HPC 和机器学习专家

AIST ABCI

AIST ABCI

ABCI Background

ABCI

The AI Bridging Cloud Infrastructure (ABCI) supercomputer, which is operated by Japan’s Institute of Advanced Industrial Science and Technology (AIST), is noteworthy for several reasons. It has achieved a top-ten ranking on both the June 2018 Top 500 (#5) and Green 500 (#8) lists. ABCI represents a new breed of supercomputers that retain a high degree of energy efficiency and are built with AI and deep learning in mind. Interestingly enough, these may not be ABCI’s most distinguishing characteristics. Increasingly, the world’s most powerful supercomputers are also among the most energy efficient.

ABCI

Project Background

The ABCI effort builds on experience gained constructing the TSUBAME3.0 supercomputer at Tokyo Tech (#6 on the Green 500 list) that also runs on Univa Grid Engine. As is the case with the RIKEN RAIDEN supercomputer (#10 on the Green500 list and also running on Univa Grid Engine), ABCI uses the latest NVIDIA Tesla V100 (Volta) GPUs. Running and managing deep-learning workloads at scale requires sophisticated GPU-aware workload management. The Univa team is honored that three of the top 10 Green 500 supercomputers run on Univa software.

The ABCI supercomputer has 301,680 cores, 476 TB of memory, and approximately 24 PB of storage spread across SSD storage, as well as a parallel file system. The system is comprised of 1,088 Infiniband connected servers, each with dual Intel Xeon Gold 6148 CPUs (Skylake) and 4 x NVIDIA Tesla v100 GPUs (Volta), and each with 640 Tensor cores. Communication between GPUs is facilitated by NVIDIA’s NVLink, delivering up to 10x the bandwidth of PCIe Gen 3 connections.

Whether executing isolated deep learning applications on a single GPU or running distributed frameworks, like TensorFlow or Caffe, ABCI’s diverse applications demand sophisticated management software. Along with Docker, Singularity and other tools, Univa Grid Engine plays a key role in ABCI’s software stack, ensuring that workloads run as efficiently as possible.

Univa Grid Engine brings a unique set of capabilities to GPU-powered, deep learning environments:

  • With the Univa Grid Engine Resource Map (RSMAP), an abstraction used for representing GPUs, GPU-aware workloads can be managed more effectively leading to better utilization and productivity.
  • By taking advantage of NVIDIA’s Datacenter GPU Manager (DCGM) facility, the latest versions of Univa Grid Engine can factor the state of GPUs into scheduling decisions and dispatch tasks more efficiently for faster completion of deep learning workloads.
  • Univa Grid Engine is integrated with Docker to offer users seamless management of containerized workloads. Univa also supports Singularity, enabling ABCI to run applications deployed with either container technology or on the same cluster.

Used together, this combination of software helps ABCI get the most from its infrastructure investment. In the latest release of Univa Grid Engine, considerable effort has been invested in the low-level details of binding computational tasks to various combinations of CPU cores and GPUs to improve efficiency. Like other deep learning supercomputers, ABCI benefits from innovations in Univa Grid Engine, such as scheduling non-GPU workloads to avoid GPU-cores sitting idle, placing parallel jobs in a fashion that considers switch topologies and rack placement, as well as using Advance Reservations to ensure that priority workloads run at scheduled times.

Whether you are operating a top 10 supercomputer like ABCI, or a small on-premise cluster, you can realize the full benefits of Univa Grid Engine when running GPU clusters. The Univa team can offer a variety of solutions for managing deep learning and other compute-intensive workloads on-premise, in the cloud or in hybrid environments.

Pacific Teck’s role

Pacific Teck worked closely with the system integrator (Fujitsu) to understand and communicate the benefits of Univa Grid Engine and BeeOND with the end user. Thanks in part to Pacific Teck’s efforts, Univa Grid Engine and BeeOND were utilized in a similar project, Tsubame 3, and the end user had an interest in replicating those elements. The end user also had an interest in Singularity, for which Pacific Teck was able to make an introduction. Pacific Teck is also involved in post sales and acting as bridge between the vendors and system integrator. We also assisted with Univa Grid Engine managing Singularity containers to run MPI jobs in containers. BeeOND creates a temporary file system from NVMe in the compute nodes with a maximum capacity of about 1PB. The size utilized at a given time is determined when kicked off by Univa Grid Engine.


Learn More about Singularity