Massive Supercomputing Coping with Heterogeneity of Modern Accelerators Toshio Endo and Satoshi Matsuoka Tokyo Institute of Technology, Japan.

Massive Supercomputing Coping with Heterogeneity of Modern Accelerators Toshio Endo and Satoshi Matsuoka Tokyo Institute of Technology, Japan

Accelerators for High Performance Computing  In HPC systems, power consumption has been/will remain a major concern  SIMD accelerators are promising for their excellent Flops/Watt ratio ClearSpeed X620 accelerator, 80GFlops, 25W NVidia GeForce 8800GTX, 375GFlops(Single precision), ~200W

Heterogeneous Architectures (1) HPC systems only with “special purpose” accelerators are infeasible  They don’t directly support existing compilers, applications, MPI, Linux… Heterogeneous architectures will be attractive for  Generality by general purpose CPUs Typically x86/x86-64 CPUs  Higher Flops/Watt ratio by accelerators ClearSpeed accelerators, GPGPUs, CellBE…  Example: IBM Roadrunner, TokyoTech TSUBAME

Heterogeneous Architectures (2) Objectives:  Running a large parallel application on heterogeneous systems Questions:  How can we utilize heterogeneous resources effectively?  Are they scalable up to supercomputing scale? AMD Opteron 880 4.8GFlops peak / core ClearSpeed X620 accelerator 80GFlops peak +

Overview of Our Work  We take a tightly-coupled program, Linpack  Combined usage of 10,368 Opteron cores and 648 ClearSpeed SIMD accelerators  >60TFlops: The world’s highest Linpack performance on heterogeneous supercomputers +

NEC/Sun/ClearSpeed/Voltaire TokyoTech TSUBAME Supercomputer (2006) 500GB 48disks 500GB 48disks 500GB 48disks Voltaire ISR9288 Infiniband 10Gbps ClearSpeed CSX600 SIMD accelerator x 360 PCI-X boards SunFire X4600 16 Opteron core/node x 655nodes 648 102TFlops peak = Opteron 49.8TF + ClearSpeed 52.2TF 16 th supercomputer in the world, 2 nd in Asia (Top500 in Nov 2007)

Structure of TSUBAME Node with Heterogeneous Processors SunFire X4600 8 dual-core Opteron CPUs (16 cores) ClearSpeed Accelerator on PCI-X InfiniBand Interconnect

Cooling Towers (~20 units) 1.6PByte storage 16 Opteron cores x 655 Compute nodes 288Port 10Gbps InfiniBand SW x 6

ClearSpeed Accelerator  PCI-X accelerator boards CSX600 SIMD processor x 2 + 1GB DRAM on board 210MHz x 2FP x 96SIMD x 2 = 80.6GFlops peak Configurable up to 250MHz Power: 25W/board Provided software: C n programming language CSXL BLAS library <= Used by this work CSFFT library

DGEMM Performance of Opteron and ClearSpeed Multiply of (MxB) x (BxM)  An accelerator is equivalent to 14 CPU cores at peak  ClearSpeed Performance is much more sensitive to matrix size! B M GOTO BLAS on Opteron (1 core) CSXL BLAS 2.50 on ClearSpeed - GOTO BLAS is by Kazushige Goto, U. Texas

Linpack: Our Target Application Benchmark  Linpack is a numerical benchmark used in Top500 Solve N x N dense linear equations  HPL (High-performance Linpack) by A. Petitet A well-known MPI parallel implementation Matrix multiply (DGEMM) computation is most time-consuming; O(N 3 ) in total

Data Decomposition in HPL  Matrix is uniformly distributed with 2D Block- Cyclic distribution N B Matrix distribution on 6 (=2x3) processes

Flow of HPL (simplified)  Performance is bottlenecked by the slowest process  HPL is designed for uniform systems Panel fact, etc. Matrix multiply (DGEMM) for own data Back to Top Panel fact, etc Matrix multiply (DGEMM) for own data Back to Top Panel fact, etc Matrix multiply (DGEMM) for own data Back to Top MPI comm

Requirements on Heterogeneous System HPL is designed for homogeneous systems, but  Intra-node heterogeneity: A node has both general purpose CPUs and SIMD accelerator  Inter-node heterogeneity: (In previous TSUBAME configuration) about half the nodes have accelerators, while others not  We want to keep modification to HPL source code small How can we run HPL efficiently on heterogeneous systems?

Three System Configurations CPU-Only Fully-Acc’d Half-Acc’d No Heterogeneity Intra-node Hetero Intra-node Hetero + Inter-node Hetero

Our Basic Policy (1/2)  For intra-node heterogeneity, we ‘virtualize’ heterogeneous processors at library layer  Processors are provides of DGEMM performance  We control mapping between processes and processors Processes Processors Example of mapping during DGEMM

Our Basic Policy (2/2)  For inter-node heterogeneity, we control the number of processes among nodes cf. CHARM++, AMPI from UIUC  We can keep kernel workload of each process uniform (good for HPL ), while maintaining heterogeneity

Careful Tuning is Necessary for Performance Since SIMD accelerators are sensitive to many HPL parameters, careful tuning is necessary Process granularity Process mapping Block size We need different tuning for each system configuration

Tuning of Process Granularity We can tune ‘process granularity’ as number of BLAS threads  If processes are too coarse (a process uses many threads), it is more difficult to balance among nodes  If too fine, HPL suffers from duplicated computation Fine Coarse

Tuning of Block Size  When block size B is small, ClearSpeed performance is heavily degraded  When B is too large, HPL suffers from large overhead for panel factorization B M

Tuning on “CPU-only” Case  Focus is bringing out performance of BLAS performance on CPUs 16 Opteron cores  Block size is 240, which is good for GOTO BLAS x 648

Tuning on “Fully-Acc’d” Case  Focus is Process granularity / block size should be large enough for ClearSpeed BLAS Balancing among processes, while utilizing both processors 16 Opteron cores Clear Speed For PCI-X communication  Block size is 864, which is good for ClearSpeed BLAS x 648

Tuning on “Half-Acc’d” Case  Focus is balance among accelerated nodes and non-accelerated nodes Node w/o ClearSpeed Clear Speed For PCI-X Node with ClearSpeed  Block size is 864 x 288 x 360

Experimentation  648 SunFire X4600 nodes in TSUBAME  Modified HPL + Voltaire MPI + GOTO BLAS + CSXL BLAS  Three configurations: CPU Only: Only Opteron CPUs are used Half Acc’d: Only half the nodes are accelerated Fully Acc’d: All the nodes are accelerated

Experimental Results  38.18TF in “CPU-only”  48.88TF in “Half-Acc’d” +28% over CPU Only  63.51TF in “Fully-Acc’d” Check precise figures in the next Top500 in June Relative speed (CPU only=1)

Summary  Scalability of heterogeneous supercomputers with SIMD accelerators is demonstrated  >60TFlops Linpack performance is achieved  Our method works efficiently even when nodes are partially accelerated Future work:  From hand-tuning to automatic tuning  Other useful applications!

CSXL Our Basic Policy (2/2) Two types of HPL processes are introduced:  CPU processes use GOTO BLAS’s DGEMM  SIMD processes throw DGEMM requests to accelerator Additional SIMD servers are introduced as multiplexer CPU process SIMD process SIMD server CSXL

Mapping between Processes and Nodes  Peak DGEMM performance per node is ~ 120 GFlops with accelerator ~ 70 GFlops w/o accelerator Roughly 7:4

Mapping between Processes and Processors  We need consider CPU usage for communication with accelerator (black region)  Remaining idle CPU cores are used to help accelerator Processors are divided into seven processes Opteron x 16 cores ClearSpeed accelerator

Massive Supercomputing Coping with Heterogeneity of Modern Accelerators Toshio Endo and Satoshi Matsuoka Tokyo Institute of Technology, Japan.

Similar presentations

Presentation on theme: "Massive Supercomputing Coping with Heterogeneity of Modern Accelerators Toshio Endo and Satoshi Matsuoka Tokyo Institute of Technology, Japan."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Massive Supercomputing Coping with Heterogeneity of Modern Accelerators Toshio Endo and Satoshi Matsuoka Tokyo Institute of Technology, Japan.

Similar presentations

Presentation on theme: "Massive Supercomputing Coping with Heterogeneity of Modern Accelerators Toshio Endo and Satoshi Matsuoka Tokyo Institute of Technology, Japan."— Presentation transcript:

Similar presentations

About project

Feedback