Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars Koesterke with Kent Milfeld and Karl W. Schulz AUS Presentation 09/06/08
Bigger Systems Higher Complexity Ranger is BIG! Ranger’s Architecture is Multi-Level Parallel and Asymmetric 3936 nodes cores 2 large Switches
Understand the Implications of the Multi-Level Parallel Architecture Optimize Operational Methods and Applications Maximize the Yield from Ranger and other Big TeraGrid Machines yet to come! Get the Most out of the New Generation of Supercomputers!
Outline Introduction General description of the Experiment Layout of the Node Layout of the Interconnect (NEM and Magnum switches) Experiments : Ping-Pong and Barrier cost On Node Experiments On NEM Experiment Switch Experiment Conclusion Implications for System Management Implications for Users NEM: Network Express Module
Parameter Selection for Experiments Ranger Nodes have 4 quad-core Sockets : 16 cores per Node Natural Setups Pure MPI : 16 tasks per Node Hybrid : 4 tasks per Node 1 task per Node Tests are selected accordingly: 1, 4 and 16 tasks 16 MPI Tasks 4 MPI Tasks 4Threads/Task 1 MPI Tasks 16 Threads/Task MPI Task on Core Master Thread of MPI Task Slave Thread of MPI Task Master Thread of MPI Task In Large-scale calculations with 16 tasks per Node, communication could/should be bundled Measure with one Task per Node
Experiment 1 : Ping-Pong with MPI MPI processes reside on : –same Node –same Chassis (connected by one NEM) –different Chassis (connected by Magnum switch) Messages are sent forth and back (Ping-Pong) –Communication Distance is varied (Node, NEM, Magnum) –Communication Volume is varied Message Size : 32 Bytes MB Number of processes sending/receiving simultaneously Effective Bandwidth per Communication Channel –Timing taken from multiple runs on a dedicated system Node : 16 Cores Chassis : 12 Nodes Total : 328 Chassis, 3936 Nodes
Experiment 2 : MPI Barrier Cost MPI processes reside on : –same Node –same Chassis (connected by a NEM) –different Chassis (connected by Magnum switch) Synchronize on Barriers –Communication Distance is varied (Node, NEM, Magnum) –Communication Volume is varied Number of processes executing Barrier Barrier Cost measured in Clock Periods (CP) –Timing taken from multiple runs on a dedicated system Node : 16 Cores Chassis : 12 Nodes Total : 328 Chassis, 3936 Nodes
Node Architecture 4 quad-core CPUs (Sockets) per node Memory local to Sockets 3-way HyperTransport ‘’Missing’’ connection CPU PCI Express Bridge Asymmetry - Local vs. Remote Memory requires one additional “hop” - PCI Connection Note: Accessing local memory on both Sockets 0 and 3 is slower with extra HT hop (Cache Coherence)
Network Architecture Each Chassis (12 Blades) is connected to a Network Express Module (NEM) Each NEM is connected to a Line Card in the Magnum Switch The Switch connects the Line Cards through a Backplane HCANEMLine CardNEMHCA 7 hops 5 hops 3 hops 1 hop Number of Hops / Latency 1 Hop 1.57 sec : Blades in the same Chassis 3 Hops 2.04 sec : NEMs connected to the same Line Card 5/7 Hops 2.45/2.85 sec : Connection through the Magnum switch
On-Node : Ping-Pong Socket 0 ping-pongs with Sockets 1, 2 and 3 1, 2, 4 simultaneous communications (quad-core) Bandwidth scales with number of communications Missing Connection : Communication between 0 and 3 is slower Maximum Bandwidth : 1100 MB/s 700 MB/s 300 MB/s
On-Node : Barrier Cost (2 Cores) One Barrier : 0---1, 0---2, Cost : CPs Asymmetry: Communication between 0 and 3 is slower
On-Node : Barrier Cost (Multiple Cores, 2 Sockets) Barriers per Socket : 1, 2, 4 Cost :1700, 3200, 6800 CPs Barriers per Socket : 1, 2, 4 Cost :1700, 3200, 6800 CPs
On-NEM: Ping-Pong 2-12 Nodes in the same Chassis 1 MPI Process per Node (1-6 communication pairs) Perfect Scaling for up to 6 simultaneous communications Maximum Bandwidth : 6 x 900 MB/s
On-NEM: Barrier Scaling Barriers per Node : 1, 4, 16 Cost : start at 5000/15000 CPs and increase up to 20000/27000/32000 CPs
NEM-to-NEM: Ping-Pong Maximum Distance : 7 hops 1 MPI Process per Node (1-12 communication pairs) Maximum Performance : 2 x 900 up to 12 x 450 MB/s
Switch : Barrier Scaling Communication between 1-12 Nodes on 2 Chassis Barriers per Node : 1, 4, 16 Two Runs: System was not clean during this test Results similar to On-NEM test
Communication pattern reveals Asymmetry on the Node level –No Direct HT Connection between Cores 0 and 3 Max. Bandwidth : On-NEM: 6 x 900 MB/s NEM-to-NEM : 2 x 900 MB/s x 450 MB/s 16-way Nodes: NUMA *, Multi-Level Interconnect: low-latency, high-bandwidth Further Investigation necessary to achieve theoretical 12 x 900 MB/s Ranger
Conclusions Aggregate Communication and I/O on Node (SMP) level –Reduce total number of Communications –Reduce Traffic through Magnum switches –On 16-way Node : 15 compute tasks and a single Communication task? –Use of MPI with OpenMP? Apply Load-Balancing –Asymmetry on Node Level –Multi-Level Interconnect (Node, NEM, Magnum switches) Use full Chassis (12 Nodes, 192 Cores) –Use extremely low-latency Connections through NEM (< 1.6 μsecs) Take Advantage of the Architecture at all Levels Applications should be cognizant of various SMP/Network levels More topology aware scheduling is under investigation More topology aware scheduling is under investigation