1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant V. Kale Parallel Programming Laboratory University of Illinois at Urbana Champaign
2 Outline Processor virtualization QsNet Opportunities Performance Evaluation of QsNet Challenges of QsNet Summary
3 Processor Virtualization Basic idea of processor virtualization User specifies interaction between objects (VPs) RTS maps VPs onto physical processors Typically, # virtual processors > # processors Embodied in Charm++ and AMPI User View System implementation
4 QsNet Popular interconnect from Quadrics Several parallel systems in top500 use QsNet Pittsburgh ’ s Lemieux (6TF) ASCI-Q (20TF) Elite network Elan adaptor
5 Elite Network 320 MB/s each way after protocol Reliable fat-tree network Multiple routes provides fault tolerance Adaptive worm hole routing 35 ns per hop
6 Elan Network Adaptor Features Low latency (4.5 μs for MPI) High bandwidth (320MB/s/node) Components Sparc processor DMA Engine 64 MB RAM On chip cache
7 Low CPU Overhead CPU Overhead is small and does not change much with the message size
8 Traditional Message Passing Time P0P0 P1P1 Send OverheadReceive Overhead Idle Time Traditional Message Passing does not utilize low CPU overhead of Elan
9 Adaptive Overlap VP 0 VP 1 VP 0 VP 1 Time P0P0 P1P1 Send OverheadReceive Overhead Processor Virtualization takes full advantage of the low CPU overhead of Elan
10 Benefit of Adaptive Overlap Problem setup: 3D stencil calculation of size run on Lemieux. Shows AMPI with virtualization ratio of 1 and 8.
11 Charm++ Message Driven Execution Handler Scheduler Pump Garbage Collection Send Tport Send Post Receives Receive Message
12 NAMD: A Production MD System Written in Charm++ Fully featured program NIH-funded development Distributed free of charge (5000+ downloads so far) Binaries and source code Installed at NSF centers Large published simulations (e.g., aquaporin simulation featured in keynote)
13 Scaling NAMD Several QsNet challenges had to be overcome to scale NAMD
14 QsNet Challange: Latency Applications need to post receives for messages of different sizes
15 Latency Bottlenecks Latency Slow NIC processor with a 100Mhz clock Cache size only 8KB Traversing a large loop flushes it Cache Misses vs Number of Receives Posted
16 Managing Latency: Message Combining Organize processors in a 2D (virtual) Mesh Phase 1: Processors send messages to row neighbors Message from (x1,y1) to (x2,y2) goes via (x1,y2) Phase 1: Processors send messages to column neighbors 2* messages instead of P-1
17 NAMD PME Performance Performance of Namd with the Atpase molecule. PME step in Namd involves an a 192 X 144 processor collective operation with 900 byte messages
18 QsNet Challenge: Bandwidth MB/s One Way290 Two Way128 PCI/DMA Contention restricts bandwidth on Alpha servers QsNet Network Bandwidth 320 MB/s
19 Improving Bandwidth Main-MainElan-MainElan-Elan One Way Two Way Sending messages from Elan memory is faster Node bandwidth (MB/s) for different placements of source and destination
20 QsNet Challenge: Stretched Handlers Stretched Sends Green superscripts Similar stretches observed in the middle of entry methods NAMD Timeline Time Processors Force compute Integrate
21 Stretching Solution Stretched Sends Elan Isend blocked when the rendezvous for the previous Isend to any destination had not been acknowledged Solved the problem by closely working with Quadrics and obtaining a patch Isend only blocks on the rendezvous of the previous message to the same destination
22 Stretching Solution Contd. Stretches in the middle of entry methods Caused by OS daemons Using blocking receives minimized these stretches Daemons can be scheduled when processor is idle
23 NAMD With Blocking Receives Processors Time Blocking Receives
24 NAMD Performance on Lemieux
25 Summary QsNet and excellent network NIC co-processor ideal for message driven execution Programming guidelines Send messages from Elan memory Post limited number of receives and before the sends Blocking receives to avoid stretching
26 Future Work One sided communication Barrier? Persistent one sided communication Reserve buffers on destination
27 Fat Tree Topology a)b) c) 4-ary 1-tree 4-ary 2-tree 4-ary 3 tree
28 Elan3 Adapter DMA Engine Thread Processor On chip shared cache 64 bit 66 Mhz PCI interface 64 MB RAM
29 Object Based Communication Framework Application AMPI Charm++ Comm. Framework Object Layer Converse Comm. Framework Processor Layer Communication Layer Performs Object Level Optimizations Optimizes Inter-Processor Communication Strategy
30 AAPC Processor Overhead Mesh Completion Time Direct CPU Overhead Lower CPU overhead enables applications using Mesh to perform better even for large messages Direct Completion Time Mesh CPU Overhead