Emulating Massively Parallel (PetaFLOPS) Machines Neelam Saboo, Arun Kumar Singla Joshua Mostkoff Unger, Gengbin Zheng, Laxmikant V. Kalé http://charm.cs.uiuc.edu Department of Computer Science Parallel Programming Laboratory
Roadmap BlueGene Architecture Need for an Emulator Charm++ BlueGene Converse BlueGene Future Work
Blue Gene: Processor-in-memory Case Study Five steps to a PetaFLOPS, taken from: http://www.research.ibm.com/bluegene/ BOARD PROCESSOR 1 GFlop/s, 0.5 MB NODE/CHIP 25 GFlop/s, 12.5 MB TOWER BLUE GENE 1 PFlop/s, 0.5 TB FUNCTIONAL MODEL: 34X34X36 cube of shared memory nodes each having 25 processors.
SMP Node 25 processors 200 processing elements Input/Output Buffer 32 x 128 bytes Network Connected to six neighbors via duplex link 16 bit @ 500 MHz = 1 Gigabyte/s Latencies: 5 cycles per hop 75 cycles per turn
Processor in out STATS: 500 MHz Memory-side cache eliminates coherency problems 10 cycles local cache 20 cycles remote cache 10 cycles cache miss 8 integer units sharing 2 floating point units 8 x 25 x ~40,000 = ~8 x 106 processing elements!
Need for Emulator Emulator – enables programmer to develop, compile, and run software using programming interface that will be used in actual machine
Emulator Objectives Emulate Blue Gene and other petaFLOPS machines. Memory limitations and time limitations on single processor requires that simulation MUST be performed on parallel architecture. Issues: Assume that program written for processor-in-memory machine will handle out-of-order execution and messaging. Therefore don’t need complex event queue/rollback.
Emulator Implementation What are basic data structures/interface? Machine configuration (topology), handler registration Nodes with node-level shared data Threads (associated with each node) representing processing elements Communication between nodes How to handle all these objects on parallel architecture? How to handle object-to-object communication? Difficulties of implementation eased by using Charm++, object-oriented parallel programming paradigm.
Experiments on Emulator Sample applications implemented: Primes Jacobi relaxation MD prototype 40,000 atoms, no bonds calculated, nearest neighbor cutoff Ran full Blue Gene (with 8 x 106 threads) on ~100 ASCI-Red processors ApoA-I: 92k Atoms
Collective Operations Explore different algorithms for broadcasts and reductions RING LINE OCTREE z y Use “primitive” 30 x 30 x 20 (10 threads) Blue Gene emulation on 50 processor Linux cluster x
Converse BlueGene Emulator Objective Performance estimation (with proper time stamping) Provide API for building Charm++ on top of emulator. Switching from Charm++ bluegene emulator to Converse emulator allow better performance improvement by accessing low lever communication and thread library directly via Converse; it also make it possible to port Charm++ on top of bluegene emulator. So that Charm++ can become one of the possible parallel programming language on Bluegene, and existing Charm++ application can run on the emulator.
Bluegene Emulator Node Structure Communication threads Worker thread inBuffer Like Converse, the bluegene emulator is also a message driven system. The only way two nodes communicate is to send a bluegene message with a handler function associated to it. This is pretty much like active messages. This slide shows our abstraction for a bluegene node. Remember that there are 200 process elements on each node, we represent them as threads. We further divide threads into two different types: communication threads and worker threads. The job of a communication thread is to poll the blegene node’s inBuffer and schedule it to worker threads. The job of a worker thread is to pick the task assigned to it and execute it. We have two different messages, affinity messages and non-affinity messages. Considering the performance, we introduce the concept of affinity message which is a special message that can only executed on a specified thread. Thus, affinity messages have specific thread ID associated to it, so the communication threads must schedule it to the specified worker threads. Non-affinity message can be assigned to any worker threads. Affinity message queue Non-affinity message queue Node Structure
Performance Pingpong Close to Converse pingpong; Charm++ pingpong 81-103 us v.s. 92 us RTT Charm++ pingpong 116 us RTT Charm++ Bluegene pingpong 134-175 us RTT Eliminate the charm message overhead, so performance is much better than previous charm++ bluegene emulator. Tests are conducted on Origin2000.
Charm++ on top of Emulator BlueGene thread represents Charm++ node; Name conflict: Cpv, Ctv MsgSend, etc CkMyPe(), CkNumPes(), etc Note: CkMyPe() now returns the thread’s global serial number.
Future Work: Simulator LeanMD : Fully functional MD with only cutoff How can we examine performance of algorithms on variants of processor-in-memory design in massive system? Several layers of detail to measure Basic: Correctly model performance, timestamp messages with correction for out-of-order execution More detailed: network performance, memory access, modeling sharing of floating-point unit, estimation techniques