Emulating Massively Parallel (PetaFLOPS) Machines

Emulating Massively Parallel (PetaFLOPS) Machines
Neelam Saboo, Arun Kumar Singla Joshua Mostkoff Unger, Gengbin Zheng, Laxmikant V. Kalé Department of Computer Science Parallel Programming Laboratory

Roadmap BlueGene Architecture Need for an Emulator Charm++ BlueGene
Converse BlueGene Future Work

Blue Gene: Processor-in-memory Case Study
Five steps to a PetaFLOPS, taken from: BOARD PROCESSOR 1 GFlop/s, 0.5 MB NODE/CHIP 25 GFlop/s, 12.5 MB TOWER BLUE GENE 1 PFlop/s, 0.5 TB FUNCTIONAL MODEL: 34X34X36 cube of shared memory nodes each having 25 processors.

SMP Node 25 processors 200 processing elements Input/Output Buffer
32 x 128 bytes Network Connected to six neighbors via duplex link MHz = Gigabyte/s Latencies: 5 cycles per hop 75 cycles per turn

Processor in out STATS: 500 MHz Memory-side cache eliminates coherency problems 10 cycles local cache 20 cycles remote cache 10 cycles cache miss 8 integer units sharing floating point units 8 x 25 x ~40,000 = ~8 x 106 processing elements!

Need for Emulator Emulator – enables programmer to develop, compile, and run software using programming interface that will be used in actual machine

Emulator Objectives Emulate Blue Gene and other petaFLOPS machines.
Memory limitations and time limitations on single processor requires that simulation MUST be performed on parallel architecture. Issues: Assume that program written for processor-in-memory machine will handle out-of-order execution and messaging. Therefore don’t need complex event queue/rollback.

Emulator Implementation
What are basic data structures/interface? Machine configuration (topology), handler registration Nodes with node-level shared data Threads (associated with each node) representing processing elements Communication between nodes How to handle all these objects on parallel architecture? How to handle object-to-object communication? Difficulties of implementation eased by using Charm++, object-oriented parallel programming paradigm.

Experiments on Emulator
Sample applications implemented: Primes Jacobi relaxation MD prototype 40,000 atoms, no bonds calculated, nearest neighbor cutoff Ran full Blue Gene (with 8 x 106 threads) on ~100 ASCI-Red processors ApoA-I: 92k Atoms

Collective Operations
Explore different algorithms for broadcasts and reductions RING LINE OCTREE z y Use “primitive” 30 x 30 x 20 (10 threads) Blue Gene emulation on 50 processor Linux cluster x

Converse BlueGene Emulator Objective
Performance estimation (with proper time stamping) Provide API for building Charm++ on top of emulator. Switching from Charm++ bluegene emulator to Converse emulator allow better performance improvement by accessing low lever communication and thread library directly via Converse; it also make it possible to port Charm++ on top of bluegene emulator. So that Charm++ can become one of the possible parallel programming language on Bluegene, and existing Charm++ application can run on the emulator.

Bluegene Emulator Node Structure Communication threads Worker thread
inBuffer Like Converse, the bluegene emulator is also a message driven system. The only way two nodes communicate is to send a bluegene message with a handler function associated to it. This is pretty much like active messages. This slide shows our abstraction for a bluegene node. Remember that there are 200 process elements on each node, we represent them as threads. We further divide threads into two different types: communication threads and worker threads. The job of a communication thread is to poll the blegene node’s inBuffer and schedule it to worker threads. The job of a worker thread is to pick the task assigned to it and execute it. We have two different messages, affinity messages and non-affinity messages. Considering the performance, we introduce the concept of affinity message which is a special message that can only executed on a specified thread. Thus, affinity messages have specific thread ID associated to it, so the communication threads must schedule it to the specified worker threads. Non-affinity message can be assigned to any worker threads. Affinity message queue Non-affinity message queue Node Structure

Performance Pingpong Close to Converse pingpong; Charm++ pingpong
us v.s. 92 us RTT Charm++ pingpong 116 us RTT Charm++ Bluegene pingpong us RTT Eliminate the charm message overhead, so performance is much better than previous charm++ bluegene emulator. Tests are conducted on Origin2000.

Charm++ on top of Emulator
BlueGene thread represents Charm++ node; Name conflict: Cpv, Ctv MsgSend, etc CkMyPe(), CkNumPes(), etc Note: CkMyPe() now returns the thread’s global serial number.

Future Work: Simulator
LeanMD : Fully functional MD with only cutoff How can we examine performance of algorithms on variants of processor-in-memory design in massive system? Several layers of detail to measure Basic: Correctly model performance, timestamp messages with correction for out-of-order execution More detailed: network performance, memory access, modeling sharing of floating-point unit, estimation techniques

Emulating Massively Parallel (PetaFLOPS) Machines

Similar presentations

Presentation on theme: "Emulating Massively Parallel (PetaFLOPS) Machines"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Emulating Massively Parallel (PetaFLOPS) Machines

Similar presentations

Presentation on theme: "Emulating Massively Parallel (PetaFLOPS) Machines"— Presentation transcript:

Similar presentations

About project

Feedback