Download presentation
Presentation is loading. Please wait.
Published byРуслан Мартьянов Modified over 5 years ago
1
Emulating Massively Parallel (PetaFLOPS) Machines
Neelam Saboo, Arun Kumar Singla Joshua Mostkoff Unger, Gengbin Zheng, Laxmikant V. Kalé Department of Computer Science Parallel Programming Laboratory
2
Roadmap BlueGene Architecture Need for an Emulator Charm++ BlueGene
Converse BlueGene Future Work
3
Blue Gene: Processor-in-memory Case Study
Five steps to a PetaFLOPS, taken from: BOARD PROCESSOR 1 GFlop/s, 0.5 MB NODE/CHIP 25 GFlop/s, 12.5 MB TOWER BLUE GENE 1 PFlop/s, 0.5 TB FUNCTIONAL MODEL: 34X34X36 cube of shared memory nodes each having 25 processors.
4
SMP Node 25 processors 200 processing elements Input/Output Buffer
32 x 128 bytes Network Connected to six neighbors via duplex link MHz = Gigabyte/s Latencies: 5 cycles per hop 75 cycles per turn
5
Processor in out STATS: 500 MHz Memory-side cache eliminates coherency problems 10 cycles local cache 20 cycles remote cache 10 cycles cache miss 8 integer units sharing floating point units 8 x 25 x ~40,000 = ~8 x 106 processing elements!
6
Need for Emulator Emulator – enables programmer to develop, compile, and run software using programming interface that will be used in actual machine
7
Emulator Objectives Emulate Blue Gene and other petaFLOPS machines.
Memory limitations and time limitations on single processor requires that simulation MUST be performed on parallel architecture. Issues: Assume that program written for processor-in-memory machine will handle out-of-order execution and messaging. Therefore don’t need complex event queue/rollback.
8
Emulator Implementation
What are basic data structures/interface? Machine configuration (topology), handler registration Nodes with node-level shared data Threads (associated with each node) representing processing elements Communication between nodes How to handle all these objects on parallel architecture? How to handle object-to-object communication? Difficulties of implementation eased by using Charm++, object-oriented parallel programming paradigm.
9
Experiments on Emulator
Sample applications implemented: Primes Jacobi relaxation MD prototype 40,000 atoms, no bonds calculated, nearest neighbor cutoff Ran full Blue Gene (with 8 x 106 threads) on ~100 ASCI-Red processors ApoA-I: 92k Atoms
10
Collective Operations
Explore different algorithms for broadcasts and reductions RING LINE OCTREE z y Use “primitive” 30 x 30 x 20 (10 threads) Blue Gene emulation on 50 processor Linux cluster x
11
Converse BlueGene Emulator Objective
Performance estimation (with proper time stamping) Provide API for building Charm++ on top of emulator. Switching from Charm++ bluegene emulator to Converse emulator allow better performance improvement by accessing low lever communication and thread library directly via Converse; it also make it possible to port Charm++ on top of bluegene emulator. So that Charm++ can become one of the possible parallel programming language on Bluegene, and existing Charm++ application can run on the emulator.
12
Bluegene Emulator Node Structure Communication threads Worker thread
inBuffer Like Converse, the bluegene emulator is also a message driven system. The only way two nodes communicate is to send a bluegene message with a handler function associated to it. This is pretty much like active messages. This slide shows our abstraction for a bluegene node. Remember that there are 200 process elements on each node, we represent them as threads. We further divide threads into two different types: communication threads and worker threads. The job of a communication thread is to poll the blegene node’s inBuffer and schedule it to worker threads. The job of a worker thread is to pick the task assigned to it and execute it. We have two different messages, affinity messages and non-affinity messages. Considering the performance, we introduce the concept of affinity message which is a special message that can only executed on a specified thread. Thus, affinity messages have specific thread ID associated to it, so the communication threads must schedule it to the specified worker threads. Non-affinity message can be assigned to any worker threads. Affinity message queue Non-affinity message queue Node Structure
13
Performance Pingpong Close to Converse pingpong; Charm++ pingpong
us v.s. 92 us RTT Charm++ pingpong 116 us RTT Charm++ Bluegene pingpong us RTT Eliminate the charm message overhead, so performance is much better than previous charm++ bluegene emulator. Tests are conducted on Origin2000.
14
Charm++ on top of Emulator
BlueGene thread represents Charm++ node; Name conflict: Cpv, Ctv MsgSend, etc CkMyPe(), CkNumPes(), etc Note: CkMyPe() now returns the thread’s global serial number.
15
Future Work: Simulator
LeanMD : Fully functional MD with only cutoff How can we examine performance of algorithms on variants of processor-in-memory design in massive system? Several layers of detail to measure Basic: Correctly model performance, timestamp messages with correction for out-of-order execution More detailed: network performance, memory access, modeling sharing of floating-point unit, estimation techniques
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.