BigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel Machines Gengbin Zheng Gunavardhan Kakulapati Laxmikant V. Kale University of Illinois at Urbana-Champaign
IPDPS 4/29/ Motivations Extremely large parallel machines around the corner Examples: ASCI Purple (12K, 100TF) BlueGene/L (64K, 360TF) BlueGene/C (8M, 1PF) PF machines likely to have 100k+ processors (1M?) Would existing parallel applications scale? Machines are not there Parallel performance is hard to model without actually running the program
IPDPS 4/29/ BlueGene/L
IPDPS 4/29/ Roadmap Explore suitable programming models Charm++ (Message-driven) MPI and its extension - AMPI (adaptive version of MPI) Use a parallel emulator to run applications Coarse-grained simulator for performance prediction (not hardware simulation)
IPDPS 4/29/ Charm++ - Object-based programming model User View System implementation User is only concerned with interaction between objects
IPDPS 4/29/ Charm++ Object-based Programming Model Processor virtualization Divide computation into large number of pieces Independent of number of processors Typically larger than number of processors Let system map objects to processors Empowers an adaptive, intelligent runtime system User View System implementation
IPDPS 4/29/ Charm++ for Peta-scale Machines Explicit management of resources This data on that processor This work on that processor Object can migrate Automatic efficient resource management One sided communication Asynchronous global operations (reductions,..)
IPDPS 4/29/ AMPI - MPI + processor virtualization Implemented as virtual processors (user-level migratable threads) Real Processors 7 MPI “processes”
IPDPS 4/29/ Parallel Emulator Actually run a parallel program Emulate full machine on existing parallel machines Based on a common low level abstraction (API) Many multiprocessor nodes connected via message passing Emulator supports Charm++/AMPI Gengbin Zheng, Arun Singla, Joshua Unger, Laxmikant V. Kalé, ``A Parallel-Object Programming Model for PetaFLOPS Machines and Blue Gene/Cyclops'' in NGS Program Workshop, IPDPS2002
IPDPS 4/29/ Emulation on a Parallel Machine Simulating (Host) Processor Simulated multi-processor nodes Simulated processor Emulating 8M threads on 96 ASCI-Red processors
IPDPS 4/29/ Emulator Performance Scalable Emulating a real-world MD application on a 200K processor BG machine Gengbin Zheng, Arun Singla, Joshua Unger, Laxmikant V. Kalé, ``A Parallel-Object Programming Model for PetaFLOPS Machines and Blue Gene/Cyclops'' in NGS Program Workshop, IPDPS02
IPDPS 4/29/ Emulator to Simulator Predicting parallel performance Modeling parallel performance accurately is challenging Communication subsystem Behavior of runtime system Size of the machine is big
IPDPS 4/29/ Performance Prediction Parallel Discrete Event Simulation (PDES) Logical processor (LP) has virtual clock Events are time-stamped State of an LP changes when an event arrives to it Our emulator was extended to carry out PDES
IPDPS 4/29/ Predict Parallel Components How to predict parallel components? Multiple resolution levels Sequential component: User supplied expression Performance counters Instruction level simulation Parallel component: Simple latency-based network model Contention-based network simulation
IPDPS 4/29/ Prior PDES Work Conservative vs. optimistic protocols Conservative: (example: DaSSF) Ensure safety of processing events in global fashion Typically require a look-ahead – high global synchronization overhead MPI-SIM Optimistic: (examples: Time Warp, SPEEDS) Each LP process the earliest event on its own, undo earlier out of order execution when causality errors occur Exploit parallelism of simulation better, and is preferred
IPDPS 4/29/ Why not use existing PDES? Major synchronization overheads Rollback/restart overhead Checkpointing overhead We can do better in simulation of some parallel applications Property of Inherent determinacy in parallel applications Most parallel programs are written to be deterministic, example “ Jacobi ”
IPDPS 4/29/ Timestamp Correction Messages should be executed in the order of their timestamps Causality error due to out-of-order message delivery Rollback and checkpoint are necessary in traditional methods Inherent determinacy is hidden in applications Need to capture event dependency Run-time detection Use language “structured dagger” to express dependency
IPDPS 4/29/ Simulation of Different Applications Linear-order applications No wildcard MPI receives Strong determinacy, no timestamp correction necessary Reactive applications (atomic) Message driven objects Methods execute as corresponding messages arrive Multi-dependent applications Irecvs with WaitAll (MPI) Uses of structured dagger to capture dependency (Charm++)
IPDPS 4/29/ Structured-Dagger entry void jacobiLifeCycle() { for (i=0; i<MAX_ITER; i++) { atomic {sendStripToLeftAndRight();} overlap { when getStripFromLeft(Msg *leftMsg) { atomic { copyStripFromLeft(leftMsg); } } when getStripFromRight(Msg *rightMsg) { atomic { copyStripFromRight(rightMsg); } } } atomic{ doWork(); /* Jacobi Relaxation */ } }
IPDPS 4/29/ Time Stamping messages LP Virtual Timer: curT Message sent: RecvT(msg) = curT+Latency Message scheduled: curT = max(curT, RecvT(msg))
IPDPS 4/29/ M1M7M6M5M4M3M2 RecvTime Execution TimeLine M8 Execution TimeLine M1M7M6M5M4M3M2M8 RecvTime Correction Message Timestamps Correction
IPDPS 4/29/ Charm++ and MPI applications Simulation output trace logs Performance visualization (Projections) BigSim Emulator Charm++ Runtime Online PDES engine Instruction Sim (RSim, IBM,..) Simple Network Model Performance counters Load Balancing Module Architecture of BigSim Simulator
IPDPS 4/29/ Charm++ and MPI applications Simulation output trace logs BigNetSim (POSE) Network Simulator Performance visualization (Projections) BigSim Emulator Charm++ Runtime Online PDES engine Instruction Sim (RSim, IBM,..) Simple Network Model Performance counters Load Balancing Module Offline PDES Architecture of BigSim Simulator
IPDPS 4/29/ Big Network Simulation Simulate network behavior: packetization, routing, contention, etc. Incorporate with post-mortem timestamp correction via POSE Switches are connected in torus network BGSIM Emulator POSE Timestamp Correction BG Log Files (tasks & dependencies) Timestamp-corrected Tasks BigNetSim
IPDPS 4/29/ BigSim Validation on Lemieux 32 real processors
IPDPS 4/29/ Jacobi on a 64K BG/L
IPDPS 4/29/ Case Study - LeanMD Molecular dynamics simulation designed for large machines K-away cut-off parallelization Benchmark er-gre with 3-away atoms 1.6 million objects 8 step simulation 32k processor BG machine Running on 400 PSC Lemieux processors Performance visualization tools
IPDPS 4/29/ Load Imbalance Histogram
IPDPS 4/29/ Performance of the BigSim Real processors (PSC Lemieux)
IPDPS 4/29/ Conclusions Improved the simulation efficiency by taking advantage of “inherent determinacy” of parallel applications Explored simulation techniques show good parallel scalability
IPDPS 4/29/ Future Work Improving simulation accuracy Instruction level simulator Network simulator Developing run-time techniques (load balancing) for very large machines using the simulator