BigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel Machines Gengbin Zheng Gunavardhan Kakulapati Laxmikant V. Kale University.

Slides:



Advertisements
Similar presentations
A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Advertisements

Μπ A Scalable & Transparent System for Simulating MPI Programs Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory Adjunct Professor.
Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.
Distributed Systems CS
Sensor Network Platforms and Tools
Parallel Programming Laboratory1 Fault Tolerance in Charm++ Sayantan Chakravorty.
Dr. Gengbin Zheng and Ehsan Totoni Parallel Programming Laboratory University of Illinois at Urbana-Champaign April 18, 2011.
PTIDES: Programming Temporally Integrated Distributed Embedded Systems Yang Zhao, EECS, UC Berkeley Edward A. Lee, EECS, UC Berkeley Jie Liu, Microsoft.
DISTRIBUTED CONSISTENCY MANAGEMENT IN A SINGLE ADDRESS SPACE DISTRIBUTED OPERATING SYSTEM Sombrero.
Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.
Parallel Simulations on High-Performance Clusters C.D. Pham RESAM laboratory Univ. Lyon 1, France
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
What is Concurrent Programming? Maram Bani Younes.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
1 Charm++ on the Cell Processor David Kunzman, Gengbin Zheng, Eric Bohm, Laxmikant V. Kale.
1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review October 1999, Iowa City.
ECE 720T5 Winter 2014 Cyber-Physical Systems Rodolfo Pellizzoni.
Adaptive MPI Milind A. Bhandarkar
Hardware Supported Time Synchronization in Multi-Core Architectures 林孟諭 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan,
Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois.
(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.
If Exascale by 2018, Really? Yes, if we want it, and here is how Laxmikant Kale.
Advanced / Other Programming Models Sathish Vadhiyar.
1 Blue Gene Simulator Gengbin Zheng Gunavardhan Kakulapati Parallel Programming Laboratory Department of Computer Science.
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
PARALLEL COMPUTING overview What is Parallel Computing? Traditionally, software has been written for serial computation: To be run on a single computer.
Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi.
Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.
1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University.
University of Illinois at Urbana-Champaign Memory Architectures for Protein Folding: MD on million PIM processors Fort Lauderdale, May 03,
Simics: A Full System Simulation Platform Synopsis by Jen Miller 19 March 2004.
1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©
Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,
Marcelo R.N. Mendes. What is FINCoS? A set of tools for data generation, load submission, and performance measurement of CEP systems; Main Characteristics:
Using Charm++ to Mask Latency in Grid Computing Applications Gregory A. Koenig Parallel Programming Laboratory Department.
IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC A Parallel-Object Programming Model for PetaFLOPS Machines and BlueGene/Cyclops Gengbin Zheng,
Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.
Memory-Aware Scheduling for LU in Charm++ Isaac Dooley, Chao Mei, Jonathan Lifflander, Laxmikant V. Kale.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Programming Environment and Performance Modeling for million-processor machines Laxmikant (Sanjay)
A uGNI-Based Asynchronous Message- driven Runtime System for Cray Supercomputers with Gemini Interconnect Yanhua Sun, Gengbin Zheng, Laximant(Sanjay) Kale.
BigSim Tutorial Presented by Gengbin Zheng and Eric Bohm Charm++ Workshop 2004 Parallel Programming Laboratory University of Illinois at Urbana-Champaign.
NGS Workshop: Feb 2002PPL-Dept of Computer Science, UIUC Programming Environment and Performance Modeling for million-processor machines Laxmikant (Sanjay)
Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.
Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.
1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.
Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab University of Illinois at Urbana-Champaign.
Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.
Operating Systems Unit 2: – Process Context switch Interrupt Interprocess communication – Thread Thread models Operating Systems.
Projections - A Step by Step Tutorial By Chee Wai Lee For the 2004 Charm++ Workshop.
Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Debugging Large Scale Applications in a Virtualized Environment Filippo Gioachin Gengbin Zheng Laxmikant Kalé Parallel Programming Laboratory Departement.
Flexibility and Interoperability in a Parallel MD code Robert Brunner, Laxmikant Kale, Jim Phillips University of Illinois at Urbana-Champaign.
Parallel and Distributed Simulation Techniques
uGNI-based Charm++ Runtime for Cray Gemini Interconnect
Performance Evaluation of Adaptive MPI
Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab
Component Frameworks:
Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar
BigSim: Simulating PetaFLOPS Supercomputers
Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale
IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale
An Orchestration Language for Parallel Objects
Parallel Exact Stochastic Simulation in Biochemical Systems
Emulating Massively Parallel (PetaFLOPS) Machines
Presentation transcript:

BigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel Machines Gengbin Zheng Gunavardhan Kakulapati Laxmikant V. Kale University of Illinois at Urbana-Champaign

IPDPS 4/29/ Motivations Extremely large parallel machines around the corner Examples: ASCI Purple (12K, 100TF) BlueGene/L (64K, 360TF) BlueGene/C (8M, 1PF) PF machines likely to have 100k+ processors (1M?) Would existing parallel applications scale? Machines are not there Parallel performance is hard to model without actually running the program

IPDPS 4/29/ BlueGene/L

IPDPS 4/29/ Roadmap Explore suitable programming models Charm++ (Message-driven) MPI and its extension - AMPI (adaptive version of MPI) Use a parallel emulator to run applications Coarse-grained simulator for performance prediction (not hardware simulation)

IPDPS 4/29/ Charm++ - Object-based programming model User View System implementation User is only concerned with interaction between objects

IPDPS 4/29/ Charm++ Object-based Programming Model Processor virtualization Divide computation into large number of pieces Independent of number of processors Typically larger than number of processors Let system map objects to processors Empowers an adaptive, intelligent runtime system User View System implementation

IPDPS 4/29/ Charm++ for Peta-scale Machines Explicit management of resources This data on that processor This work on that processor Object can migrate Automatic efficient resource management One sided communication Asynchronous global operations (reductions,..)

IPDPS 4/29/ AMPI - MPI + processor virtualization Implemented as virtual processors (user-level migratable threads) Real Processors 7 MPI “processes”

IPDPS 4/29/ Parallel Emulator Actually run a parallel program Emulate full machine on existing parallel machines Based on a common low level abstraction (API) Many multiprocessor nodes connected via message passing Emulator supports Charm++/AMPI Gengbin Zheng, Arun Singla, Joshua Unger, Laxmikant V. Kalé, ``A Parallel-Object Programming Model for PetaFLOPS Machines and Blue Gene/Cyclops'' in NGS Program Workshop, IPDPS2002

IPDPS 4/29/ Emulation on a Parallel Machine Simulating (Host) Processor Simulated multi-processor nodes Simulated processor Emulating 8M threads on 96 ASCI-Red processors

IPDPS 4/29/ Emulator Performance Scalable Emulating a real-world MD application on a 200K processor BG machine Gengbin Zheng, Arun Singla, Joshua Unger, Laxmikant V. Kalé, ``A Parallel-Object Programming Model for PetaFLOPS Machines and Blue Gene/Cyclops'' in NGS Program Workshop, IPDPS02

IPDPS 4/29/ Emulator to Simulator Predicting parallel performance Modeling parallel performance accurately is challenging Communication subsystem Behavior of runtime system Size of the machine is big

IPDPS 4/29/ Performance Prediction Parallel Discrete Event Simulation (PDES) Logical processor (LP) has virtual clock Events are time-stamped State of an LP changes when an event arrives to it Our emulator was extended to carry out PDES

IPDPS 4/29/ Predict Parallel Components How to predict parallel components? Multiple resolution levels Sequential component: User supplied expression Performance counters Instruction level simulation Parallel component: Simple latency-based network model Contention-based network simulation

IPDPS 4/29/ Prior PDES Work Conservative vs. optimistic protocols Conservative: (example: DaSSF) Ensure safety of processing events in global fashion Typically require a look-ahead – high global synchronization overhead MPI-SIM Optimistic: (examples: Time Warp, SPEEDS) Each LP process the earliest event on its own, undo earlier out of order execution when causality errors occur Exploit parallelism of simulation better, and is preferred

IPDPS 4/29/ Why not use existing PDES? Major synchronization overheads Rollback/restart overhead Checkpointing overhead We can do better in simulation of some parallel applications Property of Inherent determinacy in parallel applications Most parallel programs are written to be deterministic, example “ Jacobi ”

IPDPS 4/29/ Timestamp Correction Messages should be executed in the order of their timestamps Causality error due to out-of-order message delivery Rollback and checkpoint are necessary in traditional methods Inherent determinacy is hidden in applications Need to capture event dependency Run-time detection Use language “structured dagger” to express dependency

IPDPS 4/29/ Simulation of Different Applications Linear-order applications No wildcard MPI receives Strong determinacy, no timestamp correction necessary Reactive applications (atomic) Message driven objects Methods execute as corresponding messages arrive Multi-dependent applications Irecvs with WaitAll (MPI) Uses of structured dagger to capture dependency (Charm++)

IPDPS 4/29/ Structured-Dagger entry void jacobiLifeCycle() { for (i=0; i<MAX_ITER; i++) { atomic {sendStripToLeftAndRight();} overlap { when getStripFromLeft(Msg *leftMsg) { atomic { copyStripFromLeft(leftMsg); } } when getStripFromRight(Msg *rightMsg) { atomic { copyStripFromRight(rightMsg); } } } atomic{ doWork(); /* Jacobi Relaxation */ } }

IPDPS 4/29/ Time Stamping messages LP Virtual Timer: curT Message sent: RecvT(msg) = curT+Latency Message scheduled: curT = max(curT, RecvT(msg))

IPDPS 4/29/ M1M7M6M5M4M3M2 RecvTime Execution TimeLine M8 Execution TimeLine M1M7M6M5M4M3M2M8 RecvTime Correction Message Timestamps Correction

IPDPS 4/29/ Charm++ and MPI applications Simulation output trace logs Performance visualization (Projections) BigSim Emulator Charm++ Runtime Online PDES engine Instruction Sim (RSim, IBM,..) Simple Network Model Performance counters Load Balancing Module Architecture of BigSim Simulator

IPDPS 4/29/ Charm++ and MPI applications Simulation output trace logs BigNetSim (POSE) Network Simulator Performance visualization (Projections) BigSim Emulator Charm++ Runtime Online PDES engine Instruction Sim (RSim, IBM,..) Simple Network Model Performance counters Load Balancing Module Offline PDES Architecture of BigSim Simulator

IPDPS 4/29/ Big Network Simulation Simulate network behavior: packetization, routing, contention, etc. Incorporate with post-mortem timestamp correction via POSE Switches are connected in torus network BGSIM Emulator POSE Timestamp Correction BG Log Files (tasks & dependencies) Timestamp-corrected Tasks BigNetSim

IPDPS 4/29/ BigSim Validation on Lemieux 32 real processors

IPDPS 4/29/ Jacobi on a 64K BG/L

IPDPS 4/29/ Case Study - LeanMD Molecular dynamics simulation designed for large machines K-away cut-off parallelization Benchmark er-gre with 3-away atoms 1.6 million objects 8 step simulation 32k processor BG machine Running on 400 PSC Lemieux processors Performance visualization tools

IPDPS 4/29/ Load Imbalance Histogram

IPDPS 4/29/ Performance of the BigSim Real processors (PSC Lemieux)

IPDPS 4/29/ Conclusions Improved the simulation efficiency by taking advantage of “inherent determinacy” of parallel applications Explored simulation techniques show good parallel scalability

IPDPS 4/29/ Future Work Improving simulation accuracy Instruction level simulator Network simulator Developing run-time techniques (load balancing) for very large machines using the simulator