Welcome to the 2015 Charm++ Workshop! Laxmikant (Sanjay) Kale Parallel Programming Laboratory Department of Computer Science.

Slides:



Advertisements
Similar presentations
Presented by Dealing with the Scale Problem Innovative Computing Laboratory MPI Team.
Advertisements

Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.
This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign
Parallel Programming Laboratory1 Fault Tolerance in Charm++ Sayantan Chakravorty.
Dr. Gengbin Zheng and Ehsan Totoni Parallel Programming Laboratory University of Illinois at Urbana-Champaign April 18, 2011.
A 100,000 Ways to Fa Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Advanced Scientific.
A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS University of Illinois at Urbana-Champaign Abhinav Bhatele, Eric Bohm, Laxmikant V.
Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.
Akhil Langer, Harshit Dokania, Laxmikant Kale, Udatta Palekar* Parallel Programming Laboratory Department of Computer Science University of Illinois at.
BigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel Machines Gengbin Zheng Gunavardhan Kakulapati Laxmikant V. Kale University.
Hossein Bastan Isfahan University of Technology 1/23.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
4.x Performance Technology drivers – Exascale systems will consist of complex configurations with a huge number of potentially heterogeneous components.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Adaptive MPI Milind A. Bhandarkar
Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram Krishnamoorthy #, Arjun Singri*, P. Sadayappan* *The Ohio State.
Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois.
Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.
Temperature Aware Load Balancing For Parallel Applications Osman Sarood Parallel Programming Lab (PPL) University of Illinois Urbana Champaign.
Programming Models & Runtime Systems Breakout Report MICS PI Meeting, June 27, 2002.
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
Multiprossesors Systems.. What are Distributed Databases ? “ A Logically interrelated collection of shared data ( and a description of this data) physically.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
Workshop on Operating System Interference in High Performance Applications Performance Degradation in the Presence of Subnormal Floating-Point Values.
Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi.
Mehmet Can Kurt, The Ohio State University Sriram Krishnamoorthy, Pacific Northwest National Laboratory Kunal Agrawal, Washington University in St. Louis.
1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©
System-Directed Resilience for Exascale Platforms LDRD Proposal Ron Oldfield (PI)1423 Ron Brightwell1423 Jim Laros1422 Kevin Pedretti1423 Rolf.
Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,
SMALL IDENTIFIERS FOR CHARE ARRAY ELEMENTS With contributions from Akhil Langer, Harshitha Menon, Bilge Acun, Ramprasad Venkataraman, and L.V. Kalé Phil.
Using Charm++ to Mask Latency in Grid Computing Applications Gregory A. Koenig Parallel Programming Laboratory Department.
Memory-Aware Scheduling for LU in Charm++ Isaac Dooley, Chao Mei, Jonathan Lifflander, Laxmikant V. Kale.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
A uGNI-Based Asynchronous Message- driven Runtime System for Cray Supercomputers with Gemini Interconnect Yanhua Sun, Gengbin Zheng, Laximant(Sanjay) Kale.
Programming an SMP Desktop using Charm++ Laxmikant (Sanjay) Kale Parallel Programming Laboratory Department of Computer Science.
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.
Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.
1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.
Fault Tolerance and Checkpointing - Sathish Vadhiyar.
Effects of contention on message latencies in large supercomputers Abhinav S Bhatele and Laxmikant V Kale Parallel Programming Laboratory, UIUC IS TOPOLOGY.
Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab University of Illinois at Urbana-Champaign.
Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.
Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.
1 Creating Simulations with POSE Terry L. Wilmarth, Nilesh Choudhury, David Kunzman, Eric Bohm Parallel Programming Laboratory University of Illinois at.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn.
Chandra S. Martha Min Lee 02/10/2016
Welcome to the 2017 Charm++ Workshop!
For Massively Parallel Computation The Chaotic State of the Art
Scalable Fault Tolerance Schemes using Adaptive Runtime Support
Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab
MapReduce Simplied Data Processing on Large Clusters
Component Frameworks:
Welcome to the 2018 Charm++ Workshop!
Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar
Integrated Runtime of Charm++ and OpenMP
Welcome to the 2016 Charm++ Workshop!
Case Studies with Projections
BigSim: Simulating PetaFLOPS Supercomputers
Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale
IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale
Parallel Programming in C with MPI and OpenMP
Support for Adaptivity in ARMCI Using Migratable Objects
Laxmikant (Sanjay) Kale Parallel Programming Laboratory
Lecture 29: Distributed Systems
Presentation transcript:

Welcome to the 2015 Charm++ Workshop! Laxmikant (Sanjay) Kale Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign

A couple of forks MPI + x “Task Models” – Asynchrony Overdecomposition and migratability: – Most adaptivity 2 MPI+X Overdecomposition + Migratability Task Models Charm++ Workshop 2015

Overdecomposition Decompose the work units & data units into many more pieces than execution units – Cores/Nodes/.. Not so hard: we do decomposition anyway 3Charm++ Workshop 2015

Migratability Allow these work and data units to be migratable at runtime – i.e. the programmer or runtime, can move them Consequences for the app-developer – Communication must now be addressed to logical units with global names, not to physical processors – But this is a good thing Consequences for RTS – Must keep track of where each unit is – Naming and location management 4Charm++ Workshop 2015

Asynchrony: Message-Driven Execution Now: – You have multiple units on each processor – They address each other via logical names Need for scheduling: – What sequence should the work units execute in? – One answer: let the programmer sequence them Seen in current codes, e.g. some AMR frameworks – Message-driven execution: Let the work-unit that happens to have data (“message”) available for it execute next Let the RTS select among ready work units Programmer should not specify what executes next, but can influence it via priorities 5Charm++ Workshop 2015

Charm++ Charm++ began as an adaptive runtime system for dealing with application variability: – Dynamic load imbalances – Task parallelism first (state-space search) – Iterative (but irregular/dynamic) apps in mid- 1990s But it turns out to be useful for future hardware, which is also characterized by variability 6Charm++ Workshop 2015

Message-driven Execution A[..].foo(…) 7Charm++ Workshop 2015

Empowering the RTS The Adaptive RTS can: – Dynamically balance loads – Optimize communication: Spread over time, async collectives – Automatic latency tolerance – Prefetch data with almost perfect predictability Asynchrony Overdecomposition Migratability Adaptive Runtime System Adaptive Runtime System Introspection Adaptivity 8Charm++ Workshop 2015

What Do RTSs Look Like: Charm++ 9Charm++ Workshop 2015

Fault Tolerance in Charm++/AMPI Four approaches available: – Disk-based checkpoint/restart – In-local-storage double checkpoint w auto restart Demonstrated on 64k cores – Proactive object migration – Message-logging: scalable fault tolerance Can tolerate frequent faults Parallel restart and potential for handling faults during recovery 10Charm++ Workshop 2015

Scalable Fault tolerance Faults will be frequent at exascale (true??) – Failstop, and soft failures are both important Checkpoint-restart may not scale – Or will it? – Requires all nodes to roll back even when just one fails Inefficient: computation and power – As MTBF goes lower, it becomes infeasible Charm++ Workshop

Message-Logging Basic Idea: – Only the processes/objects on the failed node go back to the checkpoint! – Messages are stored by senders during execution – Periodic checkpoints still maintained – After a crash, reprocess “resent” messages to regain state Does it help at exascale? – Not really, or only a bit: Same time for recovery! But with over-decomposition, – work in one processor is divided across multiple virtual processors; thus, restart can be parallelized – Virtualization helps fault-free case as well Charm++ Workshop

13 Time Progress Power Normal Checkpoint-Resart method Charm++ Workshop 2015 Power consumption is continuous Progress is slowed down with failures

14 Message logging + Object-based virtualization Charm++ Workshop 2015 Power consumption is lower during recovery Progress is faster with failures

15 Cylinder surface: nodes of the machine Fail-stop recovery with message logging: A research vision Charm++ Workshop 2015

16Charm++ Workshop 2015

17Charm++ Workshop 2015

18 A fault hits a node It regresses.. Its objects start re-execution, IN PARALLEL on neighboring nodes! Charm++ Workshop 2015

19 Re-execution continues even as other nodes continue forward Due to “parallel re-execution” the neighborhood catches up Charm++ Workshop 2015

20 Back to normal execution Charm++ Workshop 2015

21 Another fault Charm++ Workshop 2015

22 Even as its neighborhood is helping recover, A 3 rd fault hits Concurrent recovery is possible as long as the two failed nodes are not checkpoint buddies Charm++ Workshop 2015

23Charm++ Workshop 2015

24Charm++ Workshop 2015

Review of last year at PPL SC14! – 6 papers at the main conference Including a state-of-practice paper on Charm++ – Charm++ tutorial, Resilience tutorial – Charm++ BoF – Harshitha Menon: George Michael Fellowship Publications: – Applications: SC, ParCo, ICPP, ICORES, IPDPS’14, IPDPS’15 – Resilience : TPDS, TJS, Parco, Cluster (Best paper) – Runtime Systems: SC, ROSS, ICPP, HiPC, IPDPS’15 – Interconnect/topologies: SC, HiPC, IPDPS’15 – Energy: SC, TOPC, PMAM – Parallel Discrete Event Simulations Petascale Applications made excellent progress – ChaNGa, NAMD, EpiSimdemics, OpenAtom Exploration of Charm++ for exascale by DOE Labs, Intel,.. Charm++ Workshop

Charmworks, Inc. A path to long-term sustainability of Charm++ Commercially supported version – Focus on nodes at Charmworks – Existing collaborative apps to continue with same licensing (NAMD, OpenAtom) as before University version continues to be distributed – Freely, in source code form, for non-profits Code base: – Committed to avoiding divergence for a few years – Charmworks codebase will be streamlined We will be happy to take your feedback Charm++ Workshop

Workshop Overview Keynotes – Martin Berzins – Jesus Labarta Applications – Christoph Junghans, Tom Quinn (ChaNGa), Jim Phillips (NAMD), Xiang Ni (Cloth Simulation), Eric Bohm, Sohrab Ismail-Beigi, Glenn Martyna (OpenAtom) New Applications and MiniApps – Esteban Meneses, Robert Steinke (ADHydro), David Hollman (miniAero), Sam White (PlasComCM), Chen Meng (SC_Tanagram), Eric Mikida (ROSS), Hassan Eslami (Graphs), Cyril Bordage, Huiwei Lu (ArgoBots) Charm++ features and capabilities – Akhil Langer (Power), Bilge Acun (TraceR & Malleability), Phil Miller (64-bit ID) Tools – Xu Liu, Kate Isaacs, Nikhil Jain, Abhinav Bhatele, Todd Gamblin Panel: Sustainable community software in academia Charm++ Workshop