Robust Non-Intrusive Record-Replay with Processor Extraction Filippo Gioachin Gengbin Zheng Laxmikant Kalé Parallel Programming Laboratory Departement.

Slides:



Advertisements
Similar presentations
Ensuring Operating System Kernel Integrity with OSck By Owen S. Hofmann Alan M. Dunn Sangman Kim Indrajit Roy Emmett Witchel Kent State University College.
Advertisements

CHESS: A Systematic Testing Tool for Concurrent Software CSCI6900 George.
Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.
Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.
Dr. Gengbin Zheng and Ehsan Totoni Parallel Programming Laboratory University of Illinois at Urbana-Champaign April 18, 2011.
User Level Interprocess Communication for Shared Memory Multiprocessor by Bershad, B.N. Anderson, A.E., Lazowska, E.D., and Levy, H.M.
1: Operating Systems Overview
OPERATING SYSTEM OVERVIEW
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
Computer Organization and Architecture
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
An Example Use Case Scenario
Analyzing parallel programs with Pin Moshe Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, Chi- Keung Luk,
Boston, May 22 nd, 2013 IPDPS 1 Acceleration of an Asynchronous Message Driven Programming Paradigm on IBM Blue Gene/Q Sameer Kumar* IBM T J Watson Research.
Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.
(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.
Advanced / Other Programming Models Sathish Vadhiyar.
1 Blue Gene Simulator Gengbin Zheng Gunavardhan Kakulapati Parallel Programming Laboratory Department of Computer Science.
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.
Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.
1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University.
University of Illinois at Urbana-Champaign Memory Architectures for Protein Folding: MD on million PIM processors Fort Lauderdale, May 03,
A uGNI-Based Asynchronous Message- driven Runtime System for Cray Supercomputers with Gemini Interconnect Yanhua Sun, Gengbin Zheng, Laximant(Sanjay) Kale.
Programming an SMP Desktop using Charm++ Laxmikant (Sanjay) Kale Parallel Programming Laboratory Department of Computer Science.
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.
Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Embedded Real-Time Systems Processing interrupts Lecturer Department University.
Embedded Real-Time Systems
Teragrid 2009 Scalable Interaction with Parallel Applications Filippo Gioachin Chee Wai Lee Laxmikant V. Kalé Department of Computer Science University.
1 Scalable Cosmological Simulations on Parallel Machines Filippo Gioachin¹ Amit Sharma¹ Sayantan Chakravorty¹ Celso Mendes¹ Laxmikant V. Kale¹ Thomas R.
Debugging Large Scale Parallel Applications Filippo Gioachin Parallel Programming Laboratory Departement of Computer Science University of Illinois at.
Doctoral Defense Filippo Gioachin September 27, 2010 Debugging Large Scale Applications with Virtualization.
Pitfalls: Time Dependent Behaviors CS433 Spring 2001 Laxmikant Kale.
Debugging Large Scale Applications in a Virtualized Environment Filippo Gioachin Gengbin Zheng Laxmikant Kalé Parallel Programming Laboratory Departement.
PADTAD 2008 Memory Tagging in Charm++ Filippo Gioachin Laxmikant V. Kalé Department of Computer Science University of Illinois at Urbana-Champaign.
IPDPS 2009 Dynamic High-level Scripting in Parallel Applications Filippo Gioachin Laxmikant V. Kalé Department of Computer Science University of Illinois.
1 ChaNGa: The Charm N-Body GrAvity Solver Filippo Gioachin¹ Pritish Jetley¹ Celso Mendes¹ Laxmikant Kale¹ Thomas Quinn² ¹ University of Illinois at Urbana-Champaign.
Debugging Tools for Charm++ Applications Filippo Gioachin University of Illinois at Urbana-Champaign.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Productive Performance Tools for Heterogeneous Parallel Computing
Operating System Structure
Presented by: Daniel Taylor
YAHMD - Yet Another Heap Memory Debugger
Distributed Processors
Operating Systems (CS 340 D)
Operating System Structure
CharmDebug Filippo Gioachin.
uGNI-based Charm++ Runtime for Cray Gemini Interconnect
Operating Systems (CS 340 D)
Designing and Debugging Batch and Interactive COBOL Programs
Integrated Runtime of Charm++ and OpenMP
Introduction to locality sensitive approach to distributed systems
Operating Systems.
Presented by Neha Agrawal
Prof. Leonardo Mostarda University of Camerino
CS510 - Portland State University
Case Studies with Projections
PROGRAMMING FUNDAMENTALS Lecture # 03. Programming Language A Programming language used to write computer programs. Its mean of communication between.
Operating System Introduction.
BigSim: Simulating PetaFLOPS Supercomputers
Debugging Support for Charm++
Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale
Database System Architectures
Chapter 2 Operating System Overview
An Orchestration Language for Parallel Objects
Lecture Topics: 11/1 Hand back midterms
Emulating Massively Parallel (PetaFLOPS) Machines
Presentation transcript:

Robust Non-Intrusive Record-Replay with Processor Extraction Filippo Gioachin Gengbin Zheng Laxmikant Kalé Parallel Programming Laboratory Departement of Computer Science University of Illinois at Urbana-Champaign

13 Aprile 2010Filippo Gioachin - UIUC2 Outline ● Motivations ● Three-step Procedure – Detecting failure ● Performance – Benchmarks and real applications ● Conclusions

13 Aprile 2010Filippo Gioachin - UIUC3 Motivations ● Intermittent problem – Only certain event orderings cause the problem ● Problems may not appear at small scale – Races between messages ● Latencies in the underlying hardware – Incorrect messaging – Data decomposition

13 Aprile 2010Filippo Gioachin - UIUC4 Problems at Large Scale ● Infeasible – Debugger needs to handle many processors – Human can be overwhelmed by information – Long waiting time in queue – Machine not available ● Expensive – Large machine allocation consume a lot of computational resources

13 Aprile 2010Filippo Gioachin - UIUC5 Do we need all the processors? ● The problem manifests itself on a single processor – If more than one, they are equivalent ● The cause can span multiple processors (causally related) – The subset is generally much smaller than the whole system ● Select the interesting processors and ignore the others

13 Aprile 2010Filippo Gioachin - UIUC6 Fighting non-determinism ● Record all data processed by each processor – Huge volume of data stored – High interference with application ● Likely the bug will not appear... – Need to run a non-optimized code ● Record only message ordering – Based on piecewise deterministic assumption – Must re-execute using the same machine

13 Aprile 2010Filippo Gioachin - UIUC7 Three-step Procedure for Processor Extraction Execute program recording message ordering Replay application with detailed recording enabled Replay selected processors as stand-alone Is problem solved? Done Select processor s to record Step1 Step2 Step3 Has bug appear ed? Minimize perturbation (few bytes per message) ● Iterate for incremental extraction ● Use message ordering to guarantee determinism ● Can execute in the virtualized environment Robust Record- Replay with Processor Extraction * F. Gioachin, G. Zheng, L.V. Kalé: "Robust Record- Replay with Processor Extraction" in Proceedings of the Workshop on Parallel and Distributed Systems: Testing, Analysis, and Debugging (PADTAD – VIII), 2010 Use r

13 Aprile 2010Filippo Gioachin - UIUC8 What if the piecewise deterministic assumption is not met? ● Make sure to detect it, and notify the user ● If all messages are identical, then we can assume the non-determinism was captured ● Methods to detect failure: – Message size and destination – Checksum of the whole message (XOR, CRC32)

13 Aprile 2010Filippo Gioachin - UIUC9 Computing Checksums ● Checksum considers memory as raw data, ignores what it contains – Pointers – Garbage ● Uninitialized fields ● Compiler padding ● Use Charm++ memory allocator – Intercept calls to malloc and pre-fill memory double intshort double int double

13 Aprile 2010Filippo Gioachin - UIUC10 Message Order Recording Performance (on NCSA's Abe)

13 Aprile 2010Filippo Gioachin - UIUC11 kNeighbor

13 Aprile 2010Filippo Gioachin - UIUC12 ChaNGa (dwf on NCSA's BluePrint)

13 Aprile 2010Filippo Gioachin - UIUC13 Replaying the Application

13 Aprile 2010Filippo Gioachin - UIUC14 Virtual Processor BigSim Emulator Message Queue Converse Main Thread Worker Thread Communication Thread Communication Thread

13 Aprile 2010Filippo Gioachin - UIUC15 Replaying under BigSim Emulation: NAMD

13 Aprile 2010Filippo Gioachin - UIUC16 Amount of Data Saved ChaNGa dwf1.2048, numbers in MB

13 Aprile 2010Filippo Gioachin - UIUC17 Debugging Case Study ● Message race during particle exchange – Fixed with tedious print statements (while trying to avoid hiding the bug...)../charmdebug +p16../ChaNGa cube300.param +record +recplay-crc../charmdebug +p16../ChaNGa cube300.param +replay +recplay-crc +record-detail 7 gdb../ChaNGa >> run cube300.param +replay-detail 7/16

13 Aprile 2010Filippo Gioachin - UIUC18 CharmDebug ● Debugger tailored to Charm++ applications – Show information pertinent to the user ● Messages in the queue ● Chare elements and their state ● Effective scalability to large applications – Uses single connection to interact with application

13 Aprile 2010Filippo Gioachin - UIUC19 CharmDebug Integration ● Enable record/replay – Select which processors to record/replay – Select which detection mechanism to use

13 Aprile 2010Filippo Gioachin - UIUC20 Summary ● Capture non-determinism present in applications ● Resources are expensive and should not be wasted – Processor Extraction to capture non-determinism of parallel application ● Must not interfere too much with the application timing ● Save only needed information – Use virtualization to extract using less resources

13 Aprile 2010Filippo Gioachin - UIUC21 Future Extensions ● Multi-replay with CharmDebug ● Replay on different architectures – Message translation (envelope and content) ● Replay in isolation of single virtual entities – Extract single chares from the application – Conditions of validity