Doctoral Defense Filippo Gioachin September 27, 2010 Debugging Large Scale Applications with Virtualization.

Slides:



Advertisements
Similar presentations
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
Advertisements

The Path to Multi-core Tools Paul Petersen. Multi-coreToolsThePathTo 2 Outline Motivation Where are we now What is easy to do next What is missing.
Lightweight Remote Procedure Call Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented by Alana Sweat.
User Level Interprocess Communication for Shared Memory Multiprocessor by Bershad, B.N. Anderson, A.E., Lazowska, E.D., and Levy, H.M.
3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
3.5 Interprocess Communication
Replay Debugging for Distributed Systems Dennis Geels, Gautam Altekar, Ion Stoica, Scott Shenker.
Processes Part I Processes & Threads* *Referred to slides by Dr. Sanjeev Setia at George Mason University Chapter 3.
Stack Management Each process/thread has two stacks  Kernel stack  User stack Stack pointer changes when exiting/entering the kernel Q: Why is this necessary?
Computer System Architectures Computer System Software
Rensselaer Polytechnic Institute CSCI-4210 – Operating Systems CSCI-6140 – Computer Operating Systems David Goldschmidt, Ph.D.
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Debugging parallel programs. Breakpoint debugging Probably the most widely familiar method of debugging programs is breakpoint debugging. In this method,
Computing Simulation in Orders Based Transparent Parallelizing Pavlenko Vitaliy Danilovich, Odessa National Polytechnic University Burdeinyi Viktor Viktorovych,
Seminar of “Virtual Machines” Course Mohammad Mahdizadeh SM. University of Science and Technology Mazandaran-Babol January 2010.
13-1 Chapter 13 Concurrency Topics Introduction Introduction to Subprogram-Level Concurrency Semaphores Monitors Message Passing Java Threads C# Threads.
Lecture 4 Mechanisms & Kernel for NOSs. Mechanisms for Network Operating Systems  Network operating systems provide three basic mechanisms that support.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Operating Systems Unit 2: – Process Context switch Interrupt Interprocess communication – Thread Thread models Operating Systems.
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
Teragrid 2009 Scalable Interaction with Parallel Applications Filippo Gioachin Chee Wai Lee Laxmikant V. Kalé Department of Computer Science University.
Debugging Large Scale Parallel Applications Filippo Gioachin Parallel Programming Laboratory Departement of Computer Science University of Illinois at.
Robust Non-Intrusive Record-Replay with Processor Extraction Filippo Gioachin Gengbin Zheng Laxmikant Kalé Parallel Programming Laboratory Departement.
Debugging Large Scale Applications in a Virtualized Environment Filippo Gioachin Gengbin Zheng Laxmikant Kalé Parallel Programming Laboratory Departement.
PADTAD 2008 Memory Tagging in Charm++ Filippo Gioachin Laxmikant V. Kalé Department of Computer Science University of Illinois at Urbana-Champaign.
1 Computer System Overview Chapter 1. 2 Operating System Exploits the hardware resources of one or more processors Provides a set of services to system.
IPDPS 2009 Dynamic High-level Scripting in Parallel Applications Filippo Gioachin Laxmikant V. Kalé Department of Computer Science University of Illinois.
1 ChaNGa: The Charm N-Body GrAvity Solver Filippo Gioachin¹ Pritish Jetley¹ Celso Mendes¹ Laxmikant Kale¹ Thomas Quinn² ¹ University of Illinois at Urbana-Champaign.
Debugging Tools for Charm++ Applications Filippo Gioachin University of Illinois at Urbana-Champaign.
The Distributed Application Debugger (DAD)
Productive Performance Tools for Heterogeneous Parallel Computing
Presented by: Daniel Taylor
Processes and threads.
YAHMD - Yet Another Heap Memory Debugger
Chapter 3: Process Concept
Self Healing and Dynamic Construction Framework:
Advanced OS Concepts (For OCR)
Chapter 2: System Structures
Operating System (013022) Dr. H. Iwidat
CharmDebug Filippo Gioachin.
Introduction to Operating System (OS)
Real-time Software Design
#01 Client/Server Computing
Chapter 4: Threads.
Introduction to Operating Systems
Chapter 4: Threads.
Computer System Overview
Process Description and Control
Threads Chapter 4.
Process Description and Control
Multithreaded Programming
Prof. Leonardo Mostarda University of Camerino
Process Description and Control
CS510 - Portland State University
Chapter 3: Processes.
Chapter 4: Threads & Concurrency
Chapter 4: Threads.
BigSim: Simulating PetaFLOPS Supercomputers
Chapter 2 Processes and Threads 2.1 Processes 2.2 Threads
Database System Architectures
Chapter 2 Operating System Overview
Outline Chapter 2 (cont) Chapter 3: Processes Virtual machines
Support for Adaptivity in ARMCI Using Migratable Objects
MapReduce: Simplified Data Processing on Large Clusters
#01 Client/Server Computing
Presentation transcript:

Doctoral Defense Filippo Gioachin September 27, 2010 Debugging Large Scale Applications with Virtualization

27 September 2010Filippo Gioachin2 Committee

27 September 2010Filippo Gioachin3 Outline ● Introduction – Motivations ● Debugging on Large Machines – Unsupervised Execution ● Virtualized Debugging – Separation of Entities ● Processor Extraction ● Provisional Message Delivery ● Conclusions

27 September 2010Filippo Gioachin4 Outline ● Introduction – Motivations ● Debugging on Large Machines – Unsupervised Execution ● Virtualized Debugging – Separation of Entities ● Processor Extraction ● Provisional Message Delivery ● Conclusions

27 September 2010Filippo Gioachin5 Motivations ● Debugging is a fundamental part of software development ● Parallel programs have all the sequential bugs: – Memory corruption – Incorrect results –....

27 September 2010Filippo Gioachin6 Motivations (2) ● Parallel programs have new types of bugs: – Data races / multicore (heavily studied in literature) – Communication mistakes – Synchronization mistakes / Message races ● To complicate things further: – Non-determinism – Problems may show up only at large scale

27 September 2010Filippo Gioachin7 Problems at Large Scale ● Problems may not appear at small scale – Races between messages ● Latencies in the underlying hardware – Incorrect messaging – Data decomposition ● Important to handle large scale applications

27 September 2010Filippo Gioachin8 Challenges to Large Scale Debugging ● Infeasible – Debugger needs to handle many processors – Human can be overwhelmed by information – Machine not available – Long waiting time in queue ● Expensive – Large machine allocations consume a lot of computational resources Techniques used in this thesis ● Tight RTS integration ● Unsupervised execution ● Virtualized debug ● Processor extraction ● Provisional delivery

27 September 2010Filippo Gioachin9 Thesis Goals ● New techniques to help debugging large scale parallel programs – Tight integration with runtime system – Processor virtualization ● Applying these techniques to message driven parallel programs

27 September 2010Filippo Gioachin10 CharmDebug Java GUI (local machine) Firewal l Parallel Application (remote machine) Existing Framework GDB CharmDebu g Applicatio n CCS (Converse Client- Server)

27 September 2010Filippo Gioachin11 Using Fewer Resources: Leveraging Existing Tools ● Record-replay – Record the message ordering to replay deterministically a parallel program – Introduced new techniques for robustness and stability ● BigSim – Simulate large machines before they are available – Performance oriented

27 September 2010Filippo Gioachin12 Outline ● Introduction – Motivations ● Debugging on Large Machines – Unsupervised Execution ● Virtualized Debugging – Separation of Entities ● Processor Extraction ● Provisional Message Delivery ● Conclusions

27 September 2010Filippo Gioachin13 Traditional Debugging ● TotalView, Allinea, Eclipse – The user manages all the processors directly ● STAT (Stack Trace Analysis Tool) using MRNET ● ATP (Abnormal Termination Processing) ● Relative debugging (requires working program) External processes that supervise the application – Information access – Scalability challenge

27 September 2010Filippo Gioachin14 Tight Integration with Runtime System ● We use the same communication infrastructure that the application uses to scale – Scale debugger with the application

27 September 2010Filippo Gioachin15 Scalability (1) Kraken Cray XT5

27 September 2010Filippo Gioachin16 Scalability (2) Kraken Cray XT5

27 September 2010Filippo Gioachin17 Scalability (3) Kraken Cray XT5

27 September 2010Filippo Gioachin18 Autoinspection ● The programmer should not manually handle all the processors – Unsupervised execution – Notification to the user from interesting processors ● User-defined ● System-defined – Discussed in the prelim (Thesis Chapter 3) ● Breakpoints ● Assertion failure ● Abort / signals ● Memory corruption

27 September 2010Filippo Gioachin19 Autoinspection (2) ● Upload a script to perform checking on the correctness of data structures – No need to recompile and restart – Execute when needed ● On demand ● Before and after entry methods – Python language ● Provide complex control flow * F. Gioachin, L.V. Kalé: "Dynamic High-Level Scripting in Parallel Applications". In Proceedings of the 23rd IEEE International Parallel and Distributed Processing Symposium (IPDPS 2009)

27 September 2010Filippo Gioachin20 Python Scripting length = charm.getValue(self, MyArray, len) arr = charm.getValue(self, MyArray, data) for i in range(0, length): value = charm.getArray(arr, double, i) if (value > 10 or value < -10): print "Error: value ", i, " = ", value return i Select on which entry points the script should run Suspend execution if a value is returned Access program's data (circumvent lack of reflection)

27 September 2010Filippo Gioachin21 5-point 2D Jacobi ● Running on four 8-core machines ● Python interface used through CharmDebug ● The checking code does not perform significant work ● Execution time in ms

27 September 2010Filippo Gioachin23 Using Fewer Resources ● Infeasible – Long waiting time in queue – Machine not available ● Expensive – Large machine allocation consume a lot of computational resources

27 September 2010Filippo Gioachin24 Outline ● Introduction – Motivations ● Debugging on Large Machines – Unsupervised Execution ● Virtualized Debugging – Separation of Entities ● Processor Extraction ● Provisional Message Delivery ● Conclusions

27 September 2010Filippo Gioachin25 Related Work ● No real work in HPC ● In other fields use Xen of IBM Hypervisor – OS-level virtualization – Hard for the user to access/install

27 September 2010Filippo Gioachin26 Virtualized Emulation ● Use emulation techniques to provide virtual processors to display to the user – Use BigSim emulation tool ● Cannot assume correctness of program – Debugger needs to communicate with application – Single address space Debugging Large Scale Applications in a Virtualized Environment * F. Gioachin, G. Zheng, L.V. Kalé: "Debugging Large Scale Applications in a Virtualized Environment" to appear in the Proceedings of the 23rd International Workshop on Languages and Compilers for Parallel Computing (LCPC2010)

27 September 2010Filippo Gioachin27 Communication under Emulated Environment Virtual Processor Worker Thread Communication Thread Message Queue Converse Main Thread Virtual Processor Worker Thread Communication Thread Comm. Gateway Real PE 12 VP 513 VP 87

27 September 2010Filippo Gioachin28 Resource Consumption: Jacobi (on NCSA's BluePrint) ● User thinks for one minute about what to do: – 8 processors ● 86 sec. ● ~0.2 SU – 1024 procs ● 60.5 sec. ● ~17 SU

27 September 2010Filippo Gioachin29 Demo

27 September 2010Filippo Gioachin30 Usage: Starting

27 September 2010Filippo Gioachin31 Usage: Debugging

27 September 2010Filippo Gioachin32 Restrictions ● Small memory footprint – Many processors needs to fit into a single physical processor ● Session constraint by human speed – Allocation idle most of the time waiting for user input – Bad for computation intensive applications

27 September 2010Filippo Gioachin33 Separation of Virtual Entities ● Single address space shared by different entities – Virtual processors for emulation – Multiple chares in Charm++ – Multiple user-level threads ● One entity can overwrite memory of another entity – Dangling pointers (memory reallocated) – Pointers passed between entities – Spurious writes (e.g. buffer overflow)

27 September 2010Filippo Gioachin34 Memory Module ● Intercept memory related system calls – The runtime system knows about all allocated memory – Useful for statistics, memory leak detection, and other memory related facilities

27 September 2010Filippo Gioachin35 Memory Tagging Associate a unique ID to each virtual entity, and tag each allocation accordingly ● * F. Gioachin, L.V. Kale: ”Memory Tagging in Charm++”; in Proceedings of the 6th Workshop on Parallel and Distributed Systems: Testing, Analysis, and Debugging (PADTAD '08) ● Best Paper

27 September 2010Filippo Gioachin36 Memory Corruption Detection ● Protect memory such that spurious writes can be detected ● Exploit the scheduler in message driven systems Has corrupti on occured ? Reset memory protection Check memory corruption No Yes Pick message User code: process message Memory Tagging in Charm++ * F. Gioachin, L.V. Kalé: "Memory Tagging in Charm++" in Proceedings of the 6th Workshop on Parallel and Distributed Systems: Testing, Analysis, and Debugging (PADTAD '08)

27 September 2010Filippo Gioachin37 Memory Corruption Detection (2) ● Three techniques developed – CRC: ✔ Low memory overhead ✗ CPU performance bound – Memcopy: ✔ Much faster than CRC ✗ Memory bandwidth bound Has corrupt ion occure d? Compute CRC of every allocated memory block Scan all memory blocks and check if CRC changed No Yes Pick message User code: process message Diff program memory with the saved copy Make a copy of all allocated memory

27 September 2010Filippo Gioachin38 Memory Corruption Detection (3) ● Mmap/mprotect ● Each malloc becomes ”mmap” ✔ Accurate detection of faulty instruction ✔ Can detect no-change writes and spurious reads ✗ Memory footprint increase ✗ Depends on mmap availability Pick message Corrupti on detected Mprotect read-only memory of other virtual entities Signal User code: process message

27 September 2010Filippo Gioachin39 Protection Mechanisms ● Checksum: Cyclic Redundancy Check (CRC) – Compute CRC-32 for all the memory in the system. Recompute upon entry method return ● Memory copy – Copy all the memory in a system in a separate area. Compare upon entry method return ● mprotect – Allocate memory with mmap and mark read-only with mprotect. Receive signal upon corruption

27 September 2010Filippo Gioachin40 Performance Aspect 4,000 allocated blocks 32 MB of allocated memory

27 September 2010Filippo Gioachin41 Related Work ● Memory protection – Studied for concurrent threads (Data Races) ● Intel Thread Checker ● RecPlay – Not applicable with only one execution thread – TotalView ● User can mimic the protection manually

27 September 2010Filippo Gioachin42 Outline ● Introduction – Motivations ● Debugging on Large Machines – Unsupervised Execution ● Virtualized Debugging – Separation of Entities ● Processor Extraction ● Provisional Message Delivery ● Conclusions

27 September 2010Filippo Gioachin43 Do we need all the processors? ● The problem manifests itself on a single processor ● The cause can span multiple processors (causally related) – The subset is generally much smaller than the whole system ● Select the interesting processors and ignore the others

27 September 2010Filippo Gioachin44 Extracting Processors: Challenges ● Record all data processed by each processor – Huge volume of data stored – High interference with application (probe effect) ● The bug may not appear – Existing: RecPlay, TotalView – Work on reduction in space requirements ● Online analysis of necessary data ● Processors grouping ● Record only some processors?

27 September 2010Filippo Gioachin45 Fighting non-determinism ● Record all data processed by each processor – Huge volume of data stored – High interference with application (probe effect) ● The bug may not appear ● Record only message ordering – Must re-execute using the whole machine – Based on piecewise deterministic assumption

27 September 2010Filippo Gioachin46 Related Work ● Full detailed recording – RecPlay – TotalView ReplayEngine ● Reduction in space requirement – Online analysis of necessary data – Processors grouping

27 September 2010Filippo Gioachin47 Three-step Procedure for Processor Extraction Execute program recording message ordering Replay application with detailed recording enabled Replay selected processors as stand-alone Is problem solved? Done Select processor s to record Step1 Step2 Step3 Has bug appear ed? Minimize perturbation (few bytes per message) ● Iterate for incremental extraction ● Use message ordering to guarantee determinism ● Can execute in the virtualized environment Robust Record- Replay with Processor Extraction * F. Gioachin, G. Zheng, L.V. Kalé: "Robust Record- Replay with Processor Extraction" in Proceedings of the Workshop on Parallel and Distributed Systems: Testing, Analysis, and Debugging (PADTAD – VIII), 2010 Use r No

27 September 2010Filippo Gioachin48 Three-step Procedure for Processor Extraction Execute program recording message ordering Replay application with detailed recording enabled Replay selected processors as stand-alone Is problem solved? Done Select processor s to record Step1 Step2 Step3 Has bug appear ed? Minimize perturbation (few bytes per message) ● Iterate for incremental extraction ● Use message ordering to guarantee determinism ● Can execute in the virtualized environment Robust Record- Replay with Processor Extraction * F. Gioachin, G. Zheng, L.V. Kalé: "Robust Record- Replay with Processor Extraction" in Proceedings of the Workshop on Parallel and Distributed Systems: Testing, Analysis, and Debugging (PADTAD – VIII), 2010 Use r No

27 September 2010Filippo Gioachin49 What if the piecewise deterministic assumption is not met? ● Make sure to detect it, and notify the user ● If all messages during replay are identical to those during record, we can assume the application is deterministic ● Methods to detect failure: – Message size and destination – Checksum of the whole message (XOR, CRC32)

27 September 2010Filippo Gioachin50 Message Order Recording Performance (on NCSA's Abe)

27 September 2010Filippo Gioachin51 kNeighbor

27 September 2010Filippo Gioachin52 ChaNGa (dwf on NCSA's BluePrint)

27 September 2010Filippo Gioachin53 Amount of Data Saved ChaNGa dwf1.2048, numbers in MB

27 September 2010Filippo Gioachin54 Debugging Case Study ● Message race during particle exchange – Was fixed with tedious print statements ● Printf often made the bug disappear../charmrun +p16../ChaNGa cube300.param +record +recplay-crc../charmrun +p16../ChaNGa cube300.param +replay +recplay-crc +record-detail 7 gdb../ChaNGa >> run cube300.param +replay-detail 7/16

27 September 2010Filippo Gioachin55 Demo

27 September 2010Filippo Gioachin56 Record-replay with CharmDebug

27 September 2010Filippo Gioachin57 Outline ● Introduction – Motivations ● Debugging on Large Machines – Unsupervised Execution ● Virtualized Debugging – Separation of Entities ● Processor Extraction ● Provisional Message Delivery ● Conclusions

27 September 2010Filippo Gioachin58 Provisional Message Delivery ● Instead of replaying the same message ordering, replay a different one – Can force the bug to appear on a smaller scale – Automatic or manual ● Manual: programmers may have an idea where the problem lies – A specific message may confirm or refute the idea – Need a way to test without restarting the application ● Important for bugs that appear after long time

27 September 2010Filippo Gioachin59 Related Work ● GDB checkpoint ● TotalView ReplayEngine – Can go back in time, but program is immutable ● MAD – Re-execution to validate new event ordering ● Analysis and reduction of searched space – ISP (for MPI), DSP (for DAG-based execution)

27 September 2010Filippo Gioachin60 Testing Execution Paths ● Save state of the running application – Deliver the message – Rollback to try another path (live) Process msg B Save state Process msg A MA MC MD Message received from the network Outgoing message proc. MA MB M C MA MB M C M D FORK CHILD PARENT

27 September 2010Filippo Gioachin61 Operation Normal Mode Provisio nal Mode. Undeliver Some Deliver Provisional Undeliver All Deliver Immediate Deliver Provisional Pe X CCS Request CCS Reply Qu eu e IncomingIncoming MessageMessage Child Pe X Forwar d Reply CCS Request Reply Pe X Child IncomingIncoming Qu eu e Forward Qu eu e Generated MessageMessage Message

27 September 2010Filippo Gioachin62 Demo

27 September 2010Filippo Gioachin63 Provisional Delivery: Example

27 September 2010Filippo Gioachin64 Outline ● Introduction – Motivations ● Debugging on Large Machines – Unsupervised Execution ● Virtualized Debugging – Separation of Entities ● Processor Extraction ● Provisional Message Delivery ● Conclusions

27 September 2010Filippo Gioachin65 Thesis Contributions ● Techniques to handle thousands of processors efficiently – Unsupervised execution with notification upon event ● Reducing the resources required for debugging of large scale applications – Virtualized debugging – Processor extraction – Provisional message delivery ● Techniques to debug message driven parallel applications

27 September 2010Filippo Gioachin66 Future Extensions ● Shared memory compliance ● Race detector – Automated testing of message delivery to discover message races ● Replay in isolation of single virtual entities – Conditions of validity

27 September 2010Filippo Gioachin67 Peer Reviewed Papers ● F. Gioachin, G. Zheng, L.V. Kale: ”Robust Non-Intrusive Record-Replay with Processor Extraction”; in Proceedings of the 8 th Workshop on Parallel and Distributed Systems: Testing, Analysis, and Debugging (PADTAD 2010) ● F. Gioachin, G. Zheng, L.V. Kale: ”Debugging Large Scale Applications in a Virtualized Environment”; to appear in Proceedings of the 23rd International Workshop on Languages and Compilers for Parallel Computing (LCPC2010) ● F. Gioachin, C.W. Lee, L.V. Kale: ”Scalable Interaction with Parallel Applications”; in Proceedings of TeraGrid'09 ● F. Gioachin, L.V. Kale: ”Dynamic High-Level Scripting in Parallel Applications”; in Proceedings of the 23rd IEEE International Parallel and Distributed Processing Symposium (IPDPS 2009) ● F. Gioachin, L.V. Kale: ”Memory Tagging in Charm++”; in Proceedings of the 6 th Workshop on Parallel and Distributed Systems: Testing, Analysis, and Debugging (PADTAD '08) ● C. Mei, G. Zheng, F. Gioachin, L.V. Kale: ”Optimizing a Parallel Runtime System for Multicore Clusters: A Case Study”; in Proceedings of Teragrid'10 ● P. Jetley, L. Wesolowski, F. Gioachin, L.V. Kale, T.R. Quinn: ”Scaling Hierarchical N-Body Simulations on GPU Clusters”; to appear in Proceedings of the ACM/IEEE Supercomputing Conference 2010 (SC10) ● P. Jetley, F. Gioachin, C. Mendes, L.V. Kale, T.R. Quinn: ”Massively Parallel Cosmological Simulations with ChaNGa”; in Proceedings of IEEE International Parallel and Distributed Processing Symposium 2008 ● F. Gioachin, A. Sharma, S. Chakravorty, C. Mendes, L.V. Kale, T.R. Quinn: ”Scalable Cosmological Simulations on Parallel Machines”; in Proceedings of the 7th International Meeting on High Performance Computing for Computational Science (VECPAR 2006). LNCS 4395, pp , 2007 ● F. Gioachin, R. Shankesi, M.J. May, C.A. Gunter, W. Shin: ”Emergency Alerts as RSS Feeds with Interdomain Authorization”; in IARIA International Conference on Internet Monitoring and Protection (ICIMP '07), 2007