Execution Replay and Debugging. Contents Introduction Parallel program: set of co-operating processes Co-operation using –shared variables –message passing.

Slides:



Advertisements
Similar presentations
Symmetric Multiprocessors: Synchronization and Sequential Consistency.
Advertisements

Implementation and Verification of a Cache Coherence protocol using Spin Steven Farago.
Operating Systems Part III: Process Management (Process Synchronization)
4 December 2001 SEESCOASEESCOA STWW - Programma Debugging of Real-Time Embedded Systems: Experiences from SEESCOA Michiel Ronsse RUG-ELIS.
Two for the Price of One: A Model for Parallel and Incremental Computation Sebastian Burckhardt, Daan Leijen, Tom Ball (Microsoft Research, Redmond) Caitlin.
CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.
Virtual Time “Virtual Time and Global States of Distributed Systems” Friedmann Mattern, 1989 The Model: An asynchronous distributed system = a set of processes.
Gwendolyn Voskuilen, Faraz Ahmad, and T. N. Vijaykumar Electrical & Computer Engineering ISCA 2010.
R2: An application-level kernel for record and replay Z. Guo, X. Wang, J. Tang, X. Liu, Z. Xu, M. Wu, M. F. Kaashoek, Z. Zhang, (MSR Asia, Tsinghua, MIT),
D u k e S y s t e m s Time, clocks, and consistency and the JMM Jeff Chase Duke University.
Chapter 6: Process Synchronization
5.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts with Java – 8 th Edition Chapter 5: CPU Scheduling.
Process Synchronization. Module 6: Process Synchronization Background The Critical-Section Problem Peterson’s Solution Synchronization Hardware Semaphores.
Recording Inter-Thread Data Dependencies for Deterministic Replay Tarun GoyalKevin WaughArvind Gopalakrishnan.
CHESS: A Systematic Testing Tool for Concurrent Software CSCI6900 George.
Chapter 6 Process Synchronization: Part 2. Problems with Semaphores Correct use of semaphore operations may not be easy: –Suppose semaphore variable called.
Race Conditions. Isolated & Non-Isolated Processes Isolated: Do not share state with other processes –The output of process is unaffected by run of other.
Today’s Agenda  Midterm: Nov 3 or 10  Finish Message Passing  Race Analysis Advanced Topics in Software Engineering 1.
An efficient data race detector for DIOTA Michiel Ronsse, Bastiaan Stougie, Jonas Maebe, Frank Cornelis, Koen De Bosschere Department of Electronics and.
Continuously Recording Program Execution for Deterministic Replay Debugging.
Concurrency: Mutual Exclusion and Synchronization Why we need Mutual Exclusion? Classical examples: Bank Transactions:Read Account (A); Compute A = A +
Execution Replay for Multiprocessor Virtual Machines George W. Dunlap Dominic Lucchetti Michael A. Fetterman Peter M. Chen.
Concurrent Processes Lecture 5. Introduction Modern operating systems can handle more than one process at a time System scheduler manages processes and.
CS533 - Concepts of Operating Systems
1 Concurrency: Deadlock and Starvation Chapter 6.
BIST vs. ATPG.
PRASHANTHI NARAYAN NETTEM.
Operating Systems CSE 411 CPU Management Oct Lecture 13 Instructor: Bhuvan Urgaonkar.
Deterministic Replay of Java Multithreaded Applications Jong-Deok Choi and Harini Srinivasan slides made by Qing Zhang.
1 The Google File System Reporter: You-Wei Zhang.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
15-740/ Oct. 17, 2012 Stefan Muller.  Problem: Software is buggy!  More specific problem: Want to make sure software doesn’t have bad property.
Solution to Dining Philosophers. Each philosopher I invokes the operations pickup() and putdown() in the following sequence: dp.pickup(i) EAT dp.putdown(i)
Cosc 4740 Chapter 6, Part 3 Process Synchronization.
SSGRR A Taxonomy of Execution Replay Systems Frank Cornelis Andy Georges Mark Christiaens Michiel Ronsse Tom Ghesquiere Koen De Bosschere Dept. ELIS.
AADEBUG MUNCHEN Non-intrusive on-the-fly data race detection using execution replay Michiel Ronsse - Koen De Bosschere Ghent University - Belgium.
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
Pallavi Joshi* Mayur Naik † Koushik Sen* David Gay ‡ *UC Berkeley † Intel Labs Berkeley ‡ Google Inc.
Parallelizing Security Checks on Commodity Hardware Ed Nightingale Dan Peek, Peter Chen Jason Flinn Microsoft Research University of Michigan.
Replay Compilation: Improving Debuggability of a Just-in Time Complier Presenter: Jun Tao.
Games Development 2 Concurrent Programming CO3301 Week 9.
1 M. Tudruj, J. Borkowski, D. Kopanski Inter-Application Control Through Global States Monitoring On a Grid Polish-Japanese Institute of Information Technology,
COMP 111 Threads and concurrency Sept 28, Tufts University Computer Science2 Who is this guy? I am not Prof. Couch Obvious? Sam Guyer New assistant.
OSes: 3. OS Structs 1 Operating Systems v Objectives –summarise OSes from several perspectives Certificate Program in Software Development CSE-TC and CSIM,
“Virtual Time and Global States of Distributed Systems”
25 April 2000 SEESCOASEESCOA STWW - Programma Evaluation of on-chip debugging techniques Deliverable D5.1 Michiel Ronsse.
Chapter 7 -1 CHAPTER 7 PROCESS SYNCHRONIZATION CGS Operating System Concepts UCF, Spring 2004.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
A record and replay mechanism using programmable network interface cards Laurent Lefèvre INRIA / LIP (UMR CNRS, INRIA, ENS, UCB)
Consider the program fragment below left. Assume that the program containing this fragment executes t1() and t2() on separate threads running on separate.
Seminar of “Virtual Machines” Course Mohammad Mahdizadeh SM. University of Science and Technology Mazandaran-Babol January 2010.
A Regulated Transitive Reduction (RTR) for Longer Memory Race Recording (ASLPOS’06) Min Xu Rastislav BodikMark D. Hill Shimin Chen LBA Reading Group Presentation.
D u k e S y s t e m s Asynchronous Replicated State Machines (Causal Multicast and All That) Jeff Chase Duke University.
Operating Systems: Wrap-Up Questions answered in this lecture: What is an Operating System? Why are operating systems so interesting? What techniques can.
Virtual Application Profiler (VAPP) Problem – Increasing hardware complexity – Programmers need to understand interactions between architecture and their.
Techniques and Structures in Concurrent Programming Wilfredo Velazquez.
31 Oktober 2000 SEESCOASEESCOA STWW - Programma Work Package 5 – Debugging Task Generic Debug Interface K. De Bosschere e.a.
C H A P T E R E L E V E N Concurrent Programming Programming Languages – Principles and Paradigms by Allen Tucker, Robert Noonan.
October 24, 2003 SEESCOASEESCOA STWW - Programma Debugging Components Koen De Bosschere RUG-ELIS.
Flashback : A Lightweight Extension for Rollback and Deterministic Replay for Software Debugging Sudarshan M. Srinivasan, Srikanth Kandula, Christopher.
Agenda  Quick Review  Finish Introduction  Java Threads.
Reachability Testing of Concurrent Programs1 Reachability Testing of Concurrent Programs Richard Carver, GMU Yu Lei, UTA.
Clock Snooping and its Application in On-the-fly Data Race Detection Koen De Bosschere and Michiel Ronsse University of Ghent, Belgium Taipei, TaiwanDec.
ECE 297 Concurrent Servers Process, fork & threads ECE 297.
Kendo: Efficient Deterministic Multithreading in Software M. Olszewski, J. Ansel, S. Amarasinghe MIT to be presented in ASPLOS 2009 slides by Evangelos.
Atomicity CS 2110 – Fall 2017.
Advanced Operating Systems - Fall 2009 Lecture 8 – Wednesday February 4, 2009 Dan C. Marinescu Office: HEC 439 B. Office hours: M,
Threading And Parallel Programming Constructs
Kernel Synchronization II
Foundations and Definitions
Presentation transcript:

Execution Replay and Debugging

Contents

Introduction Parallel program: set of co-operating processes Co-operation using –shared variables –message passing Developing parallel programs is considered difficult: –normal errors as in sequential programs –synchronisation errors (deadlock, races) –performance errors  We need good development tools

Debugging of parallel programs Most used technique: cyclic debugging Requires repeatable equivalent executions Is a problem for parallel programs: lots of non-determinism present Solution: execution replay mechanism: –record phase: trace information about the non- deterministic choices –replay phase: force an equivalent re-execution using the trace allowing the use of intrusive debugging techniques

Non-determinism Classes: –external vs. internal non-determinism –desired vs. undesired non-determinism Important: the amount of non-determinism depends on the abstraction level. E.g. a semaphore P()-operation can be fully deterministic while consisting of e number of non-deterministic spinlocking operations.

Causes of Non-determinism –In sequential programs: program code (self modifying code?) program input (disk, keyboard, network,...) certain system calls ( gettimeofday() ) interrupts, signals,... –In parallel programs: accesses to shared variables: race conditions (synchronisation races and data races) –In distributed programs: promiscuous receive operations test operations for non-blocking messages operations

Main Issues in Execution Replay recorded execution = original execution: –trace as little as possible in order to limit the overhead in time in space replayed execution = recorded execution: –faithful re-execution: trace enough

Execution Replay Methods Two types: content- vs. ordering-based –content-based: force each process to read the same value or to receive the same message as during the original execution –ordering-based: force each process to access the variables or to receive the message in the same logical order as during the original execution

Logical Clocks for Ordering-based Methods A clock C() attaches a timestamps C(x) to an event x Used for tracing the logical order of events Clock condition: Clocks are strongly consistent if New timestamp is the increment of the maximum of the old timestamps of the process and the object

Scalar Clocks Aka Lamport Clocks Simple and fast update algorithm: Scales very well with the number of processes Provides only limited information:

Vector Clocks A vector clock for a program using N processes consist of N scalar values Such a clock is strongly consistent: by comparing vector timestamps one can deduce concurrency information:

An Example Program A parallel program with two threads, communicating using shared variables: A, B MA and MB. Local variables are x and y. M is used as a mutex using an atomic swap operation provided by the CPU:

An Example Program (II) Lock operation on a mutex M is implemented (in a library): Unlock operation on a mutex M is implemented as: All variables are initially 0

An Example Program (III) The example program: Thread 1: L(MA); A=8; U(MA); L(MB); B=7; U(MB); Thread 2: B=6; L(MB); x=B; U(MB); L(MA); y=A; U(MA);

A Possible Execution: Low Level View A=8 swap(MA,1)  0 MA=0 swap(MB,1)  0 B=7 MB=0 x=B swap(MB,1)  0 MB=0 swap(MA,1)  0 y=A MA=0 B=6 swap(MB,1)  1

A Possible Execution: High Level View A=8 L(MA) U(MA) L(MB) B=7 U(MB) x=B L(MB) U(MB) L(MA) y=A U(MA) B=6 time

Recap A content-based replay method: the value read by each load operation is stored Trace generation of 1MB/s was measured on a VAX 11/780 Undoable method: time needed to record the large amount of trace information modifies the initial execution One advantage: possible to replay a subset of the processes in isolation.

Recap: Example A=8 swap(MA,1)  0 MA=0 swap(MB,1)  0 B=7 MB=0 x=B swap(MB,1)  0 MB=0 swap(MA,1)  0 y=A MA=0 B=6 swap(MB,1) 

Instant Replay First ordering-based replay method Developed for CREW-algorithms Each shared object receives a version number that is updated or logged at each CREW-operation: –read: the version number is logged –write: the version number is incremented the number of preceding read operations is logged

Instant Replay: Example A=8 Lw(MA) Uw(MA) Lw(MB) B=7 Uw(MB) x=B Lr(MB) Ur(MB) Lr(MA) y=A Ur(MA) B=6 version: 1 log 0 reads version: 1 log 0 reads log version 1 PROBLEM

Netzer Widely cited method Attaches a vector clock to each process. The clocks attach a timestamp to each memory operations. Uses vector clocks to detect concurrent (racing) memory operations Automatically traces transitive reduction of the dependencies

Netzer: Basic Idea B=6 Is this order guaranteed? swap(MB,1)  0 B=7 B=6

Netzer: Transitive Reduction B=7 MB=0 x=B swap(MB,1)  0

Netzer: Example A=8 swap(MA,1)  0 MA=0 swap(MB,1)  0 B=7 MB=0 x=B swap(MB,1)  0 MB=0 swap(MA,1)  0 y=A MA=0 B=6 swap(MB,1)  1

Netzer: Example A=8 swap(MA,1)  0 MA=0 swap(MB,1)  0 B=7 MB=0 x=B swap(MB,1)  0 MB=0 swap(MA,1)  0 y=A MA=0 B=6 swap(MB,1)  1 (1,0) (2,0) (4,0) (5,1) (6,4) (3,0) (0,1) (4,3) (4,4) (6,5) (6,6) (6,7) (6,8) (6,9) (6,10) (4,2)

Netzer: Example A=8 swap(MA,1)  0 MA=0 swap(MB,1)  0 B=7 MB=0 x=B swap(MB,1)  0 MB=0 swap(MA,1)  0 y=A MA=0 B=6 swap(MB,1)  1 (1,0) (2,0) (4,0) (5,1) (6,4) (3,0) (0,1) (4,3) (4,4) (6,5) (6,6) (6,7) (6,8) (6,9) (6,10) (4,2)

Netzer: Problems Size of vector clock grows with the number of processes –the method doesn’t scale well –programs that create thread dynamically? A vector timestamp has to be attached to all shared memory locations: huge space overhead. The method basically detects all data and synchronisation races and replays them.

ROLT Attaches a Lamport clock to each process. The clocks attach a timestamp to each memory operations. Does not detect racing operation, but merely re-executes them in the same order. Also automatically traces transitive reduction of the dependencies

ROLT: Example A=8 swap(MA,1)  0 MA=0 swap(MB,1)  0 B=7 MB=0 x=B swap(MB,1)  0 MB=0 swap(MA,1)  0 y=A MA=0 B=6 swap(MB,1) 

ROLT: Example A=8 swap(MA,1)  0 MA=0 swap(MB,1)  0 B=7 MB=0 x=B swap(MB,1)  0 MB=0 swap(MA,1)  0 y=A MA=0 B=6 swap(MB,1)  (5,8)(1,5),(7,9)Traced:

ROLT: Example A=8 swap(MA,1)  0 MA=0 swap(MB,1)  0 B=7 MB=0 x=B swap(MB,1)  0 MB=0 swap(MA,1)  0 y=A MA=0 B=6 swap(MB,1) 

ROLT: Example A=8 swap(MA,1)  0 MA=0 swap(MB,1)  0 B=7 MB=0 x=B swap(MB,1)  0 MB=0 swap(MA,1)  0 y=A MA=0 B=6 swap(MB,1) 

ROLT using three phases Problem: high overhead due to the tracing of all memory operations Solution: only record/replay the synchronisation operations (subset of all race conditions) Problem: no correct replay possible if the execution contains a data race Solution: add a third phase for detecting the data races

ROLT using three phases Phase 1: record the order of the synchronisation races Phase 2: replay the synchronisation races while using intrusive data race detection techniques Phase 3: replay the synchronisation races and use cyclic debugging techniques to find the `normal’ errors

ROLT: Example A=8 L(MA) U(MA) L(MB) B=7 U(MB) x=B L(MB) U(MB) L(MA) y=A MA=0 B= (0,5)Traced:

ROLT ROLT replays synchronisation races end detects data races. The method scales well and has a small space and time overhead. Produces small trace files. A total order is imposed  artificial dependencies.

Conclusions