BugNet Continuously Recording Program Execution for Deterministic Replay Debugging Satish Narayanasamy Gilles Pokam Brad Calder.

Slides:



Advertisements
Similar presentations
Debugging operating systems with time-traveling virtual machines Sam King George Dunlap Peter Chen CoVirt Project, University of Michigan.
Advertisements

Chapter 4 Memory Management Page Replacement 补充:什么叫页面抖动?
More on File Management
Software & Services Group PinPlay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs Harish Patil, Cristiano Pereira,
Lecture 12 Reduce Miss Penalty and Hit Time
Caching and Virtual Memory. Main Points Cache concept – Hardware vs. software caches When caches work and when they don’t – Spatial/temporal locality.
Virtual Memory Introduction to Operating Systems: Module 9.
Recording Inter-Thread Data Dependencies for Deterministic Replay Tarun GoyalKevin WaughArvind Gopalakrishnan.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
S. Narayanasamy, Z. Wang, J. Tigani, A. Edwards, B. Calder UCSD and Microsoft PLDI 2007.
Continuously Recording Program Execution for Deterministic Replay Debugging.
File Management Systems
Computer ArchitectureFall 2007 © November 21, 2007 Karem A. Sakallah Lecture 23 Virtual Memory (2) CS : Computer Architecture.
Deterministic Logging/Replaying of Applications. Motivation Run-time framework goals –Collect a complete trace of a program’s user-mode execution –Keep.
Informationsteknologi Friday, November 16, 2007Computer Architecture I - Class 121 Today’s class Operating System Machine Level.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
Chapter 11 Operating Systems
A “Flight Data Recorder” for Enabling Full-system Multiprocessor Deterministic Replay Min Xu, Rastislav Bodik, Mark D. Hill
A. Frank - P. Weisberg Operating Systems Introduction to Tasks/Threads.
Projects Using gem5 ParaDIME (2012 – 2015) RoMoL (2013 – 2018)
PRASHANTHI NARAYAN NETTEM.
Caching and Virtual Memory. Main Points Cache concept – Hardware vs. software caches When caches work and when they don’t – Spatial/temporal locality.
A “Flight Data Recorder” for Enabling Full-system Multiprocessor Deterministic Replay Min Xu, Rastislav Bodik, Mark D. Hill
1 Computer System Overview Chapter 1. 2 n An Operating System makes the computing power available to users by controlling the hardware n Let us review.
Chapter 1 Computer System Overview Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.
15-740/ Oct. 17, 2012 Stefan Muller.  Problem: Software is buggy!  More specific problem: Want to make sure software doesn’t have bad property.
- 1 - Dongyoon Lee †, Mahmoud Said*, Satish Narayanasamy †, Zijiang James Yang*, and Cristiano L. Pereira ‡ University of Michigan, Ann Arbor † Western.
SSGRR A Taxonomy of Execution Replay Systems Frank Cornelis Andy Georges Mark Christiaens Michiel Ronsse Tom Ghesquiere Koen De Bosschere Dept. ELIS.
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,
Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.
Replay Compilation: Improving Debuggability of a Just-in Time Complier Presenter: Jun Tao.
The Mach System Abraham Silberschatz, Peter Baer Galvin, Greg Gagne Presentation By: Agnimitra Roy.
Processes and Process Control 1. Processes and Process Control 2. Definitions of a Process 3. Systems state vs. Process State 4. A 2 State Process Model.
Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.
Virtual Memory The memory space of a process is normally divided into blocks that are either pages or segments. Virtual memory management takes.
A Regulated Transitive Reduction (RTR) for Longer Memory Race Recording (ASLPOS’06) Min Xu Rastislav BodikMark D. Hill Shimin Chen LBA Reading Group Presentation.
Computer Architecture 2 nd year (computer and Information Sc.)
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Process-Concept.
CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.
Processes and Virtual Memory
Virtual Application Profiler (VAPP) Problem – Increasing hardware complexity – Programmers need to understand interactions between architecture and their.
Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
What is a Process ? A program in execution.
Transactional Memory Coherence and Consistency Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu,
Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore
Embedded Real-Time Systems
CS161 – Design and Architecture of Computer
CMSC 611: Advanced Computer Architecture
Free Transactions with Rio Vista
Processes and threads.
CS161 – Design and Architecture of Computer
William Stallings Computer Organization and Architecture 8th Edition
Lecture Topics: 11/1 Processes Process Management
Chapter 9 – Real Memory Organization and Management
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Chapter 9: Virtual-Memory Management
Page Replacement.
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Chapter 8: Memory management
Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.
Free Transactions with Rio Vista
Lecture Topics: 11/1 General Operating System Concepts Processes
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
CSE 471 Autumn 1998 Virtual memory
Review What are the advantages/disadvantages of pages versus segments?
COT 5611 Operating Systems Design Principles Spring 2014
Virtual Memory.
CSC Multiprocessor Programming, Spring, 2011
Presentation transcript:

BugNet Continuously Recording Program Execution for Deterministic Replay Debugging Satish Narayanasamy Gilles Pokam Brad Calder

Motivation Current Scenario  Increasing Software Complexity  Difficult to guarantee correctness  Released software contain bugs Problem  Bugs manifest at customer site  Difficult to reproduce bugs at developer site Solution  Continuously record information about program execution, even during production runs Challenge  Recording should be transparent to customer –> HW can help!

Conventional Debugging Examine core dump  Developer can examine final system state just before the crash  Very challenging to determine the root cause Customer site (or even during testing) Debugging at developer’s site Core dump Program Execution Bug source Crash

Deterministic Replay Debugging Developer Site Crash Replay Replay Window Program Execution Bug occurs Crash Customer Site Continuous Recording What is Deterministic Replay? Executing same sequence of instructions with same input operands like in original execution

Deterministic Replay Debugging Developer Site Crash Replay Replay Window Program Execution Bug occurs Crash Customer Site Continuous Recording Deterministic Replay Debugging Debugger can examine variable values Helps figuring out root cause of bug Reproduce even non-deterministic Bugs

BugNet Goal  Architecture support to enable Deterministic Replay Debugging Focus  Debugging user code Application and shared libraries No logging during execution of system code (interrupt service routines, system calls) Approach  Log initial architectural state (registers, PC, etc) and then load values Sufficient to replay user code, even across interrupts etc..

Overview Program Execution Checkpoint Interval ~10 million instr Log Header Program Counter Arch Register Values Process ID, Thread ID Checkpoint ID.. Load Values Only output of loads need to be logged Input and output values of other instructions can be regenerated during replay

First Load Log  Log load value only if the load is the first memory access to a location  HW Support: “FLL bits” for every word in L1 and L2 caches  Reset at the beginning of a checkpoint interval  Set on access Program Execution First Load Log (FLL) Load A Load BLoad A Logged Not Logged

Load B First Load Log Store values never need to be logged Regenerated during replay Load A Store B Program Execution First Load Log (FLL) PROBLEMS Memory location can be modified by stores in Interrupts, system calls Other threads in multithreaded programs DMA transfers

Interrupts Interrupt, System Call, Context Switch Prematurely Terminate checkpoint (FLL bits are reset) New checkpoint started After servicing interrupt (Start logging First loads) Interrupts, system calls, I/O, DMA NOT tracked BUT any values consumed later by the application will be logged, ON DEMAND, in the new checkpoint

Support for Multi-threaded Programs

Assumptions for Multithreaded Programs Shared Memory Multi-threaded processors Sequential Consistency  Memory operations form a total order Directory based Cache Coherence protocol

Shared Memory Communication Thread 1 Processor 1 A First Load Log (FLL) for each thread is collected locally Problem : Shared memory communication between threads Affects First Load optimization Thread 2 Processor 2

Shared Memory Communication Thread 1 Thread 2 Store A Time Load A Invalidate Message Resets FLL (First-Load Log) bits for the word A in Thread 1 Logged Not Logged DMA are handled similarly as they use same coherence protocol

Independently Replaying Threads Thread 1 A thread can be replayed using its local FLL, independent of other threads FLL checkpoints in different threads need not begin at the same time  Prematurely terminating checkpoints for interrupts becomes easier Thread 2 Processor 1 Processor 2

Logging Memory Order Infer and debug data races  Log order of memory operations executed across all the threads Adapt Flight Data Recorder (FDR) Xu, Bodik, Hill ISCA’03 Piggyback coherence replies with execution states (Thread-ID, Checkpoint-ID, Inst Count) of sender thread

Memory Race Log Thread X Thread Y Resets first-load bit for A ICx Store A Invalidate Invalidate Ack (Y, CP_ID1, ICy) For Thread X Log (ICx, Y, CP_ID1, ICy) Will be used to determine order of Store A wrt memory operations in other threads CP_ID 1 CP_ID 2 CP_ID 3 CP_ID 1 CP_ID 2 Executing STORE (ICx)

Memory Race Log Thread X Thread Y ICy Load A Write update request (X, CPId3, ICx) For Thread Y Log (ICy, X, CPid3, ICx) cid 3 cid 4 cid 5 cid 3 cid 4 Write update reply Executing LOAD (ICy)

Main Memory Architecture Support Summary Memory Race Log Buffer Checkpoint Log Buffer Dictionary L2 L1 Control Cache coherence Controller PCRegisters Goal: Deterministically Replay Crash Checkpoint Mechanism  First Load Opt  Online Dictionary Based Compression Memory Backed Support for Multithreading Pipeline 16 KB FIFO 32 KB FIFO

Memory Back Support Handling bursts  CB -16 KB; MRB – 32 KB  During bursts, CB & MRB buffers can get full Processor stalled OR Flush the buffer and start a new checkpoint CB and MRB are memory backed  Contents continuously written back to main memory at two separate locations  Amount of main memory space allocated determines replay window length

Checkpoint Management Oldest checkpoint discarded when allocated main memory space is full Checkpoint Interval length chosen based on available main memory space  Tradeoff Smaller the checkpoint interval lesser the information loss when a checkpoint is discarded Larger the checkpoint interval lesser the information/instruction that need to be logged  Reason: First-Load optimization

Re-player Infrastructure Collecting FLL  Pin Dynamic Instrumentation Luk et al., PLDI ’05 Replaying program execution using FLL  Virtutech Simics A full system functional simulator

How to replay a checkpoint? Replay using a functional simulator – eg: Simics  Can be integrated into conventional debuggers Steps:  Load the binaries into the same address locations like in the original location  Initialize state of PC and architectural registers  Start emulating instructions  For first loads, get the value from FLL, else get value from simulated memory Core Dump Not Required

Re-player Implementation Issues Code Space  Address locations of application code and shared libraries in application’s virtual address space need to be same as in the original execution Solution: Include starting locations of user and library code space in the log  Developer should have access to binaries and libraries used by the customer Self-Modifying Code  Cannot be handled by BugNet Reason: Instructions are not logged Possible Solution  Log first load (fetch) of instructions

Replay Window Length Program Execution Execution of Latest instance of “buggy instruction” Crash Lower Bound on Replay Window Length Number of dynamic instructions between the latest execution of the buggy instruction and the crash

Bug Characteristics ProgramNature of BugReplay Window length (in instructions) gzipOverflows global variable32,209 ncompressStack Corruption17,966 tarHeap object Overflow6,634 ghostscriptDangling pointer18,030,519 tidyNull pointer dereference2,537,326 xv-3.10aBuffer overflow7,543,600 gaim Null pointer dereference74,590 napster-1.52Dangling pointer189,391 pythonBuffer Overflow92 w3mNull pointer dereference79,309 Average1,594,252 Lower bound on required replay window length AccMon Zhou et.al. MICRO’04 Sourceforge Single Threaded Sourceforge Multi-Threaded

FLL Trace Size Less than 1MB (<20M interval) is required to capture majority of bugs

BugNet Vs FDR ( Xu, Bodik & Hill ISCA’03) Flight Data Recorder (FDR) – Replay full system for debugging  Uses SafetyNet Checkpoint MechanismSorin et.al. ISCA’02 Logs values replaced by first stores Recover initial full system state from core dump and store log  To enable replay, Interrupt, Prg I/O, DMA are logged separately  Requires more HW and larger logs than BugNet BugNet -- Focus on debugging only application code  First load checkpoint mechanism  Core dump, Interrupt, I/O, DMA logs NOT required Performance overhead of both is negligible  Logging is off the critical path of main computation

Limitation Debugging ability  Debugging OS code not possible BUT, memory values modified during interrupts, I/O and DMA will be captured in FLL  Hence, the application with limited interactions with OS can be debugged  No Core Dump Values of data structures untouched during replay window are unknown BUT, values responsible for bug can be found in the log or reproduced during replay if the replay window is large enough to capture the source of bug If a variable is not accessed between the source of bug and the crash then it should not be a reason for the crash

Limitation Replay window not long enough  Problem: Cause of bug lie outside replay window  Reason: Limited storage space -- Depends on amount of main memory to devote to capture logs  Solution: OS can fine tune allocation  User Input  Memory usage at any instant of time

Summary Bugs in released software are difficult to reproduce  Goal is to continuously record a light weight trace at the customer’s site to capture hard to reproduce bugs Deterministic Replay Debugging  On average at least 1.5 million instructions need to be replayed to capture bugs that we studied Recording architectural state and load values are sufficient to enable replay  Small FLL log size No core dump No I/O, DMA, Interrupt logs Limitation  Debug only user code and shared libraries Though it supports replaying across interrupts Replay WindowFLL Size 20 Million instr< 1 MB 100 Million instr< 3 MB