Samsara: Efficient Deterministic Replay with Hardware Virtualization Extensions Peking University Shiru Ren, Chunqi Li, Le Tan, and Zhen Xiao July 27 ,

Slides:



Advertisements
Similar presentations
Remus: High Availability via Asynchronous Virtual Machine Replication
Advertisements

Debugging operating systems with time-traveling virtual machines Sam King George Dunlap Peter Chen CoVirt Project, University of Michigan.
Virtualization Technology
Programming Technologies, MIPT, April 7th, 2012 Introduction to Binary Translation Technology Roman Sokolov SMWare
Principles of Transaction Management. Outline Transaction concepts & protocols Performance impact of concurrency control Performance tuning.
CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,
Evaluating Database-Oriented Replication Schemes in Software Transacional Memory Systems Roberto Palmieri Francesco Quaglia (La Sapienza, University of.
Recording Inter-Thread Data Dependencies for Deterministic Replay Tarun GoyalKevin WaughArvind Gopalakrishnan.
CS 5204 – Operating Systems 1 Scheduler Activations.
KMemvisor: Flexible System Wide Memory Mirroring in Virtual Environments Bin Wang Zhengwei Qi Haibing Guan Haoliang Dong Wei Sun Shanghai Key Laboratory.
CS 345 Computer System Overview
Low-Cost Data Deduplication for Virtual Machine Backup in Cloud Storage Wei Zhang, Tao Yang, Gautham Narayanasamy University of California at Santa Barbara.
Continuously Recording Program Execution for Deterministic Replay Debugging.
1 Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
Execution Replay for Multiprocessor Virtual Machines George W. Dunlap Dominic Lucchetti Michael A. Fetterman Peter M. Chen.
Transaction Management and Concurrency Control
1 Last Class: Introduction Operating system = interface between user & architecture Importance of OS OS history: Change is only constant User-level Applications.
Virtualization for Cloud Computing
Chapter 9 Overview  Reasons to monitor SQL Server  Performance Monitoring and Tuning  Tools for Monitoring SQL Server  Common Monitoring and Tuning.
Multi-level Selective Deduplication for VM Snapshots in Cloud Storage Wei Zhang*, Hong Tang †, Hao Jiang †, Tao Yang*, Xiaogang Li †, Yue Zeng † * University.
Light64: Lightweight Hardware Support for Data Race Detection during Systematic Testing of Parallel Programs A. Nistor, D. Marinov and J. Torellas to appear.
Tanenbaum 8.3 See references
Microkernels, virtualization, exokernels Tutorial 1 – CSC469.
Jakub Szefer, Eric Keller, Ruby B. Lee Jennifer Rexford Princeton University CCS October, 2011 報告人:張逸文.
A “Flight Data Recorder” for Enabling Full-system Multiprocessor Deterministic Replay Min Xu, Rastislav Bodik, Mark D. Hill
CS533 Concepts of Operating Systems Jonathan Walpole.
Chapter 4 Threads, SMP, and Microkernels Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design.
Cosc 4740 Chapter 6, Part 3 Process Synchronization.
- 1 - Dongyoon Lee †, Mahmoud Said*, Satish Narayanasamy †, Zijiang James Yang*, and Cristiano L. Pereira ‡ University of Michigan, Ann Arbor † Western.
SSGRR A Taxonomy of Execution Replay Systems Frank Cornelis Andy Georges Mark Christiaens Michiel Ronsse Tom Ghesquiere Koen De Bosschere Dept. ELIS.
29th ACSAC (December, 2013) SPIDER: Stealthy Binary Program Instrumentation and Debugging via Hardware Virtualization Zhui Deng, Xiangyu Zhang, and Dongyan.
Our work on virtualization Chen Haogang, Wang Xiaolin {hchen, Institute of Network and Information Systems School of Electrical Engineering.
CS533 Concepts of Operating Systems Jonathan Walpole.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Replay Compilation: Improving Debuggability of a Just-in Time Complier Presenter: Jun Tao.
Operating Systems CSE 411 Multi-processor Operating Systems Multi-processor Operating Systems Dec Lecture 30 Instructor: Bhuvan Urgaonkar.
Rerun: Exploiting Episodes for Lightweight Memory Race Recording Derek R. Hower and Mark D. Hill Computer systems complex – more so with multicore What.
02/09/2010 Industrial Project Course (234313) Virtualization-aware database engine Final Presentation Industrial Project Course (234313) Virtualization-aware.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
Computer Systems Week 14: Memory Management Amanda Oddie.
Seminar of “Virtual Machines” Course Mohammad Mahdizadeh SM. University of Science and Technology Mazandaran-Babol January 2010.
A Regulated Transitive Reduction (RTR) for Longer Memory Race Recording (ASLPOS’06) Min Xu Rastislav BodikMark D. Hill Shimin Chen LBA Reading Group Presentation.
An Integrated Framework for Dependable and Revivable Architecture Using Multicore Processors Weidong ShiMotorola Labs Hsien-Hsin “Sean” LeeGeorgia Tech.
Full and Para Virtualization
Protecting The Kernel Data through Virtualization Technology BY VENKATA SAI PUNDAMALLI id :
Lazy Release Consistency for Software Distributed Shared Memory Pete Keleher Alan L. Cox Willy Z. By Nooruddin Shaik.
Protection of Processes Security and privacy of data is challenging currently. Protecting information – Not limited to hardware. – Depends on innovation.
A Binary Agent Technology for COTS Software Integrity Anant Agarwal Richard Schooler InCert Software.
Atom-Aid: Detecting and Surviving Atomicity Violations Brandon Lucia, Joseph Devietti, Karin Strauss and Luis Ceze LBA Reading Group 7/3/08 Slides by Michelle.
Execution Replay and Debugging. Contents Introduction Parallel program: set of co-operating processes Co-operation using –shared variables –message passing.
Flashback : A Lightweight Extension for Rollback and Deterministic Replay for Software Debugging Sudarshan M. Srinivasan, Srikanth Kandula, Christopher.
Virtual Machines Mr. Monil Adhikari. Agenda Introduction Classes of Virtual Machines System Virtual Machines Process Virtual Machines.
Architectural Features of Transactional Memory Designs for an Operating System Chris Rossbach, Hany Ramadan, Don Porter Advanced Computer Architecture.
Transactional Memory Coherence and Consistency Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu,
Agenda  Quick Review  Finish Introduction  Java Threads.
Transactional Flash V. Prabhakaran, T. L. Rodeheffer, L. Zhou (MSR, Silicon Valley), OSDI 2008 Shimin Chen Big Data Reading Group.
KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association SYSTEM ARCHITECTURE GROUP DEPARTMENT OF COMPUTER.
Kendo: Efficient Deterministic Multithreading in Software M. Olszewski, J. Ansel, S. Amarasinghe MIT to be presented in ASPLOS 2009 slides by Evangelos.
Transaction Management and Concurrency Control
Effective Data-Race Detection for the Kernel
Introduction to Operating Systems
OS Virtualization.
Changing thread semantics
Lecture 6: Transactions
Chapter 10 Transaction Management and Concurrency Control
Introduction of Week 13 Return assignment 11-1 and 3-1-5
Transaction Management
Co-designed Virtual Machines for Reliable Computer Systems
Lecture Topics: 11/1 Hand back midterms
Presentation transcript:

Samsara: Efficient Deterministic Replay with Hardware Virtualization Extensions Peking University Shiru Ren, Chunqi Li, Le Tan, and Zhen Xiao July 27 , 2015

Introduction 1 Background & Motivation 2 Samsara Overview 3 R & R the Memory Interleaving with HAV 4 2 Conclusion 6 Evaluation 5 Table of Contents

Introduction 3 Deterministic Replay  Gives computer users the ability to travel backward in time, recreating past states and events in the computer.  Checkpoint + record all non-deterministic events Checkpoint Execute same instruction stream Inject NDEs in logged points Initial State Instruction stream Final State Non-determinism Events (NDEs) (e.g. user / network input, interrupts… ) Final State’ NDEs log Recording Phase Replay Phase Application Scenarios  For debugging: cyclic debugging  For security: forensics, intrusion detection, malware analysis  For fault tolerance: hot standby, data recovery

Introduction 4 Deterministic Replay for Multi-processor  Deterministic replay for single processor is relatively mature and well- developed  Challenge on the multi-processor system: memory interleaving

Introduction 1 Background & Motivation 2 Samsara Overview 3 R & R the Memory Interleaving with HAV 4 5 Conclusion 6 Evaluation 5 Table of Contents

6 Software-only schemes  Modify OS, compiler, runtime libraries or VMM  Virtualization-based approaches—CREW protocol  CREW: Concurrent-Read & Exclusive-Write Background & Motivation Issues  Each memory access operation must be checked for logging  Serious performance degradation (more than 10X)  Huge log size (approximately 1 MB/processor/second)

Background & Motivation 7 Hardware-based schemes  Use special hardware support for recording memory-access interleaving  Redesign the cache coherence protocol Issues  Huge space overhead which limits the duration of the recorded interval  Modeled only using software simulations  Only support Sequential Consistency (SC)  Impractical for use in realistic systems We believe software-only approaches will remain in the focus for optimizations as commercial processors with dedicated hardware-based R&R features are not commonly available yet.

8 Evaluation results show that our system:  Incurs less than 3X overhead when compared to native execution  Reduce 90% log size (even smaller than hardware-based scheme) Combine the merits of both solutions  An Efficient way to record memory interleaving  Without hardware changes  Utilize current hardware acceleration as much as possible Hardware-assisted virtualization  Some hardware characteristics are available to boost performance  Efficient full virtualization using help from hardware capabilities Background & Motivation

Introduction 1 Background & Motivation 2 Samsara Overview 3 R & R the Memory Interleaving with HAV 4 9 Conclusion 6 Evaluation 5 Table of Contents

Samsara Overview 10

Samsara Overview 11 System composition  Controller  DMA recorder  R&R Component  Log recorder daemon Record and Replay Non-deterministic Events  Synchronous Events: record the contents  Asynchronous Events: record timestamp  Compound Events: DMA, record both  Memory interleaving: most important challenge

Introduction 1 Background & Motivation 2 Samsara Overview 3 R & R the Memory Interleaving with HAV 4 12 Conclusion 6 Evaluation 5 Table of Contents

R&R the Memory Interleaving with HAV Extensions 13 Motivation  point-to-point logging approach  Record dependences between pairs of instructions( log size, record overhead )  Avoid the large number of memory access detections  Chunk-based schemes ( only the total sequence of chunks is recorded ) Chunk-based Strategy  Restrict virtual processors’ execution into a series of chunks  Merely need to record commit order  Chunk execution must satisfy:  Atomicity  Serializability

14  Serializability: COW, conflict detection strategy  Atomicity: some instructions that hard to undo R&R the Memory Interleaving with HAV Extensions P0 Chunk Start LD (A) COW ST (A) ST (B) COW Chunk Complete Truncation Reason: I/O Instruction Commit P1 LD (A) Squash & Rollback LD (B) ST (B) Re-execution LD (D) ST (D) Conflict Detection R-set { A } W-set { A, B } Truncation Reason: Chunk Size Limit R-set { D } W-set { D } R-set { A, B } W-set { B } ……

15 Obtain R&W-set Efficiently via HAV Extensions  VM-based approaches: VM exit (hardware page protection)  Our approach: a single EPT traversal  Accessed and Dirty Flags of EPT  Optimization: tree-based design of EPT W(b) R(b) R(a) R(c) W(b) R(b) W(e) W(b) R(b) R(a) R(c) W(b) R(b) W(e) a single EPT traversal Just the first write to each memory page will trigger an EPT violation VM exit  Reduce at least 50% extra VM exits R&R the Memory Interleaving with HAV Extensions

16 R&R the Memory Interleaving with HAV Extensions Observations  Chunk commit is a time-consuming process  The time consumed on waiting for this lock is excessive (40%)  Update write-back operation involves serious performance degradation P0 Chunk Complete Wait for Lock Detect Conflict Broadcast Updates Subsequent Chunk Lock Obtain R&W-set Write-back Updates

17 Reduce Lock Granularity with a Decentralized Three-Phase Commit Protocol  Pages committed concurrently by different chunks have no intersection  Move this out of the synchronized block  Chunk committing out-of-order R&R the Memory Interleaving with HAV Extensions P0 Chunk Complete Wait for Lock Detect Conflict Broadcast Updates Write-back Updates Subsequent Chunk Lock Insert into committing list Update Chunk Info Check Committing List Obtain R&W-set

18 Replay Memory Interleaving  Guarantee all chunks will be properly re-built and executed in the original order  1. Truncate a chunk at the recorded timestamp (hardware performance counter)  2. Ensure that all preceding chunks have been committed successfully before the subsequent chunk starts  Allowing processors execute concurrently in replay R&R the Memory Interleaving with HAV Extensions

Introduction 1 Background & Motivation 2 Samsara Overview 3 R & R the Memory Interleaving with HAV 4 19 Conclusion 6 Evaluation 5 Table of Contents

Evaluation 20 Experimental Setup  4-core Intel Core i processor, 12GB memory, 1TB Hard Drive  Host: Ubuntu with Linux kernel version and Qemu  Guest: Ubuntu with Linux kernel version WorkloadApplication Domain Parallelization Granularity Data Usage Working Set SharingExchange blackscholesFinancial Analysiscoarselow small bodytrackComputer Visionmediumhighmedium raytraceRenderingmediumhighlowunbounded swaptionsFinancial Analysiscoarselow medium freqmineData Miningcoarsehighmediumunbounded x264Media Processingcoarsehigh medium Workloads  PARSEC 3.0

Evaluation 21 Log Size  Samsara generates log at an average rate of MB/s and MB/s for recoding two and four processors, respectively  Reduce 90% log size (even smaller than hardware-based schemes)  For comparison : MB/s (single processor)

Evaluation 22 Log Size  The size of the chunk commit log is practically negligible compared with other non-deterministic events  2.33% with 2 processors and 7.37% with 4 processors on average

Evaluation 23 Recording Overhead Compared to Native Execution  Measure the overhead of the system relative to the base platform it runs on (can not reflect the actual execution time of the system in real life)  2.6X and 5.0X for recording these workloads on two and four processors

Conclusion 24 We made the first attempt to leverage HAV extensions to achieve an efficient software-based replay system  Record processors’ execution as a series of chunks  Avoid the large number of memory access detections by performing a single EPT traversal at the end of each chunk  Propose a decentralized three-phase commit protocol to reduce the lock granularity of the chunk commit process

Thanks 25