Download presentation
Presentation is loading. Please wait.
Published byMatthew Griffin Modified over 9 years ago
1
Samsara: Efficient Deterministic Replay with Hardware Virtualization Extensions Peking University Shiru Ren, Chunqi Li, Le Tan, and Zhen Xiao July 27 , 2015
2
Introduction 1 Background & Motivation 2 Samsara Overview 3 R & R the Memory Interleaving with HAV 4 2 Conclusion 6 Evaluation 5 Table of Contents
3
Introduction 3 Deterministic Replay Gives computer users the ability to travel backward in time, recreating past states and events in the computer. Checkpoint + record all non-deterministic events Checkpoint Execute same instruction stream Inject NDEs in logged points Initial State Instruction stream Final State Non-determinism Events (NDEs) (e.g. user / network input, interrupts… ) Final State’ NDEs log Recording Phase Replay Phase Application Scenarios For debugging: cyclic debugging For security: forensics, intrusion detection, malware analysis For fault tolerance: hot standby, data recovery
4
Introduction 4 Deterministic Replay for Multi-processor Deterministic replay for single processor is relatively mature and well- developed Challenge on the multi-processor system: memory interleaving
5
Introduction 1 Background & Motivation 2 Samsara Overview 3 R & R the Memory Interleaving with HAV 4 5 Conclusion 6 Evaluation 5 Table of Contents
6
6 Software-only schemes Modify OS, compiler, runtime libraries or VMM Virtualization-based approaches—CREW protocol CREW: Concurrent-Read & Exclusive-Write Background & Motivation Issues Each memory access operation must be checked for logging Serious performance degradation (more than 10X) Huge log size (approximately 1 MB/processor/second)
7
Background & Motivation 7 Hardware-based schemes Use special hardware support for recording memory-access interleaving Redesign the cache coherence protocol Issues Huge space overhead which limits the duration of the recorded interval Modeled only using software simulations Only support Sequential Consistency (SC) Impractical for use in realistic systems We believe software-only approaches will remain in the focus for optimizations as commercial processors with dedicated hardware-based R&R features are not commonly available yet.
8
8 Evaluation results show that our system: Incurs less than 3X overhead when compared to native execution Reduce 90% log size (even smaller than hardware-based scheme) Combine the merits of both solutions An Efficient way to record memory interleaving Without hardware changes Utilize current hardware acceleration as much as possible Hardware-assisted virtualization Some hardware characteristics are available to boost performance Efficient full virtualization using help from hardware capabilities Background & Motivation
9
Introduction 1 Background & Motivation 2 Samsara Overview 3 R & R the Memory Interleaving with HAV 4 9 Conclusion 6 Evaluation 5 Table of Contents
10
Samsara Overview 10
11
Samsara Overview 11 System composition Controller DMA recorder R&R Component Log recorder daemon Record and Replay Non-deterministic Events Synchronous Events: record the contents Asynchronous Events: record timestamp Compound Events: DMA, record both Memory interleaving: most important challenge
12
Introduction 1 Background & Motivation 2 Samsara Overview 3 R & R the Memory Interleaving with HAV 4 12 Conclusion 6 Evaluation 5 Table of Contents
13
R&R the Memory Interleaving with HAV Extensions 13 Motivation point-to-point logging approach Record dependences between pairs of instructions( log size, record overhead ) Avoid the large number of memory access detections Chunk-based schemes ( only the total sequence of chunks is recorded ) Chunk-based Strategy Restrict virtual processors’ execution into a series of chunks Merely need to record commit order Chunk execution must satisfy: Atomicity Serializability
14
14 Serializability: COW, conflict detection strategy Atomicity: some instructions that hard to undo R&R the Memory Interleaving with HAV Extensions P0 Chunk Start LD (A) COW ST (A) ST (B) COW Chunk Complete Truncation Reason: I/O Instruction Commit P1 LD (A) Squash & Rollback LD (B) ST (B) Re-execution LD (D) ST (D) Conflict Detection R-set { A } W-set { A, B } Truncation Reason: Chunk Size Limit R-set { D } W-set { D } R-set { A, B } W-set { B } ……
15
15 Obtain R&W-set Efficiently via HAV Extensions VM-based approaches: VM exit (hardware page protection) Our approach: a single EPT traversal Accessed and Dirty Flags of EPT Optimization: tree-based design of EPT W(b) R(b) R(a) R(c) W(b) R(b) W(e) W(b) R(b) R(a) R(c) W(b) R(b) W(e) a single EPT traversal Just the first write to each memory page will trigger an EPT violation VM exit Reduce at least 50% extra VM exits R&R the Memory Interleaving with HAV Extensions
16
16 R&R the Memory Interleaving with HAV Extensions Observations Chunk commit is a time-consuming process The time consumed on waiting for this lock is excessive (40%) Update write-back operation involves serious performance degradation P0 Chunk Complete Wait for Lock Detect Conflict Broadcast Updates Subsequent Chunk Lock Obtain R&W-set Write-back Updates
17
17 Reduce Lock Granularity with a Decentralized Three-Phase Commit Protocol Pages committed concurrently by different chunks have no intersection Move this out of the synchronized block Chunk committing out-of-order R&R the Memory Interleaving with HAV Extensions P0 Chunk Complete Wait for Lock Detect Conflict Broadcast Updates Write-back Updates Subsequent Chunk Lock Insert into committing list Update Chunk Info Check Committing List Obtain R&W-set
18
18 Replay Memory Interleaving Guarantee all chunks will be properly re-built and executed in the original order 1. Truncate a chunk at the recorded timestamp (hardware performance counter) 2. Ensure that all preceding chunks have been committed successfully before the subsequent chunk starts Allowing processors execute concurrently in replay R&R the Memory Interleaving with HAV Extensions
19
Introduction 1 Background & Motivation 2 Samsara Overview 3 R & R the Memory Interleaving with HAV 4 19 Conclusion 6 Evaluation 5 Table of Contents
20
Evaluation 20 Experimental Setup 4-core Intel Core i7-4790 processor, 12GB memory, 1TB Hard Drive Host: Ubuntu 12.04 with Linux kernel version 3.11.0 and Qemu-1.2.2 Guest: Ubuntu 14.04 with Linux kernel version 3.13.0 WorkloadApplication Domain Parallelization Granularity Data Usage Working Set SharingExchange blackscholesFinancial Analysiscoarselow small bodytrackComputer Visionmediumhighmedium raytraceRenderingmediumhighlowunbounded swaptionsFinancial Analysiscoarselow medium freqmineData Miningcoarsehighmediumunbounded x264Media Processingcoarsehigh medium Workloads PARSEC 3.0
21
Evaluation 21 Log Size Samsara generates log at an average rate of 0.127 MB/s and 0.151 MB/s for recoding two and four processors, respectively Reduce 90% log size (even smaller than hardware-based schemes) For comparison : 0.131 MB/s (single processor)
22
Evaluation 22 Log Size The size of the chunk commit log is practically negligible compared with other non-deterministic events 2.33% with 2 processors and 7.37% with 4 processors on average
23
Evaluation 23 Recording Overhead Compared to Native Execution Measure the overhead of the system relative to the base platform it runs on (can not reflect the actual execution time of the system in real life) 2.6X and 5.0X for recording these workloads on two and four processors
24
Conclusion 24 We made the first attempt to leverage HAV extensions to achieve an efficient software-based replay system Record processors’ execution as a series of chunks Avoid the large number of memory access detections by performing a single EPT traversal at the end of each chunk Propose a decentralized three-phase commit protocol to reduce the lock granularity of the chunk commit process
25
Thanks 25
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.