Dongyoon Lee, Benjamin Wester, Kaushik Veeraraghavan, Satish Narayanasamy, Peter M. Chen, and Jason Flinn University of Michigan, Ann Arbor Respec: Efficient.

Slides:



Advertisements
Similar presentations
Remus: High Availability via Asynchronous Virtual Machine Replication
Advertisements

PEREGRINE: Efficient Deterministic Multithreading through Schedule Relaxation Heming Cui, Jingyue Wu, John Gallagher, Huayang Guo, Junfeng Yang Software.
Michael Bond (Ohio State) Milind Kulkarni (Purdue)
UW-Madison Computer Sciences Multifacet Group© 2011 Karma: Scalable Deterministic Record-Replay Arkaprava Basu Jayaram Bobba Mark D. Hill Work done at.
An Case for an Interleaving Constrained Shared-Memory Multi-Processor Jie Yu and Satish Narayanasamy University of Michigan.
Gwendolyn Voskuilen, Faraz Ahmad, and T. N. Vijaykumar Electrical & Computer Engineering ISCA 2010.
R2: An application-level kernel for record and replay Z. Guo, X. Wang, J. Tang, X. Liu, Z. Xu, M. Wu, M. F. Kaashoek, Z. Zhang, (MSR Asia, Tsinghua, MIT),
OSDI ’10 Research Visions 3 October Epoch parallelism: One execution is not enough Jessica Ouyang, Kaushik Veeraraghavan, Dongyoon Lee, Peter Chen,
OSDI ’10 Research Visions 3 October Epoch parallelism: One execution is not enough Jessica Ouyang, Kaushik Veeraraghavan, Dongyoon Lee, Peter Chen,
Detecting and surviving data races using complementary schedules
CS530 Operating System Nesting Paging in VM Replay for MPs Jaehyuk Huh Computer Science, KAIST.
Asynchronous Assertions Eddie Aftandilian and Sam Guyer Tufts University Martin Vechev ETH Zurich and IBM Research Eran Yahav Technion.
S. Narayanasamy, Z. Wang, J. Tigani, A. Edwards, B. Calder UCSD and Microsoft PLDI 2007.
Continuously Recording Program Execution for Deterministic Replay Debugging.
October 2003 What Does the Future Hold for Parallel Languages A Computer Architect’s Perspective Josep Torrellas University of Illinois
Execution Replay for Multiprocessor Virtual Machines George W. Dunlap Dominic Lucchetti Michael A. Fetterman Peter M. Chen.
Deterministic Logging/Replaying of Applications. Motivation Run-time framework goals –Collect a complete trace of a program’s user-mode execution –Keep.
BugNet Continuously Recording Program Execution for Deterministic Replay Debugging Satish Narayanasamy Gilles Pokam Brad Calder.
Remus: High Availability via Asynchronous Virtual Machine Replication.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
1 SigRace: Signature-Based Data Race Detection Abdullah Muzahid, Dario Suarez*, Shanxiang Qi & Josep Torrellas Computer Science Department University of.
Parallelizing Data Race Detection Benjamin Wester Facebook David Devecsery, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan.
DoublePlay: Parallelizing Sequential Logging and Replay Kaushik Veeraraghavan Dongyoon Lee, Benjamin Wester, Jessica Ouyang, Peter M. Chen, Jason Flinn,
DTHREADS: Efficient Deterministic Multithreading
MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,
RCDC SLIDES README Font Issues – To ensure that the RCDC logo appears correctly on all computers, it is represented with images in this presentation. This.
1 Rollback-Recovery Protocols II Mahmoud ElGammal.
Light64: Lightweight Hardware Support for Data Race Detection during Systematic Testing of Parallel Programs A. Nistor, D. Marinov and J. Torellas to appear.
0 Deterministic Replay for Real- time Software Systems Alice Lee Safety, Reliability & Quality Assurance Office JSC, NASA Yann-Hang.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Samsara: Efficient Deterministic Replay with Hardware Virtualization Extensions Peking University Shiru Ren, Chunqi Li, Le Tan, and Zhen Xiao July 27 ,
What is the Cost of Determinism?
Accelerating Mobile Applications through Flip-Flop Replication
Parallelizing Security Checks on Commodity Hardware E.B. Nightingale, D. Peek, P.M. Chen and J. Flinn U Michigan.
Accelerating Precise Race Detection Using Commercially-Available Hardware Transactional Memory Support Serdar Tasiran Koc University, Istanbul, Turkey.
- 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera.
- 1 - Dongyoon Lee †, Mahmoud Said*, Satish Narayanasamy †, Zijiang James Yang*, and Cristiano L. Pereira ‡ University of Michigan, Ann Arbor † Western.
SSGRR A Taxonomy of Execution Replay Systems Frank Cornelis Andy Georges Mark Christiaens Michiel Ronsse Tom Ghesquiere Koen De Bosschere Dept. ELIS.
AADEBUG MUNCHEN Non-intrusive on-the-fly data race detection using execution replay Michiel Ronsse - Koen De Bosschere Ghent University - Belgium.
ReSlice: Selective Re-execution of Long-retired Misspeculated Instructions Using Forward Slicing Smruti R. Sarangi, Wei Liu, Josep Torrellas, Yuanyuan.
Parallelizing Security Checks on Commodity Hardware Ed Nightingale Dan Peek, Peter Chen Jason Flinn Microsoft Research University of Michigan.
Rerun: Exploiting Episodes for Lightweight Memory Race Recording Derek R. Hower and Mark D. Hill Computer systems complex – more so with multicore What.
SPECULATIVE EXECUTION IN A DISTRIBUTED FILE SYSTEM E. B. Nightingale P. M. Chen J. Flint University of Michigan.
Efficient Deterministic Replay of Multithreaded Executions in a Managed Language Virtual Machine Michael Bond Milind Kulkarni Man Cao Meisam Fathi Salmi.
Low-Overhead Software Transactional Memory with Progress Guarantees and Strong Semantics Minjia Zhang, 1 Jipeng Huang, Man Cao, Michael D. Bond.
Speculative Execution in a Distributed File System Ed Nightingale Peter Chen Jason Flinn University of Michigan.
Drinking from Both Glasses: Adaptively Combining Pessimistic and Optimistic Synchronization for Efficient Parallel Runtime Support Man Cao Minjia Zhang.
A Regulated Transitive Reduction (RTR) for Longer Memory Race Recording (ASLPOS’06) Min Xu Rastislav BodikMark D. Hill Shimin Chen LBA Reading Group Presentation.
…and region serializability for all JESSICA OUYANG, PETER CHEN, JASON FLINN & SATISH NARAYANASAMY UNIVERSITY OF MICHIGAN.
An Integrated Framework for Dependable and Revivable Architecture Using Multicore Processors Weidong ShiMotorola Labs Hsien-Hsin “Sean” LeeGeorgia Tech.
Sound and Precise Analysis of Parallel Programs through Schedule Specialization Jingyue Wu, Yang Tang, Gang Hu, Heming Cui, Junfeng Yang Columbia University.
Dongyoon Lee, Benjamin Wester, Kaushik Veeraraghavan, Satish Narayanasamy, Peter M. Chen, and Jason Flinn University of Michigan, Ann Arbor Respec: Efficient.
CoreDet: A Compiler and Runtime System for Deterministic Multithreaded Execution Tom Bergan Owen Anderson, Joe Devietti, Luis Ceze, Dan Grossman To appear.
Speculation Supriya Vadlamani CS 6410 Advanced Systems.
AtomCaml: First-class Atomicity via Rollback Michael F. Ringenburg and Dan Grossman University of Washington International Conference on Functional Programming.
State Machine Replication State Machine Replication through transparent distributed protocols State Machine Replication through a shared log.
Flashback : A Lightweight Extension for Rollback and Deterministic Replay for Software Debugging Sudarshan M. Srinivasan, Srikanth Kandula, Christopher.
Speculative Execution in a Distributed File System Ed Nightingale Peter Chen Jason Flinn University of Michigan Best Paper at SOSP 2005 Modified for CS739.
PRES: Probabilistic Replay with Execution Sketching on Multiprocessors Soyeon Park and Yuanyuan Zhou (UCSD) Weiwei Xiong, Zuoning Yin, Rini Kaushik, Kyu.
Kendo: Efficient Deterministic Multithreading in Software M. Olszewski, J. Ansel, S. Amarasinghe MIT to be presented in ASPLOS 2009 slides by Evangelos.
Explicitly Parallel Programming with Shared-Memory is Insane: At Least Make it Deterministic! Joe Devietti, Brandon Lucia, Luis Ceze and Mark Oskin University.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Optimistic Hybrid Analysis
On-Demand Dynamic Software Analysis
Rerun: Exploiting Episodes for Lightweight Memory Race Recording
Heming Cui, Jingyue Wu, John Gallagher, Huayang Guo, Junfeng Yang
Persistency for Synchronization-Free Regions
Changing thread semantics
Parallel and Distributed Simulation
CSE 542: Operating Systems
Presentation transcript:

Dongyoon Lee, Benjamin Wester, Kaushik Veeraraghavan, Satish Narayanasamy, Peter M. Chen, and Jason Flinn University of Michigan, Ann Arbor Respec: Efficient Online Multiprocessor Replay via Speculation and External Determinism

Dongyoon Lee Deterministic Replay Record and reproduce non-deterministic events 1)Offline Uses: replay repeatedly after original run Debugging Forensics 2)Online Uses: record and replay concurrently Fault tolerance Decoupled runtime checks We focus on online replay for multi-processors Deterministic Replay 2

Dongyoon Lee Online Deterministic Replay Uses ServerReplica Takeover Fault ToleranceDecoupled Runtime Checks AppReplay + Check A A + Check B B C C Need to record and replay concurrently Both recording and replaying should be efficient Requestlog Response Fault !! replay keep the same state 3 P1 P2 P3 P4 A A B B C C

Dongyoon Lee Uniprocessor Replay Program Input (e.g. system calls, signals, etc) Thread scheduling Multiprocessor Replay: + Shared memory dependencies Instrument every memory operation PinSEL [Pereira, IISWC’08], iDNA [Bhansali, VEE’06] Page protection SMP-ReVirt [Dunlap, VEE’08] Offline search ODR [Altekar, SOSP’09], PRES [Park, SOSP’09] Replay-SAT [Lee, MICRO’09] Hardware support FDR [Xu, ISCA’03], Strata [Narayanasamy, ASPLOS’06], ReRun [Hower, ISCA’08], DeLorean [Montesinos, ISCA’08] Past Solutions for Deterministic Replay → x → 2-9x → Slow replay → Custom HW 4

Dongyoon Lee Goal: Efficient online software-only multiprocessor replay Key Idea: Speculation + Check 1)Speculate data race free 2)Detect mis-speculation using a cheap check 3)Rollback and retry on mis-speculation Overview of Our Approach multi-threaded fork Lock(l) Unlock(l) Lock(l) T1T2 Checkpoint A Recorded Process T1’T2’ A’ Replayed Process Lock(l’) Unlock(l’) Lock(l’) Speculate Race free Check B’==B? Checkpoint B 5

Dongyoon Lee Motivation/Overview Respec Design 1.Speculate data race free 2.Detect mis-speculation 3.Rollback and Retry on mis-speculation Evaluation Conclusion Roadmap 6

Dongyoon Lee Observation Reproducing program input and happens-before order of sync. operations guarantees deterministic replay of data-race-free programs [Ronsse and Bosschere ’99] 1) Program input ( e.g. system calls, signals, etc. ) Record: Log system call effects Replay: Emulate system call 2) Synchronization Operations Record and replay happens-before order Instrument common (not all) synchronization primitives in glibc Deterministic Replay of Data-race-free Programs 7 + total order

Dongyoon Lee What if a program is NOT race free? Problem Need to detect mis-speculation Data race detector is too heavy-weight Insight: External Determinism is sufficient Not necessary to replay data races Ensure that the replayed process produces the same visible effects as the recorded process to an external observer Visible effects = System output + Final program state Solution: Divergence checks Detect mis-speculation when the replay is not externally deterministic 8

Dongyoon Lee 1) System Output Check For every system call, compare system call argument Ensure that the replay produces the same output as the recorded process Divergence Check #1 – System Output Lock(l) Unlock(l) Lock(l) Lock(l’) Unlock(l’) Lock(l’) T1T2 Start A Recorded Process T1’T2’ Start A’ Replayed Process Check O’==O? SysRead X SysWrite O SysRead X’ SysWrite O’ 9 multi-threaded fork

Dongyoon Lee Benign Data Races Not all races cause divergence checks to fail A data race is inconsequential if system output matches x=1 x!=0? x=1 x!=0? T1T2 Start A Recorded ProcessReplayed Process Success SysWrite(x) 10 multi-threaded fork T1’T2’ Start A’

Dongyoon Lee 1)Need to rollback to the beginning 2)Need to buffer system output till the end 11 Divergence due to Data Races Start A Recorded Process Replayed Process Start A’ multi-threaded fork T1T2 T1’T2’ Fail SysWrite(x) x=1 x=2 x=1 x=2 SysWrite(x)

Dongyoon Lee 2) Program state check Compare register and memory state at semi-regular intervals (epochs) Construct a safe intermediate point –To release buffered output –To rollback to in case of mis-speculation Divergence Check #2 – Program State 12 Replayed Process T1’T2’ Start A’ T1T2 Recorded Process Start A epoch Release Output Success epoch SysWrite(x) x=1 x=2 x=1 x=2 SysWrite(x) Fail Checkpoint B B’ == B ?

Dongyoon Lee Recovery from Mis-speculation Rollback Rollback both recorded and replayed processes to the previous checkpoint Re-execute Optimistically re-run the failed epoch On repeated failure, switch to uniprocessor execution model –Record and replay only one thread at a time –Parallel execution resumes after the failed interval T1T2 T1’T2’ Check B’==B? x=1 x=2 Fail A == A’ Checkpoint B x=1 x=2 Checkpoint A x=1 x=2 Checkpoint B Checkpoint C Checkpoint A Check B’==B? x=1 x=2 Check C’==C? A == A’ 13

Dongyoon Lee Speculative Execution Speculator [Nightingale et al. SOSP’05] Buffer output during speculation Block execution if speculative execution is not feasible Release buffered output on commit Undo speculative changes and squash buffered output on mis-speculation 14

Dongyoon Lee Motivation/Overview Respec Design Evaluation 1.Performance results 2.Breakdown of performance overhead 3.Rollback frequency and overhead Conclusion Roadmap 15

Dongyoon Lee Evaluation Setup Test Environment 2 GHz 8 core Xeon processor with 3 GB of RAM Run 1~4 worker threads (excluding control threads) Collect the average of 10 trials (except pbzip2 and aget) Benchmarks PARSEC suite –blackscholes, bodytrack, fluidanimate, swaptions, streamcluster SPLASH-2 suite –ocean, raytrace, volrend, water-nsq, fft, and radix Real applications –pbzip2, pfscan, aget, and Apache 16

Dongyoon Lee Record and Replay Performance 18% for 2 threads, 55% for 4 threads Real applications (including Apache) showed <50% for 4 threads 17

Dongyoon Lee 1) Redundant Execution Overhead (25%) Cost of running two executions (Lower bound of online replay) Mainly due to sharing limited resources: memory system Contribute 25% of total cost for 4 threads Redundant execution overhead (25%) 18

Dongyoon Lee 2) Epoch overhead (17%) Due to checkpoint cost Due to artificial epoch barrier cost Contribute 17% of total cost for 4 threads 19 Epoch overhead (17%) Redundant execution overhead (25%)

Dongyoon Lee 3) Memory Comparison Overhead (16%) Optimization 1. compare dirty pages only Optimization 2. parallelize comparison Contribute 16% of total cost for 4 threads 20 Memory comparison overhead (16%) Epoch overhead (17%) Redundant execution overhead (25%)

Dongyoon Lee 4) Logging Overhead (42%) Logging synchronization operations and system calls overhead Main cost for applications with fine-grained synchronizations Contribute 42% of total cost for 4 threads 21 Logging and other overhead (42%) Memory comparison overhead (16%) Epoch overhead (17%) Redundant execution overhead (25%)

Dongyoon Lee Rollback Frequency and Overhead App.ThreadsRollback FrequencyOverheadAvg. Overhead Pbzip2 (100 runs) 4 84% none 41% 45% 15% once 66% 1% twice 105% Aget (50 runs) 4 80% none 6% 18% once 6% 2% twice 6% Pbzip2(16%) and Aget(20%) invoke one or more rollbacks Pbzip2: Rollbacks contribute <10% of total overhead Aget: Rollback overhead is negligible frequent checkpoints => short epochs => small amount of work to be re-done 22

Dongyoon Lee Conclusion Goal: Deterministic replay for multithreaded programs Software-only: no custom hardware Online: record and replay concurrently Contributions to replay Speculation: speculate race-free, and rollback/retry if needed External Determinism: Match system output and program states Results Performance overhead record and replay concurrently 2 threads: 18% 4 threads: 55% Thank you… 23