DoublePlay: Parallelizing Sequential Logging and Replay Kaushik Veeraraghavan Dongyoon Lee, Benjamin Wester, Jessica Ouyang, Peter M. Chen, Jason Flinn,

Slides:

Advertisements

Similar presentations

PEREGRINE: Efficient Deterministic Multithreading through Schedule Relaxation Heming Cui, Jingyue Wu, John Gallagher, Huayang Guo, Junfeng Yang Software.

Advertisements

Software & Services Group PinPlay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs Harish Patil, Cristiano Pereira,

Michael Bond (Ohio State) Milind Kulkarni (Purdue)

UW-Madison Computer Sciences Multifacet Group© 2011 Karma: Scalable Deterministic Record-Replay Arkaprava Basu Jayaram Bobba Mark D. Hill Work done at.

Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability.

An Case for an Interleaving Constrained Shared-Memory Multi-Processor Jie Yu and Satish Narayanasamy University of Michigan.

Gwendolyn Voskuilen, Faraz Ahmad, and T. N. Vijaykumar Electrical & Computer Engineering ISCA 2010.

R2: An application-level kernel for record and replay Z. Guo, X. Wang, J. Tang, X. Liu, Z. Xu, M. Wu, M. F. Kaashoek, Z. Zhang, (MSR Asia, Tsinghua, MIT),

OSDI ’10 Research Visions 3 October Epoch parallelism: One execution is not enough Jessica Ouyang, Kaushik Veeraraghavan, Dongyoon Lee, Peter Chen,

OSDI ’10 Research Visions 3 October Epoch parallelism: One execution is not enough Jessica Ouyang, Kaushik Veeraraghavan, Dongyoon Lee, Peter Chen,

Detecting and surviving data races using complementary schedules

CHESS: A Systematic Testing Tool for Concurrent Software CSCI6900 George.

Asynchronous Assertions Eddie Aftandilian and Sam Guyer Tufts University Martin Vechev ETH Zurich and IBM Research Eran Yahav Technion.

October 2003 What Does the Future Hold for Parallel Languages A Computer Architect’s Perspective Josep Torrellas University of Illinois

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

Operating Systems Parallel Systems and Threads (Soon to be basic OS knowledge)

Execution Replay for Multiprocessor Virtual Machines George W. Dunlap Dominic Lucchetti Michael A. Fetterman Peter M. Chen.

Deterministic Logging/Replaying of Applications. Motivation Run-time framework goals –Collect a complete trace of a program’s user-mode execution –Keep.

Dongyoon Lee, Benjamin Wester, Kaushik Veeraraghavan, Satish Narayanasamy, Peter M. Chen, and Jason Flinn University of Michigan, Ann Arbor Respec: Efficient.

Parallelizing Data Race Detection Benjamin Wester Facebook David Devecsery, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan.

G Robert Grimm New York University Scheduler Activations.

MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,

Samsara: Efficient Deterministic Replay with Hardware Virtualization Extensions Peking University Shiru Ren, Chunqi Li, Le Tan, and Zhen Xiao July 27 ，

A “Flight Data Recorder” for Enabling Full-system Multiprocessor Deterministic Replay Min Xu, Rastislav Bodik, Mark D. Hill

Parallelizing Security Checks on Commodity Hardware E.B. Nightingale, D. Peek, P.M. Chen and J. Flinn U Michigan.

Accelerating Precise Race Detection Using Commercially-Available Hardware Transactional Memory Support Serdar Tasiran Koc University, Istanbul, Turkey.

Multi-Core Architectures

- 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera.

- 1 - Dongyoon Lee †, Mahmoud Said*, Satish Narayanasamy †, Zijiang James Yang*, and Cristiano L. Pereira ‡ University of Michigan, Ann Arbor † Western.

C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.

Parallelizing Security Checks on Commodity Hardware Ed Nightingale Dan Peek, Peter Chen Jason Flinn Microsoft Research University of Michigan.

COMP 111 Threads and concurrency Sept 28, Tufts University Computer Science2 Who is this guy? I am not Prof. Couch Obvious? Sam Guyer New assistant.

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Mutual Exclusion.

Efficient Deterministic Replay of Multithreaded Executions in a Managed Language Virtual Machine Michael Bond Milind Kulkarni Man Cao Meisam Fathi Salmi.

Adaptive Multi-Threading for Dynamic Workloads in Embedded Multiprocessors 林鼎原 Department of Electrical Engineering National Cheng Kung University Tainan,

Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.

Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.

Speculative Execution in a Distributed File System Ed Nightingale Peter Chen Jason Flinn University of Michigan.

Drinking from Both Glasses: Adaptively Combining Pessimistic and Optimistic Synchronization for Efficient Parallel Runtime Support Man Cao Minjia Zhang.

…and region serializability for all JESSICA OUYANG, PETER CHEN, JASON FLINN & SATISH NARAYANASAMY UNIVERSITY OF MICHIGAN.

Operating Systems: Internals and Design Principles

Dongyoon Lee, Benjamin Wester, Kaushik Veeraraghavan, Satish Narayanasamy, Peter M. Chen, and Jason Flinn University of Michigan, Ann Arbor Respec: Efficient.

Thread basics. A computer process Every time a program is executed a process is created It is managed via a data structure that keeps all things memory.

1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),

CoreDet: A Compiler and Runtime System for Deterministic Multithreaded Execution Tom Bergan Owen Anderson, Joe Devietti, Luis Ceze, Dan Grossman To appear.

Speculation Supriya Vadlamani CS 6410 Advanced Systems.

An Efficient Threading Model to Boost Server Performance Anupam Chanda.

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,

Flashback : A Lightweight Extension for Rollback and Deterministic Replay for Software Debugging Sudarshan M. Srinivasan, Srikanth Kandula, Christopher.

Speculative Execution in a Distributed File System Ed Nightingale Peter Chen Jason Flinn University of Michigan Best Paper at SOSP 2005 Modified for CS739.

Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)

Tuning Threaded Code with Intel® Parallel Amplifier.

PRES: Probabilistic Replay with Execution Sketching on Multiprocessors Soyeon Park and Yuanyuan Zhou (UCSD) Weiwei Xiong, Zuoning Yin, Rini Kaushik, Kyu.

1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

Kendo: Efficient Deterministic Multithreading in Software M. Olszewski, J. Ansel, S. Amarasinghe MIT to be presented in ASPLOS 2009 slides by Evangelos.

Multiscalar Processors

Memory Consistency Models

Chapter 4: Multithreaded Programming

Memory Consistency Models

Heming Cui, Jingyue Wu, John Gallagher, Huayang Guo, Junfeng Yang

Persistency for Synchronization-Free Regions

Instruction Level Parallelism (ILP)

Hardware Counter Driven On-the-Fly Request Signatures

CS510 Operating System Foundations

Dynamic Verification of Sequential Consistency

CSE 542: Operating Systems

CSE 542: Operating Systems

EECE.4810/EECE.5730 Operating Systems

Presentation transcript:

DoublePlay: Parallelizing Sequential Logging and Replay Kaushik Veeraraghavan Dongyoon Lee, Benjamin Wester, Jessica Ouyang, Peter M. Chen, Jason Flinn, Satish Narayanasamy University of Michigan

Deterministic replay Record and reproduce execution – Debugging – Program analysis – Forensics and intrusion detection – … and many more uses Open problem: guarantee deterministic replay efficiently on a commodity multiprocessor Kaushik Veeraraghavan2

Contributions Fastest guaranteed deterministic replay system that runs on commodity multiprocessors ≈ 20% overhead if spare cores  100% overhead if workload uses all cores Uniparallelism: a novel execution model for multithreaded programs Kaushik Veeraraghavan3

Why is multiprocessor replay hard? Threads update shared memory concurrently – Up to 9x slowdown or lose replay guarantee Only one thread updates shared memory at a time Kaushik Veeraraghavan4 Multiprocessor replay CPU 0 CPU 1 CPU 2 CPU 3 i++;i--; Uniprocessor replay CPU 0 i++; i--; OS-level replay: log system calls, signals, thread schedule Goal: guarantee deterministic replay at a lower cost than previous systems

Uniparallelism: a new execution model Ep 1 Ep 2 Ep 3 Ep 4 5Kaushik Veeraraghavan 2 2 Ep 4 Ep 1 Ep 3 Ep TIME CPU 0 CPU 1 CPU 2 CPU 3 Thread-parallel execution CPU 6 CPU 4 CPU 7 CPU 5 Epoch-parallel execution Programs benefit from uniprocessor execution while scaling performance with increasing processors.

When should one choose uniparallelism? Cost – Two executions, so twice the utilization When applicable? – Properties benefit from uniprocessor execution E.g.: deterministic replay – When all cores are not fully utilized Some applications don’t scale well Hardware manufacturers continue increasing core counts Kaushik Veeraraghavan6

Roadmap Uniparallelism DoublePlay – Record program execution – Reproduce execution offline Evaluation Kaushik Veeraraghavan7

DoublePlay records epoch-parallel execution OS-level uniprocessor replay – Log system calls, signals, thread schedule No shared memory accesses need to be logged 8Kaushik Veeraraghavan Ep 1 Ep 2 Ep 3 Ep 4 CPU 6 CPU 4 CPU 7 CPU 5 Epoch-parallel execution Syscalls Sync ops Signals Syscalls Sync ops Signals

State matches ? Divergence checks verify execution Verify register and memory state at end of epoch – Match: continue program execution – Mismatch: initiate recovery 9Kaushik Veeraraghavan TIME Ep 1 Ep 2 CPU 0 CPU 1 CPU 2 CPU 3 Thread-parallel execution Ep 1 Ep 2 CPU 6 CPU 4 CPU 7 CPU 5 Epoch-parallel execution 2 2

DoublePlay keeps the executions in sync Respec [ASPLOS ’10]: online record and replay – Total order and return value of system calls – Partial order and return value of synchronization operations 10Kaushik Veeraraghavan TIME Ep 1 Ep 2 CPU 0 CPU 1 CPU 2 CPU 3 Thread-parallel execution Ep 1 Ep 2 CPU 6 CPU 4 CPU 7 CPU 5 Epoch-parallel execution Syscalls Sync ops Signals Syscalls Sync ops Signals 2 2

State matches ? Forward recovery from memory divergence Resume execution from valid epoch-parallel state 11Kaushik Veeraraghavan TIME Ep 1 Ep 2 CPU 0 CPU 1 CPU 2 CPU 3 Thread-parallel execution Ep 1 Ep 2 CPU 6 CPU 4 CPU 7 CPU 5 Epoch-parallel execution 2 2

Monitor divergences within an epoch State at end of epoch must not diverge – OK to allow divergence within epoch Benefits – Epoch-parallel execution proceeds further – Might eliminate pipeline stalls altogether Kaushik Veeraraghavan12 CPU 3 x = 2 syswrite(x) x = 1 Epoch syswrite(x) CPU 0 CPU 1 x = 1 x = 2 Epoch-parallel execution Thread -parallel execution

How is the thread schedule recreated? Option 1: Deterministic scheduling algorithm – Each thread is assigned a strict priority – Highest priority thread runs on the processor at any given time Option 2: Record & replay scheduling decisions – Log instruction pointer & branch count on preemption – Preempt offline replay thread at logged values Kaushik Veeraraghavan13

DoublePlay: summary Offline replay reproduces epoch-parallel execution 14Kaushik Veeraraghavan CPU 0 REPLAY Ep 4 Ep 1 Ep 3 Ep 2 CPU 0 CPU 1 CPU 2 CPU 3 Thread-parallel execution Ep 1 Ep 2 Ep 3 Ep 4 CPU 6 CPU 4 CPU 7 CPU 5 Epoch-parallel execution Syscalls Sync ops Signals Syscalls Sync ops Signals RECORD (uniparallelism) Initial Process State Initial Process State Syscalls Sync ops Signals Syscalls Sync ops Signals

DoublePlay evaluation Benchmarks – Applications: Apache, MySQL, pbzip2, pfscan, aget – SPLASH-2: fft, radix, ocean, water Run on 8-core Xeon server Two scenarios – With spare cores: 2 – 4 worker threads – No spare cores: worker threads use all CPUs Evaluation uses deterministic scheduler Kaushik Veeraraghavan15

Recording overhead: with spare cores Kaushik Veeraraghavan16 Average cost: 15% for 2 threads, 28% for 4 threads Normalized runtime 1% to 23% 9% to 31%

Recording overhead: no spare cores Without spare cores, overhead is  100% Scales well with increasing cores Kaushik Veeraraghavan17 Normalized runtime

Contributions DoublePlay – Fastest guaranteed replay on commodity multiprocessors ≈ 20% overhead if spare cores  100% overhead if workload uses all cores Uniparallelism – Simplicity of uniprocessor execution while scaling performance Future work: new applications of uniparallelism – Properties benefit from uniprocessor execution Kaushik Veeraraghavan18

Kaushik Veeraraghavan19

Can DoublePlay replay realistic bugs? Epoch-parallel execution can recreate same race as thread-parallel execution – Record preemptions Scenario 1: Forensic analysis, auditing, etc. – Record & reproduce epoch-parallel run Scenario 2: In-house software debugging – Epoch-parallel run is a cheap race detector – Can log mismatches and reproduce as needed Kaushik Veeraraghavan20

Why replay epoch-parallel execution? Goal: record and reproduce execution Both thread-parallel & epoch-parallel are valid Preemptions can be recorded and reproduced – Both executions cover same execution space – Both executions can experience same bugs Kaushik Veeraraghavan21

Offline replay performance Offline replay takes approximately same time as single-threaded execution Kaushik Veeraraghavan22 Relative overhead

Forward recovery and loose replay ApplicationThreadsRollbacks w/o optimizations Rollbacks w/ optimizations Relative reduction in exec. time pbzip % % % aget % % % Apache % % % MySQL % Kaushik Veeraraghavan23 Optimizations beneficial for CPU-bound apps with long epochs

syswrite (“Hello”) Forward recovery from syscall divergence Thread-parallel execution is speculative – DoublePlay logs an undo operation for each system call Apply undo operations till divergent system call Patch address space to match valid epoch-parallel state 24Kaushik Veeraraghavan TIME Ep 1 Ep 2 CPU 0 CPU 1 CPU 2 CPU 3 Thread-parallel execution Ep 1 Ep 2 CPU 6 CPU 4 CPU 7 CPU 5 Epoch-parallel execution Syscalls Sync ops Signals Syscalls Sync ops Signals 2 2

Loose divergence checks 1)Skip syscalls without visible effects – E.g.: nanosleep, getpid Kaushik Veeraraghavan25 syswrite(x) x = 1 CPU 3 CPU 0 CPU 1 x = 1 syswrite(x) If(!x) sleep() Epoch-parallel execution Thread -parallel execution

Loose divergence checks (contd.) 2) Skip self-cancelling synchronization operations Kaushik Veeraraghavan26 If(x) y = x; x = 1 CPU 3 lock(l) unlock(l) CPU 0 CPU 1 x = 1 lock(l) If(x) y = x; unlock(l) lock(l) unlock(l) If(x) y = x; Epoch-parallel execution Thread -parallel execution