PEREGRINE: Efficient Deterministic Multithreading through Schedule Relaxation Heming Cui, Jingyue Wu, John Gallagher, Huayang Guo, Junfeng Yang Software.

Slides:



Advertisements
Similar presentations
Slide 1 Insert your own content. Slide 2 Insert your own content.
Advertisements

Bounded Model Checking of Concurrent Data Types on Relaxed Memory Models: A Case Study Sebastian Burckhardt Rajeev Alur Milo M. K. Martin Department of.
0 - 0.
MULT. INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
Addition Facts
1 Interprocess Communication 1. Ways of passing information 2. Guarded critical activities (e.g. updating shared data) 3. Proper sequencing in case of.
CS4026 Formal Models of Computation Running Haskell Programs – power.
Chapter 1 Introduction Copyright © Operating Systems, by Dhananjay Dhamdhere Copyright © Introduction Abstract Views of an Operating System.
Triage: Diagnosing Production Run Failures at the Users Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou Department of Computer.
Concurrent Predicates: A Debugging Technique for Every Parallel Programmer PACT 13 Justin Gottschlich Gilles Pokam Cristiano Pereira Youfeng Wu Intel Corporation.
Computers Are Your Future Twelfth Edition Chapter 4: System Software Copyright © 2012 Pearson Education, Inc. Publishing as Prentice Hall 1.
Debugging operating systems with time-traveling virtual machines Sam King George Dunlap Peter Chen CoVirt Project, University of Michigan.
RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,
Chapter 11 – Virtual Memory Management
Read-Write Lock Allocation in Software Transactional Memory Amir Ghanbari Bavarsad and Ehsan Atoofian Lakehead University.
Past Tense Probe. Past Tense Probe Past Tense Probe – Practice 1.
3.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Process An operating system executes a variety of programs: Batch system.
1 Processes and Threads Chapter Processes 2.2 Threads 2.3 Interprocess communication 2.4 Classical IPC problems 2.5 Scheduling.
Processes Management.
Executional Architecture
KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.
Addition 1’s to 20.
Test B, 100 Subtraction Facts
Week 1.
1 Ke – Kitchen Elements Newport Ave. – Lot 13 Bethesda, MD.
1 Unit 1 Kinematics Chapter 1 Day
Threads Relation to processes Threads exist as subsets of processes Threads share memory and state information within a process Switching between threads.
Software & Services Group PinPlay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs Harish Patil, Cristiano Pereira,
4 December 2001 SEESCOASEESCOA STWW - Programma Debugging of Real-Time Embedded Systems: Experiences from SEESCOA Michiel Ronsse RUG-ELIS.
An Case for an Interleaving Constrained Shared-Memory Multi-Processor Jie Yu and Satish Narayanasamy University of Michigan.
R2: An application-level kernel for record and replay Z. Guo, X. Wang, J. Tang, X. Liu, Z. Xu, M. Wu, M. F. Kaashoek, Z. Zhang, (MSR Asia, Tsinghua, MIT),
Heming Cui, Gang Hu, Jingyue Wu, Junfeng Yang Columbia University
CHESS: A Systematic Testing Tool for Concurrent Software CSCI6900 George.
Parrot: A Practical Runtime for Deterministic, Stable, and Reliable Threads Heming Cui, Jiri Simsa, Yi-Hong Lin, Hao Li, Ben Blum, Xinan Xu, Junfeng Yang,
Effectively Model Checking Real-World Distributed Systems Junfeng Yang Joint work with Huayang Guo, Ming Wu, Lidong Zhou, Gang Hu, Lintao Zhang, Heming.
Calvin: Deterministic or Not? Free Will to Choose Derek R. Hower, Polina Dudnik, Mark D. Hill, David A. Wood.
An efficient data race detector for DIOTA Michiel Ronsse, Bastiaan Stougie, Jonas Maebe, Frank Cornelis, Koen De Bosschere Department of Electronics and.
Chapter 4: Threads. Overview Multithreading Models Threading Issues Pthreads Windows XP Threads.
Continuously Recording Program Execution for Deterministic Replay Debugging.
Dongyoon Lee, Benjamin Wester, Kaushik Veeraraghavan, Satish Narayanasamy, Peter M. Chen, and Jason Flinn University of Michigan, Ann Arbor Respec: Efficient.
Parrot: A Practical Runtime for Deterministic, Stable, and Reliable threads HEMING CUI, YI-HONG LIN, HAO LI, XINAN XU, JUNFENG YANG, JIRI SIMSA, BEN BLUM,
DoublePlay: Parallelizing Sequential Logging and Replay Kaushik Veeraraghavan Dongyoon Lee, Benjamin Wester, Jessica Ouyang, Peter M. Chen, Jason Flinn,
DTHREADS: Efficient Deterministic Multithreading
Rex: Replication at the Speed of Multi-core Zhenyu Guo, Chuntao Hong, Dong Zhou*, Mao Yang, Lidong Zhou, Li Zhuang Microsoft ResearchCMU* 1.
MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,
RCDC SLIDES README Font Issues – To ensure that the RCDC logo appears correctly on all computers, it is represented with images in this presentation. This.
Deterministic Replay of Java Multithreaded Applications Jong-Deok Choi and Harini Srinivasan slides made by Qing Zhang.
What is the Cost of Determinism?
Microsoft Research Asia Ming Wu, Haoxiang Lin, Xuezheng Liu, Zhenyu Guo, Huayang Guo, Lidong Zhou, Zheng Zhang MIT Fan Long, Xi Wang, Zhilei Xu.
Analyzing parallel programs with Pin Moshe Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, Chi- Keung Luk,
- 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera.
LOOM: Bypassing Races in Live Applications with Execution Filters Jingyue Wu, Heming Cui, Junfeng Yang Columbia University 1.
- 1 - Dongyoon Lee †, Mahmoud Said*, Satish Narayanasamy †, Zijiang James Yang*, and Cristiano L. Pereira ‡ University of Michigan, Ann Arbor † Western.
SSGRR A Taxonomy of Execution Replay Systems Frank Cornelis Andy Georges Mark Christiaens Michiel Ronsse Tom Ghesquiere Koen De Bosschere Dept. ELIS.
Colorama: Architectural Support for Data-Centric Synchronization Luis Ceze, Pablo Montesinos, Christoph von Praun, and Josep Torrellas, HPCA 2007 Shimin.
Source: Operating System Concepts by Silberschatz, Galvin and Gagne.
Cooperative Concurrency Bug Isolation Guoliang Jin, Aditya Thakur, Ben Liblit, Shan Lu University of Wisconsin–Madison Instrumentation and Sampling Strategies.
Multithreaded Programing. Outline Overview of threads Threads Multithreaded Models  Many-to-One  One-to-One  Many-to-Many Thread Libraries  Pthread.
Tongping Liu, Charlie Curtsinger, Emery Berger D THREADS : Efficient Deterministic Multithreading Insanity: Doing the same thing over and over again and.
Virtual Application Profiler (VAPP) Problem – Increasing hardware complexity – Programmers need to understand interactions between architecture and their.
Sound and Precise Analysis of Parallel Programs through Schedule Specialization Jingyue Wu, Yang Tang, Gang Hu, Heming Cui, Junfeng Yang Columbia University.
Detecting Atomicity Violations via Access Interleaving Invariants
Execution Replay and Debugging. Contents Introduction Parallel program: set of co-operating processes Co-operation using –shared variables –message passing.
State Machine Replication State Machine Replication through transparent distributed protocols State Machine Replication through a shared log.
Kendo: Efficient Deterministic Multithreading in Software M. Olszewski, J. Ansel, S. Amarasinghe MIT to be presented in ASPLOS 2009 slides by Evangelos.
Introduction to threads
Chapter 4: Multithreaded Programming
Heming Cui, Jingyue Wu, John Gallagher, Huayang Guo, Junfeng Yang
Department of Computer Science University of California, Santa Barbara
Department of Computer Science University of California, Santa Barbara
Presentation transcript:

PEREGRINE: Efficient Deterministic Multithreading through Schedule Relaxation Heming Cui, Jingyue Wu, John Gallagher, Huayang Guo, Junfeng Yang Software Systems Lab Columbia University 1

Nondeterminism in Multithreading Different runs different behaviors, depending on thread schedules Complicates a lot of things – Understanding – Testing – Debugging –…–… 2

3 Thread 0Thread 1 Apache Bug #21287 Thread 0Thread 1 mutex_lock(M) *obj = … mutex_unlock(M) mutex_lock(M) free(obj) mutex_unlock(M) mutex_lock(M) *obj = … mutex_unlock(M) mutex_lock(M) free(obj) mutex_unlock(M) Nondeterministic Synchronization Thread 0Thread 1 FFT in SPLASH2 …… barrier_wait(B) print(result) …… barrier_wait(B) result += … Thread 0Thread 1 …… barrier_wait(B) print(result) …… barrier_wait(B) result += … Data Race

Deterministic Multithreading (DMT) Same input same schedule – Addresses many problems due to nondeterminism Existing DMT systems enforce either of – Sync-schedule: deterministic total order of synch operations (e.g., lock()/unlock()) – Mem-schedule: deterministic order of shared memory accesses (e.g., load/store) 4

Thread 0Thread 1 FFT in SPLASH2 …… barrier_wait(B) print(result) …… barrier_wait(B) result += … Thread 0Thread 1 …… barrier_wait(B) print(result) …… barrier_wait(B) result += … Sync-schedule [TERN OSDI '10], [Kendo ASPLOS '09], etc Pros: efficient (16% overhead in Kendo) Cons: deterministic only when no races – Many programs contain races [Lu ASPLOS '08] 5 Thread 0Thread 1 Apache Bug #21287 Thread 0Thread 1 mutex_lock(M) *obj = … mutex_unlock(M) mutex_lock(M) free(obj) mutex_unlock(M) mutex_lock(M) *obj = … mutex_unlock(M) mutex_lock(M) free(obj) mutex_unlock(M)

Mem-schedule [COREDET ASPLOS '10], [dOS OSDI '10], etc Pros: deterministic despite of data races Cons: high overhead (e.g., 1.2~10.1X slowdown in dOS) 6 Thread 0Thread 1 FFT in SPLASH2 …… barrier_wait(B) print(result) …… barrier_wait(B) result += … Thread 0Thread 1 …… barrier_wait(B) print(result) …… barrier_wait(B) result += …

Open Challenge [WODET '11] Either determinism or efficiency, but not both 7 Type of ScheduleDeterminismEfficiency Sync Mem Can we get both?

Yes, we can! 8 Type of ScheduleDeterminismEfficiency Sync Mem PEREGRINE

PEREGRINE Insight Races rarely occur – Intuitively, many races already detected – Empirically, six real apps up to 10 races occured Hybrid schedule – Sync-schedule in race-free portion (major) – Mem-schedule in racy portion (minor) 9

PEREGRINE: Efficient DMT Schedule Relaxation – Record execution trace for new input – Relax trace into hybrid schedule – Reuse on many inputs: deterministic + efficient Reuse rate is high (e.g., 90.3% for Apache, [TERN OSDI '10]) Automatic using new program analysis techniques Run in Linux, user space Handle Pthread synchronization operations Work with server programs [TERN OSDI '10] 10

Summary of Results Evaluated on a diverse set of 18 programs – 4 real applications: Apache, PBZip2, aget, pfscan – 13 scientific programs (10 from SPLASH2, 3 from PARSEC) – Racey (popular stress testing tool for DMT) Deterministically resolve all races Efficient: 54% faster to 49% slower Stable: frequently reuse schedules for 9 programs – Many benefits: e.g., reuse good schedules [TERN OSDI '10] 11

Outline PEREGRINE overview An example Evaluation Conclusion 12

PEREGRINE Overview Instrumentor LLVM Recorder OS Program Schedule Cache 13 INPUT Program Source MissHit Execution Traces … INPUT, Si INPUT Replayer OS Program Analyzer Match?

Outline PEREGRINE overview An example Evaluation Conclusion 14

15 An Example main(argc, char *argv[]) { nthread = atoi(argv[1]); size = atoi(argv[2]); for(i=1; i<nthread; ++i) pthread_create(worker); worker(); if ((flag=atoi(argv[3]))==1) result += …; printf(%d\n, result); } worker() { char *data; data = malloc(size/nthread); for(i=0; i<size/nthread; ++i) data[i] = myRead(i); pthread_mutex_lock(&mutex); result += …; pthread_mutex_unlock(&mutex); } // Read input. // Create children threads. // Read from result. // Write to result. // Allocate data with size/nthread. // Read data from disk and compute. // Grab mutex. // Work. // Missing pthread_join() // if flag is 1, update result.

16 Instrumentor main(argc, char *argv[]) { nthread = atoi(argv[1]); size = atoi(argv[2]); for(i=1; i<nthread; ++i) pthread_create(worker); worker(); // Missing pthread_join() if ((flag=atoi(argv[3]))==1) result += …; printf(%d\n, result); } worker() { char *data; data = malloc(size/nthread); for(i=0; i<size/nthread; ++i) data[i] = myRead(i); pthread_mutex_lock(&mutex); result += …; pthread_mutex_unlock(&mutex); } // Instrument command line arguments. // Instrument read() function within myRead().

17 Instrumentor main(argc, char *argv[]) { nthread = atoi(argv[1]); size = atoi(argv[2]); for(i=1; i<nthread; ++i) pthread_create(worker); worker(); // Missing pthread_join() if ((flag=atoi(argv[3]))==1) result += …; printf(%d\n, result); } worker() { char *data; data = malloc(size/nthread); for(i=0; i<size/nthread; ++i) data[i] = myRead(i); pthread_mutex_lock(&mutex); result += …; pthread_mutex_unlock(&mutex); } // Instrument command line arguments. // Instrument read() function. // Instrument synchronization operation.

18 $./a.out Recorder main(argc, char *argv[]) { nthread = atoi(argv[1]); size = atoi(argv[2]); for(i=1; i<nthread; ++i) pthread_create(worker); worker(); // Missing pthread_join() if ((flag=atoi(argv[3]))==1) result += …; printf(%d\n, result); } worker() { char *data; data = malloc(size/nthread); for(i=0; i<size/nthread; ++i) data[i] = myRead(i); pthread_mutex_lock(&mutex); result += …; pthread_mutex_unlock(&mutex); } Thread 1 nthread=atoi() size=atoi() (1<nthread)==1 pthread_create() worker() data=malloc() (0<size/nthread)==1 data[i]=myRead() (1<size/nthread)==0 (flag==1)==0 (2<nthread)==0 Thread 0 lock() result+=…; unlock() data=malloc() (0<size/nthread)==1 data[i]=myRead() (1<size/nthread)==0 lock() result+=…; unlock() main() worker() printf(…,result)

19 $./a.out Recorder Thread 1 nthread=atoi() size=atoi() (1<nthread)==1 pthread_create() worker() data=malloc() (0<size/nthread)==1 data[i]=myRead() (1<size/nthread)==0 (flag==1)==0 (2<nthread)==0 Thread 0 lock() result+=…; unlock() data=malloc() (0<size/nthread)==1 data[i]=myRead() (1<size/nthread)==0 lock() result+=…; unlock() main() worker() printf(…,result)

20 Analyzer: Hybrid Schedule Thread 1 nthread=atoi() size=atoi() (1<nthread)==1 pthread_create() worker() data=malloc() (0<size/nthread)==1 data[i]=myRead() (1<size/nthread)==0 (flag==1)==0 (2<nthread)==0 Thread 0 lock() result+=…; unlock() data=malloc() (0<size/nthread)==1 data[i]=myRead() (1<size/nthread)==0 lock() result+=…; unlock() main() worker() pthread_create() lock() unlock() lock() unlock() Thread 1Thread 0 printf(…,result)

21 Analyzer: Hybrid Schedule Thread 1 nthread=atoi() size=atoi() (1<nthread)==1 pthread_create() worker() data=malloc() (0<size/nthread)==1 data[i]=myRead() (1<size/nthread)==0 (flag==1)==0 (2<nthread)==0 Thread 0 lock() result+=…; unlock() data=malloc() (0<size/nthread)==1 data[i]=myRead() (1<size/nthread)==0 lock() result+=…; unlock() main() worker() pthread_create() lock() unlock() lock() unlock() Thread 1Thread 0 result+=…; printf(…,result)

22 Analyzer: Hybrid Schedule Thread 1 nthread=atoi() size=atoi() (1<nthread)==1 pthread_create() worker() data=malloc() (0<size/nthread)==1 data[i]=myRead() (1<size/nthread)==0 (flag==1)==0 (2<nthread)==0 Thread 0 lock() result+=…; unlock() data=malloc() (0<size/nthread)==1 data[i]=myRead() (1<size/nthread)==0 lock() result+=…; unlock() main() worker() pthread_create() lock() unlock() lock() unlock() Thread 1Thread 0 printf(…,result) result+=…;

23 Analyzer: Hybrid Schedule Thread 1 nthread=atoi() size=atoi() (1<nthread)==1 pthread_create() worker() data=malloc() (0<size/nthread)==1 data[i]=myRead() (1<size/nthread)==0 (flag==1)==0 (2<nthread)==0 Thread 0 lock() result+=…; unlock() data=malloc() (0<size/nthread)==1 data[i]=myRead() (1<size/nthread)==0 lock() result+=…; unlock() main() worker() pthread_create() lock() unlock() lock() unlock() Thread 1Thread 0 printf(…,result) result+=…; printf(…,result)

24 Analyzer: Precondition Thread 1 nthread=atoi() size=atoi() (1<nthread)==1 pthread_create() worker() data=malloc() (0<size/nthread)==1 data[i]=myRead() (1<size/nthread)==0 (flag==1)==0 (2<nthread)==0 Thread 0 lock() result+=…; unlock() data=malloc() (0<size/nthread)==1 data[i]=myRead() (1<size/nthread)==0 lock() result+=…; unlock() main() worker() printf(…,result) Challenges – Ensure schedule is feasible – Ensure no new races pthread_create() lock() unlock() lock() unlock() Thread 1 Thread 0 printf(…,result) result+=…;./a.out Hybrid Schedule main(argc, char *argv[]) { nthread = atoi(argv[1]); size = atoi(argv[2]); for(i=1; i<nthread; ++i) pthread_create(worker); worker(); // Missing pthread_join() if ((flag=atoi(argv[3]))==1) result += …; printf(%d\n, result); } ……

25 Naïve Approach to Computing Preconditions Thread 1 nthread=atoi() size=atoi() (1<nthread)==1 pthread_create() worker() data=malloc() (0<size/nthread)==1 data[i]=myRead() (1<size/nthread)==0 (flag==1)==0 (2<nthread)==0 Thread 0 lock() result+=…; unlock() data=malloc() (0<size/nthread)==1 data[i]=myRead() (1<size/nthread)==0 lock() result+=…; unlock() main() worker() nthread==2 size==2 (1<nthread)==1 (2<nthread)==0 (0<size/nthread)==1 (1<size/nthread)==0 (0<size/nthread)==1 (1<size/nthread)==0 printf(…,result) (flag==1)==0 flag!=1

26 Analyzer: Preconditions (a Naïve Way) Thread 1 nthread=atoi() size=atoi() (1<nthread)==1 pthread_create() worker() data=malloc() (0<size/nthread)==1 data[i]=myRead() (1<size/nthread)==0 (flag==1)==0 (2<nthread)==0 Thread 0 lock() result+=…; unlock() data=malloc() (0<size/nthread)==1 data[i]=myRead() (1<size/nthread)==0 lock() result+=…; unlock() main() worker() Problem: over-constraining! – size must be 2 to reuse Absorbed most of our brain power in this paper! Solution: two new program analysis techniques; see paper nthread==2 size==2 flag!=1 printf(…,result)

27 Analyzer: Preconditions Thread 1 nthread=atoi() size=atoi() (1<nthread)==1 pthread_create() worker() data=malloc() (0<size/nthread)==1 data[i]=myRead() (1<size/nthread)==0 (flag==1)==0 (2<nthread)==0 Thread 0 lock() result+=…; unlock() data=malloc() (0<size/nthread)==1 data[i]=myRead() (1<size/nthread)==0 lock() result+=…; unlock() main() worker() nthread==2 (1<nthread)==1 (2<nthread)==0 (flag==1)==0 flag!=1 printf(…,result)

28./a.out Replayer nthread==2 flag!=1 Thread 1 nthread=atoi() size=atoi() (1<nthread)==1 pthread_create() worker() data=malloc() (0<size/nthread)==1 data[i]=myRead() (1<size/nthread)==0 (flag==1)==0 (2<nthread)==0 Thread 0 lock() result+=…; unlock() data=malloc() (0<size/nthread)==1 data[i]=myRead() (1<size/nthread)==0 lock() result+=…; unlock() main() worker() pthread_create() lock() unlock() lock() unlock() Thread 1 Thread 0 printf(…,result) result+=…; Hybrid Schedule printf(…,result) Preconditions

29 Benefits of PEREGRINE Thread 1 nthread=atoi() size=atoi() (1<nthread)==1 pthread_create() worker() data=malloc() (0<size/nthread)==1 data[i]=myRead() (1<size/nthread)==0 (flag==1)==0 (2<nthread)==0 Thread 0 lock() result+=…; unlock() data=malloc() (0<size/nthread)==1 data[i]=myRead() (1<size/nthread)==0 lock() result+=…; unlock() main() worker() (1<nthread)==1 (2<nthread)==0 Deterministic: resolve race on result; no new data races Efficient: loops on data[] run in parallel Stable [TERN OSDI '10]: can reuse on any data size or contents Other applications possible; talk to us! printf(…,result)

Outline PEREGRINE overview An example Evaluation Conclusion 30

General Experiment Setup Program-workload – Apache: download a 100KB html page using ApacheBench – PBZip2: compress a 10MB file – Aget: download linux tar.bz2, 77MB. – Pfscan: scan keyword return 100 files from gcc project – 13 scientific benchmarks (10 from SPLASH2, 3 from PARSEC): run for ms – Racey: default workload Machine: 2.67GHz dual-socket quad-core Intel Xeon machine (eight cores) with 24GB memory Concurrency: eight threads for all experiments 31

Determinism Program# RacesSync- schedule Hybrid schedule Apache0 PBZip24 barnes5 fft10 lu-non-contig10 streamcluster0 racey

Overhead in Reusing Schedules 33

% of Instructions Left in the Trace 34

Conclusion Hybrid schedule: combine the best of both sync- schedule and mem-schedules PEREGRINE – Schedule relaxation to compute hybrid schedules – Deterministic (make all 7 racy programs deterministic) – Efficient (54% faster to 49% slower) – Stable (frequently reuse schedule for 9 out of 17) Have broad applications 35

Thank you! Questions? 36