Parrot: A Practical Runtime for Deterministic, Stable, and Reliable Threads Heming Cui, Jiri Simsa, Yi-Hong Lin, Hao Li, Ben Blum, Xinan Xu, Junfeng Yang,

Slides:

Advertisements

Similar presentations

PEREGRINE: Efficient Deterministic Multithreading through Schedule Relaxation Heming Cui, Jingyue Wu, John Gallagher, Huayang Guo, Junfeng Yang Software.

Advertisements

Applications of Synchronization Coverage A.Bron,E.Farchi, Y.Magid,Y.Nir,S.Ur Tehila Mayzels 1.

Threads Relation to processes Threads exist as subsets of processes Threads share memory and state information within a process Switching between threads.

1 Chao Wang, Yu Yang*, Aarti Gupta, and Ganesh Gopalakrishnan* NEC Laboratories America, Princeton, NJ * University of Utah, Salt Lake City, UT Dynamic.

An Case for an Interleaving Constrained Shared-Memory Multi-Processor Jie Yu and Satish Narayanasamy University of Michigan.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Threads. Readings r Silberschatz et al : Chapter 4.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Lecture 6: Threads Chapter 4.

Heming Cui, Gang Hu, Jingyue Wu, Junfeng Yang Columbia University

Effectively Model Checking Real-World Distributed Systems Junfeng Yang Joint work with Huayang Guo, Ming Wu, Lidong Zhou, Gang Hu, Lintao Zhang, Heming.

Spark: Cluster Computing with Working Sets

15-213: How to dBug your (threaded) proxy Jiří Šimša PARALLEL DATA LABORATORY Carnegie Mellon University.

Lecture 4: Concurrency and Threads CS 170 T Yang, 2015 Chapter 4 of AD textbook.

Parrot: A Practical Runtime for Deterministic, Stable, and Reliable threads HEMING CUI, YI-HONG LIN, HAO LI, XINAN XU, JUNFENG YANG, JIRI SIMSA, BEN BLUM,

A. Frank - P. Weisberg Operating Systems Introduction to Tasks/Threads.

A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.

DoublePlay: Parallelizing Sequential Logging and Replay Kaushik Veeraraghavan Dongyoon Lee, Benjamin Wester, Jessica Ouyang, Peter M. Chen, Jason Flinn,

DTHREADS: Efficient Deterministic Multithreading

RCDC SLIDES README Font Issues – To ensure that the RCDC logo appears correctly on all computers, it is represented with images in this presentation. This.

Concurrency Recitation – 2/24 Nisarg Raval Slides by Prof. Landon Cox.

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

What is the Cost of Determinism?

Programming with Shared Memory Introduction to OpenMP

Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.

Microsoft Research Asia Ming Wu, Haoxiang Lin, Xuezheng Liu, Zhenyu Guo, Huayang Guo, Lidong Zhou, Zheng Zhang MIT Fan Long, Xi Wang, Zhilei Xu.

University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.

- 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera.

LOOM: Bypassing Races in Live Applications with Execution Filters Jingyue Wu, Heming Cui, Junfeng Yang Columbia University 1.

Huayang Guo 1,2, Ming Wu 1, Lidong Zhou 1, Gang Hu 1,2, Junfeng Yang 2, Lintao Zhang 1 1 Microsoft Research Asia 2 Columbia University Practical Software.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,

CS 346 – Chapter 4 Threads –How they differ from processes –Definition, purpose Threads of the same process share: code, data, open files –Types –Support.

ICS 145B -- L. Bic1 Project: Process/Thread Synchronization Textbook: pages ICS 145B L. Bic.

COMP 111 Threads and concurrency Sept 28, Tufts University Computer Science2 Who is this guy? I am not Prof. Couch Obvious? Sam Guyer New assistant.

Copyright ©: University of Illinois CS 241 Staff1 Threads Systems Concepts.

Source: Operating System Concepts by Silberschatz, Galvin and Gagne.

Age Based Scheduling for Asymmetric Multiprocessors Nagesh B Lakshminarayana, Jaekyu Lee & Hyesoon Kim.

Understanding Performance, Power and Energy Behavior in Asymmetric Processors Nagesh B Lakshminarayana Hyesoon Kim School of Computer Science Georgia Institute.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Discussion Week 2 TA: Kyle Dewey. Overview Concurrency Process level Thread level MIPS - switch.s Project #1.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

Tongping Liu, Charlie Curtsinger, Emery Berger D THREADS : Efficient Deterministic Multithreading Insanity: Doing the same thing over and over again and.

Diagnosing and Fixing Concurrency Bugs Credits to Dr. Guoliang Jin, Computer Science, NC STATE Presented by Tao Wang.

Sound and Precise Analysis of Parallel Programs through Schedule Specialization Jingyue Wu, Yang Tang, Gang Hu, Heming Cui, Junfeng Yang Columbia University.

CS307 Operating Systems Threads Fan Wu Department of Computer Science and Engineering Shanghai Jiao Tong University Spring 2011.

BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.

CoreDet: A Compiler and Runtime System for Deterministic Multithreaded Execution Tom Bergan Owen Anderson, Joe Devietti, Luis Ceze, Dan Grossman To appear.

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

State Machine Replication State Machine Replication through transparent distributed protocols State Machine Replication through a shared log.

Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.

Tuning Threaded Code with Intel® Parallel Amplifier.

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, Haryadi S. Gunawi.

Kendo: Efficient Deterministic Multithreading in Software M. Olszewski, J. Ansel, S. Amarasinghe MIT to be presented in ASPLOS 2009 slides by Evangelos.

CMSC 421 Spring 2004 Section 0202 Part II: Process Management Chapter 5 Threads.

Incremental Parallel and Distributed Systems Pramod Bhatotia MPI-SWS & Saarland University April 2015.

Explicitly Parallel Programming with Shared-Memory is Insane: At Least Make it Deterministic! Joe Devietti, Brandon Lucia, Luis Ceze and Mark Oskin University.

Introduction to threads

Maple: A Coverage-Driven Testing Tool for Multithreaded Programs

Parallel Software Development with Intel Threading Analysis Tools

Processes and Threads Processes and their scheduling

Parallel Programming By J. H. Wang May 2, 2017.

Introduction to OpenMP

Threads and Cooperation

Heming Cui, Jingyue Wu, John Gallagher, Huayang Guo, Junfeng Yang

Department of Computer Science University of California, Santa Barbara

Operating System Concepts

Chapter 4: Threads & Concurrency

FlyMC: Highly Scalable Testing of Complex Interleavings in Distributed Systems Jeffrey F. Lukman, Huan Ke, Cesar Stuardo, Riza Suminto, Daniar Kurniawan,

Department of Computer Science University of California, Santa Barbara

Presentation transcript:

Parrot: A Practical Runtime for Deterministic, Stable, and Reliable Threads Heming Cui, Jiri Simsa, Yi-Hong Lin, Hao Li, Ben Blum, Xinan Xu, Junfeng Yang, Garth Gibson, Randal Bryant 1 Columbia UniversityCarnegie Mellon University

Parrot Preview Multithreading: hard to get right – Key reason: too many thread interleavings, or schedules Techniques to reduce the number of schedule – Deterministic Multithreading (DMT) – Stable Multithreading (StableMT) – Challenges: too slow or too complicated to deploy Parrot: a practical StableMT runtime – Fast and deployable: effective performance hints – Greatly improve reliability 2

Too Many Schedules in Multithreading 3 // thread 1... // thread N...; lock(m);...; unlock(m); lock(m);...; unlock(m); Each does K steps N! schedules (N!) K schedules! Lower bound! Schedule: a total order of synchronizations # of Schedules: exponential in both N and K All inputs: much more schedules All schedules Checked schedules

All schedules – Benefits pretty much all reliability techniques E.g., improve precision of static analysis [Wu PLDI 12] Stable Multithreading (StableMT): Reducing the number of schedules for all inputs [HotPar 13] [CACM 14] 4 // thread 1... // thread N...; lock(m);...; unlock(m); lock(m);...; unlock(m); Checked schedules

Conceptual View Traditional multithreading – Hard to understand, test, analyze, etc Stable Multithreading (StableMT) – E.g., [Tern OSDI 10] [Determinator OSDI 10] [Peregrine SOSP 11] [Dthreads SOSP 11] Deterministic Multithreading (DMT) – E.g., [Dmp ASPLOS 09] [Kendo ASPLOS 09] [CoreDet ASPLOS 10] [dOS OSDI 10] StableMT is better! [HotPar 13] [CACM 14] 5

Challenges of StableMT Performance challenge: slow – Ignore load balance (e.g., [Dthreads SOSP 11 ): serialize parallelism (5x slow down with 30% programs) Deployment challenge: too complicated – Reuse schedules (e.g., [Tern OSDI 10][Peregrine SOSP 11] [Ics OOPSLA 13] ): sophisticated program analysis 6 // thread 1... // thread N...; lock(m);...; unlock(m); lock(m);...; unlock(m); compute();

Parrot Key Insight The rule – Most threads spend majority of their time in a small number of core computations Solution for good performance – The StableMT schedules only need to balance these core computations 7

Parrot: A Practical StableMT Runtime Simple: a runtime system in user-space – Enforce round-robin schedule for Pthreads synchronization Flexible: performance hints – Soft barrier: Co-schedule threads at core computations – Performance critical section: get through the section fast Practical: evaluate 108 popular programs – Easy to use: 1.2 lines of hints, 0.5~2 hours per program – Fast: 6.9% with 55 real-world programs, 12.7% for all – Scalable: 24-core machine, different input sizes – Reliable: Improve coverage of [Dbug SPIN 11] by 10 6 ~

Outline Example Evaluation Conclusion 9

int main(int argc, char *argv[]) { for (i=0; i<atoi(argv[1]); ++i) // argv[1]: # of threads pthread_create(…, consumer, 0); for (i=0; i<atoi(argv[2]); ++i) { // argv[2]: # of file blocks block = block_read(i, argv[3]); // argv[3]: file name add(queue, block); } void *consumer(void *arg) { for(;;) { // exit logic elided for clarity block = get(queue); // blocking call compress(block); // core computation } An Example based on PBZip2 10 pthread_mutex_lock(&mu); enqueue(queue, block); pthread_cond_signal(&cv); pthread_mutex_unlock(&mu); pthread_mutex_lock(&mu); // termination logic elided while (empty(q)) pthread_cond_wait(&cv, &mu); char *block = dequeue(q); pthread_mutex_unlock(&mu);

int main(int argc, char *argv[]) { for (i=0; i<atoi(argv[1]); ++i) pthread_create(…, consumer, 0); for (i=0; i<atoi(argv[2]); ++i) { block = block_read(i, argv[3]); add(queue, block); } void *consumer(void *arg) { for(;;) { block = get(queue); compress(block); } The Serialization Problem 11 LD_PRELOAD=parrot.so pbzip 2 2 a.txt main thread consumer1consumer2 Observed 7.7x slowdown with 16 threads in a previous system. add() get() wait runnable get() ret compress() get() compress() runnable Serialized!

int main(int argc, char *argv[]) { for (i=0; i<atoi(argv[1]); ++i) pthread_create(…, consumer, 0); for (i=0; i<atoi(argv[2]); ++i) { block = block_read(i, argv[3]); add(queue, block); } void *consumer(void *arg) { for(;;) { block = get(queue); compress(block); } Adding Soft Barrier Hints 12 LD_PRELOAD=parrot.so pbzip 2 2 a.txt main thread consumer1consumer2 add() get() wait get() ret soba_wait() soba_init(atoi(artv[1])); soba_wait(); get() ret soba_wait() compress() Only 0.8% overhead!

Performance Hint: Soft Barrier Usage – Co-schedule threads at core computations Interface – void soba_init(int size, void *id = 0, int timeout = 20); – void soba_wait(void *id = 0); Can also benefit – Other similar systems, and traditional OS schedulers 13

Performance Hint: Performance Critical Section (PCS) Motivation – Optimize Low level synchronizations – E.g., {lock(); x++; unlock();} Usage – Get through these sections fast by ignoring round-robin Interface – void pcs_enter(); – void pcs_exit(); And can check – Use model checking tools to completely check schedules in PCS 14

Evaluation Questions Performance of Parrot Effectiveness of performance hints Improvement on model checking coverage 15

Evaluation Setup A wide range of 108 programs: 10x more, and complete – 55 real-world software: BerkeleyDB, OpenLDAP, MPlayer, etc. – 53 benchmark programs: Parsec, Splash2x, Phoenix, NPB. – Rich thread idioms: Pthreads, OpenMP, data partition, fork-join, pipeline, map-reduce, and workpile. Concurrency setup – Machine: 24 cores with Linux – # of threads: 16 or 24 Inputs – At least 3 input sizes (small, medium, large) per program 16

Performance of Parrot 17 ImageMagickGNU C++ Parallel STLParsecSplash2-x PhoenixNPB berkeley db openldap redis mencoderpbzip2_compresspbzip2_decompress pfscan aget Normalized Execution Time

Effectiveness of Performance Hints 18 # programs requiring hints # lines of hints Overhead /wo hints Overhead /w hints Soft barrier %9.0% Performance critical section %42.1% Total %11.9% Time: 0.5~2 hours per program, mostly by inexperienced students. # Lines: In average, 1.2 lines per program. How: deterministic performance debugging + idiom patterns.

Improving Dbug’s Coverage Model checking: systematically explore schedules – E.g., [Dpor POPL 05] [Explode OSDI 06] [MaceMC NSDI 07] [Chess OSDI 08] [Modist NSDI 09] [Demeter SOSP 11] [Dbug SPIN 11] – Challenge: state-space explosion  poor coverage Parrot+Dbug Integration – Verified 99 of 108 programs under test setup (1 day) Dbug alone verified only 43 – Reduced the number of schedules for 56 programs by 10 6 ~ (not a typo!) 19

Conclusion and Future Work Multithreading: too many schedules Parrot: a practical StableMT runtime system – Well-defined round-robin synchronization schedules – Performance hints: flexibly optimize performance Thorough evaluation – Easy to use, fast, and scalable – Greatly improve model checking coverage Broad application – Current: static analysis, model checking – Future: replication for distributed systems 20

Thank you! Questions? Parrot: Lab: 21