Incremental Parallel and Distributed Systems Pramod Bhatotia MPI-SWS & Saarland University April 2015.

Incremental Parallel and Distributed Systems Pramod Bhatotia MPI-SWS & Saarland University April 2015

Akshat Alexander Björn Pedro RodrigoUmut EkinRafael Flavio

Common workflow: Rerun the same application with evolving input Scientific computing Reactive Systems Data analytics Small input change  small work to update the output 3 Goal: Efficient execution of apps in successive runs

Existing approaches (-) Inefficient (+) Easy to design Re-compute from scratch Design application- specific algorithms Incremental computation (+) Efficient (-) Hard to design 4 Automatic + Efficient

To enable transparent, practical, and efficient incremental computation in real-world parallel and distributed systems My thesis research 5

My projects @ MPI-SWS Incoop Incremental batch processing [HotCloud’11, SoCC’11] Incoop Incremental batch processing [HotCloud’11, SoCC’11] 6 Shredder Incremental storage [FAST’12] Shredder Incremental storage [FAST’12] Slider Incremental stream processing [Middleware’14, Hadoop Summit’15] Slider Incremental stream processing [Middleware’14, Hadoop Summit’15] Conductor* Incremental job deployment [NSDI’12, LADIS’10, PODC’10] Conductor* Incremental job deployment [NSDI’12, LADIS’10, PODC’10] iThreads Incremental multithreading [ASPLOS’15] iThreads Incremental multithreading [ASPLOS’15] * 2 nd author

Incremental Multithreading iThreads@mpi-sws.org Joint work with: Pedro Fonseca & Björn Brandenburg (MPI-SWS) Rodrigo Rodrigues (Nova University of Lisbon) Umut Acar (CMU)

Well-studied topic in the PL community Compiler and language-based approaches Target sequential programs!  Auto “Incrementalization” 8

Proposals for parallel incremental computation Hammer et al. [DAMP’07] Burckhardt et al. [OOPSLA’11] Limitations Require a new language with special data-types Restrict programming model to strict fork-join Existing multi-threaded programs are not supported! Parallelism 9

1.Transparency Target unmodified pthreads based programs 2.Practicality Support the full range of synchronization primitives 3.Efficiency Design parallel infrastructure for incremental computation Design goals 10

iThreads $ LD_PRELOAD=“iThreads.so” $./myProgram (initial run) $ echo “ ” >> changes.txt $./myProgram (incremental run) 11 Speedups w.r.t. pthreads up to 8X

Motivation Design Evaluation Outline 12

Behind the scenes Computation Sub-computations step#1 Divide step#2 Build step#3 Perform Dependence graph Change propagation 13 Initial run Incremental run

A simple example lock(); z = ++y; unlock(); thread-1 thread-2 lock(); x++; unlock(); local_var++; lock(); y = 2*x + z; unlock(); shared variables: x, y, and z 14

Step # 1 Computation Sub-computations Divide BuildPerform Dependence graph Change propagation 15 How do we divide a computation into sub- computations? How do we divide a computation into sub- computations?

Sub-computations 16 lock(); z = ++y; unlock(); thread-1 thread-2 lock(); x++; unlock(); local_var++; lock(); y = 2*x + z; unlock(); Entire thread? Entire thread? “Coarse-grained” Small change implies recomputing the entire thread “Coarse-grained” Small change implies recomputing the entire thread Single Instruction? Single Instruction? “Fine-grained” Requires tracking individual load/store instructions “Fine-grained” Requires tracking individual load/store instructions

Sub-computation granularity 17 Single instruction Single instruction Entire thread Entire thread Coarse-grainedExpensive A sub-computation is a sequence of instructions between pthreads synchronization points Release Consistency (RC) memory model to define the granularity of a sub-computation

Sub-computations 18 lock(); z = ++y; unlock(); thread-1 thread-2 lock(); x++; unlock(); local_var++; lock(); y = 2*x + z; unlock(); Sub-computation for thread-1 Sub-computation for thread-1 Sub- computations for thread-2

Step # 2 Computation Sub-computations Divide BuildPerform Dependence graph Change propagation 19 How do we build the dependence graph? How do we build the dependence graph?

Example: Changed schedule 20 lock(); z = ++y; unlock(); thread-1 thread-2 lock(); x++; unlock(); local_var++; lock(); y = 2*x + z; unlock(); Read={x} Write={x} Read={ local_var } Write={ local_var } Read={x,z} Write={y} shared variables: x, y, and z different schedule Read={y} Write={y,z} (1): Record happens-before dependencies

Example: Same schedule 21 lock(); z = ++y; unlock(); thread-1 thread-2 lock(); x++; unlock(); local_var++; lock(); y = 2*x + z; unlock(); Read={y} Write={y,z} Read={x} Write={x} Read={ local_var } Write={ local_var } Read={x,z} Write={y} shared variables: x, y, and z Same schedule

Example: Changed input 22 lock(); z = ++y; unlock(); thread-1 thread-2 lock(); x++; unlock(); local_var++; lock(); y = 2*x + z; unlock(); Read={y} Write={y,z} Read={x} Write={x} Read={ local_var } Write={ local_var } Read={x,z} Write={y} shared variables: x, y, and z different input y’ Read={y} Write={y,z} Read={x, z } Write={y} (2): Record data dependencies

Vertices: sub-computations Edges: Happens-before order b/w sub-computations Intra-thread: Totally ordered based on execution Inter-thread: Partially ordered based on sync primitives Data dependency b/w sub-computations If they can be ordered based on happens-before, & Antecedent writes the data that is read by precedent Dependence graph 23

Step # 3 Computation Sub-computations Divide BuildPerform Dependence graph Change propagation 24 How do we perform change propagation? How do we perform change propagation?

Change propagation Dirty set  {Changed input} For each sub-computation in a thread Check validity in the recorded happens-before order If (Read set Dirty set) – Re-compute the sub-computation – Add write set to the dirty set Else – Skip execution of the sub-computation – Apply the memoized effects of the sub-computation 25

Motivation Design Evaluation Outline 26

Evaluating iThreads 1.Speedups for the incremental run 2.Overheads for the initial run Implementation (for Linux) 32-bit dynamically linkable shared library Platform Evaluated on Intel Xeon 12 cores running Linux Evaluation 27

1. Speedup for incremental run 28 Better Worst Speedups vs. pthreads up to 8X

Memoization overhead 29 Space overheads w.r.t. input size Write a lot of intermediate state

2. Overheads for the initial run 30 Runtime overheads vs. pthreads Worst Better

A case for parallel incremental computation iThreads: Incremental multithreading Transparent: Targets unmodified programs Practical: Supports the full range of sync primitives Efficient: Employs parallelism for change propagation Usage: A dynamically linkable shared library Summary 31

Incremental Systems Transparent + Practical + Efficient bhatotia@mpi-sws.org Thank you all! 32

Incremental Parallel and Distributed Systems Pramod Bhatotia MPI-SWS & Saarland University April 2015.

Similar presentations

Presentation on theme: "Incremental Parallel and Distributed Systems Pramod Bhatotia MPI-SWS & Saarland University April 2015."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Incremental Parallel and Distributed Systems Pramod Bhatotia MPI-SWS & Saarland University April 2015.

Similar presentations

Presentation on theme: "Incremental Parallel and Distributed Systems Pramod Bhatotia MPI-SWS & Saarland University April 2015."— Presentation transcript:

Similar presentations

About project

Feedback