Two for the Price of One: A Model for Parallel and Incremental Computation Sebastian Burckhardt, Daan Leijen, Tom Ball (Microsoft Research, Redmond) Caitlin.

Slides:



Advertisements
Similar presentations
Chapt.2 Machine Architecture Impact of languages –Support – faster, more secure Primitive Operations –e.g. nested subroutine calls »Subroutines implemented.
Advertisements

It’s Alive! Continuous Feedback in UI Programming Sebastian Burckhardt Manuel Fahndrich Peli de Halleux Sean McDirmid Michal Moskal Nikolai Tillmann Microsoft.
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.
More on protocol implementation Version walks Timers and their problems.
Parallel and Distributed Simulation Time Warp: Other Mechanisms.
CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.
Concurrent Revisions: A deterministic concurrency model. Daan Leijen & Sebastian Burckhardt Microsoft Research (invited talk at FOOL 2010)
External Sorting CS634 Lecture 10, Mar 5, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.
Chapter 9. 2 Objectives You should be able to describe: Addresses and Pointers Array Names as Pointers Pointer Arithmetic Passing Addresses Common Programming.
DARPA Scalable Simplification of Reversible Circuits Vivek Shende, Aditya Prasad, Igor Markov, and John Hayes The Univ. of Michigan, EECS.
Concurrent Revisions A strong alternative to sequential consistency Daan Leijen, Alexandro Baldassin, and Sebastian Burckhardt Microsoft Research (appearing.
1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts.
Lecture 36: Programming Languages & Memory Management Announcements & Review Read Ch GU1 & GU2 Cohoon & Davidson Ch 14 Reges & Stepp Lab 10 set game due.
Concurrent Revisions: A deterministic concurrency model. Daan Leijen, Alexandro Baldassin, and Sebastian Burckhardt Microsoft Research (OOPSLA 2010)
An Adaptive Multi-Objective Scheduling Selection Framework For Continuous Query Processing Timothy M. Sutherland Bradford Pielech Yali Zhu Luping Ding.
Parallel Processing (CS526) Spring 2012(Week 5).  There are no rules, only intuition, experience and imagination!  We consider design techniques, particularly.
Intel Concurrent Collections (for Haskell) Ryan Newton*, Chih-Ping Chen*, Simon Marlow+ *Intel +Microsoft Research Software and Services Group Jul 29,
Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.
CSCE Database Systems Chapter 15: Query Execution 1.
Database Management 9. course. Execution of queries.
ROBERT BOCCHINO, ET AL. UNIVERSAL PARALLEL COMPUTING RESEARCH CENTER UNIVERSITY OF ILLINOIS A Type and Effect System for Deterministic Parallel Java *Based.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Sorting.
C Functions Pepper. Objectives Create functions Function prototypes Parameters – Pass by value or reference – Sending a reference Return values Math functions.
Parallelizing Security Checks on Commodity Hardware Ed Nightingale Dan Peek, Peter Chen Jason Flinn Microsoft Research University of Michigan.
Synchronization Transformations for Parallel Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California, Santa Barbara.
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
Mark Marron 1, Deepak Kapur 2, Manuel Hermenegildo 1 1 Imdea-Software (Spain) 2 University of New Mexico 1.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
1 External Sorting. 2 Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing gpa order.
CS 460/660 Compiler Construction. Class 01 2 Why Study Compilers? Compilers are important – –Responsible for many aspects of system performance Compilers.
CSC310 © Tom Briggs Shippensburg University Fundamentals of the Analysis of Algorithm Efficiency Chapter 2.
Multithreaded Programing. Outline Overview of threads Threads Multithreaded Models  Many-to-One  One-to-One  Many-to-Many Thread Libraries  Pthread.
Virtual Application Profiler (VAPP) Problem – Increasing hardware complexity – Programmers need to understand interactions between architecture and their.
Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
Introduction to Database Systems1 External Sorting Query Processing: Topic 0.
Execution Replay and Debugging. Contents Introduction Parallel program: set of co-operating processes Co-operation using –shared variables –message passing.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.
External Sorting. Why Sort? A classic problem in computer science! Data requested in sorted order –e.g., find students in increasing gpa order Sorting.
A FIRST BOOK OF C++ CHAPTER 8 ARRAYS AND POINTERS.
Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.
Two for the Price of One: A Model for Parallel and Incremental Computation Thomas Ball, Sebastian Burckhardt, Daan Leijen Microsoft Research – Redmond.
Concurrent Revisions: A deterministic concurrency model. Daan Leijen & Sebastian Burckhardt Microsoft Research (OOPSLA 2010, ESOP 2011)
Mergesort example: Merge as we return from recursive calls Merge Divide 1 element 829.
IThreads A Threading Library for Parallel Incremental Computation Pramod Bhatotia Pedro Fonseca, Björn Brandenburg (MPI-SWS) Umut Acar (CMU) Rodrigo Rodrigues.
Incremental Parallel and Distributed Systems Pramod Bhatotia MPI-SWS & Saarland University April 2015.
December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today.
Conception of parallel algorithms
Parallel Programming By J. H. Wang May 2, 2017.
Introduction to Query Optimization
Presented by: Huston Bokinsky Ying Zhang 25 April, 2013
External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.
CIS 488/588 Bruce R. Maxim UM-Dearborn
Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz
Design and Implementation Issues for Atomicity
Trace-based Just-in-Time Type Specialization for Dynamic Languages
Chapter 12 Query Processing (1)
Programming with Shared Memory Specifying parallelism
Lecture 2 The Art of Concurrency
Evaluation of Relational Operations: Other Techniques
COMP108 Algorithmic Foundations Dynamic Programming
Programming with Shared Memory Specifying parallelism
Supporting Online Analytics with User-Defined Estimation and Early Termination in a MapReduce-Like Framework Yi Wang, Linchuan Chen, Gagan Agrawal The.
Presentation transcript:

Two for the Price of One: A Model for Parallel and Incremental Computation Sebastian Burckhardt, Daan Leijen, Tom Ball (Microsoft Research, Redmond) Caitlin Sadowski, Jaeheon Yi (University of California, Santa Cruz )

Motivation: Compute-Mutate Loops Common pattern in applications (e.g. browser, games, compilers, spreadsheets, editors, forms, simulations) Goal: perform better (faster, less power) Nondeterministic I/O Deterministic No I/O Potentially Parallel

Incremental Computation Which one would you choose? Do we have to choose? Parallel Computation input output input’ output’ input ComputationComputation’ Comp utat ion output

Incremental + Parallel Computation input Comp utat ion output Work Stealing: Dynamic Task Graph Self-Adjusting Computation: Dynamic Dependence Graph

Two for the Price of One Wanted: Programming Model for Parallel & Incremental Computation input output ? Small set of primitives to express computation ?

Compute-Mutate Loop Examples

Our Primitives: fork, join, record, repeat Start with Deterministic Parallel Programming  Concurrent Revisions Model  fork and join Revisions (= Isolated Tasks)  Declare shared data and operations on it Add Primitives for record and repeat  c = record { f(); } for some computation f()  repeat c is equivalent to calling f() again, but faster  the compute-mutate loop does record – mutate – repeat – mutate – repeat …

Concurrent Revisions Model Deterministic Parallelism by fork and join (creates concurrent tasks called revisions) Revisions are isolated  fork copies all state  join replays updates Use optimized types (copy on write, merge functions) fork x.Set(2) x.Add(3) join [OOPSLA ‘10] [ESOP ‘11] [WoDet ‘11] x.Set(1)

Example: Parallel Sum

Example Step 1: Record

Example (Cont’d) Step 2: Mutate Step 3: Repeat

How does it work? On Record  Create ordered tree of summaries (summary=revision)  Revisions-Library already stores effects of revisions Can keep them around to “reexecute” = join again  Can track dependencies Record dependencies Invalidate summaries At time of fork, know if valid r r r1 r2 r1.1 r1.2 r2.1 r2.2

c = record { pass1(); pass2(); pass3(); } Illustration Example Consider computation shaped like this (e.g. our CSS layout alg. with 3 passes) r r r1 r2 r1.1 r1.2 r2.1 r2.2 r1 r2 r1.1 r1.2 r2.1 r2.2 r1 r2 r1.1 r1.2 r2.1 r2.2 reads x

Illustration Example Consider computation shaped like this (e.g. our CSS layout alg. with 3 passes) c = record { pass1(); pass2(); pass3(); } x = 100; r r r1 r2 r1.1 r1.2 r2.1 r2.2 r1 r2 r1.1 r1.2 r2.1 r2.2 r1 r2 r1.1 r1.2 r2.1 r2.2 Dep. on x

Illustration Example Consider computation shaped like this (e.g. our CSS layout alg. with 3 passes) c = record { pass1(); pass2(); pass3(); } x = 100; repeat c; r r r1 r2 r1.1 r1.2 r2.1 r2.2 r1 r2 r1.1 r1.2 r2.1 r2.2 r1 r2 r1.1 r1.2 r2.1 r2.2 record repeat

Illustration Example Consider computation shaped like this (e.g. our CSS layout alg. with 3 passes) c = record { pass1(); pass2(); pass3(); } x = 100; repeat c; r r r1 r2 r1.1 r1.2 r2.1 r2.2 r1 r2 r1.1 r1.2 r2.1 r2.2 r1 r2 r1.1 r1.2 r2.1 r2.2 record repeat

Illustration Example Consider computation shaped like this (e.g. our CSS layout alg. with 3 passes) c = record { pass1(); pass2(); pass3(); } x = 100; repeat c; r r r1 r2 r1.1 r1.2 r2.1 r2.2 r1 r2 r1.1 r1.2 r2.1 r2.2 r1 r2 r1.1 r1.2 r2.1 r2.2 record repeat

Illustration Example Consider computation shaped like this (e.g. our CSS layout alg. with 3 passes) c = record { pass1(); pass2(); pass3(); } x = 100; repeat c; r r r1 r2 r1.1 r1.2 r2.1 r2.2 r1 r2 r1.1 r1.2 r2.1 r2.2 r1 r2 r1.1 r1.2 r2.1 r2.2 record repeat

Invalidating Summaries is Tricky May depend on external input  Invalidate between compute and repeat May depend on write by some other summary  Invalidate if write changes  Invalidate if write disappears Dependencies can change  Invalidate if new write appears between old write and read And all of this is concurrent

Who has to think about what Our runtime Detects nature of dependencies  Dynamic tracking of reads and writes Records and replays effects of revisions Schedules revisions in parallel on multiprocessor (based on TPL, a work- stealing scheduler) The programmer Explicitly structures the computation (fork, join) Declares data that participates in  Concurrent accesses  Dependencies Thinks about performance  Revision granularity  Applies Marker Optimization

Optimizations by Programmer Granularity Control  Problem: too much overhead if revisions contain not enough work  Solution: Use typical techniques (e.g. recursion threshold) to keep revisions large enough Markers  Problem: too much overhead if tracking individual memory locations (e.g. bytes of a picture)  Solution: Use marker object to represent a group of locations (e.g. a tile in picture)

Results on 5 Benchmarks On 8 cores,  recording is still faster than baseline (1.8x – 6.4x)  Repeat after small change is significantly faster than baseline (12x – 37x)  Repeat after large change is same as record

What part does parallelism play? Without parallelism, record is up to 31% slower than baseline Without parallelism, repeat after small change is still 4.9x – 24x faster than baseline

Controlling Task Granularity is Important

Contributions Philosophical  Connect incremental and parallel computation Programming Model  Concurrent Revisions + record/repeat Algorithm  Concurrent summary creation, invalidation, replay Empirical  Implement C# Prototype  Measure on Examples  Identify Important Optimizations

THANK YOU!