Cilk A C language for programming dynamic multithreaded applications on shared-memory multiprocessors. Example applications: virus shell assembly graphics.

Slides:



Advertisements
Similar presentations
© 2009 Charles E. Leiserson and Pablo Halpern1 Introduction to Cilk++ Programming PADTAD July 20, 2009 Cilk, Cilk++, Cilkview, and Cilkscreen, are trademarks.
Advertisements

MATH 224 – Discrete Mathematics
The University of Adelaide, School of Computer Science
U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and.
CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad.
Parallelizing C Programs Using Cilk Mahdi Javadi.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Race Conditions CS550 Operating Systems. Review So far, we have discussed Processes and Threads and talked about multithreading and MPI processes by example.
Cilk CISC 879 Parallel Computation Erhan Atilla Avinal.
Juan Mendivelso.  Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms:
Multithreaded Algorithms Andreas Klappenecker. Motivation We have discussed serial algorithms that are suitable for running on a uniprocessor computer.
Multithreaded Programming in Cilk L ECTURE 2 Charles E. Leiserson Supercomputing Technologies Research Group Computer Science and Artificial Intelligence.
STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of computing: –Instruction (single, multiple) per clock cycle.
1 L ECTURE 2 Matrix Multiplication Tableau Construction Recurrences (Review) Conclusion Merge Sort.
Multithreaded Programming in Cilk L ECTURE 3 Charles E. Leiserson Supercomputing Technologies Research Group Computer Science and Artificial Intelligence.
CSC 221: Recursion. Recursion: Definition Function that solves a problem by relying on itself to compute the correct solution for a smaller version of.
1 CS 140 : Feb 19, 2015 Cilk Scheduling & Applications Analyzing quicksort Optional: Master method for solving divide-and-conquer recurrences Tips on parallelism.
Java Thread and Memory Model
13-1 Chapter 13 Concurrency Topics Introduction Introduction to Subprogram-Level Concurrency Semaphores Monitors Message Passing Java Threads C# Threads.
Scheduling Multithreaded Computations By Work-Stealing Robert D. Blumofe The University of Texas, Austin Charles E. Leiserson, MIT Laboratory for Computer.
CS 240A: Shared Memory & Multicore Programming with Cilk++
Constructs for Data Organization and Program Control, Scope, Binding, and Parameter Passing. Expression Evaluation.
1 Cilk Chao Huang CS498LVK. 2 Introduction A multithreaded parallel programming language Effective for exploiting dynamic, asynchronous parallelism (Chess.
LECTURE 19 Subroutines and Parameter Passing. ABSTRACTION Recall: Abstraction is the process by which we can hide larger or more complex code fragments.
7/9/ Realizing Concurrency using Posix Threads (pthreads) B. Ramamurthy.
December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today.
Single Instruction Multiple Threads
CILK: An Efficient Multithreaded Runtime System
Chapter 10 : Implementing Subprograms
Implementing Subprograms
Names and Attributes Names are a key programming language feature
CS 140 : Numerical Examples on Shared Memory with Cilk++
Atomic Operations in Hardware
CS399 New Beginnings Jonathan Walpole.
Atomic Operations in Hardware
Task Scheduling for Multicore CPUs and NUMA Systems
Run-time organization
Martin Rinard Laboratory for Computer Science
Design III Chapter 13 9/20/2018 Crowley OS Chap. 13.
Chapter 10: Implementing Subprograms Sangho Ha
Chapter 9 :: Subroutines and Control Abstraction
Realizing Concurrency using Posix Threads (pthreads)
Cilk and OpenMP (Lecture 20, cs262a)
Complexity Measures for Parallel Computation
Chap. 8 :: Subroutines and Control Abstraction
Chap. 8 :: Subroutines and Control Abstraction
COT 5611 Operating Systems Design Principles Spring 2014
Multithreaded Programming in Cilk LECTURE 1
Shared Memory Programming
Introduction to CILK Some slides are from:
Fast Communication and User Level Parallelism
Unit 3 Test: Friday.
Parallelism and Concurrency
Transactions with Nested Parallelism
UNIT V Run Time Environments.
Languages and Compilers (SProg og Oversættere)
Realizing Concurrency using Posix Threads (pthreads)
Multithreaded Programming in Cilk Lecture 1
Realizing Concurrency using the thread model
Realizing Concurrency using Posix Threads (pthreads)
CS510 Operating System Foundations
CSE 153 Design of Operating Systems Winter 19
Complexity Measures for Parallel Computation
Cilk and Writing Code for Hardware
Priority Queues & Heaps
Foundations and Definitions
CSE 153 Design of Operating Systems Winter 2019
Introduction to CILK Some slides are from:
Lecture 11: Machine-Dependent Optimization
Presentation transcript:

Cilk A C language for programming dynamic multithreaded applications on shared-memory multiprocessors. Example applications: virus shell assembly graphics rendering n-body simulation heuristic search dense and sparse matrix computations friction-stir welding simulation artificial evolution 4/7/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

Apply Parallel Programming Model Questions Programming model is made up of the languages and libraries that create an abstract view of the machine Control How is parallelism created? How is are dependencies (orderings) enforced? Data Can data be shared or is it all private? How is shared data accessed or private data communicated? Synchronization What operations can be used to coordinate parallelism What are the atomic (indivisible) operations? 4/7/2019 CS194 Lecture CS267 Lecture 2

Fibonacci Example: Creating Parallelism int fib (int n) { if (n<2) return (n); else { int x,y; x = fib(n-1); y = fib(n-2); return (x+y); } C elision cilk int fib (int n) { x = spawn fib(n-1); y = spawn fib(n-2); sync; Cilk code Cilk is a faithful extension of C. A Cilk program’s serial elision is always a legal implementation of Cilk semantics. Cilk provides no new data types. 4/7/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

Basic Cilk Keywords Identifies a function as a Cilk procedure, capable of being spawned in parallel. cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); } The named child Cilk procedure can execute in parallel with the parent caller. Control cannot pass this point until all spawned children have returned. 4/7/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

Dynamic Multithreading cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); } Example: fib(4) 4 3 2 processors are virtualized 2 1 1 The computation dag unfolds dynamically. 1 4/7/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

Algorithmic Complexity Measures TP = execution time on P processors T1 = work T1 = span* LOWER BOUNDS TP ¸ T1/P TP ¸ T1 * Also called critical-path length or computational depth. 4/7/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

Definition: T1/TP = speedup on P processors. If T1/TP = (P) · P, we have linear speedup; = P, we have perfect linear speedup; > P, we have superlinear speedup, which is not possible in our model, because of the lower bound TP ¸ T1/P. T1/T1 = parallelism = the average amount of work per step along the span. 4/7/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

Using many more than 2 processors makes little sense. Example: fib(4) Assume for simplicity that each Cilk thread in fib() takes unit time to execute. Work: T1 = 17 Work: T1 = ? Using many more than 2 processors makes little sense. Span: T1 = ? Span: T1 = 8 Parallelism: T1/T1 = 2.125 4/7/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

Parallelizing Vector Addition void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } 4/7/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

Parallelizing Vector Addition void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } C if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else { void vadd (real *A, real *B, int n){ vadd (A, B, n/2); vadd (A+n/2, B+n/2, n-n/2); } Parallelization strategy: Convert loops to recursion. 4/7/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

Parallelizing Vector Addition void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } C ilk if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else { cilk void vadd (real *A, real *B, int n){ vadd (A, B, n/2; spawn vadd (A+n/2, B+n/2, n-n/2; spawn } sync; Parallelization strategy: Convert loops to recursion. Insert Cilk keywords. Side benefit: D&C is generally good for caches! 4/7/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

Vector Addition cilk void vadd (real *A, real *B, int n){ if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else { spawn vadd (A, B, n/2); spawn vadd (A+n/2, B+n/2, n-n/2); sync; } 4/7/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

Vector Addition Analysis To add two vectors of length n, where BASE = (1): Work: T1 = ? (n) Span: T1 = ? (lg n) Parallelism: T1/T1 = ? (n/lg n) BASE 4/7/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

Another Parallelization C void vadd1 (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } void vadd (real *A, real *B, int n){ int j; for (j=0; j<n; j+=BASE) { vadd (A+j, B+j, min(BASE, n-j)); Cilk cilk void vadd1 (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } cilk void vadd (real *A, real *B, int n){ int j; for (j=0; j<n; j+=BASE) { spawn vadd (A+j, B+j, min(BASE, n-j)); sync; 4/7/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

… … Work: T1 = ? Span: T1 = ? Parallelism: T1/T1 = ? (n) (n) (1) Analysis … … BASE To add two vectors of length n, where BASE = (1): Work: T1 = ? Span: T1 = ? Parallelism: T1/T1 = ? (n) PUNY! (n) (1) 4/7/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

Choosing BASE = √n ) T1 = (√n) Parallelism: T1/T1 = ? (√n ) Optimal Choice of BASE … … BASE To add two vectors of length n using an optimal choice of BASE to maximize parallelism: Work: T1 = ? (n) Span: T1 = ? (BASE + n/BASE) Choosing BASE = √n ) T1 = (√n) Parallelism: T1/T1 = ? (√n ) 4/7/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

Definition: A thread is ready if all its predecessors have executed. Greedy Scheduling IDEA: Do as much as possible on every step. Definition: A thread is ready if all its predecessors have executed. P = 3 Complete step ¸ P threads ready. Run any P. Incomplete step < P threads ready. Run all of them. 4/7/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. Steal! P P P P When a processor runs out of work, it steals a thread from the top of a random victim’s deque. 4/7/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

Performance of Work-Stealing Theorem: Cilk’s work-stealing scheduler achieves an expected running time of TP  T1/P + O(T1) on P processors. Pseudoproof. A processor is either working or stealing. The total time all processors spend working is T1. Each steal has a 1/P chance of reducing the span by 1. Thus, the expected cost of all steals is O(PT1). Since there are P processors, the expected time is (T1 + O(PT1))/P = T1/P + O(T1) . ■ 4/7/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

Space Bounds Theorem. Let S1 be the stack space required by a serial execution of a Cilk program. Then, the space required by a P-processor execution is at most SP · PS1 . P = 3 Proof (by induction). The work-stealing algorithm maintains the busy-leaves property: every extant procedure frame with no extant descendents has a processor working on it. ■ P S1 4/7/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

Cilk vs. PThreads How will the following code execute in Pthreads? In Cilk? for (i=1; i<1000000000; i++) { spawn-or-fork foo(i); } sync-or-join; What if foo contains code that waits (e.g., spins) on a variable being set by another instance of foo? This different is a liveness property: Cilk threads are spawned lazily, “may” parallelism PThreads are spawned eagerly, “must” parallelism 4/7/2019 CS194 Lecture CS267 Lecture 2

Operating on Returned Values Programmers may sometimes wish to incorporate a value returned from a spawned child into the parent frame by means other than a simple variable assignment. Example: x += spawn foo(a,b,c); Cilk achieves this functionality using an internal function, called an inlet, which is executed as a secondary thread on the parent frame when the child returns. 4/7/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

The inlet keyword defines a void internal function to be an inlet. Semantics of Inlets int max, ix = -1;  for (i=0; i<1000000; i++) { } sync; /* ix now indexes the largest foo(i) */ inlet void update ( int val, int index ) { if (idx == -1 || val > max) { ix = index; max = val; } update ( spawn foo(i), i ); The inlet keyword defines a void internal function to be an inlet. In the current implementation of Cilk, the inlet definition may not contain a spawn, and only the first argument of the inlet may be spawned at the call site. 4/7/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

The non-spawn args to update() are evaluated. Semantics of Inlets int max, ix = -1; inlet void update ( int val, int index ) { if (idx == -1 || val > max) { ix = index; max = val; }  for (i=0; i<1000000; i++) { sync; /* ix now indexes the largest foo(i) */ update ( spawn foo(i), i ); The non-spawn args to update() are evaluated. The Cilk procedure foo(i) is spawned. Control passes to the next statement. When foo(i) returns, update() is invoked. 4/7/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

Aborting Cilk Threads: Computing a Product p =  Ai i = 0 n int product(int *A, int n) { int i, p=1; for (i=0; i<n; i++) { p *= A[i]; } return p; Optimization: Quit early if the partial product ever becomes 0. 4/7/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

Computing a Product in Parallel p =  Ai i = 0 n cilk int prod(int *A, int n) { int p = 1; if (n == 1) { return A[0]; } else { p *= spawn product(A, n/2); p *= spawn product(A+n/2, n-n/2); sync; return p; } How do we quit early if we discover a zero? 4/7/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

1. Recode the implicit inlet to make it explicit. Cilk’s Abort Feature cilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; return; } if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) ); mult( spawn product(A+n/2, n-n/2) ); sync; return p; 1. Recode the implicit inlet to make it explicit. 4/7/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

2. Check for 0 within the inlet. Cilk’s Abort Feature cilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; return; } if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) ); mult( spawn product(A+n/2, n-n/2) ); sync; return p; 2. Check for 0 within the inlet. 4/7/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

2. Check for 0 within the inlet. Cilk’s Abort Feature cilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; return; } if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) ); mult( spawn product(A+n/2, n-n/2) ); sync; return p; cilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; if (p == 0) { abort; /* Aborts existing children, */ } /* but not future ones. */ return; } if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) ); mult( spawn product(A+n/2, n-n/2) ); sync; return p; 2. Check for 0 within the inlet. 4/7/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

Cilk’s Abort Feature cilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; if (p == 0) { abort; /* Aborts existing children, */ } /* but not future ones. */ return; } if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) ); mult( spawn product(A+n/2, n-n/2) ); sync; return p; 4/7/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

Cilk’s Abort Feature cilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; return; } if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) ); mult( spawn product(A+n/2, n-n/2) ); sync; return p; cilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; if (p == 0) { abort; /* Aborts existing children, */ } /* but not future ones. */ return; } if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) ); if (p == 0) { /* Don’t spawn if we’ve */ return 0; /* already aborted! */ mult( spawn product(A+n/2, n-n/2) ); sync; return p; 4/7/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

Cilk’s solution to mutual exclusion is no better than anybody else’s. Cilk provides a library of spin locks declared with Cilk_lockvar. To avoid deadlock with the Cilk scheduler, a lock should only be held within a Cilk thread. I.e., spawn and sync should not be executed while a lock is held. Fortunately, Cilk’s control parallelism often mitigates the need for extensive locking. 4/7/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

cilk2c C post- source Cilk source gcc object code ld Cilk RTS binary Compiling Cilk The cilkc compiler encapsulates the process. cilk2c C post- source source-to-source translator Cilk source gcc object code C compiler ld Cilk RTS binary linking loader cilk2c translates straight C code into identical C postsource. 4/7/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

Cilk’s Compiler Strategy The cilk2c translator generates two “clones” of each Cilk procedure: fast clone—serial, common-case code. slow clone—code with parallel bookkeeping. SLOW FAST The fast clone is always spawned, saving live variables on Cilk’s work deque (shadow stack). The slow clone is resumed if a thread is stolen, restoring variables from the shadow stack. A check is made whenever a procedure returns to see if the resuming parent has been stolen. 4/7/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

Compiling spawn — Fast Clone x = spawn fib(n-1); Cilk source entry join n x y Cilk deque frame cilk2c frame->entry = 1; frame->n = n; push(frame); x = fib(n-1); if (pop()==FAILURE) { frame->x = x; frame->join--; h clean up & return to scheduler i } C post- source suspend parent run child resume parent remotely 4/7/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

Compiling sync — Fast Clone Cilk source SLOW FAST sync; cilk2c ; C post- source No synchronization overhead in the fast clone! 4/7/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

Compiling the Slow Clone void fib_slow(fib_frame *frame) { int n,x,y; switch (frame->entry) { case 1: goto L1; case 2: goto L2; case 3: goto L3; }  frame->entry = 1; frame->n = n; push(frame); x = fib(n-1); if (pop()==FAILURE) { frame->x = x; frame->join--; h clean up & return to scheduler i if (0) { L1:; n = frame->n; frame restore program counter entry join n x y same as fast clone Cilk deque restore local variables if resuming continue 4/7/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

Breakdown of Work Overhead (circa 1997) C state saving frame allocation stealing protocol MIPS R10000 27ns 1 2 3 4 5 6 7 78ns 113ns 115ns UltraSPARC I Pentium Pro Alpha 21164 T1/TS Benchmark: fib on one processor. 4/7/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

Cilk Chess Programs Socrates placed 3rd in the 1994 International Computer Chess Championship running on NCSA’s 512-node Connection Machine CM5. Socrates 2.0 took 2nd place in the 1995 World Computer Chess Championship running on Sandia National Labs’ 1824-node Intel Paragon. Cilkchess placed 1st in the 1996 Dutch Open running on a 12-processor Sun Enterprise 5000. It placed 2nd in 1997 and 1998 running on Boston University’s 64-processor SGI Origin 2000. Cilkchess tied for 3rd in the 1999 WCCC running on NASA’s 256-node SGI Origin 2000. 4/7/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2

Socrates Normalized Speedup 1 TP = T TP = T1/P + T T1/TP T1/T 0.1 TP = T1/P measured speedup 0.01 0.01 0.1 1 P T1/T 4/7/2019 CS194 Lecture © 2006 Charles E. Leiserson CS267 Lecture 2