Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multithreaded Programming in Cilk LECTURE 1

Similar presentations


Presentation on theme: "Multithreaded Programming in Cilk LECTURE 1"— Presentation transcript:

1 Multithreaded Programming in Cilk LECTURE 1
July 13, 2006 Multithreaded Programming in Cilk LECTURE 1 Charles E. Leiserson Adapted to current Cilk syntax by Shirley Moore Supercomputing Technologies Research Group Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

2 Multithreaded Programming in Cilk Lecture 1
July 13, 2006 Cilk A C language for programming dynamic multithreaded applications on shared-memory multiprocessors. Example applications: virus shell assembly graphics rendering n-body simulation heuristic search dense and sparse matrix computations friction-stir welding simulation artificial evolution

3 Shared-Memory Multiprocessor
Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Shared-Memory Multiprocessor P Network Memory I/O $ In particular, over the next decade, chip multiprocessors (CMP’s) will be an increasingly important platform!

4 Multithreaded Programming in Cilk Lecture 1
July 13, 2006 Cilk Is Simple Cilk extends the C language with just a handful of keywords. Every Cilk program has a serial semantics. Not only is Cilk fast, it provides performance guarantees based on performance abstractions. Cilk is processor-oblivious. Cilk’s provably good runtime system auto-matically manages low-level aspects of parallel execution, including protocols, load balancing, and scheduling. Cilk supports speculative parallelism.

5 Multithreaded Programming in Cilk Lecture 1
Minicourse Outline July 13, 2006 LECTURE 1 Basic Cilk programming: Cilk keywords, performance measures, scheduling. LECTURE 2 Analysis of Cilk algorithms: matrix multiplication, sorting, tableau construction. LABORATORY Programming matrix multiplication in Cilk — Dr. Bradley C. Kuszmaul LECTURE 3 Advanced Cilk programming: inlets, abort, speculation, data synchronization, & more.

6 Multithreaded Programming in Cilk Lecture 1
July 13, 2006 LECTURE 1 Basic Cilk Programming Performance Measures Parallelizing Vector Addition Scheduling Theory A Chess Lesson Cilk’s Scheduler Conclusion

7 Multithreaded Programming in Cilk Lecture 1
July 13, 2006 Fibonacci int fib (int n) { if (n<2) return (n); else { int x,y; x = fib(n-1); y = fib(n-2); return (x+y); } C elision x = cilk_spawn fib(n-1); y = cilk_spawn fib(n-2); cilk_sync; Cilk code Cilk is a faithful extension of C. A Cilk program’s serial elision is always a legal implementation of Cilk semantics. Cilk provides no new data types.

8 Multithreaded Programming in Cilk Lecture 1
July 13, 2006 Basic Cilk Keywords int fib (int n) { if (n<2) return (n); else { int x,y; x = cilk_spawn fib(n-1); y = cilk_spawn fib(n-2); cilk_sync; return (x+y); } The named child Cilk procedure can execute in parallel with the parent caller. Control cannot pass this point until all spawned children have returned.

9 Dynamic Multithreading
Multithreaded Programming in Cilk Lecture 1 Dynamic Multithreading July 13, 2006 int fib (int n) { if (n<2) return (n); else { int x,y; x = cilk_spawn fib(n-1); y = cilk_spawn fib(n-2); cilk_sync; return (x+y); } Example: fib(4) 4 3 2 2 1 1 “Processor oblivious” The computation dag unfolds dynamically. 1

10 Multithreaded Computation
Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Multithreaded Computation initial thread final thread continue edge return edge spawn edge

11 Multithreaded Programming in Cilk Lecture 1
Cactus Stack Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Cilk supports C’s rule for pointers: A pointer to stack space can be passed from parent to child, but not from child to parent. (Cilk also supports malloc.) A B C D E Views of stack B A C E D Cilk’s cactus stack supports several views in parallel.

12 Multithreaded Programming in Cilk Lecture 1
July 13, 2006 LECTURE 1 Basic Cilk Programming Performance Measures Parallelizing Vector Addition Scheduling Theory A Chess Lesson Cilk’s Scheduler Conclusion

13 Algorithmic Complexity Measures
Multithreaded Programming in Cilk Lecture 1 Algorithmic Complexity Measures July 13, 2006 TP = execution time on P processors

14 Algorithmic Complexity Measures
Multithreaded Programming in Cilk Lecture 1 Algorithmic Complexity Measures July 13, 2006 TP = execution time on P processors T1 = work

15 Algorithmic Complexity Measures
Multithreaded Programming in Cilk Lecture 1 July 13, 2006 TP = execution time on P processors T1 = work T∞ = span* * Also called critical-path length or computational depth.

16 Algorithmic Complexity Measures
Multithreaded Programming in Cilk Lecture 1 July 13, 2006 TP = execution time on P processors T1 = work T∞= span* * Also called critical-path length or computational depth.

17 Multithreaded Programming in Cilk Lecture 1
Speedup July 13, 2006 Definition: T1/TP = speedup on P processors.

18 Multithreaded Programming in Cilk Lecture 1
Parallelism July 13, 2006

19 Multithreaded Programming in Cilk Lecture 1
Example: fib(4) July 13, 2006 1 8 2 7 3 4 6 5 Assume for simplicity that each Cilk thread in fib() takes unit time to execute. Work: T1 = ? Work: T1 = 17 Span: T∞ = 8 Span: T1 = ?

20 Multithreaded Programming in Cilk Lecture 1
Example: fib(4) Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Assume for simplicity that each Cilk thread in fib() takes unit time to execute. Work: T1 = ? Work: T1 = 17 Using many more than 2 processors makes little sense. Span: T1 = ? Span: T1 = 8 Parallelism: T1/T∞ = 2.125

21 Multithreaded Programming in Cilk Lecture 1
July 13, 2006 LECTURE 1 Basic Cilk Programming Performance Measures Parallelizing Vector Addition Scheduling Theory A Chess Lesson Cilk’s Scheduler Conclusion

22 Parallelizing Vector Addition
Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Parallelizing Vector Addition C void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; }

23 Parallelizing Vector Addition
Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Parallelizing Vector Addition C void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } C if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else { void vadd (real *A, real *B, int n){ vadd (A, B, n/2); vadd (A+n/2, B+n/2, n-n/2); } Parallelization strategy: Convert loops to recursion.

24 Parallelizing Vector Addition
Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Parallelizing Vector Addition C void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } C ilk void vadd (real *A, real *B, int n){ if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else { cilk_spawn vadd (A, B, n/2; cilk_spawn vadd (A+n/2, B+n/2, n-n/2; } cilk_sync; Parallelization strategy: Convert loops to recursion. Insert Cilk keywords. Side benefit: D&C is generally good for caches!

25 Multithreaded Programming in Cilk Lecture 1
Vector Addition July 13, 2006 void vadd (real *A, real *B, int n){ if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else { cilk_spawn vadd (A, B, n/2); cilk_spawn vadd (A+n/2, B+n/2, n-n/2); cilk_sync; }

26 Vector Addition Analysis
Multithreaded Programming in Cilk Lecture 1 Vector Addition Analysis July 13, 2006 To add two vectors of length n, where BASE = (1): Work: T1 = ? (n) Span: T∞ = ? (lg n) Parallelism: T1/T∞ = ? (n/lg n) BASE

27 Another Parallelization
Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Another Parallelization C void vadd1 (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } void vadd (real *A, real *B, int n){ int j; for (j=0; j<n; j+=BASE) { vadd1(A+j, B+j, min(BASE, n-j)); Cilk void vadd1 (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } void vadd (real *A, real *B, int n){ int j; for (j=0; j<n; j+=BASE) { cilk_spawn vadd1(A+j, B+j, min(BASE, n-j)); cilk_sync;

28 Multithreaded Programming in Cilk Lecture 1
Analysis Multithreaded Programming in Cilk Lecture 1 July 13, 2006 BASE To add two vectors of length n, where BASE = (1): (n) PUNY! Work: T1 = ? Span: T∞ = ? Parallelism: T1/T∞ = ? (n) (1)

29 Multithreaded Programming in Cilk Lecture 1
Optimal Choice of BASE July 13, 2006 BASE To add two vectors of length n using an optimal choice of BASE to maximize parallelism: Work: T1 = ? (n) Span: T∞ = ? (BASE + n/BASE) Choosing BASE = √n , T∞ = (√n) Parallelism: T1/T∞ = ? (√n )

30 Multithreaded Programming in Cilk Lecture 1
July 13, 2006 LECTURE 1 Basic Cilk Programming Performance Measures Parallelizing Vector Addition Scheduling Theory A Chess Lesson Cilk’s Scheduler Conclusion

31 Multithreaded Programming in Cilk Lecture 1
Scheduling July 13, 2006 Cilk allows the programmer to express potential parallelism in an application. The Cilk scheduler maps Cilk threads onto processors dynamically at runtime. Since on-line schedulers are complicated, we’ll illustrate the ideas with an off-line scheduler. P Network Memory I/O $

32 Multithreaded Programming in Cilk Lecture 1
Greedy Scheduling July 13, 2006 IDEA: Do as much as possible on every step. Definition: A thread is ready if all its predecessors have executed.

33 Multithreaded Programming in Cilk Lecture 1
Greedy Scheduling Multithreaded Programming in Cilk Lecture 1 July 13, 2006 IDEA: Do as much as possible on every step. Definition: A thread is ready if all its predecessors have executed. P = 3

34 Multithreaded Programming in Cilk Lecture 1
Greedy Scheduling Multithreaded Programming in Cilk Lecture 1 July 13, 2006 IDEA: Do as much as possible on every step. Definition: A thread is ready if all its predecessors have executed. P = 3

35 Greedy-Scheduling Theorem
Multithreaded Programming in Cilk Lecture 1 Greedy-Scheduling Theorem July 13, 2006 Theorem [Graham ’68 & Brent ’75]. Any greedy scheduler achieves TP  T1/P + T. P = 3 Proof. # complete steps  T1/P, since each complete step performs P work. # incomplete steps  T∞, since each incomplete step reduces the span of the unexecuted dag by 1. ■

36 Multithreaded Programming in Cilk Lecture 1
Optimality of Greedy July 13, 2006 Corollary. Any greedy scheduler achieves within a factor of 2 of optimal.

37 Multithreaded Programming in Cilk Lecture 1
Linear Speedup Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Definition. The quantity (T1/T∞ )/P is called the parallel slackness.

38 Multithreaded Programming in Cilk Lecture 1
Cilk Performance July 13, 2006

39 Multithreaded Programming in Cilk Lecture 1
July 13, 2006 LECTURE 1 Basic Cilk Programming Performance Measures Parallelizing Vector Addition Scheduling Theory A Chess Lesson Cilk’s Scheduler Conclusion

40 Multithreaded Programming in Cilk Lecture 1
Cilk Chess Programs July 13, 2006 Socrates placed 3rd in the 1994 International Computer Chess Championship running on NCSA’s 512-node Connection Machine CM5. Socrates 2.0 took 2nd place in the 1995 World Computer Chess Championship running on Sandia National Labs’ 1824-node Intel Paragon. Cilkchess placed 1st in the 1996 Dutch Open running on a 12-processor Sun Enterprise It placed 2nd in 1997 and 1998 running on Boston University’s 64-processor SGI Origin 2000. Cilkchess tied for 3rd in the 1999 WCCC running on NASA’s 256-node SGI Origin 2000.

41 Socrates Normalized Speedup
Multithreaded Programming in Cilk Lecture 1 July 13, 2006 1 TP = T TP = T1/P + T T1/TP T1/T 0.1 TP = T1/P measured speedup 0.01 0.01 0.1 1 P T1/T

42 Multithreaded Programming in Cilk Lecture 1
Developing Socrates July 13, 2006 For the competition, Socrates was to run on a 512-processor Connection Machine Model CM5 supercomputer at the University of Illinois. The developers had easy access to a similar 32-processor CM5 at MIT. One of the developers proposed a change to the program that produced a speedup of over 20% on the MIT machine. After a back-of-the-envelope calculation, the proposed “improvement” was rejected!

43 Socrates Speedup Paradox
Multithreaded Programming in Cilk Lecture 1 Socrates Speedup Paradox July 13, 2006 Original program Proposed program T32 = 65 seconds T32 = 40 seconds TP  T1/P + T T1 = 2048 seconds T = 1 second T1 = 1024 seconds T = 8 seconds T32 = 2048/32 + 1 = 65 seconds = 40 seconds T32 = 1024/32 + 8 T512 = 2048/ = 5 seconds T512 = 1024/ = 10 seconds

44 Multithreaded Programming in Cilk Lecture 1
Lesson July 13, 2006 Work and span can predict performance on large machines better than running times on small machines can.

45 Multithreaded Programming in Cilk Lecture 1
July 13, 2006 LECTURE 1 Basic Cilk Programming Performance Measures Parallelizing Vector Addition Scheduling Theory A Chess Lesson Cilk’s Scheduler Conclusion

46 Cilk’s Work-Stealing Scheduler
Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. Spawn! P P P P

47 Cilk’s Work-Stealing Scheduler
Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. Spawn! P P P P

48 Cilk’s Work-Stealing Scheduler
Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. Return! P P P P

49 Cilk’s Work-Stealing Scheduler
Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. Return! P P P P

50 Cilk’s Work-Stealing Scheduler
Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. Steal! P P P P When a processor runs out of work, it steals a thread from the top of a random victim’s deque.

51 Cilk’s Work-Stealing Scheduler
Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. Steal! P P P P When a processor runs out of work, it steals a thread from the top of a random victim’s deque.

52 Cilk’s Work-Stealing Scheduler
Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. P P P P When a processor runs out of work, it steals a thread from the top of a random victim’s deque.

53 Cilk’s Work-Stealing Scheduler
Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. Spawn! P P P P When a processor runs out of work, it steals a thread from the top of a random victim’s deque.

54 Performance of Work-Stealing
Multithreaded Programming in Cilk Lecture 1 Performance of Work-Stealing July 13, 2006 Theorem: Cilk’s work-stealing scheduler achieves an expected running time of TP  T1/P + O(T∞) on P processors. Pseudoproof. A processor is either working or stealing. The total time all processors spend working is T1. Each steal has a 1/P chance of reducing the span by 1. Thus, the expected cost of all steals is O(PT∞). Since there are P processors, the expected time is (T1 + O(PT∞))/P = T1/P + O(T∞) . ■

55 Multithreaded Programming in Cilk Lecture 1
Space Bounds Multithreaded Programming in Cilk Lecture 1 July 13, 2006 P = 3 Proof (by induction). The work-stealing algorithm maintains the busy-leaves property: every extant procedure frame with no extant descendents has a processor working on it. ■ P S1

56 Linguistic Implications
Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Code like the following executes properly without any risk of blowing out memory: for (i=1; i< ; i++) { cilk_spawn foo(i); } cilk_sync; MORAL Better to steal parents than children!

57 Multithreaded Programming in Cilk Lecture 1
July 13, 2006 LECTURE 1 Basic Cilk Programming Performance Measures Parallelizing Vector Addition Scheduling Theory A Chess Lesson Cilk’s Scheduler Conclusion

58 Multithreaded Programming in Cilk Lecture 1
Key Ideas July 13, 2006 Cilk is simple: spawn, sync Recursion, recursion, recursion, … Work & span


Download ppt "Multithreaded Programming in Cilk LECTURE 1"

Similar presentations


Ads by Google