Multithreaded Programming in Cilk LECTURE 1

Slides:

Advertisements

Similar presentations

Multi-core Computing Lecture 3 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh.

Advertisements

© 2009 Charles E. Leiserson and Pablo Halpern1 Introduction to Cilk++ Programming PADTAD July 20, 2009 Cilk, Cilk++, Cilkview, and Cilkscreen, are trademarks.

Lecture 24 MAS 714 Hartmut Klauck

Introduction to Computer Science 2 Lecture 7: Extended binary trees

1 CS 240A : Numerical Examples in Shared Memory with Cilk++ Matrix-matrix multiplication Matrix-vector multiplication Hyperobjects Thanks to Charles E.

A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager

Greedy Algorithms Greed is good. (Some of the time)

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and.

Analysis of Cilk. A Formal Model for Cilk  A thread: maximal sequence of instructions in a procedure instance (at runtime!) not containing spawn, sync,

CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad.

Parallelizing Incremental Bayesian Segmentation (IBS) Joseph Hastings Sid Sen.

Nested Parallelism in Transactional Memory Kunal Agrawal, Jeremy T. Fineman and Jim Sukha MIT.

Parallelizing C Programs Using Cilk Mahdi Javadi.

Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR.

Image Processing Using Cilk 1 Parallel Processing – Final Project Image Processing Using Cilk Tomer Y & Tuval A (pp25)

1 CS 140 : Jan 27 – Feb 3, 2010 Multicore (and Shared Memory) Programming with Cilk++ Multicore and NUMA architectures Multithreaded Programming Cilk++

Describing Syntax and Semantics

Cilk CISC 879 Parallel Computation Erhan Atilla Avinal.

Week 7 - Wednesday.  What did we talk about last time?  Recursive running time  Master Theorem  Introduction to trees.

Juan Mendivelso.  Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms:

Multithreaded Algorithms Andreas Klappenecker. Motivation We have discussed serial algorithms that are suitable for running on a uniprocessor computer.

Multithreaded Programming in Cilk L ECTURE 2 Charles E. Leiserson Supercomputing Technologies Research Group Computer Science and Artificial Intelligence.

STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of computing: –Instruction (single, multiple) per clock cycle.

May/01/2000HIPS Online Computation of Critical Paths for Multithreaded Languages Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa University of Tokyo.

1 L ECTURE 2 Matrix Multiplication Tableau Construction Recurrences (Review) Conclusion Merge Sort.

1 CS 140 : Non-numerical Examples with Cilk++ Divide and conquer paradigm for Cilk++ Quicksort Mergesort Thanks to Charles E. Leiserson for some of these.

1 L ECTURE 2 Matrix Multiplication Tableau Construction Recurrences (Review) Conclusion Merge Sort.

Overview Work-stealing scheduler O(pS 1 ) worst case space small overhead Narlikar scheduler 1 O(S 1 +pKT  ) worst case space large overhead Hybrid scheduler.

An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,

Télécom 2A – Algo Complexity (1) Time Complexity and the divide and conquer strategy Or : how to measure algorithm run-time And : design efficient algorithms.

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

1 CS 140 : Feb 19, 2015 Cilk Scheduling & Applications Analyzing quicksort Optional: Master method for solving divide-and-conquer recurrences Tips on parallelism.

P ARALLEL P ROCESSING F INAL P RESENTATION CILK Eliran Ben Moshe Neriya Cohen.

Week 7 - Wednesday.  What did we talk about last time?  Recursive running time  Master Theorem  Symbol tables.

Scheduling Multithreaded Computations By Work-Stealing Robert D. Blumofe The University of Texas, Austin Charles E. Leiserson, MIT Laboratory for Computer.

CS 240A: Shared Memory & Multicore Programming with Cilk++

Recursion Unrolling for Divide and Conquer Programs Radu Rugina and Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology.

1 Cilk Chao Huang CS498LVK. 2 Introduction A multithreaded parallel programming language Effective for exploiting dynamic, asynchronous parallelism (Chess.

1 Interactive Computer Theorem Proving CS294-9 October 19, 2006 Adam Chlipala UC Berkeley Lecture 9: Beyond Primitive Recursion.

CILK: An Efficient Multithreaded Runtime System

CS 140 : Numerical Examples on Shared Memory with Cilk++

Decrease-and-Conquer Approach

CMPS 5433 Programming Models

Prabhanjan Kambadur, Open Systems Lab, Indiana University

Problem Parallelize (serial) applications that use files. In general

Task Scheduling for Multicore CPUs and NUMA Systems

Introduction to Algorithms

Real-Time Ray Tracing Stefan Popov.

Week 11 - Friday CS221.

Martin Rinard Laboratory for Computer Science

Amir Kamil and Katherine Yelick

Binary Tree and General Tree

Cilk and OpenMP (Lecture 20, cs262a)

Prepared by Chen & Po-Chuan 2016/03/29

Complexity Measures for Parallel Computation

CS 583 Analysis of Algorithms

Introduction to CILK Some slides are from:

RECURSION Annie Calpe

Transactions with Nested Parallelism

Multithreaded Programming in Cilk Lecture 1

Amir Kamil and Katherine Yelick

Cilk A C language for programming dynamic multithreaded applications on shared-memory multiprocessors. Example applications: virus shell assembly graphics.

Complexity Measures for Parallel Computation

Cilk and Writing Code for Hardware

Priority Queues & Heaps

Concurrent Cache-Oblivious B-trees Using Transactional Memory

Introduction to CILK Some slides are from:

Lecture 21 Amortized Analysis

Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1

Presentation transcript:

Multithreaded Programming in Cilk LECTURE 1 July 13, 2006 Multithreaded Programming in Cilk LECTURE 1 Charles E. Leiserson Adapted to current Cilk syntax by Shirley Moore Supercomputing Technologies Research Group Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Cilk A C language for programming dynamic multithreaded applications on shared-memory multiprocessors. Example applications: virus shell assembly graphics rendering n-body simulation heuristic search dense and sparse matrix computations friction-stir welding simulation artificial evolution

Shared-Memory Multiprocessor Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Shared-Memory Multiprocessor P Network … Memory I/O $ In particular, over the next decade, chip multiprocessors (CMP’s) will be an increasingly important platform!

Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Cilk Is Simple Cilk extends the C language with just a handful of keywords. Every Cilk program has a serial semantics. Not only is Cilk fast, it provides performance guarantees based on performance abstractions. Cilk is processor-oblivious. Cilk’s provably good runtime system auto-matically manages low-level aspects of parallel execution, including protocols, load balancing, and scheduling. Cilk supports speculative parallelism.

Multithreaded Programming in Cilk Lecture 1 Minicourse Outline July 13, 2006 LECTURE 1 Basic Cilk programming: Cilk keywords, performance measures, scheduling. LECTURE 2 Analysis of Cilk algorithms: matrix multiplication, sorting, tableau construction. LABORATORY Programming matrix multiplication in Cilk — Dr. Bradley C. Kuszmaul LECTURE 3 Advanced Cilk programming: inlets, abort, speculation, data synchronization, & more.

Multithreaded Programming in Cilk Lecture 1 July 13, 2006 LECTURE 1 Basic Cilk Programming Performance Measures Parallelizing Vector Addition Scheduling Theory A Chess Lesson Cilk’s Scheduler Conclusion

Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Fibonacci int fib (int n) { if (n<2) return (n); else { int x,y; x = fib(n-1); y = fib(n-2); return (x+y); } C elision x = cilk_spawn fib(n-1); y = cilk_spawn fib(n-2); cilk_sync; Cilk code Cilk is a faithful extension of C. A Cilk program’s serial elision is always a legal implementation of Cilk semantics. Cilk provides no new data types.

Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Basic Cilk Keywords int fib (int n) { if (n<2) return (n); else { int x,y; x = cilk_spawn fib(n-1); y = cilk_spawn fib(n-2); cilk_sync; return (x+y); } The named child Cilk procedure can execute in parallel with the parent caller. Control cannot pass this point until all spawned children have returned.

Dynamic Multithreading Multithreaded Programming in Cilk Lecture 1 Dynamic Multithreading July 13, 2006 int fib (int n) { if (n<2) return (n); else { int x,y; x = cilk_spawn fib(n-1); y = cilk_spawn fib(n-2); cilk_sync; return (x+y); } Example: fib(4) 4 3 2 2 1 1 “Processor oblivious” The computation dag unfolds dynamically. 1

Multithreaded Computation Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Multithreaded Computation initial thread final thread continue edge return edge spawn edge

Multithreaded Programming in Cilk Lecture 1 Cactus Stack Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Cilk supports C’s rule for pointers: A pointer to stack space can be passed from parent to child, but not from child to parent. (Cilk also supports malloc.) A B C D E Views of stack B A C E D Cilk’s cactus stack supports several views in parallel.

Multithreaded Programming in Cilk Lecture 1 July 13, 2006 LECTURE 1 Basic Cilk Programming Performance Measures Parallelizing Vector Addition Scheduling Theory A Chess Lesson Cilk’s Scheduler Conclusion

Algorithmic Complexity Measures Multithreaded Programming in Cilk Lecture 1 Algorithmic Complexity Measures July 13, 2006 TP = execution time on P processors

Algorithmic Complexity Measures Multithreaded Programming in Cilk Lecture 1 Algorithmic Complexity Measures July 13, 2006 TP = execution time on P processors T1 = work

Algorithmic Complexity Measures Multithreaded Programming in Cilk Lecture 1 July 13, 2006 TP = execution time on P processors T1 = work T∞ = span* * Also called critical-path length or computational depth.

Algorithmic Complexity Measures Multithreaded Programming in Cilk Lecture 1 July 13, 2006 TP = execution time on P processors T1 = work T∞= span* * Also called critical-path length or computational depth.

Multithreaded Programming in Cilk Lecture 1 Speedup July 13, 2006 Definition: T1/TP = speedup on P processors.

Multithreaded Programming in Cilk Lecture 1 Parallelism July 13, 2006

Multithreaded Programming in Cilk Lecture 1 Example: fib(4) July 13, 2006 1 8 2 7 3 4 6 5 Assume for simplicity that each Cilk thread in fib() takes unit time to execute. Work: T1 = ? Work: T1 = 17 Span: T∞ = 8 Span: T1 = ?

Multithreaded Programming in Cilk Lecture 1 Example: fib(4) Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Assume for simplicity that each Cilk thread in fib() takes unit time to execute. Work: T1 = ? Work: T1 = 17 Using many more than 2 processors makes little sense. Span: T1 = ? Span: T1 = 8 Parallelism: T1/T∞ = 2.125

Multithreaded Programming in Cilk Lecture 1 July 13, 2006 LECTURE 1 Basic Cilk Programming Performance Measures Parallelizing Vector Addition Scheduling Theory A Chess Lesson Cilk’s Scheduler Conclusion

Parallelizing Vector Addition Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Parallelizing Vector Addition C void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; }

Parallelizing Vector Addition Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Parallelizing Vector Addition C void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } C if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else { void vadd (real *A, real *B, int n){ vadd (A, B, n/2); vadd (A+n/2, B+n/2, n-n/2); } Parallelization strategy: Convert loops to recursion.

Parallelizing Vector Addition Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Parallelizing Vector Addition C void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } C ilk void vadd (real *A, real *B, int n){ if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else { cilk_spawn vadd (A, B, n/2; cilk_spawn vadd (A+n/2, B+n/2, n-n/2; } cilk_sync; Parallelization strategy: Convert loops to recursion. Insert Cilk keywords. Side benefit: D&C is generally good for caches!

Multithreaded Programming in Cilk Lecture 1 Vector Addition July 13, 2006 void vadd (real *A, real *B, int n){ if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else { cilk_spawn vadd (A, B, n/2); cilk_spawn vadd (A+n/2, B+n/2, n-n/2); cilk_sync; }

Vector Addition Analysis Multithreaded Programming in Cilk Lecture 1 Vector Addition Analysis July 13, 2006 To add two vectors of length n, where BASE = (1): Work: T1 = ? (n) Span: T∞ = ? (lg n) Parallelism: T1/T∞ = ? (n/lg n) BASE

Another Parallelization Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Another Parallelization C void vadd1 (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } void vadd (real *A, real *B, int n){ int j; for (j=0; j<n; j+=BASE) { vadd1(A+j, B+j, min(BASE, n-j)); Cilk void vadd1 (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } void vadd (real *A, real *B, int n){ int j; for (j=0; j<n; j+=BASE) { cilk_spawn vadd1(A+j, B+j, min(BASE, n-j)); cilk_sync;

Multithreaded Programming in Cilk Lecture 1 Analysis Multithreaded Programming in Cilk Lecture 1 July 13, 2006 … … BASE To add two vectors of length n, where BASE = (1): (n) PUNY! Work: T1 = ? Span: T∞ = ? Parallelism: T1/T∞ = ? (n) (1)

Multithreaded Programming in Cilk Lecture 1 Optimal Choice of BASE July 13, 2006 … … BASE To add two vectors of length n using an optimal choice of BASE to maximize parallelism: Work: T1 = ? (n) Span: T∞ = ? (BASE + n/BASE) Choosing BASE = √n , T∞ = (√n) Parallelism: T1/T∞ = ? (√n )

Multithreaded Programming in Cilk Lecture 1 July 13, 2006 LECTURE 1 Basic Cilk Programming Performance Measures Parallelizing Vector Addition Scheduling Theory A Chess Lesson Cilk’s Scheduler Conclusion

Multithreaded Programming in Cilk Lecture 1 Scheduling July 13, 2006 Cilk allows the programmer to express potential parallelism in an application. The Cilk scheduler maps Cilk threads onto processors dynamically at runtime. Since on-line schedulers are complicated, we’ll illustrate the ideas with an off-line scheduler. P Network … Memory I/O $

Multithreaded Programming in Cilk Lecture 1 Greedy Scheduling July 13, 2006 IDEA: Do as much as possible on every step. Definition: A thread is ready if all its predecessors have executed.

Multithreaded Programming in Cilk Lecture 1 Greedy Scheduling Multithreaded Programming in Cilk Lecture 1 July 13, 2006 IDEA: Do as much as possible on every step. Definition: A thread is ready if all its predecessors have executed. P = 3

Multithreaded Programming in Cilk Lecture 1 Greedy Scheduling Multithreaded Programming in Cilk Lecture 1 July 13, 2006 IDEA: Do as much as possible on every step. Definition: A thread is ready if all its predecessors have executed. P = 3

Greedy-Scheduling Theorem Multithreaded Programming in Cilk Lecture 1 Greedy-Scheduling Theorem July 13, 2006 Theorem [Graham ’68 & Brent ’75]. Any greedy scheduler achieves TP  T1/P + T. P = 3 Proof. # complete steps  T1/P, since each complete step performs P work. # incomplete steps  T∞, since each incomplete step reduces the span of the unexecuted dag by 1. ■

Multithreaded Programming in Cilk Lecture 1 Optimality of Greedy July 13, 2006 Corollary. Any greedy scheduler achieves within a factor of 2 of optimal.

Multithreaded Programming in Cilk Lecture 1 Linear Speedup Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Definition. The quantity (T1/T∞ )/P is called the parallel slackness.

Multithreaded Programming in Cilk Lecture 1 Cilk Performance July 13, 2006

Multithreaded Programming in Cilk Lecture 1 July 13, 2006 LECTURE 1 Basic Cilk Programming Performance Measures Parallelizing Vector Addition Scheduling Theory A Chess Lesson Cilk’s Scheduler Conclusion

Multithreaded Programming in Cilk Lecture 1 Cilk Chess Programs July 13, 2006 Socrates placed 3rd in the 1994 International Computer Chess Championship running on NCSA’s 512-node Connection Machine CM5. Socrates 2.0 took 2nd place in the 1995 World Computer Chess Championship running on Sandia National Labs’ 1824-node Intel Paragon. Cilkchess placed 1st in the 1996 Dutch Open running on a 12-processor Sun Enterprise 5000. It placed 2nd in 1997 and 1998 running on Boston University’s 64-processor SGI Origin 2000. Cilkchess tied for 3rd in the 1999 WCCC running on NASA’s 256-node SGI Origin 2000.

Socrates Normalized Speedup Multithreaded Programming in Cilk Lecture 1 July 13, 2006 1 TP = T TP = T1/P + T T1/TP T1/T 0.1 TP = T1/P measured speedup 0.01 0.01 0.1 1 P T1/T

Multithreaded Programming in Cilk Lecture 1 Developing Socrates July 13, 2006 For the competition, Socrates was to run on a 512-processor Connection Machine Model CM5 supercomputer at the University of Illinois. The developers had easy access to a similar 32-processor CM5 at MIT. One of the developers proposed a change to the program that produced a speedup of over 20% on the MIT machine. After a back-of-the-envelope calculation, the proposed “improvement” was rejected!

Socrates Speedup Paradox Multithreaded Programming in Cilk Lecture 1 Socrates Speedup Paradox July 13, 2006 Original program Proposed program T32 = 65 seconds T32 = 40 seconds TP  T1/P + T T1 = 2048 seconds T = 1 second T1 = 1024 seconds T = 8 seconds T32 = 2048/32 + 1 = 65 seconds = 40 seconds T32 = 1024/32 + 8 T512 = 2048/512 + 1 = 5 seconds T512 = 1024/512 + 8 = 10 seconds

Multithreaded Programming in Cilk Lecture 1 Lesson July 13, 2006 Work and span can predict performance on large machines better than running times on small machines can.

Multithreaded Programming in Cilk Lecture 1 July 13, 2006 LECTURE 1 Basic Cilk Programming Performance Measures Parallelizing Vector Addition Scheduling Theory A Chess Lesson Cilk’s Scheduler Conclusion

Cilk’s Work-Stealing Scheduler Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. Spawn! P P P P

Cilk’s Work-Stealing Scheduler Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. Spawn! P P P P

Cilk’s Work-Stealing Scheduler Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. Return! P P P P

Cilk’s Work-Stealing Scheduler Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. Return! P P P P

Cilk’s Work-Stealing Scheduler Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. Steal! P P P P When a processor runs out of work, it steals a thread from the top of a random victim’s deque.

Cilk’s Work-Stealing Scheduler Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. Steal! P P P P When a processor runs out of work, it steals a thread from the top of a random victim’s deque.

Cilk’s Work-Stealing Scheduler Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. P P P P When a processor runs out of work, it steals a thread from the top of a random victim’s deque.

Cilk’s Work-Stealing Scheduler Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. Spawn! P P P P When a processor runs out of work, it steals a thread from the top of a random victim’s deque.

Performance of Work-Stealing Multithreaded Programming in Cilk Lecture 1 Performance of Work-Stealing July 13, 2006 Theorem: Cilk’s work-stealing scheduler achieves an expected running time of TP  T1/P + O(T∞) on P processors. Pseudoproof. A processor is either working or stealing. The total time all processors spend working is T1. Each steal has a 1/P chance of reducing the span by 1. Thus, the expected cost of all steals is O(PT∞). Since there are P processors, the expected time is (T1 + O(PT∞))/P = T1/P + O(T∞) . ■

Multithreaded Programming in Cilk Lecture 1 Space Bounds Multithreaded Programming in Cilk Lecture 1 July 13, 2006 P = 3 Proof (by induction). The work-stealing algorithm maintains the busy-leaves property: every extant procedure frame with no extant descendents has a processor working on it. ■ P S1

Linguistic Implications Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Code like the following executes properly without any risk of blowing out memory: for (i=1; i<1000000000; i++) { cilk_spawn foo(i); } cilk_sync; MORAL Better to steal parents than children!

Multithreaded Programming in Cilk Lecture 1 July 13, 2006 LECTURE 1 Basic Cilk Programming Performance Measures Parallelizing Vector Addition Scheduling Theory A Chess Lesson Cilk’s Scheduler Conclusion

Multithreaded Programming in Cilk Lecture 1 Key Ideas July 13, 2006 Cilk is simple: spawn, sync Recursion, recursion, recursion, … Work & span