1 L ECTURE 2 Matrix Multiplication Tableau Construction Recurrences (Review) Conclusion Merge Sort.

Slides:



Advertisements
Similar presentations
Divide-and-Conquer CIS 606 Spring 2010.
Advertisements

Counting the bits Analysis of Algorithms Will it run on a larger problem? When will it fail?
A simple example finding the maximum of a set S of n numbers.
More on Divide and Conquer. The divide-and-conquer design paradigm 1. Divide the problem (instance) into subproblems. 2. Conquer the subproblems by solving.
DIVIDE AND CONQUER. 2 Algorithmic Paradigms Greedy. Build up a solution incrementally, myopically optimizing some local criterion. Divide-and-conquer.
Introduction to Algorithms Jiafen Liu Sept
1 Divide & Conquer Algorithms. 2 Recursion Review A function that calls itself either directly or indirectly through another function Recursive solutions.
Introduction to Algorithms 6.046J/18.401J L ECTURE 3 Divide and Conquer Binary search Powering a number Fibonacci numbers Matrix multiplication Strassen’s.
Divide and Conquer. Recall Complexity Analysis – Comparison of algorithm – Big O Simplification From source code – Recursive.
ADA: 4. Divide/Conquer1 Objective o look at several divide and conquer examples (merge sort, binary search), and 3 approaches for calculating their.
Algorithms Recurrences. Definition – a recurrence is an equation or inequality that describes a function in terms of its value on smaller inputs Example.
Spring 2015 Lecture 5: QuickSort & Selection
Nattee Niparnan. Recall  Complexity Analysis  Comparison of Two Algos  Big O  Simplification  From source code  Recursive.
1 Issues with Matrix and Vector Issues with Matrix and Vector Quicksort Quicksort Determining Algorithm Efficiency Determining Algorithm Efficiency Substitution.
Parallelizing C Programs Using Cilk Mahdi Javadi.
Fall 2006CENG 7071 Algorithm Analysis. Fall 2006CENG 7072 Algorithmic Performance There are two aspects of algorithmic performance: Time Instructions.
CS Section 600 CS Section 002 Dr. Angela Guercio Spring 2010.
The master method The master method applies to recurrences of the form T(n) = a T(n/b) + f (n), where a  1, b > 1, and f is asymptotically positive.
Data Structures, Spring 2004 © L. Joskowicz 1 Data Structures – LECTURE 3 Recurrence equations Formulating recurrence equations Solving recurrence equations.
Data Structures, Spring 2006 © L. Joskowicz 1 Data Structures – LECTURE 3 Recurrence equations Formulating recurrence equations Solving recurrence equations.
Administrivia, Lecture 5 HW #2 was assigned on Sunday, January 20. It is due on Thursday, January 31. –Please use the correct edition of the textbook!
CS 240A : Examples with Cilk++
Recurrences The expression: is a recurrence. –Recurrence: an equation that describes a function in terms of its value on smaller functions Analysis of.
Dynamic Programming Introduction to Algorithms Dynamic Programming CSE 680 Prof. Roger Crawfis.
Recurrences The expression: is a recurrence. –Recurrence: an equation that describes a function in terms of its value on smaller functions BIL741: Advanced.
1 Divide and Conquer Binary Search Mergesort Recurrence Relations CSE Lecture 4 – Algorithms II.
Program Performance & Asymptotic Notations CSE, POSTECH.
Multithreaded Programming in Cilk L ECTURE 2 Charles E. Leiserson Supercomputing Technologies Research Group Computer Science and Artificial Intelligence.
David Luebke 1 10/3/2015 CS 332: Algorithms Solving Recurrences Continued The Master Theorem Introduction to heapsort.
October 21, Algorithms and Data Structures Lecture X Simonas Šaltenis Nykredit Center for Database Research Aalborg University
Order Statistics The ith order statistic in a set of n elements is the ith smallest element The minimum is thus the 1st order statistic The maximum is.
Analysis of Algorithms
1 L ECTURE 2 Matrix Multiplication Tableau Construction Recurrences (Review) Conclusion Merge Sort.
1 CS 140 : Non-numerical Examples with Cilk++ Divide and conquer paradigm for Cilk++ Quicksort Mergesort Thanks to Charles E. Leiserson for some of these.
Project 2 due … Project 2 due … Project 2 Project 2.
Divide and Conquer Andreas Klappenecker [based on slides by Prof. Welch]
Télécom 2A – Algo Complexity (1) Time Complexity and the divide and conquer strategy Or : how to measure algorithm run-time And : design efficient algorithms.
1Computer Sciences Department. Book: Introduction to Algorithms, by: Thomas H. Cormen Charles E. Leiserson Ronald L. Rivest Clifford Stein Electronic:
1 Chapter 4 Divide-and-Conquer. 2 About this lecture Recall the divide-and-conquer paradigm, which we used for merge sort: – Divide the problem into a.
1 CS 140 : Feb 19, 2015 Cilk Scheduling & Applications Analyzing quicksort Optional: Master method for solving divide-and-conquer recurrences Tips on parallelism.
Algorithm Analysis Part of slides are borrowed from UST.
Divide and Conquer Andreas Klappenecker [based on slides by Prof. Welch]
CSC 413/513: Intro to Algorithms Solving Recurrences Continued The Master Theorem Introduction to heapsort.
Foundations II: Data Structures and Algorithms
Lecture 5 Today, how to solve recurrences We learned “guess and proved by induction” We also learned “substitution” method Today, we learn the “master.
Divide and Conquer. Recall Divide the problem into a number of sub-problems that are smaller instances of the same problem. Conquer the sub-problems by.
Young CS 331 D&A of Algo. Topic: Divide and Conquer1 Divide-and-Conquer General idea: Divide a problem into subprograms of the same kind; solve subprograms.
Divide and Conquer (Part II) Multiplication of two numbers Let U = (u 2n-1 u 2n-2 … u 1 u 0 ) 2 and V = (v 2n-1 v 2n-2 …v 1 v 0 ) 2, and our goal is to.
Divide and Conquer Faculty Name: Ruhi Fatima Topics Covered Divide and Conquer Matrix multiplication Recurrence.
1 CS 240A : Divide-and-Conquer with Cilk++ Thanks to Charles E. Leiserson for some of these slides Divide & Conquer Paradigm Solving recurrences Sorting:
MA/CSSE 473 Day 30 B Trees Dynamic Programming Binomial Coefficients Warshall's algorithm No in-class quiz today Student questions?
Nothing is particularly hard if you divide it into small jobs. Henry Ford Nothing is particularly hard if you divide it into small jobs. Henry Ford.
Review Quick Sort Quick Sort Algorithm Time Complexity Examples
Lecture 6 Sorting II Divide-and-Conquer Algorithms.
MA/CSSE 473 Day 14 Strassen's Algorithm: Matrix Multiplication Decrease and Conquer DFS.
Analysis of Algorithms Spring 2016CS202 - Fundamentals of Computer Science II1.
Algorithm Analysis 1.
Introduction to Algorithms: Divide-n-Conquer Algorithms
CS 140 : Numerical Examples on Shared Memory with Cilk++
Chapter 4: Divide and Conquer
CSCE 411 Design and Analysis of Algorithms
CS 3343: Analysis of Algorithms
Unit-2 Divide and Conquer
Data Structures Review Session
CS200: Algorithms Analysis
Introduction to Algorithms
CSE 373 Data Structures and Algorithms
Divide & Conquer Algorithms
Introduction to CILK Some slides are from:
Presentation transcript:

1 L ECTURE 2 Matrix Multiplication Tableau Construction Recurrences (Review) Conclusion Merge Sort

Algorithmic Complexity Measures T P = execution time on P processors T 1 = work L OWER B OUNDS T P ¸ T 1 /P T P ¸ T 1 L OWER B OUNDS T P ¸ T 1 /P T P ¸ T 1 *Also called critical-path length or computational depth. T 1 = span*

Speedup Definition: T 1 /T P = speedup on P processors. If T 1 /T P =  (P) · P, we have linear speedup; = P, we have perfect linear speedup; > P, we have superlinear speedup, which is not possible in our model, because of the lower bound T P ¸ T 1 /P.

July 13, Parallelism Because we have the lower bound T P ¸ T 1, the maximum possible speedup given T 1 and T 1 is T 1 /T 1 =parallelism =the average amount of work per step along the span.

5 The Master Method The Master Method for solving recurrences applies to recurrences of the form T(n) = a T(n/b) + f (n), where a ¸ 1, b > 1, and f is asymptotically positive. I DEA : Compare n log b a with f (n). *The unstated base case is T(n) =  (1) for sufficiently small n. *

6 Master Method — C ASE 1 n log b a À f (n) Specifically, f (n) = O(n log b a –  ) for some constant  > 0. Solution: T(n) =  (n log b a ). T(n) = a T(n/b) + f (n)

7 Master Method — C ASE 2 Specifically, f (n) =  (n log b a lg k n) for some constant k ¸ 0. Solution: T(n) =  (n log b a lg k+1 n). n log b a ¼ f (n) T(n) = a T(n/b) + f (n)

8 Master Method — C ASE 3 Specifically, f (n) =  (n log b a +  ) for some constant  > 0 and f (n) satisfies the regularity condition that a f (n/b) · c f (n) for some constant c < 1. Solution: T(n) =  (f (n)). n log b a ¿ f (n) T(n) = a T(n/b) + f (n)

9 Master Method Summary C ASE 1: f (n) = O(n log b a –  ), constant  > 0  T(n) =  (n log b a ). C ASE 2: f (n) =  (n log b a lg k n), constant k  0  T(n) =  (n log b a lg k+1 n). C ASE 3: f (n) =  (n log b a +  ), constant  > 0, and regularity condition  T(n) =  ( f (n)). T(n) = a T(n/b) + f (n)

10 Master Method Quiz T(n) = 4 T(n/2) + n n log b a = n 2 À n ) C ASE 1: T(n) =  (n 2 ). T(n) = 4 T(n/2) + n 2 n log b a = n 2 = n 2 lg 0 n ) C ASE 2: T(n) =  (n 2 lg n). T(n) = 4 T(n/2) + n 3 n log b a = n 2 ¿ n 3 ) C ASE 3: T(n) =  (n 3 ). T(n) = 4 T(n/2) + n 2 / lg n Master method does not apply!

11 L ECTURE 2 Matrix Multiplication Tableau Construction Recurrences (Review) Conclusion Merge Sort

12 Square-Matrix Multiplication c 11 c 12  c1nc1n c 21 c 22  c2nc2n    cn1cn1 cn2cn2  c nn a 11 a 12  a1na1n a 21 a 22  a2na2n    an1an1 an2an2  a nn b 11 b 12  b1nb1n b 21 b 22  b2nb2n    bn1bn1 bn2bn2  b nn = £ CAB c ij =  k = 1 n a ik b kj Assume for simplicity that n = 2 k.

13 Recursive Matrix Multiplication 8 multiplications of (n/2) £ (n/2) matrices. 1 addition of n £ n matrices. Divide and conquer — C 11 C 12 C 21 C 22 = £ A 11 A 12 A 21 A 22 B 11 B 12 B 21 B 22 =+ A 11 B 11 A 11 B 12 A 21 B 11 A 21 B 12 A 12 B 21 A 12 B 22 A 22 B 21 A 22 B 22

14 cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2); spawn Mult(C22,A21,B12,n/2); spawn Mult(C21,A21,B11,n/2); spawn Mult(T11,A12,B21,n/2); spawn Mult(T12,A12,B22,n/2); spawn Mult(T22,A22,B22,n/2); spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2); spawn Mult(C22,A21,B12,n/2); spawn Mult(C21,A21,B11,n/2); spawn Mult(T11,A12,B21,n/2); spawn Mult(T12,A12,B22,n/2); spawn Mult(T22,A22,B22,n/2); spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } Matrix Multiply in Pseudo-Cilk C = A ¢ B Absence of type declarations.

15 cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2); spawn Mult(C22,A21,B12,n/2); spawn Mult(C21,A21,B11,n/2); spawn Mult(T11,A12,B21,n/2); spawn Mult(T12,A12,B22,n/2); spawn Mult(T22,A22,B22,n/2); spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2); spawn Mult(C22,A21,B12,n/2); spawn Mult(C21,A21,B11,n/2); spawn Mult(T11,A12,B21,n/2); spawn Mult(T12,A12,B22,n/2); spawn Mult(T22,A22,B22,n/2); spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } C = A ¢ B Coarsen base cases for efficiency. Matrix Multiply in Pseudo-Cilk

16 cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2); spawn Mult(C22,A21,B12,n/2); spawn Mult(C21,A21,B11,n/2); spawn Mult(T11,A12,B21,n/2); spawn Mult(T12,A12,B22,n/2); spawn Mult(T22,A22,B22,n/2); spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2); spawn Mult(C22,A21,B12,n/2); spawn Mult(C21,A21,B11,n/2); spawn Mult(T11,A12,B21,n/2); spawn Mult(T12,A12,B22,n/2); spawn Mult(T22,A22,B22,n/2); spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } C = A ¢ B Submatrices are produced by pointer calculation, not copying of elements. Also need a row- size argument for array indexing. Matrix Multiply in Pseudo-Cilk

17 cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2); spawn Mult(C22,A21,B12,n/2); spawn Mult(C21,A21,B11,n/2); spawn Mult(T11,A12,B21,n/2); spawn Mult(T12,A12,B22,n/2); spawn Mult(T22,A22,B22,n/2); spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2); spawn Mult(C22,A21,B12,n/2); spawn Mult(C21,A21,B11,n/2); spawn Mult(T11,A12,B21,n/2); spawn Mult(T12,A12,B22,n/2); spawn Mult(T22,A22,B22,n/2); spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } C = A ¢ B cilk void Add(*C, *T, n) { h base case & partition matrices i spawn Add(C11,T11,n/2); spawn Add(C12,T12,n/2); spawn Add(C21,T21,n/2); spawn Add(C22,T22,n/2); sync; return; } cilk void Add(*C, *T, n) { h base case & partition matrices i spawn Add(C11,T11,n/2); spawn Add(C12,T12,n/2); spawn Add(C21,T21,n/2); spawn Add(C22,T22,n/2); sync; return; } C = C + T Matrix Multiply in Pseudo-Cilk

18 A 1 (n)= ? 4 A 1 (n/2) +  (1) cilk void Add(*C, *T, n) { h base case & partition matrices i spawn Add(C11,T11,n/2); spawn Add(C12,T12,n/2); spawn Add(C21,T21,n/2); spawn Add(C22,T22,n/2); sync; return; } cilk void Add(*C, *T, n) { h base case & partition matrices i spawn Add(C11,T11,n/2); spawn Add(C12,T12,n/2); spawn Add(C21,T21,n/2); spawn Add(C22,T22,n/2); sync; return; } Work of Matrix Addition Work: n log b a = n log 2 4 = n 2 À  (1). — C ASE 1 =(n2)=(n2)

19 cilk void Add(*C, *T, n) { h base case & partition matrices i spawn Add(C11,T11,n/2); spawn Add(C12,T12,n/2); spawn Add(C21,T21,n/2); spawn Add(C22,T22,n/2); sync; return; } cilk void Add(*C, *T, n) { h base case & partition matrices i spawn Add(C11,T11,n/2); spawn Add(C12,T12,n/2); spawn Add(C21,T21,n/2); spawn Add(C22,T22,n/2); sync; return; } cilk void Add(*C, *T, n) { h base case & partition matrices i spawn Add(C11,T11,n/2); spawn Add(C12,T12,n/2); spawn Add(C21,T21,n/2); spawn Add(C22,T22,n/2); sync; return; } cilk void Add(*C, *T, n) { h base case & partition matrices i spawn Add(C11,T11,n/2); spawn Add(C12,T12,n/2); spawn Add(C21,T21,n/2); spawn Add(C22,T22,n/2); sync; return; } A 1 (n)= ? Span of Matrix Addition A 1 (n/2) +  (1) Span: n log b a = n log 2 1 = 1 ) f (n) =  (n log b a lg 0 n). — C ASE 2 =  (lg n) maximum

20 M 1 (n)= ? Work of Matrix Multiplication 8 M 1 (n/2) +A 1 (n) +  (1) Work: cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2);  spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2);  spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } 8 n log b a = n log 2 8 = n 3 À  (n 2 ). =8 M 1 (n/2) +  (n 2 ) =(n3)=(n3) — C ASE 1

21 cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2);  spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2);  spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2);  spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2);  spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } M 1 (n)= ? M 1 (n/2) + A 1 (n) +  (1) Span of Matrix Multiplication Span: n log b a = n log 2 1 = 1 ) f (n) =  (n log b a lg 1 n). =M 1 (n/2) +  (lg n) =  (lg 2 n) — C ASE 2 8

22 Parallelism of Matrix Multiply M1(n)= (n3)M1(n)= (n3) Work: M 1 (n)=  (lg 2 n) Span: Parallelism: M1(n)M1(n) M1(n)M1(n) =  (n 3 /lg 2 n) For 1000 £ 1000 matrices, parallelism ¼ (10 3 ) 3 /10 2 = 10 7.

23 cilk void Mult(*C, *A, *B, n) { h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2);  spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } cilk void Mult(*C, *A, *B, n) { h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2);  spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } Stack Temporaries float *T = Cilk_alloca(n*n*sizeof(float)); In hierarchical-memory machines (especially chip multiprocessors), memory accesses are so expensive that minimizing storage often yields higher performance. I DEA : Trade off parallelism for less storage.

24 No-Temp Matrix Multiplication cilk void MultA(*C, *A, *B, n) { // C = C + A * B h base case & partition matrices i spawn MultA(C11,A11,B11,n/2); spawn MultA(C12,A11,B12,n/2); spawn MultA(C22,A21,B12,n/2); spawn MultA(C21,A21,B11,n/2); sync; spawn MultA(C21,A22,B21,n/2); spawn MultA(C22,A22,B22,n/2); spawn MultA(C12,A12,B22,n/2); spawn MultA(C11,A12,B21,n/2); sync; return; } cilk void MultA(*C, *A, *B, n) { // C = C + A * B h base case & partition matrices i spawn MultA(C11,A11,B11,n/2); spawn MultA(C12,A11,B12,n/2); spawn MultA(C22,A21,B12,n/2); spawn MultA(C21,A21,B11,n/2); sync; spawn MultA(C21,A22,B21,n/2); spawn MultA(C22,A22,B22,n/2); spawn MultA(C12,A12,B22,n/2); spawn MultA(C11,A12,B21,n/2); sync; return; } Saves space, but at what expense?

25 =(n3)=(n3) Work of No-Temp Multiply M 1 (n)= ? 8 M 1 (n/2) +  (1) Work: — C ASE 1 cilk void MultA(*C, *A, *B, n) { // C = C + A * B h base case & partition matrices i spawn MultA(C11,A11,B11,n/2); spawn MultA(C12,A11,B12,n/2); spawn MultA(C22,A21,B12,n/2); spawn MultA(C21,A21,B11,n/2); sync; spawn MultA(C21,A22,B21,n/2); spawn MultA(C22,A22,B22,n/2); spawn MultA(C12,A12,B22,n/2); spawn MultA(C11,A12,B21,n/2); sync; return; } cilk void MultA(*C, *A, *B, n) { // C = C + A * B h base case & partition matrices i spawn MultA(C11,A11,B11,n/2); spawn MultA(C12,A11,B12,n/2); spawn MultA(C22,A21,B12,n/2); spawn MultA(C21,A21,B11,n/2); sync; spawn MultA(C21,A22,B21,n/2); spawn MultA(C22,A22,B22,n/2); spawn MultA(C12,A12,B22,n/2); spawn MultA(C11,A12,B21,n/2); sync; return; }

26 cilk void MultA(*C, *A, *B, n) { // C = C + A * B h base case & partition matrices i spawn MultA(C11,A11,B11,n/2); spawn MultA(C12,A11,B12,n/2); spawn MultA(C22,A21,B12,n/2); spawn MultA(C21,A21,B11,n/2); sync; spawn MultA(C21,A22,B21,n/2); spawn MultA(C22,A22,B22,n/2); spawn MultA(C12,A12,B22,n/2); spawn MultA(C11,A12,B21,n/2); sync; return; } cilk void MultA(*C, *A, *B, n) { // C = C + A * B h base case & partition matrices i spawn MultA(C11,A11,B11,n/2); spawn MultA(C12,A11,B12,n/2); spawn MultA(C22,A21,B12,n/2); spawn MultA(C21,A21,B11,n/2); sync; spawn MultA(C21,A22,B21,n/2); spawn MultA(C22,A22,B22,n/2); spawn MultA(C12,A12,B22,n/2); spawn MultA(C11,A12,B21,n/2); sync; return; } cilk void MultA(*C, *A, *B, n) { // C = C + A * B h base case & partition matrices i spawn MultA(C11,A11,B11,n/2); spawn MultA(C12,A11,B12,n/2); spawn MultA(C22,A21,B12,n/2); spawn MultA(C21,A21,B11,n/2); sync; spawn MultA(C21,A22,B21,n/2); spawn MultA(C22,A22,B22,n/2); spawn MultA(C12,A12,B22,n/2); spawn MultA(C11,A12,B21,n/2); sync; return; } cilk void MultA(*C, *A, *B, n) { // C = C + A * B h base case & partition matrices i spawn MultA(C11,A11,B11,n/2); spawn MultA(C12,A11,B12,n/2); spawn MultA(C22,A21,B12,n/2); spawn MultA(C21,A21,B11,n/2); sync; spawn MultA(C21,A22,B21,n/2); spawn MultA(C22,A22,B22,n/2); spawn MultA(C12,A12,B22,n/2); spawn MultA(C11,A12,B21,n/2); sync; return; } =(n)=(n) M 1 (n)= ? Span of No-Temp Multiply Span: — C ASE 1 2 M 1 (n/2) +  (1) maximum

27 Parallelism of No-Temp Multiply M1(n)=(n3)M1(n)=(n3) Work: M1(n)=(n)M1(n)=(n) Span: Parallelism: M1(n)M1(n) M1(n)M1(n) =  (n 2 ) For 1000 £ 1000 matrices, parallelism ¼ (10 3 ) 3 /10 3 = Faster in practice!

28 Testing Synchronization Cilk language feature: A programmer can check whether a Cilk procedure is “synched” (without actually performing a sync ) by testing the pseudovariable SYNCHED : SYNCHED = 0 ) some spawned children might not have returned. SYNCHED = 1 ) all spawned children have definitely returned.

29 Best of Both Worlds cilk void Mult1(*C, *A, *B, n) {// multiply & store h base case & partition matrices i spawn Mult1(C11,A11,B11,n/2); // multiply & store spawn Mult1(C12,A11,B12,n/2); spawn Mult1(C22,A21,B12,n/2); spawn Mult1(C21,A21,B11,n/2); if (SYNCHED) { spawn MultA1(C11,A12,B21,n/2); // multiply & add spawn MultA1(C12,A12,B22,n/2); spawn MultA1(C22,A22,B22,n/2); spawn MultA1(C21,A22,B21,n/2); } else { float *T = Cilk_alloca(n*n*sizeof(float)); spawn Mult1(T11,A12,B21,n/2); // multiply & store spawn Mult1(T12,A12,B22,n/2); spawn Mult1(T22,A22,B22,n/2); spawn Mult1(T21,A22,B21,n/2); sync; spawn Add(C,T,n); // C = C + T } sync; return; } cilk void Mult1(*C, *A, *B, n) {// multiply & store h base case & partition matrices i spawn Mult1(C11,A11,B11,n/2); // multiply & store spawn Mult1(C12,A11,B12,n/2); spawn Mult1(C22,A21,B12,n/2); spawn Mult1(C21,A21,B11,n/2); if (SYNCHED) { spawn MultA1(C11,A12,B21,n/2); // multiply & add spawn MultA1(C12,A12,B22,n/2); spawn MultA1(C22,A22,B22,n/2); spawn MultA1(C21,A22,B21,n/2); } else { float *T = Cilk_alloca(n*n*sizeof(float)); spawn Mult1(T11,A12,B21,n/2); // multiply & store spawn Mult1(T12,A12,B22,n/2); spawn Mult1(T22,A22,B22,n/2); spawn Mult1(T21,A22,B21,n/2); sync; spawn Add(C,T,n); // C = C + T } sync; return; } This code is just as parallel as the original, but it only uses more space if runtime parallelism actually exists.

30 Ordinary Matrix Multiplication c ij =  k = 1 n a ik b kj I DEA : Spawn n 2 inner products in parallel. Compute each inner product in parallel. Work:  (n 3 ) Span:  (lg n) Parallelism:  (n 3 /lg n) B UT, this algorithm exhibits poor locality and does not exploit the cache hierarchy of modern microprocessors, especially CMP’s.

31 L ECTURE 2 Matrix Multiplication Tableau Construction Recurrences (Review) Conclusion Merge Sort

Merging Two Sorted Arrays void Merge(int *C, int *A, int *B, int na, int nb) { while (na>0 && nb>0) { if (*A <= *B) { *C++ = *A++; na--; } else { *C++ = *B++; nb--; } while (na>0) { *C++ = *A++; na--; } while (nb>0) { *C++ = *B++; nb--; } void Merge(int *C, int *A, int *B, int na, int nb) { while (na>0 && nb>0) { if (*A <= *B) { *C++ = *A++; na--; } else { *C++ = *B++; nb--; } while (na>0) { *C++ = *A++; na--; } while (nb>0) { *C++ = *B++; nb--; } Time to merge n elements = ? (n). (n).

33 cilk void MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn MergeSort(C, A, n/2); spawn MergeSort(C+n/2, A+n/2, n-n/2); sync; Merge(B, C, C+n/2, n/2, n-n/2); } cilk void MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn MergeSort(C, A, n/2); spawn MergeSort(C+n/2, A+n/2, n-n/2); sync; Merge(B, C, C+n/2, n/2, n-n/2); } Merge Sort merge

34 =  (n lg n) T 1 (n)= ? 2 T 1 (n/2) +  (n) Work of Merge Sort Work: — C ASE 2 n log b a = n log 2 2 = n ) f (n) =  (n log b a lg 0 n). cilk void MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn MergeSort(C, A, n/2); spawn MergeSort(C+n/2, A+n/2, n-n/2); sync; Merge(B, C, C+n/2, n/2, n-n/2); } cilk void MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn MergeSort(C, A, n/2); spawn MergeSort(C+n/2, A+n/2, n-n/2); sync; Merge(B, C, C+n/2, n/2, n-n/2); }

35 T 1 (n)= ? T 1 (n/2) +  (n) Span of Merge Sort Span: — C ASE 3 =(n)=(n) n log b a = n log 2 1 = 1 ¿  (n). cilk void MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn MergeSort(C, A, n/2); spawn MergeSort(C+n/2, A+n/2, n-n/2); sync; Merge(B, C, C+n/2, n/2, n-n/2); } cilk void MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn MergeSort(C, A, n/2); spawn MergeSort(C+n/2, A+n/2, n-n/2); sync; Merge(B, C, C+n/2, n/2, n-n/2); }

36 Parallelism of Merge Sort T 1 (n)=  (n lg n) Work: T1(n)=(n)T1(n)=(n) Span: Parallelism: T1(n)T1(n) T1(n)T1(n) =  (lg n) We need to parallelize the merge!

37 B A 0na 0nb na ¸ nb Parallel Merge · A[na/2] ¸ A[na/2] Binary search jj+1 Recursive merge Recursive merge na/2 · A[na/2] ¸ A[na/2] K EY I DEA : If the total number of elements to be merged in the two arrays is n = na + nb, the total number of elements in the larger of the two recursive merges is at most ? (3/4) n.

38 Parallel Merge cilk void P_Merge(int *C, int *A, int *B, int na, int nb) { if (na < nb) { spawn P_Merge(C, B, A, nb, na); } else if (na==1) { if (nb == 0) { C[0] = A[0]; } else { C[0] = (A[0]<B[0]) ? A[0] : B[0]; /* minimum */ C[1] = (A[0]<B[0]) ? B[0] : A[0]; /* maximum */ } } else { int ma = na/2; int mb = BinarySearch(A[ma], B, nb); spawn P_Merge(C, A, B, ma, mb); spawn P_Merge(C+ma+mb, A+ma, B+mb, na-ma, nb-mb); sync; } cilk void P_Merge(int *C, int *A, int *B, int na, int nb) { if (na < nb) { spawn P_Merge(C, B, A, nb, na); } else if (na==1) { if (nb == 0) { C[0] = A[0]; } else { C[0] = (A[0]<B[0]) ? A[0] : B[0]; /* minimum */ C[1] = (A[0]<B[0]) ? B[0] : A[0]; /* maximum */ } } else { int ma = na/2; int mb = BinarySearch(A[ma], B, nb); spawn P_Merge(C, A, B, ma, mb); spawn P_Merge(C+ma+mb, A+ma, B+mb, na-ma, nb-mb); sync; } Coarsen base cases for efficiency.

39 T 1 (n)= ? T 1 (3n/4) +  (lg n) Span of P_Merge Span: — C ASE 2 =  (lg 2 n) cilk void P_Merge(int *C, int *A, int *B, int na, int nb) { if (na < nb) {  } else { int ma = na/2; int mb = BinarySearch(A[ma], B, nb); spawn P_Merge(C, A, B, ma, mb); spawn P_Merge(C+ma+mb, A+ma, B+mb, na-ma, nb-mb); sync; } cilk void P_Merge(int *C, int *A, int *B, int na, int nb) { if (na < nb) {  } else { int ma = na/2; int mb = BinarySearch(A[ma], B, nb); spawn P_Merge(C, A, B, ma, mb); spawn P_Merge(C+ma+mb, A+ma, B+mb, na-ma, nb-mb); sync; } n log b a = n log 4/3 1 = 1 ) f (n) =  (n log b a lg 1 n).

40 T 1 (n) = ? T 1 (  n) + T 1 ((1–  )n) +  (lg n), where 1/4 ·  · 3/4. Work of P_Merge Work: cilk void P_Merge(int *C, int *A, int *B, int na, int nb) { if (na < nb) {  } else { int ma = na/2; int mb = BinarySearch(A[ma], B, nb); spawn P_Merge(C, A, B, ma, mb); spawn P_Merge(C+ma+mb, A+ma, B+mb, na-ma, nb-mb); sync; } cilk void P_Merge(int *C, int *A, int *B, int na, int nb) { if (na < nb) {  } else { int ma = na/2; int mb = BinarySearch(A[ma], B, nb); spawn P_Merge(C, A, B, ma, mb); spawn P_Merge(C+ma+mb, A+ma, B+mb, na-ma, nb-mb); sync; } C LAIM : T 1 (n) =  (n).

41 Analysis of Work Recurrence Substitution method: Inductive hypothesis is T 1 (k) · c 1 k – c 2 lg k, where c 1,c 2 > 0. Prove that the relation holds, and solve for c 1 and c 2. T 1 (n) = T 1 (  n) + T 1 ((1–  )n) +  (lg n), where 1/4 ·  · 3/4. T 1 (n)=T 1 (  n) + T 1 ((1–  )n) +  (lg n) · c 1 (  n) – c 2 lg(  n) + c 1 ((1–  )n) – c 2 lg((1–  )n) +  (lg n)

42 Analysis of Work Recurrence T 1 (n)=T 1 (  n) + T 1 ((1–  )n) +  (lg n) · c 1 (  n) – c 2 lg(  n) + c 1 (1–  )n – c 2 lg((1–  )n) +  (lg n) T 1 (n) = T 1 (  n) + T 1 ((1–  )n) +  (lg n), where 1/4 ·  · 3/4.

43 T 1 (n)=T 1 (  n) + T 1 ((1–  )n) +  (lg n) · c 1 (  n) – c 2 lg(  n) + c 1 (1–  )n – c 2 lg((1–  )n) +  (lg n) Analysis of Work Recurrence · c 1 n – c 2 lg(  n) – c 2 lg((1–  )n) +  (lg n) · c 1 n – c 2 ( lg(  (1–  )) + 2 lg n ) +  (lg n) · c 1 n – c 2 lg n – ( c 2 (lg n + lg(  (1–  ))) –  (lg n) ) · c 1 n – c 2 lg n by choosing c 1 and c 2 large enough. T 1 (n) = T 1 (  n) + T 1 ((1–  )n) +  (lg n), where 1/4 ·  · 3/4.

44 Parallelism of P_Merge T1(n)=(n)T1(n)=(n) Work: T 1 (n)=  (lg 2 n) Span: Parallelism: T1(n)T1(n) T1(n)T1(n) =  (n/lg 2 n)

45 cilk void P_MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn P_MergeSort(C, A, n/2); spawn P_MergeSort(C+n/2, A+n/2, n-n/2); sync; spawn P_Merge(B, C, C+n/2, n/2, n-n/2); } cilk void P_MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn P_MergeSort(C, A, n/2); spawn P_MergeSort(C+n/2, A+n/2, n-n/2); sync; spawn P_Merge(B, C, C+n/2, n/2, n-n/2); } Parallel Merge Sort

46 T 1 (n)=2 T 1 (n/2) +  (n) Work of Parallel Merge Sort Work: — C ASE 2 =  (n lg n) cilk void P_MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn P_MergeSort(C, A, n/2); spawn P_MergeSort(C+n/2, A+n/2, n-n/2); sync; spawn P_Merge(B, C, C+n/2, n/2, n-n/2); } cilk void P_MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn P_MergeSort(C, A, n/2); spawn P_MergeSort(C+n/2, A+n/2, n-n/2); sync; spawn P_Merge(B, C, C+n/2, n/2, n-n/2); }

47 Span of Parallel Merge Sort T 1 (n)= ? T 1 (n/2) +  (lg 2 n) Span: — C ASE 2 =  (lg 3 n) n log b a = n log 2 1 = 1 ) f (n) =  (n log b a lg 2 n). cilk void P_MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn P_MergeSort(C, A, n/2); spawn P_MergeSort(C+n/2, A+n/2, n-n/2); sync; spawn P_Merge(B, C, C+n/2, n/2, n-n/2); } cilk void P_MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn P_MergeSort(C, A, n/2); spawn P_MergeSort(C+n/2, A+n/2, n-n/2); sync; spawn P_Merge(B, C, C+n/2, n/2, n-n/2); }

48 Parallelism of Merge Sort T 1 (n)=  (n lg n) Work: T 1 (n)=  (lg 3 n) Span: Parallelism: T1(n)T1(n) T1(n)T1(n) =  (n/lg 2 n)

49 L ECTURE 2 Matrix Multiplication Tableau Construction Recurrences (Review) Conclusion Merge Sort

50 Tableau Construction A[i, j] = f ( A[i, j–1], A[i–1, j], A[i–1, j–1] ). Problem: Fill in an n £ n tableau A, where Dynamic programming Longest common subsequence Edit distance Time warping Work:  (n 2 ).

51 n n spawn I; sync; spawn II; spawn III; sync; spawn IV; sync; I I II III IV Cilk code Recursive Construction

52 n n Work: T 1 (n) = ? 4T 1 (n/2) +  (1) spawn I; sync; spawn II; spawn III; sync; spawn IV; sync; I I II III IV Cilk code Recursive Construction =  (n 2 ) — C ASE 1

53 Span: T 1 (n) = ? n n spawn I; sync; spawn II; spawn III; sync; spawn IV; sync; I I II III IV Cilk code Recursive Construction 3T 1 (n/2) +  (1) =  (n lg3 ) — C ASE 1

54 Analysis of Tableau Construction Work: T 1 (n) =  (n 2 ) Span: T 1 (n) =  (n lg3 ) ¼  (n 1.58 ) Parallelism: T1(n)T1(n) T1(n)T1(n) ¼  (n 0.42 )

55 n spawn I; sync; spawn II; spawn III; sync; spawn IV; spawn V; spawn VI sync; spawn VII; spawn VIII; sync; spawn IX; sync; A More-Parallel Construction I I II III IV V V VI VII VIII IX n

56 n spawn I; sync; spawn II; spawn III; sync; spawn IV; spawn V; spawn VI sync; spawn VII; spawn VIII; sync; spawn IX; sync; A More-Parallel Construction I I II III IV V V VI VII VIII IX n Work: T 1 (n) = ? 9T 1 (n/3) +  (1) =  (n 2 ) — C ASE 1

57 n spawn I; sync; spawn II; spawn III; sync; spawn IV; spawn V; spawn VI sync; spawn VII; spawn VIII; sync; spawn IX; sync; A More-Parallel Construction I I II III IV V V VI VII VIII IX n Span: T 1 (n) = ? 5T 1 (n/3) +  (1) =  (n log 3 5 ) — C ASE 1

58 Analysis of Revised Construction Work: T 1 (n) =  (n 2 ) Span: T 1 (n) =  (n log 3 5 ) ¼  (n 1.46 ) Parallelism: T1(n)T1(n) T1(n)T1(n) ¼  (n 0.54 ) More parallel by a factor of  (n 0.54 )/  (n 0.42 ) =  (n 0.12 ).

59 L ECTURE 2 Matrix Multiplication Tableau Construction Recurrences (Review) Conclusion Merge Sort

60 Key Ideas Cilk is simple: cilk, spawn, sync, SYNCHED Recurrences, recurrences, recurrences, … Work & span

Palindrome Propose a Cilk Palindrome solver. What is the key idea? What is the Algorithm? What is the Span? What is the Work? What is the Parallelism? 61

62 cilk void MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn MergeSort(C, A, n/2); spawn MergeSort(C+n/2, A+n/2, n-n/2); sync; Merge(B, C, C+n/2, n/2, n-n/2); } cilk void MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn MergeSort(C, A, n/2); spawn MergeSort(C+n/2, A+n/2, n-n/2); sync; Merge(B, C, C+n/2, n/2, n-n/2); } Palindrome

63 =  (n lg n) T 1 (n)= ? 2 T 1 (n/2) +  (n) Work of Palindrome Work: — C ASE 2 n log b a = n log 2 2 = n ) f (n) =  (n log b a lg 0 n). cilk void MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn MergeSort(C, A, n/2); spawn MergeSort(C+n/2, A+n/2, n-n/2); sync; Merge(B, C, C+n/2, n/2, n-n/2); } cilk void MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn MergeSort(C, A, n/2); spawn MergeSort(C+n/2, A+n/2, n-n/2); sync; Merge(B, C, C+n/2, n/2, n-n/2); }

64 T 1 (n)= ? T 1 (n/2) +  (n) Span of Palindrome Span: — C ASE 3 =(n)=(n) n log b a = n log 2 1 = 1 ¿  (n). cilk void MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn MergeSort(C, A, n/2); spawn MergeSort(C+n/2, A+n/2, n-n/2); sync; Merge(B, C, C+n/2, n/2, n-n/2); } cilk void MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn MergeSort(C, A, n/2); spawn MergeSort(C+n/2, A+n/2, n-n/2); sync; Merge(B, C, C+n/2, n/2, n-n/2); }