1 L ECTURE 2 Matrix Multiplication Tableau Construction Recurrences (Review) Conclusion Merge Sort.

Slides:



Advertisements
Similar presentations
Divide-and-Conquer CIS 606 Spring 2010.
Advertisements

A simple example finding the maximum of a set S of n numbers.
More on Divide and Conquer. The divide-and-conquer design paradigm 1. Divide the problem (instance) into subproblems. 2. Conquer the subproblems by solving.
Comp 122, Spring 2004 Divide and Conquer (Merge Sort)
DIVIDE AND CONQUER. 2 Algorithmic Paradigms Greedy. Build up a solution incrementally, myopically optimizing some local criterion. Divide-and-conquer.
Introduction to Algorithms Jiafen Liu Sept
1 Divide & Conquer Algorithms. 2 Recursion Review A function that calls itself either directly or indirectly through another function Recursive solutions.
Divide-and-Conquer Recursive in structure –Divide the problem into several smaller sub-problems that are similar to the original but smaller in size –Conquer.
Chapter 4: Divide and Conquer Master Theorem, Mergesort, Quicksort, Binary Search, Binary Trees The Design and Analysis of Algorithms.
Introduction to Algorithms 6.046J/18.401J L ECTURE 3 Divide and Conquer Binary search Powering a number Fibonacci numbers Matrix multiplication Strassen’s.
Divide and Conquer. Recall Complexity Analysis – Comparison of algorithm – Big O Simplification From source code – Recursive.
ADA: 4. Divide/Conquer1 Objective o look at several divide and conquer examples (merge sort, binary search), and 3 approaches for calculating their.
1 Exceptions Exceptions oops, mistake correction oops, mistake correction Issues with Matrix and Vector Issues with Matrix and Vector Quicksort Quicksort.
Nattee Niparnan. Recall  Complexity Analysis  Comparison of Two Algos  Big O  Simplification  From source code  Recursive.
1 Issues with Matrix and Vector Issues with Matrix and Vector Quicksort Quicksort Determining Algorithm Efficiency Determining Algorithm Efficiency Substitution.
Algorithm Design Strategy Divide and Conquer. More examples of Divide and Conquer  Review of Divide & Conquer Concept  More examples  Finding closest.
Parallelizing C Programs Using Cilk Mahdi Javadi.
CS Section 600 CS Section 002 Dr. Angela Guercio Spring 2010.
The master method The master method applies to recurrences of the form T(n) = a T(n/b) + f (n), where a  1, b > 1, and f is asymptotically positive.
CS421 - Course Information Website Syllabus Schedule The Book:
Administrivia, Lecture 5 HW #2 was assigned on Sunday, January 20. It is due on Thursday, January 31. –Please use the correct edition of the textbook!
CS3381 Des & Anal of Alg ( SemA) City Univ of HK / Dept of CS / Helena Wong 4. Recurrences - 1 Recurrences.
CS 240A : Examples with Cilk++
Recurrences The expression: is a recurrence. –Recurrence: an equation that describes a function in terms of its value on smaller functions Analysis of.
Recurrence Relations Connection to recursive algorithms Techniques for solving them.
Dynamic Programming Introduction to Algorithms Dynamic Programming CSE 680 Prof. Roger Crawfis.
Recurrences The expression: is a recurrence. –Recurrence: an equation that describes a function in terms of its value on smaller functions BIL741: Advanced.
1 Divide and Conquer Binary Search Mergesort Recurrence Relations CSE Lecture 4 – Algorithms II.
Arithmetic.
Multithreaded Programming in Cilk L ECTURE 2 Charles E. Leiserson Supercomputing Technologies Research Group Computer Science and Artificial Intelligence.
David Luebke 1 10/3/2015 CS 332: Algorithms Solving Recurrences Continued The Master Theorem Introduction to heapsort.
Introduction to Algorithms 6.046J/18.401J/SMA5503 Lecture 3 Prof. Erik Demaine.
10/13/20151 CS 3343: Analysis of Algorithms Lecture 9: Review for midterm 1 Analysis of quick sort.
1 Recurrences Algorithms Jay Urbain, PhD Credits: Discrete Mathematics and Its Applications, by Kenneth Rosen The Design and Analysis of.
1 CS 140 : Non-numerical Examples with Cilk++ Divide and conquer paradigm for Cilk++ Quicksort Mergesort Thanks to Charles E. Leiserson for some of these.
Project 2 due … Project 2 due … Project 2 Project 2.
1 L ECTURE 2 Matrix Multiplication Tableau Construction Recurrences (Review) Conclusion Merge Sort.
Divide and Conquer Andreas Klappenecker [based on slides by Prof. Welch]
Télécom 2A – Algo Complexity (1) Time Complexity and the divide and conquer strategy Or : how to measure algorithm run-time And : design efficient algorithms.
4.Divide-and-Conquer Hsu, Lih-Hsing. Computer Theory Lab. Chapter 4P.2 Instruction Divide Conquer Combine.
1 Chapter 4 Divide-and-Conquer. 2 About this lecture Recall the divide-and-conquer paradigm, which we used for merge sort: – Divide the problem into a.
MA/CSSE 473 Day 28 Dynamic Programming Binomial Coefficients Warshall's algorithm Student questions?
1 CS 140 : Feb 19, 2015 Cilk Scheduling & Applications Analyzing quicksort Optional: Master method for solving divide-and-conquer recurrences Tips on parallelism.
Divide and Conquer Andreas Klappenecker [based on slides by Prof. Welch]
CSC 413/513: Intro to Algorithms Solving Recurrences Continued The Master Theorem Introduction to heapsort.
Foundations II: Data Structures and Algorithms
Lecture 5 Today, how to solve recurrences We learned “guess and proved by induction” We also learned “substitution” method Today, we learn the “master.
Divide and Conquer. Recall Divide the problem into a number of sub-problems that are smaller instances of the same problem. Conquer the sub-problems by.
Young CS 331 D&A of Algo. Topic: Divide and Conquer1 Divide-and-Conquer General idea: Divide a problem into subprograms of the same kind; solve subprograms.
Divide and Conquer (Part II) Multiplication of two numbers Let U = (u 2n-1 u 2n-2 … u 1 u 0 ) 2 and V = (v 2n-1 v 2n-2 …v 1 v 0 ) 2, and our goal is to.
Recurrences (in color) It continues…. Recurrences When an algorithm calls itself recursively, its running time is described by a recurrence. When an algorithm.
Divide and Conquer Faculty Name: Ruhi Fatima Topics Covered Divide and Conquer Matrix multiplication Recurrence.
1 CS 240A : Divide-and-Conquer with Cilk++ Thanks to Charles E. Leiserson for some of these slides Divide & Conquer Paradigm Solving recurrences Sorting:
MA/CSSE 473 Day 30 B Trees Dynamic Programming Binomial Coefficients Warshall's algorithm No in-class quiz today Student questions?
Nothing is particularly hard if you divide it into small jobs. Henry Ford Nothing is particularly hard if you divide it into small jobs. Henry Ford.
Lecture 6 Sorting II Divide-and-Conquer Algorithms.
MA/CSSE 473 Day 14 Strassen's Algorithm: Matrix Multiplication Decrease and Conquer DFS.
Introduction to Algorithms: Divide-n-Conquer Algorithms
CS 140 : Numerical Examples on Shared Memory with Cilk++
Chapter 4: Divide and Conquer
Chapter 4 Divide-and-Conquer
CSCE 411 Design and Analysis of Algorithms
CS 3343: Analysis of Algorithms
Chapter 4: Divide and Conquer
Unit-2 Divide and Conquer
Introduction to Algorithms 6.046J
Topic: Divide and Conquer
Introduction to Algorithms
CSE 373 Data Structures and Algorithms
Divide & Conquer Algorithms
Presentation transcript:

1 L ECTURE 2 Matrix Multiplication Tableau Construction Recurrences (Review) Conclusion Merge Sort

2 The Master Method The Master Method for solving recurrences applies to recurrences of the form T(n) = a T(n/b) + f (n), where a ¸ 1, b > 1, and f is asymptotically positive. I DEA : Compare n log b a with f (n). *The unstated base case is T(n) =  (1) for sufficiently small n. *

3 Master Method — C ASE 1 n log b a À f (n) Specifically, f (n) = O(n log b a –  ) for some constant  > 0. Solution: T(n) =  (n log b a ). T(n) = a T(n/b) + f (n)

4 Master Method — C ASE 2 Specifically, f (n) =  (n log b a lg k n) for some constant k ¸ 0. Solution: T(n) =  (n log b a lg k+1 n). n log b a ¼ f (n) T(n) = a T(n/b) + f (n)

5 Master Method — C ASE 3 Specifically, f (n) =  (n log b a +  ) for some constant  > 0 and f (n) satisfies the regularity condition that a f (n/b) · c f (n) for some constant c < 1. Solution: T(n) =  (f (n)). n log b a ¿ f (n) T(n) = a T(n/b) + f (n)

6 Master Method Summary C ASE 1: f (n) = O(n log b a –  ), constant  > 0  T(n) =  (n log b a ). C ASE 2: f (n) =  (n log b a lg k n), constant k  0  T(n) =  (n log b a lg k+1 n). C ASE 3: f (n) =  (n log b a +  ), constant  > 0, and regularity condition  T(n) =  ( f (n)). T(n) = a T(n/b) + f (n)

7 Master Method Quiz T(n) = 4 T(n/2) + n n log b a = n 2 À n ) C ASE 1: T(n) =  (n 2 ). T(n) = 4 T(n/2) + n 2 n log b a = n 2 = n 2 lg 0 n ) C ASE 2: T(n) =  (n 2 lg n). T(n) = 4 T(n/2) + n 3 n log b a = n 2 ¿ n 3 ) C ASE 3: T(n) =  (n 3 ). T(n) = 4 T(n/2) + n 2 / lg n Master method does not apply!

8 L ECTURE 2 Matrix Multiplication Tableau Construction Recurrences (Review) Conclusion Merge Sort

9 Square-Matrix Multiplication c 11 c 12  c1nc1n c 21 c 22  c2nc2n    cn1cn1 cn2cn2  c nn a 11 a 12  a1na1n a 21 a 22  a2na2n    an1an1 an2an2  a nn b 11 b 12  b1nb1n b 21 b 22  b2nb2n    bn1bn1 bn2bn2  b nn = £ CAB c ij =  k = 1 n a ik b kj Assume for simplicity that n = 2 k.

10 Recursive Matrix Multiplication 8 multiplications of (n/2) £ (n/2) matrices. 1 addition of n £ n matrices. Divide and conquer — C 11 C 12 C 21 C 22 = £ A 11 A 12 A 21 A 22 B 11 B 12 B 21 B 22 =+ A 11 B 11 A 11 B 12 A 21 B 11 A 21 B 12 A 12 B 21 A 12 B 22 A 22 B 21 A 22 B 22

11 cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2); spawn Mult(C22,A21,B12,n/2); spawn Mult(C21,A21,B11,n/2); spawn Mult(T11,A12,B21,n/2); spawn Mult(T12,A12,B22,n/2); spawn Mult(T22,A22,B22,n/2); spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2); spawn Mult(C22,A21,B12,n/2); spawn Mult(C21,A21,B11,n/2); spawn Mult(T11,A12,B21,n/2); spawn Mult(T12,A12,B22,n/2); spawn Mult(T22,A22,B22,n/2); spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } Matrix Multiply in Pseudo-Cilk C = A ¢ B Absence of type declarations.

12 cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2); spawn Mult(C22,A21,B12,n/2); spawn Mult(C21,A21,B11,n/2); spawn Mult(T11,A12,B21,n/2); spawn Mult(T12,A12,B22,n/2); spawn Mult(T22,A22,B22,n/2); spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2); spawn Mult(C22,A21,B12,n/2); spawn Mult(C21,A21,B11,n/2); spawn Mult(T11,A12,B21,n/2); spawn Mult(T12,A12,B22,n/2); spawn Mult(T22,A22,B22,n/2); spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } C = A ¢ B Coarsen base cases for efficiency. Matrix Multiply in Pseudo-Cilk

13 cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2); spawn Mult(C22,A21,B12,n/2); spawn Mult(C21,A21,B11,n/2); spawn Mult(T11,A12,B21,n/2); spawn Mult(T12,A12,B22,n/2); spawn Mult(T22,A22,B22,n/2); spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2); spawn Mult(C22,A21,B12,n/2); spawn Mult(C21,A21,B11,n/2); spawn Mult(T11,A12,B21,n/2); spawn Mult(T12,A12,B22,n/2); spawn Mult(T22,A22,B22,n/2); spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } C = A ¢ B Submatrices are produced by pointer calculation, not copying of elements. Also need a row- size argument for array indexing. Matrix Multiply in Pseudo-Cilk

14 cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2); spawn Mult(C22,A21,B12,n/2); spawn Mult(C21,A21,B11,n/2); spawn Mult(T11,A12,B21,n/2); spawn Mult(T12,A12,B22,n/2); spawn Mult(T22,A22,B22,n/2); spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2); spawn Mult(C22,A21,B12,n/2); spawn Mult(C21,A21,B11,n/2); spawn Mult(T11,A12,B21,n/2); spawn Mult(T12,A12,B22,n/2); spawn Mult(T22,A22,B22,n/2); spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } C = A ¢ B cilk void Add(*C, *T, n) { h base case & partition matrices i spawn Add(C11,T11,n/2); spawn Add(C12,T12,n/2); spawn Add(C21,T21,n/2); spawn Add(C22,T22,n/2); sync; return; } cilk void Add(*C, *T, n) { h base case & partition matrices i spawn Add(C11,T11,n/2); spawn Add(C12,T12,n/2); spawn Add(C21,T21,n/2); spawn Add(C22,T22,n/2); sync; return; } C = C + T Matrix Multiply in Pseudo-Cilk

15 A 1 (n)= ? 4 A 1 (n/2) +  (1) cilk void Add(*C, *T, n) { h base case & partition matrices i spawn Add(C11,T11,n/2); spawn Add(C12,T12,n/2); spawn Add(C21,T21,n/2); spawn Add(C22,T22,n/2); sync; return; } cilk void Add(*C, *T, n) { h base case & partition matrices i spawn Add(C11,T11,n/2); spawn Add(C12,T12,n/2); spawn Add(C21,T21,n/2); spawn Add(C22,T22,n/2); sync; return; } Work of Matrix Addition Work: n log b a = n log 2 4 = n 2 À  (1). — C ASE 1 =(n2)=(n2)

16 cilk void Add(*C, *T, n) { h base case & partition matrices i spawn Add(C11,T11,n/2); spawn Add(C12,T12,n/2); spawn Add(C21,T21,n/2); spawn Add(C22,T22,n/2); sync; return; } cilk void Add(*C, *T, n) { h base case & partition matrices i spawn Add(C11,T11,n/2); spawn Add(C12,T12,n/2); spawn Add(C21,T21,n/2); spawn Add(C22,T22,n/2); sync; return; } cilk void Add(*C, *T, n) { h base case & partition matrices i spawn Add(C11,T11,n/2); spawn Add(C12,T12,n/2); spawn Add(C21,T21,n/2); spawn Add(C22,T22,n/2); sync; return; } cilk void Add(*C, *T, n) { h base case & partition matrices i spawn Add(C11,T11,n/2); spawn Add(C12,T12,n/2); spawn Add(C21,T21,n/2); spawn Add(C22,T22,n/2); sync; return; } A 1 (n)= ? Span of Matrix Addition A 1 (n/2) +  (1) Span: n log b a = n log 2 1 = 1 ) f (n) =  (n log b a lg 0 n). — C ASE 2 =  (lg n) maximum

17 M 1 (n)= ? Work of Matrix Multiplication 8 M 1 (n/2) +A 1 (n) +  (1) Work: cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2);  spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2);  spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } 8 n log b a = n log 2 8 = n 3 À  (n 2 ). =8 M 1 (n/2) +  (n 2 ) =(n3)=(n3) — C ASE 1

18 cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2);  spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2);  spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2);  spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2);  spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } M 1 (n)= ? M 1 (n/2) + A 1 (n) +  (1) Span of Matrix Multiplication Span: n log b a = n log 2 1 = 1 ) f (n) =  (n log b a lg 1 n). =M 1 (n/2) +  (lg n) =  (lg 2 n) — C ASE 2 8

19 Parallelism of Matrix Multiply M1(n)= (n3)M1(n)= (n3) Work: M 1 (n)=  (lg 2 n) Span: Parallelism: M1(n)M1(n) M1(n)M1(n) =  (n 3 /lg 2 n) For 1000 £ 1000 matrices, parallelism ¼ (10 3 ) 3 /10 2 = 10 7.

20 cilk void Mult(*C, *A, *B, n) { h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2);  spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } cilk void Mult(*C, *A, *B, n) { h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2);  spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } Stack Temporaries float *T = Cilk_alloca(n*n*sizeof(float)); In hierarchical-memory machines (especially chip multiprocessors), memory accesses are so expensive that minimizing storage often yields higher performance. I DEA : Trade off parallelism for less storage.

21 No-Temp Matrix Multiplication cilk void MultA(*C, *A, *B, n) { // C = C + A * B h base case & partition matrices i spawn MultA(C11,A11,B11,n/2); spawn MultA(C12,A11,B12,n/2); spawn MultA(C22,A21,B12,n/2); spawn MultA(C21,A21,B11,n/2); sync; spawn MultA(C21,A22,B21,n/2); spawn MultA(C22,A22,B22,n/2); spawn MultA(C12,A12,B22,n/2); spawn MultA(C11,A12,B21,n/2); sync; return; } cilk void MultA(*C, *A, *B, n) { // C = C + A * B h base case & partition matrices i spawn MultA(C11,A11,B11,n/2); spawn MultA(C12,A11,B12,n/2); spawn MultA(C22,A21,B12,n/2); spawn MultA(C21,A21,B11,n/2); sync; spawn MultA(C21,A22,B21,n/2); spawn MultA(C22,A22,B22,n/2); spawn MultA(C12,A12,B22,n/2); spawn MultA(C11,A12,B21,n/2); sync; return; } Saves space, but at what expense?

22 =(n3)=(n3) Work of No-Temp Multiply M 1 (n)= ? 8 M 1 (n/2) +  (1) Work: — C ASE 1 cilk void MultA(*C, *A, *B, n) { // C = C + A * B h base case & partition matrices i spawn MultA(C11,A11,B11,n/2); spawn MultA(C12,A11,B12,n/2); spawn MultA(C22,A21,B12,n/2); spawn MultA(C21,A21,B11,n/2); sync; spawn MultA(C21,A22,B21,n/2); spawn MultA(C22,A22,B22,n/2); spawn MultA(C12,A12,B22,n/2); spawn MultA(C11,A12,B21,n/2); sync; return; } cilk void MultA(*C, *A, *B, n) { // C = C + A * B h base case & partition matrices i spawn MultA(C11,A11,B11,n/2); spawn MultA(C12,A11,B12,n/2); spawn MultA(C22,A21,B12,n/2); spawn MultA(C21,A21,B11,n/2); sync; spawn MultA(C21,A22,B21,n/2); spawn MultA(C22,A22,B22,n/2); spawn MultA(C12,A12,B22,n/2); spawn MultA(C11,A12,B21,n/2); sync; return; }

23 cilk void MultA(*C, *A, *B, n) { // C = C + A * B h base case & partition matrices i spawn MultA(C11,A11,B11,n/2); spawn MultA(C12,A11,B12,n/2); spawn MultA(C22,A21,B12,n/2); spawn MultA(C21,A21,B11,n/2); sync; spawn MultA(C21,A22,B21,n/2); spawn MultA(C22,A22,B22,n/2); spawn MultA(C12,A12,B22,n/2); spawn MultA(C11,A12,B21,n/2); sync; return; } cilk void MultA(*C, *A, *B, n) { // C = C + A * B h base case & partition matrices i spawn MultA(C11,A11,B11,n/2); spawn MultA(C12,A11,B12,n/2); spawn MultA(C22,A21,B12,n/2); spawn MultA(C21,A21,B11,n/2); sync; spawn MultA(C21,A22,B21,n/2); spawn MultA(C22,A22,B22,n/2); spawn MultA(C12,A12,B22,n/2); spawn MultA(C11,A12,B21,n/2); sync; return; } cilk void MultA(*C, *A, *B, n) { // C = C + A * B h base case & partition matrices i spawn MultA(C11,A11,B11,n/2); spawn MultA(C12,A11,B12,n/2); spawn MultA(C22,A21,B12,n/2); spawn MultA(C21,A21,B11,n/2); sync; spawn MultA(C21,A22,B21,n/2); spawn MultA(C22,A22,B22,n/2); spawn MultA(C12,A12,B22,n/2); spawn MultA(C11,A12,B21,n/2); sync; return; } cilk void MultA(*C, *A, *B, n) { // C = C + A * B h base case & partition matrices i spawn MultA(C11,A11,B11,n/2); spawn MultA(C12,A11,B12,n/2); spawn MultA(C22,A21,B12,n/2); spawn MultA(C21,A21,B11,n/2); sync; spawn MultA(C21,A22,B21,n/2); spawn MultA(C22,A22,B22,n/2); spawn MultA(C12,A12,B22,n/2); spawn MultA(C11,A12,B21,n/2); sync; return; } =(n)=(n) M 1 (n)= ? Span of No-Temp Multiply Span: — C ASE 1 2 M 1 (n/2) +  (1) maximum

24 Parallelism of No-Temp Multiply M1(n)=(n3)M1(n)=(n3) Work: M1(n)=(n)M1(n)=(n) Span: Parallelism: M1(n)M1(n) M1(n)M1(n) =  (n 2 ) For 1000 £ 1000 matrices, parallelism ¼ (10 3 ) 3 /10 3 = Faster in practice!

25 Testing Synchronization Cilk language feature: A programmer can check whether a Cilk procedure is “synched” (without actually performing a sync ) by testing the pseudovariable SYNCHED : SYNCHED = 0 ) some spawned children might not have returned. SYNCHED = 1 ) all spawned children have definitely returned.

26 Best of Both Worlds cilk void Mult1(*C, *A, *B, n) {// multiply & store h base case & partition matrices i spawn Mult1(C11,A11,B11,n/2); // multiply & store spawn Mult1(C12,A11,B12,n/2); spawn Mult1(C22,A21,B12,n/2); spawn Mult1(C21,A21,B11,n/2); if (SYNCHED) { spawn MultA1(C11,A12,B21,n/2); // multiply & add spawn MultA1(C12,A12,B22,n/2); spawn MultA1(C22,A22,B22,n/2); spawn MultA1(C21,A22,B21,n/2); } else { float *T = Cilk_alloca(n*n*sizeof(float)); spawn Mult1(T11,A12,B21,n/2); // multiply & store spawn Mult1(T12,A12,B22,n/2); spawn Mult1(T22,A22,B22,n/2); spawn Mult1(T21,A22,B21,n/2); sync; spawn Add(C,T,n); // C = C + T } sync; return; } cilk void Mult1(*C, *A, *B, n) {// multiply & store h base case & partition matrices i spawn Mult1(C11,A11,B11,n/2); // multiply & store spawn Mult1(C12,A11,B12,n/2); spawn Mult1(C22,A21,B12,n/2); spawn Mult1(C21,A21,B11,n/2); if (SYNCHED) { spawn MultA1(C11,A12,B21,n/2); // multiply & add spawn MultA1(C12,A12,B22,n/2); spawn MultA1(C22,A22,B22,n/2); spawn MultA1(C21,A22,B21,n/2); } else { float *T = Cilk_alloca(n*n*sizeof(float)); spawn Mult1(T11,A12,B21,n/2); // multiply & store spawn Mult1(T12,A12,B22,n/2); spawn Mult1(T22,A22,B22,n/2); spawn Mult1(T21,A22,B21,n/2); sync; spawn Add(C,T,n); // C = C + T } sync; return; } This code is just as parallel as the original, but it only uses more space if runtime parallelism actually exists.

27 Ordinary Matrix Multiplication c ij =  k = 1 n a ik b kj I DEA : Spawn n 2 inner products in parallel. Compute each inner product in parallel. Work:  (n 3 ) Span:  (lg n) Parallelism:  (n 3 /lg n) B UT, this algorithm exhibits poor locality and does not exploit the cache hierarchy of modern microprocessors, especially CMP’s.

28 L ECTURE 2 Matrix Multiplication Tableau Construction Recurrences (Review) Conclusion Merge Sort

Merging Two Sorted Arrays void Merge(int *C, int *A, int *B, int na, int nb) { while (na>0 && nb>0) { if (*A <= *B) { *C++ = *A++; na--; } else { *C++ = *B++; nb--; } while (na>0) { *C++ = *A++; na--; } while (nb>0) { *C++ = *B++; nb--; } void Merge(int *C, int *A, int *B, int na, int nb) { while (na>0 && nb>0) { if (*A <= *B) { *C++ = *A++; na--; } else { *C++ = *B++; nb--; } while (na>0) { *C++ = *A++; na--; } while (nb>0) { *C++ = *B++; nb--; } Time to merge n elements = ? (n). (n).

30 cilk void MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn MergeSort(C, A, n/2); spawn MergeSort(C+n/2, A+n/2, n-n/2); sync; Merge(B, C, C+n/2, n/2, n-n/2); } cilk void MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn MergeSort(C, A, n/2); spawn MergeSort(C+n/2, A+n/2, n-n/2); sync; Merge(B, C, C+n/2, n/2, n-n/2); } Merge Sort merge

31 =  (n lg n) T 1 (n)= ? 2 T 1 (n/2) +  (n) Work of Merge Sort Work: — C ASE 2 n log b a = n log 2 2 = n ) f (n) =  (n log b a lg 0 n). cilk void MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn MergeSort(C, A, n/2); spawn MergeSort(C+n/2, A+n/2, n-n/2); sync; Merge(B, C, C+n/2, n/2, n-n/2); } cilk void MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn MergeSort(C, A, n/2); spawn MergeSort(C+n/2, A+n/2, n-n/2); sync; Merge(B, C, C+n/2, n/2, n-n/2); }

32 T 1 (n)= ? T 1 (n/2) +  (n) Span of Merge Sort Span: — C ASE 3 =(n)=(n) n log b a = n log 2 1 = 1 ¿  (n). cilk void MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn MergeSort(C, A, n/2); spawn MergeSort(C+n/2, A+n/2, n-n/2); sync; Merge(B, C, C+n/2, n/2, n-n/2); } cilk void MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn MergeSort(C, A, n/2); spawn MergeSort(C+n/2, A+n/2, n-n/2); sync; Merge(B, C, C+n/2, n/2, n-n/2); }

33 Parallelism of Merge Sort T 1 (n)=  (n lg n) Work: T1(n)=(n)T1(n)=(n) Span: Parallelism: T1(n)T1(n) T1(n)T1(n) =  (lg n) We need to parallelize the merge!

34 B A 0na 0nb na ¸ nb Parallel Merge · A[na/2] ¸ A[na/2] Binary search jj+1 Recursive merge Recursive merge na/2 · A[na/2] ¸ A[na/2] K EY I DEA : If the total number of elements to be merged in the two arrays is n = na + nb, the total number of elements in the larger of the two recursive merges is at most ? (3/4) n.

35 Parallel Merge cilk void P_Merge(int *C, int *A, int *B, int na, int nb) { if (na < nb) { spawn P_Merge(C, B, A, nb, na); } else if (na==1) { if (nb == 0) { C[0] = A[0]; } else { C[0] = (A[0]<B[0]) ? A[0] : B[0]; /* minimum */ C[1] = (A[0]<B[0]) ? B[0] : A[0]; /* maximum */ } } else { int ma = na/2; int mb = BinarySearch(A[ma], B, nb); spawn P_Merge(C, A, B, ma, mb); spawn P_Merge(C+ma+mb, A+ma, B+mb, na-ma, nb-mb); sync; } cilk void P_Merge(int *C, int *A, int *B, int na, int nb) { if (na < nb) { spawn P_Merge(C, B, A, nb, na); } else if (na==1) { if (nb == 0) { C[0] = A[0]; } else { C[0] = (A[0]<B[0]) ? A[0] : B[0]; /* minimum */ C[1] = (A[0]<B[0]) ? B[0] : A[0]; /* maximum */ } } else { int ma = na/2; int mb = BinarySearch(A[ma], B, nb); spawn P_Merge(C, A, B, ma, mb); spawn P_Merge(C+ma+mb, A+ma, B+mb, na-ma, nb-mb); sync; } Coarsen base cases for efficiency.

36 T 1 (n)= ? T 1 (3n/4) +  (lg n) Span of P_Merge Span: — C ASE 2 =  (lg 2 n) cilk void P_Merge(int *C, int *A, int *B, int na, int nb) { if (na < nb) {  } else { int ma = na/2; int mb = BinarySearch(A[ma], B, nb); spawn P_Merge(C, A, B, ma, mb); spawn P_Merge(C+ma+mb, A+ma, B+mb, na-ma, nb-mb); sync; } cilk void P_Merge(int *C, int *A, int *B, int na, int nb) { if (na < nb) {  } else { int ma = na/2; int mb = BinarySearch(A[ma], B, nb); spawn P_Merge(C, A, B, ma, mb); spawn P_Merge(C+ma+mb, A+ma, B+mb, na-ma, nb-mb); sync; } n log b a = n log 4/3 1 = 1 ) f (n) =  (n log b a lg 1 n).

37 T 1 (n) = ? T 1 (  n) + T 1 ((1–  )n) +  (lg n), where 1/4 ·  · 3/4. Work of P_Merge Work: cilk void P_Merge(int *C, int *A, int *B, int na, int nb) { if (na < nb) {  } else { int ma = na/2; int mb = BinarySearch(A[ma], B, nb); spawn P_Merge(C, A, B, ma, mb); spawn P_Merge(C+ma+mb, A+ma, B+mb, na-ma, nb-mb); sync; } cilk void P_Merge(int *C, int *A, int *B, int na, int nb) { if (na < nb) {  } else { int ma = na/2; int mb = BinarySearch(A[ma], B, nb); spawn P_Merge(C, A, B, ma, mb); spawn P_Merge(C+ma+mb, A+ma, B+mb, na-ma, nb-mb); sync; } C LAIM : T 1 (n) =  (n).

38 Analysis of Work Recurrence Substitution method: Inductive hypothesis is T 1 (k) · c 1 k – c 2 lg k, where c 1,c 2 > 0. Prove that the relation holds, and solve for c 1 and c 2. T 1 (n) = T 1 (  n) + T 1 ((1–  )n) +  (lg n), where 1/4 ·  · 3/4. T 1 (n)=T 1 (  n) + T 1 ((1–  )n) +  (lg n) · c 1 (  n) – c 2 lg(  n) + c 1 ((1–  )n) – c 2 lg((1–  )n) +  (lg n)

39 Analysis of Work Recurrence T 1 (n)=T 1 (  n) + T 1 ((1–  )n) +  (lg n) · c 1 (  n) – c 2 lg(  n) + c 1 (1–  )n – c 2 lg((1–  )n) +  (lg n) T 1 (n) = T 1 (  n) + T 1 ((1–  )n) +  (lg n), where 1/4 ·  · 3/4.

40 T 1 (n)=T 1 (  n) + T 1 ((1–  )n) +  (lg n) · c 1 (  n) – c 2 lg(  n) + c 1 (1–  )n – c 2 lg((1–  )n) +  (lg n) Analysis of Work Recurrence · c 1 n – c 2 lg(  n) – c 2 lg((1–  )n) +  (lg n) · c 1 n – c 2 ( lg(  (1–  )) + 2 lg n ) +  (lg n) · c 1 n – c 2 lg n – ( c 2 (lg n + lg(  (1–  ))) –  (lg n) ) · c 1 n – c 2 lg n by choosing c 1 and c 2 large enough. T 1 (n) = T 1 (  n) + T 1 ((1–  )n) +  (lg n), where 1/4 ·  · 3/4.

41 Parallelism of P_Merge T1(n)=(n)T1(n)=(n) Work: T 1 (n)=  (lg 2 n) Span: Parallelism: T1(n)T1(n) T1(n)T1(n) =  (n/lg 2 n)

42 cilk void P_MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn P_MergeSort(C, A, n/2); spawn P_MergeSort(C+n/2, A+n/2, n-n/2); sync; spawn P_Merge(B, C, C+n/2, n/2, n-n/2); } cilk void P_MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn P_MergeSort(C, A, n/2); spawn P_MergeSort(C+n/2, A+n/2, n-n/2); sync; spawn P_Merge(B, C, C+n/2, n/2, n-n/2); } Parallel Merge Sort

43 T 1 (n)=2 T 1 (n/2) +  (n) Work of Parallel Merge Sort Work: — C ASE 2 =  (n lg n) cilk void P_MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn P_MergeSort(C, A, n/2); spawn P_MergeSort(C+n/2, A+n/2, n-n/2); sync; spawn P_Merge(B, C, C+n/2, n/2, n-n/2); } cilk void P_MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn P_MergeSort(C, A, n/2); spawn P_MergeSort(C+n/2, A+n/2, n-n/2); sync; spawn P_Merge(B, C, C+n/2, n/2, n-n/2); }

44 Span of Parallel Merge Sort T 1 (n)= ? T 1 (n/2) +  (lg 2 n) Span: — C ASE 2 =  (lg 3 n) n log b a = n log 2 1 = 1 ) f (n) =  (n log b a lg 2 n). cilk void P_MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn P_MergeSort(C, A, n/2); spawn P_MergeSort(C+n/2, A+n/2, n-n/2); sync; spawn P_Merge(B, C, C+n/2, n/2, n-n/2); } cilk void P_MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn P_MergeSort(C, A, n/2); spawn P_MergeSort(C+n/2, A+n/2, n-n/2); sync; spawn P_Merge(B, C, C+n/2, n/2, n-n/2); }

45 Parallelism of Merge Sort T 1 (n)=  (n lg n) Work: T 1 (n)=  (lg 3 n) Span: Parallelism: T1(n)T1(n) T1(n)T1(n) =  (n/lg 2 n)

46 L ECTURE 2 Matrix Multiplication Tableau Construction Recurrences (Review) Conclusion Merge Sort

47 Tableau Construction A[i, j] = f ( A[i, j–1], A[i–1, j], A[i–1, j–1] ). Problem: Fill in an n £ n tableau A, where Dynamic programming Longest common subsequence Edit distance Time warping Work:  (n 2 ).

48 n n spawn I; sync; spawn II; spawn III; sync; spawn IV; sync; I I II III IV Cilk code Recursive Construction

49 n n Work: T 1 (n) = ? 4T 1 (n/2) +  (1) spawn I; sync; spawn II; spawn III; sync; spawn IV; sync; I I II III IV Cilk code Recursive Construction =  (n 2 ) — C ASE 1

50 Span: T 1 (n) = ? n n spawn I; sync; spawn II; spawn III; sync; spawn IV; sync; I I II III IV Cilk code Recursive Construction 3T 1 (n/2) +  (1) =  (n lg3 ) — C ASE 1

51 Analysis of Tableau Construction Work: T 1 (n) =  (n 2 ) Span: T 1 (n) =  (n lg3 ) ¼  (n 1.58 ) Parallelism: T1(n)T1(n) T1(n)T1(n) ¼  (n 0.42 )

52 n spawn I; sync; spawn II; spawn III; sync; spawn IV; spawn V; spawn VI sync; spawn VII; spawn VIII; sync; spawn IX; sync; A More-Parallel Construction I I II III IV V V VI VII VIII IX n

53 n spawn I; sync; spawn II; spawn III; sync; spawn IV; spawn V; spawn VI sync; spawn VII; spawn VIII; sync; spawn IX; sync; A More-Parallel Construction I I II III IV V V VI VII VIII IX n Work: T 1 (n) = ? 9T 1 (n/3) +  (1) =  (n 2 ) — C ASE 1

54 n spawn I; sync; spawn II; spawn III; sync; spawn IV; spawn V; spawn VI sync; spawn VII; spawn VIII; sync; spawn IX; sync; A More-Parallel Construction I I II III IV V V VI VII VIII IX n Span: T 1 (n) = ? 5T 1 (n/3) +  (1) =  (n log 3 5 ) — C ASE 1

55 Analysis of Revised Construction Work: T 1 (n) =  (n 2 ) Span: T 1 (n) =  (n log 3 5 ) ¼  (n 1.46 ) Parallelism: T1(n)T1(n) T1(n)T1(n) ¼  (n 0.54 ) More parallel by a factor of  (n 0.54 )/  (n 0.42 ) =  (n 0.12 ).

56 L ECTURE 2 Matrix Multiplication Tableau Construction Recurrences (Review) Conclusion Merge Sort

57 Key Ideas Cilk is simple: cilk, spawn, sync, SYNCHED Recurrences, recurrences, recurrences, … Work & span