Multithreaded Programming in Cilk L ECTURE 2 Charles E. Leiserson Supercomputing Technologies Research Group Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, Minicourse Outline ● L ECTURE 1 Basic Cilk programming: Cilk keywords, performance measures, scheduling. ● L ECTURE 2 Analysis of Cilk algorithms: matrix multiplication, sorting, tableau construction. ● L ABORATORY Programming matrix multiplication in Cilk — Dr. Bradley C. Kuszmaul ● L ECTURE 3 Advanced Cilk programming: inlets, abort, speculation, data synchronization, & more.
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, L ECTURE 2 Matrix Multiplication Tableau Construction Recurrences (Review) Conclusion Merge Sort
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, The Master Method The Master Method for solving recurrences applies to recurrences of the form T(n) = a T(n/b) + f (n), where a ¸ 1, b > 1, and f is asymptotically positive. I DEA : Compare n log b a with f (n). *The unstated base case is T(n) = (1) for sufficiently small n. *
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, Master Method — C ASE 1 n log b a À f (n) Specifically, f (n) = O(n log b a – ) for some constant > 0. Solution: T(n) = (n log b a ). T(n) = a T(n/b) + f (n)
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, Master Method — C ASE 2 Specifically, f (n) = (n log b a lg k n) for some constant k ¸ 0. Solution: T(n) = (n log b a lg k+1 n). n log b a ¼ f (n) T(n) = a T(n/b) + f (n)
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, Master Method — C ASE 3 Specifically, f (n) = (n log b a + ) for some constant > 0 and f (n) satisfies the regularity condition that a f (n/b) · c f (n) for some constant c < 1. Solution: T(n) = (f (n)). n log b a ¿ f (n) T(n) = a T(n/b) + f (n)
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, Master Method Summary C ASE 1: f (n) = O(n log b a – ), constant > 0 T(n) = (n log b a ). C ASE 2: f (n) = (n log b a lg k n), constant k 0 T(n) = (n log b a lg k+1 n). C ASE 3: f (n) = (n log b a + ), constant > 0, and regularity condition T(n) = ( f (n)). T(n) = a T(n/b) + f (n)
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, Master Method Quiz T(n) = 4 T(n/2) + n n log b a = n 2 À n ) C ASE 1: T(n) = (n 2 ). T(n) = 4 T(n/2) + n 2 n log b a = n 2 = n 2 lg 0 n ) C ASE 2: T(n) = (n 2 lg n). T(n) = 4 T(n/2) + n 3 n log b a = n 2 ¿ n 3 ) C ASE 3: T(n) = (n 3 ). T(n) = 4 T(n/2) + n 2 / lg n Master method does not apply!
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, L ECTURE 2 Matrix Multiplication Tableau Construction Recurrences (Review) Conclusion Merge Sort
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, Square-Matrix Multiplication c 11 c 12 c1nc1n c 21 c 22 c2nc2n cn1cn1 cn2cn2 c nn a 11 a 12 a1na1n a 21 a 22 a2na2n an1an1 an2an2 a nn b 11 b 12 b1nb1n b 21 b 22 b2nb2n bn1bn1 bn2bn2 b nn = £ CAB c ij = k = 1 n a ik b kj Assume for simplicity that n = 2 k.
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, Recursive Matrix Multiplication 8 multiplications of (n/2) £ (n/2) matrices. 1 addition of n £ n matrices. Divide and conquer — C 11 C 12 C 21 C 22 = £ A 11 A 12 A 21 A 22 B 11 B 12 B 21 B 22 =+ A 11 B 11 A 11 B 12 A 21 B 11 A 21 B 12 A 12 B 21 A 12 B 22 A 22 B 21 A 22 B 22
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2); spawn Mult(C22,A21,B12,n/2); spawn Mult(C21,A21,B11,n/2); spawn Mult(T11,A12,B21,n/2); spawn Mult(T12,A12,B22,n/2); spawn Mult(T22,A22,B22,n/2); spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2); spawn Mult(C22,A21,B12,n/2); spawn Mult(C21,A21,B11,n/2); spawn Mult(T11,A12,B21,n/2); spawn Mult(T12,A12,B22,n/2); spawn Mult(T22,A22,B22,n/2); spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } Matrix Multiply in Pseudo-Cilk C = A ¢ B Absence of type declarations.
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2); spawn Mult(C22,A21,B12,n/2); spawn Mult(C21,A21,B11,n/2); spawn Mult(T11,A12,B21,n/2); spawn Mult(T12,A12,B22,n/2); spawn Mult(T22,A22,B22,n/2); spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2); spawn Mult(C22,A21,B12,n/2); spawn Mult(C21,A21,B11,n/2); spawn Mult(T11,A12,B21,n/2); spawn Mult(T12,A12,B22,n/2); spawn Mult(T22,A22,B22,n/2); spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } C = A ¢ B Coarsen base cases for efficiency. Matrix Multiply in Pseudo-Cilk
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2); spawn Mult(C22,A21,B12,n/2); spawn Mult(C21,A21,B11,n/2); spawn Mult(T11,A12,B21,n/2); spawn Mult(T12,A12,B22,n/2); spawn Mult(T22,A22,B22,n/2); spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2); spawn Mult(C22,A21,B12,n/2); spawn Mult(C21,A21,B11,n/2); spawn Mult(T11,A12,B21,n/2); spawn Mult(T12,A12,B22,n/2); spawn Mult(T22,A22,B22,n/2); spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } C = A ¢ B Submatrices are produced by pointer calculation, not copying of elements. Also need a row- size argument for array indexing. Matrix Multiply in Pseudo-Cilk
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2); spawn Mult(C22,A21,B12,n/2); spawn Mult(C21,A21,B11,n/2); spawn Mult(T11,A12,B21,n/2); spawn Mult(T12,A12,B22,n/2); spawn Mult(T22,A22,B22,n/2); spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2); spawn Mult(C22,A21,B12,n/2); spawn Mult(C21,A21,B11,n/2); spawn Mult(T11,A12,B21,n/2); spawn Mult(T12,A12,B22,n/2); spawn Mult(T22,A22,B22,n/2); spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } C = A ¢ B cilk void Add(*C, *T, n) { h base case & partition matrices i spawn Add(C11,T11,n/2); spawn Add(C12,T12,n/2); spawn Add(C21,T21,n/2); spawn Add(C22,T22,n/2); sync; return; } cilk void Add(*C, *T, n) { h base case & partition matrices i spawn Add(C11,T11,n/2); spawn Add(C12,T12,n/2); spawn Add(C21,T21,n/2); spawn Add(C22,T22,n/2); sync; return; } C = C + T Matrix Multiply in Pseudo-Cilk
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, A 1 (n)= ? 4 A 1 (n/2) + (1) cilk void Add(*C, *T, n) { h base case & partition matrices i spawn Add(C11,T11,n/2); spawn Add(C12,T12,n/2); spawn Add(C21,T21,n/2); spawn Add(C22,T22,n/2); sync; return; } cilk void Add(*C, *T, n) { h base case & partition matrices i spawn Add(C11,T11,n/2); spawn Add(C12,T12,n/2); spawn Add(C21,T21,n/2); spawn Add(C22,T22,n/2); sync; return; } Work of Matrix Addition Work: n log b a = n log 2 4 = n 2 À (1). — C ASE 1 =(n2)=(n2)
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, cilk void Add(*C, *T, n) { h base case & partition matrices i spawn Add(C11,T11,n/2); spawn Add(C12,T12,n/2); spawn Add(C21,T21,n/2); spawn Add(C22,T22,n/2); sync; return; } cilk void Add(*C, *T, n) { h base case & partition matrices i spawn Add(C11,T11,n/2); spawn Add(C12,T12,n/2); spawn Add(C21,T21,n/2); spawn Add(C22,T22,n/2); sync; return; } cilk void Add(*C, *T, n) { h base case & partition matrices i spawn Add(C11,T11,n/2); spawn Add(C12,T12,n/2); spawn Add(C21,T21,n/2); spawn Add(C22,T22,n/2); sync; return; } cilk void Add(*C, *T, n) { h base case & partition matrices i spawn Add(C11,T11,n/2); spawn Add(C12,T12,n/2); spawn Add(C21,T21,n/2); spawn Add(C22,T22,n/2); sync; return; } A 1 (n)= ? Span of Matrix Addition A 1 (n/2) + (1) Span: n log b a = n log 2 1 = 1 ) f (n) = (n log b a lg 0 n). — C ASE 2 = (lg n) maximum
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, M 1 (n)= ? Work of Matrix Multiplication 8 M 1 (n/2) +A 1 (n) + (1) Work: cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2); spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2); spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } 8 n log b a = n log 2 8 = n 3 À (n 2 ). =8 M 1 (n/2) + (n 2 ) =(n3)=(n3) — C ASE 1
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2); spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2); spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2); spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2); spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } M 1 (n)= ? M 1 (n/2) + A 1 (n) + (1) Span of Matrix Multiplication Span: n log b a = n log 2 1 = 1 ) f (n) = (n log b a lg 1 n). =M 1 (n/2) + (lg n) = (lg 2 n) — C ASE 2 8
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, Parallelism of Matrix Multiply M1(n)= (n3)M1(n)= (n3) Work: M 1 (n)= (lg 2 n) Span: Parallelism: M1(n)M1(n) M1(n)M1(n) = (n 3 /lg 2 n) For 1000 £ 1000 matrices, parallelism ¼ (10 3 ) 3 /10 2 = 10 7.
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, cilk void Mult(*C, *A, *B, n) { h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2); spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } cilk void Mult(*C, *A, *B, n) { h base case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2); spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } Stack Temporaries float *T = Cilk_alloca(n*n*sizeof(float)); In hierarchical-memory machines (especially chip multiprocessors), memory accesses are so expensive that minimizing storage often yields higher performance. I DEA : Trade off parallelism for less storage.
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, No-Temp Matrix Multiplication cilk void MultA(*C, *A, *B, n) { // C = C + A * B h base case & partition matrices i spawn MultA(C11,A11,B11,n/2); spawn MultA(C12,A11,B12,n/2); spawn MultA(C22,A21,B12,n/2); spawn MultA(C21,A21,B11,n/2); sync; spawn MultA(C21,A22,B21,n/2); spawn MultA(C22,A22,B22,n/2); spawn MultA(C12,A12,B22,n/2); spawn MultA(C11,A12,B21,n/2); sync; return; } cilk void MultA(*C, *A, *B, n) { // C = C + A * B h base case & partition matrices i spawn MultA(C11,A11,B11,n/2); spawn MultA(C12,A11,B12,n/2); spawn MultA(C22,A21,B12,n/2); spawn MultA(C21,A21,B11,n/2); sync; spawn MultA(C21,A22,B21,n/2); spawn MultA(C22,A22,B22,n/2); spawn MultA(C12,A12,B22,n/2); spawn MultA(C11,A12,B21,n/2); sync; return; } Saves space, but at what expense?
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, =(n3)=(n3) Work of No-Temp Multiply M 1 (n)= ? 8 M 1 (n/2) + (1) Work: — C ASE 1 cilk void MultA(*C, *A, *B, n) { // C = C + A * B h base case & partition matrices i spawn MultA(C11,A11,B11,n/2); spawn MultA(C12,A11,B12,n/2); spawn MultA(C22,A21,B12,n/2); spawn MultA(C21,A21,B11,n/2); sync; spawn MultA(C21,A22,B21,n/2); spawn MultA(C22,A22,B22,n/2); spawn MultA(C12,A12,B22,n/2); spawn MultA(C11,A12,B21,n/2); sync; return; } cilk void MultA(*C, *A, *B, n) { // C = C + A * B h base case & partition matrices i spawn MultA(C11,A11,B11,n/2); spawn MultA(C12,A11,B12,n/2); spawn MultA(C22,A21,B12,n/2); spawn MultA(C21,A21,B11,n/2); sync; spawn MultA(C21,A22,B21,n/2); spawn MultA(C22,A22,B22,n/2); spawn MultA(C12,A12,B22,n/2); spawn MultA(C11,A12,B21,n/2); sync; return; }
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, cilk void MultA(*C, *A, *B, n) { // C = C + A * B h base case & partition matrices i spawn MultA(C11,A11,B11,n/2); spawn MultA(C12,A11,B12,n/2); spawn MultA(C22,A21,B12,n/2); spawn MultA(C21,A21,B11,n/2); sync; spawn MultA(C21,A22,B21,n/2); spawn MultA(C22,A22,B22,n/2); spawn MultA(C12,A12,B22,n/2); spawn MultA(C11,A12,B21,n/2); sync; return; } cilk void MultA(*C, *A, *B, n) { // C = C + A * B h base case & partition matrices i spawn MultA(C11,A11,B11,n/2); spawn MultA(C12,A11,B12,n/2); spawn MultA(C22,A21,B12,n/2); spawn MultA(C21,A21,B11,n/2); sync; spawn MultA(C21,A22,B21,n/2); spawn MultA(C22,A22,B22,n/2); spawn MultA(C12,A12,B22,n/2); spawn MultA(C11,A12,B21,n/2); sync; return; } cilk void MultA(*C, *A, *B, n) { // C = C + A * B h base case & partition matrices i spawn MultA(C11,A11,B11,n/2); spawn MultA(C12,A11,B12,n/2); spawn MultA(C22,A21,B12,n/2); spawn MultA(C21,A21,B11,n/2); sync; spawn MultA(C21,A22,B21,n/2); spawn MultA(C22,A22,B22,n/2); spawn MultA(C12,A12,B22,n/2); spawn MultA(C11,A12,B21,n/2); sync; return; } cilk void MultA(*C, *A, *B, n) { // C = C + A * B h base case & partition matrices i spawn MultA(C11,A11,B11,n/2); spawn MultA(C12,A11,B12,n/2); spawn MultA(C22,A21,B12,n/2); spawn MultA(C21,A21,B11,n/2); sync; spawn MultA(C21,A22,B21,n/2); spawn MultA(C22,A22,B22,n/2); spawn MultA(C12,A12,B22,n/2); spawn MultA(C11,A12,B21,n/2); sync; return; } =(n)=(n) M 1 (n)= ? Span of No-Temp Multiply Span: — C ASE 1 2 M 1 (n/2) + (1) maximum
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, Parallelism of No-Temp Multiply M1(n)=(n3)M1(n)=(n3) Work: M1(n)=(n)M1(n)=(n) Span: Parallelism: M1(n)M1(n) M1(n)M1(n) = (n 2 ) For 1000 £ 1000 matrices, parallelism ¼ (10 3 ) 3 /10 3 = Faster in practice!
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, Testing Synchronization Cilk language feature: A programmer can check whether a Cilk procedure is “synched” (without actually performing a sync ) by testing the pseudovariable SYNCHED : SYNCHED = 0 ) some spawned children might not have returned. SYNCHED = 1 ) all spawned children have definitely returned.
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, Best of Both Worlds cilk void Mult1(*C, *A, *B, n) {// multiply & store h base case & partition matrices i spawn Mult1(C11,A11,B11,n/2); // multiply & store spawn Mult1(C12,A11,B12,n/2); spawn Mult1(C22,A21,B12,n/2); spawn Mult1(C21,A21,B11,n/2); if (SYNCHED) { spawn MultA1(C11,A12,B21,n/2); // multiply & add spawn MultA1(C12,A12,B22,n/2); spawn MultA1(C22,A22,B22,n/2); spawn MultA1(C21,A22,B21,n/2); } else { float *T = Cilk_alloca(n*n*sizeof(float)); spawn Mult1(T11,A12,B21,n/2); // multiply & store spawn Mult1(T12,A12,B22,n/2); spawn Mult1(T22,A22,B22,n/2); spawn Mult1(T21,A22,B21,n/2); sync; spawn Add(C,T,n); // C = C + T } sync; return; } cilk void Mult1(*C, *A, *B, n) {// multiply & store h base case & partition matrices i spawn Mult1(C11,A11,B11,n/2); // multiply & store spawn Mult1(C12,A11,B12,n/2); spawn Mult1(C22,A21,B12,n/2); spawn Mult1(C21,A21,B11,n/2); if (SYNCHED) { spawn MultA1(C11,A12,B21,n/2); // multiply & add spawn MultA1(C12,A12,B22,n/2); spawn MultA1(C22,A22,B22,n/2); spawn MultA1(C21,A22,B21,n/2); } else { float *T = Cilk_alloca(n*n*sizeof(float)); spawn Mult1(T11,A12,B21,n/2); // multiply & store spawn Mult1(T12,A12,B22,n/2); spawn Mult1(T22,A22,B22,n/2); spawn Mult1(T21,A22,B21,n/2); sync; spawn Add(C,T,n); // C = C + T } sync; return; } This code is just as parallel as the original, but it only uses more space if runtime parallelism actually exists.
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, Ordinary Matrix Multiplication c ij = k = 1 n a ik b kj I DEA : Spawn n 2 inner products in parallel. Compute each inner product in parallel. Work: (n 3 ) Span: (lg n) Parallelism: (n 3 /lg n) B UT, this algorithm exhibits poor locality and does not exploit the cache hierarchy of modern microprocessors, especially CMP’s.
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, L ECTURE 2 Matrix Multiplication Tableau Construction Recurrences (Review) Conclusion Merge Sort
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, Merging Two Sorted Arrays void Merge(int *C, int *A, int *B, int na, int nb) { while (na>0 && nb>0) { if (*A <= *B) { *C++ = *A++; na--; } else { *C++ = *B++; nb--; } while (na>0) { *C++ = *A++; na--; } while (nb>0) { *C++ = *B++; nb--; } void Merge(int *C, int *A, int *B, int na, int nb) { while (na>0 && nb>0) { if (*A <= *B) { *C++ = *A++; na--; } else { *C++ = *B++; nb--; } while (na>0) { *C++ = *A++; na--; } while (nb>0) { *C++ = *B++; nb--; } Time to merge n elements = ? (n). (n).
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, cilk void MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn MergeSort(C, A, n/2); spawn MergeSort(C+n/2, A+n/2, n-n/2); sync; Merge(B, C, C+n/2, n/2, n-n/2); } cilk void MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn MergeSort(C, A, n/2); spawn MergeSort(C+n/2, A+n/2, n-n/2); sync; Merge(B, C, C+n/2, n/2, n-n/2); } Merge Sort merge
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, = (n lg n) T 1 (n)= ? 2 T 1 (n/2) + (n) Work of Merge Sort Work: — C ASE 2 n log b a = n log 2 2 = n ) f (n) = (n log b a lg 0 n). cilk void MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn MergeSort(C, A, n/2); spawn MergeSort(C+n/2, A+n/2, n-n/2); sync; Merge(B, C, C+n/2, n/2, n-n/2); } cilk void MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn MergeSort(C, A, n/2); spawn MergeSort(C+n/2, A+n/2, n-n/2); sync; Merge(B, C, C+n/2, n/2, n-n/2); }
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, T 1 (n)= ? T 1 (n/2) + (n) Span of Merge Sort Span: — C ASE 3 =(n)=(n) n log b a = n log 2 1 = 1 ¿ (n). cilk void MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn MergeSort(C, A, n/2); spawn MergeSort(C+n/2, A+n/2, n-n/2); sync; Merge(B, C, C+n/2, n/2, n-n/2); } cilk void MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn MergeSort(C, A, n/2); spawn MergeSort(C+n/2, A+n/2, n-n/2); sync; Merge(B, C, C+n/2, n/2, n-n/2); }
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, Parallelism of Merge Sort T 1 (n)= (n lg n) Work: T1(n)=(n)T1(n)=(n) Span: Parallelism: T1(n)T1(n) T1(n)T1(n) = (lg n) We need to parallelize the merge!
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, B A 0na 0nb na ¸ nb Parallel Merge · A[na/2] ¸ A[na/2] Binary search jj+1 Recursive merge Recursive merge na/2 · A[na/2] ¸ A[na/2] K EY I DEA : If the total number of elements to be merged in the two arrays is n = na + nb, the total number of elements in the larger of the two recursive merges is at most ? (3/4) n.
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, Parallel Merge cilk void P_Merge(int *C, int *A, int *B, int na, int nb) { if (na < nb) { spawn P_Merge(C, B, A, nb, na); } else if (na==1) { if (nb == 0) { C[0] = A[0]; } else { C[0] = (A[0]<B[0]) ? A[0] : B[0]; /* minimum */ C[1] = (A[0]<B[0]) ? B[0] : A[0]; /* maximum */ } } else { int ma = na/2; int mb = BinarySearch(A[ma], B, nb); spawn P_Merge(C, A, B, ma, mb); spawn P_Merge(C+ma+mb, A+ma, B+mb, na-ma, nb-mb); sync; } cilk void P_Merge(int *C, int *A, int *B, int na, int nb) { if (na < nb) { spawn P_Merge(C, B, A, nb, na); } else if (na==1) { if (nb == 0) { C[0] = A[0]; } else { C[0] = (A[0]<B[0]) ? A[0] : B[0]; /* minimum */ C[1] = (A[0]<B[0]) ? B[0] : A[0]; /* maximum */ } } else { int ma = na/2; int mb = BinarySearch(A[ma], B, nb); spawn P_Merge(C, A, B, ma, mb); spawn P_Merge(C+ma+mb, A+ma, B+mb, na-ma, nb-mb); sync; } Coarsen base cases for efficiency.
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, T 1 (n)= ? T 1 (3n/4) + (lg n) Span of P_Merge Span: — C ASE 2 = (lg 2 n) cilk void P_Merge(int *C, int *A, int *B, int na, int nb) { if (na < nb) { } else { int ma = na/2; int mb = BinarySearch(A[ma], B, nb); spawn P_Merge(C, A, B, ma, mb); spawn P_Merge(C+ma+mb, A+ma, B+mb, na-ma, nb-mb); sync; } cilk void P_Merge(int *C, int *A, int *B, int na, int nb) { if (na < nb) { } else { int ma = na/2; int mb = BinarySearch(A[ma], B, nb); spawn P_Merge(C, A, B, ma, mb); spawn P_Merge(C+ma+mb, A+ma, B+mb, na-ma, nb-mb); sync; } n log b a = n log 4/3 1 = 1 ) f (n) = (n log b a lg 1 n).
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, T 1 (n) = ? T 1 ( n) + T 1 ((1– )n) + (lg n), where 1/4 · · 3/4. Work of P_Merge Work: cilk void P_Merge(int *C, int *A, int *B, int na, int nb) { if (na < nb) { } else { int ma = na/2; int mb = BinarySearch(A[ma], B, nb); spawn P_Merge(C, A, B, ma, mb); spawn P_Merge(C+ma+mb, A+ma, B+mb, na-ma, nb-mb); sync; } cilk void P_Merge(int *C, int *A, int *B, int na, int nb) { if (na < nb) { } else { int ma = na/2; int mb = BinarySearch(A[ma], B, nb); spawn P_Merge(C, A, B, ma, mb); spawn P_Merge(C+ma+mb, A+ma, B+mb, na-ma, nb-mb); sync; } C LAIM : T 1 (n) = (n).
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, Analysis of Work Recurrence Substitution method: Inductive hypothesis is T 1 (k) · c 1 k – c 2 lg k, where c 1,c 2 > 0. Prove that the relation holds, and solve for c 1 and c 2. T 1 (n) = T 1 ( n) + T 1 ((1– )n) + (lg n), where 1/4 · · 3/4. T 1 (n)=T 1 ( n) + T 1 ((1– )n) + (lg n) · c 1 ( n) – c 2 lg( n) + c 1 ((1– )n) – c 2 lg((1– )n) + (lg n)
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, Analysis of Work Recurrence T 1 (n)=T 1 ( n) + T 1 ((1– )n) + (lg n) · c 1 ( n) – c 2 lg( n) + c 1 (1– )n – c 2 lg((1– )n) + (lg n) T 1 (n) = T 1 ( n) + T 1 ((1– )n) + (lg n), where 1/4 · · 3/4.
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, T 1 (n)=T 1 ( n) + T 1 ((1– )n) + (lg n) · c 1 ( n) – c 2 lg( n) + c 1 (1– )n – c 2 lg((1– )n) + (lg n) Analysis of Work Recurrence · c 1 n – c 2 lg( n) – c 2 lg((1– )n) + (lg n) · c 1 n – c 2 ( lg( (1– )) + 2 lg n ) + (lg n) · c 1 n – c 2 lg n – ( c 2 (lg n + lg( (1– ))) – (lg n) ) · c 1 n – c 2 lg n by choosing c 1 and c 2 large enough. T 1 (n) = T 1 ( n) + T 1 ((1– )n) + (lg n), where 1/4 · · 3/4.
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, Parallelism of P_Merge T1(n)=(n)T1(n)=(n) Work: T 1 (n)= (lg 2 n) Span: Parallelism: T1(n)T1(n) T1(n)T1(n) = (n/lg 2 n)
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, cilk void P_MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn P_MergeSort(C, A, n/2); spawn P_MergeSort(C+n/2, A+n/2, n-n/2); sync; spawn P_Merge(B, C, C+n/2, n/2, n-n/2); } cilk void P_MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn P_MergeSort(C, A, n/2); spawn P_MergeSort(C+n/2, A+n/2, n-n/2); sync; spawn P_Merge(B, C, C+n/2, n/2, n-n/2); } Parallel Merge Sort
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, T 1 (n)=2 T 1 (n/2) + (n) Work of Parallel Merge Sort Work: — C ASE 2 = (n lg n) cilk void P_MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn P_MergeSort(C, A, n/2); spawn P_MergeSort(C+n/2, A+n/2, n-n/2); sync; spawn P_Merge(B, C, C+n/2, n/2, n-n/2); } cilk void P_MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn P_MergeSort(C, A, n/2); spawn P_MergeSort(C+n/2, A+n/2, n-n/2); sync; spawn P_Merge(B, C, C+n/2, n/2, n-n/2); }
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, Span of Parallel Merge Sort T 1 (n)= ? T 1 (n/2) + (lg 2 n) Span: — C ASE 2 = (lg 3 n) n log b a = n log 2 1 = 1 ) f (n) = (n log b a lg 2 n). cilk void P_MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn P_MergeSort(C, A, n/2); spawn P_MergeSort(C+n/2, A+n/2, n-n/2); sync; spawn P_Merge(B, C, C+n/2, n/2, n-n/2); } cilk void P_MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn P_MergeSort(C, A, n/2); spawn P_MergeSort(C+n/2, A+n/2, n-n/2); sync; spawn P_Merge(B, C, C+n/2, n/2, n-n/2); }
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, Parallelism of Merge Sort T 1 (n)= (n lg n) Work: T 1 (n)= (lg 3 n) Span: Parallelism: T1(n)T1(n) T1(n)T1(n) = (n/lg 2 n)
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, L ECTURE 2 Matrix Multiplication Tableau Construction Recurrences (Review) Conclusion Merge Sort
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, Tableau Construction A[i, j] = f ( A[i, j–1], A[i–1, j], A[i–1, j–1] ). Problem: Fill in an n £ n tableau A, where Dynamic programming Longest common subsequence Edit distance Time warping Work: (n 2 ).
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, n n spawn I; sync; spawn II; spawn III; sync; spawn IV; sync; I I II III IV Cilk code Recursive Construction
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, n n Work: T 1 (n) = ? 4T 1 (n/2) + (1) spawn I; sync; spawn II; spawn III; sync; spawn IV; sync; I I II III IV Cilk code Recursive Construction = (n 2 ) — C ASE 1
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, Span: T 1 (n) = ? n n spawn I; sync; spawn II; spawn III; sync; spawn IV; sync; I I II III IV Cilk code Recursive Construction 3T 1 (n/2) + (1) = (n lg3 ) — C ASE 1
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, Analysis of Tableau Construction Work: T 1 (n) = (n 2 ) Span: T 1 (n) = (n lg3 ) ¼ (n 1.58 ) Parallelism: T1(n)T1(n) T1(n)T1(n) ¼ (n 0.42 )
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, n spawn I; sync; spawn II; spawn III; sync; spawn IV; spawn V; spawn VI sync; spawn VII; spawn VIII; sync; spawn IX; sync; A More-Parallel Construction I I II III IV V V VI VII VIII IX n
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, n spawn I; sync; spawn II; spawn III; sync; spawn IV; spawn V; spawn VI sync; spawn VII; spawn VIII; sync; spawn IX; sync; A More-Parallel Construction I I II III IV V V VI VII VIII IX n Work: T 1 (n) = ? 9T 1 (n/3) + (1) = (n 2 ) — C ASE 1
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, n spawn I; sync; spawn II; spawn III; sync; spawn IV; spawn V; spawn VI sync; spawn VII; spawn VIII; sync; spawn IX; sync; A More-Parallel Construction I I II III IV V V VI VII VIII IX n Span: T 1 (n) = ? 5T 1 (n/3) + (1) = (n log 3 5 ) — C ASE 1
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, Analysis of Revised Construction Work: T 1 (n) = (n 2 ) Span: T 1 (n) = (n log 3 5 ) ¼ (n 1.46 ) Parallelism: T1(n)T1(n) T1(n)T1(n) ¼ (n 0.54 ) More parallel by a factor of (n 0.54 )/ (n 0.42 ) = (n 0.12 ).
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, Puzzle You may only use basic Cilk control constructs ( spawn, sync ) for synchronization. No locks, synchronizing through memory, etc. What is the largest parallelism that can be obtained for the tableau- construction problem using Cilk?
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, L ECTURE 2 Matrix Multiplication Tableau Construction Recurrences (Review) Conclusion Merge Sort
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, Key Ideas Cilk is simple: cilk, spawn, sync, SYNCHED Recurrences, recurrences, recurrences, … Work & span
© 2006 by Charles E. LeisersonMultithreaded Programming in Cilk —L ECTURE 2July 14, Minicourse Outline ● L ECTURE 1 Basic Cilk programming: Cilk keywords, performance measures, scheduling. ● L ECTURE 2 Analysis of Cilk algorithms: matrix multiplication, sorting, tableau construction. ● L ABORATORY Programming matrix multiplication in Cilk — Dr. Bradley C. Kuszmaul ● L ECTURE 3 Advanced Cilk programming: inlets, abort, speculation, data synchronization, & more.