CS 140 : Numerical Examples on Shared Memory with Cilk++

Slides:



Advertisements
Similar presentations
1 CS 240A : Numerical Examples in Shared Memory with Cilk++ Matrix-matrix multiplication Matrix-vector multiplication Hyperobjects Thanks to Charles E.
Advertisements

Introduction to Algorithms 6.046J/18.401J L ECTURE 3 Divide and Conquer Binary search Powering a number Fibonacci numbers Matrix multiplication Strassen’s.
Divide and Conquer. Recall Complexity Analysis – Comparison of algorithm – Big O Simplification From source code – Recursive.
Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers Chapter 11: Numerical Algorithms Sec 11.2: Implementing.
Nattee Niparnan. Recall  Complexity Analysis  Comparison of Two Algos  Big O  Simplification  From source code  Recursive.
Algorithm Design Strategy Divide and Conquer. More examples of Divide and Conquer  Review of Divide & Conquer Concept  More examples  Finding closest.
Towards Communication Avoiding Fast Algorithm for Sparse Matrix Multiplication Part I: Minimizing arithmetic operations Oded Schwartz CS294, Lecture #21.
Introduction to OpenMP For a more detailed tutorial see: Look at the presentations.
CS 240A : Examples with Cilk++
1 CS 140 : Jan 27 – Feb 3, 2010 Multicore (and Shared Memory) Programming with Cilk++ Multicore and NUMA architectures Multithreaded Programming Cilk++
1 A Linear Space Algorithm for Computing Maximal Common Subsequences Author: D.S. Hirschberg Publisher: Communications of the ACM 1975 Presenter: Han-Chen.
1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.
1 Divide and Conquer Binary Search Mergesort Recurrence Relations CSE Lecture 4 – Algorithms II.
Multithreaded Algorithms Andreas Klappenecker. Motivation We have discussed serial algorithms that are suitable for running on a uniprocessor computer.
High Performance Computing 1 Numerical Linear Algebra An Introduction.
More CUDA Examples. Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 8, 2005 Session 8.
Multithreaded Programming in Cilk L ECTURE 2 Charles E. Leiserson Supercomputing Technologies Research Group Computer Science and Artificial Intelligence.
A.Abhari CPS1251 Multidimensional Arrays Multidimensional array is the array with two or more dimensions. For example: char box [3] [3] defines a two-dimensional.
Introduction to Algorithms 6.046J/18.401J/SMA5503 Lecture 3 Prof. Erik Demaine.
1 L ECTURE 2 Matrix Multiplication Tableau Construction Recurrences (Review) Conclusion Merge Sort.
1 CS 140 : Non-numerical Examples with Cilk++ Divide and conquer paradigm for Cilk++ Quicksort Mergesort Thanks to Charles E. Leiserson for some of these.
1 L ECTURE 2 Matrix Multiplication Tableau Construction Recurrences (Review) Conclusion Merge Sort.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
1 CS 140 : Feb 19, 2015 Cilk Scheduling & Applications Analyzing quicksort Optional: Master method for solving divide-and-conquer recurrences Tips on parallelism.
ADA: 4.5. Matrix Mult.1 Objective o an extra divide and conquer example, based on a question in class Algorithm Design and Analysis (ADA) ,
Young CS 331 D&A of Algo. Topic: Divide and Conquer1 Divide-and-Conquer General idea: Divide a problem into subprograms of the same kind; solve subprograms.
1 CS 240A : Divide-and-Conquer with Cilk++ Thanks to Charles E. Leiserson for some of these slides Divide & Conquer Paradigm Solving recurrences Sorting:
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
CSc 8530 Matrix Multiplication and Transpose By Jaman Bhola.
CONCURRENCY PLATFORMS
Numerical Algorithms Chapter 11.
All-pairs Shortest paths Transitive Closure
4.5 Matrices.
Cache Memories CSE 238/2038/2138: Systems Programming
4.6 Matrices.
Optimizing Cache Performance in Matrix Multiplication
Advanced Algorithms Analysis and Design
Seminar on Dynamic Programming.
Chapter 4 Divide-and-Conquer
Compilers for Embedded Systems
CS 213: Data Structures and Algorithms
Optimizing Cache Performance in Matrix Multiplication
BLAS: behind the scenes
Chapter 4: Divide and Conquer
Complexity Measures for Parallel Computation
Unit-2 Divide and Conquer
Parallel Matrix Operations
Numerical Algorithms • Parallelizing matrix multiplication
Another Randomized Algorithm
Multithreaded Programming in Cilk LECTURE 1
Topic: Divide and Conquer
Introduction to CILK Some slides are from:
Parallelismo.
Introduction to Algorithms
Unit –VIII PRAM Algorithms.
Using compiler-directed approach to create MPI code automatically
Multithreaded Programming in Cilk Lecture 1
Vectors and Matrices In MATLAB a vector can be defined as row vector or as a column vector. A vector of length n can be visualized as matrix of size 1xn.
Introduction to Algorithm and its Complexity Lecture 1: 18 slides
Complexity Measures for Parallel Computation
David Kauchak cs161 Summer 2009
Discrete Mathematics and its Applications
Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as ai,j and elements of B as.
Topic: Divide and Conquer
Matrix Chain Multiplication
Another Randomized Algorithm
Introduction to CILK Some slides are from:
Applied Discrete Mathematics Week 4: Functions
Seminar on Dynamic Programming.
Presentation transcript:

CS 140 : Numerical Examples on Shared Memory with Cilk++ Matrix-matrix multiplication Matrix-vector multiplication Thanks to Charles E. Leiserson for some of these slides

Work and Span (Recap) TP = execution time on P processors T1 = work Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Work and Span (Recap) TP = execution time on P processors T1 = work T∞ = span* Speedup on p processors T1/Tp Parallelism T1/T∞ * Also called critical-path length or computational depth. 2

Cilk Loops: Divide and Conquer Multithreaded Programming in Cilk Lecture 1 July 13, 2006 Cilk Loops: Divide and Conquer cilk_for (int i=0; i<n; ++i) { A[i]+=B[i]; } Vector addition Implementation G Work: T1 = Work: T1 = Θ(n) grain size Span: T∞ = Span: T∞ = Θ(lg n) Parallelism: T1/T∞ = Parallelism: T1/T∞ = Θ(n/lg n) Assume that G = Θ(1). 3

Square-Matrix Multiplication ⋯ c1n c21 c22 c2n ⋮ ⋱ cn1 cn2 cnn a11 a12 ⋯ a1n a21 a22 a2n ⋮ ⋱ an1 an2 ann b11 b12 ⋯ b1n b21 b22 b2n ⋮ ⋱ bn1 bn2 bnn = · C A B cij =  k = 1 n aik bkj Assume for simplicity that n = 2k.

Parallelizing Matrix Multiply cilk_for (int i=1; i<n; ++i) { cilk_for (int j=0; j<n; ++j) { for (int k=0; k<n; ++k { C[i][j] += A[i][k] * B[k][j]; } Θ(n3) Work: T1 = Θ(n) Span: T∞ = Parallelism: T1/T∞ = Θ(n2) For 1000 × 1000 matrices, parallelism ≈ (103)2 = 106.

Recursive Matrix Multiplication Divide and conquer — C11 C12 C21 C22 A11 A12 A21 A22 B11 B12 B21 B22 = · A11B11 A11B12 A21B11 A21B12 A12B21 A12B22 A22B21 A22B22 = + 8 multiplications of n/2 × n/2 matrices. 1 addition of n × n matrices.

D&C Matrix Multiplication template <typename T> void MMult(T *C, T *A, T *B, int n) { T * D = new T[n*n]; //base case & partition matrices cilk_spawn MMult(C11, A11, B11, n/2); cilk_spawn MMult(C12, A11, B12, n/2); cilk_spawn MMult(C22, A21, B12, n/2); cilk_spawn MMult(C21, A21, B11, n/2); cilk_spawn MMult(D11, A12, B21, n/2); cilk_spawn MMult(D12, A12, B22, n/2); cilk_spawn MMult(D22, A22, B22, n/2); MMult(D21, A22, B21, n/2); cilk_sync; MAdd(C, D, n); // C += D; } Row/column length of matrices Coarsen for efficiency Determine submatrices by index calculation

Matrix Addition template <typename T> void MMult(T *C, T *A, T *B, int n) { T * D = new T[n*n]; //base case & partition matrices cilk_spawn MMult(C11, A11, B11, n/2); cilk_spawn MMult(C12, A11, B12, n/2); cilk_spawn MMult(C22, A21, B12, n/2); cilk_spawn MMult(C21, A21, B11, n/2); cilk_spawn MMult(D11, A12, B21, n/2); cilk_spawn MMult(D12, A12, B22, n/2); cilk_spawn MMult(D22, A22, B22, n/2); MMult(D21, A22, B21, n/2); cilk_sync; MAdd(C, D, n); // C += D; } template <typename T> void MAdd(T *C, T *D, int n) { cilk_for (int i=0; i<n; ++i) { cilk_for (int j=0; j<n; ++j) { C[n*i+j] += D[n*i+j]; }

Analysis of Matrix Addition template <typename T> void MAdd(T *C, T *D, int n) { cilk_for (int i=0; i<n; ++i) { cilk_for (int j=0; j<n; ++j) { C[n*i+j] += D[n*i+j]; } Work: A1(n) = Θ(n2) Θ(lg n) Span: A∞(n) = Nested cilk_for statements have the same Θ(lg n) span

Work of Matrix Multiplication template <typename T> void MMult(T *C, T *A, T *B, int n) { T * D = new T [n*n]; //base case & partition matrices cilk_spawn MMult(C11, A11, B11, n/2); cilk_spawn MMult(C12, A11, B12, n/2); ⋮ cilk_spawn MMult(D22, A22, B22, n/2); MMult(D21, A22, B21, n/2); cilk_sync; MAdd(C, D, n); // C += D; } CASE 1: nlogba = nlog28 = n3 f(n) = Θ(n2) Work: M1(n) = 8M1(n/2) + A1(n) + Θ(1) = 8M1(n/2) + Θ(n2) = Θ(n3)

Span of Matrix Multiplication template <typename T> void MMult(T *C, T *A, T *B, int n) { T * D = new T [n*n]; //base case & partition matrices cilk_spawn MMult(C11, A11, B11, n/2); cilk_spawn MMult(C12, A11, B12, n/2); ⋮ cilk_spawn MMult(D22, A22, B22, n/2); MMult(D21, A22, B21, n/2); cilk_sync; MAdd(C, D, n, size); // C += D; } maximum CASE 2: nlogba = nlog21 = 1 f(n) = Θ(nlogba lg1n) M∞(n/2) + A∞(n) + Θ(1) Span: M∞(n) = = M∞(n/2) + Θ(lg n) = Θ(lg2n)

Parallelism of Matrix Multiply M1(n) = Θ(n3) Work: M∞(n) = Θ(lg2n) Span: Parallelism: M1(n) M∞(n) = Θ(n3/lg2n) For 1000 × 1000 matrices, parallelism ≈ (103)3/102 = 107.

Stack Temporaries template <typename T> void MMult(T *C, T *A, T *B, int n) { //base case & partition matrices cilk_spawn MMult(C11, A11, B11, n/2); cilk_spawn MMult(C12, A11, B12, n/2); cilk_spawn MMult(C22, A21, B12, n/2); cilk_spawn MMult(C21, A21, B11, n/2); cilk_spawn MMult(D11, A12, B21, n/2); cilk_spawn MMult(D12, A12, B22, n/2); cilk_spawn MMult(D22, A22, B22, n/2); MMult(D21, A22, B21, n/2); cilk_sync; MAdd(C, D, n); // C += D; } T * D = new T [n*n]; Idea: Since minimizing storage tends to yield higher performance, trade off parallelism for less storage.

No-Temp Matrix Multiplication // C += A*B; template <typename T> void MMult2(T *C, T *A, T *B, int n) { //base case & partition matrices cilk_spawn MMult2(C11, A11, B11, n/2); cilk_spawn MMult2(C12, A11, B12, n/2); cilk_spawn MMult2(C22, A21, B12, n/2); MMult2(C21, A21, B11, n/2); cilk_sync; cilk_spawn MMult2(C11, A12, B21, n/2); cilk_spawn MMult2(C12, A12, B22, n/2); cilk_spawn MMult2(C22, A22, B22, n/2); MMult2(C21, A22, B21, n/2); } Saves space, but at what expense?

Work of No-Temp Multiply // C += A*B; template <typename T> void MMult2(T *C, T *A, T *B, int n) { //base case & partition matrices cilk_spawn MMult2(C11, A11, B11, n/2); cilk_spawn MMult2(C12, A11, B12, n/2); cilk_spawn MMult2(C22, A21, B12, n/2); MMult2(C21, A21, B11, n/2); cilk_sync; cilk_spawn MMult2(C11, A12, B21, n/2); cilk_spawn MMult2(C12, A12, B22, n/2); cilk_spawn MMult2(C22, A22, B22, n/2); MMult2(C21, A22, B21, n/2); } CASE 1: nlogba = nlog28 = n3 f(n) = Θ(1) Work: M1(n) = 8M1(n/2) + Θ(1) = Θ(n3)

Span of No-Temp Multiply // C += A*B; template <typename T> void MMult2(T *C, T *A, T *B, int n) { //base case & partition matrices cilk_spawn MMult2(C11, A11, B11, n/2); cilk_spawn MMult2(C12, A11, B12, n/2); cilk_spawn MMult2(C22, A21, B12, n/2); MMult2(C21, A21, B11, n/2); cilk_sync; cilk_spawn MMult2(C11, A12, B21, n/2); cilk_spawn MMult2(C12, A12, B22, n/2); cilk_spawn MMult2(C22, A22, B22, n/2); MMult2(C21, A22, B21, n/2); } max CASE 1: nlogba = nlog22 = n f(n) = Θ(1) max Span: M∞(n) = 2M∞(n/2) + Θ(1) = Θ(n)

Parallelism of No-Temp Multiply M1(n) = Θ(n3) Work: M∞(n) = Θ(n) Span: Parallelism: M1(n) M∞(n) = Θ(n2) For 1000 × 1000 matrices, parallelism ≈ (103)2 = 106. Faster in practice!

How general that was? A B C = · Which dimension to split? Matrices are often rectangular Even when they are square, the dimensions are hardly a power of two k n A B C k = · m Which dimension to split?

General Matrix Multiplication template <typename T> void MMult3(T *A, T* B, T* C, int i0, int i1, int j0, int j1, int k0, int k1) { int di = i1 - i0; int dj = j1 - j0; int dk = k1 - k0; if (di >= dj && di >= dk && di >= THRESHOLD) {    int mi = i0 + di / 2;    MMult3 (A, B, C, i0, mi, j0, j1, k0, k1);    MMult3 (A, B, C, mi, i1, j0, j1, k0, k1); } else if (dj >= dk && dj >= THRESHOLD) {   int mj = j0 + dj / 2;   MMult3 (A, B, C, i0, i1, j0, mj, k0, k1);   MMult3 (A, B, C, i0, i1, mj, j1, k0, k1); } else if (dk >= THRESHOLD) {   int mk = k0 + dk / 2;   MMult3 (A, B, C, i0, i1, j0, j1, k0, mk);   MMult3 (A, B, C, i0, i1, j0, j1, mk, k1); } else { // Iterative (triple-nested loop) multiply } } Split m if it is the largest Split n if it is the largest Split k if it is the largest for (int i = i0; i < i1; ++i) {    for (int j = j0; j < j1; ++j) {      for (int k = k0; k < k1; ++k)      C[i][j] += A[i][k] * B[k][j];

Parallelizing General MMult template <typename T> void MMult3(T *A, T* B, T* C, int i0, int i1, int j0, int j1, int k0, int k1) { int di = i1 - i0; int dj = j1 - j0; int dk = k1 - k0; if (di >= dj && di >= dk && di >= THRESHOLD) {    int mi = i0 + di / 2;    cilk_spawn MMult3 (A, B, C, i0, mi, j0, j1, k0, k1);    MMult3 (A, B, C, mi, i1, j0, j1, k0, k1); } else if (dj >= dk && dj >= THRESHOLD) {   int mj = j0 + dj / 2;   cilk_spawn MMult3 (A, B, C, i0, i1, j0, mj, k0, k1);   MMult3 (A, B, C, i0, i1, mj, j1, k0, k1); } else if (dk >= THRESHOLD) {   int mk = k0 + dk / 2;   MMult3 (A, B, C, i0, i1, j0, j1, k0, mk);   MMult3 (A, B, C, i0, i1, j0, j1, mk, k1); } else { // Iterative (triple-nested loop) multiply } } Unsafe to spawn here unless we use a temporary ! for (int i = i0; i < i1; ++i) {    for (int j = j0; j < j1; ++j) {      for (int k = k0; k < k1; ++k)      C[i][j] += A[i][k] * B[k][j];

Split m No races, safe to spawn ! k n A B C k = · m

Split n No races, safe to spawn ! k n A B C k = · m

Split k Data races, unsafe to spawn ! k n A B C k = · m

Matrix-Vector Multiplication y1 y2 ⋮ yn a11 a12 ⋯ a1n a21 a22 a2n ⋮ ⋱ an1 an2 ann x1 x2 ⋮ xn = · y A x yi =  j = 1 n aij xj Let each worker handle a single row !

Matrix-Vector Multiplication template <typename T> void MatVec (T *A, T* x, T* y, int m, int n) { for(int i=0; i<m; i++) { for(int j=0; j<n; j++) y[i] = A[i][j] * x[j]; } Parallelize template <typename T> void MatVec (T *A, T* x, T* y, int m, int n) { cilk_for (int i=0; i<m; i++){ for(int j=0; j<n; j++) y[i] = A[i][j] * x[j]; }