Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP
Michael J. Quinn

Matrix Multiplication
Chapter 11 Matrix Multiplication

Outline Sequential algorithms Iterative, row-oriented
Recursive, block-oriented Parallel algorithms Rowwise block striped decomposition Cannon’s algorithm

Iterative, Row-oriented Algorithm
Series of inner product (dot product) operations  =

Performance as n Increases

Reason: Matrix B Gets Too Big for Cache
Computing a row of C requires accessing every element of B

Block Matrix Multiplication
 = Replace scalar multiplication with matrix multiplication Replace scalar addition with matrix addition

Recurse Until B Small Enough

Comparing Sequential Performance

First Parallel Algorithm
Partitioning Divide matrices into rows Each primitive task has corresponding rows of three matrices Communication Each task must eventually see every row of B Organize tasks into a ring

First Parallel Algorithm (cont.)
Agglomeration and mapping Fixed number of tasks, each requiring same amount of computation Regular communication among tasks Strategy: Assign each process a contiguous group of rows

Communication of B A A A B C A B C A A A B C A B C

Complexity Analysis Algorithm has p iterations
During each iteration a process multiplies (n / p)  (n / p) block of A by (n / p)  n block of B: (n3 / p2) Total computation time: (n3 / p) Each process ends up passing (p-1)n2/p = (n2) elements of B

Weakness of Algorithm 1 Blocks of B being manipulated have p times more columns than rows Each process must access every element of matrix B Ratio of computations per communication is poor: only 2n / p

Parallel Algorithm 2 (Cannon’s Algorithm)
Associate a primitive task with each matrix element Agglomerate tasks responsible for a square (or nearly square) block of C Computation-to-communication ratio rises to n / p

Elements of A and B Needed to Compute a Process’s Portion of C
Algorithm 1 Cannon’s Algorithm

Blocks Must Be Aligned Before After

Blocks Need to Be Aligned
Each triangle represents a matrix block Only same-color triangles should be multiplied B10 B11 A11 B12 B13 A10 A12 A13 B20 B21 B22 B23 A20 A21 A22 A23 B30 B31 B32 B33 A30 A31 A32 A33

Rearrange Blocks Block Aij cycles left i positions Block Bij cycles
up j positions B10 B21 A11 A12 A13 B32 A10 B03 B20 B31 B02 B13 A22 A23 A20 A21 B30 B01 B12 B23 A33 A30 A31 A32

Consider Process P1,2 B22 A11 A12 A13 B32 A10 Step 1 B02 B12

Complexity Analysis Algorithm has p iterations
During each iteration process multiplies two (n / p )  (n / p ) matrices: (n3 / p 3/2) Computational complexity: (n3 / p) During each iteration process sends and receives two blocks of size (n / p )  (n / p ) Communication complexity: (n2/ p)

Summary Considered two sequential algorithms
Iterative, row-oriented algorithm Recursive, block-oriented algorithm Second has better cache hit rate as n increases Developed two parallel algorithms First based on rowwise block striped decomposition Second based on checkerboard block decomposition Second algorithm is scalable, while first is not

Parallel Programming in C with MPI and OpenMP

Similar presentations

Presentation on theme: "Parallel Programming in C with MPI and OpenMP"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parallel Programming in C with MPI and OpenMP

Similar presentations

Presentation on theme: "Parallel Programming in C with MPI and OpenMP"— Presentation transcript:

Similar presentations

About project

Feedback