Lecture 9 Architecture Independent (MPI) Algorithm Design

Name: Lecture 9 Architecture Independent (MPI) Algorithm Design
Uploaded: 2017-08-15T21:40:49+00:00
Duration: PTM13S58
Channel: Victoria Powers
Description: Lecture 9 Architecture Independent (MPI) Algorithm Design

Lecture 9 Architecture Independent (MPI) Algorithm Design
Parallel Computing Spring 2010

Matrix Computations SPMD program design stipulates that processors executes a single program on different pieces of data. For matrix related computations it makes sense to distribute a matrix evenly among the p processors of a parallel computer. Such a distribution should also take into consideration the storage of the matrix by say the compiler so that locality issues are also taken into consideration (filling cache lines efficiently to speedup computation). There are various ways to divide a matrix. Some of the most common one are described below. One way to distribute a matrix is by using block distributions. Split an array into blocks of size n/p1 × n/p2 so that p = p1 × p2 and assign the i-th block to processor i. This distribution is suitable for matrices as long as the amount of work for different elements of the matrix is the same. The most common block distributions are. • column-wise (block) distribution. Split matrix into p column stripes so that n/p consecutive columns form the i-th stripe that will be stored in processor i. This is p1 = 1 and p2 = p. • row-wise (block) distribution. Split matrix into p row stripes so that n/p consecutive rows form the i-th stripe that will be stored in processor i. This is p1 = p and p2 = 1. • block or square distribution. This is the case p1 = p2 = √p, i.e. the blocks are of size n/√p× n/√p and store block i to processor i. There are certain cases (eg. LU decomposition, Cholesky factorization), where the amount of work differs for different elements of a matrix. For these cases block distributions are not suitable.

Matrix block distributions

Matrix-Vector Multiplication
Sequential Alg: the running time is O(n2). n^2 multiplications and additions MAT_VECT(A,x,y) { for i=0 to n-1 do y[i]=0; for j=0 to n-1 do y[i]=y[i]+A[i][j]*x[j]; }

Matrix-Vector Multiplication: Rowwise 1-D Partitioning
Assume p=n (p – no. of processors). Steps: Step 1: Initial partition of matrix and vector: Matrix distribution: Each process get one complete row of the matrix. Vector distribution: The n*1 vector is distributed such that each process owns one of its elements. Step 2: All-to-all broadcast Every process has one element of the vector, but every process needs the entire vector. Step 3: computation Process Pi computes Running time: All-to-all broadcast: θ(n) at any architecture Multiplication of a single row of A and with vector x is θ(n) Total running time is θ(n). Total work is θ(n^2) – cost-optimal

Assume p<n (p – no. of processors). Three Steps: Initial partition of matrix and vector: Each process initially stores n/p complete rows of the matrix and a portion of the vector of size n/p All-to-all broadcast: Among p processes and involved messages of size n/p Computation: Each process multiplies n/p rows of the matrix with the vector x to produce n/p elements of the result vector. Running Time: T=(ts+ n/p tw)(p-1) on any architecture T=ts logp + n/p tw(p-1) on hypercube Computation: T=n* n/p =θ(n2/p) Total running time T= θ(n2/p+ts logp + n tw) Total work: W=θ(n2+ts p logp + n p tw) – cost-optimal

Matrix-Vector Multiplication: Columnwise 1-D Partitioning
Similar to rowwise 1-D Partitioning

Matrix-Vector Multiplication: 2-D Partitioning
Assume p=n2 Steps: Step 1: Initial partitioning Each process get one element of matrix The vector is distributed only processes in the diagonal, each of which owns one element. Step 2: broadcast The ith element of vector should be available to the ith element of each row of matrix. So this step consists of n simultaneous one-to-all broadcast operations, one in each column of processes. Step 3: computation Each process multiplies its matrix element with the corresponding element of x. Step 4: All-to-one reduction of partial results. The products computed for each row must be added, leaving the sums in the last column of processes. Running time: One-to-all broadcast: θ(log n) Computation in each process: θ(1) All-to-one reduction: θ(log n) Total running time: θ(log n) Total work: θ(n2 log n) – not cost-optimal

Assume p<n2 Steps: Step 1: Initial partitioning Each process get (n/p)*(n/p) of matrix The vector is distributed only processes in the diagonal, each of which owns n/p element. Step 2: columwise one-to-all broadcast The ith group of elements of vector should be available to the ith group of each row of matrix. So this step consists of n simultaneous one-to-all broadcast operations, one in each column of processes. Step 3: computation Each process multiplies its n/p matrix element with the corresponding element of x. Step 4: All-to-one reduction of partial results. The products computed for each row must be added, leaving the sums in the last column of processes. Running time: Columnwise one-to-all broadcast: T= (ts+ n/p tw)(log p) on any architecture Computation in each process: T=n/p* n/p All-to-one reduction: T= (ts+ n/p tw)(log p) on any architecture Total running time: T= n2/p + 2(ts+ n/p tw)(log p) on any architecture

Matrix-Vector Multiplication: 1-D Partitioning vs. 2-D Partitioning
Matrix-vector multiplication is faster with block 2-D partitioning of the matrix than with block 1-D partitioning for the same number of processes. If the number of processes is greater than n, then the 1-D partitioning cannot be used. If the number of processes is less than or equal to n, 2-D partitioning is preferable.

Matrix Distributions : Block cyclic
In block cyclic distributions the rows (similarly for columns) are split into q groups of n/q consecutive rows per group, where potentially q > p, and the i-th group is assigned to a processor in a cyclic fashion. • column-cyclic distribution. This is an one-dimensional cyclic distribution. Split matrix into q column stripes so that n/q consecutive columns form the i-th stripe that will be stored in processor i %p. The symbol % is the mod (remainder of the division) operator. Usually q > p. Sometimes the term wrapped-around column distribution is used for the case where n/q = 1, i.e. q = n. • row-cyclic distribution. This is an one-dimensional cyclic distribution. Split matrix into q row stripes so that n/q consecutive rows form the i-th stripe that will be stored in processor i %p. The symbol % is the mod (remainder of the division) operator. Usually q > p. Sometimes the term wrapped-around row distribution is used for the case where n/q = 1, i.e. q = n. • scattered distribution. Let p = qi · Pj processors be divided into qj groups each group Pj consisting of qi processors. Particularly, Pj = {jqi + l | 0 ≤ l ≤ qi − 1}. Processor jqi + l is called the l-th processor of group Pj . This way matrix element (i, j), 0 ≤ i, j < n, is assigned to the (i mod qi)-th processor of group P(j mod qj). A scattered distribution refers to the special case qi = qj = √p.

Block cyclic distributions

Scattered Distribution

Matrix Multiplication – Serial algorithm

Matrix Multiplication
The algorithm for matrix multiplication presented below was presented in the seminal work of Valiant. It works for p ≤ n2. Three steps: Initial partitioning: Matrices A and B are partitioned into p blocks Ai,j, and Bi,j (1 <=i,j < √p) of size n/√p × n/√p each. These blocks are mapped onto a √p × √p logical mesh of processes. The process are labeled from P0,0 to P √p-1,√p -1. All-to-all broadcasting: Process Pi,j initially stores Ai,j and Bi,j and computes block Ci,j of the result matrix. Computing submatrix Ci,j requires all submatrices Ai,k and Bk,j for 0 ≤k<√p. To aquire all the required blocks, an all-to-all broadcast of matrix A’s block is performed in each row of processes, and an all-to-all broadcast of matrix B’s blocks is performed in each column. Computation: After Pi,j acquire Ai,0, Ai,1, …, Ai,√p -1 and B0,j, B1,j, …, B √p -1,j, it performs the submatrix multiplication and addition step of line 7 and line 8 in Alg 8.3. Running time: All-to-all broadcast: T=(ts+ n^2/p tw)(p-1) on any architecture T=ts log  p + n^2/p tw( p-1) on hypercube Computation: T= p*(n/p)^3=n^3/p.

The input matrices A and B are divided into p block-submatrices, each one of dimension m× m, where m = n/√p. We call this distribution of the input among the processors block distribution. This way, element A(i, j), 0 ≤ i < n, 0 ≤ j < n, belongs to the (j/m)∗√p+(i/m)-th block that is subsequently assigned to the memory of the same-numbered processor. Let Ai (respectively, Bi) denote the i-th block of A (respectively, B) stored in processor i. With these conventions the algorithm can be described in Figure 1. The following Proposition describes the performance of the aforementioned algorithm.

Lecture 9 Architecture Independent (MPI) Algorithm Design

Similar presentations

Presentation on theme: "Lecture 9 Architecture Independent (MPI) Algorithm Design"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 9 Architecture Independent (MPI) Algorithm Design

Similar presentations

Presentation on theme: "Lecture 9 Architecture Independent (MPI) Algorithm Design"— Presentation transcript:

Similar presentations

About project

Feedback