Matrix Multiplication Instructor : Dr Sushil K Prasad Presented By : R. Jayampathi Sampath Instructor : Dr Sushil K Prasad Presented By : R. Jayampathi Sampath
OutlineOutline Introduction Introduction Hypercube Interconnection Network Hypercube Interconnection Network The Parallel Algorithm The Parallel Algorithm Matrix Transposition Matrix Transposition Communication Efficient Matrix Multiplication on Hypercubes (The paper) Communication Efficient Matrix Multiplication on Hypercubes (The paper)
IntroductionIntroduction Matrix multiplication is important algorithm design in parallel computation. Matrix multiplication is important algorithm design in parallel computation. Matrix multiplication on hypercube Matrix multiplication on hypercube – Diameter is smaller – Degree = log(p) Straightforward RAM algorithm for MatMul requires O(n 3 ) time. Straightforward RAM algorithm for MatMul requires O(n 3 ) time. – Sequential Algorithm: for (i=0; i<n; i++){ for (i=0; i<n; i++){ for (j=0; j<n; j++) { for (j=0; j<n; j++) { t=0; t=0; for(k = 0; k<n; k++){ for(k = 0; k<n; k++){ t=t +a ik *b kj ; t=t +a ik *b kj ; } c ij =t; c ij =t; } }
Hypercube Interconnection Network
Hypercube Interconnection Network (contd.) The formal specification of a Hypercube Interconnection Network. The formal specification of a Hypercube Interconnection Network. – Let processors be available – Let i and i (b) be two integers whose binary representation differ only in position b, – Specifically, If is binary representation of i If is binary representation of i Then is the binary representation of i (b) where is the complement of bit Then is the binary representation of i (b) where is the complement of bit A g-Dimentional Hypercube interconnection network is formed by connection each processor p i to by two way link for all A g-Dimentional Hypercube interconnection network is formed by connection each processor p i to by two way link for all
The Parallel Algorithm Example (parallel algorithm) Example (parallel algorithm) A = B = Step1: 2*2 n=2 1 #processors N N=n 3 =2 3 =8 X,X,X ijk Initial step Step 1.1 Step 1.2 A(0,j,k) & B(0,j,k) -> processors (i,j,k), where 1 processors (i,j,k), where 1<=i<=n-1. A(i,j,i) -> processors (i,j,k) where 0<=k<=n-1 B(i,j,k) -> processors (i,j,k) where 0<=j<=n-1
The Parallel Algorithm (contd.) Step2: Step3:
The Parallel Algorithm (contd.) Implementation of straightforward RAM algorithm on HC. Implementation of straightforward RAM algorithm on HC. The multiplication of two n x n matrices A, B where n=2 q The multiplication of two n x n matrices A, B where n=2 q Use HC with N = n 3 = 2 3q Use HC with N = n 3 = 2 3q Each processor P r occupying position (i,j,k) Each processor P r occupying position (i,j,k) where r = in 2 + jn + k for 0<= i,j,k <= n-1 If the binary representation of r is : If the binary representation of r is : r 3q-1 r 3q-2 …r 2q r 2q-1 …r q r q-1 …r 0 r 3q-1 r 3q-2 …r 2q r 2q-1 …r q r q-1 …r 0 then the binary representation of i, j, k are then the binary representation of i, j, k are r 3q-1 r 3q-2 …r 2q, r 2q-1 …r q, r q-1 …r 0 respectively r 3q-1 r 3q-2 …r 2q, r 2q-1 …r q, r q-1 …r 0 respectively
The Parallel Algorithm (contd.) Example (positioning) Example (positioning) – The multiplication of two 2 x 2 matrices A, B where n=2=2 1 q=1 – Use HC with N = n 3 = 2 3q =8 processors – Each processor P r occupying position (i,j,k) – where r = i2 2 + j2 + k for 0<= i,j,k <= 1 – If the binary representation of r is : – r 2 r 1 r 0 – then the binary representation of i, j, k are – r 2, r 1, r 0 respectively
The Parallel Algorithm (contd.) All processors with same index value in the one of i,j,k form a HC with n 2 processors All processors with same index value in the one of i,j,k form a HC with n 2 processors All processors with the same index value in two field coordinates form a HC with n processors All processors with the same index value in two field coordinates form a HC with n processors Each processor will have 3 registers A r, B r and C r also denoted A(I,j,k) B(I,j,k) and C(i,j,k) Each processor will have 3 registers A r, B r and C r also denoted A(I,j,k) B(I,j,k) and C(i,j,k) ArBr Cr 101
The Parallel Algorithm (contd.) Step 1:The elements of A and B are distributed to the n 3 processors so that the processor in position i,j,k will contain a ji and b ik Step 1:The elements of A and B are distributed to the n 3 processors so that the processor in position i,j,k will contain a ji and b ik – 1.1 Copies of data initially in A(0,j,k) and B(0,j,k) are sent to processors in positions (i,j,k), where 1<=i<=n-1. Resulting in A(i,j,k) = a ij and B(i,j,k) = b jk for 0<=i<=n-1. – 1.2 Copies of data in A(i,j,i) are sent to processors in positions (i,j,k), where 0<=k<=n-1. Resulting in A(i,j,k) = a ji for 0<=k<=n-1. – 1.3 Copies of data in B(i,j,k) are sent to processors in positions (i,j,k), where 0<=j<=n-1. Resulting in B(i,j,k) = b ik for 0<=j<=n-1. Step 2: Each processor in position (i,j,k) computes the product Step 2: Each processor in position (i,j,k) computes the product C(i,j,k) = A(i,j,k) * B(i,j,k) Step 3: The sum C(0,j,k) = ∑C(i,j,k) for 0<=i<=n-1 and is computed for 0<=j,k<n-1. Step 3: The sum C(0,j,k) = ∑C(i,j,k) for 0<=i<=n-1 and is computed for 0<=j,k<n-1.
The Parallel Algorithm (contd.) Analysis Analysis – Steps 1.1,1.1,1.3 and 3 consists of q constant time iterations. – Step 2 requires constant time – So, T(n 3 ) = O(q) = O(logn) = O(logn) – Cost pT(p) = O(n 3 logn) – Not cost optimal.
n=4=2 2 #processors N N=n 3 =4 3 =64 XX,XX,XX ijk Copies of data initially in A(0,j,k) and B(0,j,k)
n=4=2 2 #processors N N=n 3 =4 3 =64 XX,XX,XX ijk A(0,j,k) and B(0,j,k) are sent to processors in positions (i,j,k), where 1<=i<=n-1
n=4=2 2 #processors N N=n 3 =4 3 =64 XX,XX,XX ijk Copies of data initially in A(0,j,k) and B(0,j,k)
n=4=2 2 #processors N N=n 3 =4 3 =64 XX,XX,XX ijk Senders of A Copies of data in A(i,j,i) are sent to processors in positions (i,j,k), where 0<=k<=n-1
n=4=2 2 #processors N N=n 3 =4 3 =64 XX,XX,XX ijk Senders of B Copies of data in B(i,j,k) are sent to processors in positions (i,j,k), where 0<=j<=n-1
Matrix Transposition The number of processors used is N = n 2 = 2 2q and processor P r occupies position (i,j) where r = in + j where 0<=i,j<=n-1. The number of processors used is N = n 2 = 2 2q and processor P r occupies position (i,j) where r = in + j where 0<=i,j<=n-1. Initially, processor P r holds all of the elements of matrix A where r = in + j. Initially, processor P r holds all of the elements of matrix A where r = in + j. Upon termination, processor P s holds element a ij where s = jn + i. Upon termination, processor P s holds element a ij where s = jn + i.
Matrix Transposition (contd.) A recursive interpretation of the algorithm A recursive interpretation of the algorithm – Divide the matrix into 4 sub-matrices – n/2 x n/2 – The first level of recursion The elements of bottom left sub-matrix are swapped with the corresponding elements of the top right sub-matrix. The elements of bottom left sub-matrix are swapped with the corresponding elements of the top right sub-matrix. The elements of other two sub-matrix are untouched. The elements of other two sub-matrix are untouched. – The same step is now applied to each of four (n/2)*(n/2) matrices. – This continues until 2*2 matrices are transposed. Analysis Analysis – The algorithm consists of q constant time iterations. – T(n) = log(n) – Cost = n 2 log(n) – not cost optimal (n(n-1)/2 operations on n*n matrix on the RAM by swapping a ij wit a ji for all i<j)
Matrix Transposition (contd.) Example Example A= 1eb2cfdg hxvy3zw4A=1eb2hxvycfdg3zw4 A=1bhve2xY cd3w fgz4 A=1be2hvxycdfg3wz
OutlineOutline 2D Diagonal Algorithm 2D Diagonal Algorithm The 3-D Diagonal Algorithm The 3-D Diagonal Algorithm
2D Diagonal Algorithm A*0A*1A*2A*3 B0*B1* B2* B3* A*0B0*A*1B1* A*2B2* A*3B3* AB Step 1 4*44*4
2D Diagonal Algorithm (Contd.) A*0B0*A*0A*0A*0 A*1A*1B1*A*1A*1 A*2A*2A*2B2*A*2 A*3A*3A*3A*3B3* A*0B0*A*0B01A*0B02A*0B03A*1B10A*1B1*A*1B12A*1B13 A*2B20A*2B21A*2B2*A*2B23 A*3B30A*3B31A*3B32A*3B3* Step 2 Step 3 C00,C10C20,C30C01,C11C21,C31 C02,C12C22,C32 C03,C13C23,C33 Step 4
2D Diagonal Algorithm (Contd.) Above algorithm can be extended to a 3-D mesh embedded in a hypercube with A *i and B i* being initially distributed along the third dimension z. Above algorithm can be extended to a 3-D mesh embedded in a hypercube with A *i and B i* being initially distributed along the third dimension z. Processor p iik holding the sub-blocks of A ki and B ik Processor p iik holding the sub-blocks of A ki and B ik One-to-all-personalized broadcast of Bi* then replaced by point-to-point communication of B ik from p iik to p kik One-to-all-personalized broadcast of Bi* then replaced by point-to-point communication of B ik from p iik to p kik It fallows one-to-all broadcast B ik to p kik along the z direction. It fallows one-to-all broadcast B ik to p kik along the z direction.
The 3-D Diagonal Algorithm HC consisting of p processors HC consisting of p processors Can be visualized as a 3-D mesh of size Can be visualized as a 3-D mesh of size Matrices A and B are partitioned into blocks of p ⅔ with blocks along each dimension. Matrices A and B are partitioned into blocks of p ⅔ with blocks along each dimension. Initially, it is assumed that A and B are mapped onto the 2-D plane x = y Initially, it is assumed that A and B are mapped onto the 2-D plane x = y processor p iik containing the blocks of A ki and B ki processor p iik containing the blocks of A ki and B ki
The 3-D Diagonal Algorithm (contd.) Algorithm consists of 3 phases Algorithm consists of 3 phases – Point to point communication of Bki by piik to pikk – One to all broadcast of blocks of A along the x direction and the newly acquired block of B along the z direction. Now processor pijk has the blocks of Akj and Bji Now processor pijk has the blocks of Akj and Bji Each processor calculates the products of blocks of A and B. Each processor calculates the products of blocks of A and B. – The reduction by adding the result sub matrices along the z direction.
The 3-D Diagonal Algorithm (contd.) Analysis Analysis – Phase 1: Passing messages of size n 2 / p ⅔ require log(3√p(t s + t w (n 2 / p ⅔ ))) where t s is the time it takes to start up for message sending and t w is time it takes to send a word from one processor to its neighbor. – Phase 2: takes twice as much time as phase 1. – Phase 3: Can be completed in the same amount of time as Phase 1. Overall, the algorithm takes (4/3 log p, n 2 / p ⅔ (4/3 log p)) Overall, the algorithm takes (4/3 log p, n 2 / p ⅔ (4/3 log p))
BibliographyBibliography Akl, Parallel Computation, Models and Methods, Prentice Hall Akl, Parallel Computation, Models and Methods, Prentice Hall Gupta, H & Sadayappan P., Communication Efficient Matrix Mulitplication on Hypercubes, August 1994 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures, Gupta, H & Sadayappan P., Communication Efficient Matrix Mulitplication on Hypercubes, August 1994 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures, Quinn, M.J., Parallel Computing – Theory and Practice, McGraw Hill, 1997 Quinn, M.J., Parallel Computing – Theory and Practice, McGraw Hill, 1997