Download presentation
Presentation is loading. Please wait.
Published byLeslie McLaughlin Modified over 8 years ago
1
CSc 8530 Matrix Multiplication and Transpose By Jaman Bhola
2
Outline Matrix multiplication – one processor Matrix multiplication – parallel Algorithm Example Analysis Matrix transpose Algorithm Example Analysis
3
Matrix multiplication – 1 processor Using one processor – O(n 3 ) time Algorithm: for (i = 0; i < n; i++) { for (j = 0; j < n; j ++) { t = 0; for(k = 0; k < n; k++) { t = t + a ik * b kj ; } c ij = t;} }
4
Matrix multiplication – parallel Using Hypercube: The algorithm given in the book assumes the multiplication of two n x n matrices where n can be factored into a power of 2. This will facilitate a hypercube network. Need N = n 3 = 2 3q processors where n = 2 q is the size of the matrix.
5
N processors allowing each processor to occupy a vertex in the hypercube. Each processor P r has a given position – where r = in 2 + jn + k for 0<= i,j,k <= n-1 If r is represented by: r 3q-1 r 3q-2 …r 2q r 2q-1 …r q r q-1 …r 0 then the binary representation of i, j, k are r 3q-1 r 3q-2 …r 2q, r 2q-1 …r q, r q-1 …r 0 respectively This allow the positioning of processors such that their position only differ by one binary digit location.
6
Also this allow all processors that agree in one or two of the positions i,j,k will form a hypercube Example, building a hypercube for q = 1, then for N = n 3 = 2 3q N = 8 processors. And for P r where r = in 2 + jn + k we get:
7
i j k P 0 r = 0 + 0 + 0 = 0 0 0 0 P 1 r = 0 + 0 + 1 = 1 0 0 1 P 2 r = 0 + 2 + 0 = 2 0 1 0 P 3 r = 0 + 2 + 1 = 3 0 1 1 P 4 r = 4 + 0 + 0 = 4 1 0 0 P 5 r = 4 + 0 + 1 = 5 1 0 1 P 6 r = 4 + 2 + 0 = 6 1 1 0 P 7 r = 4 + 2 + 1 = 7 1 1 1
8
000 111110 100101 010011 001
9
Processor Layout Each processor will have 3 registers A r, B r and C r P 0 The following is the step by step description of the algorithm A B C
10
Step 1:The elements of A and B are distributed to the n 3 processors so that the processor in position i,j,k will contain a ji and b ik (1.1): Copies of data initially in A(0,j,k) and B(0,j,k) are sent to processors in positions (i,j,k), where 1<=i<=n-1. Resulting in A(i,j,k) = a ij and B(i,j,k) = b jk for 0<=i<=n-1. (1.2) Copies of data in A(i,j,k) are sent to processors in positions (i,j,k), where 0<=k<=n-1. Resulting in A(i,j,k) = a ji for 0<=k<=n-1. (1.3) Copies of data in B(i,j,k) are sent to processors in positions (i,j,k), where 0<=j<=n-1. Resulting in B(i,j,k) = b ik for 0<=j<=n-1.
11
Step 2: Each processor in position (i,j,k) computes the product C(i,j,k) = A(i,j,k) * B(i,j,k) Thus C(i,j,k) = a ji * b ik for 0<=i,j,k<=n-1 Step 3: The sum C(0,j,k) = ∑C(i,j,k) for 0<=i<=n-1 and is computed for 0<=j,k<n-1.
12
The algorithm: Step1: (1.1) for m = 3q – 1 downto 2q do for all r ε N(r m = 0) do in parallel (i) A r(m) = A r (ii) B r(m) = B r end for (1.2) for m = q-1 downto 0 do for all r ε N(r m = r 2q+m ) do in parallel A r(m) = A r end for
13
(1.3) for m = 2q-1 downto 0 do for all r ε N(r m = r q+m ) do in parallel B r(m) = B r end for Step 2: for r = 0 to N-1 do in parallel C r = A r * B r end for
14
Step 3: for m = 2q to 3q - 1 do for all r ε N(r m = 0) do in parallel C r(m) = C r + C r(m) end for
15
An Example using a 2x2 matrix. This example will require n 3 processors = 8. The matrices are 1 2 A = 3 4 -1 -2 B = -3 -4
16
4, -43, -3 2, -21, -1 4, -43, -3 4, -43, -3 2, -21, -1 2, -21, -1 2, -2 1, -2 3, -4 4, -44, -3 3, -3 1, -1 2, -1 4, -44, -3 3, -23, -1 1, -11, -2 2, -3 2, -4
17
-22-15 -16 -12 -6-3 -2 -10-7 -6 -8
18
Analysis of algorithm If the layout of the processors is viewed as a n x n x n array, then there consist of a layer of processors n each with an n x n array of processors. Initially, this first layer – n will have a distinct value from matrix A in its A register and a distinct value from matrix B in its B register. This is constant time operation. Step 1.1: Copies are sent to n/2 processors, and continually to n/4, etc – O(log n) to copy data from layer 0 to layers n-1
19
Step1.2 and 1.3. Each processor from column i in layer i sends data to processor in its row. Similar from row i sending data to processor in its column. Requiring constant time iterations. Step 2 require constant time Step 3 require constant time iteration Overall, it requires O(log n) time But cost is O(n 3 log n) – not optimal.
20
A faster algorithm - Quinn For all P m, where 1<=m<=p for i = m to n step p do for j = 1 to n do t = 0; for k = 1 to n do t = t + a[i][k] * b[k][j] c[i][j] = t time O(n 3 /p + p) – maximum # of processors – n 2
21
An actual implementation Get the processor id This if statement is to make sure that the entire size of the matrix is computed chunksize = (int) (n/p); if ((chunksize * nprocs) <= sizes){ int differ = n - (chunksize*p); if (id == 0) lower = id *chunksize; else{ lower = id * chunksize + differ + 1; upper = (id + 1) * chunksize + differ; }
22
else{ lower = id * chunksize; upper = (id + 1) * chunksize;} for (i = lower; i < upper; i++){ for(j = 0; j < n; j++){ total = 0; for (k = 0; k < n; k++){ total = total + mat1[i][k] * mat2[k][j]; } mat3[i][j] = total; }}
23
Another faster Algorithm – Gupta & Sadayappan The 3-D Diagonal Algorithm is a 3 phase algorithm. The concept: a hypercube of p processors viewed as a 3-D mesh of size 3√p x 3√p x 3√p Matrices A and B are partitioned into blocks of p ⅔ with 3√p blocks along each dimension. Initially, it is assumed that A and B are mapped onto the 2-D plane x = y and the 2-D plane y = j is responsible for calculating the outer product of A *,j (the set of columns stored at processors p j,j,* ) and B j,* (the set of rows of B).
24
Phase 1: Point to point communication of B k,i by p i,i,k to p i,k,k Phase 2: One-to-all broadcasts of blocks of A along the x-direction and the newly acquired blocks (from phase 1) of B along the z-direction i.e. processor p i,i,k broadcasts A k,i to p *,i,k and all other processor of the form of p i,i,k broadcasts B k,i to p i,k,* At the end of phase 2, every processor p i,j,k has blocks A k,j and B j,i Each processor now calculates the product of their pair of blocks A and B.
25
Phase 3: After computation, there is reduction by addition in the y-direction providing the final matrix C.
26
Algorithm Analysis Phase 1: Passing messages of size n 2 / p ⅔ require log(3√p(t s + t w (n 2 / p ⅔ ))) where t s is the time it takes to start up for message sending and t w is time it takes to send a word from one processor to its neighbor. Phase two takes twice as much time as phase 1. Phase 3: Can be completed in the same amount of time as Phase 1. Overall, the algorithm takes (4/3 log p, n 2 / p ⅔ (4/3 log p)) where communication for each entry is t sa + t wb
27
Some added conditions are: 1. p <= n 3 2. Overall space used 2n 2 3√p The above description is for a one port hypercube architecture whereby a processor can use at most one communication link to send and receive data. A multi-port architecture, whereby the processor can use all of its communication ports simultaneously, the algorithm will be faster reducing the above amount of time by a factor of log(3√p).
28
The algorithm Initial distribution – Processor p i,i,k contains A ki and B ki Program of processor p i,j,k If (i = j) then Send B ki to p i,k,k Broadcast Bji to all processors p i,j,j endif Receive A kj from p i,j,j Calculate I ki = A kj x Bji Send I ki to p i,i,k
29
if ( i = j) for I = 0 to 3√p – 1 Receive I ki from p i,i,k C ki = C ki + I ki endfor endif I is an intermediate matrix.
30
Matrix Transposition The same concept is used here as in Matrix multiplication The number of processors used is N = n 2 = 2 2q and processor P r occupies position (i,j) where r = in + j where 0<=i,j<=n-1. Initially, processor P r holds all of the elements of matrix A where r = in + j. Upon termination, processor P s holds element a ij where s = jn + i.
31
If r is represented by: r 2q-1 r 2q-2 …r q r q-1 … r 1 r 0 then the binary representation of i and j are r 2q-1 r 2q-2 …r q, r q-1 …r 1 r 0 respectively And s is represented by s 2q-1 s 2q-2 …s q s q-1 … s 1 s 0 And the binary representation of j and i is s 2q-1 r 2q-2 …r q, r q-1 …r 1 r 0 respectively Thus it could be seen that for example r 2q-1 r 2q-2 …r q = s q-1 … s 1 s 0 and r q-1 r q-2 … r 0 = s 2q-1 s 2q-2 …s q
32
The algorithm First the requirements for the algorithm – it needs the processors to have registers – A u and B u both of processor P u The index of P u will be u = u 2q-1u2q-2 …u q uq-1 … u 1 u 0 matching that of r.
33
For m = 2q-1 downto q do for u = 0 to N-1 do in parallel (1) if u m ≠ u m-q then B u (m) = A u endif (2) if u m = u m-q then A u (m-q) = B u endif endfor
34
Explanation of algorithm This algorithm is implemented using recursion to achieve the transpose of A. Divide the matrix into 4 submatrices – n/2 x n/2. For iteration 1 when m = 2q-1, swap elements of the top right submatrix with that of the bottom left submatrix. The other 2 submatrices are not touched. Now recursively do this until all of the elements are swapped.
35
Example. We want the transpose of the following matrix: a b c d A = e f g h i j k l m n o p
36
We use 16 processors with the following indices: 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
37
6 7 14 15 4 5 12 13 2 3 10 11 0 1 8 9 Drawing a hypercube for this:
38
Processor 0 – binary 0000 holds a 00 which is the value a Processor 1 – binary 0001 holds a 01 which is the value b Processor 2 – binary 0010 holds a 02 which is the value c And so on In the first iteration m = 2q-1 where q = 2 in this example m = 3. Step 1: Each P u for u 3 ≠ u 1 sends their element of A u to P u (3) which stores the value in B u (3) i.e. processors 2, 3, 6 & 7 send to processors 10, 11, 14 & 15 respectively. And processors 8, 9, 12 & 13 send to processors 0, 1, 4 & 5 respectively.
39
Step 2: Each processor that received a data in Step 1, will now send the data from B u to P u (1) to be stored in A u (1), i.e. Processors 0, 1, 4, 5 send to 2, 3, 6, 7 respectively Processors 10, 11, 14, 15 send to 8, 9, 12, 13 respectively By the end of the first iteration our matrix A will look like: a b i j A = e f m n c d k l g h o p
40
In the second iteration when m = q = 2: Step 1: Each P u (where u 2 ≠ u 0 ) sends A u to P u (2) storing it in B u (2). This is a simultaneous transfer: From: processor 4 to processor 0 processor 1 to processor 5 processor 6 to processor 2 processor 3 to processor 7 processor 12 to processor 8 processor 9 to processor 13 processor 14 to processor 10 processor 11 to processor 15
41
Step 2: For u 2 = u 0, each P u sends B u to P u (0) where it is stored in A u (0) thus Swap the element in the top right corner processor with that in the bottom left corner for each of the 2 x 2 submatrices. From: processor 0 to processor 1 processor 5 to processor 4 processor 2 to processor 3 processor 7 to processor 6 processor 8 to processor 9 processor 13 to processor 12 processor 10 to processor 11 processor 15 to processor 14
42
After the second iteration, we have the following transposed matrix: a e i m A = b f j n c g k o d h l p
43
Algorithm Analysis It takes q constant time iterations giving t(n) = O(log n) But it takes n 2 processors. Therefore Cost = (n 2 log n) which is not optimal.
44
Bibliography Akl, Parallel Computation, Models and Methods, Prentice Hall 1997. Drake, J.B. and Luo, Q., A scalable Parallel Strassen’s matrix Multiplication Algorithm For Distributed-Memory Computers, February 1995 Proceedings of the 1995 ACM symposium on Applied computing, 221- 226 Gupta, H & Sadayappan P., Communication Efficient Matrix Mulitplication on Hypercubes, August 1994 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures, 320 - 329 Quinn, M.J., Parallel Computing – Theory and Practice, McGraw Hill, 1997
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.