Matrix Multiplication (i,j,k) for I = 1 to n do for j = 1 to n do for k = 1 to n do C[i,j] = C[i,j] + A[i,k] x B[k,j] endfor
(i,j,k) Memory Map = x i j j k i k
Scalar Architecture Registers Cache memory Functional units Functional units Main memory Memory bus
Cache lines: matrix stored by rows Stride 1 dimension
Matrix Multiplication (i,k,j) Improve Spatial Locality for i = 1 to n do for k = 1 to n do for j = 1 to n do C[i,j] = C[i,j] + A[i,k] x B[k,j] endfor
(i,k,j) Memory Map = x i j j k i k
Matrix Multiplication (i,k,j) Improve Temporal Locality =x C11 C12 C13 C21 C22 C23 C31 C32 C33 A11 A12 A13 A21 A22 A23 A31 A32 A33 B11 B12 B13 B21 B22 B23 B31 B32 B33 C11 = A11 x B11 + A12 x B21 + A13 x B31
Submatrix Multiplication (i,k,j) for it = 1 to n by s do for kt = 1 to n by s do for jt = 1 to n by s do for i = it to min(it+s-1,n) do for k = kt to min(kt+s-1,n) do for j = jt to min(jt+s-1,n) do C[i,j] = C[i,j] + A[i,k] x B[k,j] endfor
(i,k,j) Memory Map = x it jt kt it kt s
Multiprocessor Architecture Memory bus CPU Cache memory Main memory CPU Cache memory
Parallel (i,k,j): Inner loop for i = 1 to n do for k = 1 to n do parfor j = 1 to n do C[i,j] = C[i,j] + A[i,k] x B[k,j] endparfor endfor
Parallel (i,k,j): Inner loop memory mapping = x i k i k
Parallel (i,k,j): Outer loop parfor i = 1 to n do for k = 1 to n do for j = 1 to n do C[i,j] = C[i,j] + A[i,k] x B[k,j] endfor endparfor
Parallel (i,k,j): Outer loop memory mapping = x
Parallel (i,k,j): Submatrix parfor it = 1 to n by s do for kt = 1 to n by s do for jt = 1 to n by s do for i = it to min(it+s-1,n) do for k = kt to min(kt+s-1,n) do for j = jt to min(jt+s-1,n) do C[i,j] = C[i,j] + A[i,k] x B[k,j] endfor endparfor