1 Friday, September 22, 2006 If one ox could not do the job they did not try to grow a bigger ox, but used two oxen. -Grace Murray Hopper (1906-1992)

2 2 Today §Block matrix operations §Network topologies

3 3 Strided access §Stride Sequence of memory reads and writes to addresses, each of which is separated from the last by a constant interval called "the stride length“ §Unit stride

4 4 do i = 1, N do j = 1, N A[i] =A[i] + B[j] enddo N is large so B[j] cannot remain in cache until it is used again in another iteration of outer loop. Little reuse between touches How many cache misses for A and B?

5 5 Blocking do i = 1, N do j = 1, N, S do jj = j, MIN(j+S, N) A[i] =A[i] + B[jj] enddo do i = 1, N do j = 1, N A[i] =A[i] + B[j] enddo

6 6 Blocking do j = 1, N, S do i = 1, N do jj = j, MIN(j+S, N) A[i] =A[i] + B[jj] enddo do i = 1, N do j = 1, N A[i] =A[i] + B[j] enddo S is the maximum number of elements of B that can remain in cache between two iterations of the i loop Block or strip mine How many cache misses for A and B?

7 7 Operation Count vs. Memory Operations §Example: Matrix multiplication §Previous example?

9 9 Matrix multiplication int i,j,k; for (i=0;i<n;i++) { for(j=0;j<n;j++) { for (k=0;k<n;k++) { c[i][j]=c[i][j]+ a[i][k]*b[k][j]; } Remember to initialize c[i][j] to zero

10 10 Matrix multiplication with blocking int i,j,k,ii,jj,kk; for (ii=0;ii<n;ii+=S) { for (jj=0;jj<n;jj+=S) { for (kk=0;kk<n;kk+=S) { for(i=ii;i<min((ii+S),n);i++) { for(j=jj;j<min((jj+S),n);j++) { for(k=kk;k<min((kk+S),n);k++) { c[i][j]=c[i][j]+a[i][k]*b[k][j]; } Remember to initialize c[i][j] to zero

11 11 Exercise §Matrix Vector Multiplication

12 12 Cache coherence in multiprocessor systems §Suppose two processors on a shared bus have loaded the same variable. §If one processor changes value of that variable then:

13 13 Cache coherence in multiprocessor systems §Suppose two processors on a shared bus have loaded the same variable. §If one processor changes value of that variable then: l Invalidate other copies l Update other copies

15 15 Cache coherence in multiprocessor systems §What if a processor reads a data item only once initially? §Invalidate protocol is more commonly used.

16 16 False Sharing (multiprocessor) §Two processors are accessing different data items in the same cache block. §What happens if they both attempt to write to it?

17 17 False Sharing (multiprocessor) §Two processors are accessing different data items in the same cache block. §What happens if they both attempt to write to it? §Padding in data structures (tradeoff space vs. time)

18 18 Network Topologies §Bus based, crossbar and multistage networks §Earth simulator: crossbar §IBM SP-2 Multistage network

19 19 Network Topologies Large number of links in completely connected. Bottleneck in star topology.

20 20 Network Topologies 1-D torus Intel Paragon – 2-D Mesh BlueGene/L 3-D torus Cray TE3 3-D Cube

21 21 §2-D and 3-D meshes are common in parallel computers §Regularly structured computation maps naturally to 2-D mesh. §3-D network topologies: weather modeling, structure modeling

