12e.1 More on Parallel Computing UNC-Wilmington, C. Ferner, 2007 Mar 21, 2007
12e.2 Block Mapping (Review) blksz = (int)ceil((float)N / P); for (i = lb + my_rank * blksz; i < min(N, lb + (my_rank + 1) * blksz); i++) {... } (lb is the lower bound of the original loop)
12e.3 Example for (i = 1; i < N; i++) { for (j = 0; j < N; j++) { a[i][j] += f(a[i-1][j]); }
12e.4 Example 0,00,10,20,3 0,N ,0 1,1 1,2 1,31,N ,02,12,22,3 2,N-1... N-1,0N-1,1N-1,2N-1,3N-1,N-1... j i
12e.5 Example If we mapped iterations of the i loop to processors, the dependencies cross processors boundaries Therefore interprocessor communication would be required
12e.6 N-1,N-1 Example 0,00,10,20,3 0,N ,0 1,1 1,2 1,31,N ,02,12,22,3 2,N-1... N-1,0N-1,1N-1,2N-1,3... PE 0 : PE 1 : PE 2 : PE P :
12e.7 Example A better solution would be to map iterations of the j loop to processors
12e.8 N-1,N-1 Example 0,00,10,20,3 0,N ,0 1,1 1,2 1,31,N ,02,12,22,3 2,N-1... N-1,0N-1,1N-1,2N-1,3... PE 0 : PE 1 : PE 2 : PE 3 :
12e.9 Example for (i = 1; i < N; i++) { for (j = my_rank * blksz; i < min(N, (my_rank + 1) * blksz); i++) { a[i][j] += f(a[i-1][j]); }
12e.10 Block Mapping (Review) blksz = (int)ceil((float)N / P); for (i = lb + my_rank * blksz; i < min(N, lb + (my_rank + 1) * blksz); i++) {... } (lb is the lower bound of the original loop)
12e.11 Block Mapping
12e.12 Block Mapping The problem is that block mapping can lead to a load imbalance Example, let N=26, P=6 blksz = ceiling(26/6) = 5 (lb = 0)
12e.13 Block Mapping Processors 0-4 have 5 iterations of work Processor 5 has 1 iteration
12e.14 Cyclic Mapping An alternative to block mapping is cyclic mapping This is where each iteration is assigned to each processors in a round robin fashion This leads to a better load balance
12e.15 Cyclic Mapping Processors 0-2 have 6 iterations of work Processor 3-6 have only 5, but it is only 1 iteration fewer!
12e.16 Cyclic Mapping for (i = lb + my_rank; i < N; i += P) {... } (lb is the lower bound of the original loop)
12e.17 Cyclic Mapping Conceptually, this is an easier mapping to implement than block mapping It leads to better load balancing However, it can (and often does) lead to more communication Suppose that each iteration in the above example is dependent on the previous iteration
12e.18 Cyclic Mapping A message is sent from iteration 0 to 1, from 1 to 2, from 2 to 3, from 3 to 4, from 4 to 5, from 5 to 6,...
12e.19 Block Mapping With block mapping, only messages are sent from iteration 5 to 6, from 11 to 12, from 17 to 18, and from 23 to 24
12e.20 Block vs Cyclic Block mapping increases the granularity and reduces overall communication (O(P)). However, it can lead to load imbalances (O(N/P)). Cyclic mapping decreases granularity and increases overall communication (O(N)). However, it improves load balance (O(1)). Block-Cyclic is a combination of the two
12e.21 Block-Cyclic Mapping Block-cyclic with N=26, P=6, and blksz=2 The load imbalance will be <= blksz
12e.22 Block-Cyclic Mapping (N, P, and blksz are given) nLayers = (int)ceil(((float)N)/(blksz*P)); for (layer = 0; layer < nLayers; layer++) { beginBlk = layer*blksz*N; for (i = beginBlk + mypid*blksz; i < min(N, beginBlk + (mypid + 1)*blksz); i++) {... }
12e.23 Block vs Cyclic Block-Cyclic is in between Block and Cyclic in terms of granularity, communication, and load balancing. Block and Cyclic are special cases of Block-Cyclic –Block = Block-Cyclic with blksz = ceiling(N/P) –Cyclic = Block-Cyclic with blksz = 1