Download presentation
Presentation is loading. Please wait.
1
1 Tuesday, September 19, 2006 The practical scientist is trying to solve tomorrow's problem on yesterday's computer. Computer scientists often have it the other way around. - Numerical Recipes, C Edition
2
2 Reference Material §Lectures 1 & 2 l “Parallel Computer Architecture” by David Culler et. al., Chapter 1. l “Sourcebook of Parallel Computing” by Jack Dongarra et. al., Chapters 1 and 2. l Introduction to Parallel Computing by Grama et. al., Chapter 1 and Chapter 2 §2.4. l www.top500.org §Lecture 3 l Introduction to Parallel Computing by Grama et. al., Chapter 2 §2.3 l Introduction to Parallel Computing, Lawrence Livermore National Laboratory, http://www.llnl.gov/computing/tutorials/parallel_comp/ §Lecture 4 & 5 l “Techniques for Optimizing Applications” by Garg et. al., Chapter 9 l “Software Optimizations for High Performance Computing” by Wadleigh et. al., Chapter 5 l Introduction to Parallel Computing by Grama et. al., Chapter 2 §2.1- 2.2
3
3 Software Optimizations §Optimize serial code before parallelizing it.
4
4 Loop Unrolling do i=1,n A(i)=B(i) enddo do i=1,n,4 A(i)=B(i) A(i+1)=B(i+1) A(i+2)=B(i+2) A(i+3)=B(i+3) enddo Unrolled by 4. Some compilers allow users to specify unrolling depth. Avoid excessive unrolling: Register pressure / spills can hurt performance Pipelining to hide instruction latencies Reduces overhead of index increment and conditional check Assumption n is divisible by 4
5
5 Loop Unrolling do j=1 to N do i = 1 to N Z[i,j]=Z[i,j]+X[i]*Y[j] enddo Unroll outer loop by 2
6
6 Loop Unrolling do j=1 to N do i = 1 to N Z[i,j]=Z[i,j]+X[i]*Y[j] enddo do j=1 to N step 2 do i = 1 to N Z[i,j]=Z[i,j]+X[i]*Y[j] Z[i,j+1]=Z[i,j+1]+X[i]*Y[j+1] enddo
7
7 Loop Unrolling do j=1 to N do i = 1 to N Z[i,j]=Z[i,j]+X[i]*Y[j] enddo do j=1 to N step 2 do i = 1 to N Z[i,j]=Z[i,j]+X[i]*Y[j] Z[i,j+1]=Z[i,j+1]+X[i]*Y[j+1] enddo Number of load operations can be reduced e.g. Half as many loads of X
8
8 Loop Fusion §Beneficial in loop-intensive programs. §Decreases index calculation overhead. §Can also help in instruction level parallelism. §Beneficial if same data structures are used in different loops.
9
9 Loop Fusion for (i=0; i<n; i++) temp[i] =x[i]*y[i]; for (i=0; i<n; i++) z[i] =w[i]+temp[i];
10
10 Loop Fusion for (i=0; i<n; i++) temp[i] =x[i]*y[i]; for (i=0; i<n; i++) z[i] =w[i]+temp[i]; for (i=0; i<n; i++) z[i] =x[i]*y[i]+w[i]; Check for register pressure before fusing
11
11 Loop Fission §Condition statements can hurt pipelining §Split into two, one with condition statements and the other without. §Compiler can do optimizations in condition-free loop like unrolling. §Beneficial for fat loops that may lead to register spills
12
12 Loop Fission for (i=0;i<nodes;i++) { a[i] = a[i]*small; dtime = a[i] + b[i]; dtime = fabs(dtime*ratinpmt); temp1[i] = dtime*relaxn; if(temp1[i] > hgreat) { temp1[i]=1; }
13
13 Loop Fission for (i=0;i<nodes;i++) { a[i] = a[i]*small; dtime = a[i] + b[i]; dtime = fabs(dtime*ratinpmt); temp1[i] = dtime*relaxn; if(temp1[i] > hgreat) { temp1[i]=1; } for (i=0;i<nodes;i++) { a[i] = a[i]*small; dtime = a[i] + b[i]; dtime = fabs(dtime*ratinpmt); temp1[i] = dtime*relaxn; } for (i=0;i<nodes;i++) { if(temp1[i] > hgreat) { temp1[i]=1; }
14
14 Reductions for (i=0; i<n; i++) { sum +=x[i]; } Normally a single register would be used for reduction variable. Hide floating point instruction latency?
15
15 Reductions for (i=0; i<n; i++) { sum +=x[i]; } sum1=sum2=sum3=sum4=0.0 nend = (n>>2)<<2; for (i=0; i<nend; i+=4){ sum1 +=x[i]; sum2 +=x[i+1]; sum3 +=x[i+2]; sum4 +=x[i+3]; } sumx = sum1 + sum2+ sum3 + sum4; for (i=nend; i<n; i++) sumx += x[i]
16
16 a**0.5 vs sqrt(a)
17
17 a**0.5 vs sqrt(a) §Appropriate include files can help in generating faster code. e.g. math.h
18
18 §The time to access memory has not kept pace with CPU clock speeds. §Performance of a program can be suboptimal because data to perform the operations are not delivered from memory to registers by the time processor is ready to use them. §Wastage of CPU cycles: CPU starvation
19
19
20
20 §Ability of memory system to feed data to the processor l Memory latency l Memory Bandwidth
21
21 Effect of Memory Latency §1 GHz processor (1ns clock) l Capable of executing 4 instructions in each cycle of 1ns §DRAM with latency 100ns §Cache block size : 1 word §Peak processor rating?
22
22 Effect of Memory Latency §1 GHz processor (1ns clock) l Capable of executing 4 instructions in each cycle of 1ns §DRAM with latency 100ns (no caches) §Memory block 1 word §Peak processor rating 4 GFlops
23
23 Effect of Memory Latency §1 GHz processor (1ns clock) l Capable of executing 4 instructions in each cycle of 1ns §DRAM with latency 100ns (no caches) §Memory block: 1 word §Peak processor rating 4 GFlops §Dot product of two vectors §Peak speed of computation?
24
24 Effect of Memory Latency §1 GHz processor (1ns clock) l Capable of executing 4 instructions in each cycle of 1ns §DRAM with latency 100ns (no caches) §Memory block 1 word §Peak processor rating 4 GFlops Dot product of two vectors Peak speed of computation? one floating point operation every 100ns i.e. speed of 10 MFLOPS
25
25 Effect of Memory Latency: Introduce Cache §1 GHz processor (1ns clock) l Capable of executing 4 instructions in each cycle of 1ns §DRAM with latency 100ns §Memory block 1 word §Cache 32KB with 1ns latency §Multiply two matrices A and B of 32x32 words with result in C. (Note: Previous example had no data reuse). §Assume ideal cache placement and enough capacity to hold A,B and C
26
26 Effect of Memory Latency: Introduce Cache §Multiply two matrices A and B of 32x32 words with result in C §32x32 = 1K words §Total operations and total time taken?
27
27 Effect of Memory Latency: Introduce Cache §Multiply two matrices A and B of 32x32 words with result in C §32x32 = 1K words §Total operations and total time taken? §Two matrices = 2K require words §Multiplying two matrices requires 2n 3 operations
28
28 Effect of Memory Latency: Introduce Cache §Multiply two matrices A and B of 32x32 words with result in C §32x32 = 1K §Two matrices = 2K require 2K *100ns = 200µs. §Multiplying two matrices requires 2n 3 operations = 2*32 3 = 64K operations §4 operations per cycle we need 64K/4 cycles = 16µs §Total time = 200+16µs §Computation rate 64K operations/(200+16µs) = 303 MFLOPS
29
29 Effect of Memory Bandwidth §1 GHz processor (1ns clock) l Capable of executing 4 instructions in each cycle of 1ns §DRAM with latency 100ns §Memory block 4 words §Cache 32KB with 1ns latency §Dot product example again §Bandwidth increased 4 fold
30
30 §Reduce cache misses. §Spatial locality §Temporal locality
31
31 Impact of strided access for (i=0; i<1000; i++) column_sum[i] = 0.0; for(j=0; j<1000; j++) column_sum[i]+= b[j][i];
32
32 Eliminating strided access for (i=0; i<1000; i++) column_sum[i] = 0.0; for(j=0; j<1000; j++) for (i=0; i<1000; i++) column_sum[i]+= b[j][i]; Assumption: Vector column_sum is retained in the cache
33
33 do i = 1, N do j = 1, N A[i] =A[i] + B[j] enddo N is large so B[j] cannot remain in cache until it is used again in another iteration of outer loop. Little reuse between touches How many cache misses for A and B?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.