Presentation is loading. Please wait.

Presentation is loading. Please wait.

Performance Optimization Getting your programs to run faster CS 691.

Similar presentations


Presentation on theme: "Performance Optimization Getting your programs to run faster CS 691."— Presentation transcript:

1 Performance Optimization Getting your programs to run faster CS 691

2 Why optimize Better turn-around on jobs Run more programs/scenarios Release resources to other applications You want the job to finish before you retire

3 Ways to get more performance Run on bigger, faster hardware clock speed, more memory, … Tweak your algorithm Optimize your code

4 Loop Unrolling Converting passes of a loop into in-line streams of code Useful when loops do calculations on data in arrays Unrolling can take advantage of pipeline processing units in processors Compiler may preload operands into CPU registers

5 Loop Unrolling – disadvantages may be limited by the number of Floating point registers Pentium III: 8 Pentium 4: 8 Itanium: 128

6 Loop Unrolling – simple example Loop do i=1,n a(i) = b(i) +x*c(i) enddo Unrolled Loop do i=1,n,4 a(i) = b(i) +x*c(i) a(i+1) = b(i+1) +x*c(i+1) a(i+2) = b(i+2) +x*c(i+2) a(i+3) = b(i+3) +x*c(i+3) enddo

7 Loop Unrolling – simple example Performance – Rolled P3 550mhz – 13 mflops Itanium – 30 mflops Performance Unrolled P3 550mhz – 30 mflops Itanium – 107 mflops *from: LCI and NCSA

8 Loop Unrolling int a[100]; for (i=0;i<100;i++){ a[i] = a[i] * 2; } int a[100]; for (i=0;i<100;i+=5){ a[i] = a[i] * 2; a[i+1]=a[i+1]*2; a[i+2]=a[i+2]*2; a[i+3]=a[i+3]*2; a[i+4]=a[i+4]*2; }

9 Loop unrolling int a[10][10]; for (i=0;i<10;i++){ for (j=0;j<10;j++) { a[i][j] = a[i][j] *2;} int a[10][10]; for (i=0;i<10;i++){ a[i][0]=a[i][0]*2; a[i][1]=a[i][1]*2; a[i][2]=a[i][2]*2; a[i][3]=a[i][3]*2; a[i][4]=a[i][4]*2; a[i][5]=a[i][5]*2; a[i][6]=a[i][6]*2; a[i][7]=a[i][7]*2; a[i][8]=a[i][8]*2; a[i][9]=a[i][9]*2;}

10 Loop unrolling – Matrix Dot Product float a[100]; float b[100]; float z; for (i=0;i<100;i++){ z = z + a[i] * b[i]; } float a[100]; float b[100]; float z; for (i=0;i<100;i+=2){ z = z + a[i] * b[i]; z = z + a[i+1] * b[i+1]; }

11 Unrolling Loops You can do it automatically

12 Unrolling Loops – compiler options GNU Compilers -funroll-loops -funrull-all-loops (not recommended) PGI Compilers -Munroll -Munroll=c:N -Munroll=n:M

13 Unrolling Loops – Compiler Options Intel Compilers -unrollM (up to M times) -unroll

14 Taking Memory in Order Optimizing the use of cache row major order vs column major order row major --  a(1,1), a(2,1), a(3,1), a(1,2), a(2,2),… column major –  a(1,1), a(1,2), a(1,3), a(2,1), a(2,2),…

15 Taking Memory in Order Remember C and Fortran store arrays in the opposite manner  C – row major  Fortran – column major

16 Taking Memory in Order c  Fortran 

17 Taking Memory in Order do i=1,m do j=1,n a(i,j)=b(i,j)+c(i) end do do j=1,m do i=1,n a(i,j)=b(i,j)+c(i) end do loop time: 23.42 loop runs at 4.48 Mflops loop time: 2.80 loop runs at 37.48 Mflops

18 Floating Point Division FP Division is very expensive in terms of processor time 20-60 clock cycles to compute Usually not pipelined FP Division required by IEEE “rules”

19 Floating point division – use reciprocal float a[100]; for (i=0;i<100;i++){ a[i]=a[i]/2; } float a[100]; Float denom; denom = 1/2; for (i=0;i<100;i++){ a[i]=a[i]*denom; }

20 Compiler options for IEEE Compatibility PGI Compilers  -Knoieee Intel Compilers  -mp GNU Compilers  can’t do Floating Point Division

21 Compilers can’t optimize if divisor is not scalar Breaks IEEE “rules” May impact portability

22 Function Inlining Build functions/subroutines in as inline parts of the programs code… … rather than functions/subroutines minimizes functions calls (and management of…)

23 Function Inlining Compile with – -Minline  compiler tries to inline what it can (meet compiler criteria) -Minline=except:func  excludes func from inlining -Minline=func  inline only func

24 Function Inlining …Compile with- -Minline=myfile.lib  inlines functions from inline library file -Minline=levels:n  inlines functions up to n levels of calls  usually default = 1

25 MPI Tuning Minimize messages Pointers/counts MPI Derived datatypes MPI_Pack/MPI_Unpack Using shared memory for message passing #PBS –l nodes=6:ppn=1 … but… #PBS –l nodes=3:ppn=2 … is better.

26 Compiler optimizations -O0 –no optimization -O1 –local optimization, register allocation -O2 –local/limited global optimization -O3 –aggressive global optimization -Munroll – loop unrolling -Mvect - vectorization -Minline – function inlining

27 gcc Compiler Optimatizations http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html See:

28


Download ppt "Performance Optimization Getting your programs to run faster CS 691."

Similar presentations


Ads by Google