Download presentation
Presentation is loading. Please wait.
Published byCharleen Williamson Modified over 9 years ago
1
Performance Optimization Getting your programs to run faster CS 691
2
Why optimize Better turn-around on jobs Run more programs/scenarios Release resources to other applications You want the job to finish before you retire
3
Ways to get more performance Run on bigger, faster hardware clock speed, more memory, … Tweak your algorithm Optimize your code
4
Loop Unrolling Converting passes of a loop into in-line streams of code Useful when loops do calculations on data in arrays Unrolling can take advantage of pipeline processing units in processors Compiler may preload operands into CPU registers
5
Loop Unrolling – disadvantages may be limited by the number of Floating point registers Pentium III: 8 Pentium 4: 8 Itanium: 128
6
Loop Unrolling – simple example Loop do i=1,n a(i) = b(i) +x*c(i) enddo Unrolled Loop do i=1,n,4 a(i) = b(i) +x*c(i) a(i+1) = b(i+1) +x*c(i+1) a(i+2) = b(i+2) +x*c(i+2) a(i+3) = b(i+3) +x*c(i+3) enddo
7
Loop Unrolling – simple example Performance – Rolled P3 550mhz – 13 mflops Itanium – 30 mflops Performance Unrolled P3 550mhz – 30 mflops Itanium – 107 mflops *from: LCI and NCSA
8
Loop Unrolling int a[100]; for (i=0;i<100;i++){ a[i] = a[i] * 2; } int a[100]; for (i=0;i<100;i+=5){ a[i] = a[i] * 2; a[i+1]=a[i+1]*2; a[i+2]=a[i+2]*2; a[i+3]=a[i+3]*2; a[i+4]=a[i+4]*2; }
9
Loop unrolling int a[10][10]; for (i=0;i<10;i++){ for (j=0;j<10;j++) { a[i][j] = a[i][j] *2;} int a[10][10]; for (i=0;i<10;i++){ a[i][0]=a[i][0]*2; a[i][1]=a[i][1]*2; a[i][2]=a[i][2]*2; a[i][3]=a[i][3]*2; a[i][4]=a[i][4]*2; a[i][5]=a[i][5]*2; a[i][6]=a[i][6]*2; a[i][7]=a[i][7]*2; a[i][8]=a[i][8]*2; a[i][9]=a[i][9]*2;}
10
Loop unrolling – Matrix Dot Product float a[100]; float b[100]; float z; for (i=0;i<100;i++){ z = z + a[i] * b[i]; } float a[100]; float b[100]; float z; for (i=0;i<100;i+=2){ z = z + a[i] * b[i]; z = z + a[i+1] * b[i+1]; }
11
Unrolling Loops You can do it automatically
12
Unrolling Loops – compiler options GNU Compilers -funroll-loops -funrull-all-loops (not recommended) PGI Compilers -Munroll -Munroll=c:N -Munroll=n:M
13
Unrolling Loops – Compiler Options Intel Compilers -unrollM (up to M times) -unroll
14
Taking Memory in Order Optimizing the use of cache row major order vs column major order row major -- a(1,1), a(2,1), a(3,1), a(1,2), a(2,2),… column major – a(1,1), a(1,2), a(1,3), a(2,1), a(2,2),…
15
Taking Memory in Order Remember C and Fortran store arrays in the opposite manner C – row major Fortran – column major
16
Taking Memory in Order c Fortran
17
Taking Memory in Order do i=1,m do j=1,n a(i,j)=b(i,j)+c(i) end do do j=1,m do i=1,n a(i,j)=b(i,j)+c(i) end do loop time: 23.42 loop runs at 4.48 Mflops loop time: 2.80 loop runs at 37.48 Mflops
18
Floating Point Division FP Division is very expensive in terms of processor time 20-60 clock cycles to compute Usually not pipelined FP Division required by IEEE “rules”
19
Floating point division – use reciprocal float a[100]; for (i=0;i<100;i++){ a[i]=a[i]/2; } float a[100]; Float denom; denom = 1/2; for (i=0;i<100;i++){ a[i]=a[i]*denom; }
20
Compiler options for IEEE Compatibility PGI Compilers -Knoieee Intel Compilers -mp GNU Compilers can’t do Floating Point Division
21
Compilers can’t optimize if divisor is not scalar Breaks IEEE “rules” May impact portability
22
Function Inlining Build functions/subroutines in as inline parts of the programs code… … rather than functions/subroutines minimizes functions calls (and management of…)
23
Function Inlining Compile with – -Minline compiler tries to inline what it can (meet compiler criteria) -Minline=except:func excludes func from inlining -Minline=func inline only func
24
Function Inlining …Compile with- -Minline=myfile.lib inlines functions from inline library file -Minline=levels:n inlines functions up to n levels of calls usually default = 1
25
MPI Tuning Minimize messages Pointers/counts MPI Derived datatypes MPI_Pack/MPI_Unpack Using shared memory for message passing #PBS –l nodes=6:ppn=1 … but… #PBS –l nodes=3:ppn=2 … is better.
26
Compiler optimizations -O0 –no optimization -O1 –local optimization, register allocation -O2 –local/limited global optimization -O3 –aggressive global optimization -Munroll – loop unrolling -Mvect - vectorization -Minline – function inlining
27
gcc Compiler Optimatizations http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html See:
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.