Download presentation
Presentation is loading. Please wait.
Published byBrice Simmons Modified over 9 years ago
1
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance
2
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 2 Performance can make the difference Recommendations Patrick van der Smagt in 1991 for neural net implementations Use Pointers instead of array indices Use doubles instead of floats Optimize inner loops
3
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 3 Performance gain A factor of 10 can easily be gained We have now knowledge how programs are executed: –Load / Use hazards (20% of load instr. → 1 bubble) –Mispredicted branches (40% of jmp instr. → 2 bubbles) –Return from procedure calls (100% of ret instr. → 3 bubbles) → Directions for optimizing procedures and loops ! Gain has to be measured
4
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 4 Amdahl's Law If a part of the system initially consumed of the execution time, speeding up this part of the code with factor k, the overall factor S is much less When we speed up a part of a program, the effect on the overall performance is limited by the significance of that part
5
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 5 Recipe for optimizing Use Profile to find most used procedure Optimize inner-loop of that procedure for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j];
6
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 6 Optimizing Compilers Provide efficient mapping to machine –register allocation –code selection and ordering –eliminating minor inefficiencies Have difficulty with “optimization blockers” –potential memory aliasing –potential procedure side-effects
7
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 7 Manual solution Code movement for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j]; for (i = 0; i < n; i++) { int ni = n*i; for (j = 0; j < n; j++) a[ni + j] = b[j]; } Most compilers do a good job with array code + simple loop structures
8
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 8 –As long as no optimization blockers are present, compilers can’t be beaten for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j]; imull %ebx,%eax# i*n movl 8(%ebp),%edi# a leal (%edi,%eax,4),%edx# p = a+i*n (scaled by 4) # Inner Loop.L40: movl 12(%ebp),%edi# b movl (%edi,%ecx,4),%eax # b+j (scaled by 4) movl %eax,(%edx)# *p = b[j] addl $4,%edx# p++ (scaled by 4) incl %ecx# j++ jl.L40 # loop if j<n for (i = 0; i < n; i++) { int ni = n*i; int *p = a+ni; for (j = 0; j < n; j++) *p++ = b[j]; } Compilers solution
9
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 9 Memory Aliasing Twiddle (&xp, &xp) –Twiddle1: 4x xp –Twiddle2: 3x xp void twiddle1 (int *xp, int *yp) { *xp += *yp; *xp += *yp: } void twiddle2 (int *xp, int *yp) { *xp += 2* *yp; }
10
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 10 Side effects f(x){return counter++;} → Func (0) –Func1 = 0+1+2+3=6 –Func2 = 4* 0=0 int func1 (int x) { return f(x)+f(x)+f(x)+f(x); } int func2 (int x) { return 4* f(x); }
11
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 11 Limitations for Compilers Operate Under Fundamental Constraint –Must not cause any change in program behavior under any possible condition –Often prevents it from making optimizations when would only affect behavior under pathological conditions. Behavior that may be obvious to the programmer can be obfuscated by languages and coding styles –e.g., data ranges may be more limited than variable types suggest Most analysis is performed only within procedures –whole-program analysis is too expensive in most cases Most analysis is based only on static information –compiler has difficulty anticipating run-time inputs When in doubt, the compiler must be conservative
12
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 12 –Optimizations you should do regardless of processor / compiler Code Motion (out of the loop) Reducing procedure calls Unneeded Memory usage Share Common sub-expressions –Machine-Dependent Optimizations Pointer code Unrolling Enabling instruction level parallelism Machine-independent versus Machine-dependent optimizations
13
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 13 Optimization Example Procedure –Compute aggregate OPER of all elements of vector –Store result at destination location Integer addition: Clock Cycles / Element –42.06 (Compiled -g)31.25 (Compiled -O2) void combine1(vec_ptr v, data_t *dest) { int i; *dest = IDENT; for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest = *dest OPER val; }
14
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 14 Move Call Out of Loop Optimization –Move call to vec_length out of inner loop Value does not change from one iteration to next Function calls are expensive –CPE: 20.66 (Compiled -O2) vec_length() requires 10 clock cycles void combine2(vec_ptr v, data_t *dest) { int i; int length = vec_length(v); *dest = IDENT; for (i = 0; i < length; i++) { int val; get_vec_element(v, i, &val); *dest = *dest OPER val; } int vec_length(vec_ptr v) { return v->len; }
15
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 15 Bypass data-abstraction Optimization –Avoid procedure call to retrieve each vector element Get pointer to start of array before loop Within loop just do pointer reference Not as clean in terms of data abstraction –CPE: 6.00 (Compiled -O2) get_vec_element() requires 14 clock cycles Bounds checking is expensive void combine3(vec_ptr v, data_t *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); *dest = IDENT; for (i = 0; i < length; i++) *dest = *dest OPER data[i]; } int get_vec_element() { if (index < 0 || index >= v->len) return 0; *dest = v->data[index]; return 1; }
16
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 16 Eliminate Unneeded Memory Refs Optimization –Don’t need to store in destination until end –Local variable sum held in register –Avoids 1 memory read, 1 memory write per cycle –CPE: 2.00 (Compiled -O2) Memory references are expensive! void combine4(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int sum = IDENT; for (i = 0; i < length; i++) sum = sum OPER data[i]; *dest = sum; }
17
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 17 Why did the compiler do that? Different behavior due to memory aliasing –Combine (v, get_vec_start(v)+2) with OPER * –Combine3 [2,3,5]→[2,3,1] →[2,3,2] →[2,3,6] →[2,3,36] –Combine4 [2,3,5]→[2,3,5] →[2,3,5] →[2,3,5] →[2,3,30]
18
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 18 Machine Independent Code Motion –Reduce frequency with which computation performed If it will always produce same result Especially moving expensive code out of loop
19
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 19 How should I write my programs, given that I have a good, optimizing compiler? Don’t: Smash Code into Oblivion –Hard to read, maintain, & assure correctness Do: –Select best algorithm & data representation –Write code that’s readable & maintainable Procedures, recursion, without built-in constant limits Even though these factors can slow down code Focus on Inner Loops –Detailed optimization means detailed measurement Conclusion
20
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 20 Assignment Practice Problems –Practice Problem 5.1: 'What effect has the call swap(&xp, &xp)?‘ –Practice Problem 5.3: ‘Indicate the number of functions calls in 3 fragments‘
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.