Download presentation
Presentation is loading. Please wait.
Published byEthan Higgins Modified over 8 years ago
1
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance
2
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 2 Performance can make the difference Recommendations Patrick van der Smagt in 1991 for neural net implementations Use Pointers instead of array indices Use doubles instead of floats Optimize inner loops
3
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 3 –Optimizations you should do regardless of processor / compiler Code Motion (out of the loop) Reducing procedure calls Unneeded Memory usage Share Common sub-expressions –Machine-Dependent Optimizations Pointer code Unrolling Enabling instruction level parallelism Machine-independent versus Machine-dependent optimizations
4
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 4 Machine dependent One has to known today’s architectures Superscalar (Pentium) (often two instructions/cycle) Dynamic execution (P6) (three instructions out-of-order/cycle) Explicit parallelism (Itanium) (six execution units)
5
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 5 Pentium III Design Execution Functional Units Instruction Control Integer/ Branch FP Add FP Mult/Div LoadStore Instruction Cache Data Cache Fetch Control Instruction Decode Address Instrs. Operations Prediction OK? Data Addr. General Integer Operation Results Retirement Unit Register File Register Updates
6
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 6 Functional Units of Pentium III Multiple Instructions Can Execute in Parallel –1 load –1 store –2 integer (one may be branch) –1 FP Addition –1 FP Multiplication or Division
7
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 7 Performance of Pentium III operations Many instructions can be Pipelined to 1 cycle InstructionLatencyCycles/Issue –Load / Store3 1 –Integer Multiply4 1 –Integer Divide36 36 –Double/Single FP Multiply5 2 –Double/Single FP Add3 1 –Double/Single FP Divide38 38
8
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 8 Instruction Control Grabs Instructions From Memory –Based on current PC + predicted targets for predicted branches –Hardware dynamically guesses (possibly) branch target Translates Instructions Into Operations –Primitive steps required to perform instruction –Typical instruction requires 1–3 operations Converts Register References Into Tags –Abstract identifier linking destination of one operation with sources of later operations Instruction Control Instruction Cache Fetch Control Instruction Decode Address Instrs. Operations Retirement Unit Register File
9
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 9 Translation Example Version of Combine4 –Integer data, multiply operation Translation of First Iteration.L24:# Loop: imull (%eax,%edx,4),%ecx# t *= data[i] incl %edx# i++ cmpl %esi,%edx# i:length jl.L24# if < goto Loop.L24: imull (%eax,%edx,4),%ecx incl %edx cmpl %esi,%edx jl.L24 load (%eax,%edx.0,4) t.1 imull t.1, %ecx.0 %ecx.1 incl %edx.0 %edx.1 cmpl %esi, %edx.1 cc.1 jl-taken cc.1
10
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 10 Visualizing Operations Operations –Vertical position denotes time at which executed Cannot begin operation until operands available –Height denotes latency load (%eax,%edx,4) t.1 imull t.1, %ecx.0 %ecx.1 incl %edx.0 %edx.1 cmpl %esi, %edx.1 cc.1 jl-taken cc.1 cc.1 t.1 load %ecx.1 incl cmpl jl %edx.0 %edx.1 %ecx.0 imull Time
11
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 11 4 Iterations of Combining Sum Resource Analysis Performance –Unlimited resources should give CPE of 1.0 –Would require executing 4 integer operations in parallel 4 integer ops
12
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 12 Pentium Resource Constraints –Only two integer functional units –Set priority based on program order Performance –Sustain CPE of 2.0
13
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 13 Loop Unrolling Optimization –Combine multiple iterations into single loop body –Amortizes loop overhead across multiple iterations –Finish extras at end void combine5(vec_ptr v, int *dest) { int length = vec_length(v); int limit = length-2; int *data = get_vec_start(v); int sum = 0; int i; /* Combine 3 elements at a time */ for (i = 0; i < limit; i+=3) { sum += data[i] + data[i+2] + data[i+1]; } /* Finish any remaining elements */ for (; i < length; i++) { sum += data[i]; } *dest = sum; } –Measured CPE=1.33
14
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 14 Resource distribution with Loop Unrolling Predicted Performance –Can complete iteration in 3 cycles –Should give CPE of 1.0
15
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 15 Effect of Unrolling Only helps integer sum for our examples –Other cases constrained by functional unit latencies Effect is nonlinear with degree of unrolling Many subtle effects determine exact scheduling of operations Unrolling Degree 1234816 Intege r Sum2.001.501.331.501.251.06 Intege r Product4.00 FPSum3.00 FPProduct5.00
16
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 16 Unrolling is for long vectors New source of overhead –The need to finish the remaining elements when the vector length is not divisible by the degree of unrolling Unrolling Degree1234816 IntSum∞2.001.501.331.501.251.06 IntSum10242.061.561.401.561.311.12 IntSum314.023.573.393.843.913.66
17
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 17 3 Iterations of Combining Product Unlimited Resource Analysis –Assume operation can start as soon as operands available Performance –Limiting factor becomes latency –Gives CPE of 4.0
18
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 18 Iteration splitting Optimization –Make operands available by accumulating in two different products (x0,x1) –Combine at end Performance –CPE = 2.0 void combine6(vec_ptr v, int *dest) { int length = vec_length(v); int limit = length-1; int *data = get_vec_start(v); int x0 = 1; int x1 = 1; int i; /* Combine 2 elements at a time */ for (i = 0; i < limit; i+=2) { x0 *= data[i]; x1 *= data[i+1]; } /* Finish any remaining elements */ for (; i < length; i++) { x0 *= data[i]; } *dest = x0 * x1; }
19
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 19 –Predicted Performance Make use of both execution units Gives CPE of 2.0 Resource distribution with Iteration Splitting
20
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 20 Results for Pentium III –Biggest gain doing basic optimizations –But, last little bit helps
21
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 21 Results for Pentium 4 –Higher latencies (int * = 14, fp + = 5.0, fp * = 7.0) Clock runs at 2.0 GHz Not an improvement over 1.0 GHz P3 for integer * –Avoids FP multiplication anomaly
22
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 22 Loop Unrolling –Some compilers do this automatically –Generally not as clever as what can achieve by hand Exposing Instruction-Level Parallelism –Very machine dependent Warning: –Benefits depend heavily on particular machine –Do only for performance-critical parts of code –Best if performed by compiler But GCC on IA32/Linux is not very good Machine-Dependent Opt. Summary
23
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 23 How should I write my programs, given that I have a good, optimizing compiler? Don’t: Smash Code into Oblivion –Hard to read, maintain, & assure correctness Do: –Select best algorithm & data representation –Write code that’s readable & maintainable Procedures, recursion, without built-in constant limits Even though these factors can slow down code Focus on Inner Loops –Detailed optimization means detailed measurement Conclusion
24
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 24 Assignment Practice Problems –Practice Problem 5.8: ‘Associations of aprod.' Optimization Lab
25
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 25 Performance of Pentium Core 2 operations Two stores and one load in a single cycle. The performance of IDIV depends on the quotient InstructionLatencyCycles/Issue –Load / Store1 0.33 –Integer Multiply3 1 –Integer Divide17-41 12-36 –Double/Single FP Multiply5 2 –Double/Single FP Add3 1 –Double/Single FP Divide32 32
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.