Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Code Optimization. 2 Outline Optimizing Blockers –Memory alias –Side effect in function call Understanding Modern Processor –Super-scalar –Out-of –order.

Similar presentations


Presentation on theme: "1 Code Optimization. 2 Outline Optimizing Blockers –Memory alias –Side effect in function call Understanding Modern Processor –Super-scalar –Out-of –order."— Presentation transcript:

1 1 Code Optimization

2 2 Outline Optimizing Blockers –Memory alias –Side effect in function call Understanding Modern Processor –Super-scalar –Out-of –order execution More Code Optimization techniques Performance Tuning Suggested reading –5.1, 5.7 ~ 5.16

3 3 5.1 Capabilities and Limitations of Optimizing Compliers Review on 5.3 Program Example 5.4 Eliminating Loop Inefficiencies 5.5 Reducing Procedure Calls 5.6 Eliminating Unneeded Memory References

4 4 void combine1(vec_ptr v, data_t *dest) { int i; *dest = IDENT; for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest = *dest OPER val; } Example P387

5 5 void combine2(vec_ptr v, int *dest) { int i; int length = vec_length(v); *dest = IDENT; for (i = 0; i < length; i++) { int val; get_vec_element(v, i, &val); *dest = *dest OPER val; } Example P388

6 6 void combine3(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); *dest = IDENT; for (i = 0; i < length; i++) { *dest = *dest OPER data[i]; } Example P392

7 7 void combine4(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int x = IDENT; for (i = 0; i < length; i++) x = x OPER data[i]; *dest = x; } Example P394

8 8 Machine Independent Opt. Results Optimizations –Reduce function calls and memory references within loop

9 9 Machine Independent Opt. Results Performance Anomaly –Computing FP product of all elements exceptionally slow. –Very large speedup when accumulate in temporary –Memory uses 64-bit format, register use 80 –Benchmark data caused overflow of 64 bits, but not 80 Combine4 Combine3 Combine2 Combine1 P385 P388 P392 P394

10 10 Optimization Blockers P394 void combine4(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int sum = 0; for (i = 0; i < length; i++) sum += data[i]; *dest = sum; }

11 11 Optimization Blocker: Memory Aliasing P394 Aliasing –Two different memory references specify single location Example –v: [3, 2, 17] –combine3(v, get_vec_start(v)+2) -->? –combine4(v, get_vec_start(v)+2) -->?

12 12 Optimization Blocker: Memory Aliasing Observations –Easy to have happen in C Since allowed to do address arithmetic Direct access to storage structures –Get in habit of introducing local variables Accumulating within loops Your way of telling compiler not to check for aliasing

13 13 Optimizing Compilers Provide efficient mapping of program to machine –register allocation –code selection and ordering –eliminating minor inefficiencies

14 14 Optimizing Compilers Don’t (usually) improve asymptotic efficiency –up to programmer to select best overall algorithm –big-O savings are (often) more important than constant factors but constant factors also matter Have difficulty overcoming “optimization blockers” –potential memory aliasing –potential procedure side-effects

15 15 Limitations of Optimizing Compilers Operate Under Fundamental Constraint –Must not cause any change in program behavior under any possible condition –Often prevents it from making optimizations when would only affect behavior under pathological conditions.

16 16 Limitations of Optimizing Compilers Behavior that may be obvious to the programmer can be obfuscated by languages and coding styles –e.g., data ranges may be more limited than variable types suggest e.g., using an “ int ” in C for what could be an enumerated type obfuscated :混乱

17 17 Limitations of Optimizing Compilers Most analysis is performed only within procedures –whole-program analysis is too expensive in most cases Most analysis is based only on static information –compiler has difficulty anticipating run-time inputs When in doubt, the compiler must be conservative

18 18 Optimization Blockers P380 Memory aliasing void twiddle1(int *xp, int *yp) { *xp += *yp ; } void twiddle2(int *xp, int *yp) { *xp += 2* *yp ; }

19 19 Optimization Blockers P381 Function call and side effect int f(int) ; int func1(x) { return f(x)+f(x)+f(x)+f(x) ; } int func2(x) { return 4*f(x) ; }

20 20 Optimization Blockers P381 Function call and side effect int counter = 0 ; int f(int x) { return counter++ ; }

21 21 5.7 Understanding Modern Processors 5.7.1 Overall Operation

22 22 Modern CPU Design Figure 5.11 P396 Execution Functional Units Instruction Control Integer/ Branch FP Add FP Mult/Div LoadStore Instruction Cache Data Cache Fetch Control Instruction Decode Address Instructions Operations Prediction OK? Data Addr. General Integer Operation Results Retirement Unit Register File Register Updates

23 23 Retirement Unit Register File Instruction Cache Fetch Control Instruction Decode Address Instructions Integer /branch General Integer FP Add FP mult/div LoadStore Functional units operations Predication OK? Data Cache Operation results addr data addr data Register Updates 1) 2) 3) 4) 5) (1)(2)(3)(4)(5)(6) (7)

24 24 Modern Processor P396 Superscalar –Perform multiple operations on every clock cycle Out-of-order execution –The order in which the instructions execute need not correspond to their ordering in the assembly program

25 25 Modern Processor P396 Two main parts –Instruction Control Unit Responsible for reading a sequence of instructions from memory Generating from above instructions a set of primitive operations to perform on program data –Execution Unit

26 26 1) Instruction Control Unit Instruction Cache –A special, high speed memory containing the most recently accessed instructions.

27 27 1) Instruction Control Unit Instruction Decoding Logic –Take actual program instructions –Converts them into a set of primitive operations –Each primitive operation performs some simple task Simple arithmetic, Load, Store addl %eax, 4(%edx) --- three operations load 4(%edx)  t1 addl %eax, t1  t2 store t2, 4(%edx) –Register renaming P397 P398

28 28 2) Fetch Control Fetch Ahead P396 –Fetches well ahead of currently accessed instructions –ICU has enough time to decode these –ICU has enough time to send decoded operations down to the EU

29 29 Fetch Control Branch Predication P397 –Branch taken or fall through –Guess whether branch is taken or not Speculative Execution P397 –Fetch, decode and execute only according to the branch prediction –Before the branch predication has been determined

30 30 5.7 Understanding Modern Processors 5.7.2 Functional Unit Performance

31 31 Multi-functional Units Multiple Instructions Can Execute in Parallel –1 load –1 store –2 integer (one may be branch) –1 FP Addition –1 FP Multiplication or Division

32 32 Multi-functional Units Figure 5.12 P400 Some Instructions Take > 1 Cycle, but Can be Pipelined –InstructionLatency Cycles/Issue –Load / Store31 –Integer Multiply41 –Integer Divide3636 –Double/Single FP Multiply52 –Double/Single FP Add31 –Double/Single FP Divide3838

33 33 5.7 Understanding Modern Processors 5.7.1 Overall Operation

34 34 Execution Unit Receives operations from ICU Each cycle it may receive more than one operation Operations are queued in buffer

35 35 Execution Unit Operation is dispatched to one of multi- functional units, whenever –All the operands of an operation are ready –Suitable functional units are available Execution results are passed among functional units (7) Data Cache P398 –A high speed memory containing the most recently accessed data values

36 36 4) Retirement Unit P398 Instructions need to commit in serial order –Misprediction –Exception Updates Architecture status –Memory and register values

37 37 5.7.3 A Closer Look at Processor Operation Translation Instruction into Operations

38 38 Translation Example P401.L24:# Loop: imull (%eax,%edx,4),%ecx# t *= data[i] incl %edx# i++ cmpl %esi,%edx# i:length jl.L24# if < goto Loop.L24: imull (%eax,%edx,4),%ecx incl %edx cmpl %esi,%edx jl.L24 load (%eax,%edx.0,4)  t.1 imull t.1, %ecx.0  %ecx.1 incl %edx.0  %edx.1 cmpl %esi, %edx.1  cc.1 jl-taken cc.1

39 39 Understanding Translation Example P401 Split into two operations –Load reads from memory to generate temporary result t.1 –Multiply operation just operates on registers imull (%eax,%edx,4),%ecxload (%eax,%edx.0,4)  t.1 imull t.1, %ecx.0  %ecx.1

40 40 Understanding Translation Example P401 Operands –Registers %eax does not change in loop. Values will be retrieved from register file during decoding imull (%eax,%edx,4),%ecxload (%eax,%edx.0,4)  t.1 imull t.1, %ecx.0  %ecx.1

41 41 Understanding Translation Example P401 Operands –Register %ecx changes on every iteration. –Uniquely identify different versions as %ecx.0, %ecx.1, %ecx.2, … –Register renaming Values passed directly from producer to consumers imull (%eax,%edx,4),%ecxload (%eax,%edx.0,4)  t.1 imull t.1, %ecx.0  %ecx.1

42 42 Understanding Translation Example P402 Register %edx changes on each iteration Renamed as %edx.0, %edx.1, %edx.2, … incl %edxincl %edx.0  %edx.1

43 43 Understanding Translation Example P402 Condition codes are treated similar to registers Assign tag to define connection between producer and consumer cmpl %esi,%edxcmpl %esi, %edx.1  cc.1

44 44 Understanding Translation Example P402 Instruction control unit determines destination of jump Predicts whether target will be taken Starts fetching instruction at predicted destination jl.L24jl-taken cc.1

45 45 Understanding Translation Example P401 Execution unit simply checks whether or not prediction was OK If not, it signals instruction control –Instruction control then “invalidates” any operations generated from misfetched instructions –Begins fetching and decoding instructions at correct target jl.L24jl-taken cc.1

46 46 Operations –Vertical position denotes time at which executed Cannot begin operation until operands available –Height denotes latency Operands –Arcs shown only for operands that are passed within execution unit cc.1 t.1 load %ecx.1 incl cmpl jl %edx.0 %edx.1 %ecx.0 imull load (%eax,%edx.0,4)  t.1 imull t.1, %ecx.0  %ecx.1 incl %edx.0  %edx.1 cmpl %esi, %edx.1  cc.1 jl-taken cc.1 Time Visualizing Operations Figure 5.13 P403

47 47 Operations –Same as before, except that add has latency of 1 load (%eax,%edx,4)  t.1 iaddl t.1, %ecx.0  %ecx.1 incl %edx.0  %edx.1 cmpl %esi, %edx.1  cc.1 jl-taken cc.1 Time cc.1 t.1 %ecx. i +1 incl cmpl jl load %edx.0 %edx.1 %ecx.0 addl %ecx.1 load Visualizing Operations Figure 5.14 P403

48 48 Unlimited Resource Analysis –Assume operation can start as soon as operands available –Operations for multiple iterations overlap in time Performance –Limiting factor becomes latency of integer multiplier –Gives CPE of 4.0 3 Iterations of Combining Product Figure 5.15 P404

49 49 Unlimited Resource Analysis Performance –Can begin a new iteration on each clock cycle –Should give CPE of 1.0 –Would require executing 4 integer operations in parallel 4 integer ops 4 Iterations of Combining Sum Figure 5.16 P405

50 50 Combining Product: Resource Constraints Figure 5.17 P406 Figure 5.17 P406

51 51 Combining Sum: Resource Constraints Figure 5.18 P408

52 52 Combining Sum: Resource Constraints Only have two integer functional units Some operations delayed even though operands available Set priority based on program order Performance –Sustain CPE of 2.0

53 53 5.8 Reducing Loop Overhead

54 54 Loop unrolling P409 void combine5(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int x = IDENT; /* combine 3 elements at a time */ for (i = 0; i < length-2; i+=3) x = x OPER data[i] OPER data[i+1] OPER data[i+2]; /* finish any remaining elements */ for (; i < length; i++) x = x OPER data[i]; *dest = x; }

55 55 –Loads can pipeline, since don’t have dependencies –Only one set of loop control operations load (%eax,%edx.0,4)  t.1a iaddl t.1a, %ecx.0c  %ecx.1a load 4(%eax,%edx.0,4)  t.1b iaddl t.1b, %ecx.1a  %ecx.1b load 8(%eax,%edx.0,4)  t.1c iaddl t.1c, %ecx.1b  %ecx.1c iaddl $3,%edx.0  %edx.1 cmpl %esi, %edx.1  cc.1 jl-taken cc.1 Visualizing Unrolled Loop P410

56 56 Time %edx.0 %edx.1 %ecx.0c cc.1 t.1a %ecx. i +1 addl cmpl jl addl %ecx.1c addl t.1b t.1c %ecx.1a %ecx.1b load Measured CPE = 1.33 Visualizing Unrolled Loop Figure 5.20 P410

57 57 Executing with Loop Unrolling Figure 5.21 P411

58 58 Executing with Loop Unrolling Predicted Performance –Can complete iteration in 3 cycles –Should give CPE of 1.0 Measured Performance –CPE of 1.33 –One iteration every 4 cycles

59 59 Unrolling Degree1234816 IntegerSum2.001.501.331.501.251.06 IntegerProduct4.00 FPSum3.00 FPProduct5.00 Effect of Unrolling P411

60 60 Effect of Unrolling Only helps integer sum for our examples –Other cases constrained by functional unit latencies Effect is nonlinear with degree of unrolling –Many subtle effects determine exact scheduling of operations

61 61 5.9 Converting to Pointer Code

62 62 void combine4p(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int *dend = data + length ; int x = IDENT; for (; data < dend ; data++ ) x = x OPER *data; *dest = x; } Example P413

63 63 Some compilers and processors do better job optimizing array code FunctionIntegerFloating pointer + * Combine42.00 4.003.00 5.00 Combine4p3.00 4.003.00 5.00 Pointer Code vs. Array Code P414

64 64.L24:# Loop: addl (%eax,%edx,4),%ecx# x += data[i] incl %edx# i++ cmpl %esi,%edx# i:length jl.L24# if < goto Loop.L30:# Loop: addl (%eax),%ecx# x += *data addl $4,%eax# data ++ cmpl %edx,%eax# data:dend jb.L30# if < goto Loop Pointer vs. Array Code Inner Loops P414

65 65 Performance –Array Code: 4 instructions in 2 clock cycles –Pointer Code: Almost same 4 instructions in 3 clock cycles Pointer vs. Array Code Inner Loops

66 66 5.10 Enhancing Parallelism

67 67 void combine6(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int x0 = IDENT, x1 = IDENT; /* combine 2 elements at a time */ for (i = 0; i < length; i+=2){ x0 = x0 OPER data[i]; x1 = x1 OPER data[i+1]; } /* finish any remaining elements */ for (; i < length; i++) x0 = x0 OPER data[i]; *dest = x0 OPER x1; } Loop Splitting P416

68 68 Loop Splitting Optimization –Accumulate in two different sums Can be performed simultaneously –Combine at end –Exploits property that integer addition & multiplication are associative & commutative –FP addition & multiplication not associative, but transformation usually acceptable Associative: 可结合的 Commutative: 可交换的

69 69 load (%eax,%edx.0,4)  t.1a imull t.1a, %ecx.0  %ecx.1 load 4(%eax,%edx.0,4)  t.1b imull t.1b, %ebx.0  %ebx.1 iaddl $2,%edx.0  %edx.1 cmpl %esi, %edx.1  cc.1 jl-taken cc.1 Visualizing Parallel Loop P417 Two multiplies within loop no longer have data dependency Allows them to pipeline

70 70 Time %edx.1 %ecx.0 %ebx.0 cc.1 t.1a imull %ecx.1 addl cmpl jl %edx.0 imull %ebx.1 t.1b load Visualizing Parallel Loop Figure 5.25 P417

71 71 Executing with Parallel Loop Figure 5.26 P418

72 72 Optimization Results for Combining P419

73 73 Optimization Results for Combining Register spilling – only 6 registers available –Using memory as storage Register spilling –movl -12(%ebp), %edi –imull 24(%eax), %edi –movl %edi, -12(%ebp)

74 74 5.11 Putting it Together: Summary of Results for Optimizing Combining Code 5.11.1 Floating-Point Performance Anomaly 5.11.2 Changing Platforms

75 75 5.12 Branch Prediction and Misprediction Penalties

76 76 What About Branches? Challenge –Instruction Control Unit must work well ahead of Exec. Unit –To generate enough operations to keep EU busy

77 77 80489f3:movl $0x1,%ecx 80489f8:xorl %edx,%edx 80489fa:cmpl %esi,%edx 80489fc:jnl 8048a25 80489fe:movl %esi,%esi 8048a00:imull (%eax,%edx,4),%ecx Executing Fetching & Decoding What About Branches?

78 78 What About Branches? Challenge –When encounters conditional branch, cannot reliably determine where to continue fetching

79 79 Branch Outcomes When encounter conditional branch, cannot determine where to continue fetching –Branch Taken: Transfer control to branch target –Branch Not-Taken: Continue with next instruction in sequence Cannot resolve until outcome determined by branch/integer unit

80 80 80489f3:movl $0x1,%ecx 80489f8:xorl %edx,%edx 80489fa:cmpl %esi,%edx 80489fc:jnl 8048a25 80489fe:movl %esi,%esi 8048a00:imull (%eax,%edx,4),%ecx 8048a25:cmpl %edi,%edx 8048a27:jl 8048a20 8048a29:movl 0xc(%ebp),%eax 8048a2c:leal 0xffffffe8(%ebp),%esp 8048a2f:movl %ecx,(%eax) Branch Taken Branch Not-Taken Branch Outcomes

81 81 Branch Prediction Idea –Guess which way branch will go –Begin executing instructions at predicted position But don’t actually modify register or memory data

82 82 80489f3:movl $0x1,%ecx 80489f8:xorl %edx,%edx 80489fa:cmpl %esi,%edx 80489fc:jnl 8048a25... 8048a25:cmpl %edi,%edx 8048a27:jl 8048a20 8048a29:movl 0xc(%ebp),%eax 8048a2c:leal 0xffffffe8(%ebp),%esp 8048a2f:movl %ecx,(%eax) Predict Taken Execute Branch Prediction

83 83 80488b1:movl (%ecx,%edx,4),%eax 80488b4:addl %eax,(%edi) 80488b6:incl %edx 80488b7:cmpl %esi,%edx 80488b9:jl 80488b1 80488b1:movl (%ecx,%edx,4),%eax 80488b4:addl %eax,(%edi) 80488b6:incl %edx 80488b7:cmpl %esi,%edx 80488b9:jl 80488b1 80488b1:movl (%ecx,%edx,4),%eax 80488b4:addl %eax,(%edi) 80488b6:incl %edx 80488b7:cmpl %esi,%edx 80488b9:jl 80488b1 i = 98 i = 99 i = 100 Predict Taken (OK) Predict Taken (Oops) 80488b1:movl (%ecx,%edx,4),%eax 80488b4:addl %eax,(%edi) 80488b6:incl %edx 80488b7:cmpl %esi,%edx 80488b9:jl 80488b1 i = 101 Assume vector length = 100 Read invalid location Executed Fetched Branch Prediction Through Loop

84 84 80488b1:movl (%ecx,%edx,4),%eax 80488b4:addl %eax,(%edi) 80488b6:incl %edx 80488b7:cmpl %esi,%edx 80488b9:jl 80488b1 80488b1:movl (%ecx,%edx,4),%eax 80488b4:addl %eax,(%edi) 80488b6:incl %edx 80488b7:cmpl %esi,%edx 80488b9:jl 80488b1 80488b1:movl (%ecx,%edx,4),%eax 80488b4:addl %eax,(%edi) 80488b6:incl %edx 80488b7:cmpl %esi,%edx 80488b9:jl 80488b1 i = 98 i = 99 i = 100 Predict Taken (OK) Predict Taken (Oops) 80488b1:movl (%ecx,%edx,4),%eax 80488b4:addl %eax,(%edi) 80488b6:incl %edx i = 101 Invalidate Assume vector length = 100 Branch Misprediction Invalidation

85 85 80488b1:movl (%ecx,%edx,4),%eax 80488b4:addl %eax,(%edi) 80488b6:incl %edx 80488b7:cmpl %esi,%edx 80488b9:jl 80488b1 80488b1:movl (%ecx,%edx,4),%eax 80488b4:addl %eax,(%edi) 80488b6:incl %edx 80488b7:cmpl %esi,%edx 80488b9:jl 80488b1 80488bb:leal 0xffffffe8(%ebp),%esp 80488be:popl %ebx 80488bf:popl %esi 80488c0:popl %edi i = 98 i = 99 Predict Taken (OK) Definitely not taken Assume vector length = 100 Branch Misprediction Recovery

86 86 Branch Misprediction Recovery P427 Performance Cost –Misprediction on Pentium III wastes ~14 clock cycles –That’s a lot of time on a high performance processor

87 87 Misprediction penalty is about 14 cycles in PIII machine Conditional mov is used to avoid the misprediction penalty when the branch outcome is not predictable For example: int absval(int val) { return (val <0)? –val : val } Conditional Jump Figure 5.29 P427

88 88 Conditional Jump P428 movl 8(%ebp), %eaxGet val as result movl %eax, %edxCopy to %edx negl %edxNegate %edx testl %eax, %eaxTest Val cmov1 %edx, %eaxif <0 copy %edx to result

89 89 5.13 Understanding Memory Performance

90 90 typedef struct ELE { struct ELE *next ; int data ; } list_ele, *list_ptr ; int list_len(list_ptr ls) { int len = 0 ; for (;ls;ls=ls->next) len++ ; return len ; } Assembly Instructions.L27: incl %eax movl(%edx), %edx testl %edx, %edx jne.L27 Execution unit operations incl %eax.0  %eax.1 load (%edx.0)  %edx.1 testl %edx.1, %edx.1  cc.1 jne-taken cc.1 Load Latency P429, P430 Figure 5.30 P430

91 91 incl testl jne %eax.0 %edx.0 incl testl jne load incl testl jne load %eax.1 %eax.2 %eax.3 %edx.1 %edx.2 %edx.3 cc.1 cc.2 cc.3 i=1 i=2 i=3 1 2 3 4 5 6 7 8 9 10 11 Figure 5.31 P430

92 92 Store Latency Figure 5.32 P431 void array_clear(int *dest, int n) { int i; for ( i = 0 ; i < n ; i++) dest[i] = 0 ; } CPE 2.0

93 93 Store Latency Figure 5.32 P431 void array_clear(int *dest, int n) { int i; int len = n-7 ; for ( i = 0 ; i < len ; i++) { dest[i] = dest[i+1] = dest[i+2] = dest[i+3] = 0 ; dest[i+4] = dest[i+5] = dest[i+6] = dest[i+7] = 0 ; } for ( ; i < n ; i++) dest[i] = 0 ; } CPE 1.25

94 94 Store latency Figure 5.33 P432 void write_read(int *src, int *dest, int n) { int cnt = n; int val = 0; while (cnt--) { *dest = val; val = (*src)+1; }

95 95 Store latency Figure 5.33 P432 write_read(&a[0], &a[1], 3) initialiter. 1iter. 2iter. 3 cnt3210 a(-10, 17)(-10, 0)(-10, -9)(-10, -9) val0-9-9-9 write_read(&a[0], &a[0], 3) initialiter. 1iter. 2iter. 3 cnt3210 a(-10, 17)(0, 17)(1, 17)(2, 17) val0123

96 96 Store latency void write_read(int *src, int *dest, int n) { int cnt = n; int val = 0; while (cnt--) { *dest = val; val = (*src)+1; }

97 97 Store latency P434.L32: movl %edx, (%ecx) movl (%ebx), %edx incl %edx decl %eax jnc.L32 storeaddr (%ecx) storedata %edx.0 load (%ebx)  %edx.1a incl %edx.1a  %edx.1b decl %eax.0  %eax.1 jnc-taken cc.1

98 98 %eax.2 1 2 3 4 5 6 7 decl store data store addr load incl jncdecl store data store addr load incl jnc  %edx.1a %edx.1b cc.1 %eax.1  %edx.2a %edx.2b %eax.0 %edx.0 Store latency Figure 5.35 P434

99 99 1 2 3 4 5 6 7 decl store data store addr load incl jncdecl Store data store addr incl jnc = %edx.1a %edx.1b cc.1 %eax.1 = %edx.2b %eax.2 %eax.0 %edx.0 load %edx.2a Figure 5.36 P435

100 100 5.14 Life in the Real World: Performance Improvement Techniques

101 101 Basic Strategies for Optimizing Program Performance High-level design Basic coding principles Eliminate excessive function calls Eliminate unnecessary memory references Low-level optimizations Try various forms of pointer versus array code. Reduce loop overhead by unrolling loops. Find ways to make use of the pipelined functional units by techniques such as iteration splitting

102 102 5.15 Identifying and Eliminating Performance Bottlenecks

103 103 Performance Tuning Identify –Which is the hottest part of the program –Using a very useful method profiling Instrument the program Run it with typical input data Collect information from the result Analysis the result –gprof example $gcc –O2 –pg prog.c –o prog $prog file.text (generate new file gmon.out) $gprof prog (with gmon.out)

104 104 Example Task –Count word frequencies in text document –Sort the words in descending order of occurence Steps –Convert strings to lower case –Apply hash function –Read words and insert into hash table Mostly list operations Maintain counter for each unique word –Sort results

105 105 Examples unix> gcc –O2 –pg prog.c –o prog unix>./prog file.txt unix> gprof prog % cumulative self self total time seconds seconds calls ms/call ms/call name 86.60 8.21 8.21 1 8210.00 8210.00 sort_words 5.80 8.76 0.55 946596 0.00 0.00 lower1 4.75 9.21 0.45 946596 0.00 0.00 find_ele_rec 1.27 9.33 0.12 946596 0.00 0.00 h_add

106 106 Branch Misprediction Recovery Performance Cost –Misprediction on Pentium III wastes ~14 clock cycles –That’s a lot of time on a high performance processor

107 107 4872758 find_ele_rec [5] 0.600.01946596/946596 insert_string [4] [5]6.70.600.01946596+4872758 find_ele_rec [5] 0.000.0126946/26946 save_string [9] 0.000.0026946/26946 new_ele [11] 4872758 find_ele_rec [5] Example P439

108 108 Principle Interval counting –Maintain a counter for each function Record the time spent executing this function –Interrupted at regular time (1ms) Check which function is executing when interrupt occurs Increment the counter for this function

109 109 Data Set P439 Collected works of Shakespeare 946,596 total words, 26,596 unique Initial implementation: 9.2 seconds

110 110 Code Optimizations –First step: Use more efficient sorting function –Library function qsort Figure 5.37 P441 1)2)3)4)5)6)7)

111 111 Further Optimizations 2)3)4)5)6)7)1)

112 112 Example 3) Iter first: Use iterative function to insert elements in linked list –Causes code to slow down 4) Iter last: Iterative function, places new entry at end of list –Tend to place most common words at front of list 5) Big table: Increase number of hash buckets 6) Better hash: Use more sophisticated hash function 7) Linear lower: Move strlen out of loop

113 113 Code Motion Example#2 void lower(char *s) { int i; for (i = 0; i < strlen(s); i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); } void lower(char *s) { int i = 0; if (i >= strlen(s)) goto done; loop: if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); i++; if (i < strlen(s)) goto loop; done: }

114 114 Lower Case Conversion Performance –Time quadruples when double string length –Quadratic performance

115 115 Time quadruples when double string length Quadratic performance Lower Case Conversion Performance

116 116 Move call to strlen outside of loop Since result does not change from one iteration to another Form of code motion void lower(char *s) { int i; int len = strlen(s); for (i = 0; i < len; i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); } Improving Performance

117 117 Lower Case Conversion Performance –Time doubles when double string length –Linear performance

118 118 Benefits –Helps identify performance bottlenecks –Especially useful when have complex system with many components Limitations –Only shows performance for data tested –E.g., linear lower did not show big gain, since words are short Quadratic inefficiency could remain lurking in code –Timing mechanism fairly crude Only works for programs that run for > 3 seconds Performance Tuning

119 119 T new = (1-  )T old + (  T old )/k = T old [(1-  ) +  /k] S = T old / T new = 1/[(1-  ) +  /k] S  = 1/(1-  ) Amdahl’s Law P443


Download ppt "1 Code Optimization. 2 Outline Optimizing Blockers –Memory alias –Side effect in function call Understanding Modern Processor –Super-scalar –Out-of –order."

Similar presentations


Ads by Google