1 Code Optimization
2 Outline Optimizing Blockers –Memory alias –Side effect in function call Understanding Modern Processor –Super-scalar –Out-of –order execution More Code Optimization techniques Performance Tuning Suggested reading –5.1, 5.7 ~ 5.16
3 5.1 Capabilities and Limitations of Optimizing Compliers Review on 5.3 Program Example 5.4 Eliminating Loop Inefficiencies 5.5 Reducing Procedure Calls 5.6 Eliminating Unneeded Memory References
4 void combine1(vec_ptr v, data_t *dest) { int i; *dest = IDENT; for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest = *dest OPER val; } Example P387
5 void combine2(vec_ptr v, int *dest) { int i; int length = vec_length(v); *dest = IDENT; for (i = 0; i < length; i++) { int val; get_vec_element(v, i, &val); *dest = *dest OPER val; } Example P388
6 void combine3(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); *dest = IDENT; for (i = 0; i < length; i++) { *dest = *dest OPER data[i]; } Example P392
7 void combine4(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int x = IDENT; for (i = 0; i < length; i++) x = x OPER data[i]; *dest = x; } Example P394
8 Machine Independent Opt. Results Optimizations –Reduce function calls and memory references within loop
9 Machine Independent Opt. Results Performance Anomaly –Computing FP product of all elements exceptionally slow. –Very large speedup when accumulate in temporary –Memory uses 64-bit format, register use 80 –Benchmark data caused overflow of 64 bits, but not 80 Combine4 Combine3 Combine2 Combine1 P385 P388 P392 P394
10 Optimization Blockers P394 void combine4(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int sum = 0; for (i = 0; i < length; i++) sum += data[i]; *dest = sum; }
11 Optimization Blocker: Memory Aliasing P394 Aliasing –Two different memory references specify single location Example –v: [3, 2, 17] –combine3(v, get_vec_start(v)+2) -->? –combine4(v, get_vec_start(v)+2) -->?
12 Optimization Blocker: Memory Aliasing Observations –Easy to have happen in C Since allowed to do address arithmetic Direct access to storage structures –Get in habit of introducing local variables Accumulating within loops Your way of telling compiler not to check for aliasing
13 Optimizing Compilers Provide efficient mapping of program to machine –register allocation –code selection and ordering –eliminating minor inefficiencies
14 Optimizing Compilers Don’t (usually) improve asymptotic efficiency –up to programmer to select best overall algorithm –big-O savings are (often) more important than constant factors but constant factors also matter Have difficulty overcoming “optimization blockers” –potential memory aliasing –potential procedure side-effects
15 Limitations of Optimizing Compilers Operate Under Fundamental Constraint –Must not cause any change in program behavior under any possible condition –Often prevents it from making optimizations when would only affect behavior under pathological conditions.
16 Limitations of Optimizing Compilers Behavior that may be obvious to the programmer can be obfuscated by languages and coding styles –e.g., data ranges may be more limited than variable types suggest e.g., using an “ int ” in C for what could be an enumerated type
17 Limitations of Optimizing Compilers Most analysis is performed only within procedures –whole-program analysis is too expensive in most cases Most analysis is based only on static information –compiler has difficulty anticipating run-time inputs When in doubt, the compiler must be conservative
18 Optimization Blockers P380 Memory aliasing void twiddle1(int *xp, int *yp) { *xp += *yp ; } void twiddle2(int *xp, int *yp) { *xp += 2* *yp ; }
19 Optimization Blockers P381 Function call and side effect int f(int) ; int func1(x) { return f(x)+f(x)+f(x)+f(x) ; } int func2(x) { return 4*f(x) ; }
20 Optimization Blockers P381 Function call and side effect int counter = 0 ; int f(int x) { return counter++ ; }
Understanding Modern Processors
22 Modern CPU Design Figure 5.11 P396 Execution Functional Units Instruction Control Integer/ Branch FP Add FP Mult/Div LoadStore Instruction Cache Data Cache Fetch Control Instruction Decode Address Instructions Operations Prediction OK? Data Addr. General Integer Operation Results Retirement Unit Register File Register Updates
23 Retirement Unit Register File Instruction Cache Fetch Control Instruction Decode Address Instructions Integer /branch General Integer FP Add FP mult/div LoadStore Functional units operations Predication OK? Data Cache Operation results addr data addr data Register Updates 1) 2) 3) 4) 5) (1)(2)(3)(4)(5)(6) (7)
24 Modern Processor P396 Superscalar –Perform multiple operations on every clock cycle Out-of-order execution –The order in which the instructions execute need not correspond to their ordering in the assembly program
25 Modern Processor P396 Two main parts –Instruction Control Unit Responsible for reading a sequence of instructions from memory Generating from above instructions a set of primitive operations to perform on program data –Execution Unit
26 1) Instruction Control Unit Instruction Cache –A special, high speed memory containing the most recently accessed instructions.
27 1) Instruction Control Unit Instruction Decoding Logic –Take actual program instructions –Converts them into a set of primitive operations –Each primitive operation performs some simple task Simple arithmetic, Load, Store addl %eax, 4(%edx) --- three operations load 4(%edx) t1 addl %eax, t1 t2 store t2, 4(%edx) –Register renaming P397 P398
28 2) Fetch Control Fetch Ahead P396 –Fetches well ahead of currently accessed instructions –ICU has enough time to decode these –ICU has enough time to send decoded operations down to the EU
29 Fetch Control Branch Predication P397 –Branch taken or fall through –Guess whether branch is taken or not Speculative Execution P397 –Fetch, decode and execute only according to the branch prediction –Before the branch predication has been determined
30 Multi-functional Units Multiple Instructions Can Execute in Parallel –1 load –1 store –2 integer (one may be branch) –1 FP Addition –1 FP Multiplication or Division
31 Multi-functional Units Figure 5.12 P400 Some Instructions Take > 1 Cycle, but Can be Pipelined –InstructionLatency Cycles/Issue –Load / Store31 –Integer Multiply41 –Integer Divide3636 –Double/Single FP Multiply52 –Double/Single FP Add31 –Double/Single FP Divide3838
32 Execution Unit Receives operations from ICU Each cycle it may receive more than one operation Operations are queued in buffer
33 Execution Unit Operation is dispatched to one of multi- functional units, whenever –All the operands of an operation are ready –Suitable functional units are available Execution results are passed among functional units (7) Data Cache P398 –A high speed memory containing the most recently accessed data values
34 4) Retirement Unit P398 Instructions need to commit in serial order –Misprediction –Exception Updates Architecture status –Memory and register values
35 Translation Example P401.L24:# Loop: imull (%eax,%edx,4),%ecx# t *= data[i] incl %edx# i++ cmpl %esi,%edx# i:length jl.L24# if < goto Loop.L24: imull (%eax,%edx,4),%ecx incl %edx cmpl %esi,%edx jl.L24 load (%eax,%edx.0,4) t.1 imull t.1, %ecx.0 %ecx.1 incl %edx.0 %edx.1 cmpl %esi, %edx.1 cc.1 jl-taken cc.1
36 Understanding Translation Example P401 Split into two operations –Load reads from memory to generate temporary result t.1 –Multiply operation just operates on registers imull (%eax,%edx,4),%ecxload (%eax,%edx.0,4) t.1 imull t.1, %ecx.0 %ecx.1
37 Understanding Translation Example P401 Operands –Registers %eax does not change in loop. Values will be retrieved from register file during decoding imull (%eax,%edx,4),%ecxload (%eax,%edx.0,4) t.1 imull t.1, %ecx.0 %ecx.1
38 Understanding Translation Example P401 Operands –Register %ecx changes on every iteration. –Uniquely identify different versions as %ecx.0, %ecx.1, %ecx.2, … –Register renaming Values passed directly from producer to consumers imull (%eax,%edx,4),%ecxload (%eax,%edx.0,4) t.1 imull t.1, %ecx.0 %ecx.1
39 Understanding Translation Example P402 Register %edx changes on each iteration Renamed as %edx.0, %edx.1, %edx.2, … incl %edxincl %edx.0 %edx.1
40 Understanding Translation Example P402 Condition codes are treated similar to registers Assign tag to define connection between producer and consumer cmpl %esi,%edxcmpl %esi, %edx.1 cc.1
41 Understanding Translation Example P402 Instruction control unit determines destination of jump Predicts whether target will be taken Starts fetching instruction at predicted destination jl.L24jl-taken cc.1
42 Understanding Translation Example P401 Execution unit simply checks whether or not prediction was OK If not, it signals instruction control –Instruction control then “invalidates” any operations generated from misfetched instructions –Begins fetching and decoding instructions at correct target jl.L24jl-taken cc.1
43 Operations –Vertical position denotes time at which executed Cannot begin operation until operands available –Height denotes latency Operands –Arcs shown only for operands that are passed within execution unit cc.1 t.1 load %ecx.1 incl cmpl jl %edx.0 %edx.1 %ecx.0 imull load (%eax,%edx.0,4) t.1 imull t.1, %ecx.0 %ecx.1 incl %edx.0 %edx.1 cmpl %esi, %edx.1 cc.1 jl-taken cc.1 Time Visualizing Operations Figure 5.13 P403
44 Operations –Same as before, except that add has latency of 1 load (%eax,%edx,4) t.1 iaddl t.1, %ecx.0 %ecx.1 incl %edx.0 %edx.1 cmpl %esi, %edx.1 cc.1 jl-taken cc.1 Time cc.1 t.1 %ecx. i +1 incl cmpl jl load %edx.0 %edx.1 %ecx.0 addl %ecx.1 load Visualizing Operations Figure 5.14 P403
45 Unlimited Resource Analysis –Assume operation can start as soon as operands available –Operations for multiple iterations overlap in time Performance –Limiting factor becomes latency of integer multiplier –Gives CPE of Iterations of Combining Product Figure 5.15 P404
46 Unlimited Resource Analysis Performance –Can begin a new iteration on each clock cycle –Should give CPE of 1.0 –Would require executing 4 integer operations in parallel 4 integer ops 4 Iterations of Combining Sum Figure 5.16 P405
47 Combining Sum: Resource Constraints Figure 5.18 P408
48 Combining Sum: Resource Constraints Only have two integer functional units Some operations delayed even though operands available Set priority based on program order Performance –Sustain CPE of 2.0
Converting to Pointer Code
50 void combine4p(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int *dend = data + length ; int x = IDENT; for (; data < dend ; data++ ) x = x OPER *data; *dest = x; } Example P413
51 Some compilers and processors do better job optimizing array code FunctionIntegerFloating pointer + * Combine Combine4p Pointer Code vs. Array Code P414
52.L24:# Loop: addl (%eax,%edx,4),%ecx# x += data[i] incl %edx# i++ cmpl %esi,%edx# i:length jl.L24# if < goto Loop.L30:# Loop: addl (%eax),%ecx# x += *data addl $4,%eax# data ++ cmpl %edx,%eax# data:dend jb.L30# if < goto Loop Pointer vs. Array Code Inner Loops P414
53 Performance –Array Code: 4 instructions in 2 clock cycles –Pointer Code: Almost same 4 instructions in 3 clock cycles Pointer vs. Array Code Inner Loops
Reducing Loop Overhead
55 Loop unrolling P409 void combine5(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int x = IDENT; /* combine 3 elements at a time */ for (i = 0; i < length-2; i+=3) x = x OPER data[i] OPER data[i+1] OPER data[i+2]; /* finish any remaining elements */ for (; i < length; i++) x = x OPER data[i]; *dest = x; }
56 –Loads can pipeline, since don’t have dependencies –Only one set of loop control operations load (%eax,%edx.0,4) t.1a iaddl t.1a, %ecx.0c %ecx.1a load 4(%eax,%edx.0,4) t.1b iaddl t.1b, %ecx.1a %ecx.1b load 8(%eax,%edx.0,4) t.1c iaddl t.1c, %ecx.1b %ecx.1c iaddl $3,%edx.0 %edx.1 cmpl %esi, %edx.1 cc.1 jl-taken cc.1 Visualizing Unrolled Loop P410
57 Time %edx.0 %edx.1 %ecx.0c cc.1 t.1a %ecx. i +1 addl cmpl jl addl %ecx.1c addl t.1b t.1c %ecx.1a %ecx.1b load Measured CPE = 1.33 Visualizing Unrolled Loop Figure 5.20 P410
58 Executing with Loop Unrolling Figure 5.21 P411
59 Executing with Loop Unrolling Predicted Performance –Can complete iteration in 3 cycles –Should give CPE of 1.0 Measured Performance –CPE of 1.33 –One iteration every 4 cycles
60 Unrolling Degree IntegerSum IntegerProduct4.00 FPSum3.00 FPProduct5.00 Effect of Unrolling P411
61 Effect of Unrolling Only helps integer sum for our examples –Other cases constrained by functional unit latencies Effect is nonlinear with degree of unrolling –Many subtle effects determine exact scheduling of operations
Enhancing Parallelism
63 void combine5(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int x0 = IDENT, x1 = IDENT; /* combine 2 elements at a time */ for (i = 0; i < length; i+=2){ x0 = x0 OPER data[i]; x1 = x1 OPER data[i+1]; } /* finish any remaining elements */ for (; i < length; i++) x0 = x0 OPER data[i]; *dest = x0 OPER x1; } Loop Splitting P409
64 Loop Splitting Optimization –Accumulate in two different sums Can be performed simultaneously –Combine at end –Exploits property that integer addition & multiplication are associative & commutative –FP addition & multiplication not associative, but transformation usually acceptable Associative: 可结合的 Commutative: 可交换的
65 load (%eax,%edx.0,4) t.1a imull t.1a, %ecx.0 %ecx.1 load 4(%eax,%edx.0,4) t.1b imull t.1b, %ebx.0 %ebx.1 iaddl $2,%edx.0 %edx.1 cmpl %esi, %edx.1 cc.1 jl-taken cc.1 Visualizing Parallel Loop P417 Two multiplies within loop no longer have data dependency Allows them to pipeline
66 Time %edx.1 %ecx.0 %ebx.0 cc.1 t.1a imull %ecx.1 addl cmpl jl %edx.0 imull %ebx.1 t.1b load Visualizing Parallel Loop Figure 5.25 P417
67 Executing with Parallel Loop Figure 5.26 P418
68 Optimization Results for Combining P419
69 Optimization Results for Combining Register spilling – only 6 registers available –Using memory as storage Register spilling –movl -12(%ebp), %edi –imull 24(%eax), %edi –movl %edi, -12(%ebp)
Putting it Together: Summary of Results for Optimizing Combining Code
Branch Prediction and Misprediction Penalties
72 What About Branches? Challenge –Instruction Control Unit must work well ahead of Exec. Unit –To generate enough operations to keep EU busy
f3:movl $0x1,%ecx 80489f8:xorl %edx,%edx 80489fa:cmpl %esi,%edx 80489fc:jnl 8048a fe:movl %esi,%esi 8048a00:imull (%eax,%edx,4),%ecx Executing Fetching & Decoding What About Branches?
74 What About Branches? Challenge –When encounters conditional branch, cannot reliably determine where to continue fetching
75 Branch Outcomes When encounter conditional branch, cannot determine where to continue fetching –Branch Taken: Transfer control to branch target –Branch Not-Taken: Continue with next instruction in sequence Cannot resolve until outcome determined by branch/integer unit
f3:movl $0x1,%ecx 80489f8:xorl %edx,%edx 80489fa:cmpl %esi,%edx 80489fc:jnl 8048a fe:movl %esi,%esi 8048a00:imull (%eax,%edx,4),%ecx 8048a25:cmpl %edi,%edx 8048a27:jl 8048a a29:movl 0xc(%ebp),%eax 8048a2c:leal 0xffffffe8(%ebp),%esp 8048a2f:movl %ecx,(%eax) Branch Taken Branch Not-Taken Branch Outcomes
77 Branch Prediction Idea –Guess which way branch will go –Begin executing instructions at predicted position But don’t actually modify register or memory data
f3:movl $0x1,%ecx 80489f8:xorl %edx,%edx 80489fa:cmpl %esi,%edx 80489fc:jnl 8048a a25:cmpl %edi,%edx 8048a27:jl 8048a a29:movl 0xc(%ebp),%eax 8048a2c:leal 0xffffffe8(%ebp),%esp 8048a2f:movl %ecx,(%eax) Predict Taken Execute Branch Prediction
b1:movl (%ecx,%edx,4),%eax 80488b4:addl %eax,(%edi) 80488b6:incl %edx 80488b7:cmpl %esi,%edx 80488b9:jl 80488b b1:movl (%ecx,%edx,4),%eax 80488b4:addl %eax,(%edi) 80488b6:incl %edx 80488b7:cmpl %esi,%edx 80488b9:jl 80488b b1:movl (%ecx,%edx,4),%eax 80488b4:addl %eax,(%edi) 80488b6:incl %edx 80488b7:cmpl %esi,%edx 80488b9:jl 80488b1 i = 98 i = 99 i = 100 Predict Taken (OK) Predict Taken (Oops) 80488b1:movl (%ecx,%edx,4),%eax 80488b4:addl %eax,(%edi) 80488b6:incl %edx 80488b7:cmpl %esi,%edx 80488b9:jl 80488b1 i = 101 Assume vector length = 100 Read invalid location Executed Fetched Branch Prediction Through Loop
b1:movl (%ecx,%edx,4),%eax 80488b4:addl %eax,(%edi) 80488b6:incl %edx 80488b7:cmpl %esi,%edx 80488b9:jl 80488b b1:movl (%ecx,%edx,4),%eax 80488b4:addl %eax,(%edi) 80488b6:incl %edx 80488b7:cmpl %esi,%edx 80488b9:jl 80488b b1:movl (%ecx,%edx,4),%eax 80488b4:addl %eax,(%edi) 80488b6:incl %edx 80488b7:cmpl %esi,%edx 80488b9:jl 80488b1 i = 98 i = 99 i = 100 Predict Taken (OK) Predict Taken (Oops) 80488b1:movl (%ecx,%edx,4),%eax 80488b4:addl %eax,(%edi) 80488b6:incl %edx i = 101 Invalidate Assume vector length = 100 Branch Misprediction Invalidation
b1:movl (%ecx,%edx,4),%eax 80488b4:addl %eax,(%edi) 80488b6:incl %edx 80488b7:cmpl %esi,%edx 80488b9:jl 80488b b1:movl (%ecx,%edx,4),%eax 80488b4:addl %eax,(%edi) 80488b6:incl %edx 80488b7:cmpl %esi,%edx 80488b9:jl 80488b bb:leal 0xffffffe8(%ebp),%esp 80488be:popl %ebx 80488bf:popl %esi 80488c0:popl %edi i = 98 i = 99 Predict Taken (OK) Definitely not taken Assume vector length = 100 Branch Misprediction Recovery
82 Branch Misprediction Recovery P427 Performance Cost –Misprediction on Pentium III wastes ~14 clock cycles –That’s a lot of time on a high performance processor
83 Misprediction penalty is about 14 cycles in PIII machine Conditional mov is used to avoid the misprediction penalty when the branch outcome is not predictable For example: int absval(int val) { return (val <0)? –val : val } Conditional Jump Figure 5.29 P427
84 Conditional Jump P428 movl 8(%ebp), %eaxGet val as result movl %eax, %edxCopy to %edx negl %edxNegate %edx testl %eax, %eaxTest Val cmov1 %edx, %eaxif <0 copy %edx to result
Understanding Memory Performance
86 typedef struct ELE { struct ELE *next ; int data ; } list_ele, *list_ptr ; int list_len(list_ptr ls) { int len = 0 ; for (;ls;ls=ls->next) len++ ; return len ; } Assembly Instructions.L27: incl %eax movl(%edx), %edx testl %edx, %edx jne.L27 Execution unit operations incl %eax.0 %eax.1 load (%edx.0) %edx.1 testl %edx.1, %edx.1 cc.1 jne-taken cc.1 Load Latency P429, P430 Figure 5.30 P430
87 incl testl jne %eax.0 %edx.0 incl testl jne load incl testl jne load %eax.1 %eax.2 %eax.3 %edx.1 %edx.2 %edx.3 cc.1 cc.2 cc.3 i=1 i=2 i= Figure 5.31 P430
88 Store Latency Figure 5.32 P431 void array_clear(int *dest, int n) { int i; for ( i = 0 ; i < n ; i++) dest[i] = 0 ; } CPE 2.0
89 Store Latency Figure 5.32 P431 void array_clear(int *dest, int n) { int i; int len = n-7 ; for ( i = 0 ; i < len ; i++) { dest[i] = dest[i+1] = dest[i+2] = dest[i+3] = 0 ; dest[i+4] = dest[i+5] = dest[i+6] = dest[i+7] = 0 ; } for ( ; i < n ; i++) dest[i] = 0 ; } CPE 1.25
90 Store latency Figure 5.33 P432 void write_read(int *src, int *dest, int n) { int cnt = n; int val = 0; while (cnt--) { *dest = val; val = (*src)+1; }
91 Store latency Figure 5.33 P432 write_read(&a[0], &a[1], 3) initialiter. 1iter. 2iter. 3 cnt3210 a(-10, 17)(-10, 0)(-10, -9)(-10, -9) val write_read(&a[0], &a[0], 3) initialiter. 1iter. 2iter. 3 cnt3210 a(-10, 17)(0, 17)(1, 17)(2, 17) val0123
92 Store latency void write_read(int *src, int *dest, int n) { int cnt = n; int val = 0; while (cnt--) { *dest = val; val = (*src)+1; }
93 Store latency P434.L32: movl %edx, (%ecx) movl (%ebx), %edx incl %edx decl %eax jnc.L32 storeaddr (%ecx) storedata %edx.0 load (%ebx) %edx.1a incl %edx.1a %edx.1b decl %eax.0 %eax.1 jnc-taken cc.1
94 %eax decl store data store addr load incl jncdecl store data store addr load incl jnc %edx.1a %edx.1b cc.1 %eax.1 %edx.2a %edx.2b %eax.0 %edx.0 Store latency Figure 5.35 P434
decl store data store addr load incl jncdecl Store data store addr incl jnc = %edx.1a %edx.1b cc.1 %eax.1 = %edx.2b %eax.2 %eax.0 %edx.0 load %edx.2a Figure 5.36 P435
Life in the Real World: Performance Improvement Techniques
Identifying and Eliminating Performance Bottlenecks
98 Performance Tuning Identify –Which is the hottest part of the program –Using a very useful method profiling Instrument the program Run it with typical input data Collect information from the result Analysis the result –gprof example $gcc –O2 –pg prog.c –o prog $prog file.text (generate new file gmon.out) $gprof prog (with gmon.out)
99 Example Task –Count word frequencies in text document –Sort the words in descending order of occurence Steps –Convert strings to lowercase –Apply hash function –Read words and insert into hash table Mostly list operations Maintain counter for each unique word –Sort results
100 Examples unix> gcc –O2 –pg prog.c –o prog unix>./prog file.txt unix> gprof prog % cumulative self self total time seconds seconds calls ms/call ms/call name sort_words lower find_ele_rec h_add
101 Branch Misprediction Recovery Performance Cost –Misprediction on Pentium III wastes ~14 clock cycles –That’s a lot of time on a high performance processor
find_ele_rec [5] / insert_string [4] [5] find_ele_rec [5] /26946 save_string [9] /26946 new_ele [11] find_ele_rec [5] Example P439
103 Principle Interval counting –Maintain a counter for each function Record the time spent executing this function –Interrupted at regular time (1ms) Check which function is executing when interrupt occurs Increment the counter for this function
104 Data Set P439 Collected works of Shakespeare 946,596 total words, 26,596 unique Initial implementation: 9.2 seconds
105 Code Optimizations –First step: Use more efficient sorting function –Library function qsort Figure 5.37 P441 1)2)3)4)5)6)7)
106 Further Optimizations 2)3)4)5)6)7)1)
107 Example 3) Iter first: Use iterative function to insert elements in linked list –Causes code to slow down 4) Iter last: Iterative function, places new entry at end of list –Tend to place most common words at front of list 5) Big table: Increase number of hash buckets 6) Better hash: Use more sophisticated hash function 7) Linear lower: Move strlen out of loop
108 Code Motion Example#2 void lower(char *s) { int i; for (i = 0; i < strlen(s); i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); } void lower(char *s) { int i = 0; if (i >= strlen(s)) goto done; loop: if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); i++; if (i < strlen(s)) goto loop; done: }
109 Lower Case Conversion Performance –Time quadruples when double string length –Quadratic performance
110 Time quadruples when double string length Quadratic performance Lower Case Conversion Performance
111 Move call to strlen outside of loop Since result does not change from one iteration to another Form of code motion void lower(char *s) { int i; int len = strlen(s); for (i = 0; i < len; i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); } Improving Performance
112 Lower Case Conversion Performance –Time doubles when double string length –Linear performance
113 Benefits –Helps identify performance bottlenecks –Especially useful when have complex system with many components Limitations –Only shows performance for data tested –E.g., linear lower did not show big gain, since words are short Quadratic inefficiency could remain lurking in code –Timing mechanism fairly crude Only works for programs that run for > 3 seconds Performance Tuning
114 T new = (1- )T old + ( T old )/k = T old [(1- ) + /k] S = T old / T new = 1/[(1- ) + /k] S = 1/(1- ) Amdahl’s Law P443