1 Code Optimization. 2 Outline Optimizing Blockers –Memory alias –Side effect in function call Understanding Modern Processor –Super-scalar –Out-of –order.

Slides:



Advertisements
Similar presentations
Code Optimization and Performance Chapter 5 CS 105 Tour of the Black Holes of Computing.
Advertisements

Computer Organization and Architecture
CSCI 4717/5717 Computer Architecture
Carnegie Mellon Today Program optimization  Optimization blocker: Memory aliasing  Out of order processing: Instruction level parallelism  Understanding.
Instructor: Erol Sahin Program Optimization CENG331: Introduction to Computer Systems 11 th Lecture Acknowledgement: Most of the slides are adapted from.
Program Optimization (Chapter 5)
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
1 Code Optimization(II). 2 Outline Understanding Modern Processor –Super-scalar –Out-of –order execution Suggested reading –5.14,5.7.
1 Seoul National University Wrap-Up. 2 Overview Seoul National University Wrap-Up of PIPE Design  Exception conditions  Performance analysis Modern.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Carnegie Mellon 1 Program Optimization : Introduction to Computer Systems 25 th Lecture, Nov. 23, 2010 Instructors: Randy Bryant and Dave O’Hallaron.
PipelinedImplementation Part I CSC 333. – 2 – Overview General Principles of Pipelining Goal Difficulties Creating a Pipelined Y86 Processor Rearranging.
Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.
Code Optimization I September 24, 2007 Topics Machine-Independent Optimizations Basic optimizations Optimization blockers class08.ppt F’07.
Code Optimization II Feb. 19, 2008 Topics Machine Dependent Optimizations Understanding Processor Operations Branches and Branch Prediction class11.ppt.
Code Optimization: Machine Independent Optimizations Feb 12, 2004 Topics Machine-Independent Optimizations Machine-Dependent Opts Understanding Processor.
Code Optimization I: Machine Independent Optimizations Sept. 26, 2002 Topics Machine-Independent Optimizations Code motion Reduction in strength Common.
Code Optimization II September 26, 2007 Topics Machine Dependent Optimizations Understanding Processor Operations Branches and Branch Prediction class09.ppt.
Chapter 12 CPU Structure and Function. Example Register Organizations.
Code Optimization II: Machine Dependent Optimizations Topics Machine-Dependent Optimizations Pointer code Unrolling Enabling instruction level parallelism.
Architecture Basics ECE 454 Computer Systems Programming
Code Optimization 1. Outline Machine-Independent Optimization –Code motion –Memory optimization Suggested reading –5.1 ~
CS 3214 Computer Systems Godmar Back Lecture 10. Announcements Stay tuned for Exercise 5 Project 2 due Sep 30 Auto-fail rule 2: –Need at least Firecracker.
1 Code Optimization. 2 Outline Machine-Independent Optimization –Code motion –Memory optimization Suggested reading –5.2 ~ 5.6.
Recitation 7: 10/21/02 Outline Program Optimization –Machine Independent –Machine Dependent Loop Unrolling Blocking Annie Luo
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance.
1 Code Optimization. 2 Outline Optimizing Blockers –Memory alias –Side effect in function call Understanding Modern Processor –Super-scalar –Out-of –order.
Machine-Dependent Optimization CS 105 “Tour of the Black Holes of Computing”
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance.
Code Optimization and Performance Chapters 5 and 9 perf01.ppt CS 105 “Tour of the Black Holes of Computing”
Introduction to ECE 454 Computer Systems Programming Topics: Lecture topics and assignments Profiling rudiments Lab schedule and rationale Cristiana Amza.
Code Optimization II: Machine Dependent Optimization Topics Machine-Dependent Optimizations Unrolling Enabling instruction level parallelism.
Machine Independent Optimizations Topics Code motion Reduction in strength Common subexpression sharing.
Code Optimization and Performance CS 105 “Tour of the Black Holes of Computing”
Programming for Performance CS 740 Oct. 4, 2000 Topics How architecture impacts your programs How (and how not) to tune your code.
Machine-Dependent Optimization CS 105 “Tour of the Black Holes of Computing”
1 Code Optimization. 2 Outline Machine-Independent Optimization –Code motion –Memory optimization Suggested reading –5.2 ~ 5.6.
Code Optimization and Performance I Chapter 5 perf01.ppt CS 105 “Tour of the Black Holes of Computing”
Real-World Pipelines Idea –Divide process into independent stages –Move objects through stages in sequence –At any given times, multiple objects being.
Code Optimization and Performance II
Real-World Pipelines Idea Divide process into independent stages
Code Optimization.
Computer Architecture Chapter (14): Processor Structure and Function
Code Optimization II September 27, 2006
CS 3214 Introduction to Computer Systems
Machine-Dependent Optimization
Code Optimization II Dec. 2, 2008
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Code Optimization I: Machine Independent Optimizations
Instructors: Dave O’Hallaron, Greg Ganger, and Greg Kesden
Code Optimization II: Machine Dependent Optimizations Oct. 1, 2002
Code Optimization II: Machine Dependent Optimizations
Machine-Dependent Optimization
Code Optimization and Performance
Code Optimization I: Machine Independent Optimizations Feb 11, 2003
Machine-Level Programming III: Procedures Sept 18, 2001
Code Optimization(II)
Code Optimization April 6, 2000
Code Optimization I Nov. 25, 2008
Pipelined Implementation : Part I
COMP 2130 Intro Computer Systems Thompson Rivers University
Optimizing program performance
Program Optimization CSE 238/2038/2138: Systems Programming
Machine-Independent Optimization
Lecture 11: Machine-Dependent Optimization
Code Optimization and Performance
Code Optimization II October 2, 2001
Presentation transcript:

1 Code Optimization

2 Outline Optimizing Blockers –Memory alias –Side effect in function call Understanding Modern Processor –Super-scalar –Out-of –order execution More Code Optimization techniques Performance Tuning Suggested reading –5.1, 5.7 ~ 5.16

3 5.1 Capabilities and Limitations of Optimizing Compliers Review on 5.3 Program Example 5.4 Eliminating Loop Inefficiencies 5.5 Reducing Procedure Calls 5.6 Eliminating Unneeded Memory References

4 void combine1(vec_ptr v, data_t *dest) { int i; *dest = IDENT; for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest = *dest OPER val; } Example P387

5 void combine2(vec_ptr v, int *dest) { int i; int length = vec_length(v); *dest = IDENT; for (i = 0; i < length; i++) { int val; get_vec_element(v, i, &val); *dest = *dest OPER val; } Example P388

6 void combine3(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); *dest = IDENT; for (i = 0; i < length; i++) { *dest = *dest OPER data[i]; } Example P392

7 void combine4(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int x = IDENT; for (i = 0; i < length; i++) x = x OPER data[i]; *dest = x; } Example P394

8 Machine Independent Opt. Results Optimizations –Reduce function calls and memory references within loop

9 Machine Independent Opt. Results Performance Anomaly –Computing FP product of all elements exceptionally slow. –Very large speedup when accumulate in temporary –Memory uses 64-bit format, register use 80 –Benchmark data caused overflow of 64 bits, but not 80 Combine4 Combine3 Combine2 Combine1 P385 P388 P392 P394

10 Optimization Blockers P394 void combine4(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int sum = 0; for (i = 0; i < length; i++) sum += data[i]; *dest = sum; }

11 Optimization Blocker: Memory Aliasing P394 Aliasing –Two different memory references specify single location Example –v: [3, 2, 17] –combine3(v, get_vec_start(v)+2) -->? –combine4(v, get_vec_start(v)+2) -->?

12 Optimization Blocker: Memory Aliasing Observations –Easy to have happen in C Since allowed to do address arithmetic Direct access to storage structures –Get in habit of introducing local variables Accumulating within loops Your way of telling compiler not to check for aliasing

13 Optimizing Compilers Provide efficient mapping of program to machine –register allocation –code selection and ordering –eliminating minor inefficiencies

14 Optimizing Compilers Don’t (usually) improve asymptotic efficiency –up to programmer to select best overall algorithm –big-O savings are (often) more important than constant factors but constant factors also matter Have difficulty overcoming “optimization blockers” –potential memory aliasing –potential procedure side-effects

15 Limitations of Optimizing Compilers Operate Under Fundamental Constraint –Must not cause any change in program behavior under any possible condition –Often prevents it from making optimizations when would only affect behavior under pathological conditions.

16 Limitations of Optimizing Compilers Behavior that may be obvious to the programmer can be obfuscated by languages and coding styles –e.g., data ranges may be more limited than variable types suggest e.g., using an “ int ” in C for what could be an enumerated type

17 Limitations of Optimizing Compilers Most analysis is performed only within procedures –whole-program analysis is too expensive in most cases Most analysis is based only on static information –compiler has difficulty anticipating run-time inputs When in doubt, the compiler must be conservative

18 Optimization Blockers P380 Memory aliasing void twiddle1(int *xp, int *yp) { *xp += *yp ; } void twiddle2(int *xp, int *yp) { *xp += 2* *yp ; }

19 Optimization Blockers P381 Function call and side effect int f(int) ; int func1(x) { return f(x)+f(x)+f(x)+f(x) ; } int func2(x) { return 4*f(x) ; }

20 Optimization Blockers P381 Function call and side effect int counter = 0 ; int f(int x) { return counter++ ; }

Understanding Modern Processors

22 Modern CPU Design Figure 5.11 P396 Execution Functional Units Instruction Control Integer/ Branch FP Add FP Mult/Div LoadStore Instruction Cache Data Cache Fetch Control Instruction Decode Address Instructions Operations Prediction OK? Data Addr. General Integer Operation Results Retirement Unit Register File Register Updates

23 Retirement Unit Register File Instruction Cache Fetch Control Instruction Decode Address Instructions Integer /branch General Integer FP Add FP mult/div LoadStore Functional units operations Predication OK? Data Cache Operation results addr data addr data Register Updates 1) 2) 3) 4) 5) (1)(2)(3)(4)(5)(6) (7)

24 Modern Processor P396 Superscalar –Perform multiple operations on every clock cycle Out-of-order execution –The order in which the instructions execute need not correspond to their ordering in the assembly program

25 Modern Processor P396 Two main parts –Instruction Control Unit Responsible for reading a sequence of instructions from memory Generating from above instructions a set of primitive operations to perform on program data –Execution Unit

26 1) Instruction Control Unit Instruction Cache –A special, high speed memory containing the most recently accessed instructions.

27 1) Instruction Control Unit Instruction Decoding Logic –Take actual program instructions –Converts them into a set of primitive operations –Each primitive operation performs some simple task Simple arithmetic, Load, Store addl %eax, 4(%edx) --- three operations load 4(%edx)  t1 addl %eax, t1  t2 store t2, 4(%edx) –Register renaming P397 P398

28 2) Fetch Control Fetch Ahead P396 –Fetches well ahead of currently accessed instructions –ICU has enough time to decode these –ICU has enough time to send decoded operations down to the EU

29 Fetch Control Branch Predication P397 –Branch taken or fall through –Guess whether branch is taken or not Speculative Execution P397 –Fetch, decode and execute only according to the branch prediction –Before the branch predication has been determined

30 Multi-functional Units Multiple Instructions Can Execute in Parallel –1 load –1 store –2 integer (one may be branch) –1 FP Addition –1 FP Multiplication or Division

31 Multi-functional Units Figure 5.12 P400 Some Instructions Take > 1 Cycle, but Can be Pipelined –InstructionLatency Cycles/Issue –Load / Store31 –Integer Multiply41 –Integer Divide3636 –Double/Single FP Multiply52 –Double/Single FP Add31 –Double/Single FP Divide3838

32 Execution Unit Receives operations from ICU Each cycle it may receive more than one operation Operations are queued in buffer

33 Execution Unit Operation is dispatched to one of multi- functional units, whenever –All the operands of an operation are ready –Suitable functional units are available Execution results are passed among functional units (7) Data Cache P398 –A high speed memory containing the most recently accessed data values

34 4) Retirement Unit P398 Instructions need to commit in serial order –Misprediction –Exception Updates Architecture status –Memory and register values

35 Translation Example P401.L24:# Loop: imull (%eax,%edx,4),%ecx# t *= data[i] incl %edx# i++ cmpl %esi,%edx# i:length jl.L24# if < goto Loop.L24: imull (%eax,%edx,4),%ecx incl %edx cmpl %esi,%edx jl.L24 load (%eax,%edx.0,4)  t.1 imull t.1, %ecx.0  %ecx.1 incl %edx.0  %edx.1 cmpl %esi, %edx.1  cc.1 jl-taken cc.1

36 Understanding Translation Example P401 Split into two operations –Load reads from memory to generate temporary result t.1 –Multiply operation just operates on registers imull (%eax,%edx,4),%ecxload (%eax,%edx.0,4)  t.1 imull t.1, %ecx.0  %ecx.1

37 Understanding Translation Example P401 Operands –Registers %eax does not change in loop. Values will be retrieved from register file during decoding imull (%eax,%edx,4),%ecxload (%eax,%edx.0,4)  t.1 imull t.1, %ecx.0  %ecx.1

38 Understanding Translation Example P401 Operands –Register %ecx changes on every iteration. –Uniquely identify different versions as %ecx.0, %ecx.1, %ecx.2, … –Register renaming Values passed directly from producer to consumers imull (%eax,%edx,4),%ecxload (%eax,%edx.0,4)  t.1 imull t.1, %ecx.0  %ecx.1

39 Understanding Translation Example P402 Register %edx changes on each iteration Renamed as %edx.0, %edx.1, %edx.2, … incl %edxincl %edx.0  %edx.1

40 Understanding Translation Example P402 Condition codes are treated similar to registers Assign tag to define connection between producer and consumer cmpl %esi,%edxcmpl %esi, %edx.1  cc.1

41 Understanding Translation Example P402 Instruction control unit determines destination of jump Predicts whether target will be taken Starts fetching instruction at predicted destination jl.L24jl-taken cc.1

42 Understanding Translation Example P401 Execution unit simply checks whether or not prediction was OK If not, it signals instruction control –Instruction control then “invalidates” any operations generated from misfetched instructions –Begins fetching and decoding instructions at correct target jl.L24jl-taken cc.1

43 Operations –Vertical position denotes time at which executed Cannot begin operation until operands available –Height denotes latency Operands –Arcs shown only for operands that are passed within execution unit cc.1 t.1 load %ecx.1 incl cmpl jl %edx.0 %edx.1 %ecx.0 imull load (%eax,%edx.0,4)  t.1 imull t.1, %ecx.0  %ecx.1 incl %edx.0  %edx.1 cmpl %esi, %edx.1  cc.1 jl-taken cc.1 Time Visualizing Operations Figure 5.13 P403

44 Operations –Same as before, except that add has latency of 1 load (%eax,%edx,4)  t.1 iaddl t.1, %ecx.0  %ecx.1 incl %edx.0  %edx.1 cmpl %esi, %edx.1  cc.1 jl-taken cc.1 Time cc.1 t.1 %ecx. i +1 incl cmpl jl load %edx.0 %edx.1 %ecx.0 addl %ecx.1 load Visualizing Operations Figure 5.14 P403

45 Unlimited Resource Analysis –Assume operation can start as soon as operands available –Operations for multiple iterations overlap in time Performance –Limiting factor becomes latency of integer multiplier –Gives CPE of Iterations of Combining Product Figure 5.15 P404

46 Unlimited Resource Analysis Performance –Can begin a new iteration on each clock cycle –Should give CPE of 1.0 –Would require executing 4 integer operations in parallel 4 integer ops 4 Iterations of Combining Sum Figure 5.16 P405

47 Combining Sum: Resource Constraints Figure 5.18 P408

48 Combining Sum: Resource Constraints Only have two integer functional units Some operations delayed even though operands available Set priority based on program order Performance –Sustain CPE of 2.0

Converting to Pointer Code

50 void combine4p(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int *dend = data + length ; int x = IDENT; for (; data < dend ; data++ ) x = x OPER *data; *dest = x; } Example P413

51 Some compilers and processors do better job optimizing array code FunctionIntegerFloating pointer + * Combine Combine4p Pointer Code vs. Array Code P414

52.L24:# Loop: addl (%eax,%edx,4),%ecx# x += data[i] incl %edx# i++ cmpl %esi,%edx# i:length jl.L24# if < goto Loop.L30:# Loop: addl (%eax),%ecx# x += *data addl $4,%eax# data ++ cmpl %edx,%eax# data:dend jb.L30# if < goto Loop Pointer vs. Array Code Inner Loops P414

53 Performance –Array Code: 4 instructions in 2 clock cycles –Pointer Code: Almost same 4 instructions in 3 clock cycles Pointer vs. Array Code Inner Loops

Reducing Loop Overhead

55 Loop unrolling P409 void combine5(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int x = IDENT; /* combine 3 elements at a time */ for (i = 0; i < length-2; i+=3) x = x OPER data[i] OPER data[i+1] OPER data[i+2]; /* finish any remaining elements */ for (; i < length; i++) x = x OPER data[i]; *dest = x; }

56 –Loads can pipeline, since don’t have dependencies –Only one set of loop control operations load (%eax,%edx.0,4)  t.1a iaddl t.1a, %ecx.0c  %ecx.1a load 4(%eax,%edx.0,4)  t.1b iaddl t.1b, %ecx.1a  %ecx.1b load 8(%eax,%edx.0,4)  t.1c iaddl t.1c, %ecx.1b  %ecx.1c iaddl $3,%edx.0  %edx.1 cmpl %esi, %edx.1  cc.1 jl-taken cc.1 Visualizing Unrolled Loop P410

57 Time %edx.0 %edx.1 %ecx.0c cc.1 t.1a %ecx. i +1 addl cmpl jl addl %ecx.1c addl t.1b t.1c %ecx.1a %ecx.1b load Measured CPE = 1.33 Visualizing Unrolled Loop Figure 5.20 P410

58 Executing with Loop Unrolling Figure 5.21 P411

59 Executing with Loop Unrolling Predicted Performance –Can complete iteration in 3 cycles –Should give CPE of 1.0 Measured Performance –CPE of 1.33 –One iteration every 4 cycles

60 Unrolling Degree IntegerSum IntegerProduct4.00 FPSum3.00 FPProduct5.00 Effect of Unrolling P411

61 Effect of Unrolling Only helps integer sum for our examples –Other cases constrained by functional unit latencies Effect is nonlinear with degree of unrolling –Many subtle effects determine exact scheduling of operations

Enhancing Parallelism

63 void combine5(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int x0 = IDENT, x1 = IDENT; /* combine 2 elements at a time */ for (i = 0; i < length; i+=2){ x0 = x0 OPER data[i]; x1 = x1 OPER data[i+1]; } /* finish any remaining elements */ for (; i < length; i++) x0 = x0 OPER data[i]; *dest = x0 OPER x1; } Loop Splitting P409

64 Loop Splitting Optimization –Accumulate in two different sums Can be performed simultaneously –Combine at end –Exploits property that integer addition & multiplication are associative & commutative –FP addition & multiplication not associative, but transformation usually acceptable Associative: 可结合的 Commutative: 可交换的

65 load (%eax,%edx.0,4)  t.1a imull t.1a, %ecx.0  %ecx.1 load 4(%eax,%edx.0,4)  t.1b imull t.1b, %ebx.0  %ebx.1 iaddl $2,%edx.0  %edx.1 cmpl %esi, %edx.1  cc.1 jl-taken cc.1 Visualizing Parallel Loop P417 Two multiplies within loop no longer have data dependency Allows them to pipeline

66 Time %edx.1 %ecx.0 %ebx.0 cc.1 t.1a imull %ecx.1 addl cmpl jl %edx.0 imull %ebx.1 t.1b load Visualizing Parallel Loop Figure 5.25 P417

67 Executing with Parallel Loop Figure 5.26 P418

68 Optimization Results for Combining P419

69 Optimization Results for Combining Register spilling – only 6 registers available –Using memory as storage Register spilling –movl -12(%ebp), %edi –imull 24(%eax), %edi –movl %edi, -12(%ebp)

Putting it Together: Summary of Results for Optimizing Combining Code

Branch Prediction and Misprediction Penalties

72 What About Branches? Challenge –Instruction Control Unit must work well ahead of Exec. Unit –To generate enough operations to keep EU busy

f3:movl $0x1,%ecx 80489f8:xorl %edx,%edx 80489fa:cmpl %esi,%edx 80489fc:jnl 8048a fe:movl %esi,%esi 8048a00:imull (%eax,%edx,4),%ecx Executing Fetching & Decoding What About Branches?

74 What About Branches? Challenge –When encounters conditional branch, cannot reliably determine where to continue fetching

75 Branch Outcomes When encounter conditional branch, cannot determine where to continue fetching –Branch Taken: Transfer control to branch target –Branch Not-Taken: Continue with next instruction in sequence Cannot resolve until outcome determined by branch/integer unit

f3:movl $0x1,%ecx 80489f8:xorl %edx,%edx 80489fa:cmpl %esi,%edx 80489fc:jnl 8048a fe:movl %esi,%esi 8048a00:imull (%eax,%edx,4),%ecx 8048a25:cmpl %edi,%edx 8048a27:jl 8048a a29:movl 0xc(%ebp),%eax 8048a2c:leal 0xffffffe8(%ebp),%esp 8048a2f:movl %ecx,(%eax) Branch Taken Branch Not-Taken Branch Outcomes

77 Branch Prediction Idea –Guess which way branch will go –Begin executing instructions at predicted position But don’t actually modify register or memory data

f3:movl $0x1,%ecx 80489f8:xorl %edx,%edx 80489fa:cmpl %esi,%edx 80489fc:jnl 8048a a25:cmpl %edi,%edx 8048a27:jl 8048a a29:movl 0xc(%ebp),%eax 8048a2c:leal 0xffffffe8(%ebp),%esp 8048a2f:movl %ecx,(%eax) Predict Taken Execute Branch Prediction

b1:movl (%ecx,%edx,4),%eax 80488b4:addl %eax,(%edi) 80488b6:incl %edx 80488b7:cmpl %esi,%edx 80488b9:jl 80488b b1:movl (%ecx,%edx,4),%eax 80488b4:addl %eax,(%edi) 80488b6:incl %edx 80488b7:cmpl %esi,%edx 80488b9:jl 80488b b1:movl (%ecx,%edx,4),%eax 80488b4:addl %eax,(%edi) 80488b6:incl %edx 80488b7:cmpl %esi,%edx 80488b9:jl 80488b1 i = 98 i = 99 i = 100 Predict Taken (OK) Predict Taken (Oops) 80488b1:movl (%ecx,%edx,4),%eax 80488b4:addl %eax,(%edi) 80488b6:incl %edx 80488b7:cmpl %esi,%edx 80488b9:jl 80488b1 i = 101 Assume vector length = 100 Read invalid location Executed Fetched Branch Prediction Through Loop

b1:movl (%ecx,%edx,4),%eax 80488b4:addl %eax,(%edi) 80488b6:incl %edx 80488b7:cmpl %esi,%edx 80488b9:jl 80488b b1:movl (%ecx,%edx,4),%eax 80488b4:addl %eax,(%edi) 80488b6:incl %edx 80488b7:cmpl %esi,%edx 80488b9:jl 80488b b1:movl (%ecx,%edx,4),%eax 80488b4:addl %eax,(%edi) 80488b6:incl %edx 80488b7:cmpl %esi,%edx 80488b9:jl 80488b1 i = 98 i = 99 i = 100 Predict Taken (OK) Predict Taken (Oops) 80488b1:movl (%ecx,%edx,4),%eax 80488b4:addl %eax,(%edi) 80488b6:incl %edx i = 101 Invalidate Assume vector length = 100 Branch Misprediction Invalidation

b1:movl (%ecx,%edx,4),%eax 80488b4:addl %eax,(%edi) 80488b6:incl %edx 80488b7:cmpl %esi,%edx 80488b9:jl 80488b b1:movl (%ecx,%edx,4),%eax 80488b4:addl %eax,(%edi) 80488b6:incl %edx 80488b7:cmpl %esi,%edx 80488b9:jl 80488b bb:leal 0xffffffe8(%ebp),%esp 80488be:popl %ebx 80488bf:popl %esi 80488c0:popl %edi i = 98 i = 99 Predict Taken (OK) Definitely not taken Assume vector length = 100 Branch Misprediction Recovery

82 Branch Misprediction Recovery P427 Performance Cost –Misprediction on Pentium III wastes ~14 clock cycles –That’s a lot of time on a high performance processor

83 Misprediction penalty is about 14 cycles in PIII machine Conditional mov is used to avoid the misprediction penalty when the branch outcome is not predictable For example: int absval(int val) { return (val <0)? –val : val } Conditional Jump Figure 5.29 P427

84 Conditional Jump P428 movl 8(%ebp), %eaxGet val as result movl %eax, %edxCopy to %edx negl %edxNegate %edx testl %eax, %eaxTest Val cmov1 %edx, %eaxif <0 copy %edx to result

Understanding Memory Performance

86 typedef struct ELE { struct ELE *next ; int data ; } list_ele, *list_ptr ; int list_len(list_ptr ls) { int len = 0 ; for (;ls;ls=ls->next) len++ ; return len ; } Assembly Instructions.L27: incl %eax movl(%edx), %edx testl %edx, %edx jne.L27 Execution unit operations incl %eax.0  %eax.1 load (%edx.0)  %edx.1 testl %edx.1, %edx.1  cc.1 jne-taken cc.1 Load Latency P429, P430 Figure 5.30 P430

87 incl testl jne %eax.0 %edx.0 incl testl jne load incl testl jne load %eax.1 %eax.2 %eax.3 %edx.1 %edx.2 %edx.3 cc.1 cc.2 cc.3 i=1 i=2 i= Figure 5.31 P430

88 Store Latency Figure 5.32 P431 void array_clear(int *dest, int n) { int i; for ( i = 0 ; i < n ; i++) dest[i] = 0 ; } CPE 2.0

89 Store Latency Figure 5.32 P431 void array_clear(int *dest, int n) { int i; int len = n-7 ; for ( i = 0 ; i < len ; i++) { dest[i] = dest[i+1] = dest[i+2] = dest[i+3] = 0 ; dest[i+4] = dest[i+5] = dest[i+6] = dest[i+7] = 0 ; } for ( ; i < n ; i++) dest[i] = 0 ; } CPE 1.25

90 Store latency Figure 5.33 P432 void write_read(int *src, int *dest, int n) { int cnt = n; int val = 0; while (cnt--) { *dest = val; val = (*src)+1; }

91 Store latency Figure 5.33 P432 write_read(&a[0], &a[1], 3) initialiter. 1iter. 2iter. 3 cnt3210 a(-10, 17)(-10, 0)(-10, -9)(-10, -9) val write_read(&a[0], &a[0], 3) initialiter. 1iter. 2iter. 3 cnt3210 a(-10, 17)(0, 17)(1, 17)(2, 17) val0123

92 Store latency void write_read(int *src, int *dest, int n) { int cnt = n; int val = 0; while (cnt--) { *dest = val; val = (*src)+1; }

93 Store latency P434.L32: movl %edx, (%ecx) movl (%ebx), %edx incl %edx decl %eax jnc.L32 storeaddr (%ecx) storedata %edx.0 load (%ebx)  %edx.1a incl %edx.1a  %edx.1b decl %eax.0  %eax.1 jnc-taken cc.1

94 %eax decl store data store addr load incl jncdecl store data store addr load incl jnc  %edx.1a %edx.1b cc.1 %eax.1  %edx.2a %edx.2b %eax.0 %edx.0 Store latency Figure 5.35 P434

decl store data store addr load incl jncdecl Store data store addr incl jnc = %edx.1a %edx.1b cc.1 %eax.1 = %edx.2b %eax.2 %eax.0 %edx.0 load %edx.2a Figure 5.36 P435

Life in the Real World: Performance Improvement Techniques

Identifying and Eliminating Performance Bottlenecks

98 Performance Tuning Identify –Which is the hottest part of the program –Using a very useful method profiling Instrument the program Run it with typical input data Collect information from the result Analysis the result –gprof example $gcc –O2 –pg prog.c –o prog $prog file.text (generate new file gmon.out) $gprof prog (with gmon.out)

99 Example Task –Count word frequencies in text document –Sort the words in descending order of occurence Steps –Convert strings to lowercase –Apply hash function –Read words and insert into hash table Mostly list operations Maintain counter for each unique word –Sort results

100 Examples unix> gcc –O2 –pg prog.c –o prog unix>./prog file.txt unix> gprof prog % cumulative self self total time seconds seconds calls ms/call ms/call name sort_words lower find_ele_rec h_add

101 Branch Misprediction Recovery Performance Cost –Misprediction on Pentium III wastes ~14 clock cycles –That’s a lot of time on a high performance processor

find_ele_rec [5] / insert_string [4] [5] find_ele_rec [5] /26946 save_string [9] /26946 new_ele [11] find_ele_rec [5] Example P439

103 Principle Interval counting –Maintain a counter for each function Record the time spent executing this function –Interrupted at regular time (1ms) Check which function is executing when interrupt occurs Increment the counter for this function

104 Data Set P439 Collected works of Shakespeare 946,596 total words, 26,596 unique Initial implementation: 9.2 seconds

105 Code Optimizations –First step: Use more efficient sorting function –Library function qsort Figure 5.37 P441 1)2)3)4)5)6)7)

106 Further Optimizations 2)3)4)5)6)7)1)

107 Example 3) Iter first: Use iterative function to insert elements in linked list –Causes code to slow down 4) Iter last: Iterative function, places new entry at end of list –Tend to place most common words at front of list 5) Big table: Increase number of hash buckets 6) Better hash: Use more sophisticated hash function 7) Linear lower: Move strlen out of loop

108 Code Motion Example#2 void lower(char *s) { int i; for (i = 0; i < strlen(s); i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); } void lower(char *s) { int i = 0; if (i >= strlen(s)) goto done; loop: if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); i++; if (i < strlen(s)) goto loop; done: }

109 Lower Case Conversion Performance –Time quadruples when double string length –Quadratic performance

110 Time quadruples when double string length Quadratic performance Lower Case Conversion Performance

111 Move call to strlen outside of loop Since result does not change from one iteration to another Form of code motion void lower(char *s) { int i; int len = strlen(s); for (i = 0; i < len; i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); } Improving Performance

112 Lower Case Conversion Performance –Time doubles when double string length –Linear performance

113 Benefits –Helps identify performance bottlenecks –Especially useful when have complex system with many components Limitations –Only shows performance for data tested –E.g., linear lower did not show big gain, since words are short Quadratic inefficiency could remain lurking in code –Timing mechanism fairly crude Only works for programs that run for > 3 seconds Performance Tuning

114 T new = (1-  )T old + (  T old )/k = T old [(1-  ) +  /k] S = T old / T new = 1/[(1-  ) +  /k] S  = 1/(1-  ) Amdahl’s Law P443