1 Code Optimization. 2 Outline Optimizing Blockers –Memory alias –Side effect in function call Understanding Modern Processor –Super-scalar –Out-of –order.

Slides:

Advertisements

Similar presentations

Code Optimization and Performance Chapter 5 CS 105 Tour of the Black Holes of Computing.

Advertisements

Computer Organization and Architecture

CSCI 4717/5717 Computer Architecture

Carnegie Mellon Today Program optimization  Optimization blocker: Memory aliasing  Out of order processing: Instruction level parallelism  Understanding.

Instructor: Erol Sahin Program Optimization CENG331: Introduction to Computer Systems 11 th Lecture Acknowledgement: Most of the slides are adapted from.

Program Optimization (Chapter 5)

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

1 Code Optimization(II). 2 Outline Understanding Modern Processor –Super-scalar –Out-of –order execution Suggested reading –5.14,5.7.

Winter 2015 COMP 2130 Intro Computer Systems Computing Science Thompson Rivers University Program Optimization.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Carnegie Mellon 1 Program Optimization : Introduction to Computer Systems 25 th Lecture, Nov. 23, 2010 Instructors: Randy Bryant and Dave O’Hallaron.

PipelinedImplementation Part I CSC 333. – 2 – Overview General Principles of Pipelining Goal Difficulties Creating a Pipelined Y86 Processor Rearranging.

Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.

1 ICS 51 Introductory Computer Organization Fall 2006 updated: Oct. 2, 2006.

Code Optimization I September 24, 2007 Topics Machine-Independent Optimizations Basic optimizations Optimization blockers class08.ppt F’07.

Code Optimization II Feb. 19, 2008 Topics Machine Dependent Optimizations Understanding Processor Operations Branches and Branch Prediction class11.ppt.

Code Optimization: Machine Independent Optimizations Feb 12, 2004 Topics Machine-Independent Optimizations Machine-Dependent Opts Understanding Processor.

Code Optimization I: Machine Independent Optimizations Sept. 26, 2002 Topics Machine-Independent Optimizations Code motion Reduction in strength Common.

Code Optimization II September 26, 2007 Topics Machine Dependent Optimizations Understanding Processor Operations Branches and Branch Prediction class09.ppt.

Chapter 12 CPU Structure and Function. Example Register Organizations.

Code Optimization II: Machine Dependent Optimizations Topics Machine-Dependent Optimizations Pointer code Unrolling Enabling instruction level parallelism.

Architecture Basics ECE 454 Computer Systems Programming

Code Optimization 1. Outline Machine-Independent Optimization –Code motion –Memory optimization Suggested reading –5.1 ~

CS 3214 Computer Systems Godmar Back Lecture 10. Announcements Stay tuned for Exercise 5 Project 2 due Sep 30 Auto-fail rule 2: –Need at least Firecracker.

1 Code Optimization. 2 Outline Machine-Independent Optimization –Code motion –Memory optimization Suggested reading –5.2 ~ 5.6.

Recitation 7: 10/21/02 Outline Program Optimization –Machine Independent –Machine Dependent Loop Unrolling Blocking Annie Luo

University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance.

1 Code Optimization. 2 Outline Optimizing Blockers –Memory alias –Side effect in function call Understanding Modern Processor –Super-scalar –Out-of –order.

Machine-Dependent Optimization CS 105 “Tour of the Black Holes of Computing”

University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance.

Code Optimization and Performance Chapters 5 and 9 perf01.ppt CS 105 “Tour of the Black Holes of Computing”

Introduction to ECE 454 Computer Systems Programming Topics: Lecture topics and assignments Profiling rudiments Lab schedule and rationale Cristiana Amza.

Code Optimization II: Machine Dependent Optimization Topics Machine-Dependent Optimizations Unrolling Enabling instruction level parallelism.

Machine Independent Optimizations Topics Code motion Reduction in strength Common subexpression sharing.

Code Optimization and Performance CS 105 “Tour of the Black Holes of Computing”

Programming for Performance CS 740 Oct. 4, 2000 Topics How architecture impacts your programs How (and how not) to tune your code.

Machine-Dependent Optimization CS 105 “Tour of the Black Holes of Computing”

1 Code Optimization. 2 Outline Machine-Independent Optimization –Code motion –Memory optimization Suggested reading –5.2 ~ 5.6.

Code Optimization and Performance I Chapter 5 perf01.ppt CS 105 “Tour of the Black Holes of Computing”

Real-World Pipelines Idea –Divide process into independent stages –Move objects through stages in sequence –At any given times, multiple objects being.

PipeliningPipelining Computer Architecture (Fall 2006)

Code Optimization and Performance II

Real-World Pipelines Idea Divide process into independent stages

Code Optimization.

Computer Architecture Chapter (14): Processor Structure and Function

Code Optimization II September 27, 2006

CS 3214 Introduction to Computer Systems

Machine-Dependent Optimization

Code Optimization II Dec. 2, 2008

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Code Optimization I: Machine Independent Optimizations

Instructors: Dave O’Hallaron, Greg Ganger, and Greg Kesden

Code Optimization II: Machine Dependent Optimizations Oct. 1, 2002

Code Optimization II: Machine Dependent Optimizations

Machine-Dependent Optimization

Code Optimization and Performance

Code Optimization I: Machine Independent Optimizations Feb 11, 2003

Code Optimization(II)

Code Optimization April 6, 2000

Code Optimization I Nov. 25, 2008

Pipelined Implementation : Part I

COMP 2130 Intro Computer Systems Thompson Rivers University

Optimizing program performance

Program Optimization CSE 238/2038/2138: Systems Programming

Machine-Independent Optimization

Lecture 11: Machine-Dependent Optimization

Code Optimization and Performance

Code Optimization II October 2, 2001

Presentation transcript:

1 Code Optimization

2 Outline Optimizing Blockers –Memory alias –Side effect in function call Understanding Modern Processor –Super-scalar –Out-of –order execution More Code Optimization techniques Performance Tuning Suggested reading –5.1, 5.7 ~ 5.16

3 5.1 Capabilities and Limitations of Optimizing Compliers Review on 5.3 Program Example 5.4 Eliminating Loop Inefficiencies 5.5 Reducing Procedure Calls 5.6 Eliminating Unneeded Memory References

4 void combine1(vec_ptr v, data_t *dest) { int i; *dest = IDENT; for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest = *dest OPER val; } Example P387

5 void combine2(vec_ptr v, int *dest) { int i; int length = vec_length(v); *dest = IDENT; for (i = 0; i < length; i++) { int val; get_vec_element(v, i, &val); *dest = *dest OPER val; } Example P388

6 void combine3(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); *dest = IDENT; for (i = 0; i < length; i++) { *dest = *dest OPER data[i]; } Example P392

7 void combine4(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int x = IDENT; for (i = 0; i < length; i++) x = x OPER data[i]; *dest = x; } Example P394

8 Machine Independent Opt. Results Optimizations –Reduce function calls and memory references within loop

9 Machine Independent Opt. Results Performance Anomaly –Computing FP product of all elements exceptionally slow. –Very large speedup when accumulate in temporary –Memory uses 64-bit format, register use 80 –Benchmark data caused overflow of 64 bits, but not 80 Combine4 Combine3 Combine2 Combine1 P385 P388 P392 P394

10 Optimization Blockers P394 void combine4(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int sum = 0; for (i = 0; i < length; i++) sum += data[i]; *dest = sum; }

11 Optimization Blocker: Memory Aliasing P394 Aliasing –Two different memory references specify single location Example –v: [3, 2, 17] –combine3(v, get_vec_start(v)+2) -->? –combine4(v, get_vec_start(v)+2) -->?

12 Optimization Blocker: Memory Aliasing Observations –Easy to have happen in C Since allowed to do address arithmetic Direct access to storage structures –Get in habit of introducing local variables Accumulating within loops Your way of telling compiler not to check for aliasing

13 Optimizing Compilers Provide efficient mapping of program to machine –register allocation –code selection and ordering –eliminating minor inefficiencies

14 Optimizing Compilers Don’t (usually) improve asymptotic efficiency –up to programmer to select best overall algorithm –big-O savings are (often) more important than constant factors but constant factors also matter Have difficulty overcoming “optimization blockers” –potential memory aliasing –potential procedure side-effects

15 Limitations of Optimizing Compilers Operate Under Fundamental Constraint –Must not cause any change in program behavior under any possible condition –Often prevents it from making optimizations when would only affect behavior under pathological conditions.

16 Limitations of Optimizing Compilers Behavior that may be obvious to the programmer can be obfuscated by languages and coding styles –e.g., data ranges may be more limited than variable types suggest e.g., using an “ int ” in C for what could be an enumerated type obfuscated ：混乱

17 Limitations of Optimizing Compilers Most analysis is performed only within procedures –whole-program analysis is too expensive in most cases Most analysis is based only on static information –compiler has difficulty anticipating run-time inputs When in doubt, the compiler must be conservative

18 Optimization Blockers P380 Memory aliasing void twiddle1(int *xp, int *yp) { *xp += *yp ; } void twiddle2(int *xp, int *yp) { *xp += 2* *yp ; }

19 Optimization Blockers P381 Function call and side effect int f(int) ; int func1(x) { return f(x)+f(x)+f(x)+f(x) ; } int func2(x) { return 4*f(x) ; }

20 Optimization Blockers P381 Function call and side effect int counter = 0 ; int f(int x) { return counter++ ; }

Understanding Modern Processors Overall Operation

22 Modern CPU Design Figure 5.11 P396 Execution Functional Units Instruction Control Integer/ Branch FP Add FP Mult/Div LoadStore Instruction Cache Data Cache Fetch Control Instruction Decode Address Instructions Operations Prediction OK? Data Addr. General Integer Operation Results Retirement Unit Register File Register Updates

23 Retirement Unit Register File Instruction Cache Fetch Control Instruction Decode Address Instructions Integer /branch General Integer FP Add FP mult/div LoadStore Functional units operations Predication OK? Data Cache Operation results addr data addr data Register Updates 1) 2) 3) 4) 5) (1)(2)(3)(4)(5)(6) (7)

24 Modern Processor P396 Superscalar –Perform multiple operations on every clock cycle Out-of-order execution –The order in which the instructions execute need not correspond to their ordering in the assembly program

25 Modern Processor P396 Two main parts –Instruction Control Unit Responsible for reading a sequence of instructions from memory Generating from above instructions a set of primitive operations to perform on program data –Execution Unit

26 1) Instruction Control Unit Instruction Cache –A special, high speed memory containing the most recently accessed instructions.

27 1) Instruction Control Unit Instruction Decoding Logic –Take actual program instructions –Converts them into a set of primitive operations –Each primitive operation performs some simple task Simple arithmetic, Load, Store addl %eax, 4(%edx) --- three operations load 4(%edx)  t1 addl %eax, t1  t2 store t2, 4(%edx) –Register renaming P397 P398

28 2) Fetch Control Fetch Ahead P396 –Fetches well ahead of currently accessed instructions –ICU has enough time to decode these –ICU has enough time to send decoded operations down to the EU

29 Fetch Control Branch Predication P397 –Branch taken or fall through –Guess whether branch is taken or not Speculative Execution P397 –Fetch, decode and execute only according to the branch prediction –Before the branch predication has been determined

Understanding Modern Processors Functional Unit Performance

31 Multi-functional Units Multiple Instructions Can Execute in Parallel –1 load –1 store –2 integer (one may be branch) –1 FP Addition –1 FP Multiplication or Division

32 Multi-functional Units Figure 5.12 P400 Some Instructions Take > 1 Cycle, but Can be Pipelined –InstructionLatency Cycles/Issue –Load / Store31 –Integer Multiply41 –Integer Divide3636 –Double/Single FP Multiply52 –Double/Single FP Add31 –Double/Single FP Divide3838

Understanding Modern Processors Overall Operation

34 Execution Unit Receives operations from ICU Each cycle it may receive more than one operation Operations are queued in buffer

35 Execution Unit Operation is dispatched to one of multi- functional units, whenever –All the operands of an operation are ready –Suitable functional units are available Execution results are passed among functional units (7) Data Cache P398 –A high speed memory containing the most recently accessed data values

36 4) Retirement Unit P398 Instructions need to commit in serial order –Misprediction –Exception Updates Architecture status –Memory and register values

A Closer Look at Processor Operation Translation Instruction into Operations

38 Translation Example P401.L24:# Loop: imull (%eax,%edx,4),%ecx# t *= data[i] incl %edx# i++ cmpl %esi,%edx# i:length jl.L24# if < goto Loop.L24: imull (%eax,%edx,4),%ecx incl %edx cmpl %esi,%edx jl.L24 load (%eax,%edx.0,4)  t.1 imull t.1, %ecx.0  %ecx.1 incl %edx.0  %edx.1 cmpl %esi, %edx.1  cc.1 jl-taken cc.1

39 Understanding Translation Example P401 Split into two operations –Load reads from memory to generate temporary result t.1 –Multiply operation just operates on registers imull (%eax,%edx,4),%ecxload (%eax,%edx.0,4)  t.1 imull t.1, %ecx.0  %ecx.1

40 Understanding Translation Example P401 Operands –Registers %eax does not change in loop. Values will be retrieved from register file during decoding imull (%eax,%edx,4),%ecxload (%eax,%edx.0,4)  t.1 imull t.1, %ecx.0  %ecx.1

41 Understanding Translation Example P401 Operands –Register %ecx changes on every iteration. –Uniquely identify different versions as %ecx.0, %ecx.1, %ecx.2, … –Register renaming Values passed directly from producer to consumers imull (%eax,%edx,4),%ecxload (%eax,%edx.0,4)  t.1 imull t.1, %ecx.0  %ecx.1

42 Understanding Translation Example P402 Register %edx changes on each iteration Renamed as %edx.0, %edx.1, %edx.2, … incl %edxincl %edx.0  %edx.1

43 Understanding Translation Example P402 Condition codes are treated similar to registers Assign tag to define connection between producer and consumer cmpl %esi,%edxcmpl %esi, %edx.1  cc.1

44 Understanding Translation Example P402 Instruction control unit determines destination of jump Predicts whether target will be taken Starts fetching instruction at predicted destination jl.L24jl-taken cc.1

45 Understanding Translation Example P401 Execution unit simply checks whether or not prediction was OK If not, it signals instruction control –Instruction control then “invalidates” any operations generated from misfetched instructions –Begins fetching and decoding instructions at correct target jl.L24jl-taken cc.1

46 Operations –Vertical position denotes time at which executed Cannot begin operation until operands available –Height denotes latency Operands –Arcs shown only for operands that are passed within execution unit cc.1 t.1 load %ecx.1 incl cmpl jl %edx.0 %edx.1 %ecx.0 imull load (%eax,%edx.0,4)  t.1 imull t.1, %ecx.0  %ecx.1 incl %edx.0  %edx.1 cmpl %esi, %edx.1  cc.1 jl-taken cc.1 Time Visualizing Operations Figure 5.13 P403

47 Operations –Same as before, except that add has latency of 1 load (%eax,%edx,4)  t.1 iaddl t.1, %ecx.0  %ecx.1 incl %edx.0  %edx.1 cmpl %esi, %edx.1  cc.1 jl-taken cc.1 Time cc.1 t.1 %ecx. i +1 incl cmpl jl load %edx.0 %edx.1 %ecx.0 addl %ecx.1 load Visualizing Operations Figure 5.14 P403

48 Unlimited Resource Analysis –Assume operation can start as soon as operands available –Operations for multiple iterations overlap in time Performance –Limiting factor becomes latency of integer multiplier –Gives CPE of Iterations of Combining Product Figure 5.15 P404

49 Unlimited Resource Analysis Performance –Can begin a new iteration on each clock cycle –Should give CPE of 1.0 –Would require executing 4 integer operations in parallel 4 integer ops 4 Iterations of Combining Sum Figure 5.16 P405

50 Combining Product: Resource Constraints Figure 5.17 P406 Figure 5.17 P406

51 Combining Sum: Resource Constraints Figure 5.18 P408

52 Combining Sum: Resource Constraints Only have two integer functional units Some operations delayed even though operands available Set priority based on program order Performance –Sustain CPE of 2.0

Reducing Loop Overhead

54 Loop unrolling P409 void combine5(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int x = IDENT; /* combine 3 elements at a time */ for (i = 0; i < length-2; i+=3) x = x OPER data[i] OPER data[i+1] OPER data[i+2]; /* finish any remaining elements */ for (; i < length; i++) x = x OPER data[i]; *dest = x; }

55 –Loads can pipeline, since don’t have dependencies –Only one set of loop control operations load (%eax,%edx.0,4)  t.1a iaddl t.1a, %ecx.0c  %ecx.1a load 4(%eax,%edx.0,4)  t.1b iaddl t.1b, %ecx.1a  %ecx.1b load 8(%eax,%edx.0,4)  t.1c iaddl t.1c, %ecx.1b  %ecx.1c iaddl $3,%edx.0  %edx.1 cmpl %esi, %edx.1  cc.1 jl-taken cc.1 Visualizing Unrolled Loop P410

56 Time %edx.0 %edx.1 %ecx.0c cc.1 t.1a %ecx. i +1 addl cmpl jl addl %ecx.1c addl t.1b t.1c %ecx.1a %ecx.1b load Measured CPE = 1.33 Visualizing Unrolled Loop Figure 5.20 P410

57 Executing with Loop Unrolling Figure 5.21 P411

58 Executing with Loop Unrolling Predicted Performance –Can complete iteration in 3 cycles –Should give CPE of 1.0 Measured Performance –CPE of 1.33 –One iteration every 4 cycles

59 Unrolling Degree IntegerSum IntegerProduct4.00 FPSum3.00 FPProduct5.00 Effect of Unrolling P411

60 Effect of Unrolling Only helps integer sum for our examples –Other cases constrained by functional unit latencies Effect is nonlinear with degree of unrolling –Many subtle effects determine exact scheduling of operations

Converting to Pointer Code

62 void combine4p(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int *dend = data + length ; int x = IDENT; for (; data < dend ; data++ ) x = x OPER *data; *dest = x; } Example P413

63 Some compilers and processors do better job optimizing array code FunctionIntegerFloating pointer + * Combine Combine4p Pointer Code vs. Array Code P414

64.L24:# Loop: addl (%eax,%edx,4),%ecx# x += data[i] incl %edx# i++ cmpl %esi,%edx# i:length jl.L24# if < goto Loop.L30:# Loop: addl (%eax),%ecx# x += *data addl $4,%eax# data ++ cmpl %edx,%eax# data:dend jb.L30# if < goto Loop Pointer vs. Array Code Inner Loops P414

65 Performance –Array Code: 4 instructions in 2 clock cycles –Pointer Code: Almost same 4 instructions in 3 clock cycles Pointer vs. Array Code Inner Loops

Enhancing Parallelism

67 void combine6(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int x0 = IDENT, x1 = IDENT; /* combine 2 elements at a time */ for (i = 0; i < length; i+=2){ x0 = x0 OPER data[i]; x1 = x1 OPER data[i+1]; } /* finish any remaining elements */ for (; i < length; i++) x0 = x0 OPER data[i]; *dest = x0 OPER x1; } Loop Splitting P416

68 Loop Splitting Optimization –Accumulate in two different sums Can be performed simultaneously –Combine at end –Exploits property that integer addition & multiplication are associative & commutative –FP addition & multiplication not associative, but transformation usually acceptable Associative: 可结合的 Commutative: 可交换的

69 load (%eax,%edx.0,4)  t.1a imull t.1a, %ecx.0  %ecx.1 load 4(%eax,%edx.0,4)  t.1b imull t.1b, %ebx.0  %ebx.1 iaddl $2,%edx.0  %edx.1 cmpl %esi, %edx.1  cc.1 jl-taken cc.1 Visualizing Parallel Loop P417 Two multiplies within loop no longer have data dependency Allows them to pipeline

70 Time %edx.1 %ecx.0 %ebx.0 cc.1 t.1a imull %ecx.1 addl cmpl jl %edx.0 imull %ebx.1 t.1b load Visualizing Parallel Loop Figure 5.25 P417

71 Executing with Parallel Loop Figure 5.26 P418

72 Optimization Results for Combining P419

73 Optimization Results for Combining Register spilling – only 6 registers available –Using memory as storage Register spilling –movl -12(%ebp), %edi –imull 24(%eax), %edi –movl %edi, -12(%ebp)

Putting it Together: Summary of Results for Optimizing Combining Code Floating-Point Performance Anomaly Changing Platforms

Branch Prediction and Misprediction Penalties

76 What About Branches? Challenge –Instruction Control Unit must work well ahead of Exec. Unit –To generate enough operations to keep EU busy

f3:movl $0x1,%ecx 80489f8:xorl %edx,%edx 80489fa:cmpl %esi,%edx 80489fc:jnl 8048a fe:movl %esi,%esi 8048a00:imull (%eax,%edx,4),%ecx Executing Fetching & Decoding What About Branches?

78 What About Branches? Challenge –When encounters conditional branch, cannot reliably determine where to continue fetching

79 Branch Outcomes When encounter conditional branch, cannot determine where to continue fetching –Branch Taken: Transfer control to branch target –Branch Not-Taken: Continue with next instruction in sequence Cannot resolve until outcome determined by branch/integer unit

f3:movl $0x1,%ecx 80489f8:xorl %edx,%edx 80489fa:cmpl %esi,%edx 80489fc:jnl 8048a fe:movl %esi,%esi 8048a00:imull (%eax,%edx,4),%ecx 8048a25:cmpl %edi,%edx 8048a27:jl 8048a a29:movl 0xc(%ebp),%eax 8048a2c:leal 0xffffffe8(%ebp),%esp 8048a2f:movl %ecx,(%eax) Branch Taken Branch Not-Taken Branch Outcomes

81 Branch Prediction Idea –Guess which way branch will go –Begin executing instructions at predicted position But don’t actually modify register or memory data

f3:movl $0x1,%ecx 80489f8:xorl %edx,%edx 80489fa:cmpl %esi,%edx 80489fc:jnl 8048a a25:cmpl %edi,%edx 8048a27:jl 8048a a29:movl 0xc(%ebp),%eax 8048a2c:leal 0xffffffe8(%ebp),%esp 8048a2f:movl %ecx,(%eax) Predict Taken Execute Branch Prediction

b1:movl (%ecx,%edx,4),%eax 80488b4:addl %eax,(%edi) 80488b6:incl %edx 80488b7:cmpl %esi,%edx 80488b9:jl 80488b b1:movl (%ecx,%edx,4),%eax 80488b4:addl %eax,(%edi) 80488b6:incl %edx 80488b7:cmpl %esi,%edx 80488b9:jl 80488b b1:movl (%ecx,%edx,4),%eax 80488b4:addl %eax,(%edi) 80488b6:incl %edx 80488b7:cmpl %esi,%edx 80488b9:jl 80488b1 i = 98 i = 99 i = 100 Predict Taken (OK) Predict Taken (Oops) 80488b1:movl (%ecx,%edx,4),%eax 80488b4:addl %eax,(%edi) 80488b6:incl %edx 80488b7:cmpl %esi,%edx 80488b9:jl 80488b1 i = 101 Assume vector length = 100 Read invalid location Executed Fetched Branch Prediction Through Loop

b1:movl (%ecx,%edx,4),%eax 80488b4:addl %eax,(%edi) 80488b6:incl %edx 80488b7:cmpl %esi,%edx 80488b9:jl 80488b b1:movl (%ecx,%edx,4),%eax 80488b4:addl %eax,(%edi) 80488b6:incl %edx 80488b7:cmpl %esi,%edx 80488b9:jl 80488b b1:movl (%ecx,%edx,4),%eax 80488b4:addl %eax,(%edi) 80488b6:incl %edx 80488b7:cmpl %esi,%edx 80488b9:jl 80488b1 i = 98 i = 99 i = 100 Predict Taken (OK) Predict Taken (Oops) 80488b1:movl (%ecx,%edx,4),%eax 80488b4:addl %eax,(%edi) 80488b6:incl %edx i = 101 Invalidate Assume vector length = 100 Branch Misprediction Invalidation

b1:movl (%ecx,%edx,4),%eax 80488b4:addl %eax,(%edi) 80488b6:incl %edx 80488b7:cmpl %esi,%edx 80488b9:jl 80488b b1:movl (%ecx,%edx,4),%eax 80488b4:addl %eax,(%edi) 80488b6:incl %edx 80488b7:cmpl %esi,%edx 80488b9:jl 80488b bb:leal 0xffffffe8(%ebp),%esp 80488be:popl %ebx 80488bf:popl %esi 80488c0:popl %edi i = 98 i = 99 Predict Taken (OK) Definitely not taken Assume vector length = 100 Branch Misprediction Recovery

86 Branch Misprediction Recovery P427 Performance Cost –Misprediction on Pentium III wastes ~14 clock cycles –That’s a lot of time on a high performance processor

87 Misprediction penalty is about 14 cycles in PIII machine Conditional mov is used to avoid the misprediction penalty when the branch outcome is not predictable For example: int absval(int val) { return (val <0)? –val : val } Conditional Jump Figure 5.29 P427

88 Conditional Jump P428 movl 8(%ebp), %eaxGet val as result movl %eax, %edxCopy to %edx negl %edxNegate %edx testl %eax, %eaxTest Val cmov1 %edx, %eaxif <0 copy %edx to result

Understanding Memory Performance

90 typedef struct ELE { struct ELE *next ; int data ; } list_ele, *list_ptr ; int list_len(list_ptr ls) { int len = 0 ; for (;ls;ls=ls->next) len++ ; return len ; } Assembly Instructions.L27: incl %eax movl(%edx), %edx testl %edx, %edx jne.L27 Execution unit operations incl %eax.0  %eax.1 load (%edx.0)  %edx.1 testl %edx.1, %edx.1  cc.1 jne-taken cc.1 Load Latency P429, P430 Figure 5.30 P430

91 incl testl jne %eax.0 %edx.0 incl testl jne load incl testl jne load %eax.1 %eax.2 %eax.3 %edx.1 %edx.2 %edx.3 cc.1 cc.2 cc.3 i=1 i=2 i= Figure 5.31 P430

92 Store Latency Figure 5.32 P431 void array_clear(int *dest, int n) { int i; for ( i = 0 ; i < n ; i++) dest[i] = 0 ; } CPE 2.0

93 Store Latency Figure 5.32 P431 void array_clear(int *dest, int n) { int i; int len = n-7 ; for ( i = 0 ; i < len ; i++) { dest[i] = dest[i+1] = dest[i+2] = dest[i+3] = 0 ; dest[i+4] = dest[i+5] = dest[i+6] = dest[i+7] = 0 ; } for ( ; i < n ; i++) dest[i] = 0 ; } CPE 1.25

94 Store latency Figure 5.33 P432 void write_read(int *src, int *dest, int n) { int cnt = n; int val = 0; while (cnt--) { *dest = val; val = (*src)+1; }

95 Store latency Figure 5.33 P432 write_read(&a[0], &a[1], 3) initialiter. 1iter. 2iter. 3 cnt3210 a(-10, 17)(-10, 0)(-10, -9)(-10, -9) val write_read(&a[0], &a[0], 3) initialiter. 1iter. 2iter. 3 cnt3210 a(-10, 17)(0, 17)(1, 17)(2, 17) val0123

96 Store latency void write_read(int *src, int *dest, int n) { int cnt = n; int val = 0; while (cnt--) { *dest = val; val = (*src)+1; }

97 Store latency P434.L32: movl %edx, (%ecx) movl (%ebx), %edx incl %edx decl %eax jnc.L32 storeaddr (%ecx) storedata %edx.0 load (%ebx)  %edx.1a incl %edx.1a  %edx.1b decl %eax.0  %eax.1 jnc-taken cc.1

98 %eax decl store data store addr load incl jncdecl store data store addr load incl jnc  %edx.1a %edx.1b cc.1 %eax.1  %edx.2a %edx.2b %eax.0 %edx.0 Store latency Figure 5.35 P434

decl store data store addr load incl jncdecl Store data store addr incl jnc = %edx.1a %edx.1b cc.1 %eax.1 = %edx.2b %eax.2 %eax.0 %edx.0 load %edx.2a Figure 5.36 P435

Life in the Real World: Performance Improvement Techniques

101 Basic Strategies for Optimizing Program Performance High-level design Basic coding principles Eliminate excessive function calls Eliminate unnecessary memory references Low-level optimizations Try various forms of pointer versus array code. Reduce loop overhead by unrolling loops. Find ways to make use of the pipelined functional units by techniques such as iteration splitting

Identifying and Eliminating Performance Bottlenecks

103 Performance Tuning Identify –Which is the hottest part of the program –Using a very useful method profiling Instrument the program Run it with typical input data Collect information from the result Analysis the result –gprof example $gcc –O2 –pg prog.c –o prog $prog file.text (generate new file gmon.out) $gprof prog (with gmon.out)

104 Example Task –Count word frequencies in text document –Sort the words in descending order of occurence Steps –Convert strings to lower case –Apply hash function –Read words and insert into hash table Mostly list operations Maintain counter for each unique word –Sort results

105 Examples unix> gcc –O2 –pg prog.c –o prog unix>./prog file.txt unix> gprof prog % cumulative self self total time seconds seconds calls ms/call ms/call name sort_words lower find_ele_rec h_add

106 Branch Misprediction Recovery Performance Cost –Misprediction on Pentium III wastes ~14 clock cycles –That’s a lot of time on a high performance processor

find_ele_rec [5] / insert_string [4] [5] find_ele_rec [5] /26946 save_string [9] /26946 new_ele [11] find_ele_rec [5] Example P439

108 Principle Interval counting –Maintain a counter for each function Record the time spent executing this function –Interrupted at regular time (1ms) Check which function is executing when interrupt occurs Increment the counter for this function

109 Data Set P439 Collected works of Shakespeare 946,596 total words, 26,596 unique Initial implementation: 9.2 seconds

110 Code Optimizations –First step: Use more efficient sorting function –Library function qsort Figure 5.37 P441 1)2)3)4)5)6)7)

111 Further Optimizations 2)3)4)5)6)7)1)

112 Example 3) Iter first: Use iterative function to insert elements in linked list –Causes code to slow down 4) Iter last: Iterative function, places new entry at end of list –Tend to place most common words at front of list 5) Big table: Increase number of hash buckets 6) Better hash: Use more sophisticated hash function 7) Linear lower: Move strlen out of loop

113 Code Motion Example#2 void lower(char *s) { int i; for (i = 0; i < strlen(s); i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); } void lower(char *s) { int i = 0; if (i >= strlen(s)) goto done; loop: if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); i++; if (i < strlen(s)) goto loop; done: }

114 Lower Case Conversion Performance –Time quadruples when double string length –Quadratic performance

115 Time quadruples when double string length Quadratic performance Lower Case Conversion Performance

116 Move call to strlen outside of loop Since result does not change from one iteration to another Form of code motion void lower(char *s) { int i; int len = strlen(s); for (i = 0; i < len; i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); } Improving Performance

117 Lower Case Conversion Performance –Time doubles when double string length –Linear performance

118 Benefits –Helps identify performance bottlenecks –Especially useful when have complex system with many components Limitations –Only shows performance for data tested –E.g., linear lower did not show big gain, since words are short Quadratic inefficiency could remain lurking in code –Timing mechanism fairly crude Only works for programs that run for > 3 seconds Performance Tuning

119 T new = (1-  )T old + (  T old )/k = T old [(1-  ) +  /k] S = T old / T new = 1/[(1-  ) +  /k] S  = 1/(1-  ) Amdahl’s Law P443