1 Code Optimization(II). 2 Outline Understanding Modern Processor –Super-scalar –Out-of –order execution Suggested reading –5.14,5.7.

Slides:



Advertisements
Similar presentations
Lecture 4: CPU Performance
Advertisements

Computer Organization and Architecture
Carnegie Mellon Today Program optimization  Optimization blocker: Memory aliasing  Out of order processing: Instruction level parallelism  Understanding.
Instructor: Erol Sahin Program Optimization CENG331: Introduction to Computer Systems 11 th Lecture Acknowledgement: Most of the slides are adapted from.
Program Optimization (Chapter 5)
Superscalar processors Review. Dependence graph S1S2 Nodes: instructions Edges: ordered relations among the instructions Any ordering-based transformation.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Winter 2015 COMP 2130 Intro Computer Systems Computing Science Thompson Rivers University Program Optimization.
1 Seoul National University Wrap-Up. 2 Overview Seoul National University Wrap-Up of PIPE Design  Exception conditions  Performance analysis Modern.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Code Optimization II Feb. 19, 2008 Topics Machine Dependent Optimizations Understanding Processor Operations Branches and Branch Prediction class11.ppt.
Code Optimization II September 26, 2007 Topics Machine Dependent Optimizations Understanding Processor Operations Branches and Branch Prediction class09.ppt.
Code Optimization II: Machine Dependent Optimizations Topics Machine-Dependent Optimizations Pointer code Unrolling Enabling instruction level parallelism.
Architecture Basics ECE 454 Computer Systems Programming
Code Optimization 1. Outline Machine-Independent Optimization –Code motion –Memory optimization Suggested reading –5.1 ~
1 Code Optimization. 2 Outline Machine-Independent Optimization –Code motion –Memory optimization Suggested reading –5.2 ~ 5.6.
Recitation 7: 10/21/02 Outline Program Optimization –Machine Independent –Machine Dependent Loop Unrolling Blocking Annie Luo
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance.
1 Code Optimization. 2 Outline Optimizing Blockers –Memory alias –Side effect in function call Understanding Modern Processor –Super-scalar –Out-of –order.
1 Code Optimization. 2 Outline Optimizing Blockers –Memory alias –Side effect in function call Understanding Modern Processor –Super-scalar –Out-of –order.
Machine-Dependent Optimization CS 105 “Tour of the Black Holes of Computing”
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance.
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
Computer Architecture: Wrap-up CENG331 - Computer Organization Instructors: Murat Manguoglu(Section 1) Erol Sahin (Section 2 & 3) Adapted from slides of.
Code Optimization II: Machine Dependent Optimization Topics Machine-Dependent Optimizations Unrolling Enabling instruction level parallelism.
Machine Independent Optimizations Topics Code motion Reduction in strength Common subexpression sharing.
Code Optimization and Performance CS 105 “Tour of the Black Holes of Computing”
Machine-Dependent Optimization CS 105 “Tour of the Black Holes of Computing”
PART 5: (1/2) Processor Internals CHAPTER 14: INSTRUCTION-LEVEL PARALLELISM AND SUPERSCALAR PROCESSORS 1.
1 Code Optimization. 2 Outline Machine-Independent Optimization –Code motion –Memory optimization Suggested reading –5.2 ~ 5.6.
Real-World Pipelines Idea –Divide process into independent stages –Move objects through stages in sequence –At any given times, multiple objects being.
PipeliningPipelining Computer Architecture (Fall 2006)
Code Optimization and Performance II
Real-World Pipelines Idea Divide process into independent stages
Computer Architecture Chapter (14): Processor Structure and Function
Code Optimization II September 27, 2006
Instruction Level Parallelism
Today’s Instructor: Phil Gibbons
Machine-Dependent Optimization
Code Optimization II Dec. 2, 2008
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Instructors: Dave O’Hallaron, Greg Ganger, and Greg Kesden
Program Optimization CENG331 - Computer Organization
Instructors: Pelin Angin and Erol Sahin
Program Optimization CSCE 312
Code Optimization II: Machine Dependent Optimizations Oct. 1, 2002
Code Optimization II: Machine Dependent Optimizations
Machine-Dependent Optimization
Lecture 6: Advanced Pipelines
Code Optimization /18-213/14-513/15-513: Introduction to Computer Systems 10th Lecture, September 27, 2018.
Pipelined Implementation : Part I
Superscalar Processors & VLIW Processors
Seoul National University
Code Optimization and Performance
Pipelined Implementation : Part I
Code Optimization(II)
Chapter Six.
Pipelined Implementation : Part I
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
* From AMD 1996 Publication #18522 Revision E
Pipelined Implementation
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
Optimizing program performance
Machine-Independent Optimization
Lecture 11: Machine-Dependent Optimization
Code Optimization II October 2, 2001
Presentation transcript:

1 Code Optimization(II)

2 Outline Understanding Modern Processor –Super-scalar –Out-of –order execution Suggested reading –5.14,5.7

3 Modern CPU Design

How is it possible? 5 instructions in 6 clock cycles 4 instructions in 2 clock cyles 4.L18: movl (%ecx,%edx,4),%eax addl %eax,(%edi) incl %edx cmpl %esi,%edx jl.L18.L24: addl (%eax,%edx,4),%ecx incl %edx cmpl %esi,%edx jl.L24 Combine3Combine4

Exploiting Instruction-Level Parallelism Need general understanding of modern processor design –Hardware can execute multiple instructions in parallel Performance limited by data dependencies Simple transformations can have dramatic performance improvement –Compilers often cannot make these transformations –Lack of associativity and distributivity in floating- point arithmetic 5

6 Modern Processor Superscalar –Perform multiple operations on every clock cycle Out-of-order execution –The order in which the instructions execute need not correspond to their ordering in the assembly program

Execution Functional Units Instruction Control Integer/ Branch FP Add FP Mult/Div LoadStore Instruction Cache Data Cache Fetch Control Instruction Decode Address Instructions Operations Prediction OK? Data Addr. General Integer Operation Results Retirement Unit Register File Register Updates

8 Modern Processor Two main parts –Instruction Control Unit (ICU) Responsible for reading a sequence of instructions from memory Generating from above instructions a set of primitive operations to perform on program data –Execution Unit (EU) Execute these operations

9 Instruction Control Unit Instruction Cache –A special, high speed memory containing the most recently accessed instructions. Instruction Control Instruction Cache Fetch Control Instruction Decode Address Instructions Operations Prediction OK? Retirement Unit Register File Register Updates

10 Instruction Control Unit Fetch Control –Fetches ahead of currently accessed instructions enough time to decode instructions and send decoded operations down to the EU Instruction Control Instruction Cache Fetch Control Instruction Decode Address Instructions Operations Prediction OK? Retirement Unit Register File Register Updates

11 Fetch Control Branch Predication –Branch taken or fall through –Guess whether branch is taken or not Speculative Execution –Fetch, decode and execute only according to the branch prediction –Before the branch predication has been determined

12 Instruction Control Unit Instruction Decoding Logic –Take actual program instructions Instruction Control Instruction Cache Fetch Control Instruction Decode Address Instructions Operations Prediction OK? Retirement Unit Register File Register Updates

13 Instruction Control Unit Instruction Decoding Logic –Take actual program instructions –Converts them into a set of primitive operations An instruction can be decoded into a variable number of operations –Each primitive operation performs some simple task Simple arithmetic, Load, Store –Register renaming load 4(%edx)  t1 addl %eax, t1  t2 store t2, 4(%edx) addl %eax, 4(%edx)

14 Execution Functional Units Integer/ Branch FP Add FP Mult/Div LoadStore Data Cache Data Addr. General Integer Operation Results Multi-functional Units –Receive operations from ICU –Execute a number of operations on each clock cycle –Handle specific types of operations Execution Unit

15 Multi-functional Units Multiple Instructions Can Execute in Parallel –Nehalem CPU (Core i7) 1 load, with address computation 1 store, with address computation 2 simple integer (one may be branch) 1 complex integer (multiply/divide) 1 FP Multiply 1 FP Add

16 Multi-functional Units Some Instructions Take > 1 Cycle, but Can be Pipelined Nehalem (Core i7) InstructionLatency Cycles/Issue Integer Add Integer Multiply 31 Integer/Long Divide Single/Double FP Add 31 Single/Double FP Multiply 4/51 Single/Double FP Divide

17 Execution Unit Operation is dispatched to one of multi- functional units, whenever –All the operands of an operation are ready –Suitable functional units are available Execution results are passed among functional units Data Cache –A high speed memory containing the most recently accessed data values

18 Execution Unit Data Cache –Load and store units access memory via data cache –A high speed memory containing the most recently accessed data values Execution Functional Units Integer/ Branch FP Add FP Mult/Div LoadStore Data Cache Data Addr. General Integer Operation Results

19 Instruction Control Unit Retirement Unit –Keep track of the ongoing processing mispredictionexception –Obey the sequential semantics of the machine-level program (misprediction & exception) Instruction Control Instruction Cache Fetch Control Instruction Decode Address Instructions Operations Prediction OK? Retirement Unit Register File Register Updates

20 Instruction Control Unit Register File –Integer, floating-point and other registers –Controlled by Retirement Unit Instruction Control Instruction Cache Fetch Control Instruction Decode Address Instructions Operations Prediction OK? Retirement Unit Register File Register Updates

21 Instruction Control Unit Instruction Retired/Flushed –Place instructions into a first-in, first-out queue –Retired: any updates to the registers being made Operations of the instruction have completed Any branch prediction to the instruction are confirmed correctly –Flushed: discard any results have been computed Some branch prediction was mispredicted Mispredictions can’t alter the program state

22 Execution Unit Operation Results –Functional units can send results directly to each other –A elaborate form of data forwarding techniques Execution Functional Units Integer/ Branch FP Add FP Mult/Div LoadStore Data Cache Data Addr. General Integer Operation Results

23 Execution Unit Register Renaming –Values passed directly from producer to consumers t –A tag t is generated to the result of the operation E.g. %ecx.0, %ecx.1 –Renaming table r tMaintain the association between program register r and tag t for an operation that will update this register

24 Data-Flow Graphs –Visualize how the data dependencies in a program dictate its performance –Example: combine4 (data_t = float, OP = *) void combine4(vec_ptr v, data_t *dest) { long int i; long int length = vec_length(v); data_t *data = get_vec_start(v); data_t x = IDENT; for (i = 0; i < length; i++) x = x OP data[i]; *dest = x; }

25 Translation Example. L488:# Loop: mulss (%rax,%rdx,4),%xmm0# t *= data[i] addq $1, %rdx# Increment i cmpq %rdx,%rbp# Compare length:i jg.L488# if > goto Loop.L488: mulss (%rax,%rdx,4),%xmm0 addq $1, %rdx cmpq %rdx,%rbp jg.L488 load (%rax,%rdx.0,4)  t.1 mulq t.1, %xmm0.0  %xmm0.1 addq $1, %rdx.0  %rdx.1 cmpq %rdx.1, %rbp  cc.1 jg-taken cc.1

26 Understanding Translation Example Split into two operations –Load reads from memory to generate temporary result t.1 –Multiply operation just operates on registers mulss (%rax,%rdx,4),%xmm0load (%rax,%rdx.0,4)  t.1 mulq t.1, %xmm0.0  %xmm0.1

27 Understanding Translation Example Operands –Registers %rax does not change in loop register file –Values will be retrieved from register file during decoding mulss (%rax,%rdx,4),%xmm0load (%rax,%rdx.0,4)  t.1 mulq t.1, %xmm0.0  %xmm0.1

28 Understanding Translation Example Operands –Register %xmm0 changes on every iteration –Uniquely identify different versions as %xmm0.0, %xmm0.1, %xmm0.2, … –Register renaming Values passed directly from producer to consumers mulss (%rax,%rdx,4),%xmm0load (%rax,%rdx.0,4)  t.1 mulq t.1, %xmm0.0  %xmm0.1

29 Understanding Translation Example Register %rdx changes on each iteration Renamed as %rdx.0, %rdx.1, %rdx.2, … addq $1, %rdxaddq $1, %rdx.0  %rdx.1

30 Understanding Translation Example Condition codes are treated similar to registers Assign tag to define connection between producer and consumer cmpq %rdx,%rbpcmpq %rdx.1, %rbp  cc.1

31 Understanding Translation Example Instruction control unit determines destination of jump Predicts whether target will be taken Starts fetching instruction at predicted destination jg.L488jg-taken cc.1

32 Understanding Translation Example Execution unit simply checks whether or not prediction was OK If not, it signals instruction control unit –Instruction control unit then “invalidates” any operations generated from misfetched instructions –Begins fetching and decoding instructions at correct target jg.L488jg-taken cc.1

33 Graphical Representation mulss (%rax,%rdx,4), %xmm0 addq $1,%rdx cmpq %rdx,%rbp jg loop load (%rax,%rdx.0,4)  t.1 mulq t.1, %xmm0.0  %xmm0.1 addq $1, %rdx.0  %rdx.1 cmpq %rdx.1, %rbp  cc.1 jg-taken cc.1 Registers –read-only: %rax, %rbp –write-only: - –Loop: %rdx, %xmm0 –Local: t, cc %rax%rbp%rdx%xmm0 %rax%rbp%rdx%xmm0 load mul add cmp jg t cc

34 Refinement of Graphical Representation Data Dependencies %rax%rbp%rdx%xmm0 %rax%rbp%rdx%xmm0 load mul add cmp jg %xmm0%rax%rbp%rdx %xmm0%rdx load mul add cmp jg

35 Refinement of Graphical Representation %xmm0%rax%rbp%rdx %xmm0%rdx load mul add cmp jg Data Dependencies %xmm0%rdx %xmm0%rdx load mul add data[i]

36 Refinement of Graphical Representation %xmm0%rdx %xmm0%rdx load mul add data[i] %rdx.0%xmm0.0 %rdx.1%xmm0.1 load mul add data[0] %rdx.2%xmm0.2 load mul add data[1]

37 Refinement of Graphical Representation %rdx.0%xmm0.0 %rdx.1%xmm0.1 load mul add data[0] %rdx.2%xmm0.2 load mul add data[1] load mul add data[0] load mul add data[1] load mul add data[n-1]..

Integer Floating Point Function + * + F* D* combine combine Refinement of Graphical Representation load mul add data[0] load mul add data[1] load mul add data[n-1].. Two chains of data dependencies –Update x by mul –Update i by add Critical path –Latency of mul is 4 –Latency of add is 1 The latency of combine4 is 4

39 Performance-limiting Critical Path Nehalem (Core i7) InstructionLatency Cycles/Issue 1 Integer Add Integer Multiply 31 Integer/Long Divide Single/Double FP Add 31 Single/Double FP Multiply 4/51 Single/Double FP Divide Integer Floating Point Function + * + F* D* combine combine

40 Other Performance Factors Data-flow representation provide only a lower bound –e.g. Integer addition, CPE = 2.0 –Total number of functional units available –The number of data values can be passed among functional units Next step –Enhance instruction-level parallelism –Goal: CPEs close to 1.0

41 Next Class More Code Optimization techniques Suggested reading –5.8 ~ 5.13