Out-of-order execution Lihu Rappoport 11/2004 1 MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

Slides:

Advertisements

Similar presentations

Lecture 9 – OOO execution © Avi Mendelson, 5/ MAMAS – Computer Architecture Lecture 9 – Out Of Order (OOO) Dr. Avi Mendelson Some of the slides.

Advertisements

Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.

Computer Organization and Architecture

CSCI 4717/5717 Computer Architecture

Real-time Signal Processing on Embedded Systems Advanced Cutting-edge Research Seminar I&III.

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

CSE 8383 Superscalar Processor 1 Abdullah A Alasmari & Eid S. Alharbi.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Instruction-Level Parallelism (ILP)

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.

CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.

Chapter 12 Pipelining Strategies Performance Hazards.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

Review of CS 203A Laxmi Narayan Bhuyan Lecture2.

Pipelined Processor II CPSC 321 Andreas Klappenecker.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

OOO execution © Avi Mendelson, 4/ MAMAS – Computer Architecture Lecture 7 – Out Of Order (OOO) Avi Mendelson Some of the slides were taken.

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Computer Architecture Out-of-order execution

1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.

1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

1 Out-Of-Order Execution (part I) Alexander Titov 14 March 2015.

COMP25212 Lecture 51 Pipelining Reducing Instruction Execution Time.

CS 1104 Help Session IV Five Issues in Pipelining Colin Tan, S

Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.

PART 5: (1/2) Processor Internals CHAPTER 14: INSTRUCTION-LEVEL PARALLELISM AND SUPERSCALAR PROCESSORS 1.

Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.

1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

CSCE 212 Chapter 6 Enhancing Performance with Pipelining Instructor: Jason D. Bakos.

Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.

PipeliningPipelining Computer Architecture (Fall 2006)

CS203 – Advanced Computer Architecture ILP and Speculation.

CS 352H: Computer Systems Architecture

Dynamic Scheduling Why go out of style?

CDA3101 Recitation Section 8

CS203 – Advanced Computer Architecture

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Morgan Kaufmann Publishers The Processor

Sequential Execution Semantics

Morgan Kaufmann Publishers The Processor

Superscalar Processors & VLIW Processors

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

15-740/ Computer Architecture Lecture 5: Precise Exceptions

How to improve (decrease) CPI

Control unit extension for data hazards

Instruction Level Parallelism (ILP)

Control unit extension for data hazards

How to improve (decrease) CPI

Control unit extension for data hazards

Conceptual execution on a processor which exploits ILP

Presentation transcript:

out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport

out-of-order execution Lihu Rappoport 11/ What’s Next  Remember our goal: minimize CPU Time CPU Time = clock cycle  CPI  IC  So far we have learned –Minimize clock cycle  add more pipe stages –Minimize CPI  use pipeline –Minimize IC  architecture  In a pipelined CPU: –CPI w/o hazards is 1 –CPI with hazards is > 1  Adding more pipe stages reduces clock cycle but increases CPI: –Higher penalty due to control hazards –More data hazards  Beyond some point adding more pipe does not help  What can we do ? Further reduce the CPI !

out-of-order execution Lihu Rappoport 11/ A Superscalar CPU  Just duplicating the hardware in one of the pipe stage (e.g., have 2 ALUs) won’t help: –the bottleneck will simply move to other stages  In order to have CPI < 1 we need to be able to fetch, decode, execute, and retire more than a single instruction per clock: IF ID EXE MEM WB

out-of-order execution Lihu Rappoport 11/ The Pentium  Processor  Fetches and decodes 2 instructions per cycle  Before register file read a pairing decision is made: can the two instructions be executed in parallel  Pairing decision is based on –Data dependencies: instructions must be independent –Resources:  Some instructions uses resources from the 2 pipes  The second pipe can only execute part of the instructions IF ID U-pipe V-pipe pairing

out-of-order execution Lihu Rappoport 11/  MPI : miss-per-instruction: # of incorrectly predicted branches # of predicted branches MPI = = MPR* total # of instructions total # of instructions  MPI correlates well with performance. For example, assume: –MPR = 5%, %branches = 20%  MPI = 1% –Without hazards CPI=0.5 (2 instructions per cycles) –Flush penalty of 5 cycles.  We get:  MPI = 1%  flush in every 100 instructions  flush in every 50 cycles (since CPI=0.5),  5 cycles flush penalty every 50 cycles  10% in performance (For CPI=1 we get 5 cycles flush penalty every 100 cycles  5% in performance) Misprediction Penalty In a Super Scalar CPU

out-of-order execution Lihu Rappoport 11/ Is Superscalar Good Enough ?  A superscalar processor can fetch, decode, execute and retire two instructions in parallel  However, two instruction can only be executed in parallel if they are independent  But … adjacent instructions are usually dependent –The utilization of the second pipe is usually low –There are algorithms in which both pipes are highly utilized  What to do ?  Out-of-order execution: –Execute instruction not in program order –Still need to do the same as the original program

out-of-order execution Lihu Rappoport 11/ Out Of Order Execution  Execute instructions based on “data flow” rather than program order  Basic idea: look a head in a window of instructions and find instructions that are ready to execute: –Do not have data dependencies with previous instructions, which were still not executed –Resources are available  Out-of-order execution is: Starting the execution stage of an instruction before the execution stage of a previous instruction  Advantages: –Help exploit Instruction Level Parallelism (ILP) –Help cover latencies (e.g., cache miss, divide)  Can Compilers do the Work ? –Compilers can statically reschedule instructions –Compilers do not have run time information (e.g., cache miss, conditional branch direction → limited to basic blocks)

out-of-order execution Lihu Rappoport 11/ Data Flow Analysis  Example : assume that a divide operation takes 20 cycles. (1) r1  r4 / r7 (2) r8  r1 + r2 (3) r5  r5 + 1 (4) r6  r6 - r3 (5) r4  r5 + r6 (6) r7  r8 * r Data Flow Graph In-order execution Out-of-order execution

out-of-order execution Lihu Rappoport 11/ OOOE - General Scheme  Fetch & decode instructions in parallel, but in order, to fill-out an inst. pool  Out of the instructions in the window, execute instructions that are ready: –All the data the required for the instruction are ready –Execution resources are available  Executed instructions signals all dependant instructions that the data is ready  Commit a few instructions in-order –An instruction can commit only after all preceding instructions (in program order) committed Fetch & Decode Instruction pool Retire (commit) In-order Execute Out-of-order

out-of-order execution Lihu Rappoport 11/ Out Of Order Execution - Example  Assume that executing a divide operation takes 20 cycles. (1)r1  r5 / r4 (2)r3  r1 + r8 (3)r8  r5 + 1 (4)r3  r7 - 2 (5)r6  r6 + r7  Inst2 has a RAW dependency on r1 with Inst1, so it cannot be executed in parallel with Inst1  Can successive instructions pass Inst2 ? –What about Inst3 ? No - Inst2 must read r8 before Inst3 writes to it –What about Inst4 ? No - Inst4 must write to r3 after Inst2 –What about Inst5 ? Yes

out-of-order execution Lihu Rappoport 11/ False Dependencies  OOOE creates new dependencies: –WAR : an instruction writes into a register which needs to be read by an earlier instruction –WAW: an instruction writes into a register which needs to be written by an earlier instruction  These are false dependencies: –There no missing data  Still, false dependencies prevent executing instructions out-of-order  Solution: Register Renaming

out-of-order execution Lihu Rappoport 11/ Register Renaming  Hold a pool of physical registers  Architectural registers are mapped into physical registers –When an instruction writes to an architectural register  A free physical register is allocated from the pool  The physical register points to the architectural register  The instruction writes the value to the physical register –When an instruction reads from an architectural register  reads the data from the latest instruction which writes to the same architectural register, and precedes the current instruction  If no such instruction exists, read directly from the architectural register –When an instruction commits  Moves the value from the physical register to the architectural register it points

out-of-order execution Lihu Rappoport 11/ OOOE with Register Renaming: Example cycle 1 cycle 2 (1)r1  mem1r1’  mem1 (2)r2  r2 + r1 r2’  r2 + r1’ (3)r1  mem2r1”  mem2 (4)r3  r3 + r1 r3’  r3 + r1” (5)r1  mem3r1”’  mem3 (6)r4  r5 + r1 r4’  r5 + r1”’ (7)r5  2r5’  2 (8)r6  r5 + 2 r6’  r5’ + 2 Register Renaming Benefits Removes false dependencies Removes architecture limit for # of registers WAW WAR

out-of-order execution Lihu Rappoport 11/ Executing Beyond Branches  The scheme we saw so far does not search beyond a branch  Limited to the parallelism within a basic-block – A basic-block is ~5 instruction long (1) r1  r4 / r7 (2)r2  r2 + r1 (3)r3  r2 - 5 (4)beq r3,0,300 If the beq is predicted NT, (5)r8  r8 + 1 Inst 5 can be spec executed  We would like to look beyond branches –But what if we execute an instruction beyond a branch and then it turns out that we predicted the wrong path ? Solution: Speculative Execution

out-of-order execution Lihu Rappoport 11/ Speculative Execution Execution of instructions from a predicted (yet unsure) path Eventually, path may turn wrong Implementation: –Hold a pool of all “not yet executed” instructions –Fetch instructions into the pool from a predicted path –Instructions for which all operands are “ready” can be executed –An instruction may change the processor state (commit) only when it is safe  An instruction commits only when all previous (in-order) instructions had committed  Instructions commit in-order  Instructions which follow a branch commit only after the branch commits  If a predicted branch is wrong all the instructions which follow it are flushed Register Renaming helps speculative execution –Renamed registers are kept until speculation is verified to be correct

out-of-order execution Lihu Rappoport 11/ Speculative Execution - Example cycle 1 cycle 2 (1) r1  mem1 r1’  mem1 (2) r2  r2 + r1 r2’  r2 + r1’ (3) r1  mem2 r1”  mem2 (4) r3  r3 + r1 r3’  r3 + r1” (5) jmp cond L2 (6)L2 r1  mem3 r1”’  mem3 (7) r4  r5 + r1 r4’  r5 + r1”’ (8) r5  2 r5’  2 (9) r6  r5 + 2 r6’  r5’ + 2  Instructions 6-9 are speculatively executed –If the prediction turns wrong, they will be flushed  If the branch was predicted taken, the instructions from the other path would be have been speculatively executed. WAW WAR Speculative Execution

out-of-order execution Lihu Rappoport 11/ Out Of Order Execution - Summary  Advantages –Help exploit Instruction Level Parallelism (ILP) –Help cover latencies (e.g., cache miss, divide) –Superior/complementary to compiler scheduler  Dynamic instruction window  Reg Renaming: Can use more than the number architectural registers  Complex microarchitecture –Complex scheduler –Requires reordering mechanism (retirement) in the back-end for:  Precise interrupt resolution  Misprediction/speculation recovery  Memory ordering  Speculative Execution –Advantage: larger scheduling window  reveals more ILP –Issues: misprediction cost and misprediction recovery