Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Slides:

Advertisements

Similar presentations

Lecture 9 – OOO execution © Avi Mendelson, 5/ MAMAS – Computer Architecture Lecture 9 – Out Of Order (OOO) Dr. Avi Mendelson Some of the slides.

Advertisements

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.

CSCI 4717/5717 Computer Architecture

Real-time Signal Processing on Embedded Systems Advanced Cutting-edge Research Seminar I&III.

CS6290 Speculation Recovery. Loose Ends Up to now: –Techniques for handling register dependencies Register renaming for WAR, WAW Tomasulo’s algorithm.

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

CSE 8383 Superscalar Processor 1 Abdullah A Alasmari & Eid S. Alharbi.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Instruction-Level Parallelism (ILP)

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

National & Kapodistrian University of Athens Dep.of Informatics & Telecommunications MSc. In Computer Systems Technology Advanced Computer Architecture.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.

Chapter 12 Pipelining Strategies Performance Hazards.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

Goal: Reduce the Penalty of Control Hazards

Review of CS 203A Laxmi Narayan Bhuyan Lecture2.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

OOO execution © Avi Mendelson, 4/ MAMAS – Computer Architecture Lecture 7 – Out Of Order (OOO) Avi Mendelson Some of the slides were taken.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.

CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.

1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.

1 Out-Of-Order Execution (part I) Alexander Titov 14 March 2015.

COMP25212 Lecture 51 Pipelining Reducing Instruction Execution Time.

CS 1104 Help Session IV Five Issues in Pipelining Colin Tan, S

Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

Spring 2003CSE P5481 Precise Interrupts Precise interrupts preserve the model that instructions execute in program-generated order, one at a time If an.

1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.

Computer Structure 2015 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,

Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

现代计算机体系结构主讲教师：张钢天津大学计算机学院 2009 年.

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

CS 352H: Computer Systems Architecture

Dynamic Scheduling Why go out of style?

Computer Organization CS224

CDA3101 Recitation Section 8

Computer Structure Advanced Branch Prediction

Simultaneous Multithreading

Computer Structure Multi-Threading

PowerPC 604 Superscalar Microprocessor

Part IV Data Path and Control

/ Computer Architecture and Design

Pipelining: Advanced ILP

Sequential Execution Semantics

Morgan Kaufmann Publishers The Processor

Lecture 6: Advanced Pipelines

Superscalar Processors & VLIW Processors

The processor: Pipelining and Branching

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Hardware Multithreading

15-740/ Computer Architecture Lecture 5: Precise Exceptions

How to improve (decrease) CPI

Control unit extension for data hazards

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

CSC3050 – Computer Architecture

Computer Structure Out-Of-Order Execution

Presentation transcript:

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz

Computer Structure 2014 – Out-Of-Order Execution 2 What’s Next u Goal: minimize CPU Time CPU Time = clock cycle  CPI  IC u So far we have learned  Minimize clock cycle  add more pipe stages  Minimize CPI  use pipeline  Minimize IC  architecture u In a pipelined CPU  CPI w/o hazards is 1  CPI with hazards is > 1 u Adding more pipe stages reduces clock cycle but increases CPI  Higher penalty due to control hazards  More data hazards u What can we do ? Further reduce the CPI !

Computer Structure 2014 – Out-Of-Order Execution 3 A Superscalar CPU u Duplicating HW in one pipe stage won’t help  e.g., have 2 ALUs  the bottleneck moves to other stages u Getting IPC > 1 requires to fetch, decode, exe, and retire >1 instruction per clock: IF ID EXE MEM WB

Computer Structure 2014 – Out-Of-Order Execution 4 The Pentium  Processor u Fetches and decodes 2 instructions per cycle  Before register file read, decide on pairing: can the two instructions be executed in parallel u Pairing decision is based on  Data dependencies: 2 nd instruction must be independent of 1 st  Resources: U-pipe and V-pipe are not symmetric (save HW) Common instructions can execute on either pipe Some instructions can execute only on the U-pipe If the 2 nd instruction requires the U-pipe, it cannot pair Some instructions use resources of both pipes IF ID U-pipe V-pipe pairing

Computer Structure 2014 – Out-Of-Order Execution 5 u MPI : miss-per-instruction: #incorrectly predicted branches #predicted branches MPI = = MPR× total # of instructions total # of instructions u MPI correlates well with performance, e.g., assume  MPR = 5%, %branches = 20%  MPI = 1%  Without hazards IPC=2 (2 instructions per cycles)  Flush penalty of 5 cycles u We get  MPI = 1%  flush in every 100 instructions  IPC=2  flush every 100/2 = 50 cycles  5 cycles flush penalty every 50 cycles  10% performance hit u For IPC=1 we would get  5 cycles flush penalty per 100 cycles  5% performance hit u Flush penalty increases as the machine is deeper and wider Misprediction Penalty in a Superscalar CPU

Computer Structure 2014 – Out-Of-Order Execution 6 Extract More ILP u ILP – Instruction Level Parallelism  A given program, executed on a given input data has a given parallelism  Can execute only independent instructions in parallel  If for example each instruction is dependent on the previous instruction, the ILP of the program is 1 Adding more HW will not change that u Adjacent instructions are usually dependent  The utilization of the 2 nd pipe is usually low  There are algorithms in which both pipes are highly utilized u Solution: Out-Of-Order Execution  Look for independent instructions further ahead in the program  Execute instructions based on data readiness  Still need to keep the semantics of the original program

Computer Structure 2014 – Out-Of-Order Execution 7 Data Flow Analysis u Example: (1) r1  r4 / r7 ; assume divide takes 20 cycles (2) r8  r1 + r2 (3) r5  r5 + 1 (4) r6  r6 - r3 (5) r4  r5 + r6 (6) r7  r8 * r In-order execution Out-of-order execution Data Flow Graph r1 r5r6 r4 r8

Computer Structure 2014 – Out-Of-Order Execution 8 OOOE – General Scheme u Fetch & decode instructions in parallel but in order  Fill the Instruction Pool u Execute ready instructions from the instructions pool  All source data ready + needed execution resources available u Once an instruction is executed  signal all dependent instructions that data is ready u Commit instructions in parallel but in-order  State change (memory, register) and fault/exception handling Retire (commit) In-order Fetch & Decode Instruction pool In-order Execute Out-of-order

Computer Structure 2014 – Out-Of-Order Execution 9 (1) r1  R9/17 (2) r2  r2+r1 (4) r3  r3+r1 (5) jcc L2 (6) L2 r1  35 Write-After-Write Dependency (8) r3  2 (7) r4  r3+r1 (3) r1  23

Computer Structure 2014 – Out-Of-Order Execution 10 (1) r1  R9/17 (2) r2  r2+r1 (4) r3  r3+r1 (5) jcc L2 (6) L2 r1  35 Write-After-Write Dependency (8) r3  2 (7) r4  r3+r1 (3) r1  23 If inst (3) is executed before inst (1), r1 ends up having a wrong value. Called write-after-write false dependency.

Computer Structure 2014 – Out-Of-Order Execution 11 (1) r1  R9/17 (2) r2  r2+r1 (4) r3  r3+r1 (5) jcc L2 (6) L2 r1  35 Write-After-Write Dependency (8) r3  2 (7) r4  r3+r1 (3) r1  23 Inst (4) should use the value of r1 produced by inst (3), even if inst (1) is executed after inst (3). Write-After-Write (WAW) is a false dependency Not a real data dependency, but an artifact of OOO execution

Computer Structure 2014 – Out-Of-Order Execution 12 (1) r1  R9/17 (2) r2  r2+r1 (4) r3  r3+r1 (5) jcc L2 (6) L2 r1  35 Speculative Execution (8) r3  2 (7) r4  r3+r1 (3) r1  23 1/5 instruction is a branch  continue fetching, decoding, and allocating instructions into the instruction pool according to the predicted path. Called “speculative execution”

Computer Structure 2014 – Out-Of-Order Execution 13 (1) r1  R9/17 (2) r2  r2+r1 (4) r3  r3+r1 (5) jcc L2 (6) L2 r1  35 Write-After-Read Dependency (3) r1  23 (8) r3  2 (7) r4  r3+r1

Computer Structure 2014 – Out-Of-Order Execution 14 (7) r4  r3+r1 (1) r1  R9/17 (2) r2  r2+r1 (4) r3  r3+r1 (5) jcc L2 (6) L2 r1  35 Write-After-Read Dependency (3) r1  23 (8) r3  2 If inst (8) is executed before inst (7), inst (7) gets a wrong value of r3. Called write-after-read false dependency. Write-After-Read (WAR) is a false dependency Not a real data dependency, but an artifact of OOO execution

Computer Structure 2014 – Out-Of-Order Execution 15 Register Renaming u Hold a pool of physical registers  Map architectural registers into physical registers (still in-order) u When an instruction is allocated into the instruction pool (still in-order)  Allocate a free physical register from a pool  The physical register points to the architectural register u When an instruction executes and writes a result  Write the result value to the physical register u When an instruction needs data from a register  Read data from the physical register allocated to the latest inst which writes to the same arch register, and precedes the current inst If no such instruction exists, read from the reset arch. value u When an instruction commits  Copy the value from its physical register to the architectural register

Computer Structure 2014 – Out-Of-Order Execution 16 Renaming r1:pr1 pr1  17 r2:pr2 pr2  r2+pr1 r1:pr3 pr3  23 r3:pr4 pr4  r3+pr3 r1:pr5 pr5  35 r4:pr6 pr6  pr4+pr5 r3:pr7 pr7  2 (1) r1  17 (2) r2  r2+r1 (3) r1  23 (4) r3  r3+r1 (5) jcc L2 (6) L2 r1  35 (7) r4  r3+r1 (8) r3  2 Register Renaming r1r2r3r4 Register mappingr1r2r3r4 pr1pr2 pr3pr4 pr5pr6 pr7 When an instruction commits: Copy its physical register into the architectural register

Computer Structure 2014 – Out-Of-Order Execution 17 Renaming r1:pr1 pr1  17 r2:pr2 pr2  r2+pr1 r1:pr3 pr3  23 r3:pr4 pr4  r3+pr3 r1:pr5 pr5  35 r4:pr6 pr6  pr4+pr5 r3:pr7 pr7  2 (1) r1  17 (2) r2  r2+r1 (3) r1  23 (4) r3  r3+r1 (5) jcc L2 (6) L2 r1  35 (7) r4  r3+r1 (8) r3  2 Speculative Execution – Misprediction r1r2r3r4 Register mappingr1r2r3r4 pr1pr2 pr3pr4 pr5pr6 pr7 If the predicted branch path turns out to be wrong (when the branch is executed): The instructions following the branch are flushed before they are committed  the architectural state is not changed

Computer Structure 2014 – Out-Of-Order Execution 18 Renaming r1:pr1 pr1  17 r2:pr2 pr2  r2+pr1 r1:pr3 pr3  23 r3:pr4 pr4  r3+pr3 r1:pr5 pr5  35 r4:pr6 pr6  pr4+pr5 r3:pr7 pr7  2 (1) r1  17 (2) r2  r2+r1 (3) r1  23 (4) r3  r3+r1 (5) jcc L2 (6) L2 r1  35 (7) r4  r3+r1 (8) r3  2 Speculative Execution – Misprediction r1r2r3r4 Register mappingr1r2r3r4 pr1pr2 pr3pr4 pr5pr6 pr7 But the register mapping was already wrongly updated by the wrong path instructions

Computer Structure 2014 – Out-Of-Order Execution 19 Jump Misprediction – Flush at Retire u When the mispredicted jump retires  Flush the pipeline When the branch commits, all the instructions remaining in the pipe are younger than the branch  from the wrong path  Reset the renaming map So all register are mapped to architectural registers This is ok since there are no consumers of physical registers (pipe is flushed)  Start fetching instructions from the correct path u Disadvantage  Very high misprediction penalty  Misprediction is already known after the jump was executed  We will see ways to recover a misprediction at execution

Computer Structure 2014 – Out-Of-Order Execution 20 OOO Requires Accurate Branch Predictor u Accurate branch predictor increases the effective scheduling window size  Speculate across multiple branches (a branch every 5 – 10 instructions) Instruction pool branches High chances to commit Low chances to commit

Computer Structure 2014 – Out-Of-Order Execution 21 Interrupts and Faults Handling u Complications for pipelined and OOO execution  Interrupts occur in the middle of an instruction  A speculative instruction can get a fault (divide by 0, page fault) u Faults are served in program order, at retirement only  Mark an instruction that takes a fault at execution  Instructions older than the faulting instruction are retired  Only when the faulting instruction retires – handle the fault Flush subsequent instructions Initiate the fault handling code according to the fault type Restart faulting and/or subsequent instructions u Interrupts are served when the next instruction retires  Let the instruction in the current cycle retire  Flush subsequent instructions and initiate the interrupt service code  Fetch the subsequent instructions

Computer Structure 2014 – Out-Of-Order Execution 22 Out Of Order Execution Summary u Look ahead in a window of instructions  Dispatch ready instructions to execution Do not depend on data from previous instructions still not executed Have the required execution resources available u Advantages  Exploit Instruction Level Parallelism beyond adjacent instructions  Help cover latencies (e.g., L1 data cache miss, divide)  Superior/complementary to compiler scheduler Can look for ILP beyond conditional branches In a given control path instructions may be independent Register Renaming: use more than the number architectural registers u Complex micro-architecture  Register renaming, complex scheduler, misprediction recovery  Memory ordering – so far we did not talk about that