Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz

Computer Structure 2014 – Out-Of-Order Execution 2 What’s Next u Goal: minimize CPU Time CPU Time = clock cycle  CPI  IC u So far we have learned  Minimize clock cycle  add more pipe stages  Minimize CPI  use pipeline  Minimize IC  architecture u In a pipelined CPU  CPI w/o hazards is 1  CPI with hazards is > 1 u Adding more pipe stages reduces clock cycle but increases CPI  Higher penalty due to control hazards  More data hazards u What can we do ? Further reduce the CPI !

Computer Structure 2014 – Out-Of-Order Execution 3 A Superscalar CPU u Duplicating HW in one pipe stage won’t help  e.g., have 2 ALUs  the bottleneck moves to other stages u Getting IPC > 1 requires to fetch, decode, exe, and retire >1 instruction per clock: IF ID EXE MEM WB

Computer Structure 2014 – Out-Of-Order Execution 4 The Pentium  Processor u Fetches and decodes 2 instructions per cycle  Before register file read, decide on pairing: can the two instructions be executed in parallel u Pairing decision is based on  Data dependencies: 2 nd instruction must be independent of 1 st  Resources: U-pipe and V-pipe are not symmetric (save HW) Common instructions can execute on either pipe Some instructions can execute only on the U-pipe If the 2 nd instruction requires the U-pipe, it cannot pair Some instructions use resources of both pipes IF ID U-pipe V-pipe pairing

Computer Structure 2014 – Out-Of-Order Execution 5 u MPI : miss-per-instruction: #incorrectly predicted branches #predicted branches MPI = = MPR× total # of instructions total # of instructions u MPI correlates well with performance, e.g., assume  MPR = 5%, %branches = 20%  MPI = 1%  Without hazards IPC=2 (2 instructions per cycles)  Flush penalty of 5 cycles u We get  MPI = 1%  flush in every 100 instructions  IPC=2  flush every 100/2 = 50 cycles  5 cycles flush penalty every 50 cycles  10% performance hit u For IPC=1 we would get  5 cycles flush penalty per 100 cycles  5% performance hit u Flush penalty increases as the machine is deeper and wider Misprediction Penalty in a Superscalar CPU

Computer Structure 2014 – Out-Of-Order Execution 6 Extract More ILP u ILP – Instruction Level Parallelism  A given program, executed on a given input data has a given parallelism  Can execute only independent instructions in parallel  If for example each instruction is dependent on the previous instruction, the ILP of the program is 1 Adding more HW will not change that u Adjacent instructions are usually dependent  The utilization of the 2 nd pipe is usually low  There are algorithms in which both pipes are highly utilized u Solution: Out-Of-Order Execution  Look for independent instructions further ahead in the program  Execute instructions based on data readiness  Still need to keep the semantics of the original program

Computer Structure 2014 – Out-Of-Order Execution 7 Data Flow Analysis u Example: (1) r1  r4 / r7 ; assume divide takes 20 cycles (2) r8  r1 + r2 (3) r5  r5 + 1 (4) r6  r6 - r3 (5) r4  r5 + r6 (6) r7  r8 * r4 1 3 4 5 2 6 In-order execution 1 3 4 526 Out-of-order execution 134 2 5 6 Data Flow Graph r1 r5r6 r4 r8

Computer Structure 2014 – Out-Of-Order Execution 8 OOOE – General Scheme u Fetch & decode instructions in parallel but in order  Fill the Instruction Pool u Execute ready instructions from the instructions pool  All source data ready + needed execution resources available u Once an instruction is executed  signal all dependent instructions that data is ready u Commit instructions in parallel but in-order  State change (memory, register) and fault/exception handling Retire (commit) In-order Fetch & Decode Instruction pool In-order Execute Out-of-order

Computer Structure 2014 – Out-Of-Order Execution 9 (1) r1  R9/17 (2) r2  r2+r1 (4) r3  r3+r1 (5) jcc L2 (6) L2 r1  35 Write-After-Write Dependency (8) r3  2 (7) r4  r3+r1 (3) r1  23

Computer Structure 2014 – Out-Of-Order Execution 10 (1) r1  R9/17 (2) r2  r2+r1 (4) r3  r3+r1 (5) jcc L2 (6) L2 r1  35 Write-After-Write Dependency (8) r3  2 (7) r4  r3+r1 (3) r1  23 If inst (3) is executed before inst (1), r1 ends up having a wrong value. Called write-after-write false dependency.

Computer Structure 2014 – Out-Of-Order Execution 11 (1) r1  R9/17 (2) r2  r2+r1 (4) r3  r3+r1 (5) jcc L2 (6) L2 r1  35 Write-After-Write Dependency (8) r3  2 (7) r4  r3+r1 (3) r1  23 Inst (4) should use the value of r1 produced by inst (3), even if inst (1) is executed after inst (3). Write-After-Write (WAW) is a false dependency Not a real data dependency, but an artifact of OOO execution

Computer Structure 2014 – Out-Of-Order Execution 12 (1) r1  R9/17 (2) r2  r2+r1 (4) r3  r3+r1 (5) jcc L2 (6) L2 r1  35 Speculative Execution (8) r3  2 (7) r4  r3+r1 (3) r1  23 1/5 instruction is a branch  continue fetching, decoding, and allocating instructions into the instruction pool according to the predicted path. Called “speculative execution”

Computer Structure 2014 – Out-Of-Order Execution 13 (1) r1  R9/17 (2) r2  r2+r1 (4) r3  r3+r1 (5) jcc L2 (6) L2 r1  35 Write-After-Read Dependency (3) r1  23 (8) r3  2 (7) r4  r3+r1

Computer Structure 2014 – Out-Of-Order Execution 14 (7) r4  r3+r1 (1) r1  R9/17 (2) r2  r2+r1 (4) r3  r3+r1 (5) jcc L2 (6) L2 r1  35 Write-After-Read Dependency (3) r1  23 (8) r3  2 If inst (8) is executed before inst (7), inst (7) gets a wrong value of r3. Called write-after-read false dependency. Write-After-Read (WAR) is a false dependency Not a real data dependency, but an artifact of OOO execution

Computer Structure 2014 – Out-Of-Order Execution 15 Register Renaming u Hold a pool of physical registers  Map architectural registers into physical registers (still in-order) u When an instruction is allocated into the instruction pool (still in-order)  Allocate a free physical register from a pool  The physical register points to the architectural register u When an instruction executes and writes a result  Write the result value to the physical register u When an instruction needs data from a register  Read data from the physical register allocated to the latest inst which writes to the same arch register, and precedes the current inst If no such instruction exists, read from the reset arch. value u When an instruction commits  Copy the value from its physical register to the architectural register

Computer Structure 2014 – Out-Of-Order Execution 16 Renaming r1:pr1 pr1  17 r2:pr2 pr2  r2+pr1 r1:pr3 pr3  23 r3:pr4 pr4  r3+pr3 r1:pr5 pr5  35 r4:pr6 pr6  pr4+pr5 r3:pr7 pr7  2 (1) r1  17 (2) r2  r2+r1 (3) r1  23 (4) r3  r3+r1 (5) jcc L2 (6) L2 r1  35 (7) r4  r3+r1 (8) r3  2 Register Renaming r1r2r3r4 Register mappingr1r2r3r4 pr1pr2 pr3pr4 pr5pr6 pr7 When an instruction commits: Copy its physical register into the architectural register

Computer Structure 2014 – Out-Of-Order Execution 17 Renaming r1:pr1 pr1  17 r2:pr2 pr2  r2+pr1 r1:pr3 pr3  23 r3:pr4 pr4  r3+pr3 r1:pr5 pr5  35 r4:pr6 pr6  pr4+pr5 r3:pr7 pr7  2 (1) r1  17 (2) r2  r2+r1 (3) r1  23 (4) r3  r3+r1 (5) jcc L2 (6) L2 r1  35 (7) r4  r3+r1 (8) r3  2 Speculative Execution – Misprediction r1r2r3r4 Register mappingr1r2r3r4 pr1pr2 pr3pr4 pr5pr6 pr7 If the predicted branch path turns out to be wrong (when the branch is executed): The instructions following the branch are flushed before they are committed  the architectural state is not changed

Computer Structure 2014 – Out-Of-Order Execution 18 Renaming r1:pr1 pr1  17 r2:pr2 pr2  r2+pr1 r1:pr3 pr3  23 r3:pr4 pr4  r3+pr3 r1:pr5 pr5  35 r4:pr6 pr6  pr4+pr5 r3:pr7 pr7  2 (1) r1  17 (2) r2  r2+r1 (3) r1  23 (4) r3  r3+r1 (5) jcc L2 (6) L2 r1  35 (7) r4  r3+r1 (8) r3  2 Speculative Execution – Misprediction r1r2r3r4 Register mappingr1r2r3r4 pr1pr2 pr3pr4 pr5pr6 pr7 But the register mapping was already wrongly updated by the wrong path instructions

Computer Structure 2014 – Out-Of-Order Execution 19 Jump Misprediction – Flush at Retire u When the mispredicted jump retires  Flush the pipeline When the branch commits, all the instructions remaining in the pipe are younger than the branch  from the wrong path  Reset the renaming map So all register are mapped to architectural registers This is ok since there are no consumers of physical registers (pipe is flushed)  Start fetching instructions from the correct path u Disadvantage  Very high misprediction penalty  Misprediction is already known after the jump was executed  We will see ways to recover a misprediction at execution

Computer Structure 2014 – Out-Of-Order Execution 20 OOO Requires Accurate Branch Predictor u Accurate branch predictor increases the effective scheduling window size  Speculate across multiple branches (a branch every 5 – 10 instructions) Instruction pool branches High chances to commit Low chances to commit

Computer Structure 2014 – Out-Of-Order Execution 21 Interrupts and Faults Handling u Complications for pipelined and OOO execution  Interrupts occur in the middle of an instruction  A speculative instruction can get a fault (divide by 0, page fault) u Faults are served in program order, at retirement only  Mark an instruction that takes a fault at execution  Instructions older than the faulting instruction are retired  Only when the faulting instruction retires – handle the fault Flush subsequent instructions Initiate the fault handling code according to the fault type Restart faulting and/or subsequent instructions u Interrupts are served when the next instruction retires  Let the instruction in the current cycle retire  Flush subsequent instructions and initiate the interrupt service code  Fetch the subsequent instructions

Computer Structure 2014 – Out-Of-Order Execution 22 Out Of Order Execution Summary u Look ahead in a window of instructions  Dispatch ready instructions to execution Do not depend on data from previous instructions still not executed Have the required execution resources available u Advantages  Exploit Instruction Level Parallelism beyond adjacent instructions  Help cover latencies (e.g., L1 data cache miss, divide)  Superior/complementary to compiler scheduler Can look for ILP beyond conditional branches In a given control path instructions may be independent Register Renaming: use more than the number architectural registers u Complex micro-architecture  Register renaming, complex scheduler, misprediction recovery  Memory ordering – so far we did not talk about that

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Similar presentations

Presentation on theme: "Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Similar presentations

Presentation on theme: "Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz."— Presentation transcript:

Similar presentations

About project

Feedback