Download presentation
Presentation is loading. Please wait.
Published byJakob Slee Modified over 9 years ago
1
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz
2
Computer Structure 2014 – Out-Of-Order Execution 2 What’s Next u Goal: minimize CPU Time CPU Time = clock cycle CPI IC u So far we have learned Minimize clock cycle add more pipe stages Minimize CPI use pipeline Minimize IC architecture u In a pipelined CPU CPI w/o hazards is 1 CPI with hazards is > 1 u Adding more pipe stages reduces clock cycle but increases CPI Higher penalty due to control hazards More data hazards u What can we do ? Further reduce the CPI !
3
Computer Structure 2014 – Out-Of-Order Execution 3 A Superscalar CPU u Duplicating HW in one pipe stage won’t help e.g., have 2 ALUs the bottleneck moves to other stages u Getting IPC > 1 requires to fetch, decode, exe, and retire >1 instruction per clock: IF ID EXE MEM WB
4
Computer Structure 2014 – Out-Of-Order Execution 4 The Pentium Processor u Fetches and decodes 2 instructions per cycle Before register file read, decide on pairing: can the two instructions be executed in parallel u Pairing decision is based on Data dependencies: 2 nd instruction must be independent of 1 st Resources: U-pipe and V-pipe are not symmetric (save HW) Common instructions can execute on either pipe Some instructions can execute only on the U-pipe If the 2 nd instruction requires the U-pipe, it cannot pair Some instructions use resources of both pipes IF ID U-pipe V-pipe pairing
5
Computer Structure 2014 – Out-Of-Order Execution 5 u MPI : miss-per-instruction: #incorrectly predicted branches #predicted branches MPI = = MPR× total # of instructions total # of instructions u MPI correlates well with performance, e.g., assume MPR = 5%, %branches = 20% MPI = 1% Without hazards IPC=2 (2 instructions per cycles) Flush penalty of 5 cycles u We get MPI = 1% flush in every 100 instructions IPC=2 flush every 100/2 = 50 cycles 5 cycles flush penalty every 50 cycles 10% performance hit u For IPC=1 we would get 5 cycles flush penalty per 100 cycles 5% performance hit u Flush penalty increases as the machine is deeper and wider Misprediction Penalty in a Superscalar CPU
6
Computer Structure 2014 – Out-Of-Order Execution 6 Extract More ILP u ILP – Instruction Level Parallelism A given program, executed on a given input data has a given parallelism Can execute only independent instructions in parallel If for example each instruction is dependent on the previous instruction, the ILP of the program is 1 Adding more HW will not change that u Adjacent instructions are usually dependent The utilization of the 2 nd pipe is usually low There are algorithms in which both pipes are highly utilized u Solution: Out-Of-Order Execution Look for independent instructions further ahead in the program Execute instructions based on data readiness Still need to keep the semantics of the original program
7
Computer Structure 2014 – Out-Of-Order Execution 7 Data Flow Analysis u Example: (1) r1 r4 / r7 ; assume divide takes 20 cycles (2) r8 r1 + r2 (3) r5 r5 + 1 (4) r6 r6 - r3 (5) r4 r5 + r6 (6) r7 r8 * r4 1 3 4 5 2 6 In-order execution 1 3 4 526 Out-of-order execution 134 2 5 6 Data Flow Graph r1 r5r6 r4 r8
8
Computer Structure 2014 – Out-Of-Order Execution 8 OOOE – General Scheme u Fetch & decode instructions in parallel but in order Fill the Instruction Pool u Execute ready instructions from the instructions pool All source data ready + needed execution resources available u Once an instruction is executed signal all dependent instructions that data is ready u Commit instructions in parallel but in-order State change (memory, register) and fault/exception handling Retire (commit) In-order Fetch & Decode Instruction pool In-order Execute Out-of-order
9
Computer Structure 2014 – Out-Of-Order Execution 9 (1) r1 R9/17 (2) r2 r2+r1 (4) r3 r3+r1 (5) jcc L2 (6) L2 r1 35 Write-After-Write Dependency (8) r3 2 (7) r4 r3+r1 (3) r1 23
10
Computer Structure 2014 – Out-Of-Order Execution 10 (1) r1 R9/17 (2) r2 r2+r1 (4) r3 r3+r1 (5) jcc L2 (6) L2 r1 35 Write-After-Write Dependency (8) r3 2 (7) r4 r3+r1 (3) r1 23 If inst (3) is executed before inst (1), r1 ends up having a wrong value. Called write-after-write false dependency.
11
Computer Structure 2014 – Out-Of-Order Execution 11 (1) r1 R9/17 (2) r2 r2+r1 (4) r3 r3+r1 (5) jcc L2 (6) L2 r1 35 Write-After-Write Dependency (8) r3 2 (7) r4 r3+r1 (3) r1 23 Inst (4) should use the value of r1 produced by inst (3), even if inst (1) is executed after inst (3). Write-After-Write (WAW) is a false dependency Not a real data dependency, but an artifact of OOO execution
12
Computer Structure 2014 – Out-Of-Order Execution 12 (1) r1 R9/17 (2) r2 r2+r1 (4) r3 r3+r1 (5) jcc L2 (6) L2 r1 35 Speculative Execution (8) r3 2 (7) r4 r3+r1 (3) r1 23 1/5 instruction is a branch continue fetching, decoding, and allocating instructions into the instruction pool according to the predicted path. Called “speculative execution”
13
Computer Structure 2014 – Out-Of-Order Execution 13 (1) r1 R9/17 (2) r2 r2+r1 (4) r3 r3+r1 (5) jcc L2 (6) L2 r1 35 Write-After-Read Dependency (3) r1 23 (8) r3 2 (7) r4 r3+r1
14
Computer Structure 2014 – Out-Of-Order Execution 14 (7) r4 r3+r1 (1) r1 R9/17 (2) r2 r2+r1 (4) r3 r3+r1 (5) jcc L2 (6) L2 r1 35 Write-After-Read Dependency (3) r1 23 (8) r3 2 If inst (8) is executed before inst (7), inst (7) gets a wrong value of r3. Called write-after-read false dependency. Write-After-Read (WAR) is a false dependency Not a real data dependency, but an artifact of OOO execution
15
Computer Structure 2014 – Out-Of-Order Execution 15 Register Renaming u Hold a pool of physical registers Map architectural registers into physical registers (still in-order) u When an instruction is allocated into the instruction pool (still in-order) Allocate a free physical register from a pool The physical register points to the architectural register u When an instruction executes and writes a result Write the result value to the physical register u When an instruction needs data from a register Read data from the physical register allocated to the latest inst which writes to the same arch register, and precedes the current inst If no such instruction exists, read from the reset arch. value u When an instruction commits Copy the value from its physical register to the architectural register
16
Computer Structure 2014 – Out-Of-Order Execution 16 Renaming r1:pr1 pr1 17 r2:pr2 pr2 r2+pr1 r1:pr3 pr3 23 r3:pr4 pr4 r3+pr3 r1:pr5 pr5 35 r4:pr6 pr6 pr4+pr5 r3:pr7 pr7 2 (1) r1 17 (2) r2 r2+r1 (3) r1 23 (4) r3 r3+r1 (5) jcc L2 (6) L2 r1 35 (7) r4 r3+r1 (8) r3 2 Register Renaming r1r2r3r4 Register mappingr1r2r3r4 pr1pr2 pr3pr4 pr5pr6 pr7 When an instruction commits: Copy its physical register into the architectural register
17
Computer Structure 2014 – Out-Of-Order Execution 17 Renaming r1:pr1 pr1 17 r2:pr2 pr2 r2+pr1 r1:pr3 pr3 23 r3:pr4 pr4 r3+pr3 r1:pr5 pr5 35 r4:pr6 pr6 pr4+pr5 r3:pr7 pr7 2 (1) r1 17 (2) r2 r2+r1 (3) r1 23 (4) r3 r3+r1 (5) jcc L2 (6) L2 r1 35 (7) r4 r3+r1 (8) r3 2 Speculative Execution – Misprediction r1r2r3r4 Register mappingr1r2r3r4 pr1pr2 pr3pr4 pr5pr6 pr7 If the predicted branch path turns out to be wrong (when the branch is executed): The instructions following the branch are flushed before they are committed the architectural state is not changed
18
Computer Structure 2014 – Out-Of-Order Execution 18 Renaming r1:pr1 pr1 17 r2:pr2 pr2 r2+pr1 r1:pr3 pr3 23 r3:pr4 pr4 r3+pr3 r1:pr5 pr5 35 r4:pr6 pr6 pr4+pr5 r3:pr7 pr7 2 (1) r1 17 (2) r2 r2+r1 (3) r1 23 (4) r3 r3+r1 (5) jcc L2 (6) L2 r1 35 (7) r4 r3+r1 (8) r3 2 Speculative Execution – Misprediction r1r2r3r4 Register mappingr1r2r3r4 pr1pr2 pr3pr4 pr5pr6 pr7 But the register mapping was already wrongly updated by the wrong path instructions
19
Computer Structure 2014 – Out-Of-Order Execution 19 Jump Misprediction – Flush at Retire u When the mispredicted jump retires Flush the pipeline When the branch commits, all the instructions remaining in the pipe are younger than the branch from the wrong path Reset the renaming map So all register are mapped to architectural registers This is ok since there are no consumers of physical registers (pipe is flushed) Start fetching instructions from the correct path u Disadvantage Very high misprediction penalty Misprediction is already known after the jump was executed We will see ways to recover a misprediction at execution
20
Computer Structure 2014 – Out-Of-Order Execution 20 OOO Requires Accurate Branch Predictor u Accurate branch predictor increases the effective scheduling window size Speculate across multiple branches (a branch every 5 – 10 instructions) Instruction pool branches High chances to commit Low chances to commit
21
Computer Structure 2014 – Out-Of-Order Execution 21 Interrupts and Faults Handling u Complications for pipelined and OOO execution Interrupts occur in the middle of an instruction A speculative instruction can get a fault (divide by 0, page fault) u Faults are served in program order, at retirement only Mark an instruction that takes a fault at execution Instructions older than the faulting instruction are retired Only when the faulting instruction retires – handle the fault Flush subsequent instructions Initiate the fault handling code according to the fault type Restart faulting and/or subsequent instructions u Interrupts are served when the next instruction retires Let the instruction in the current cycle retire Flush subsequent instructions and initiate the interrupt service code Fetch the subsequent instructions
22
Computer Structure 2014 – Out-Of-Order Execution 22 Out Of Order Execution Summary u Look ahead in a window of instructions Dispatch ready instructions to execution Do not depend on data from previous instructions still not executed Have the required execution resources available u Advantages Exploit Instruction Level Parallelism beyond adjacent instructions Help cover latencies (e.g., L1 data cache miss, divide) Superior/complementary to compiler scheduler Can look for ILP beyond conditional branches In a given control path instructions may be independent Register Renaming: use more than the number architectural registers u Complex micro-architecture Register renaming, complex scheduler, misprediction recovery Memory ordering – so far we did not talk about that
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.