Download presentation
Presentation is loading. Please wait.
Published byMarian Day Modified over 9 years ago
1
out-of-order execution Lihu Rappoport 11/2004 1 MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport
2
out-of-order execution Lihu Rappoport 11/2004 2 What’s Next Remember our goal: minimize CPU Time CPU Time = clock cycle CPI IC So far we have learned –Minimize clock cycle add more pipe stages –Minimize CPI use pipeline –Minimize IC architecture In a pipelined CPU: –CPI w/o hazards is 1 –CPI with hazards is > 1 Adding more pipe stages reduces clock cycle but increases CPI: –Higher penalty due to control hazards –More data hazards Beyond some point adding more pipe does not help What can we do ? Further reduce the CPI !
3
out-of-order execution Lihu Rappoport 11/2004 3 A Superscalar CPU Just duplicating the hardware in one of the pipe stage (e.g., have 2 ALUs) won’t help: –the bottleneck will simply move to other stages In order to have CPI < 1 we need to be able to fetch, decode, execute, and retire more than a single instruction per clock: IF ID EXE MEM WB
4
out-of-order execution Lihu Rappoport 11/2004 4 The Pentium Processor Fetches and decodes 2 instructions per cycle Before register file read a pairing decision is made: can the two instructions be executed in parallel Pairing decision is based on –Data dependencies: instructions must be independent –Resources: Some instructions uses resources from the 2 pipes The second pipe can only execute part of the instructions IF ID U-pipe V-pipe pairing
5
out-of-order execution Lihu Rappoport 11/2004 5 MPI : miss-per-instruction: # of incorrectly predicted branches # of predicted branches MPI = = MPR* total # of instructions total # of instructions MPI correlates well with performance. For example, assume: –MPR = 5%, %branches = 20% MPI = 1% –Without hazards CPI=0.5 (2 instructions per cycles) –Flush penalty of 5 cycles. We get: MPI = 1% flush in every 100 instructions flush in every 50 cycles (since CPI=0.5), 5 cycles flush penalty every 50 cycles 10% in performance (For CPI=1 we get 5 cycles flush penalty every 100 cycles 5% in performance) Misprediction Penalty In a Super Scalar CPU
6
out-of-order execution Lihu Rappoport 11/2004 6 Is Superscalar Good Enough ? A superscalar processor can fetch, decode, execute and retire two instructions in parallel However, two instruction can only be executed in parallel if they are independent But … adjacent instructions are usually dependent –The utilization of the second pipe is usually low –There are algorithms in which both pipes are highly utilized What to do ? Out-of-order execution: –Execute instruction not in program order –Still need to do the same as the original program
7
out-of-order execution Lihu Rappoport 11/2004 7 Out Of Order Execution Execute instructions based on “data flow” rather than program order Basic idea: look a head in a window of instructions and find instructions that are ready to execute: –Do not have data dependencies with previous instructions, which were still not executed –Resources are available Out-of-order execution is: Starting the execution stage of an instruction before the execution stage of a previous instruction Advantages: –Help exploit Instruction Level Parallelism (ILP) –Help cover latencies (e.g., cache miss, divide) Can Compilers do the Work ? –Compilers can statically reschedule instructions –Compilers do not have run time information (e.g., cache miss, conditional branch direction → limited to basic blocks)
8
out-of-order execution Lihu Rappoport 11/2004 8 Data Flow Analysis Example : assume that a divide operation takes 20 cycles. (1) r1 r4 / r7 (2) r8 r1 + r2 (3) r5 r5 + 1 (4) r6 r6 - r3 (5) r4 r5 + r6 (6) r7 r8 * r4 134 2 5 6 Data Flow Graph 1 3 4 5 2 6 In-order execution 1 3 4 526 Out-of-order execution
9
out-of-order execution Lihu Rappoport 11/2004 9 OOOE - General Scheme Fetch & decode instructions in parallel, but in order, to fill-out an inst. pool Out of the instructions in the window, execute instructions that are ready: –All the data the required for the instruction are ready –Execution resources are available Executed instructions signals all dependant instructions that the data is ready Commit a few instructions in-order –An instruction can commit only after all preceding instructions (in program order) committed Fetch & Decode Instruction pool Retire (commit) In-order Execute Out-of-order
10
out-of-order execution Lihu Rappoport 11/2004 10 Out Of Order Execution - Example Assume that executing a divide operation takes 20 cycles. (1)r1 r5 / r4 (2)r3 r1 + r8 (3)r8 r5 + 1 (4)r3 r7 - 2 (5)r6 r6 + r7 Inst2 has a RAW dependency on r1 with Inst1, so it cannot be executed in parallel with Inst1 Can successive instructions pass Inst2 ? –What about Inst3 ? No - Inst2 must read r8 before Inst3 writes to it –What about Inst4 ? No - Inst4 must write to r3 after Inst2 –What about Inst5 ? Yes
11
out-of-order execution Lihu Rappoport 11/2004 11 False Dependencies OOOE creates new dependencies: –WAR : an instruction writes into a register which needs to be read by an earlier instruction –WAW: an instruction writes into a register which needs to be written by an earlier instruction These are false dependencies: –There no missing data Still, false dependencies prevent executing instructions out-of-order Solution: Register Renaming
12
out-of-order execution Lihu Rappoport 11/2004 12 Register Renaming Hold a pool of physical registers Architectural registers are mapped into physical registers –When an instruction writes to an architectural register A free physical register is allocated from the pool The physical register points to the architectural register The instruction writes the value to the physical register –When an instruction reads from an architectural register reads the data from the latest instruction which writes to the same architectural register, and precedes the current instruction If no such instruction exists, read directly from the architectural register –When an instruction commits Moves the value from the physical register to the architectural register it points
13
out-of-order execution Lihu Rappoport 11/2004 13 OOOE with Register Renaming: Example cycle 1 cycle 2 (1)r1 mem1r1’ mem1 (2)r2 r2 + r1 r2’ r2 + r1’ (3)r1 mem2r1” mem2 (4)r3 r3 + r1 r3’ r3 + r1” (5)r1 mem3r1”’ mem3 (6)r4 r5 + r1 r4’ r5 + r1”’ (7)r5 2r5’ 2 (8)r6 r5 + 2 r6’ r5’ + 2 Register Renaming Benefits Removes false dependencies Removes architecture limit for # of registers WAW WAR
14
out-of-order execution Lihu Rappoport 11/2004 14 Executing Beyond Branches The scheme we saw so far does not search beyond a branch Limited to the parallelism within a basic-block – A basic-block is ~5 instruction long (1) r1 r4 / r7 (2)r2 r2 + r1 (3)r3 r2 - 5 (4)beq r3,0,300 If the beq is predicted NT, (5)r8 r8 + 1 Inst 5 can be spec executed We would like to look beyond branches –But what if we execute an instruction beyond a branch and then it turns out that we predicted the wrong path ? Solution: Speculative Execution
15
out-of-order execution Lihu Rappoport 11/2004 15 Speculative Execution Execution of instructions from a predicted (yet unsure) path Eventually, path may turn wrong Implementation: –Hold a pool of all “not yet executed” instructions –Fetch instructions into the pool from a predicted path –Instructions for which all operands are “ready” can be executed –An instruction may change the processor state (commit) only when it is safe An instruction commits only when all previous (in-order) instructions had committed Instructions commit in-order Instructions which follow a branch commit only after the branch commits If a predicted branch is wrong all the instructions which follow it are flushed Register Renaming helps speculative execution –Renamed registers are kept until speculation is verified to be correct
16
out-of-order execution Lihu Rappoport 11/2004 16 Speculative Execution - Example cycle 1 cycle 2 (1) r1 mem1 r1’ mem1 (2) r2 r2 + r1 r2’ r2 + r1’ (3) r1 mem2 r1” mem2 (4) r3 r3 + r1 r3’ r3 + r1” (5) jmp cond L2 (6)L2 r1 mem3 r1”’ mem3 (7) r4 r5 + r1 r4’ r5 + r1”’ (8) r5 2 r5’ 2 (9) r6 r5 + 2 r6’ r5’ + 2 Instructions 6-9 are speculatively executed –If the prediction turns wrong, they will be flushed If the branch was predicted taken, the instructions from the other path would be have been speculatively executed. WAW WAR Speculative Execution
17
out-of-order execution Lihu Rappoport 11/2004 17 Out Of Order Execution - Summary Advantages –Help exploit Instruction Level Parallelism (ILP) –Help cover latencies (e.g., cache miss, divide) –Superior/complementary to compiler scheduler Dynamic instruction window Reg Renaming: Can use more than the number architectural registers Complex microarchitecture –Complex scheduler –Requires reordering mechanism (retirement) in the back-end for: Precise interrupt resolution Misprediction/speculation recovery Memory ordering Speculative Execution –Advantage: larger scheduling window reveals more ILP –Issues: misprediction cost and misprediction recovery
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.