Download presentation
Presentation is loading. Please wait.
1
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based on slides by David Patterson, Avi Mendelson, Lihu Rappoport, and Adi Yoaz
2
Computer Architecture 2011 – out-of-order execution (lec 7) 2 Out-of-order in a nutshell u Real processors Do not really serially execute instruction by instruction u Instead, they have an instruction window Moves across the program Ready instructions get picked up from window Not necessarily in program order Need window because some instructions aren’t ready User is unaware (except that the program runs faster)
3
Computer Architecture 2011 – out-of-order execution (lec 7) 3 What’s Next u Remember our goal: minimize CPU Time CPU Time = clock cycle CPI IC u So far we have learned Minimize clock cycle add more pipe stages Minimize CPI use pipeline Minimize IC architecture u In a pipelined CPU: CPI w/o hazards is 1 CPI with hazards is > 1 u Adding more pipe stages reduces clock cycle but increases CPI Higher penalty due to control hazards More data hazards u Beyond some point adding more pipe stages does not help u What can we do? Further reduce the CPI!
4
Computer Architecture 2011 – out-of-order execution (lec 7) 4 A Superscalar CPU u Duplicating HW in one pipe stage won’t help e.g., have 2 ALUs the bottleneck moves to other stages u Getting IPC > 1 requires to fetch, decode, exe, and retire >1 instruction per clock: IF ID EXE MEM WB
5
Computer Architecture 2011 – out-of-order execution (lec 7) 5 The Pentium Processor u Fetches and decodes 2 instructions per cycle u Before register file read, decide on pairing can the two instructions be executed in parallel u Pairing decision is based on Data dependencies: instructions must be independent Resources: Some instructions use resources from the 2 pipes The second pipe can only execute part of the instructions IF ID U-pipe V-pipe pairing
6
Computer Architecture 2011 – out-of-order execution (lec 7) 6 u MPI : miss-per-instruction (MPR = miss-predict rate): #incorrectly predicted branches #predicted branches MPI = = MPR× total # of instructions total # of instructions u MPI correlates well with performance. E.g., assume: MPR = 5%, %branches = 20% MPI = 1% Without hazards IPC=2 (2 instructions per cycles) Flush penalty of 5 cycles u We get: MPI = 1% flush in every 100 instructions IPC=2 flush every 100/2 = 50 cycles 5 cycles flush penalty every 50 cycles 10% in performance u For IPC=1 we would get 5 cycles flush penalty per 100 cycles 5% in performance Misprediction Penalty in a Superscalar CPU
7
Computer Architecture 2011 – out-of-order execution (lec 7) 7 Is Superscalar Good Enough ? u A superscalar processor can fetch, decode, execute and retire 2 instructions in parallel Can execute only independent instructions in parallel u But … adjacent instructions are usually dependent The utilization of the second pipe is usually low There are algorithms in which both pipes are highly utilized u Solution: out-of-order execution Execute instructions based on “data flow” rather than program order Still need to keep the semantics of the original program
8
Computer Architecture 2011 – out-of-order execution (lec 7) 8 Out Of Order Execution u Look ahead in a window of instructions and find instructions that are ready to execute Don’t depend on data from previous instructions still not executed Resources are available u Out-of-order execution Start instruction execution before execution of a previous instructions u Advantages: Help exploit Instruction Level Parallelism (ILP) Help cover latencies (e.g., L1 data cache miss, divide) u Can Compilers do the Work ? Compilers can statically reschedule instructions Compilers do not have run time information Conditional branch direction → limited to basic blocks Data values, which may affect calculation time and control Cache miss / hit
9
Computer Architecture 2011 – out-of-order execution (lec 7) 9 Data Flow Analysis u Example: (1) r1 r4 / r7 ; assume divide takes 20 cycles (2) r8 r1 + r2 (3) r5 r5 + 1 (4) r6 r6 - r3 (5) r4 r5 + r6 (6) r7 r8 * r4 13 4 5 2 6 In-order execution 1 3 4 526 Out-of-order execution 134 2 5 6 Data Flow Graph r1 r5r6 r4 r8 1 3 4 5 2 6 In-order (superscalar 3)
10
Computer Architecture 2011 – out-of-order execution (lec 7) 10 OOOE – General Scheme u Fetch & decode instructions in parallel but in order, to fill inst. pool u Execute ready instructions from the instructions pool All the data required for the instruction is ready Execution resources are available u Once an instruction is executed signal all dependant instructions that data is ready u Commit instructions in parallel but in-order Can commit an instruction only after all preceding instructions (in program order) have committed Fetch & Decode Instruction pool Retire (commit) In-order Execute Out-of-order
11
Computer Architecture 2011 – out-of-order execution (lec 7) 11 Out Of Order Execution – Example u Assume that executing a divide operation takes 20 cycles (1)r1 r5 / r4 (2)r3 r1 + r8 (3)r8 r5 + 1 (4)r3 r7 - 2 (5)r6 r6 + r7 u Inst2 has a RAW dependency on r1 with Inst1 It cannot be executed in parallel with Inst1 u Can successive instructions pass Inst2 ? Inst3 cannot since Inst2 must read r8 before Inst3 writes to it (WAR) Inst4 cannot since it must write to r3 after Inst2 (WAW) Inst5 can 1 34 2 5
12
Computer Architecture 2011 – out-of-order execution (lec 7) 12 False Dependencies u OOOE creates new dependencies WAR: write to a register which is read by an earlier inst. (1)r3 r2 + r1 (2)r2 r4 + 3 WAW: write to a register which is written by an earlier inst. (1)r3 r1 + r2 (2)r3 r4 + 3 u These are false dependencies There is no missing data Still prevent executing instructions out-of-order u Solution: Register Renaming
13
Computer Architecture 2011 – out-of-order execution (lec 7) 13 Register Renaming u Hold a pool of physical registers u Map architectural registers into physical registers Before an instruction can be sent for execution Allocate a free physical register from a pool The physical register points to the architectural register When an instruction writes a result Write the result value to the physical register When an instruction needs data from a register Read data from the physical register allocated to the latest inst which writes to the same arch register, and precedes the current inst If no such instruction exists, read directly from the arch. register When an instruction commits Move the value from the physical register to the arch register it points
14
Computer Architecture 2011 – out-of-order execution (lec 7) 14 OOOE with Register Renaming: Example cycle 1 cycle 2 (1)r1 mem1r1’ mem1 (2)r2 r2 + r1 r2’ r2 + r1’ (3)r1 mem2r1” mem2 (4)r3 r3 + r1 r3’ r3 + r1” (5)r1 mem3r1”’ mem3 (6)r4 r5 + r1 r4’ r5 + r1”’ (7)r5 2r5’ 2 (8)r6 r5 + 2 r6’ r5’ + 2 Register Renaming Benefits Removes false dependencies Removes architecture limit for # of registers WAW WAR
15
Computer Architecture 2011 – out-of-order execution (lec 7) 15 Executing Beyond Branches u So far we do not look for instructions ready to execute beyond a branch Limited to the parallelism within a “basic-block” A basic-block is ~5 instruction long ( 1) r1 r4 / r7 (2)r2 r2 + r1 (3)r3 r2 - 5 (4)beq r3,0,300 If the beq is predicted NT, (5)r8 r8 + 1 Inst 5 can be spec executed u We would like to look beyond branches But what if we execute an instruction beyond a branch and then it turns out that we predicted the wrong path ? Solution: Speculative Execution
16
Computer Architecture 2011 – out-of-order execution (lec 7) 16 Speculative Execution u Execution of instructions from a predicted (yet unsure) path Eventually, path may turn wrong u Implementation: Hold a pool of all not yet executed instructions Fetch instructions into the pool from a predicted path Instructions for which all operands are ready can be executed An instruction may change the processor state (commit) only when it is safe An instruction commits only when all previous (in-order) instructions have committed instructions commit in-order Instructions which follow a branch commit only after the branch commits If a predicted branch is wrong all the instructions which follow it are flushed u Register Renaming helps speculative execution Renamed registers are kept until speculation is verified to be correct
17
Computer Architecture 2011 – out-of-order execution (lec 7) 17 Speculative Execution - Example cycle 1 cycle 2 (1) r1 mem1 r1’ mem1 (2) r2 r2 + r1 r2’ r2 + r1’ (3) r1 mem2 r1” mem2 (4) r3 r3 + r1 r3’ r3 + r1” (5) jmp cond L2 predicted taken to L2 (6)L2 r1 mem3 r1”’ mem3 (7) r4 r5 + r1 r4’ r5 + r1”’ (8) r5 2 r5’ 2 (9) r6 r5 + 2 r6’ r5’ + 2 u Instructions 6-9 are speculatively executed If the prediction turns wrong, they will be flushed u If the branch was predicted not-taken The instructions from the other path would have been speculatively executed WAW WAR Speculative Execution
18
Computer Architecture 2011 – out-of-order execution (lec 7) 18 Out Of Order Execution – Summary u Advantages Help exploit Instruction Level Parallelism (ILP) Help cover latencies (e.g., cache miss, divide) Superior/complementary to compiler scheduler Dynamic instruction window Regs renaming: can use more than the number of arch registers u Complex micro-architecture Complex scheduler Requires reordering mechanism (retirement) in the back-end for: Precise interrupt resolution Misprediction/speculation recovery Memory ordering u Speculative Execution Advantage: larger scheduling window reveals more ILP Issues: misprediction cost and misprediction recovery
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.