Out-of-Order Execution Scheduling A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Instruction Level Parallel Processing Sequential Execution Semantics Out-of-Order Execution How it can help Issues: Maintaining Sequential Semantics Scheduling Scoreboard Register Renaming Initially, we’ll focus on Registers, Memory later on A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Sequential Semantics - Review Instructions appear as if they executed: In the order they appear in the program One after the other Program Order Pipelining Superscalar Out-of-Order A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Out-of-Order Execution loop: add r4, r4, 1 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop do { sum += a[++m]; i--; } while (i != 0); Superscalar fetch decode sub bne add ld out-of-order fetch decode add fetch decode ld fetch decode add fetch decode sub fetch decode bne A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Sequential Semantics? Execution does NOT adhere to sequential semantics To be precise: Eventually it may Simplest solution: Define problem away Not acceptable today: e.g., Virtual Memory Three-phase Instruction execution In-Progress, Completed and Committed inconsistent fetch decode sub bne add ld consistent A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Out-of-order Execution Issues Preserving Sequential Semantics Stalling Instructions w/ dependences Issuing Instructions when dependences are satisfied A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Back to Sequential Semantics Instr. exec. in 3 phases: In-progress, Completed, Committed OOO for in-progress and Completed In-order Commits Completed - out-of-order: ”Visible only inside” Results visible to subsequent instructions Results not visible to outsiders On interrupts completed results are discarded Committed - in-order: ”Visible to all” Results visible to outsiders On interrupt committed results are preserved A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
How Completes Help w/ Performance in-order completes out-of-order completes in-order commits DIV R3, _, _ ADD R1, _, _ ADD _, R1, _ Time In-order commits fetch decode sub bne add ld commit commit commit commit commit complete A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Implementing Completes/Commits Key idea: Maintain sufficient state around to be able to roll-back when necessary Roll-back: Discard (aka Squash) all not committed One solution (conceptual): Upon Complete instruction records previous value of target register Upon Discard, instruction restores target value Upon Commit, nothing to do We will return to this shortly Focus on scheduling mechanisms A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Out-of-Order Execution Overview Program Form Processing Phase Static program dynamic inst. Stream (trace) execution window completed instructions In-Progress Dispatch/ dependences inst. Issue inst execution inst. Reorder & commit Completed Committed A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Out-of-Order Execution: Stages Fetch: get instruction from memory Decode/Dispatch: what is it? What are the dependences Issue: Go – all dependences satisfied Execute: perform operation Complete: result available to other insts. Commit: result available to outsiders We’ll start w/ Decode/Dispatch Then we’ll consider Issue A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
OOO Scheduling Instruction @ Decode: Do I have dependences yet to be satisfied? Yes, stall until they are No, clear to issue Wakeup Instructions Stalled: Dependences satisfied Allow instruction to issue Dependence: (later instruction, earlier instruction) & type We’ll first consider RAW and then move on to WAW and WAR A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Stalling @ Decode for RAW Are there unsatisfied dependences? RAW: have to wait for register value We don’t really care who is producing the value Only whether it is available Can use the Register Availability Vector as in pipelining/superscalar Also known as scoreboard At Decode Reset bit corresponding to your target At writeback set Check all bits for source regs: if any is 0 stall A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Issuing Instructions: Scheduling Determine when an instruction can issue Ignore resources for the time being Stalled because of RAW w/ preceding instruction Concept: Producer (write) notifies consumers (read) Requirements: Consumers need to be able to identify producer The register name is one possible link Mechanism Consumer placed in a reservation station Producers on complete broadcasts identity Waiting instructions observe Update Operand Availability Issue if all operands now available A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Reservation Station State pertaining to an instruction What registers it reads Whether they are available What is the destination register What state is the instruction in Waiting Executing A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Out-Of-Order Exec. Example loop: add r4, r4, 4 ld r2, 10(r4) 4 cycles lat add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop RAV op src1 src2 tgt status r1 r2 r3 r4 1 Cycle 0 A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Out-Of-Order Exec. Example: Cycle 0 loop: add r4, r4, 4 ld r2, 10(r4) 5 cycles lat add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Ready to be executed RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4/0 Rdy 1 Cycle 0 A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Cycle 1 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Notify those waiting for R4 RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Exec 1 ld r4/1 NA/1 r2 Rdy R4 gets produced now A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Cycle 2 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Result available @ cycle 6 RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Cmtd 1 ld r4/1 NA/1 r2 Exec Wait for r2 add r3/1 r2/0 r3 Wait A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Cycle 3 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Result available @ cycle 6 RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Cmtd 1 ld r4/1 NA/1 r2 Exec Wait for r2 add r3/1 r2/0 r3 Wait No dependences sub r1/1 NA/1 r1 Rdy A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Cycle 4 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Result available @ cycle 6 RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Cmtd 1 ld r4/1 NA/1 r2 Exec Wait for r2 add r3/1 r2/0 r3 Wait r1 produced now Notify consumers sub r1/1 NA/1 r1 Exec bne r1/1 r0/1 NA Rdy r1 will be available next cycle A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Cycle 5 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Result available @ cycle 6 RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Cmtd 1 ld r4/1 NA/1 r2 Exec Wait for r2 add r3/1 r2/0 r3 Wait Completed sub r1/1 NA/1 r1 Compl bne r1/1 r0/1 NA Exec executing A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Cycle 6 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Result available @ cycle 6 Notify consumers RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Cmtd 1 ld r4/1 NA/1 r2 Exec Wait for r2 add r3/1 r2/1 r3 Rdy Completed sub r1/1 NA/1 r1 Compl bne r1/1 r0/1 NA Exec executing A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Cycle 7 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Notify consumers RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Cmtd 1 ld r4/1 NA/1 r2 Cmtd Executing add r3/1 r2/1 r3 Exec sub r1/1 NA/1 r1 Compl bne r1/1 r0/1 NA Compl Completed A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Cycle 8 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Cmtd 1 ld r4/1 NA/1 r2 Cmtd add r3/1 r2/1 r3 Cmtd sub r1/1 NA/1 r1 Cmtd bne r1/1 r0/1 NA Cmtd A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Notifying Consumers Identity of Producer Uniquely Identify the Instruction Easily retrievable @ decode by others Target Register Recall we stall on WAR or WAW Functional Unit If not pipelined Place in instruction window PC? not. Why? A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Name Dependences and OOO WAW or WAR: We need to update register but others are still using it add r1, r1, 10 sw r1, 20(r2) add r1, r3, 30 sub r2, r1, 40 There is only one r1 sw needs to see the value of 1st add sub needs to wait for 2nd add and not 1st Solution: Stall decode when WAW or WAR A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Detecting WAW and WAR WAW? Look at Scoreboard If bit is 0 then there is a pending write Stall WAR? Need to know whether all preceding consumers have read the value Keep a count per register Increase at decode for all reads Decrease on issue More elegant solution via register renaming Soon A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Window vs. Scheduler Window Distance between oldest and youngest instruction that can co-exist inside the CPU Larger window Potential for more ILP Scheduler Number of instructions that are waiting to be issued Instructions enter at Fetch Exit at Commit Instructions enter at Decode Leave at writeback Window >= Scheduler Can be the same structure In window but not in scheduler completed instructions A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Scoreboarding Schedule based on RAW dependences WAW and WAR cause stalls WAW at decode WAR at writeback Optimization: Why is this OK? Implemented in the CDC 6600 in ‘64 18 non-pipelined FUs 4 FP: 2 mul, 1 add, 1 div 7 MEM: 5 load, 2 store 7 INT: add, shift, logical etc. Centralized Control Scheme Controls all Instruction Issue Detects all hazards A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
MIPS/DLX w/ Scoreboarding Register File FP mul FP divide FP add FP integer scoreboard A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Scoreboarding Overview Ignore IF and MEM for simplicity 4-stage execution Issue Check for structural hazards Check for WAW hazards Stall until all clear ReadOp Check for RAW hazards Wait until all operands ready Read Registers Execute Execute Operations Notify scoreboard when complete Write Check for WAR hazards Stall Write until all clear A completing instruction cannot write dest if an earlier instruction has not read dest. A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Scoreboarding Optimizations/Tricks WAW as in original OOO WAR is optimized Second Producer is allowed to execute up to complete It is stalled there until preceding consumers complete No Commit No precise interrupts Window is implemented in the scoreboard One entry per Functional Unit Recall not pipelined Instructions identified by FU id A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Scoreboarding Organization Three structures Instruction Status Functional Unit Status Register Result Status Which stage the instruction is currently in Functional Unit Status: scheduling Busy OP Operation Fi Dest. Reg. Fj, Fk Source Regs Qj, Qk FUs producing sources Rj, Rk Ready bits for sources Register Result Status: dep. determination Which FU will produce a register A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Scoreboarding explained Register status reg: Which FU produces the register Use at decode Source reg match is a RAW Target reg macth is a WAW stall A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Functional Unit Status Busy: resource allocation OP: what to do once issued (e.g., add, sub) Dest. Reg.: Where to write result To find WAR Fj, Fk Source Regs for WAR: can’t write if consumers pending for previous value of register (if FU not the same) Qj, Qk FUs producing sources To wait for appropriate producer Rj, Rk Ready bits for sources To determine when ready: all ready A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Scoreboarding Example A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Example: Cycle 0 A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Example, contd. The rest you’ll find on the web site Go through it Source: Patterson Summary: Execution proceeds in an order dictated by dependences RAW, WAR and WAW force ordering Tricks may be possible A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Beyond Simple OoO E will wait for B, C and D. WAR w/ C and D WAW w/ B A: LF F6, 34(R2) B: LF F2, 45(R3) C: MULF F0, F2, F4 D: SUBF F8, F2, F6 E: ADDF F2, F7, F4 E will wait for B, C and D. WAR w/ C and D WAW w/ B Can we do better? A B C D E A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
What if we had infinite registers A: LF F6, 34(R2) B: LF F2, 45(R3) C: MULF F0, F2, F4 D: SUBF F8, F2, F6 E: ADDF F2, F7, F4 E: ADDF F9, F7, F4 No false dependences anymore Since we do not reuse a name we can’t have WAW and WAR A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Why we can’t have Infinite Registers False/Name dependences (WAR and WAW) Artifact of having finite registers There is no such thing as infinite There is no such thing as large enough Well there is (in a sec.) Computers execute Billions of Instructions per sec. Even a multi-billion register file would soon be exhausted Want to exploit parallelism across several instances of the same code Loops, recursive functions (most frequent part) A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Yes, there is “large enough” At any given point there will be a finite number of instructions in the window if each instruction has a single register target if there are N instructions How many registers do we need? N? N + X? A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Register Renaming Register Version Every Write creates a new version Uses read the last version Need to keep a version until all uses have read it. Register Renaming: Architectural vs. Physical Registers more phys. than arch. Maintain a map of arch. to phys. regs. Use in-order decoding to properly identify dependences. Instructions wait only for input op. availability. Only last version is written to reg. file. A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Register Renaming Need more physical registers than architectural A: DIVF F3, F1, F0 r1, -, - B: SUBF F2, F1, F0 r2, -, - C: MULF F0, F2, F4 r3, r2, - D: SUBF F6, F2, F3 r4, r2, r1 E: ADDF F2, F5, F4 r5, -, - F: ADDF F0, F0, F2 r6, r3, r5 Need more physical registers than architectural Ignore control flow for the time being. A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Register Renaming Process Only need to remember last producer of each architectural register Vector At decode Find the most recent producers for all source registers After: declare self as most recent producer of target register Complication: May have to retract Speculative Execution, e.g., interrupts Need to be able to restore the mapping state A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Register Renaming Support Structures Register Rename Table f(aR) = pR one entry per architectural Register Free Register List Lists not used Physical Registers At Decode grab a new register from the free list Change mapping in rename table At Commit Release Register? Not… Why? Could release previous version A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
How Many Physical Registers? Correctness: At least as many architectural plus? Performance: As many as possible Not correctness Recall not all instructions produce register results stores and branches A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Dynamic Scheduling A: DIVF F3, F1, F0 r1, -, - B: SUBF F2, F1, F0 r2, -, - C: MULF F0, F2, F4 r3, r2, - D: SUBF F6, F2, F3 r4, r2, r1 E: ADDF F2, F5, F4 r5, -, - F: ADDF F0, F0, F2 r6, r3, r5 - Values and Names flow together - Writeback specifies both value and name - A waiting instruction inspects all results - It is allowed to execute when all inputs are available Name Value A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Physical Registers Physical register file is just one option What we need is separate storage Consumers could keep values in their reservation station Tomasulo’s next A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Tomasulo’s Algorithm IBM 360/91 - Fast 360 for scientific code Completed in 1967 Dynamic scheduling Predates cache memories Pipelined FUs Adder up to 3 instructions Multiplier up to 2 instructions Tomasulo vs. Scoreboard Distributed hazard detection and control Results are bypassed to FUs Common Data Bus (CDB) for results All results visible to all instead of via a register A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
DLX w/ Tomasulo Tomasulo’s Algorithm Use “tags” to identify data values Reservation stations distributed control CDB broadcasts all results to all RSs Extend DLX as example Assume multiple FUs than pipelined Main difference is Register-Memory Insts. I.e., DLX does not have them But that’s really a detail :-) Physical Registers? Not really. What we need is different storage and name for every version. Here it’s the producing reservation station A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Dynamic DLX adders Mults Load buffers Store buffers CDB RS Operation Stack Registers A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Tomasulo’s Algorithm 3 major steps Dispatch Get instruction from fetch queue ALU op: check for available RS Load: Check for available load buffer If available: dispatch and copy read regs to RS or load buffer if not: stall - structural hazard Issue If all ops are available: issue If not monitor CDB for operands Complete If CDB available, broadcast result else stall A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Tomasulo’s Algorithm contd. Reservation stations Handle distributed hazard detection and instruction control Everything receiving data get its tag 4-bit tag specifies reservation station or load buffer Also which FU will produce result Register specifier is used to assign tags Then they are discarded Input register specifiers are ONLY used in dispatch. (Rename table) Common Data Bus: value + “tag” = where this comes from vs. typical bus: value + “tag” = where this goes to A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Tomasulo’s Algorithm Contd. Reservation Stations Op Opcode Qj, Qk Tag Fields (source ops) Vj, Vk Operand values (source ops) Busy Currently in use Register file and Store Buffer Qi Tag field Vi Value Load Buffers Busy Currently in Use A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Tomasulo’s: Understanding Speculative vs. Architectural State add r1, r2, 10 sub r4, r1, 20 add r1, r3, 30 Register file Where is the register? I have it Value of r1 I have it Value of r2 Arch. Reg. Name I have it Value of r3 I have it Value of r4 Can be: “I have it”, “reservation station id” Reservation Stations tgt src1 src2 NA NA Value of Src1 NA Value of Src2 NA NA Value of Src1 NA Value of Src2 Reg Arch. name A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Renaming 1st Instruction add r1, r2, 10 sub r4, r1, 20 add r1, r3, 30 Register file RS0 ----- I have it Value of r2 Read sources (r2) Rename r1 to RS0 I have it Value of r3 I have it Value of r4 Reservation Stations tgt src1 src2 RS0 r1 I have it Value of R2 I have it 10 NA NA Value of Src1 NA Value of Src2 NA NA Value of Src1 NA Value of Src2 A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Renaming 2nd Instruction add r1, r2, 10 sub r4, r1, 20 add r1, r3, 30 Register file RS0 ----- I have it Value of r2 Sources: r1 in RS0 NYA Rename r4 to RS1 I have it Value of r3 RS1 ---- Reservation Stations tgt src1 src2 r1 I have it Value of R2 I have it 10 RS1 r4 RS0 ---- I have it 20 NA NA Value of Src1 NA Value of Src2 A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Renaming 3rd Instruction add r1, r2, 10 sub r4, r1, 20 add r1, r3, 30 Register file RS2 ----- I have it Value of r2 Sources: r3 Avail. Rename r1 to RS2 I have it Value of r3 RS1 ---- Reservation Stations tgt src1 src2 r1 I have it Value of R2 I have it 10 r4 RS0 ---- I have it 20 RS2 r1 I have it Value of R3 I have it 30 A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Example: cycle 0 A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Example: cycle 1 A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Example: cycle 3 - Mul is issued vs. scoreboard - What’s waiting for L1? A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Example… Check the web site… Too much for in-class Summary: Execution proceeds in any order that does not violate RAW dependences WAR and WAW are removed A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Tomasulo’s vs. Scoreboard - In-order issue - Out-of-order execution - Out-of-order completion Scoreboard: A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Tomasulo’s Out-of-order loads and stores? What about WAW, RAW and WAR? Compare all load addresses against the addresses of all preceding store buffers Stall if they match CDB is a bottleneck One write per cycle Could duplicate But, come at a cost Datapath + duplicated tags and control Complex Implementation Scalability? All results to all sources What if we want 128 instrs? A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Tomasulo’s Advantages Distribution of hazard detection Elimination of WAR and WAW stalls Common Data Bus Broadcasts result to multiple instrs (+) Bottleneck Register Renaming Removes WAR and WAW hazards More interesting when same code appears twice Think of loops More on this later BUT: Associative lookups RECALL: direct map is faster A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
In Summary A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto