Out of Order (OoO) Execution EE457 Out of Order (OoO) Execution Introduction to Dynamic Scheduling of Instructions (The Tomasulo Algorithm) By Gandhi Puvvada
References EE557 Textbook Prof. Dubois’ EE557 Classnotes Prof. Annavaram’s slides Prof. Patterson’s Lecture slides
Programs often have several small fragments of code, which can be executed in any order.
OoO (Out of Order) execution Io = In order ”Execution” here means producing the results. Completion means committing results. (writing into register file or memory). IoI (IoD) OoE IoC In order Issue/Dispatch, Out of order Execution and finally In order completion/commitment
IoC or OoC? IoI (IoD) OoE IoC IoC (In order completion) is necessary to support exceptions (ex: page fault). Here we present first IoI (IoD) OoE OoC and then (at the end) IoI (IoD) OoE IoC
OoC? But branches .. OoC? Hope you are not executing instruction beyond a branch and committing them! Well we dispatch a branch and suspend dispatching and wait until the branch is resolved. Then we resume dispatching instructions beyond the branch at either the fall-through area or at the target area.
Instruction Scheduling (Re-ordering of instructions) Basic block = a straight-line code sequence with no branches. Compiler can perform static instruction scheduling. Tomasulo Algorithm lets us schedule instructions dynamically (in hardware). Branch prediction and speculative execution beyond a branch (of course with ability to flush wrong-path instructions on misprediction) will be covered later (and implemented on FPGA in EE560).
Register renaming to allow later instructions to proceed lw $8, 40($2); add $8, $8, $8; sw $8, 40($2); lw $8, 60($3); sw $8, 60($3); lw $8, 40($2); add $8, $8, $8; sw $8, 40($2); lw $48, 60($3); add $48, $48, $48; sw $48, 60($3);
Static Scheduling (based on Prof. Dubois slide) Strengths -- Hardware simplicity -- Compiler has a global view of the code (does not help the hardware much) Weaknesses -- can not be CPU-implementation specific -- can not foresee dynamic events -- cache misses -- data-dependent delays -- conditional branches can only reschedule instructions in a basic block (basic block = a straight-line code sequence with no branches) -- can not pre-compute memory addresses
Simple 5-stage pipeline In-order execution RAW dependency Solve it by forwarding, if not, by stalling Dependent instructions are stalled in the ID stage IM DM IF ID EX M WB
Simple 5-stage pipeline: Dependent instructions are stalled in the ID stage and lw
Simple 5-stage pipeline: Dependent instructions can not be stalled in the EX stage. Why? and lw
Provide multiple functional units (for simplicity, we avoid talking about floating point execution unit and floating point register file) Stall, after decoding, in queues Divide Multiply IM Integer DM Load/ Store IF ID WB Queues and Functional unit
rs, rt (IDs) are carried into EX Why junior instructions carry their source register IDs into EX stage? Well they need to get help from Senior #1 or Senior #2 in EX stage under the control of the FU. No more of that. There may be 40 seniors in front of you. So I, the dispatch unit, will tell you from which senior you need to get help for which source register. rs, rt (IDs) are carried into EX
Tomasulo’s plan OoO Out of order execution Multiple functional units (say, Integer, DM, Multiplier, Divider) Queues between ID and EX stages (in place of ID/EX register)
Out of order execution ?! Problems all over ??!! For the time, no branch prediction, no speculative execution beyond branches, just stall on a conditional branch No support for precise exceptions for the time Even then, …
RAW, WAR, and WAW RAW = Read After Write lw $8, 40($2); add $9, $8, $7; WAR = Write after Read add $9, $8, $6; lw $8, 40($2); WAW = Write after Write add $9, $8, $6; lw $9, 40($2); WAW ? How is it possible? Consider a printer or a FIFO Why would anyone produce some result in $9 and without utilizing that result, why would he overwrite it with another result?
WAW can easily occur! WAW ? How is it possible? In out of order execution, instructions before the branch and instruction after the branch can co-exist. For example, multiple iterations of this loop can coexist in the execution area. So, what? Loop: LW $2, 40($1); MULT $4 $2, $3; SW $4, 40($1); ADDI $1, $1, -4; BNE $1, $0, Loop;
Say a company gives standard bonus to most of the employees and a higher bonus to the managers. So you load into $3 standard bonus from the stdbonus location in memory. And then you check to see if it is a case of a manager, and then load into $3 again (overwriting the earlier $3) the special bonus from the special location in memory. LW $3 stdbonus ($0) BNE $1, $2, SKIP LW $3 special ($0)
RAW, WAR, and WAW (some terminology to remember) RAW = Read After Write lw $8, 40($2); add $9, $8, $7; WAR = Write after Read add $9, $8, $6; lw $8, 40($2); WAW = Write after Write add $9, $8, $6; lw $9, 40($2); RAW A true dependency WAR An anti-dependency Name Dependences WAW An output dependency
RAW, WAR, and WAW In-order execution: We need to deal with RAW only. Out of order execution: Now we need to deal with WAR and WAW besides RAW.
Limited Architectural Registers More Physical Registers Register Renaming lw $8, 40($2); add $8, $8, $8; sw $8, 40($2); lw $8, 60($3); sw $8, 60($3); It is clear that compiler is using $8 as a temporary register. If there is a delay in obtaining $2, the first part of the code can not proceed. Unfortunately, the second part of the code can not proceed because of name dependency for $8.
This is an example of name dependency. If we had 64 registers instead of 32 registers, then perhaps compiler might have used $48 instead of $8 and we could have executed the second part of the code before the first part! lw $8, 40($2); add $8, $8, $8; sw $8, 40($2); lw $48, 60($3); add $48, $48, $48; sw $48, 60($3); This is an example of name dependency.
Four different temporary registers can be used here as shown: $8, $18, $28, and $48 (or called with coded names, LION, TIGER, CAT, and ANT). lw $8, 40($2); add $18, $8, $8; sw $18, 40($2); lw $28, 60($3); add $48, $28, $28; sw $48, 60($3); lw LION, 40($2); add TIGER, LION, LION; sw TIGER, 40($2); lw CAT, 60($3); add ANT, CAT, CAT; sw ANT, 60($3);
Can a later implementation provide 64 registers (instead of 32) while maintaining binary compatibility with previously compiled codes? Answer: Yes / No Why?
Answer: Can not change the number of Architectural Registers Register Renaming Through Tagging Registers This solves name dependency problems (WAR and WAW) while attending to true dependency (RAW) through waiting in queues.
RST = Register Status Table RF = Register File square_root $2, $10; $1 $2 $3 $4 $5 $6 $7 $8 . . . $31 $1 $2 $3 $4 $5 $6 $7 $8 . . . $31 lw $8, 40($2); add $8, $8, $8; sw $8, 40($2); lw $8, 60($3); sw $8, 60($3); dependent source destination RST = Register Status Table RF = Register File
lw $8, 40($2); add $8, $8, $8; sw $8, 40($2); lw $8, 60($3); RST RF square_root $2, $10; $1 $2 $3 $4 $5 $6 $7 $8 . . . $31 $1 $2 $3 $4 $5 $6 $7 $8 . . . $31 lw $8, 40($2); add $8, $8, $8; sw $8, 40($2); lw $8, 60($3); sw $8, 60($3);
lw $8, 40($2); add $8, $8, $8; sw $8, 40($2); lw $8, 60($3); RST RF square_root $2, $10; $1 $2 $3 $4 $5 $6 $7 $8 . . . $31 $1 $2 $3 $4 $5 $6 $7 $8 . . . $31 lw $8, 40($2); add $8, $8, $8; sw $8, 40($2); lw $8, 60($3); sw $8, 60($3);
lw $8, 40($2); add $8, $8, $8; sw $8, 40($2); lw $8, 60($3); RST RF square_root $2, $10; $1 $2 $3 $4 $5 $6 $7 $8 . . . $31 $1 $2 $3 $4 $5 $6 $7 $8 . . . $31 lw $8, 40($2); add $8, $8, $8; sw $8, 40($2); lw $8, 60($3); sw $8, 60($3);
lw $8, 40($2); add $8, $8, $8; sw $8, 40($2); lw $8, 60($3); square_root $2, $10; Dispatch unit decodes and dispatches instructions. For destination operand, an instruction carries a TAG (but not the actual register name)! For source operands, an instruction carries either the values or TAGs of the operands (but not the actual register names)! lw $8, 40($2); add $8, $8, $8; sw $8, 40($2); lw $8, 60($3); sw $8, 60($3);
Register Renaming
TAGs for destinations or sources or for both? A new tag is assigned to the destination register of the instruction being dispatched. For each of the source registers (source operands) of the instruction being dispatched, either the value of the source register (if it has not been previously tagged) or the existing tag associated with the source register (if it has been tagged already) is conveyed to the instruction. If a tag is conveyed for a source, then the instruction needs to wait for the original instruction with that destination tag to go on to the CDB and announce the value.
Unique TAG 4 4 Like SSN, we need a unique TAG SSNs are reused. Similarly TAGs can be reused. TAGs are similar to the number TOKENs.
Take a number vs. Take a token 4 In State Bank of India, the cashier issues brass tokens to customers trying to draw money as an identification (and not at all to put them in any virtual queue). Token numbers are in random order. The cashier verifies the signature in the records room and returns with money, call the token number and issues the money. Tokens are reclaimed and reused. Helps to create a Virtual Queue. We do not need that here!
TAGs (= Tokens) 4 How many Tokens should the bank cashier have to start with? What happens if the tokens are run out? Does he need to have any order in holding tokens and issuing tokens? Does he have to collect tokens back?
TAG FIFO (FIFOs are taught in EE560) To issue and collect Tokens (TAGs), use a circular FIFO (First-in-First-Out) unit. While the FIFO-order is not important here, a FIFO is the easiest to implement in hardware compared to a random order in a pile. Filled with (say) 64 tokens (in any order) initially on reset. Tokens return in out of order anyway. Put tokens back in the FIFO and issue. 1 63 wp rp 2 Full wp 63 2 1 wp rp 2 rp 63 2 tokens issued 1 token returned
CDB = Common Data Bus (compare it to a Public Announcing System) Block Diagram provided by Prof. Dubois Simplified for EE457 TAG FIFO 63 2 Int. Divider Integer Multiplier Issue Unit CDB = Common Data Bus (compare it to a Public Announcing System)
Front-End & Back-End IFQ Instruction Fetch Queue (a FIFO structure) Dispatch unit (including RST, RF, Tag FIFO) Load Store and other Issue Queues Issue Unit Functional units CDB (Common Data Bus)
Bottle neck in the design CDB = Common Data Bus Do all instructions use CDB? sw ? j (jump)? beq
load store queue Address calculation Memory disambiguation Mr. Bruin: Let me take a guess! You will now propose to have a MST (Memory Status Table) (like the RST). And you will rename memory locations to solve WAW and WAR problems among memory locations, right?!
MST (Memory Status Table). No way. It is too big MST (Memory Status Table)? No way! It is too big! We will just ask the junior to stall and wait to solve his WAR and WAW problems with his seniors. $1 $2 $3 $4 $5 $6 $7 $8 . . . $31 RST RF 0 1 . . . 0 1 . . MST Memory
Address calculation for lw and sw EE557 approach for address calculation EE457/560 approach for address calculation Dedicated adder, to compute address, attached to the load-store queue.
Memory Disambiguation EE557
Memory Disambiguation RAW sw $2, 2000($0); lw $8, 2000($0); WAW sw $2, 2000($0); sw $8, 2000($0); WAR lw $2, 2000($0); sw $8, 2000($0);
Memory Disambiguation RAW sw $2, 2000($0); lw $8, 2000($0); This later lw can proceed only if there is no store ahead of it with the same address. WAW sw $2, 2000($0); sw $8, 2000($0); This later sw can proceed only if there is no store ahead of it with the same address. WAR lw $2, 2000($0); sw $8, 2000($0); This later sw can proceed only if there is no load ahead of it with the same address.
Maintaining instructions in the order of arrival (issue order/program order) in a queue Is it necessary or is it desirable? In the case of L-S Queue ? In the case of Integer and other queues (mult queue, div queue)?
Maintaining instructions in the order of arrival (issue order/program order) in a queue Is it necessary or is it desirable? In the case of L-S Queue ? NECESSARY to enforce memory disambiguation rules In the case of Integer and other queues (mult queue, div queue)? DESIRABLE, so that an earlier instruction gets executed whenever possible, there by perhaps reducing too many instructions waiting on it.
Priority (based on the order of arrival) among instructions ready to execute Is it necessary or is it desirable? Local priority with in the queues Global priority across the queues
Issue Unit CDB availability constraint Pipelined functional unit vs. Multi-cycle functional unit Conflict resolution Round-robin priority adequate?, well, …
Conditional branches Dispatch unit stops dispatching until the branch is resolved. CDB broadcasts the result of the branch Dispatching continues there after either at the fall-through instruction or at target instruction. Successful branch shall cause flushing of IFQ very much like jump.
Conditional branches Since we stop dispatching instructions after a branch, does it mean that this branch is the last instruction to be executed in the back-end ? Is it possible that the back-end holds simultaneously (a) some instructions dispatched before the branch and (b) some instructions issued after the branch was resolved?
Tomasulo Loop Example Loop: LW $2, 40($1); MULT $4 $2, $3; SW $4, 40($1); ADDI $1, $1, -4; BNE $1, $0, Loop; Assume Multiply takes 4 clocks Assume first load takes 8 clocks (cache miss), second load takes 1 clock (hit) Based on Prof. Annavaram’s lecture slide
How could Tomasulo overlap iterations of loops? Loop: LW $2, 40($1); MULT $4 $2, $3; SW $4, 40($1); ADDI $1, $1, -4; BNE $1, $0, Loop; The destination registers bear different TAGs in different iterations. These tags were given in place of the source operands to the dependent instructions following them.
Say, only two iterations. Let us unroll the two iterations. Loop: LW $2, 40($1); MULT $4 $2, $3; SW $4, 40($1); ADDI $1, $1, -4; BNE $1, $0, Loop; Loop: LW $2, 40($1); BNE $1, $0, Loop; destination register dependent source register(s)
Because, there is no reorder buffer Because, there is no reorder buffer. Note: Your EE560 project will use a reorder buffer and much more!