Out of Order (OoO) Execution

Slides:

Advertisements

Similar presentations

Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.

Advertisements

Lec18.1 Step by step for Dynamic Scheduling by reorder buffer Copyright by John Kubiatowicz (http.cs.berkeley.edu/~kubitron)

A scheme to overcome data hazards

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.

Advanced Pipelining Optimally Scheduling Code Optimally Programming Code Scheduling for Superscalars (6.9) Exceptions (5.6, 6.8)

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Instruction-Level Parallelism (ILP)

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Chapter 12 Pipelining Strategies Performance Hazards.

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.

1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.

5/13/99 Ashish Sabharwal1 Pipelining and Hazards n Hazards occur because –Don’t have enough resources (ALU’s, memory,…) Structural Hazard –Need a value.

1 Lecture 10: Memory Dependence Detection and Speculation Memory correctness, dynamic memory disambiguation, speculative disambiguation, Alpha Example.

CS203 – Advanced Computer Architecture ILP and Speculation.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Instruction-Level Parallelism and Its Dynamic Exploitation

IBM System 360. Common architecture for a set of machines

CS 352H: Computer Systems Architecture

Dynamic Scheduling Why go out of style?

Computer Architecture Chapter (14): Processor Structure and Function

Computer Organization CS224

Stalling delays the entire pipeline

Concepts and Challenges

/ Computer Architecture and Design

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

Simultaneous Multithreading

Out of Order Processors

Tomasulo Loop Example Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1

CS203 – Advanced Computer Architecture

CS5100 Advanced Computer Architecture Hardware-Based Speculation

Appendix C Pipeline implementation

Microprocessor Microarchitecture Dynamic Pipeline

ECS 154B Computer Architecture II Spring 2009

Lecture 12 Reorder Buffers

\course\cpeg323-08F\Topic6b-323

Advantages of Dynamic Scheduling

Pipelining: Advanced ILP

CMSC 611: Advanced Computer Architecture

Lecture 6: Advanced Pipelines

A Dynamic Algorithm: Tomasulo’s

Out of Order Processors

Pipelining Multicycle, MIPS R4000, and More

Pipelining review.

The processor: Pipelining and Branching

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

John Kubiatowicz (http.cs.berkeley.edu/~kubitron)

Lecture 11: Memory Data Flow Techniques

Adapted from the slides of Prof

\course\cpeg323-05F\Topic6b-323

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

How to improve (decrease) CPI

September 20, 2000 Prof. John Kubiatowicz

Instruction Level Parallelism (ILP)

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Reduction of Data Hazards Stalls with Dynamic Scheduling

CS5100 Advanced Computer Architecture Dynamic Scheduling

Adapted from the slides of Prof

CS203 – Advanced Computer Architecture

September 20, 2000 Prof. John Kubiatowicz

CSC3050 – Computer Architecture

Lecture 7 Dynamic Scheduling

Conceptual execution on a processor which exploits ILP

Presentation transcript:

Out of Order (OoO) Execution EE457 Out of Order (OoO) Execution Introduction to Dynamic Scheduling of Instructions (The Tomasulo Algorithm) By Gandhi Puvvada

References EE557 Textbook Prof. Dubois’ EE557 Classnotes Prof. Annavaram’s slides Prof. Patterson’s Lecture slides

Programs often have several small fragments of code, which can be executed in any order.

OoO (Out of Order) execution Io = In order ”Execution” here means producing the results. Completion means committing results. (writing into register file or memory). IoI (IoD)  OoE  IoC In order Issue/Dispatch, Out of order Execution and finally In order completion/commitment

IoC or OoC? IoI (IoD)  OoE  IoC IoC (In order completion) is necessary to support exceptions (ex: page fault). Here we present first IoI (IoD)  OoE  OoC and then (at the end) IoI (IoD)  OoE  IoC

OoC? But branches .. OoC? Hope you are not executing instruction beyond a branch and committing them! Well we dispatch a branch and suspend dispatching and wait until the branch is resolved. Then we resume dispatching instructions beyond the branch at either the fall-through area or at the target area.

Instruction Scheduling (Re-ordering of instructions) Basic block = a straight-line code sequence with no branches. Compiler can perform static instruction scheduling. Tomasulo Algorithm lets us schedule instructions dynamically (in hardware). Branch prediction and speculative execution beyond a branch (of course with ability to flush wrong-path instructions on misprediction) will be covered later (and implemented on FPGA in EE560).

Register renaming to allow later instructions to proceed lw $8, 40($2); add $8, $8, $8; sw $8, 40($2); lw $8, 60($3); sw $8, 60($3); lw $8, 40($2); add $8, $8, $8; sw $8, 40($2); lw $48, 60($3); add $48, $48, $48; sw $48, 60($3);

Static Scheduling (based on Prof. Dubois slide) Strengths -- Hardware simplicity -- Compiler has a global view of the code (does not help the hardware much) Weaknesses -- can not be CPU-implementation specific -- can not foresee dynamic events -- cache misses -- data-dependent delays -- conditional branches  can only reschedule instructions in a basic block (basic block = a straight-line code sequence with no branches) -- can not pre-compute memory addresses

Simple 5-stage pipeline In-order execution RAW dependency Solve it by forwarding, if not, by stalling Dependent instructions are stalled in the ID stage IM DM IF ID EX M WB

Simple 5-stage pipeline: Dependent instructions are stalled in the ID stage and lw

Simple 5-stage pipeline: Dependent instructions can not be stalled in the EX stage. Why? and lw

Provide multiple functional units (for simplicity, we avoid talking about floating point execution unit and floating point register file) Stall, after decoding, in queues Divide Multiply IM Integer DM Load/ Store IF ID WB Queues and Functional unit

rs, rt (IDs) are carried into EX Why junior instructions carry their source register IDs into EX stage? Well they need to get help from Senior #1 or Senior #2 in EX stage under the control of the FU. No more of that. There may be 40 seniors in front of you. So I, the dispatch unit, will tell you from which senior you need to get help for which source register. rs, rt (IDs) are carried into EX

Tomasulo’s plan OoO Out of order execution Multiple functional units (say, Integer, DM, Multiplier, Divider) Queues between ID and EX stages (in place of ID/EX register)

Out of order execution ?! Problems all over ??!! For the time, no branch prediction, no speculative execution beyond branches, just stall on a conditional branch No support for precise exceptions for the time Even then, …

RAW, WAR, and WAW RAW = Read After Write lw $8, 40($2); add $9, $8, $7; WAR = Write after Read add $9, $8, $6; lw $8, 40($2); WAW = Write after Write add $9, $8, $6; lw $9, 40($2); WAW ? How is it possible? Consider a printer or a FIFO Why would anyone produce some result in $9 and without utilizing that result, why would he overwrite it with another result?

WAW can easily occur! WAW ? How is it possible? In out of order execution, instructions before the branch and instruction after the branch can co-exist. For example, multiple iterations of this loop can coexist in the execution area. So, what? Loop: LW $2, 40($1); MULT $4 $2, $3; SW $4, 40($1); ADDI $1, $1, -4; BNE $1, $0, Loop;

Say a company gives standard bonus to most of the employees and a higher bonus to the managers. So you load into $3 standard bonus from the stdbonus location in memory. And then you check to see if it is a case of a manager, and then load into $3 again (overwriting the earlier $3) the special bonus from the special location in memory. LW $3 stdbonus ($0) BNE $1, $2, SKIP LW $3 special ($0)

RAW, WAR, and WAW (some terminology to remember) RAW = Read After Write lw $8, 40($2); add $9, $8, $7; WAR = Write after Read add $9, $8, $6; lw $8, 40($2); WAW = Write after Write add $9, $8, $6; lw $9, 40($2); RAW A true dependency WAR An anti-dependency Name Dependences WAW An output dependency

RAW, WAR, and WAW In-order execution: We need to deal with RAW only. Out of order execution: Now we need to deal with WAR and WAW besides RAW.

Limited Architectural Registers More Physical Registers Register Renaming lw $8, 40($2); add $8, $8, $8; sw $8, 40($2); lw $8, 60($3); sw $8, 60($3); It is clear that compiler is using $8 as a temporary register. If there is a delay in obtaining $2, the first part of the code can not proceed. Unfortunately, the second part of the code can not proceed because of name dependency for $8.

This is an example of name dependency. If we had 64 registers instead of 32 registers, then perhaps compiler might have used $48 instead of $8 and we could have executed the second part of the code before the first part! lw $8, 40($2); add $8, $8, $8; sw $8, 40($2); lw $48, 60($3); add $48, $48, $48; sw $48, 60($3); This is an example of name dependency.

Four different temporary registers can be used here as shown: $8, $18, $28, and $48 (or called with coded names, LION, TIGER, CAT, and ANT). lw $8, 40($2); add $18, $8, $8; sw $18, 40($2); lw $28, 60($3); add $48, $28, $28; sw $48, 60($3); lw LION, 40($2); add TIGER, LION, LION; sw TIGER, 40($2); lw CAT, 60($3); add ANT, CAT, CAT; sw ANT, 60($3);

Can a later implementation provide 64 registers (instead of 32) while maintaining binary compatibility with previously compiled codes? Answer: Yes / No Why?

Answer: Can not change the number of Architectural Registers Register Renaming Through Tagging Registers This solves name dependency problems (WAR and WAW) while attending to true dependency (RAW) through waiting in queues.

RST = Register Status Table RF = Register File square_root $2, $10; $1 $2 $3 $4 $5 $6 $7 $8 . . . $31 $1 $2 $3 $4 $5 $6 $7 $8 . . . $31 lw $8, 40($2); add $8, $8, $8; sw $8, 40($2); lw $8, 60($3); sw $8, 60($3); dependent source destination RST = Register Status Table RF = Register File

lw $8, 40($2); add $8, $8, $8; sw $8, 40($2); lw $8, 60($3); RST RF square_root $2, $10; $1 $2 $3 $4 $5 $6 $7 $8 . . . $31 $1 $2 $3 $4 $5 $6 $7 $8 . . . $31 lw $8, 40($2); add $8, $8, $8; sw $8, 40($2); lw $8, 60($3); sw $8, 60($3);

lw $8, 40($2); add $8, $8, $8; sw $8, 40($2); lw $8, 60($3); RST RF square_root $2, $10; $1 $2 $3 $4 $5 $6 $7 $8 . . . $31 $1 $2 $3 $4 $5 $6 $7 $8 . . . $31 lw $8, 40($2); add $8, $8, $8; sw $8, 40($2); lw $8, 60($3); sw $8, 60($3);

lw $8, 40($2); add $8, $8, $8; sw $8, 40($2); lw $8, 60($3); RST RF square_root $2, $10; $1 $2 $3 $4 $5 $6 $7 $8 . . . $31 $1 $2 $3 $4 $5 $6 $7 $8 . . . $31 lw $8, 40($2); add $8, $8, $8; sw $8, 40($2); lw $8, 60($3); sw $8, 60($3);

lw $8, 40($2); add $8, $8, $8; sw $8, 40($2); lw $8, 60($3); square_root $2, $10; Dispatch unit decodes and dispatches instructions. For destination operand, an instruction carries a TAG (but not the actual register name)! For source operands, an instruction carries either the values or TAGs of the operands (but not the actual register names)! lw $8, 40($2); add $8, $8, $8; sw $8, 40($2); lw $8, 60($3); sw $8, 60($3);

Register Renaming

TAGs for destinations or sources or for both? A new tag is assigned to the destination register of the instruction being dispatched. For each of the source registers (source operands) of the instruction being dispatched, either the value of the source register (if it has not been previously tagged) or the existing tag associated with the source register (if it has been tagged already) is conveyed to the instruction. If a tag is conveyed for a source, then the instruction needs to wait for the original instruction with that destination tag to go on to the CDB and announce the value.

Unique TAG 4 4 Like SSN, we need a unique TAG SSNs are reused. Similarly TAGs can be reused. TAGs are similar to the number TOKENs.

Take a number vs. Take a token 4 In State Bank of India, the cashier issues brass tokens to customers trying to draw money as an identification (and not at all to put them in any virtual queue). Token numbers are in random order. The cashier verifies the signature in the records room and returns with money, call the token number and issues the money. Tokens are reclaimed and reused. Helps to create a Virtual Queue. We do not need that here!

TAGs (= Tokens) 4 How many Tokens should the bank cashier have to start with? What happens if the tokens are run out? Does he need to have any order in holding tokens and issuing tokens? Does he have to collect tokens back?

TAG FIFO (FIFOs are taught in EE560) To issue and collect Tokens (TAGs), use a circular FIFO (First-in-First-Out) unit. While the FIFO-order is not important here, a FIFO is the easiest to implement in hardware compared to a random order in a pile. Filled with (say) 64 tokens (in any order) initially on reset. Tokens return in out of order anyway. Put tokens back in the FIFO and issue. 1 63 wp rp 2 Full wp 63 2 1 wp rp 2 rp 63 2 tokens issued 1 token returned

CDB = Common Data Bus (compare it to a Public Announcing System) Block Diagram provided by Prof. Dubois Simplified for EE457 TAG FIFO 63 2 Int. Divider Integer Multiplier Issue Unit CDB = Common Data Bus (compare it to a Public Announcing System)

Front-End & Back-End IFQ Instruction Fetch Queue (a FIFO structure) Dispatch unit (including RST, RF, Tag FIFO) Load Store and other Issue Queues Issue Unit Functional units CDB (Common Data Bus)

Bottle neck in the design CDB = Common Data Bus Do all instructions use CDB? sw ? j (jump)? beq

load store queue Address calculation Memory disambiguation Mr. Bruin: Let me take a guess! You will now propose to have a MST (Memory Status Table) (like the RST). And you will rename memory locations to solve WAW and WAR problems among memory locations, right?!

MST (Memory Status Table). No way. It is too big MST (Memory Status Table)? No way! It is too big! We will just ask the junior to stall and wait to solve his WAR and WAW problems with his seniors. $1 $2 $3 $4 $5 $6 $7 $8 . . . $31 RST RF 0 1 . . . 0 1 . . MST Memory

Address calculation for lw and sw EE557 approach for address calculation EE457/560 approach for address calculation Dedicated adder, to compute address, attached to the load-store queue.

Memory Disambiguation EE557

Memory Disambiguation RAW sw $2, 2000($0); lw $8, 2000($0); WAW sw $2, 2000($0); sw $8, 2000($0); WAR lw $2, 2000($0); sw $8, 2000($0);

Memory Disambiguation RAW sw $2, 2000($0); lw $8, 2000($0); This later lw can proceed only if there is no store ahead of it with the same address. WAW sw $2, 2000($0); sw $8, 2000($0); This later sw can proceed only if there is no store ahead of it with the same address. WAR lw $2, 2000($0); sw $8, 2000($0); This later sw can proceed only if there is no load ahead of it with the same address.

Maintaining instructions in the order of arrival (issue order/program order) in a queue Is it necessary or is it desirable? In the case of L-S Queue ? In the case of Integer and other queues (mult queue, div queue)?

Maintaining instructions in the order of arrival (issue order/program order) in a queue Is it necessary or is it desirable? In the case of L-S Queue ? NECESSARY to enforce memory disambiguation rules In the case of Integer and other queues (mult queue, div queue)? DESIRABLE, so that an earlier instruction gets executed whenever possible, there by perhaps reducing too many instructions waiting on it.

Priority (based on the order of arrival) among instructions ready to execute Is it necessary or is it desirable? Local priority with in the queues Global priority across the queues

Issue Unit CDB availability constraint Pipelined functional unit vs. Multi-cycle functional unit Conflict resolution Round-robin priority adequate?, well, …

Conditional branches Dispatch unit stops dispatching until the branch is resolved. CDB broadcasts the result of the branch Dispatching continues there after either at the fall-through instruction or at target instruction. Successful branch shall cause flushing of IFQ very much like jump.

Conditional branches Since we stop dispatching instructions after a branch, does it mean that this branch is the last instruction to be executed in the back-end ? Is it possible that the back-end holds simultaneously (a) some instructions dispatched before the branch and (b) some instructions issued after the branch was resolved?

Tomasulo Loop Example Loop: LW $2, 40($1); MULT $4 $2, $3; SW $4, 40($1); ADDI $1, $1, -4; BNE $1, $0, Loop; Assume Multiply takes 4 clocks Assume first load takes 8 clocks (cache miss), second load takes 1 clock (hit) Based on Prof. Annavaram’s lecture slide

How could Tomasulo overlap iterations of loops? Loop: LW $2, 40($1); MULT $4 $2, $3; SW $4, 40($1); ADDI $1, $1, -4; BNE $1, $0, Loop; The destination registers bear different TAGs in different iterations. These tags were given in place of the source operands to the dependent instructions following them.

Say, only two iterations. Let us unroll the two iterations. Loop: LW $2, 40($1); MULT $4 $2, $3; SW $4, 40($1); ADDI $1, $1, -4; BNE $1, $0, Loop; Loop: LW $2, 40($1); BNE $1, $0, Loop; destination register dependent source register(s)

Because, there is no reorder buffer Because, there is no reorder buffer. Note: Your EE560 project will use a reorder buffer and much more!