© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture 10-11 Instruction Execution: Dynamic Scheduling.

Slides:



Advertisements
Similar presentations
Spring 2003CSE P5481 Out-of-Order Execution Several implementations out-of-order completion CDC 6600 with scoreboarding IBM 360/91 with Tomasulos algorithm.
Advertisements

Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.
A scheme to overcome data hazards
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
Dyn. Sched. CSE 471 Autumn 0219 Tomasulo’s algorithm “Weaknesses” in scoreboard: –Centralized control –No forwarding (more RAW than needed) Tomasulo’s.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
COMP25212 Advanced Pipelining Out of Order Processors.
ECE 2162 Tomasulo’s Algorithm. Implementing Dynamic Scheduling Tomasulo’s Algorithm –Used in IBM 360/91 (in the 60s) –Tracks when operands are available.
Out-of-Order Machine State Instruction Sequence: Inorder State: Look-ahead State: Architectural State: R3  A R7  B R8  C R7  D R4  E R3  F R8  G.
Microprocessor Microarchitecture Dependency and OOO Execution Lynn Choi Dept. Of Computer and Electronics Engineering.
Instruction-Level Parallelism (ILP)
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Zvika Guz Slides modified from Prof. Dave Patterson, Prof. John Kubiatowicz, and Prof. Nancy Warter-Perez Out Of Order Execution.
Mult. Issue CSE 471 Autumn 011 Multiple Issue Alternatives Superscalar (hardware detects conflicts) –Statically scheduled (in order dispatch and hence.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
March 9, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Krste Asanovic Electrical.
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )
1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
OOO execution © Avi Mendelson, 4/ MAMAS – Computer Architecture Lecture 7 – Out Of Order (OOO) Avi Mendelson Some of the slides were taken.
CSC 4250 Computer Architectures October 13, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Nov. 9, Lecture 6: Dynamic Scheduling with Scoreboarding and Tomasulo Algorithm (Section 2.4)
1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.
Trace cache and Back-end Oper. CSE 4711 Instruction Fetch Unit Using I-cache I-cache I-TLB Decoder Branch Pred Register renaming Execution units.
1 Lecture 7: Speculative Execution and Recovery using Reorder Buffer Branch prediction and speculative execution, precise interrupt, reorder buffer.
Computer Architecture: Out-of-Order Execution
Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
04/03/2016 slide 1 Dynamic instruction scheduling Key idea: allow subsequent independent instructions to proceed DIVDF0,F2,F4; takes long time ADDDF10,F0,F8;
1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.
Samira Khan University of Virginia Feb 9, 2016 COMPUTER ARCHITECTURE CS 6354 Precise Exception The content and concept of this course are adapted from.
1 Lecture 10: Memory Dependence Detection and Speculation Memory correctness, dynamic memory disambiguation, speculative disambiguation, Alpha Example.
CS203 – Advanced Computer Architecture ILP and Speculation.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
15-740/ Computer Architecture Lecture 7: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University.
IBM System 360. Common architecture for a set of machines
Dynamic Scheduling Why go out of style?
Precise Exceptions and Out-of-Order Execution
Computer Architecture Lecture 14: Out-of-Order Execution
Design of Digital Circuits Lecture 18: Out-of-Order Execution
/ Computer Architecture and Design
CIS-550 Advanced Computer Architecture Lecture 10: Precise Exceptions
Out of Order Processors
Microprocessor Microarchitecture Dynamic Pipeline
Sequential Execution Semantics
High-level view Out-of-order pipeline
Lecture 6: Advanced Pipelines
Out of Order Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
ECE 2162 Reorder Buffer.
Lecture 11: Memory Data Flow Techniques
Lecture: Out-of-order Processors
Lecture 8: Dynamic ILP Topics: out-of-order processors
Adapted from the slides of Prof
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Krste Asanovic Electrical Engineering and Computer Sciences
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
Adapted from the slides of Prof
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
15-740/ Computer Architecture Lecture 10: Out-of-Order Execution
Prof. Onur Mutlu Carnegie Mellon University
Lecture 9: Dynamic ILP Topics: out-of-order processors
Conceptual execution on a processor which exploits ILP
Presentation transcript:

© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture Instruction Execution: Dynamic Scheduling

© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Outline General concepts –dataflow –dynamic scheduling with Tomasulo’s Algorithm The P6 Execution Microarchitecture Dynamic Scheduling Issues

© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois The Execution Problem Instruction Supply Execution Mechanism Data Supply We are able to deliver instructions at high bandwidth, and we have techniques for high bandwidth, low-latency data supply. But nothing matters if we cannot consume everything at high bandwidth in the execution mechanism. We need to execute instructions in parallel. Fundamental problem: taking things in the order prescribed by the programmer will cause instruction dependencies to limit parallel execution of instructions.

© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Dynamic Scheduling Reservation Station Renaming Retirement/Recovery Memory Disambiguation Tomasulo’s Algorithm

© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Dataflow Concepts 1. MUL Ra, Rb -> Rm 2. ADD Rc, Rd -> Rn 3. SUB Rm, Rn -> Rx 4. ADD Rr, Rs -> Rm 5. ADD Rt, Rv -> Rn 6. DIV Rm, Rn -> Ry x = (a * b) - (c + d); y = (r + s) / (t + v); Source Code Machine Code Dataflow Graph

© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Data Dependences Data flow dependence –consumer-producer relationship –register bypass and interlocks Data output and antidependences –reuse of registers at compile time –register renaming

© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Interlocking Allow instruction to execute only when data and resources ready –simple interlocking based on bypass logic for short pipelines –scoreboarding for deep pipelines –Tomasulo’s Algorithm for out-of-order instruction dispatch

© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Tomasulo’s Algorithm Invented for IBM FPU First published in 1967(IBM Journal) Not for general CPU design until 1990’s. –branch prediction and exception recovery problems solved

© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Tomasulo’s Algorithm Register renaming –tags for values Out-of-order execution –reservation stations Data forwarding –common data bus

© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Tomasulo’s Algorithm Instruction decode –fetch register file for value and tag –tag is handle for data currently being generated –determine RS to hold the decoded operations

© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Reservation Station Hardware mechanism that enables instructions to execute out-of-order and as early as their source operands are ready. An instruction waits in the RS until the tags for its source operands have been broadcast by their producers.

© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Tomasulo’s Algorithm Instruction Issue –insert operation and operands into reservation station entry asisgned –mark destination register as not ready

© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Tomasulo’s Algorithm Operation dispatch –identify operations ready for execution –determine highest priority operation for each port/function unit

© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Tomasulo’s Algorithm Data forwarding –result value and tag distributed to RS entries for associative search –result value and tag delivered to destination register for potential update

© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Renaming Objective: want to eliminate WAR and WAW (false dependencies) Renaming happens in program order Renaming requires a table to map between architectural registers and physical registers

© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Retirement What happens if we inadvertently execute an instruction that should not have been executed (i.e., branch misprediction) or execute an instruction incorrectly (i.e., exception)? Need to flush all bad instructions and make it look as if they never executed. And then start executing from the correct point.

© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Retirement using Reorder Buffer Reorder Buffer tail pointer head pointer Insts, in program order An instruction that reaches the head and executes without exception can be safely retired Values from Data Bus Flushing inflight instructions is easy – clear out RS and ROB Recovering RAT state is hard. That’s where the ROB comes in.

© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Putting it all together Register Alias Table Reservation Stations FU Reorder Buffer

© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Memory Disambiguation 1. MUL Ra, Rb -> Rm 2. ADD Rc, Rd -> Rn 3. ST Rm -> 0(Rn) 4. LD 0(Rs) -> Rm 5. ADD Rt, Rv -> Rn 6. DIV Rm, Rn -> Ry ??? Depends if Rn == Rs

© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Conceptual Memory Order Buffer L/SAddrValueVV Loads/Stores in program order Stores write into buffer and pass to memory only after they reach the head and are retired. What about loads? Could go in order (highly conservative) Could wait until all previous unknown store addresses are known (not so conservative) Could go as soon as address is known (optimistic)

© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois The P6 Execution Microarchitecture [making dynamic scheduling work at wide issue] Renaming Scheduling/Execution Memory Retirement Fetch/Decode in-order out-of-order

© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois The P6 Register Alias Table ROB Entry NumberRRF Valid Srcs for μop0 Srcs for μop1 Srcs for μop2 Dests for μops ROB Allocator If the producer has already retired, the value is in the Retirement Register File (RRF Valid is 1) If the producer has not retired, then the value will have to be provided by the Reorder Buffer at the ROB Entry Number indicated in the RAT (RRF Valid is 0) From retire (Dest, ROB entry #s) Physical sources

© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois ReOrder Buffer (ROB) Psrc Read and Pdest Write VValueDest Status PSrcs for μop0 PSrcs for μop1 PSrcs for μop2 PDests for μops from allocator Values for Psrcs Execution results and from function units

© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Retirement Register File Psrc Read PSrcs for μop0 PSrcs for μop1 PSrcs for μop2 Values for Psrcs Value From ReOrder Buffer retirement

© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Issue RAT RRF ROB Reservation Station Rename (RAT access) Register Read (Also ROB allocate) Issue (RS allocate)

© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois P6 Reservation Station Entry Valid Psrc0 tag Psrc0 data Psrc0 V Opcode Psrc1 tag Psrc1 data Psrc1 V ROB Entry # Up to three μops per cycle are added to the ResStation

© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Execution Reservation Station Integer Unit1 Integer Unit0 Load addr gen Store addr gen Floating point unit Memory Order Buffer Port0 Port1Port2Port3Port4 To Reorder Buffer Data Cache

© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Memory Order Buffer Address Allocation happens in order, at issue. Store data is buffered in MOB until retirement of that store. STIDs correspond to the entry of the previous store. P6 Rule: STs must go in-order wrt other STs. LDs can go out- of-order wrt to other LDs and STs. LDs go as soon as address is ready. Clean up at retirement. L/S Store ID ST Addr LD Addr ST Data

© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Retirement VValueDest Status Head Pointer If Status indicates all is OK, then the value is written, or committed, to the RRF. Also, the (Dest and ROB entry number) is sent to RAT to potentially set RRF Valid bit. If Status indicates something went wrong, then a recovery action is started. Up to 3 uops can be retired per cycle. Reorder Buffer

© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Recovery ROB – flush all insts. RS – flush all insts. RRF – do nothing. RAT – Make all entries indicate RRF valid. Sent new PC to Fetch Mechanism

© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Reservation Station Alternative Designs Value capture reservation stations v.s. tag- only reservation stations –Pentium IV adjusts tags rather than moves values when retiring an instruction –Need to keep entries in ROB longer until they no longer safe keep retired value visible to the subsequent instructions

© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Other thoughts How many cycles for branch misprediction? Read Sohi and Smith for more general concepts Read about the MIPS 10K for details on an alternative implementation

© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Data Dependencies Read After Write –Flow Write After Write –Anti Write After Read –Output 1. MUL Ra, Rb -> Rm 3. SUB Rm, Rn -> Rx 1. MUL Ra, Rb -> Rm 4. ADD Rr, Rs -> Rm 3. SUB Rm, Rn -> Rx 4. ADD Rr, Rs -> Rm