Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

Slides:



Advertisements
Similar presentations
Spring 2003CSE P5481 Out-of-Order Execution Several implementations out-of-order completion CDC 6600 with scoreboarding IBM 360/91 with Tomasulos algorithm.
Advertisements

Scoreboarding & Tomasulos Approach Bazat pe slide-urile lui Vincent H. Berk.
Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.
Real-time Signal Processing on Embedded Systems Advanced Cutting-edge Research Seminar I&III.
Lec18.1 Step by step for Dynamic Scheduling by reorder buffer Copyright by John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
A scheme to overcome data hazards
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
Dyn. Sched. CSE 471 Autumn 0219 Tomasulo’s algorithm “Weaknesses” in scoreboard: –Centralized control –No forwarding (more RAW than needed) Tomasulo’s.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
COMP25212 Advanced Pipelining Out of Order Processors.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
Computer Architecture Lec 8 – Instruction Level Parallelism.
Microprocessor Microarchitecture Dependency and OOO Execution Lynn Choi Dept. Of Computer and Electronics Engineering.
Instruction-Level Parallelism (ILP)
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 14, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Cont. Computer Architecture.
1 Zvika Guz Slides modified from Prof. Dave Patterson, Prof. John Kubiatowicz, and Prof. Nancy Warter-Perez Out Of Order Execution.
Mult. Issue CSE 471 Autumn 011 Multiple Issue Alternatives Superscalar (hardware detects conflicts) –Statically scheduled (in order dispatch and hence.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
Review of CS 203A Laxmi Narayan Bhuyan Lecture2.
March 9, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Krste Asanovic Electrical.
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
ENGS 116 Lecture 71 Scoreboarding Vincent H. Berk October 8, 2008 Reading for today: A.5 – A.6, article: Smith&Pleszkun FRIDAY: NO CLASS Reading for Monday:
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 5, 2005 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Scoreboarding)
OOO execution © Avi Mendelson, 4/ MAMAS – Computer Architecture Lecture 7 – Out Of Order (OOO) Avi Mendelson Some of the slides were taken.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Nov. 9, Lecture 6: Dynamic Scheduling with Scoreboarding and Tomasulo Algorithm (Section 2.4)
1 Chapter 2: ILP and Its Exploitation Review simple static pipeline ILP Overview Dynamic branch prediction Dynamic scheduling, out-of-order execution Hardware-based.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Instruction-Level Parallelism Dynamic Scheduling
CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.
© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture Instruction Execution: Dynamic Scheduling.
Spring 2003CSE P5481 Precise Interrupts Precise interrupts preserve the model that instructions execute in program-generated order, one at a time If an.
Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.
2/24; 3/1,3/11 (quiz was 2/22, QuizAns 3/8) CSE502-S11, Lec ILP 1 Tomasulo Organization FP adders Add1 Add2 Add3 FP multipliers Mult1 Mult2 From.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
Anshul Kumar, CSE IITD CSL718 : Superscalar Processors Speculative Execution 2nd Feb, 2006.
Pentium III Instruction Stream. Introduction Pentium III uses several key features to exploit ILP This part of our presentation will cover the methods.
04/03/2016 slide 1 Dynamic instruction scheduling Key idea: allow subsequent independent instructions to proceed DIVDF0,F2,F4; takes long time ADDDF10,F0,F8;
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
CS203 – Advanced Computer Architecture ILP and Speculation.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Instruction-Level Parallelism and Its Dynamic Exploitation
IBM System 360. Common architecture for a set of machines
Dynamic Scheduling Why go out of style?
/ Computer Architecture and Design
Out of Order Processors
Tomasulo Loop Example Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1
CS203 – Advanced Computer Architecture
CS5100 Advanced Computer Architecture Hardware-Based Speculation
Advantages of Dynamic Scheduling
High-level view Out-of-order pipeline
Tomasulo With Reorder buffer:
A Dynamic Algorithm: Tomasulo’s
Out of Order Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Smruti R. Sarangi IIT Delhi
Adapted from the slides of Prof
Checking for issue/dispatch
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
Adapted from the slides of Prof
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
Scoreboarding ENGS 116 Lecture 7 Vincent H. Berk October 5, 2005
Chapter 3: ILP and Its Exploitation
High-level view Out-of-order pipeline
Lecture 7 Dynamic Scheduling
Conceptual execution on a processor which exploits ILP
Presentation transcript:

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers) physical register file that is the same size as the architectural registers one-to-one relationship reorder buffer (ROB) (~ R10K active list) provides in-order instruction commit circular queue with head & tail pointers holds 40 “executing” instructions in program order (dispatched but not yet committed) field for either integer or FP result after it has been computed a result value is put in its register after its producing instruction reaches the head of the buffer & is removed (i.e., commits)

Spring 2003CSE P5482 Reorder Buffer Implementation (Pentium Pro) register alias table (RAT) (~ R10K map table) provides register renaming important because very few GPRs in the x86 architecture indicates whether a source operand of a new instruction points to the reorder buffer or the physical register file do an associative search of ROB destination registers for the new source operands if found, consumer instruction points to the producer instruction in the ROB the data hazard check before instruction dispatch reservation station (~ IBM 360/91 reservation stations, R10000 instruction buffer) holds instructions waiting to execute provides forwarding to reduce RAW hazards result values go back to the reservation station (as well as ROB) so dependent instructions have source operand values provides out-of-order execution

Spring 2003CSE P5483 Pentium Pro Execution In-order issue decode instructions enter uops into reorder buffer for in-order completion rename registers via register alias table detect structural hazards for reservation station Out-of-order execution one reservation station, multiple entries check source operands for RAW hazards check structural hazards for separate integer, FP, memory units result goes to reservation station & reorder buffer In-order completion this & previous uops have completed write “G”PR registers rollback on interrupts

Spring 2003CSE P5484 Pentium Pro fetch & decode pipeline BTB access (1 stage) instruction fetch & align for decoding (2.5 stages) decode & uop generation (2.5 stages) register renaming & instruction issue to reservation stations (3 stages minimum) integer pipeline execute, resolve branch write registers & commit load pipeline address calculation & to memory reorder buffer integrated L1 & L2 data cache access pipelined FP add & multiply

Spring 2003CSE P5485 Pentium Pro

Spring 2003CSE P5486 Pentium Pro

Spring 2003CSE P5487

Spring 2003CSE P5488 Pentium Pro Some bandwidth constraints: maximum for one cycle 16 bytes fetched 3 instructions decoded 6  ops to the reorder buffer 5  ops dispatched to reservation station & functional units 1 load & 1 store access to the L1 data cache 1 cache result returned 3  ops committed if good instruction mix right instruction order operands available functional units available load & store to different cache banks all previous instructions already committed

Spring 2003CSE P5489 Making Tomasulo Precise Reorder buffer contains: instruction type (register operation, store, branch) destination field (register, memory, none) result values ready bit to indicate instruction has executed replaces load & store buffers replaces renaming function of the reservation stations operand fields in reservation stations point to reorder buffer location results tagged with reorder buffer entry number provides hardware speculation (speculative execution, not just fetch & issue) plus precise interrupts

Spring 2003CSE P54810 Precise Implementation for Tomasulo

Spring 2003CSE P54811 Tomasulo’s Algorithm: Execution Steps issue & read (assume the instruction has been fetched) structural hazard detection for entry in reservation station & reorder buffer issue if no hazard stall if hazard read registers for source operands can come from either registers or reorder buffer RAT indicates which allocate entry in reorder buffer execute RAW hazard detection for source operands snoop on common data bus for missing operands reservation station tag is entry number in reorder buffer dispatch to functional unit when obtain both operand values

Spring 2003CSE P54812 Tomasulo’s Algorithm: Execution Steps write result broadcast result & reorder buffer entry (tag) on the common data bus to reservation stations & reorder buffer commit retire the instruction at the head of the reorder buffer update register with result in reorder buffer or do a store remove instruction from reorder buffer if a mispredicted branch, restart with right-path instructions

Spring 2003CSE P54813 Example in the Book 1 Cycle after first load has committed ROB 6

Spring 2003CSE P54814 Example in the Book 2 Cycle after second load committed ROB 6

Spring 2003CSE P54815 Example in the Book 3 Cycle after subd wrote its result in the reorder buffer but still can’t commit yes ROB 4 ROB 6

Spring 2003CSE P54816 Example in the Book 4 Cycle after addd wrote its result in the reorder buffer but can’t commit ROB 6 ROB 4

Spring 2003CSE P54817 Example in the Book 5 Cycle after multd committed ROB 4 ROB 6 ROB 5

Spring 2003CSE P54818 Example in the Book 6 Cycle after subd committed; Next: complete divd, commit divd, commit addd ROB 6 ROB 5

Spring 2003CSE P54819 Pool of Physical Regisers vs. Reorder Buffer Think about the advantages and disadvantages of these implementations clean-up all done at instruction commit (physical register pool) record that value no longer speculative in register busy table unmap previous mapping for the architectural register instruction issue simpler (physical register pool) only look in one place for the source operands (the physical register file) book claims that deallocating register is more complicated with a physical register pool have to search for outstanding uses in the active list not done in practice: wait until the instruction that redefines the architectural register commits

Spring 2003CSE P54820 Limits Limits on out-of-order execution amount of ILP in the code scheduling window size need to do associative searches & its effect on cycle time number & types of functional units