Out of Order Processors

Slides:



Advertisements
Similar presentations
CMSC 611: Advanced Computer Architecture Tomasulo Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Advertisements

Scoreboarding & Tomasulos Approach Bazat pe slide-urile lui Vincent H. Berk.
Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.
Instruction-level Parallelism Compiler Perspectives on Code Movement dependencies are a property of code, whether or not it is a HW hazard depends on.
Lec18.1 Step by step for Dynamic Scheduling by reorder buffer Copyright by John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
A scheme to overcome data hazards
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
COMP25212 Advanced Pipelining Out of Order Processors.
CMSC 611: Advanced Computer Architecture Scoreboard Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
COMP381 by M. Hamdi 1 Pipelining (Dynamic Scheduling Through Hardware Schemes)
ENGS 116 Lecture 71 Scoreboarding Vincent H. Berk October 8, 2008 Reading for today: A.5 – A.6, article: Smith&Pleszkun FRIDAY: NO CLASS Reading for Monday:
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 5, 2005 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Scoreboarding)
Nov. 9, Lecture 6: Dynamic Scheduling with Scoreboarding and Tomasulo Algorithm (Section 2.4)
1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.
Out-of-order execution: Scoreboarding and Tomasulo Week 2
1 Lecture 6 Tomasulo Algorithm CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading:Textbook 2.4, 2.5.
CET 520/ Gannod1 Section A.8 Dynamic Scheduling using a Scoreboard.
Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.
2/24; 3/1,3/11 (quiz was 2/22, QuizAns 3/8) CSE502-S11, Lec ILP 1 Tomasulo Organization FP adders Add1 Add2 Add3 FP multipliers Mult1 Mult2 From.
Chapter 3 Instruction Level Parallelism Dr. Eng. Amr T. Abdel-Hamid Elect 707 Spring 2011 Computer Applications Text book slides: Computer Architec ture:
04/03/2016 slide 1 Dynamic instruction scheduling Key idea: allow subsequent independent instructions to proceed DIVDF0,F2,F4; takes long time ADDDF10,F0,F8;
CIS 662 – Computer Architecture – Fall Class 11 – 10/12/04 1 Scoreboarding  The following four steps replace ID, EX and WB steps  ID: Issue –
COMP25212 Advanced Pipelining Out of Order Processors.
Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm 吳俊興 高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.
Instruction-Level Parallelism and Its Dynamic Exploitation
IBM System 360. Common architecture for a set of machines
/ Computer Architecture and Design
Out of Order Processors
Dynamic Scheduling and Speculation
Step by step for Tomasulo Scheme
Tomasulo Loop Example Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1
CS203 – Advanced Computer Architecture
Microprocessor Microarchitecture Dynamic Pipeline
CSE 520 Computer Architecture Lec Chapter 2 - DS-Tomasulo
Lecture 6 Score Board And Tomasulo’s Algorithm
March 11, 2002 Prof. David E. Culler Computer Science 252 Spring 2002
Chapter 3: ILP and Its Exploitation
Advantages of Dynamic Scheduling
High-level view Out-of-order pipeline
11/14/2018 CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, Electrical and Computer.
CMSC 611: Advanced Computer Architecture
A Dynamic Algorithm: Tomasulo’s
COMP s1 Seminar 3: Dynamic Scheduling
Last Week Talks Any feedback from the talks? What did you like?
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
CS 704 Advanced Computer Architecture
Adapted from the slides of Prof
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
March 11, 2002 Prof. David E. Culler Computer Science 252 Spring 2002
CSCE430/830 Computer Architecture
Advanced Computer Architecture
Static vs. dynamic scheduling
September 20, 2000 Prof. John Kubiatowicz
1/2/2019 CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, Electrical and Computer.
Tomasulo Organization
Reduction of Data Hazards Stalls with Dynamic Scheduling
Adapted from the slides of Prof
Lecture 5 Scoreboarding: Enforce Register Data Dependence
CS152 Computer Architecture and Engineering Lecture 16 Compiler Optimizations (Cont) Dynamic Scheduling with Scoreboards.
CS252 Graduate Computer Architecture Lecture 6 Tomasulo, Implicit Register Renaming, Loop-Level Parallelism Extraction Explicit Register Renaming February.
Scoreboarding ENGS 116 Lecture 7 Vincent H. Berk October 5, 2005
/ Computer Architecture and Design
John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
September 20, 2000 Prof. John Kubiatowicz
High-level view Out-of-order pipeline
Lecture 7 Dynamic Scheduling
Conceptual execution on a processor which exploits ILP
Presentation transcript:

Out of Order Processors Advanced Pipelining Out of Order Processors COMP25212 1

From Wednesday… What is a Functional Unit? Is a hardware component of a processor which can perform a specific operation (or set thereof). Integer arithmetic, floating point multiplication, access memory What is a structural hazard? When an instruction can not be issued because there all suitable functional units are busy What data dependencies exists in out-of-order processors? True dependency (Read-after-write): instruction A depends on the output of a previous instruction B. Anti-dependency (Write-after-read): instruction A writes in the input of a previous instruction B. We need to ensure B reads the correct value instead of that generated by A. Output dependency (Write-after-write): instructions A and B write in the same register. We need to ensure that the register keeps the value of the later instruction.

Out-of-Order Execution with Scoreboard From Wednesday… Out-of-Order Execution with Scoreboard Centralized data structure Tracks the status of registers, FUs and instructions Creates dynamically in HW the dependency graph Limited scalability The centralized nature limits scalability: Small number of FUs and small window of instructions Dealing with dependencies RAW – stall conflicted instruction WAW – stall the pipeline WAR – stall WB

Out of Order Execution with Tomasulo

Tomasulo’s Algorithm Control logic for out-of-order execution is decentralized Reservation Stations (RS) in the functional units keep instruction information In addition RS seamlessly rename registers A Common Data Bus (CDB) broadcasts data and results to the different devices A single instruction can finish each cycle Distributed control allows for a larger window of instructions – more flexible dynamic scheduling

Tomasulo’s Algorithm Structural hazards stall the pipeline RS tracks operands and buffers them as soon as they are available Reduce pressure on the register bank Impact of RAW dependencies is reduced Execute an instruction when all operands are available WAW and WAR dependencies are avoided Register renaming

Register Renaming (Example) Eliminates WAR and WAW hazards by renaming all destination registers. Can be done by compiler, but Tomasulo does it transparently in hardware (reservation stations) True dependences DIV.D F0, F2, F4 ADD.D F6, F0, F8 ST.D F6, 0(R1) SUB.D F8, F10, F14 MUL.D F6, F10, F8 R Antidependence S Output dependence

Tomasulo Organization Intr. Queue FP Registers From Mem Load Buffers Load1 Load2 Load3 Load4 Load5 Load6 Store Buffers Add1 Add2 Add3 Mult1 Mult2 Resolve RAW memory conflict? (address in memory buffers) Integer unit executes in parallel Reservation Stations To Mem FP adders FP multipliers Common Data Bus (CDB)

Common Data Bus Normal data bus: data + destination (“go to” bus) Common data bus: data + source (“come from” bus) 64 bits of data + 4 bits of Functional Unit address Functional units broadcast their result Reservation stations take the operand if it matches any input Functional Unit Register bank takes the operand if it matches the Functional Unit writing the result

Stages of Tomasulo Algorithm 1. Issue (I) — get instruction from FP Op Queue If reservation station free (no structural hazard), issue instruction and read operands (or RS producing them) Otherwise, stall the pipeline 2. Execute (EX) — operate on operands When both source operands are ready then execute; if not ready, watch Common Data Bus for results 3. Write result (WB) — finish execution Write on Common Data Bus to all awaiting units; free reservation station

Stages of a Tomasulo Pipeline Execute Mem Write Back Retire Execute FP Multiplication Write Back Retire Execute FP Multiplication Fetch Issue Write Back Retire Execute FP Add Execute FP Division Write Back Retire Write Back Retire

Reservation Station Components No information about instructions needed Information in the Reservation Station Op: Operation to perform in the unit (e.g., + or –) Vj, Vk: Value of Source operands Qj, Qk: Reservation stations producing source registers (value to be written) Note: Qj, Qk=0 means ready Busy: Indicates reservation station or FU is busy Register result status — Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write into that register What you might have thought 1. 4 stages of instruction executino 2.Status of FU: Normal things to keep track of (RAW & structura for busyl): Fi from instruction format of the mahine (Fi is dest) Add unit can Add or Sub Rj, Rk - status of registers (Yes means ready) Qj,Qk - If a no in Rj, Rk, means waiting for a FU to write result; Qj, Qk means wihch FU waiting for it 3.Status of register result (WAW &WAR)s: which FU is going to write into registers Scoreboard on 6600 = size of FU 6.7, 6.8, 6.9, 6.12, 6.13, 6.16, 6.17 FU latencies: Add 2, Mult 10, Div 40 clocks

Instruction status Instruction stream Instruction status: Tomasulo does not need this info We will show the times for each stage, for convenience

Reservation Station Components No information about instructions needed Information in the Reservation Station Op: Operation to perform in the unit (e.g., + or –) Vj, Vk: Value of Source operands Qj, Qk: Reservation stations producing source registers (value to be written) Note: Qj, Qk=0 means ready Busy: Indicates reservation station or FU is busy Register result status — Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write into that register What you might have thought 1. 4 stages of instruction executino 2.Status of FU: Normal things to keep track of (RAW & structura for busyl): Fi from instruction format of the mahine (Fi is dest) Add unit can Add or Sub Rj, Rk - status of registers (Yes means ready) Qj,Qk - If a no in Rj, Rk, means waiting for a FU to write result; Qj, Qk means wihch FU waiting for it 3.Status of register result (WAW &WAR)s: which FU is going to write into registers Scoreboard on 6600 = size of FU 6.7, 6.8, 6.9, 6.12, 6.13, 6.16, 6.17 FU latencies: Add 2, Mult 10, Div 40 clocks

Functional Unit status Reservation Stations: 3 Load Buffers Input Operands Input Operands Which FU will produce operands FU count down Reservation Stations: 3 Adder 2 Multiplication

Reservation Station Components No information about instructions needed Information in the Reservation Station Op: Operation to perform in the unit (e.g., + or –) Vj, Vk: Value of Source operands Qj, Qk: Reservation stations producing source registers (value to be written) Note: Qj, Qk=0 means ready Busy: Indicates reservation station or FU is busy Register result status — Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write into that register What you might have thought 1. 4 stages of instruction executino 2.Status of FU: Normal things to keep track of (RAW & structura for busyl): Fi from instruction format of the mahine (Fi is dest) Add unit can Add or Sub Rj, Rk - status of registers (Yes means ready) Qj,Qk - If a no in Rj, Rk, means waiting for a FU to write result; Qj, Qk means wihch FU waiting for it 3.Status of register result (WAW &WAR)s: which FU is going to write into registers Scoreboard on 6600 = size of FU 6.7, 6.8, 6.9, 6.12, 6.13, 6.16, 6.17 FU latencies: Add 2, Mult 10, Div 40 clocks

Which RS will write in each register? Register Status Which RS will write in each register? Clock cycle counter

A Tomasulo Example The following code is run on a Tomasulo pipeline with: L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 DIV.D F10, F0, F6 ADD.D F6, F8, F2 Functional Unit (FU) # of FUs EX cycles FP Multiply/Division 2 10/40 FP Addition/Substraction 3 2 Mem Load 3 2

Dependency Graph For Example L.D F6, 34 (R2) 1 Example Code L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 DIV.D F10, F0, F6 ADD.D F6, F8, F2 1 2 3 4 5 6 L.D F2, 45 (R3) 2 MUL.D F0, F2, F4 3 Data Dependence: (1, 4) (1, 5) (2, 3) (2, 4) (2, 6) (3, 5) (4, 6) Output Dependence: (1, 6) Anti-dependence: (5, 6) SUB.D F8, F6, F2 4 DIV.D F10, F0, F6 5 Real Data Dependence (RAW) Anti-dependence (WAR) Output Dependence (WAW) ADD.D F6, F8, F2 6

Tomasulo Example

Tomasulo Example Cycle 1 LD#1 issued

Tomasulo Example Cycle 2 LD#2 issued

Tomasulo Example Cycle 3 MULTD is issued LD#1 completes and broadcasts its result

Tomasulo Example Cycle 4 LD#1 result updates the register bank and frees the RS SUBD is issued LD#2 completes, broadcasting its result

Tomasulo Example Cycle 5 LD#2 result updates the register bank and frees RS Add1, Mult1 start execution DIVD is issued

Tomasulo Example Cycle 6 ADDD issued

Tomasulo Example Cycle 7 Add1 (SUBD) completes and broadcasts result

Tomasulo Example Cycle 8 Add1 (SUBD) result updates the register bank and frees RS Add2 (ADDD) start execution

Tomasulo Example Cycle 9 ADDD and MULTD continue execution

Tomasulo Example Cycle 10 Add2 (ADDD) completes and broadcasts result

Tomasulo Example Cycle 11 ADDD updates the register bank and frees RS

Tomasulo Example Cycle 12 MULTD continues execution

Tomasulo Example Cycle 13 MULTD continues execution

Tomasulo Example Cycle 14 MULTD continues execution

Tomasulo Example Cycle 15 MULTD completes and broadcasts result

Tomasulo Example Cycle 16 MULTD updates the register bank and frees RS DIVD starts execution

39 cycles later…

Tomasulo Example Cycle 55 DIVD is about to complete

Tomasulo Example Cycle 56 DIVD completes and broadcasts result

Tomasulo Example Cycle 57 DIVD updates the register bank and frees RS

Tomasulo Example Cycle 57 In-order issue Out-of-order execution Out-of-order completion Execution Complete

Tomasulo’s advantages Distributed hazard detection logic distributed reservation stations and the CDB If multiple instructions waiting on a single result, & each instruction has other operand, then instructions can be issued simultaneously by broadcasting on CDB If a centralized register file were used, the units would have to read their results from the registers when register buses are available. (2) Avoids stalling due to WAW or WAR hazards

Tomasulo Drawbacks Complexity of hardware Performance limited by Common Data Bus Each CDB must go to all functional units  high capacitance, high wiring density Number of functional units that can complete per cycle limited to one! Multiple CDBs  more FU logic for parallel stores

Summary Reservations stations: implicit register renaming by buffering source operands Prevents registers from being the bottleneck Avoids the WAR and WAW hazards of Scoreboard Lasting Contributions Dynamic scheduling Register renaming Others (not covered here) Load/store disambiguation through re-ordering buffer Speculative execution

Summary of Out-of-Order Processors

Out of Order Processors BENEFITS: Accelerates the execution of programs More efficient design Increases the utilisation of processor resources LIMITATIONS: More complex design Expensive in terms of area and power Non-precise interrupts Interrupting exactly after an instruction becomes more difficult (but can be solved with reordering buffers) 46

Scoreboard vs Tomasulo (originals)

Example LD – 4 cycles Assuming no structural Hazards Add/Sub – 2 cycles Mul/Div – 2 cycles Assuming no structural Hazards

Example RAW RAW – Stall the pipeline WAW RAW – ADD stalled, SUB could be issued RAW – ADD stalled, SUB can be issued LD – 4 cycles Add/Sub – 2 cycles Mul/Div – 2 cycles Assuming no structural Hazards

Example WAW WAW – SUB cannot be issued Stall the pipeline ADD will change the register status so that R3 will be written by the ADD Then, the SUB will overwrite it, so that any instruction after this that wants to read R3 will get the data from the CDB. When the ADD finishes, the register bank will not have marked that its result is for R3, so nothing will be written to the Reg Bank. WAW – Allowed by register renaming in RS LD – 4 cycles Add/Sub – 2 cycles Mul/Div – 2 cycles Assuming no structural Hazards

Example 2 instrs. can finish at the same time CDB limits finishing instrs. to one/cycle LD – 4 cycles Add/Sub – 2 cycles Mul/Div – 2 cycles Assuming no structural Hazards

low instruction-level parallelism Consider the following program which implements R = A^2 + B^2 + C^2 + D^2 LD r1, A MUL r2, r1, r1 -- A^2 LD r3, B MUL r4, r3, r3 -- B^2 ADD r11, r2, r4 -- A^2 + B^2 LD r5, C MUL r6, r5, r5 -- C^2 LD r7, D MUL r8, r7, r7 -- D^2 ADD r12, r6, r8 -- C^2 + D^2 ADD r21, r11, r12 -- A^2 + B^2 + C^2 + D^2 ST r21, R The current code is not really suitable for a superscalar pipeline because of its low instruction-level parallelism Reorder the instructions to exploit superscalar execution. Assume all kinds of forwarding are implemented.