Lecture 7 Dynamic Scheduling

Slides:



Advertisements
Similar presentations
CMSC 611: Advanced Computer Architecture Tomasulo Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Advertisements

Scoreboarding & Tomasulos Approach Bazat pe slide-urile lui Vincent H. Berk.
Instruction-level Parallelism Compiler Perspectives on Code Movement dependencies are a property of code, whether or not it is a HW hazard depends on.
Lec18.1 Step by step for Dynamic Scheduling by reorder buffer Copyright by John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
A scheme to overcome data hazards
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
Dyn. Sched. CSE 471 Autumn 0219 Tomasulo’s algorithm “Weaknesses” in scoreboard: –Centralized control –No forwarding (more RAW than needed) Tomasulo’s.
Lecture 6: ILP HW Case Study— CDC 6600 Scoreboard & Tomasulo’s Algorithm Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
COMP25212 Advanced Pipelining Out of Order Processors.
CMSC 611: Advanced Computer Architecture Scoreboard Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
COMP381 by M. Hamdi 1 Pipelining (Dynamic Scheduling Through Hardware Schemes)
ENGS 116 Lecture 71 Scoreboarding Vincent H. Berk October 8, 2008 Reading for today: A.5 – A.6, article: Smith&Pleszkun FRIDAY: NO CLASS Reading for Monday:
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 5, 2005 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Scoreboarding)
Nov. 9, Lecture 6: Dynamic Scheduling with Scoreboarding and Tomasulo Algorithm (Section 2.4)
1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.
Out-of-order execution: Scoreboarding and Tomasulo Week 2
1 Lecture 6 Tomasulo Algorithm CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading:Textbook 2.4, 2.5.
CET 520/ Gannod1 Section A.8 Dynamic Scheduling using a Scoreboard.
Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.
2/24; 3/1,3/11 (quiz was 2/22, QuizAns 3/8) CSE502-S11, Lec ILP 1 Tomasulo Organization FP adders Add1 Add2 Add3 FP multipliers Mult1 Mult2 From.
CSC 4250 Computer Architectures September 29, 2006 Appendix A. Pipelining.
Chapter 3 Instruction Level Parallelism Dr. Eng. Amr T. Abdel-Hamid Elect 707 Spring 2011 Computer Applications Text book slides: Computer Architec ture:
04/03/2016 slide 1 Dynamic instruction scheduling Key idea: allow subsequent independent instructions to proceed DIVDF0,F2,F4; takes long time ADDDF10,F0,F8;
COMP25212 Advanced Pipelining Out of Order Processors.
Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm 吳俊興 高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.
Instruction-Level Parallelism and Its Dynamic Exploitation
IBM System 360. Common architecture for a set of machines
Images from Patterson-Hennessy Book
/ Computer Architecture and Design
Out of Order Processors
Dynamic Scheduling and Speculation
Step by step for Tomasulo Scheme
CS203 – Advanced Computer Architecture
CSE 520 Computer Architecture Lec Chapter 2 - DS-Tomasulo
Lecture 6 Score Board And Tomasulo’s Algorithm
March 11, 2002 Prof. David E. Culler Computer Science 252 Spring 2002
Advantages of Dynamic Scheduling
High-level view Out-of-order pipeline
11/14/2018 CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, Electrical and Computer.
CMSC 611: Advanced Computer Architecture
A Dynamic Algorithm: Tomasulo’s
COMP s1 Seminar 3: Dynamic Scheduling
Out of Order Processors
Last Week Talks Any feedback from the talks? What did you like?
CS252 Graduate Computer Architecture Lecture 6 Scoreboard, Tomasulo, Register Renaming February 7th, 2011 John Kubiatowicz Electrical Engineering and.
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
CS 704 Advanced Computer Architecture
Checking for issue/dispatch
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
March 11, 2002 Prof. David E. Culler Computer Science 252 Spring 2002
Static vs. dynamic scheduling
CSCE430/830 Computer Architecture
Static vs. dynamic scheduling
September 20, 2000 Prof. John Kubiatowicz
1/2/2019 CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, Electrical and Computer.
Tomasulo Organization
Reduction of Data Hazards Stalls with Dynamic Scheduling
Lecture 5 Scoreboarding: Enforce Register Data Dependence
CS152 Computer Architecture and Engineering Lecture 16 Compiler Optimizations (Cont) Dynamic Scheduling with Scoreboards.
CS252 Graduate Computer Architecture Lecture 6 Tomasulo, Implicit Register Renaming, Loop-Level Parallelism Extraction Explicit Register Renaming February.
Scoreboarding ENGS 116 Lecture 7 Vincent H. Berk October 5, 2005
/ Computer Architecture and Design
John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
September 20, 2000 Prof. John Kubiatowicz
CS252 Graduate Computer Architecture Lecture 6 Introduction to Advanced Pipelining: Out-Of-Order Execution John Kubiatowicz Electrical Engineering and.
High-level view Out-of-order pipeline
Conceptual execution on a processor which exploits ILP
Presentation transcript:

Lecture 7 Dynamic Scheduling This lecture will cover sections 1.1-1.4: introduction, application, technology trends, cost and price. Assoc. Prof Dr. Norafida Ithnin Advanced Computer System & Architecture MCC 2313

Dynamic vs. Static Scheduling Data hazards in a program to cause a processor to stall. With static scheduling the compiler tries to reorder these instructions during compile time to reduce pipeline stalls. Uses less hardware Can use more powerful algorithms With dynamic scheduling the hardware tries to rearrange the instructions during run-time to reduce pipeline stalls. Simpler compiler Handles dependencies not known at compile time Allows code compiled for a different machine to run efficiently.

Out-Of-Order Execution In our previous model of DLX, all instructions executed in the order that they appear This can lead to unnecessary stalls DIVD FO, F2, F4 ADDD F10, F0, F8 SUBD F12, F8, F14 SUBD stalls waiting for the ADDD to go first, even though SUBD does not have a data dependency. With out-of-order execution, the SUBD is allowed to executed before the add This can lead to out-of order completion, which can cause WAW and WAR hazards

Scoreboarding The scoreboard implements a centralized control scheme that Detects all resource and data hazards Allows instructions to execute out-of-order when no resource hazards or data dependencies First implemented in 1964 by the CDC 6600, which had 18 separate functional units 4 FP units (2 multiply, 1 add, 1 divide) 7 memory units (5 loads, 2 stores) 7 integer units (add, shift, logical, compare, etc.) Dynamic DLX (much simpler) 2 FP multiply (10 EX cycles) 1 FP add (2 EX cycles) 1 FP divide (40 EX cycles) 1 integer unit (1 EX cycle)

Out-of-Order Execution Out-of-order execution divides ID stage: 1. Issue—decode instructions, check for structural hazards 2. Read operands—wait until no data hazards, then read operands Scoreboards allow instruction to execute whenever 1 & 2 hold, not waiting for prior instructions CDC 6600: In order issue, out of order execution, out of order commit (also called completion)

Scoreboard Implications Out-of-order completion can lead to WAR and WAW hazards? Solution for WAW Detect WAW hazard before reading operands Stall write until other instruction completes Solutions for WAR Detect WAR hazards before writing back to the register files and stall the write back This scoreboard does not take advantage of forwarding, since it waits until both results are written back to the register file Scoreboard replaces ID, EX, WB with 4 stages

Four Stages of Scoreboard Control Issue (ID1) decode instructions check for structural and WAW hazards stall until structural and WAW hazards are resolved Read operands (ID2) wait until no RAW hazards then read operands Execution (EX) operate on operands may be multiple cycles - notify scoreboard when done Write result (WB) finish execution stall if WAR hazard

Three Parts of the Scoreboard 1. Instruction status—which of 4 steps the instruction is in: ID1, ID1, EX, or WB. 2. Functional unit status—Indicates the state of the functional unit (FU). 9 fields for each functional unit Busy—Indicates whether the unit is busy or not Op—Operation to perform in the unit (e.g., + or –) Fi—Destination register Fj, Fk—Source-register numbers Qj, Qk—Functional units producing source registers Fj, Fk Rj, Rk—Flags indicating when Fj, Fk are ready 3. Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register

Scoreboard Example for DLX

CDC 6000 Scoreboard Summary Speedup from scoreboard 1.7 for FORTRAN programs 2.5 for hand-coded assembly language programs Effects of modern compilers? Hardware Scoreboard hardware approximately same as one FPU Main cost was buses (4x’s normal amount) Could be more severe for modern processors Limitations No forwarding logic Limited to instructions in basic block Stalls for WAW hazards Wait for WAR hazards before WB

Tomasulo Algorithm for Dynamic Scheduling For IBM 360/91 in 1967 - about 3 years after CDC 6600 Goal: High performance without special compilers Differences between IBM 360 & CDC 6600 IBM has only 2 register specifiers/instr vs. 3 in CDC 6600 IBM has register-memory instructions IBM has 4 FP registers vs. 8 in CDC 6600 IBM has pipelined functional units (3 adds, 2 multiplies) Tomasulo algorithm is designed to handle name dependencies (WAW and WAR hazards) efficiently SUB F1, F2, F0 DIVF F2, F3 , F2 ADDF F3, F0, F0 MULF F3, F1, F1

Tomasulo Algorithm Differences from Scoreboarding Distributed hazard detection and control (through reservation stations) Results are bypassed to function units Common data bus (CDB) broadcasts results to all FUs. HW renaming of registers to avoid WAR, WAW hazards Load and Stores treated as FUs as well Registers in instructions replaced by pointers to reservation station buffers Lead to concepts used in Alpha 21264, HP 8000, MIPS 10000, Pentium II, PowerPC 604, …

Tomasulo Organization FP Op Queue FP Registers Load Buffer Store Buffer Common Data Bus FP Add Res. Station FP Mul Res. Station

Reservation Station Components Op—Operation to perform in the unit (e.g., + or –) Qj, Qk—Reservation stations producing source registers Vj, Vk—Value of Source operands Rj, Rk—Flags indicating when Vj, Vk are ready Busy—Indicates reservation station and FU is busy Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register. What you might have thought 1. 4 stages of instruction executino 2.Status of FU: Normal things to keep track of (RAW & structura for busyl): Fi from instruction format of the mahine (Fi is dest) Add unit can Add or Sub Rj, Rk - status of registers (Yes means ready) Qj,Qk - If a no in Rj, Rk, means waiting for a FU to write result; Qj, Qk means wihch FU waiting for it 3.Status of register result (WAW &WAR)s: which FU is going to write into registers Scoreboard on 6600 = size of FU 6.7, 6.8, 6.9, 6.12, 6.13, 6.16, 6.17 FU latencies: Add 2, Mult 10, Div 40 clocks

Three Stages of Tomasulo Algorithm 1. Issue—get instruction from FP Op Queue If reservation station free, the scoreboard issues instr & sends operands (renames registers). 2. Execution—operate on operands (EX) When both operands ready then execute; if not ready, watch CDB for result 3. Write result—finish execution (WB) Write on Common Data Bus to all awaiting units; mark reservation station available.

Tomasulo Example Cycle 0

Tomasulo Summary Advantages Prevents register from being the bottleneck Eliminates WAR, WAW hazards Allows loop unrolling in HW Common Data Bus Broadcasts results to multiple instructions Central bottleneck Lasting Contributions Dynamic scheduling Register renaming Load/store disambiguation

Discussions