04/03/2016 slide 1 Dynamic instruction scheduling Key idea: allow subsequent independent instructions to proceed DIVDF0,F2,F4; takes long time ADDDF10,F0,F8;

Slides:



Advertisements
Similar presentations
CMSC 611: Advanced Computer Architecture Tomasulo Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Advertisements

Scoreboarding & Tomasulos Approach Bazat pe slide-urile lui Vincent H. Berk.
Instruction-level Parallelism Compiler Perspectives on Code Movement dependencies are a property of code, whether or not it is a HW hazard depends on.
Lec18.1 Step by step for Dynamic Scheduling by reorder buffer Copyright by John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
A scheme to overcome data hazards
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
Dyn. Sched. CSE 471 Autumn 0219 Tomasulo’s algorithm “Weaknesses” in scoreboard: –Centralized control –No forwarding (more RAW than needed) Tomasulo’s.
Lecture 6: ILP HW Case Study— CDC 6600 Scoreboard & Tomasulo’s Algorithm Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
COMP25212 Advanced Pipelining Out of Order Processors.
CMSC 611: Advanced Computer Architecture Scoreboard Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Microprocessor Microarchitecture Dependency and OOO Execution Lynn Choi Dept. Of Computer and Electronics Engineering.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.
Computer Architecture
Review of CS 203A Laxmi Narayan Bhuyan Lecture2.
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
COMP381 by M. Hamdi 1 Pipelining (Dynamic Scheduling Through Hardware Schemes)
1 Recap (Scoreboarding). 2 Dynamic Scheduling Dynamic Scheduling by Hardware – – Allow Out-of-order execution, Out-of-order completion – – Even though.
ENGS 116 Lecture 71 Scoreboarding Vincent H. Berk October 8, 2008 Reading for today: A.5 – A.6, article: Smith&Pleszkun FRIDAY: NO CLASS Reading for Monday:
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 5, 2005 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Scoreboarding)
CSC 4250 Computer Architectures October 13, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.
Nov. 9, Lecture 6: Dynamic Scheduling with Scoreboarding and Tomasulo Algorithm (Section 2.4)
1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.
Out-of-order execution: Scoreboarding and Tomasulo Week 2
1 Lecture 6 Tomasulo Algorithm CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading:Textbook 2.4, 2.5.
CET 520/ Gannod1 Section A.8 Dynamic Scheduling using a Scoreboard.
Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.
2/24; 3/1,3/11 (quiz was 2/22, QuizAns 3/8) CSE502-S11, Lec ILP 1 Tomasulo Organization FP adders Add1 Add2 Add3 FP multipliers Mult1 Mult2 From.
1 Images from Patterson-Hennessy Book Machines that introduced pipelining and instruction-level parallelism. Clockwise from top: IBM Stretch, IBM 360/91,
CSC 4250 Computer Architectures September 29, 2006 Appendix A. Pipelining.
Chapter 3 Instruction Level Parallelism Dr. Eng. Amr T. Abdel-Hamid Elect 707 Spring 2011 Computer Applications Text book slides: Computer Architec ture:
CIS 662 – Computer Architecture – Fall Class 11 – 10/12/04 1 Scoreboarding  The following four steps replace ID, EX and WB steps  ID: Issue –
COMP25212 Advanced Pipelining Out of Order Processors.
CS203 – Advanced Computer Architecture ILP and Speculation.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Instruction-Level Parallelism and Its Dynamic Exploitation
IBM System 360. Common architecture for a set of machines
Images from Patterson-Hennessy Book
/ Computer Architecture and Design
Out of Order Processors
Step by step for Tomasulo Scheme
CS203 – Advanced Computer Architecture
Microprocessor Microarchitecture Dynamic Pipeline
Lecture 6 Score Board And Tomasulo’s Algorithm
Advantages of Dynamic Scheduling
High-level view Out-of-order pipeline
11/14/2018 CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, Electrical and Computer.
CMSC 611: Advanced Computer Architecture
A Dynamic Algorithm: Tomasulo’s
COMP s1 Seminar 3: Dynamic Scheduling
Out of Order Processors
Last Week Talks Any feedback from the talks? What did you like?
John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
CS 704 Advanced Computer Architecture
Checking for issue/dispatch
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
Static vs. dynamic scheduling
CSCE430/830 Computer Architecture
Static vs. dynamic scheduling
Tomasulo Organization
Reduction of Data Hazards Stalls with Dynamic Scheduling
Lecture 5 Scoreboarding: Enforce Register Data Dependence
CS152 Computer Architecture and Engineering Lecture 16 Compiler Optimizations (Cont) Dynamic Scheduling with Scoreboards.
Scoreboarding ENGS 116 Lecture 7 Vincent H. Berk October 5, 2005
September 20, 2000 Prof. John Kubiatowicz
CSL718 : Superscalar Processors
High-level view Out-of-order pipeline
Lecture 7 Dynamic Scheduling
Conceptual execution on a processor which exploits ILP
Presentation transcript:

04/03/2016 slide 1 Dynamic instruction scheduling Key idea: allow subsequent independent instructions to proceed DIVDF0,F2,F4; takes long time ADDDF10,F0,F8; stalls waiting for F0 SUBDF12,F8,F13; Let this instr. bypass the ADDD Enables out-of-order execution => out-of-order completion IFIDMWBEX Instr. gets stuck here Two historical schemes used in recent machines: Scoreboard dates back to CDC 6600 in 1963 Tomasulo in IBM 360/91 in 1967

04/03/2016 slide 2 Scoreboard pipeline IF Read operands Write Issue Int FP Add FP Mult1 FP Mult2 FP Div Mem Scoreboard ID stage Issue: Decode and check for structural hazards Read operands: wait until no data hazard, then read operands All data hazards are handled by the scoreboard mechanism

04/03/2016 slide 3 Scoreboard complications Out-of-order completion => WAR, WAW hazards WAR: instruction is stalled in the WB stage until a previous instruction has read the operand WAW: instruction is stalled in the Issue stage until a previous instruction has written its result Scoreboard keeps track of dependencies and state of operations

04/03/2016 slide 4 Scoreboard functionality Issue: Instruction is issued when: No structural hazard for a functional unit No WAW with an instruction in execution Read: Instruction reads operands when they become available (RAW) EX: normal execution Write: Instruction writes when all previous instructions have read this operand The scoreboard is updated when an instruction proceeds to a new stage

04/03/2016 slide 5 Data structures in the scoreboard 1.Instruction status—keeps track of in which stage an instruction is. 2.Functional unit status—Indicates the state of the functional unit (FU). 9 fields for each FU: Busy: Indicates whether the unit is busy or not Op: Operation to perform in the unit (e.g. add or sub) Fi: Destination register name Fj, Fk: Source register names Qj, Qk: Name of functional unit producing regs Fj, Fk Rj, Rk: Flags indicating when Fj and Fk are ready 3.Register result status—Indicates which functional unit will write to each register, if any.

04/03/2016 slide 6 Scoreboard example

04/03/2016 slide 7 Detailed Scoreboard Pipeline Control

04/03/2016 slide 8 Scoreboard example, cycle 1

04/03/2016 slide 9 Scoreboard example, cycle 2 Issue 2nd load?

04/03/2016 slide 10 Scoreboard example, cycle 3 Issue MULT?

04/03/2016 slide 11 Scoreboard example, cycle 4

04/03/2016 slide 12 Scoreboard example, cycle 5

04/03/2016 slide 13 Scoreboard example, cycle 6

04/03/2016 slide 14 Scoreboard example, cycle 7

04/03/2016 slide 15 Scoreboard example, cycle 8a

04/03/2016 slide 16 Scoreboard example, cycle 8

04/03/2016 slide 17 Scoreboard example, cycle 9 Read operands for MULT & SUB Issue ADDD?

04/03/2016 slide 18 Scoreboard example, cycle 11 SUBD completes execution

04/03/2016 slide 19 Scoreboard example, cycle 12 Read operands for DIVD?

04/03/2016 slide 20 Scoreboard example, cycle 13 Issue ADDD

04/03/2016 slide 21 Scoreboard example, cycle 14

04/03/2016 slide 22 Scoreboard example, cycle 16 Can ADDD write result?

04/03/2016 slide 23 Scoreboard example, cycle 17 ADDD stalls, waiting for DIVD to read F6 Resolves a WAR hazard!

04/03/2016 slide 24 Scoreboard example, cycle 19

04/03/2016 slide 25 Scoreboard example, cycle 20

04/03/2016 slide 26 Scoreboard example, cycle 21

04/03/2016 slide 27 Scoreboard example, cycle 22 Now ADDD can safely write its result in F6

04/03/2016 slide 28 Scoreboard example, cycle 61

04/03/2016 slide 29 Scoreboard example, cycle 62

04/03/2016 slide 30 Limitations with scoreboards The scoreboard technique is limited by: Number of scoreboard entries (window size) Number and types of functional units Number of ports to the register bank Hazards caused by name dependencies Tomasulo’s algorithm addresses the last two limitations

04/03/2016 slide 31 Tomasulo’s Algorithm In IBM 360/91, 4 years after the CDC 6600 Goal: High performance without compiler support Differences between Tomasulo & Scoreboard: Control & Buffers distributed with FUs (called reservation stations) vs. centralised in Scoreboard Register names in instructions replaced by pointers to reservation station buffer (HW register renaming) Common Data Bus broadcasts results to all FUs Loads and Stores treated as FUs as well This technique has been adopted in many recent machines (e.g. PowerPC)

04/03/2016 slide 32 Hardware Organization

04/03/2016 slide 33 Three stages of Tomasulo’s Alg. 1. Issue—get instruction from FP Op Queue Issue if no structural hazard for a reservation station 2.Execution—operate on operands (EX) Execute when both operands are available; if not ready, watch Common Data Bus (CDB) for result 3.Write result—finish execution (WB) Write on CDB to all awaiting functional units; mark reservation station available Normal bus: data + destination Common Data Bus: data + source (snooping)

04/03/2016 slide 34 Tomasulo example, cycle 0

04/03/2016 slide 35 Tomasulo example, cycle 1

04/03/2016 slide 36 Tomasulo example, cycle 2

04/03/2016 slide 37 Tomasulo example, cycle 3

04/03/2016 slide 38 Tomasulo example, cycle 4

04/03/2016 slide 39 Tomasulo example, cycle 5

04/03/2016 slide 40 Tomasulo example, cycle 6

04/03/2016 slide 41 Tomasulo example, cycle 7

04/03/2016 slide 42 Tomasulo example, cycle 8

04/03/2016 slide 43 Tomasulo example, cycle 10

04/03/2016 slide 44 Tomasulo example, cycle 11

04/03/2016 slide 45 Tomasulo example, cycle 15

04/03/2016 slide 46 Tomasulo example, cycle 16

04/03/2016 slide 47 Tomasulo example, cycle 56

04/03/2016 slide 48 Tomasulo example, cycle 57

04/03/2016 slide 49 Example of WAR hazards in Tomasulo’s Algorithm Example: LFF6, 34(R2) DIVFF10, F6, F0 ADDFF6, F8, F2 ADDF can safely finish before DIVF has read register F6 because: DIVF has renamed register F6 to point at LFs functional unit LF broadcasts its result on the Common Data Bus Register renaming can thus be done: statically by the compiler dynamically by the hardware