1 Recap (Scoreboarding). 2 Dynamic Scheduling Dynamic Scheduling by Hardware – – Allow Out-of-order execution, Out-of-order completion – – Even though.

Slides:



Advertisements
Similar presentations
Scoreboarding & Tomasulos Approach Bazat pe slide-urile lui Vincent H. Berk.
Advertisements

Instruction-level Parallelism Compiler Perspectives on Code Movement dependencies are a property of code, whether or not it is a HW hazard depends on.
A scheme to overcome data hazards
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
Lecture 6: ILP HW Case Study— CDC 6600 Scoreboard & Tomasulo’s Algorithm Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
COMP25212 Advanced Pipelining Out of Order Processors.
CSE 8383 Superscalar Processor 1 Abdullah A Alasmari & Eid S. Alharbi.
Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and Branch Prediction * Jeremy R. Johnson Wed. Nov. 8, 2000.
CMSC 611: Advanced Computer Architecture Scoreboard Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Lecture 6: Pipelining MIPS R4000 and More Kai Bu
Instruction-Level Parallelism (ILP)
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.
Computer Architecture
Data Hazards RAW Hazard ADD.D F3, F1, F2 SUB.D F5, F6, F3 No Solution, normal property of programs WAW Hazard DIV.D F3, F1, F2 SUB.D F3, F6, F5 This instruction.
CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.
1 Lecture 5 Branch Prediction (2.3) and Scoreboarding (A.7)
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
COMP381 by M. Hamdi 1 Pipelining (Dynamic Scheduling Through Hardware Schemes)
CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach.
ENGS 116 Lecture 71 Scoreboarding Vincent H. Berk October 8, 2008 Reading for today: A.5 – A.6, article: Smith&Pleszkun FRIDAY: NO CLASS Reading for Monday:
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 5, 2005 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Scoreboarding)
EENG449b/Savvides Lec 5.1 1/27/04 January 27, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
CSC 4250 Computer Architectures October 13, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.
Computer Architecture
Nov. 9, Lecture 6: Dynamic Scheduling with Scoreboarding and Tomasulo Algorithm (Section 2.4)
Out-of-order execution: Scoreboarding and Tomasulo Week 2
CSC 4250 Computer Architectures September 26, 2006 Appendix A. Pipelining.
1 Lecture 5 Overview of Superscalar Techniques CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading: Textbook, Ch. 2.1 “Complexity-Effective.
Instruction-Level Parallelism Dynamic Scheduling
1 Lecture 6 Tomasulo Algorithm CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading:Textbook 2.4, 2.5.
CET 520/ Gannod1 Section A.8 Dynamic Scheduling using a Scoreboard.
Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.
1 Lecture 5: Dependence Analysis and Superscalar Techniques Overview Instruction dependences, correctness, inst scheduling examples, renaming, speculation,
1 Images from Patterson-Hennessy Book Machines that introduced pipelining and instruction-level parallelism. Clockwise from top: IBM Stretch, IBM 360/91,
CSC 4250 Computer Architectures September 29, 2006 Appendix A. Pipelining.
Chapter 3 Instruction Level Parallelism Dr. Eng. Amr T. Abdel-Hamid Elect 707 Spring 2011 Computer Applications Text book slides: Computer Architec ture:
MS108 Computer System I Lecture 6 Scoreboarding Prof. Xiaoyao Liang 2015/4/3 1.
04/03/2016 slide 1 Dynamic instruction scheduling Key idea: allow subsequent independent instructions to proceed DIVDF0,F2,F4; takes long time ADDDF10,F0,F8;
CIS 662 – Computer Architecture – Fall Class 11 – 10/12/04 1 Scoreboarding  The following four steps replace ID, EX and WB steps  ID: Issue –
COMP25212 Advanced Pipelining Out of Order Processors.
Instruction-Level Parallelism and Its Dynamic Exploitation
IBM System 360. Common architecture for a set of machines
Images from Patterson-Hennessy Book
Out of Order Processors
Dynamic Scheduling and Speculation
Step by step for Tomasulo Scheme
CS203 – Advanced Computer Architecture
Lecture 6 Score Board And Tomasulo’s Algorithm
Chapter 3: ILP and Its Exploitation
Advantages of Dynamic Scheduling
High-level view Out-of-order pipeline
CMSC 611: Advanced Computer Architecture
A Dynamic Algorithm: Tomasulo’s
COMP s1 Seminar 3: Dynamic Scheduling
Out of Order Processors
Last Week Talks Any feedback from the talks? What did you like?
John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
CS 704 Advanced Computer Architecture
Static vs. dynamic scheduling
CSCE430/830 Computer Architecture
Advanced Computer Architecture
Static vs. dynamic scheduling
Tomasulo Organization
Reduction of Data Hazards Stalls with Dynamic Scheduling
Lecture 5 Scoreboarding: Enforce Register Data Dependence
CS152 Computer Architecture and Engineering Lecture 16 Compiler Optimizations (Cont) Dynamic Scheduling with Scoreboards.
Scoreboarding ENGS 116 Lecture 7 Vincent H. Berk October 5, 2005
High-level view Out-of-order pipeline
Lecture 7 Dynamic Scheduling
Presentation transcript:

1 Recap (Scoreboarding)

2 Dynamic Scheduling Dynamic Scheduling by Hardware – – Allow Out-of-order execution, Out-of-order completion – – Even though an instruction is stalled, later instructions, with no data dependencies with the instructions which are stalled and causing the stall, can proceed – – Efficient utilization of functional unit with multiple units

3 Dynamic Pipeline Scheduling: The Concept Instruction are allowed to start executing out-of-order as soon as their operands are available. Example: This implies allowing out-of-order instruction commit (completion). In the case of in-order execution SUBD must wait for DIVD to complete which stalled ADDD before starting execution In out-of-order execution SUBD can start as soon as the values of its operands F8, F14 are available. DIVD F0, F2, F4 ADDD F10, F0, F8 SUBD F12, F8, F14

4 Dynamic Pipeline Scheduling Dynamic instruction scheduling is accomplished by: –Dividing the Instruction Decode ID stage into two stages: Issue: Decode instructions, check for structural hazards. Read operands: Wait until data hazard conditions, if any, are resolved, then read operands when available. (All instructions pass through the issue stage in order but can be stalled or pass each other in the read operands stage).

5 Scoreboard Implications Out-of-order execution ==> WAR, WAW hazards? DIVD F0, F2, F4 ADDD F10, F0, F8 SUBD F8, F8, F14 If the pipeline executes SUBD before ADDD, it will yield incorrect execution A WAW hazard would occur. We must detect the hazard and stall until other completes. DIVD F0, F2, F4 ADDD F10, F0, F8 SUBD F10, F8, F14

6 Scoreboard Specifics Several functional units –several floating-point units, integer units, and memory reference units Data dependencies (hazards) are detected when an instruction reaches the scoreboard –corresponding to instruction issue replacing part of the ID stage Scoreboard determines –when the instruction is ready for execution –based on when its operands and functional unit become available –where results are written

7 The basic structure of a MIPS processor with a scoreboard

8 Three Parts of the Scoreboard 1 Instruction status: Which of 4 steps the instruction is in. 2 Functional unit status: Indicates the state of the functional unit (FU). Nine fields for each functional unit: –Busy Indicates whether the unit is busy or not –Op Operation to perform in the unit (e.g., + or –) –Fi Destination register –Fj, Fk Source-register numbers –Qj, Qk Functional units producing source registers Fj, Fk –Rj, Rk Flags indicating when Fj, Fk are ready (set to Yes after operand is available to read) 3 Register result status: Indicates which functional unit will write to each register, if one exists. Blank when no pending instructions will write that register.

9 A Scoreboard Example The following code is run on the MIPS with a scoreboard given earlier with: L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 DIV.D F10, F0, F6 ADD.D F6, F8, F2 Functional Unit (FU) # of FUs EX Latency Integer 1 0 Floating Point Multiply 2 10 Floating Point add 1 2 Floating point Divide 1 40 All functional units are not pipelined

10 Dependency Graph For Example Code L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 DIV.D F10, F0, F6 ADD.D F6, F8, F L.D F6, 34 (R2) 1 L.D F2, 45 (R3) 2 MUL.D F0, F2, F4 3 DIV.D F10, F0, F6 5 SUB.D F8, F6, F2 4 ADD.D F6, F8, F2 6 Date Dependence: (1, 4) (1, 5) (2, 3) (2, 4) (2, 6) (3, 5) (4, 6) Output Dependence: (1, 6) Anti-dependence: (5, 6) Example Code Real Data Dependence (RAW) Anti-dependence (WAR) Output Dependence (WAW)

11 Scoreboard Example: Cycle 1 Instruction status ReadExecutionWrite Instructionjk IssueoperandscompleteResult L.DF634+R2 1 L.DF245+R3 MUL.DF0F2F4 SUB.DF8F6F2 DIV.DF10F0F6 ADD.DF6F8F2 Functional unit statusdestS1S2FU for jFU for kFj?Fk? TimeNameBusyOpFiFjFkQjQkRjRk IntegerYesLoadF6R2Yes Mult1No Mult2No AddNo DivideNo Register result status Clock F0F2F4F6F8F10F12...F30 1FUInteger FP Latency: Add = 2 cycles, Multiply = 10, Divide = 40

12 Scoreboard Example: Cycle 2 FP Latency: Add = 2 cycles, Multiply = 10, Divide = 40 Instruction status ReadExecutionWrite Instructionjk IssueoperandscompleteResult L.DF634+R2 1 L.DF245+R3 MUL.DF0F2F4 SUB.DF8F6F2 DIV.DF10F0F6 ADD.DF6F8F2 Functional unit statusdestS1S2FU for jFU for kFj?Fk? TimeNameBusyOpFiFjFkQjQkRjRk IntegerYesLoadF6R2Yes Mult1No Mult2No AddNo DivideNo Register result status Clock F0F2F4F6F8F10F12...F30 2FUInteger 2 Issue second L.D? No, stall on structural hazard

13 Scoreboard Example: Cycle 3 Issue MUL.D? In-order issue !!! Instruction status ReadExecutionWrite Instructionjk IssueoperandscompleteResult L.DF634+R2123 L.DF245+R3 MUL.DF0F2F4 SUB.DF8F6F2 DIV.DF10F0F6 ADD.DF6F8F2 Functional unit statusdestS1S2FU for jFU for kFj?Fk? TimeNameBusyOpFiFjFkQjQkRjRk IntegerYesLoadF6R2Yes Mult1No Mult2No AddNo DivideNo Register result status Clock F0F2F4F6F8F10F12...F30 3FUInteger ?

14 Scoreboard Example: Cycle 4 Instruction status ReadExecutionWrite Instructionjk IssueoperandscompleteResult L.DF634+R L.DF245+R3 MUL.DF0F2F4 SUB.DF8F6F2 DIV.DF10F0F6 ADD.DF6F8F2 Functional unit statusdestS1S2FU for jFU for kFj?Fk? TimeNameBusyOpFiFjFkQjQkRjRk IntegerYesLoadF6R2Yes Mult1No Mult2No AddNo DivideNo Register result status Clock F0F2F4F6F8F10F12...F30 4FUInteger

15 Scoreboard Example: Cycle 5 Instruction status ReadExecutionWrite Instructionjk IssueoperandscompleteResult F634+R F245+R3 F0F2F4 F8F6F2 F10F0F6 L.D MUL.D SUB.D DIV.D ADD.D F6F8F2 Functional unit statusdestS1S2FU for jFU for kFj?Fk? TimeNameBusyOpFiFjFkQjQkRjRk IntegerYesLoadF2R3Yes Mult1No Mult2No AddNo DivideNo Register result status Clock F0F2F4F6F8F10F12...F30 5FUInteger 5

16 Scoreboard Example: Cycle 6 Instruction status ReadExecutionWrite Instructionjk IssueoperandscompleteResult F634+R F245+R3 F0F2F4 F8F6F2 F10F0F6 F8F2 Functional unit statusdestS1S2FU for jFU for kFj?Fk? TimeNameBusyOpFiFjFkQjQkRjRk IntegerYesLoadF2R3Yes Mult1 Mult2No AddNo DivideNo Register result status Clock F0F2F4F6F8F10F12...F30 6FU Integer Yes Mult F0 F2 F4 Integer No Yes Mult1 L.D MUL.D SUB.D DIV.D ADD.D

17 Scoreboard Example: Cycle 7 Instruction status ReadExecutionWrite Instructionjk IssueoperandscompleteResult F634+R F245+R3 F0F2F4 F8F6F2 F10F0F6 F8F2 Functional unit statusdestS1S2FU for jFU for kFj?Fk? TimeNameBusyOpFiFjFkQjQkRjRk IntegerYesLoadF2R3Yes Mult1 Mult2 No Add Divide No Register result status Clock F0F2F4F6F8F10F12...F30 7FU Integer Yes Mult F0 F2 F4 Integer No Yes Yes Sub F8 F6 F2 Integer Yes No Mult1 Add 7 Read multiply operands? L.D MUL.D SUB.D DIV.D ADD.D

18 Scoreboard Example: Cycle 8a (First half of cycle 8) Instruction status ReadExecutionWrite Instructionjk IssueoperandscompleteResult F634+R F245+R3 F0F2F4 F8F6F2 F10F0F6 F8F2 Functional unit statusdestS1S2FU for jFU for kFj?Fk? TimeNameBusyOpFiFjFkQjQkRjRk IntegerYesLoadF2R3Yes Mult1 Mult2 No Add Divide Register result status Clock F0F2F4F6F8F10F12...F30 8FU Integer Yes Mult F0 F2 F4 Integer No Yes Yes Sub F8 F6 F2 Integer Yes No Mult1Add Divide 7 8 Yes Div F10 F0 F6 Mult1 No Yes L.D MUL.D SUB.D DIV.D ADD.D

19 Scoreboard Example: Cycle 8b (Second half of cycle 8) Instruction status ReadExecutionWrite Instructionjk IssueoperandscompleteResult F634+R F245+R3 F0F2F4 F8F6F2 F10F0F6 F8F2 Functional unit statusdestS1S2FU for jFU for kFj?Fk? TimeNameBusyOpFiFjFkQjQkRjRk Integer No Mult1 Mult2 No Add Divide Register result status Clock F0F2F4F6F8F10F12...F30 8FU Yes Mult F0 F2 F4 Yes Yes Yes Sub F8 F6 F2 Yes Yes Mult1Add Divide 7 8 Yes Div F10 F0 F6 Mult1 No Yes L.D MUL.D SUB.D DIV.D ADD.D

20 Scoreboard Example: Cycle 9 FP Latency: Add = 2 cycles, Multiply = 10, Divide = 40 Instruction status ReadExecutionWrite Instructionjk IssueoperandscompleteResult F634+R F245+R3 F0F2F4 F8F6F2 F10F0F6 F8F2 Functional unit statusdestS1S2FU for jFU for kFj?Fk? TimeNameBusyOpFiFjFkQjQkRjRk Integer No 10 Mult1 Mult2 No 2 Add Divide Register result status Clock F0F2F4F6F8F10F12...F30 9FU Yes Mult F0 F2 F4 Yes Yes Yes Sub F8 F6 F2 Yes Yes Mult1Add Divide Yes Div F10 F0 F6 Mult1 No Yes Read operands for MUL.D & SUB.D? Issue ADD.D? ? L.D MUL.D SUB.D DIV.D ADD.D

21 Scoreboard Example: Cycle 11 Instruction status ReadExecutionWrite Instructionjk IssueoperandscompleteResult F634+R F245+R3 F0F2F4 F8F6F2 F10F0F6 F8F2 Functional unit statusdestS1S2FU for jFU for kFj?Fk? TimeNameBusyOpFiFjFkQjQkRjRk Integer No 8 Mult1 Mult2 No 0 Add Divide Register result status Clock F0F2F4F6F8F10F12...F30 11FU Yes Mult F0 F2 F4 Yes Yes Yes Sub F8 F6 F2 Yes Yes Mult1Add Divide Yes Div F10 F0 F6 Mult1 No Yes L.D MUL.D SUB.D DIV.D ADD.D

22 Scoreboard Example: Cycle 12 Instruction status ReadExecutionWrite Instructionjk IssueoperandscompleteResult F634+R F245+R3 F0F2F4 F8F6F2 F10F0F6 F8F2 Functional unit statusdestS1S2FU for jFU for kFj?Fk? TimeNameBusyOpFiFjFkQjQkRjRk Integer No 7 Mult1 Mult2 No Add Divide Register result status Clock F0F2F4F6F8F10F12...F30 12FU Yes Mult F0 F2 F4 Yes Yes No Mult1 Divide Yes Div F10 F0 F6 Mult1 No Yes Read operands for DIV.D? L.D MUL.D SUB.D DIV.D ADD.D

23 Scoreboard Example: Cycle 13 Instruction status ReadExecutionWrite Instructionjk IssueoperandscompleteResult F634+R F245+R3 F0F2F4 F8F6F2 F10F0F6 F8F2 Functional unit statusdestS1S2FU for jFU for kFj?Fk? TimeNameBusyOpFiFjFkQjQkRjRk Integer No 6 Mult1 Mult2 No Add Divide Register result status Clock F0F2F4F6F8F10F12...F30 13FU Yes Mult F0 F2 F4 Yes Yes Mult1 Add Divide Yes Div F10 F0 F6 Mult1 No Yes Yes Add F6 F8 F2 Yes Yes 13 L.D MUL.D SUB.D DIV.D ADD.D

24 Scoreboard Example: Cycle 17 Instruction status ReadExecutionWrite Instructionjk IssueoperandscompleteResult F634+R21234 F245+R35678 F0F2F469 F8F6F F10F0F68 F8F Functional unit statusdestS1S2FU for jFU for kFj?Fk? TimeNameBusyOpFiFjFkQjQkRjRk IntegerNo 2Mult1YesMultF0F2F4Yes Mult2No AddYesAddF6F8F2Yes DivideYesDivF10F0F6Mult1NoYes Register result status Clock F0F2F4F6F8F10F12...F30 17FUMult1AddDivide Write result of ADD.D? No, WAR hazard L.D MUL.D SUB.D DIV.D ADD.D

25 Scoreboard Example: Cycle 20 Instruction status ReadExecutionWrite Instructionjk IssueoperandscompleteResult F634+R21234 F245+R35678 F0F2F F8F6F F10F0F68 F8F Functional unit statusdestS1S2FU for jFU for kFj?Fk? TimeNameBusyOpFiFjFkQjQkRjRk IntegerNo Mult1 Mult2No AddYesAddF6F8F2Yes DivideYesDivF10F0F6Yes Register result status Clock F0F2F4F6F8F10F12...F30 20FUAddDivide No L.D MUL.D SUB.D DIV.D ADD.D

26 Scoreboard Example: Cycle 21 Instruction status ReadExecutionWrite Instructionjk IssueoperandscompleteResult F634+R21234 F245+R35678 F0F2F F8F6F F10F0F68 21 F6F8F Functional unit statusdestS1S2FU for jFU for kFj?Fk? TimeNameBusy OpFiFjFkQjQkRjRk IntegerNo Mult1 Mult2No AddYesAddF6F8F2Yes DivideYesDivF10F0F6Yes Register result status Clock F0F2F4F6F8F10F12...F30 21FUAddDivide No L.D MUL.D SUB.D DIV.D ADD.D

27 Scoreboard Example: Cycle 22 Instruction status ReadExecutionWrite Instructionjk IssueoperandscompleteResult F634+R21234 F245+R35678 F0F2F F8F6F F10F0F68 21 F6F8F Functional unit status destS1S2FU for jFU for kFj?Fk? TimeNameBusy OpFiFjFkQjQkRjRk IntegerNo Mult1 Mult2No AddNo 40 DivideYesDivF10F0F6Yes Register result status Clock F0F2F4F6F8F10F12...F30 22FUDivide No L.D MUL.D SUB.D DIV.D ADD.D

28 Scoreboard Example: Cycle 61 Instruction status ReadExecutionWrite Instructionjk IssueoperandscompleteResult F634+R21234 F245+R35678 F0F2F F8F6F F10F0F F6F8F Functional unit status destS1S2FU for jFU for kFj?Fk? TimeNameBusy OpFiFjFkQjQkRjRk IntegerNo Mult1 Mult2No AddNo 0 DivideYesDivF10F0F6Yes Register result status Clock F0F2F4F6F8F10F12...F30 61FUDivide No L.D MUL.D SUB.D DIV.D ADD.D

29 Scoreboard Example: Cycle 62 Instruction status ReadExecutionWrite Instructionjk IssueoperandscompleteResult F634+R21234 F245+R35678 F0F2F F8F6F F10F0F F6F8F Functional unit status destS1S2FU for jFU for kFj?Fk? TimeNameBusy OpFiFjFkQjQkRjRk IntegerNo Mult1No Mult2No AddNo 0DivideNo Register result status Clock F0F2F4F6F8F10F12...F30 62FU Instruction Block done We have: In-oder issue, Out-of-order execute and commit L.D MUL.D SUB.D DIV.D ADD.D

30 Where have all the transistors gone? Superscalar (multiple instructions per clock cycle) Execution Icache D cache branch TLB Intel Pentium III (10M transistors) 2 Bus Intf Out-Of-Order SS Branch prediction (predict outcome of decisions) 3 levels of cache Out-of-order execution (executing instructions in different order than programmer wrote them)