Lecture 6 Score Board And Tomasulo’s Algorithm

Slides:



Advertisements
Similar presentations
CMSC 611: Advanced Computer Architecture Tomasulo Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Advertisements

Scoreboarding & Tomasulos Approach Bazat pe slide-urile lui Vincent H. Berk.
Instruction-level Parallelism Compiler Perspectives on Code Movement dependencies are a property of code, whether or not it is a HW hazard depends on.
A scheme to overcome data hazards
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
Lecture 6: ILP HW Case Study— CDC 6600 Scoreboard & Tomasulo’s Algorithm Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
COMP25212 Advanced Pipelining Out of Order Processors.
Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and Branch Prediction * Jeremy R. Johnson Wed. Nov. 8, 2000.
CMSC 611: Advanced Computer Architecture Scoreboard Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
1 Lecture 5 Branch Prediction (2.3) and Scoreboarding (A.7)
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
ENGS 116 Lecture 71 Scoreboarding Vincent H. Berk October 8, 2008 Reading for today: A.5 – A.6, article: Smith&Pleszkun FRIDAY: NO CLASS Reading for Monday:
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 5, 2005 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Scoreboarding)
Nov. 9, Lecture 6: Dynamic Scheduling with Scoreboarding and Tomasulo Algorithm (Section 2.4)
1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.
Out-of-order execution: Scoreboarding and Tomasulo Week 2
1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.
1 Lecture 6 Tomasulo Algorithm CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading:Textbook 2.4, 2.5.
Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.
2/24; 3/1,3/11 (quiz was 2/22, QuizAns 3/8) CSE502-S11, Lec ILP 1 Tomasulo Organization FP adders Add1 Add2 Add3 FP multipliers Mult1 Mult2 From.
1 Images from Patterson-Hennessy Book Machines that introduced pipelining and instruction-level parallelism. Clockwise from top: IBM Stretch, IBM 360/91,
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)
04/03/2016 slide 1 Dynamic instruction scheduling Key idea: allow subsequent independent instructions to proceed DIVDF0,F2,F4; takes long time ADDDF10,F0,F8;
COMP25212 Advanced Pipelining Out of Order Processors.
Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm 吳俊興 高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.
Instruction-Level Parallelism and Its Dynamic Exploitation
IBM System 360. Common architecture for a set of machines
Images from Patterson-Hennessy Book
/ Computer Architecture and Design
CS 704 Advanced Computer Architecture
Approaches to exploiting Instruction Level Parallelism (ILP)
Out of Order Processors
Step by step for Tomasulo Scheme
Tomasulo Loop Example Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1
CS203 – Advanced Computer Architecture
Lecture 3: Introduction to Advanced Pipelining
Advantages of Dynamic Scheduling
High-level view Out-of-order pipeline
11/14/2018 CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, Electrical and Computer.
CMSC 611: Advanced Computer Architecture
A Dynamic Algorithm: Tomasulo’s
COMP s1 Seminar 3: Dynamic Scheduling
Out of Order Processors
Last Week Talks Any feedback from the talks? What did you like?
CS252 Graduate Computer Architecture Lecture 6 Scoreboard, Tomasulo, Register Renaming February 7th, 2011 John Kubiatowicz Electrical Engineering and.
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
CS 704 Advanced Computer Architecture
Adapted from the slides of Prof
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
Static vs. dynamic scheduling
CSCE430/830 Computer Architecture
Advanced Computer Architecture
Static vs. dynamic scheduling
September 20, 2000 Prof. John Kubiatowicz
1/2/2019 CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, Electrical and Computer.
Tomasulo Organization
Reduction of Data Hazards Stalls with Dynamic Scheduling
Adapted from the slides of Prof
Lecture 5 Scoreboarding: Enforce Register Data Dependence
CS152 Computer Architecture and Engineering Lecture 16 Compiler Optimizations (Cont) Dynamic Scheduling with Scoreboards.
CS252 Graduate Computer Architecture Lecture 6 Tomasulo, Implicit Register Renaming, Loop-Level Parallelism Extraction Explicit Register Renaming February.
Scoreboarding ENGS 116 Lecture 7 Vincent H. Berk October 5, 2005
/ Computer Architecture and Design
John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
September 20, 2000 Prof. John Kubiatowicz
CS252 Graduate Computer Architecture Lecture 6 Introduction to Advanced Pipelining: Out-Of-Order Execution John Kubiatowicz Electrical Engineering and.
High-level view Out-of-order pipeline
Lecture 7 Dynamic Scheduling
Conceptual execution on a processor which exploits ILP
Presentation transcript:

Lecture 6 Score Board And Tomasulo’s Algorithm Nov. 2, 2004 Lec. 7

Three Parts of the Scoreboard 1. Instruction status—which of 4 steps the instruction is in (Issue, Operand Read, EX, Write) 2. Functional unit status—Indicates the state of the functional unit (FU). 9 fields for each functional unit Busy—Indicates whether the unit is busy or not Op—Operation to perform in the unit (e.g., + or –) Fi—Destination register Fj, Fk—Source-register numbers Qj, Qk—Functional units producing source registers Fj, Fk Rj, Rk—Flags indicating when Fj, Fk are ready and not yet read. Set to No after operand are read. 3. Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register What you might have thought 1. 4 stages of instruction executino 2.Status of FU: Normal things to keep track of (RAW & structura for busyl): Fi from instruction format of the mahine (Fi is dest) Add unit can Add or Sub Rj, Rk - status of registers (Yes means ready) Qj,Qk - If a no in Rj, Rk, means waiting for a FU to write result; Qj, Qk means wihch FU waiting for it 3.Status of register result (WAW &WAR)s: which FU is going to write into registers Scoreboard on 6600 = size of FU 6.7, 6.8, 6.9, 6.12, 6.13, 6.16, 6.17 FU latencies: Add 2, Mult 10, Div 40 clocks Nov. 2, 2004 Lec. 7

Detailed Scoreboard Pipeline Control Read operands Execution complete Instruction status Write result Issue Wait until Bookkeeping Not busy (FU) and not result(D) Busy(FU)¬ yes; Op(FU)¬ op; Fi(FU)¬ `D’; Fj(FU)¬ `S1’; Fk(FU)¬ `S2’; Qj¬ Result(‘S1’); Qk¬ Result(`S2’); Rj¬ not Qj; Rk¬ not Qk; Result(‘D’)¬ FU; WAW Rj and Rk Rj¬ No; Rk¬ No Functional unit done "f((Fj( f )!=Fi(FU) or Rj( f )=No) & (Fk( f )!=Fi(FU) or Rk( f )=No)) "f(if Qj(f)=FU then Rj(f)¬ Yes); "f(if Qk(f)=FU then Rj(f)¬ Yes); Result(Fi(FU))¬ 0; Busy(FU)¬ No Rj,Rk==No  1. Source operands are not ready. OR 2. Source operands are ready and have been read WAR A.55 on page A-76 Nov. 2, 2004 Lec. 7

Scoreboard Example The following numbers are to illustrate behavior, not representative LD – 1 cycle (compute address + data cache access) ADDDs and SUBs are 2 cycles Multiply is 10 cycles Divide is 40 cycles Nov. 2, 2004 Lec. 7

Scoreboard Example Nov. 2, 2004 Lec. 7

Scoreboard Example Cycle 1 Nov. 2, 2004 Lec. 7

Scoreboard Example Cycle 2 Note: Can’t issue I2 because Integer unit is busy. Can’t issue next instruction due to in-order issue Nov. 2, 2004 Lec. 7

Scoreboard Example Cycle 3 Nov. 2, 2004 Lec. 7

Scoreboard Example Cycle 4 Nov. 2, 2004 Lec. 7

Scoreboard Example Cycle 5 Now I2 is issued Nov. 2, 2004 Lec. 7

Scoreboard Example Cycle 6 Nov. 2, 2004 Lec. 7

Scoreboard Example Cycle 7 I3 stalled at read because I2 isn’t complete Nov. 2, 2004 Lec. 7

Scoreboard Example Cycle 8 Nov. 2, 2004 Lec. 7

Scoreboard Example Cycle 9 Note: I3 and I4 read operands because F2 is now available. ADDD (I6) can’t be issued because SUBD (I4) uses the adder Nov. 2, 2004 Lec. 7

Scoreboard Example Cycle 11 Note: Add takes 2 cycles, so nothing happens in cycle 10. MUL continues. Nov. 2, 2004 Lec. 7

Scoreboard Example Cycle 12 Nov. 2, 2004 Lec. 7

Scoreboard Example Cycle 13 Now ADDD is issued because SUBD has completed Nov. 2, 2004 Lec. 7

Scoreboard Example Cycle 14 Nov. 2, 2004 Lec. 7

Scoreboard Example Cycle 15 Note: ADDD takes 2 cycles, so no change Nov. 2, 2004 Lec. 7

Scoreboard Example Cycle 16 ADDD completes, but MULTD and DIVD go on Nov. 2, 2004 Lec. 7

Scoreboard Example Cycle 17 ADDD stalls, can’t write back due to WAR with DIVD. MULT and DIV continue Nov. 2, 2004 Lec. 7

Scoreboard Example Cycle 18 MULT and DIV continue Nov. 2, 2004 Lec. 7

Scoreboard Example Cycle 19 MULT completes after 10 cycles Nov. 2, 2004 Lec. 7

Scoreboard Example Cycle 20 MULTD completes and writes to F0 Nov. 2, 2004 Lec. 7

Scoreboard Example Cycle 21 Now DIVD reads because F0 is available Nov. 2, 2004 Lec. 7

Scoreboard Example Cycle 22 ADDD writes result because WAR is removed. Nov. 2, 2004 Lec. 7

Scoreboard Example Cycle 61 DIVD completes execution Nov. 2, 2004 Lec. 7

Scoreboard Example Cycle 62 Execution is finished Nov. 2, 2004 Lec. 7

Review: Scoreboard Limitations of 6600 scoreboard DIV.D F0, F2, F4 No forwarding Limited to instructions in basic block (small window) Large number of functional units (structural hazards) Stall on WAR hazards Stall on WAW hazards DIV.D F0, F2, F4 ADD.D F6, F0, F8 S.D F6, 0(R1) SUB.D F8, F10, F14 MUL.D F6, F10, F8 WAR WAW Output dependence Antidependence Name dependence Nov. 2, 2004 Lec. 7

Another Dynamic Algorithm: Tomasulo Algorithm For IBM 360/91 about 3 years after CDC 6600 Goal: High Performance without special compilers Differences between Tomasulo Algorithm & Scoreboard Control & buffers distributed with Function Units vs. centralized in scoreboard; called “reservation stations” Registers in instructions replaced by pointers to reservation station buffer HW renaming of registers to avoid WAW hazards Buffer operand values to avoid WAR hazards Common Data Bus broadcasts results to all FUs Load and Stores treated as FUs as well Why study? Lead to Alpha 21264, HP 8000, MIPS 10000, Pentium II, Power PC 604 … Nov. 2, 2004 Lec. 7

FP unit and load-store unit using Tomasulo’s alg. Nov. 2, 2004 Lec. 7

Another Dynamic Algorithm: Tomasulo Algorithm DIV.D F0, F2, F4 ADD.D S, F0, F8 S.D S, 0(R1) register renaming SUB.D T, F10, F14 MUL.D F6, F10, T Implemented through reservation stations (rs) per functional unit Buffers an operand as soon as it is available – avoids WAR hazards. Pending instr. designate rs that will provide their inputs – avoids WAW hazards. The last write in a sequence of same-register-writing actually updates the register Decentralize hazard detection and execution control Instruction results are passed directly to the FU from rs rather than from registers Through common data bus (CDB) Nov. 2, 2004 Lec. 7

Three Stages of Tomasulo Algorithm 1. Issue—get instruction from FP Op Queue Stall if structural hazard, ie. no space in the rs. If reservation station (rs) is free, the issue logic issues instr to rs & read operands into rs if ready (Register renaming => Solves WAR). Make status of destination register waiting for this latest instn even if the previous instn writing to this register hasn’t completed => Solves WAW hazards. 2. Execution—operate on operands (EX) When both operands are ready then execute; if not ready, watch CDB for result – Solves RAW 3. Write result—finish execution (WB) Write on Common Data Bus to all awaiting units; mark reservation station available. Write result into dest. reg. if its status is r. => Solves WAW. Normal data bus: data + destination (“go to” bus) CDB: data + source (“come from” bus) 64 bits of data + 4 bits of Functional Unit source address Write if matches expected Functional Unit (produces result) Does broadcast Nov. 2, 2004 Lec. 7

Reservation Station Components Op—Operation to perform in the unit (e.g., + or –) Vj, Vk— Value of the source operand. Qj, Qk— Name of the RS that would provide the source operands. Value zero means the source operands already available in Vj or Vk, or is not necessary. Busy—Indicates reservation station or FU is busy Register File Status Qi: Qi —Indicates which functional unit will write each register, if one exists. Blank (0) when no pending instructions that will write that register meaning that the value is already available. Nov. 2, 2004 Lec. 7

Tomasulo Example Cycle 0 Nov. 2, 2004 Lec. 7

Tomasulo Example Cycle 1 Nov. 2, 2004 Lec. 7

Tomasulo Example Cycle 2 Nov. 2, 2004 Lec. 7

Tomasulo Example Cycle 3 Nov. 2, 2004 Lec. 7

Tomasulo Example Cycle 4 Nov. 2, 2004 Lec. 7

Tomasulo Example Cycle 5 Nov. 2, 2004 Lec. 7

Tomasulo Example Cycle 6 Nov. 2, 2004 Lec. 7

Tomasulo Example Cycle 7 Nov. 2, 2004 Lec. 7

Tomasulo Example Cycle 8 Nov. 2, 2004 Lec. 7

Tomasulo Example Cycle 9 Nov. 2, 2004 Lec. 7

Tomasulo Example Cycle 10 Nov. 2, 2004 Lec. 7

Tomasulo Example Cycle 11 Nov. 2, 2004 Lec. 7

Tomasulo Example Cycle 12 Nov. 2, 2004 Lec. 7

Tomasulo Example Cycle 15 Nov. 2, 2004 Lec. 7

Tomasulo Example Cycle 16 Nov. 2, 2004 Lec. 7

Tomasulo Example Cycle 56 Nov. 2, 2004 Lec. 7

Tomasulo Example Cycle 57 Nov. 2, 2004 Lec. 7

Branch Prediction (3.4, 3.5) Nov. 2, 2004 Lec. 7

Branch Prediction Easiest (static prediction) Next easiest Always taken, always not taken Opcode based Displacement based (forward not taken, backward taken) Compiler directed (branch likely, branch not likely) Next easiest 1 bit predictor – remember last taken/not taken per branch Use a branch-prediction buffer or branch-history table Use part of the PC (low-order bits) to index buffer/table Multiple branches may share the same bit Invert the bit if the prediction is wrong Backward branches for loops will be mispredicted twice Nov. 2, 2004 Lec. 7

Q: Assume a loop branch is taken nine times in a row, then not taken once. What is the prediction accuracy using 1-bit predictor? A: After first loop, the predictor will say not to take because the last time the execution came out of loop, it set a “0” in the predictor. So, it’s a misprediction. The bit will now be set to “1”. Works fine until the last loop when it is predicted as taken. So, 2 mispredictions in in 10 loop executions => 80% accuracy. How about a 2-bit predictor? Let the prediction be changed only after it misses twice in a row. Nov. 2, 2004 Lec. 7

2-bit Branch Prediction Has 4 states instead of 2, allowing for more information about tendencies A prediction must miss twice before it is changed Good for backward branches of loops Nov. 2, 2004 Lec. 7

Branch History Table BHT Has limited size 2 bits by N (e.g. 4K) 4K same as infinite, see Fig. 3.9 Uses low-order bits of branch PC to choose entry 01 BHT branch PC Nov. 2, 2004 Lec. 7

Can we do better ? Correlating branch predictors also look at other branches for clues if (aa==2) T aa = 0 if (bb==2) T bb = 0 if(aa!=bb) { … NT Prediction if the last branch is NT Prediction if the last branch is T (1,1) predictor – uses history of 1 branch and uses a 1-bit predictor Nov. 2, 2004 Lec. 7

Correlating Branch Predictor If we use 2 branches as histories, then there are 4 possibilities (T-T, NT-T, NT-NT, NT-T). For each possibility, we need to use a predictor (1-bit, 2-bit). And this repeats for every branch. (2,2) branch prediction Nov. 2, 2004 Lec. 7

Performance of Correlating Branch Prediction With same number of state bits, (2,2) performs better than noncorrelating 2-bit predictor. Outperforms a 2-bit predictor with infinite number of entries Nov. 2, 2004 Lec. 7

General (m,n) Branch Predictors The global history register is an m-bit shift register that records the last m branches encountered by the processor Usually use both the PC address and the GHR (2-level) m-bit ghr PC Combining funciton 01 n-bit predictors 00 Nov. 2, 2004 Lec. 7

Is Branch Predictor Enough? When is using branch prediction beneficial? When the outcome is known later than the target For example, in our standard MIPS pipeline, we compute the target in ID stage but testing the branch condition incur a structure hazard in register file. If we predict the branch is taken and suppose it is correct, what is the target address? Need a mechanism to provide target address as well Can we eliminate the one cycle delay for the 5-stage pipeline? Need to fetch from branch target immediately after branch Nov. 2, 2004 Lec. 7

Branch Target Buffer (BTB) Is the current instruction a branch ? • BTB provides the answer before the current instruction is decoded and therefore enables fetching to begin after IF-stage . What is the branch target ? • BTB provides the branch target if the prediction is a taken direct branch (for not taken branches the target is simply PC+4 ) . Nov. 2, 2004 Lec. 7