Images from Patterson-Hennessy Book

Slides:

Advertisements

Similar presentations

Scoreboarding & Tomasulos Approach Bazat pe slide-urile lui Vincent H. Berk.

Advertisements

A scheme to overcome data hazards

Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.

Lecture 6: ILP HW Case Study— CDC 6600 Scoreboard & Tomasulo’s Algorithm Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

COMP25212 Advanced Pipelining Out of Order Processors.

COMP 4211 Seminar Presentation Based On: Computer Architecture A Quantitative Approach by Hennessey and Patterson Presenter : Feri Danes.

Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and Branch Prediction * Jeremy R. Johnson Wed. Nov. 8, 2000.

Instruction Set Issues MIPS easy –Instructions are only committed at MEM  WB transition Other architectures are more difficult –Instructions may update.

CMSC 611: Advanced Computer Architecture Scoreboard Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

Lecture 6: Pipelining MIPS R4000 and More Kai Bu

1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.

Data Hazards RAW Hazard ADD.D F3, F1, F2 SUB.D F5, F6, F3 No Solution, normal property of programs WAW Hazard DIV.D F3, F1, F2 SUB.D F3, F6, F5 This instruction.

1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.

COMP381 by M. Hamdi 1 Pipelining (Dynamic Scheduling Through Hardware Schemes)

1 Recap (Scoreboarding). 2 Dynamic Scheduling Dynamic Scheduling by Hardware – – Allow Out-of-order execution, Out-of-order completion – – Even though.

ENGS 116 Lecture 71 Scoreboarding Vincent H. Berk October 8, 2008 Reading for today: A.5 – A.6, article: Smith&Pleszkun FRIDAY: NO CLASS Reading for Monday:

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 5, 2005 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Scoreboarding)

EENG449b/Savvides Lec 5.1 1/27/04 January 27, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.

Nov. 9, Lecture 6: Dynamic Scheduling with Scoreboarding and Tomasulo Algorithm (Section 2.4)

Out-of-order execution: Scoreboarding and Tomasulo Week 2

1 Appendix A Pipeline implementation Pipeline hazards, detection and forwarding Multiple-cycle operations MIPS R4000 CDA5155 Spring, 2007, Peir / University.

Instruction-Level Parallelism Dynamic Scheduling

CET 520/ Gannod1 Section A.8 Dynamic Scheduling using a Scoreboard.

Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.

1 Images from Patterson-Hennessy Book Machines that introduced pipelining and instruction-level parallelism. Clockwise from top: IBM Stretch, IBM 360/91,

CSC 4250 Computer Architectures September 29, 2006 Appendix A. Pipelining.

04/03/2016 slide 1 Dynamic instruction scheduling Key idea: allow subsequent independent instructions to proceed DIVDF0,F2,F4; takes long time ADDDF10,F0,F8;

CIS 662 – Computer Architecture – Fall Class 11 – 10/12/04 1 Scoreboarding  The following four steps replace ID, EX and WB steps  ID: Issue –

COMP25212 Advanced Pipelining Out of Order Processors.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Instruction-Level Parallelism and Its Dynamic Exploitation

IBM System 360. Common architecture for a set of machines

/ Computer Architecture and Design

Lecture 07: Pipelining Multicycle, MIPS R4000, and More

Out of Order Processors

Step by step for Tomasulo Scheme

CS203 – Advanced Computer Architecture

Appendix C Pipeline implementation

Lecture 6 Score Board And Tomasulo’s Algorithm

Advantages of Dynamic Scheduling

Morgan Kaufmann Publishers The Processor

High-level view Out-of-order pipeline

CMSC 611: Advanced Computer Architecture

Lecture 6: Advanced Pipelines

A Dynamic Algorithm: Tomasulo’s

COMP s1 Seminar 3: Dynamic Scheduling

Out of Order Processors

Last Week Talks Any feedback from the talks? What did you like?

Pipelining Multicycle, MIPS R4000, and More

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

CS 704 Advanced Computer Architecture

Adapted from the slides of Prof

Checking for issue/dispatch

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

How to improve (decrease) CPI

Static vs. dynamic scheduling

CSCE430/830 Computer Architecture

Advanced Computer Architecture

Static vs. dynamic scheduling

Tomasulo Organization

Reduction of Data Hazards Stalls with Dynamic Scheduling

Adapted from the slides of Prof

Lecture 5 Scoreboarding: Enforce Register Data Dependence

CS152 Computer Architecture and Engineering Lecture 16 Compiler Optimizations (Cont) Dynamic Scheduling with Scoreboards.

Pipelining Multicycle, MIPS R4000, and More

Scoreboarding ENGS 116 Lecture 7 Vincent H. Berk October 5, 2005

High-level view Out-of-order pipeline

Lecture 7 Dynamic Scheduling

CMSC 611: Advanced Computer Architecture

Conceptual execution on a processor which exploits ILP

Presentation transcript:

Images from Patterson-Hennessy Book Machines that introduced pipelining and instruction-level parallelism. Clockwise from top: IBM Stretch, IBM 360/91, and CDC 6600 Images from Patterson-Hennessy Book

COMP 740: Computer Architecture and Implementation Montek Singh Oct 17-19, 2016 Topic: Instruction-Level Parallelism II (Dynamic Scheduling: Scoreboarding)

Outline A more complex pipeline, the MIPS R4000 Dynamic scheduling Look at the effects of memory with longer latency Also long floating-point instructions Dynamic scheduling Scoreboarding

Short case study: MIPS R4000 More complex than basic MIPS From early 90s, just before SGI bought MIPS Superpipelined Approx. 2 instructions per cycle Caches were pipelined Which is what most of the book’s discussion is about R4000 – 100MHz, 1.3M transistors, 2 levels of cache R4400 – up to 250 MHz, larger caches Key attributes to focus on: longer memory latency instruction fetches and memory loads take longer longer floating-point operations

Address calculation, branching Pipeline Diagram Key difference: multiple cycles for memory access Deeper pipeline will lead to more hazards More complex forwarding Longer branch delays Decode Address calculation, branching Figure C.41 The eight-stage pipeline structure of the R4000 uses pipelined instruction and data caches. The pipe stages are labeled and their detailed function is described in the text. The vertical dashed lines represent the stage boundaries as well as the location of pipeline latches. The instruction is actually available at the end of IS, but the tag check is done in RF, while the registers are fetched. Thus, we show the instruction memory as operating through RF. The TC stage is needed for data memory access, since we cannot write the data into the register until we know whether the cache access was a hit or not.

Forwarding, 2 cycle delay

Loads: 2 cycle delay After a load: 2 instructions are affected cannot forward memory value for 2 cycles only 3rd instruction can receive forwarded value in time Figure C.42 The structure of the R4000 integer pipeline leads to a 2-cycle load delay. A 2-cycle delay is possible because the data value is available at the end of DS and can be bypassed. If the tag check in TC indicates a miss, the pipeline is backed up a cycle, when the correct data are available.

Or a 2 cycle stall ADD stalled for R1 for 2 cycles SUB uses forwarded value OR uses values from register

Branch hazards Branch Delay = 3 Cycles branch is completed in ALU cycle (i.e., PC updated) one branch delay slot, plus 2 predict-not-taken Figure C.44 The basic branch delay is 3 cycles, since the condition evaluation is performed during EX.

Predicted not Taken If branch taken stall for 2 cycles beyond delay slot

8 Stages in FP pipeline Stages are used one or more times, depending on instruction

Some FP Instructions Note latencies and initiation intervals Individual stages may result in structural hazards

Structural Hazard Example 1 Units needed at same time highlighted

Structural Hazard Example 2 The shorter ADD instruction clears the pipeline fast so doesn’t stall MUL

Structural Hazard Example 3 Notice how these long instructions can have long-lasting effects

Performance CPI for base case (1.0), and with stalls Left 4 programs integer Cache effects not included Load stalls – 2 cycles now Branch stalls now more expensive FP result is a RAW hazard Structural not a big problem

What Do We Have So Far? Multiple instructions in flight at one time If data hazard, no new instructions issue until hazard cleared (stall) Could minimize stalls by reordering instructions static scheduling a smart complier could reorder instructions to minimize stall using a detailed description of the architecture dynamic scheduling … next topic or, add hardware to do this at run time

Out of Order Execution With dynamic scheduling, we can do out of order execution Execute instructions with no dependencies Implies out of order completion Today discuss one method: scoreboarding So far, instructions issued in order Later we’ll look at out of order issue

Decode Stage Split the ID stage into 2 stages 1st = issue stage decode and check for structural hazards 2nd = read operand stage wait until operands available, read and proceed

Scoreboarding Use a new hardware unit called the scoreboard hardware data structure Keeps track of dependencies, and executes out of order… … operands become available First used on CDC 6600 16 functional units

MIPS with Scoreboard Complex EX stage Each functional unit has 2 inputs 1 output

What is a Scoreboard? A Scoreboard is a table maintained by the hardware: keeps track of instructions being fetched, issued, executed… keeps track of the resources (functional units and operands) they use/need keeps track of which instructions modify which registers uses this information to dynamically schedule instructions very similar to a pen and paper calculation simple step-by-step procedure easily implemented in hardware

Dynamic Scheduling with a Scoreboard Original development in CDC 6600 Simplified example in HP5 for MIPS FP operations Using neither renaming nor forwarding Values always move from registers to function units, and from function units back to registers However, write-back of results happen as soon as possible, not in a statically scheduled slot Out-of-order completion can give rise to WAR and WAW hazards Remember: machine “knows” original program order (needed for hazard detection) Machine model 2 FP multipliers (10 cycles), 1 FP adder (2 cycles), 1 FP divider (40 cycles), all non-pipelined 1 integer unit for everything else (incl. memory references)

New Worry: WAR Hazards Didn’t exist before, because read occurred early Example DIV.D F0, F2, F4 ADD.D F10, F0, F8 SUB.D F8, F8, F14 ADD could easily stall for DIV’s F0 If SUB allowed to execute, then ADD might use wrong value for F8 SUB has a WAR hazard with ADD through register F8!

Scoreboard Implications Out-of-order completion  WAW, WAR hazards? for WAW: stall in Issue until previous write completes for WAR: stall in Write Result until previous read completes Need to have multiple instructions in execution phase  multiple execution units or pipelined execution units Scoreboard keeps track of dependences, state of operations Scoreboard replaces ID, EX, WB with 4 stages

New Stages The fetch is same, others have changed. Let’s look at them one by one Fetch Issue Read Operands EX WB

Issue If Then an instruction is issued Moves to “read operands” stage Fetch Issue Read Operands EX WB If the required functional unit is available, and no other unit is pending a write to same register Then an instruction is issued Moves to “read operands” stage The register restriction prevents WAW hazards

Read Operands By now, the functional unit is assigned Fetch Issue Read Operands EX WB By now, the functional unit is assigned If operands are available, allows functional unit to read operands from register file This design has no forwarding So one extra cycle of latency

EX Has more functional units Memory access is during EX cycle Fetch Issue Read Operands EX WB Has more functional units Memory access is during EX cycle Notifies scoreboard when done

Write Result Prevent WAR hazards In this case Fetch Issue Read Operands EX WB Prevent WAR hazards In this case DIV.D F0, F2, F4 ADD.D F10, F0, F8 SUB.D F8, F8, F14 Will stall the WB of the SUB.D until ADD.D reads F8

Components of Scoreboard Hardware data structure Look at pieces, one by one Instructions (in order) listed on top left

Instruction Status All but last issued (ADD is waiting in Issue stage) First LD complete MUL, SUB waiting for register F2 (LD) DIV waiting for F0 (result of MUL)

Status of Each Functional Unit Fi is destination; j, k sources Q lists producers of inputs R column indicates that input registers are ready, but not yet read (set to No after read)

Register Result Shows which unit is producing which register Needed by Issue stage

Later in Execution LD and SUB (fast ops) have completed ADD and MUL in process DIV waiting for MUL to write F0

Almost Done DIV about ready to write Most everything complete and pipeline almost flushed

Cost of Extra Performance Scoreboard hardware Extra functional units Extra buses Which may result in structural hazard Hardware needs to assign buses Performance depends on Amount of parallelism in code sequence Window size of the scoreboard Size of basic block (i.e., code without branches), … next

Status – Our Pipeline Now Can execute instructions out of order Have not discussed out of order issue Could extend our scoreboarding to do this Still, the opportunities in basic block limited Basic blocks tend to be short Would like to issue past branches

Next We’ll first look at techniques to increase issue potential Compiler techniques Then look at branch prediction Look at Tomasulo’s algorithm for dynamic scheduling

Self-Study Material Summary of scoreboarding algorithm One long scoreboarding example Formal logic equations for scoreboarding logic

Four Stages of Scoreboard Control Issue: decode instr. & check for structural hazards (ID1) If functional unit is free and no WAW hazard with other active instruction … … scoreboard issues the instruction to the functional unit and updates its internal data structure. If a structural or WAW hazard exists … … instruction issue stalls unless there is buffering between fetch and issue, no further instructions can issue until these hazards are cleared. Read operands: wait until no data hazards, then read (ID2) A source operand is available if no earlier issued active instruction is going to write it. When all source operands are available … … scoreboard tells the functional unit to proceed to read the operands from registers and begin execution. Thus, scoreboard resolves RAW hazards dynamically in this step instructions may be sent into execution out of order

Four Stages of Scoreboard Control (cont.) Execution: operate on operands The functional unit begins execution upon receiving operands When result is ready, it notifies the scoreboard Write Result: finish execution (WB) Once scoreboard is aware that functional unit has completed execution, scoreboard checks for WAR hazards. If no WAR hazard … … it writes results If WAR hazard … … it stalls the completing instruction Example: DIV.D F0,F2,F4 ADD.D F10,F0,F8 SUB.D F8,F8,F14 CDC 6600 scoreboard would stall SUB.D until ADD.D reads ops

Three Parts of the Scoreboard Instruction status: Which of 4 steps instruction is in Functional unit (FU) status: Indicates state of FU Nine fields for each functional unit Busy: Indicates whether the unit is busy or not Op: Operation to perform in the unit (e.g., + or -) Fi: Destination register Fj, Fk: Source registers Qj, Qk: Functional units producing source registers Fj, Fk Rj, Rk: Flags indicating when Fj, Fk are ready Register result status: Indicates which functional unit will write each register, if any blank when no pending instructions will write that register

Scoreboard Example Cycle 0

Scoreboard Example Cycle 1 First LD issues

Scoreboard Example Cycle 2 Structural hazard on Integer unit; second LD stalls in IF stage

Scoreboard Example Cycle 3 Second LD is still stalled

Scoreboard Example Cycle 4 Second LD still stalled; first LD done

Scoreboard Example Cycle 5 Second LD issues as the structural hazard on Integer unit has cleared

Scoreboard Example Cycle 6 MULT issues

Scoreboard Example Cycle 7 SUBD issues; MULT stalled on LD

Scoreboard Example Cycle 8a DIVD issues; SUBD stalled on LD

Scoreboard Example Cycle 8b LD writes F2; MULT and SUBD enabled

Scoreboard Example Cycle 9 MULT and SUBD read operands and enter execution

Scoreboard Example Cycle 10 Structural hazard on Add unit stalls the final ADDD

Scoreboard Example Cycle 11 SUBD and MULT are still in execution

Scoreboard Example Cycle 12 SUBD writes results; Add unit free; structural hazard resolves

Scoreboard Example Cycle 13 Note WAR hazard between DIVD and ADDD

Scoreboard Example Cycle 14 MULT still executing; DIVD stalled on F0 (RAW hazard)

Scoreboard Example Cycle 15 MULT still executing

Scoreboard Example Cycle 16 ADDD completes execution, ready to write result into F6

Scoreboard Example Cycle 17 WAR hazard : ADDD stalls in Write Result stage

Scoreboard Example Cycle 18 DIVD stalled (RAW hazard on F0), ADDD stalled (WAR hazard on F6)

Scoreboard Example Cycle 19 MULT completes execution

Scoreboard Example Cycle 20 MULT writes result; DIVD can proceed to read operands at next cycle

Scoreboard Example Cycle 21 DIVD reads operands; WAR hazard on F6 is resolved

Scoreboard Example Cycle 22 Divide! ADDD completes writing of result

Scoreboard Example Cycle 61 DIVD completes execution; ready to write result

Scoreboard Summary CDC designers measured performance improvement of 1.7 for compiled FORTRAN code, 2.5 for assembly No pipeline scheduling in software Slow memory (no cache) Limitations of 6600 scoreboard No forwarding Limited to instructions in basic block (small issue window) Number of functional units (structural hazards) Wait for WAR hazards Prevent WAW hazards

Scoreboard: Bookkeeping Actions Instruction Status Wait Until Bookkeeping Issue Not Busy[FU] and not Result[D] Busy[FU]yes; Op[FU]op; Fi[FU]D; Fj[FU]S1; Fk[FU]S2; Qj[FU]Result[S1]; Qk[FU]Result[S2]; Rjnot Qj; Rknot Qk; Result[D]FU Read Operands Rj and Rk RjNo; RkNo; Qj0; Qk0 Execution Complete Functional unit done Write Result  f ((Fj[f]≠Fi[FU] or Rj[f]=No) & (Fk[f]≠Fi[FU] or Rk[f]=No))  f (if Qj[f]=FU then Rj[f]yes);  f (if Qk[f]=FU then Rk[f]yes); Result[Fi[FU]]0; Busy[FU]No;