CSC 4250 Computer Architectures September 29, 2006 Appendix A. Pipelining.

Slides:

Advertisements

Similar presentations

Scoreboarding & Tomasulos Approach Bazat pe slide-urile lui Vincent H. Berk.

Advertisements

Instruction-level Parallelism Compiler Perspectives on Code Movement dependencies are a property of code, whether or not it is a HW hazard depends on.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 3, 2005 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Introduction)

A scheme to overcome data hazards

Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.

Lecture 6: ILP HW Case Study— CDC 6600 Scoreboard & Tomasulo’s Algorithm Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

COMP25212 Advanced Pipelining Out of Order Processors.

Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and Branch Prediction * Jeremy R. Johnson Wed. Nov. 8, 2000.

CMSC 611: Advanced Computer Architecture Scoreboard Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

Dynamic Scheduling Pipelines rely on instruction flow as scheduled by the compiler (whether optimized or not) – We could continue to fetch instructions.

Instruction-Level Parallelism (ILP)

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.

Computer Architecture

Data Hazards RAW Hazard ADD.D F3, F1, F2 SUB.D F5, F6, F3 No Solution, normal property of programs WAW Hazard DIV.D F3, F1, F2 SUB.D F3, F6, F5 This instruction.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

Review of CS 203A Laxmi Narayan Bhuyan Lecture2.

1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.

COMP381 by M. Hamdi 1 Pipelining (Dynamic Scheduling Through Hardware Schemes)

1 Recap (Scoreboarding). 2 Dynamic Scheduling Dynamic Scheduling by Hardware – – Allow Out-of-order execution, Out-of-order completion – – Even though.

ENGS 116 Lecture 71 Scoreboarding Vincent H. Berk October 8, 2008 Reading for today: A.5 – A.6, article: Smith&Pleszkun FRIDAY: NO CLASS Reading for Monday:

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 5, 2005 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Scoreboarding)

EENG449b/Savvides Lec 5.1 1/27/04 January 27, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.

CSC 4250 Computer Architectures October 13, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

Expl. ILP & Dyn.Sched CSE 4711 How to improve (decrease) CPI Recall: CPI = Ideal CPI + CPI contributed by stalls Ideal CPI =1 for single issue machine.

Nov. 9, Lecture 6: Dynamic Scheduling with Scoreboarding and Tomasulo Algorithm (Section 2.4)

1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.

Out-of-order execution: Scoreboarding and Tomasulo Week 2

CSC 4250 Computer Architectures September 26, 2006 Appendix A. Pipelining.

1 Lecture 5 Overview of Superscalar Techniques CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading: Textbook, Ch. 2.1 “Complexity-Effective.

Instruction-Level Parallelism Dynamic Scheduling

1 Lecture 6 Tomasulo Algorithm CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading:Textbook 2.4, 2.5.

CET 520/ Gannod1 Section A.8 Dynamic Scheduling using a Scoreboard.

Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.

1 Images from Patterson-Hennessy Book Machines that introduced pipelining and instruction-level parallelism. Clockwise from top: IBM Stretch, IBM 360/91,

Recap Multicycle Operations –MIPS Floating Point Putting It All Together: the MIPS R4000 Pipeline.

04/03/2016 slide 1 Dynamic instruction scheduling Key idea: allow subsequent independent instructions to proceed DIVDF0,F2,F4; takes long time ADDDF10,F0,F8;

CIS 662 – Computer Architecture – Fall Class 11 – 10/12/04 1 Scoreboarding  The following four steps replace ID, EX and WB steps  ID: Issue –

COMP25212 Advanced Pipelining Out of Order Processors.

Instruction-Level Parallelism and Its Dynamic Exploitation

IBM System 360. Common architecture for a set of machines

Images from Patterson-Hennessy Book

/ Computer Architecture and Design

Out of Order Processors

Step by step for Tomasulo Scheme

CS203 – Advanced Computer Architecture

Lecture 6 Score Board And Tomasulo’s Algorithm

Chapter 3: ILP and Its Exploitation

Advantages of Dynamic Scheduling

High-level view Out-of-order pipeline

A Dynamic Algorithm: Tomasulo’s

COMP s1 Seminar 3: Dynamic Scheduling

Out of Order Processors

Last Week Talks Any feedback from the talks? What did you like?

CS 704 Advanced Computer Architecture

Checking for issue/dispatch

How to improve (decrease) CPI

Static vs. dynamic scheduling

CSCE430/830 Computer Architecture

Advanced Computer Architecture

Static vs. dynamic scheduling

Tomasulo Organization

Reduction of Data Hazards Stalls with Dynamic Scheduling

Lecture 5 Scoreboarding: Enforce Register Data Dependence

CS152 Computer Architecture and Engineering Lecture 16 Compiler Optimizations (Cont) Dynamic Scheduling with Scoreboards.

Scoreboarding ENGS 116 Lecture 7 Vincent H. Berk October 5, 2005

High-level view Out-of-order pipeline

Lecture 7 Dynamic Scheduling

Conceptual execution on a processor which exploits ILP

Presentation transcript:

CSC 4250 Computer Architectures September 29, 2006 Appendix A. Pipelining

Static Pipeline Scheduling Simple pipeline fetches an instruction, decodes it, and checks for hazards (structural and data) If no hazard, then issue instruction If there is hazard, then stall pipeline ─ no new instructions will be fetched or issued Compiler may schedule instructions to avoid the hazard ─ static scheduling

Dynamic Pipeline Scheduling Hardware rearranges instruction execution to reduce stalls Scoreboarding technique of CDC6600 Tomasulo’s algorithm (Chapter 3) We do in-order instruction issue ─ if an instruction is stalled in the pipeline, then no later instructions can proceed What if later instructions are independent? Example:DIV.DF0,F2,F4 ADD.DF10,F0,F8 MUL.DF6,F6,F14 We want to issue and execute MUL instruction while ADD instruction waits for the result of DIV

Scoreboarding In a dynamically scheduled pipeline, all instructions pass through the issue stage in order (in-order issue); however, they can be stalled or they can bypass each other in the second stage (read operands) and enter execution out of order Scoreboarding is a technique for allowing instructions to execute out of order when there are sufficient resources and no data dependences; it is named after the CDC 6600 scoreboard, which developed this capability

First Supercomputer CDC = Control Data Corporation In 1964 CDC delivered the first CDC6600 The machine was unique in many ways It introduced scoreboarding It was the first processor to make extensive use of multiple functional units. It had 16 separate FUs, including 4 FP units, 5 units for memory references and 7 units for integer operations It had peripheral processors that used multithreading The interaction between pipelining and IS design was understood, and a simple, load-store instruction set was used to promote pipelining

Structural and Data Hazards Before, no instruction issue if there is either structural or data hazard Data hazards include WAW, RAW and WAR Now, issue instruction if no structural hazard and no WAW data hazard Example:DIV.DF0,F2,F4 ADD.DF10,F0,F8 MUL.DF6,F6,F14 So, all three instructions will be issued Read operands when no RAW hazards

Record Keeping Every instruction goes through the scoreboard, where a record of the data dependences is constructed; this step corresponds to instruction issue and replaces part of the ID step in the MIPS pipeline The scoreboard determines when the instruction can read its operands and begin operation (RAW hazards) If the scoreboard decides that the instruction cannot execute immediately, it monitors every change in the hardware and decides when the instruction can execute The scoreboard controls when an instruction can write its result into the destination register (WAR hazards)

Split ID Stage into Two Stages 1. Issue ─ Decode instructions; check for structural and WAW hazards 2. Read operands ─ Wait until no RAW hazards; then read operands No Issue:DIV.DF0,F2,F4 ADD.DF10,F0,F8 SUB.DF6,F6,F14 (why no issue?) No Issue:DIV.DF0,F2,F4 ADD.DF10,F0,F8 MUL.DF0,F6,F14 (why no issue?)

MIPS Processor with Scoreboard

Four Steps in Execution 1. Issue ─ if no structural nor WAW hazards 2. Read operands ─ if no RAW hazards 3. Execute ─ if both operands are received 4. Write result ─ if no WAR hazards We concentrate on FP operations and do not consider a step for memory access

Step One. Issue If a functional unit (FU) for the instruction is free and no other active instruction has the same destination register, the scoreboard issues the instruction to the FU and updates its internal data structure By ensuring that no other active FU wants to write its result into the destination register, we guarantee that WAW hazards cannot be present If a structural or WAW hazard exists, then the instruction issue stalls, and no further instructions will issue until these hazards are cleared

Step Two. Read Operands The scoreboard monitors the availability of the source operands. A source operand is available if no earlier issued active instruction is going to write it. When the source operands are available, the scoreboard tells the FU to proceed to read the operands from the registers and begin execution. The scoreboard resolves RAW hazards dynamically in this step, and instructions may be sent into execution out of order. The operands for an instruction are read only when both operands are available in the register file. The scoreboard does not take advantage of forwarding. Issue and Read Operands together replace the ID stage of the simple MIPS pipeline.

Step Three. Execution The FU begins execution upon receiving operands When the result is ready, the FU notifies the scoreboard that it has completed execution This step replaces the EX stage in the MIPS pipeline and takes multiple cycles in the MIPS FP pipeline

Step Four. Write Result Once it is aware that the FU has completed execution, the scoreboard checks for WAR hazards and stalls the completing instruction, if necessary In general, a completing instruction cannot be allowed to write its results when  There is an instruction that has not read its operands that precedes (i.e., in order of issue) the completing instruction, and  One of the operands is the same register as the result of the completing instruction If WAR hazard does not exist, or when it clears, the scoreboard tells the FU to store its result to the destination register This step replaces the WB step in the simple MIPS pipeline

Example (p. A-72) L.DF6,34(R2) L.DF2,45(R3) MUL.DF0,F2,F4 SUB.DF8,F6,F2 DIV.DF10,F0,F6 ADD.DF6,F8,F2

Scoreboard Three parts: 1. Instruction status ─ indicates which of four steps of instruction 2. Functional unit status ─ busy, op, Fi, Fj, Fk, Qj, Qk, Rj, Rk 3. Register result status ─ indicates which functional unit will write each register, if instruction is active

Example Code: L.DF6,34(R2) L.DF2, 45(R3) MUL.DF0,F2,F4 SUB.DF8,F6,F2 DIV.DF10,F0,F6 ADD.DF6,F8,F2

Scoreboard Tables 1 (Fill in blanks) InstructionIssueRead operandsExec. completeWrite result L.D F6,34(R2)√√ L.D F2,45(R3) MUL.D F0,F2,F4 SUB.D F8,F6,F2 DIV.D F10,F0,F6 ADD.D F6,F8,F2 NameBusyOpFiFjFkQjQkRjRk Integer Mult1 Mult2 Add Divide F0F2F4F6F8F10F12…F30 FU

Scoreboard Tables 2 (Fill in blanks) InstructionIssueRead operandsExec. completeWrite result L.D F6,34(R2)√√√√ L.D F2,45(R3)√√√ MUL.D F0,F2,F4√ SUB.D F8,F6,F2√ DIV.D F10,F0,F6√ ADD.D F6,F8,F2 NameBusyOpFiFjFkQjQkRjRk Integer Mult1 Mult2 Add Divide F0F2F4F6F8F10F12…F30 FU

Scoreboard Tables 3 (Fill in blanks) InstructionIssueRead operandsExec. completeWrite result L.D F6,34(R2)√√√√ L.D F2,45(R3)√√√√ MUL.D F0,F2,F4√√√ SUB.D F8,F6,F2√√√√ DIV.D F10,F0,F6√ ADD.D F6,F8,F2√√√ NameBusyOpFiFjFkQjQkRjRk Integer Mult1 Mult2 Add Divide F0F2F4F6F8F10F12…F30 FU

Scoreboard Tables 4 (Fill in blanks) InstructionIssueRead operandsExec. completeWrite result L.D F6,34(R2)√√√√ L.D F2,45(R3)√√√√ MUL.D F0,F2,F4√√√√ SUB.D F8,F6,F2√√√√ DIV.D F10,F0,F6√√√ ADD.D F6,F8,F2√√√√ NameBusyOpFiFjFkQjQkRjRk Integer Mult1 Mult2 Add Divide F0F2F4F6F8F10F12…F30 FU

Required Checks Instruction statusWait until IssueNot busy[FU] and not Result[D] Read operandsRj and Rk Execution completeFunctional unit done Write resultsFor every f ( ( Fj[f]≠Fi[FU] or Rj[f]=No ) & ( Fk[f]≠Fi[FU] or Rk[f]=No ) )

WAR Hazard WAR hazard exists  if another instr. has this instr.’s destination (Fi[FU]) as a source (Fj[f] or Fk[f]), and  if some other instruction has flagged the register (Rj = Yes or Rk = Yes) Test on write-result prevents write if WAR hazard exists

Costs and Benefits of Scoreboarding Reported performance improvement of 1.7 for FORTRAN programs and 2.5 for hand-coded assembly language. Scoreboard had about as much logic as a FU ─ surprisingly low. Main cost was large number of buses ─ about four times as many as would be required if CPU only executed instructions in order.

Factors Limiting Scoreboarding 1. Amount of Parallelism available among the instructions ─ This determines whether independent instructions can be found to execute. If each instruction depends on its predecessor, no dynamic scheduling scheme can reduce stalls. 2. Amount of Scoreboard Entries ─ This determines how far ahead the pipeline can look for independent instructions. The set of instructions examined as candidates for potential execution is called the window. The size of the scoreboard determines the size of the window. 3. Number and Types of FU’s ─ This determines the importance of structural hazards. 4. Presence of Antidependences and Output Dependences ─ These lead to WAR and WAW stalls.

A.9. Fallacies and Pitfalls Unexpected execution may cause unexpected hazards. It looks like that WAW hazards should never occur in a code sequence because no compiler would ever generate two writes to the same register without an intervening read. But they can occur when the sequence is unexpected. For example, the first write might be in the delay slot of a taken branch. Here is an example: BNEZR1,foo DIV.DF0,F2,F4; moved into delay slot ; from fall through ….. foo:L.DF0,qrs If the branch is taken, then before DIV.D can complete, the L.D will reach WB, causing a WAW hazard.

How Extensive Pipelining Affects Performance Extensive pipelining can impact other aspects of a design, leading to overall worse cost-performance The best example of this phenomenon comes from two implementations of the VAX, the 8600 and the 8700 When the 8600 was initially delivered, it had a cycle time of 80ns. Subsequently, a redesigned version called the 8650 with a 55 ns clock was introduced. The 8700 had a much simpler pipeline that operated at the microinstruction level, yielding a smaller CPU with a faster clock cycle of 45ns The overall outcome is that the 8650 had a CPI advantage of about 20%, but the 8700 had a clock rate that was about 20% faster. Thus, the 8700 achieved the same performance with much less hardware