Instruction-Level Parallelism and Its Dynamic Exploitation

Slides:

Advertisements

Similar presentations

Scoreboarding & Tomasulos Approach Bazat pe slide-urile lui Vincent H. Berk.

Advertisements

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

A scheme to overcome data hazards

COMP4611 Tutorial 6 Instruction Level Parallelism

Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

COMP25212 Advanced Pipelining Out of Order Processors.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.

Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and Branch Prediction * Jeremy R. Johnson Wed. Nov. 8, 2000.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.

EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 5, 2005 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Scoreboarding)

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

Nov. 9, Lecture 6: Dynamic Scheduling with Scoreboarding and Tomasulo Algorithm (Section 2.4)

CET 520/ Gannod1 Section A.8 Dynamic Scheduling using a Scoreboard.

Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.

1 Images from Patterson-Hennessy Book Machines that introduced pipelining and instruction-level parallelism. Clockwise from top: IBM Stretch, IBM 360/91,

CSC 4250 Computer Architectures September 29, 2006 Appendix A. Pipelining.

04/03/2016 slide 1 Dynamic instruction scheduling Key idea: allow subsequent independent instructions to proceed DIVDF0,F2,F4; takes long time ADDDF10,F0,F8;

IBM System 360. Common architecture for a set of machines

Images from Patterson-Hennessy Book

/ Computer Architecture and Design

CS 704 Advanced Computer Architecture

Approaches to exploiting Instruction Level Parallelism (ILP)

Out of Order Processors

CS203 – Advanced Computer Architecture

Lecture 6 Score Board And Tomasulo’s Algorithm

Chapter 3: ILP and Its Exploitation

Advantages of Dynamic Scheduling

High-level view Out-of-order pipeline

11/14/2018 CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, Electrical and Computer.

CMSC 611: Advanced Computer Architecture

A Dynamic Algorithm: Tomasulo’s

COMP s1 Seminar 3: Dynamic Scheduling

Out of Order Processors

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

John Kubiatowicz (http.cs.berkeley.edu/~kubitron)

CS 704 Advanced Computer Architecture

Checking for issue/dispatch

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

Static vs. dynamic scheduling

CSCE430/830 Computer Architecture

Advanced Computer Architecture

Static vs. dynamic scheduling

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Tomasulo Organization

Reduction of Data Hazards Stalls with Dynamic Scheduling

CS152 Computer Architecture and Engineering Lecture 16 Compiler Optimizations (Cont) Dynamic Scheduling with Scoreboards.

Midterm 2 review Chapter

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Scoreboarding ENGS 116 Lecture 7 Vincent H. Berk October 5, 2005

/ Computer Architecture and Design

September 20, 2000 Prof. John Kubiatowicz

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Dynamic Hardware Prediction

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

High-level view Out-of-order pipeline

Lecture 7 Dynamic Scheduling

CMSC 611: Advanced Computer Architecture

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Loop-Level Parallelism

Conceptual execution on a processor which exploits ILP

Presentation transcript:

Instruction-Level Parallelism and Its Dynamic Exploitation Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation

Overview Instruction level parallelism Dynamic Scheduling Techniques Scoreboarding (Appendix A.8) Tomasulo’s Algorithm Reducing Branch Cost with Dynamic Hardware Prediction Basic Branch Prediction and Branch-Prediction Buffers Branch Target Buffers

CPI Equation Pipeline CPI = Ideal pipeline CPI + Structural stalls + RAW stalls + WAR stalls + WAW stalls + Control stalls Technique Reduces Loop unrolling Control stalls Basic pipeline scheduling RAW stalls Dynamic scheduling with scoreboarding Dynamic scheduling with register renaming WAR and WAW stalls Dynamic branch prediction Issuing multiple instructions per cycle Ideal CPI Compiler dependence analysis Ideal CPI and data stalls Software pipelining and trace scheduling Speculation All data and control stalls Dynamic memory disambiguation RAW stalls involving memory

Instruction Level Parallelism Potential overlap among instructions Few possibilities in a basic block Blocks are small (6-7 instructions) Instructions are dependent Exploit ILP across multiple basic blocks Iterations of a loop for (i = 1000; i > 0; i=i-1) x[i] = x[i] + s; Alternative to vector instructions

Basic Pipeline Scheduling Find sequences of unrelated instructions Compiler’s ability to schedule Amount of ILP available in the program Latencies of the functional units Latency assumptions for the examples Standard MIPS integer pipeline No structural hazards (fully pipelined or duplicated units Latencies of FP operations: Instruction producing result Instruction using result Latency FP ALU op 3 SD 2 LD 1

Sample Pipeline . . . FP ALU FP ALU FP ALU SD EX IF ID FP1 FP2 FP3 FP4 DM WB FP1 FP2 FP3 FP4 . . . FP ALU IF ID FP1 FP2 FP3 FP4 DM WB FP ALU IF ID stall stall stall FP1 FP2 FP3 FP ALU IF ID FP1 FP2 FP3 FP4 DM WB SD IF ID EX stall stall DM WB

Basic Scheduling for (i = 1000; i > 0; i=i-1) x[i] = x[i] + s; Sequential MIPS Assembly Code Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 BNEZ R1, Loop for (i = 1000; i > 0; i=i-1) x[i] = x[i] + s; Pipelined execution: Loop: LD F0, 0(R1) 1 stall 2 ADDD F4, F0, F2 3 stall 4 stall 5 SD 0(R1), F4 6 SUBI R1, R1, #8 7 stall 8 BNEZ R1, Loop 9 stall 10 Scheduled pipelined execution: Loop: LD F0, 0(R1) 1 SUBI R1, R1, #8 2 ADDD F4, F0, F2 3 stall 4 BNEZ R1, Loop 5 SD 8(R1), F4 6

Dynamic Scheduling Scheduling separates dependent instructions Static – performed by the compiler Dynamic – performed by the hardware Advantages of dynamic scheduling Handles dependences unknown at compile time Simplifies the compiler Optimization is done at run time Disadvantages Can not eliminate true data dependences

Out-of-order execution (1/2) Central idea of dynamic scheduling In-order execution: Out-of-order execution: DIVD F0, F2, F4 IF ID DIV ….. ADDD F10, F0, F8 IF ID stall stall stall … SUBD F12, F8, F14 IF stall stall ….. DIVD F0, F2, F4 IF ID DIV ….. SUBD F12, F8, F14 IF ID A1 A2 A3 A4 … ADDD F10, F0, F8 IF ID stall …..

Out-of-Order Execution (2/2) Separate issue process in ID: Issue decode instruction check structural hazards in-order execution Read operands Wait until no data hazards Out-of-order execution/completion Exception handling problems WAR hazards

DS with a Scoreboard Details in Appendix A.8 Allows out-of-order execution Sufficient resources No data dependencies Responsible for issue, execution and hazards Functional units with long delays Duplicated Fully pipelined CDC 6600 – 16 functional units

MIPS with Scoreboard

Scoreboard Operation Scoreboard centralizes hazard management Every instruction goes through the scoreboard Scoreboard determines when the instruction can read its operands and begin execution Monitors changes in hardware and decides when an stalled instruction can execute Controls when instructions can write results New pipeline ID EX WB Issue Read Regs Execution Write

Execution Process Issue Read Operands Execution Write result Functional unit is free (structural) Active instructions do not have same Rd (WAW) Read Operands Checks availability of source operands Resolves RAW hazards dynamically (out-of-order execution) Execution Functional unit begins execution when operands arrive Notifies the scoreboard when it has completed execution Write result Scoreboard checks WAR hazards Stalls the completing instruction if necessary

Scoreboard Data Structure Instruction status – indicates pipeline stage Functional unit status Busy – functional unit is busy or not Op – operation to perform in the unit (+, -, etc.) Fi – destination register Fj, Fk – source register numbers Qj, Qk – functional unit producing Fj, Fk Rj, Rk – flags indicating when Fj, Fk are ready Register result status – FU that will write registers

Scoreboard Data Structure (1/3) Instruction Issue Read operands Execution completed Write LD F6, 34(R2) Y Y Y Y LD F2, 45(R3) Y Y Y MULTD F0, F2, F4 Y SUBD F8, F6, F2 DIVD F10, F0, F6 ADDD F6, F8, F2 Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Y Load F2 R3 N Mult1 Y Mult F0 F2 F4 Integer N Y Mult2 N Add Y Sub F8 F6 F2 Integer Y N Divide Y Div F10 F0 F6 Mult1 N Y F0 F2 F4 F6 F8 F10 F12 . . . F30 Functional Unit Mult1 Int Add Div

Scoreboard Data Structure (2/3)

Scoreboard Data Structure (3/3)

Scoreboard Algorithm

Scoreboard Limitations Amount of available ILP Number of scoreboard entries Limited to a basic block Extended beyond a branch Number and types of functional units Structural hazards can increase with DS Presence of anti- and output- dependences Lead to WAR and WAW stalls

Tomasulo Approach Another approach to eliminate stalls Combines scoreboard with Register renaming (to avoid WAR and WAW) Designed for the IBM 360/91 High FP performance for the whole 360 family Four double precision FP registers Long memory access and long FP delays Can support overlapped execution of multiple iterations of a loop

Tomasulo Approach

Stages Issue Execute Write result Empty reservation station or buffer Send operands to the reservation station Use name of reservation station for operands Execute Execute operation if operands are available Monitor CDB for availability of operands Write result When result is available, write it to the CDB

Example (1/2)

Example (2/2)

Tomasulo’s Algorithm An enhanced and detailed design in Fig. 3.5 of the textbook

Loop Iterations Loop: LD F0, 0(R1) MULTD F4,F0,F2 SD 0(R1), F4 SUBI R1, R1, #8 BNEZ R1, Loop

Dynamic Hardware Prediction Importance of control dependences Branches and jumps are frequent Limiting factor as ILP increases (Amdahl’s law) Schemes to attack control dependences Static Basic (stall the pipeline) Predict-not-taken and predict-taken Delayed branch and canceling branch Dynamic predictors Effectiveness of dynamic prediction schemes Accuracy Cost

Basic Branch Prediction Buffers a.k.a. Branch History Table (BHT) - Small direct-mapped cache of T/NT bits Branch Instruction IR: + Branch Target PC: BHT T (predict taken) NT (predict not- taken) PC + 4

N-bit Branch Prediction Buffers Use an n-bit saturating counter Only the loop exit causes a misprediction 2-bit predictor almost as good as any general n-bit predictor

Prediction Accuracy of a 4K-entry 2-bit Prediction Buffer

Correlating Branch Predictors a.k.a. Two-level Predictors – Use recent behavior of other (previous) branches Branch Instruction IR: + Branch Target PC: BHT T (predict taken) NT (predict not- taken) PC + 4 1-bit global branch history: (stores behavior of previous branch) NT/T NT T

Example . . . Basic one-bit predictor BNEZ R1, L1 ; branch b1 (d!=0) DADDIU R1, R0, #1 L1: DADDIU R3, R1, #-1 BNEZ R3, L2 ; branch b2 L2: . . . Basic one-bit predictor d=? b1 pred b1 action new b1 pred b2 pred b2 action new b2 pred 2 NT T T NT T T 0 T NT NT T NT NT One-bit predictor with one-bit correlation d=? b1 pred b1 action new b1 pred b2 pred b2 action new b2 pred 2 NT/NT T T/NT NT/NT T NT/T 0 T/NT NT T/NT NT/T NT NT/T 2 T/NT T T/NT NT/T T NT/T 0 T/NT NT T/NT NT/T NT NT/T

(2,2) Branch Prediction Buffer

(m, n) Predictors Use behavior of the last m branches 2m n-bit predictors for each branch Simple implementation Use m-bit shift register to record the behavior of the last m branches (m,n) BPF m-bit GBH PC: n-bit predictor

Size of the Buffers Number of bits in a (m,n) predictor 2m x n x Number of entries in the table Example – assume 8K bits in the BHT (0,1): 8K entries (0,2): 4K entries (2,2): 1K entries (12,2): 1 entry! Does not use the branch address Relies only on the global branch history

Performance Comparison of 2-bit Predictors

Branch-Target Buffers Further reduce control stalls (hopefully to 0) Store the predicted address in the buffer Access the buffer during IF

Prediction with BTF

Target Instruction Buffers Store target instructions instead of addresses Advantages BTB access can take longer than time between IFs and BTB can be larger Branch folding Zero-cycle unconditional branches Replace branch with target instruction

Performance Issues Limitations of branch prediction schemes Prediction accuracy (80% - 95%) Type of program Size of buffer Penalty of misprediction Fetch from both directions to reduce penalty Memory system should: Dual-ported Have an interleaved cache Fetch from one path and then from the other