Lecture 9 Instruction Level Parallelism Topics Review Appendix C Dynamic Scheduling Scoreboarding Tomasulo Readings: Chapter 3 October 8, 2014 CSCE 513.

Slides:

Advertisements

Similar presentations

Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.

Advertisements

ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.

Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative.

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

DAP Spr.‘98 ©UCB 1 Lecture 6: ILP Techniques Contd. Laxmi N. Bhuyan CS 162 Spring 2003.

1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Cont. Computer Architecture.

Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation – Concepts 吳俊興高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.

EENG449b/Savvides Lec /22/05 March 22, 2005 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EENG449b/Savvides Lec /20/04 February 12, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

Chapter 2 Instruction-Level Parallelism and Its Exploitation

EECC551 - Shaaban #1 Fall 2002 lec# Floating Point/Multicycle Pipelining in MIPS Completion of MIPS EX stage floating point arithmetic operations.

EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CSC 4250 Computer Architectures October 13, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative.

1 Lecture 5 Overview of Superscalar Techniques CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading: Textbook, Ch. 2.1 “Complexity-Effective.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.

Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.

1 Lecture 5: Dependence Analysis and Superscalar Techniques Overview Instruction dependences, correctness, inst scheduling examples, renaming, speculation,

1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.

Lecture 4.5 Pipelines – Control Hazards Topics Control Hazards Branch Prediction Misprediction stalls Readings: Appendix C September 2, 2015 CSCE 513 Computer.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Instruction-Level Parallelism and Its Dynamic Exploitation

CS 352H: Computer Systems Architecture

The University of Adelaide, School of Computer Science

/ Computer Architecture and Design

The University of Adelaide, School of Computer Science

Approaches to exploiting Instruction Level Parallelism (ILP)

CS203 – Advanced Computer Architecture

Lecture 10 Tomasulo’s Algorithm

Lecture 12 Reorder Buffers

Lecture 5 Pipelines – Control Hazards

Chapter 3: ILP and Its Exploitation

The University of Adelaide, School of Computer Science

Out of Order Processors

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

The University of Adelaide, School of Computer Science

CS 704 Advanced Computer Architecture

The University of Adelaide, School of Computer Science

Chapter 3: ILP and Its Exploitation

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

Advanced Computer Architecture

Lecture 5 Pipelines – Control Hazards

Adapted from the slides of Prof

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Dynamic Hardware Prediction

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Lecture 5: Pipeline Wrap-up, Static ILP

Presentation transcript:

Lecture 9 Instruction Level Parallelism Topics Review Appendix C Dynamic Scheduling Scoreboarding Tomasulo Readings: Chapter 3 October 8, 2014 CSCE 513 Computer Architecture

– 2 – CSCE 513 Fall 2015 Overview Last Time Dynamic Scheduling MIPS R4000 Pipeline/ ScoreboardNew Stalls in Diagrams revisited Scoreboard Review Dealing with Control Hazards in the 5-stageReferences Appendix A.8, Chapter 3

– 3 – CSCE 513 Fall 2015 Figure 3.1 ILP techniques references Review:

– 4 – CSCE 513 Fall 2015 C.27 –Forwarding Paths to ALU

– 5 – CSCE 513 Fall 2015 Fig C.24 Stalls with forwarding

– 6 – CSCE 513 Fall 2015 Figure C.28 - Branch hazards Figure C.28 The stall from branch hazards can be reduced by moving the zero test and branch-target calculation into the ID phase of the pipeline. Notice that we have made two important changes, each of which removes 1 cycle from the 3-cycle stall for branches. The first change is to move both the branch-target address calculation and the branch condition decision to the ID cycle. The second change is to write the PC of the instruction in the IF phase, using either the branch-target address computed during ID or the incremented PC computed during IF. In comparison, Figure C.22 obtained the branch-target address from the EX/MEM register and wrote the result during the MEM clock cycle. As mentioned in Figure C.22, the PC can be thought of as a pipeline register (e.g., as part of ID/IF), which is written with the address of the next instruction at the end of each IF cycle.

– 7 – CSCE 513 Fall 2015 Fig C.29 Revised IF, ID Pipeline Fetch and Decode Stages for Branch Instructions  Fetch  IF/ID.IR  Mem[PC]  IF/ID.NPC  (if ((IF/ID.opcode == branch) & (regs[IF/ID.IR[rs] op 0)) {IF?ID.NPC + sign-ext(IF/ID.IR[Imm] << 2)} else {PC + 4}  Decode  ID/EX.A  Regs[IF/ID.IR[rs]  ID/EX.B  Regs[IF/ID.IR[rt]]  ID/EX.IR  IF/ID.IR  ID/EX.Imm  (IF/ID.IR[16] 16 ## IF/ID.IR[16-31])

– 8 – CSCE 513 Fall 2015 Fig C.14 Branch Delay slots xxx

– 9 – CSCE 513 Fall 2015 C. 4 What Makes Pipelining Hard to Implement? Exceptions!  I/ O device request  Invoking an operating system service from a user program  Tracing instruction execution  Breakpoint (programmer-requested interrupt)  Integer arithmetic overflow  FP arithmetic anomaly  Page fault (not in main memory)  Misaligned memory accesses (if alignment is required)  Memory protection violation  Using an undefined or unimplemented instruction  Hardware malfunctions

– 10 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Introduction Pipelining became universal technique in 1985 Overlaps execution of instructions Exploits “Instruction Level Parallelism” Beyond this, there are two main approaches: Hardware-based dynamic approaches Used in server and desktop processors Not used as extensively in PMP processors Compiler-based static approaches Not as successful outside of scientific applications Introduction

– 11 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Instruction-Level Parallelism When exploiting instruction-level parallelism, goal is to maximize CPI Pipeline CPI = Ideal pipeline CPI + Structural stalls + Data hazard stalls + Control stalls Parallelism with basic block is limited Typical size of basic block = 3-6 instructions Must optimize across branches Introduction

– 12 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Data Dependence Loop-Level Parallelism Unroll loop statically or dynamically Use SIMD (vector processors and GPUs)Challenges: Data dependency Instruction j is data dependent on instruction i if »Instruction i produces a result that may be used by instruction j »Instruction j is data dependent on instruction k and instruction k is data dependent on instruction i Dependent instructions cannot be executed simultaneously Introduction

– 13 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Data Dependence Dependencies are a property of programs Pipeline organization determines if dependence is detected and if it causes a stall Data dependence conveys: Possibility of a hazard Order in which results must be calculated Upper bound on exploitable instruction level parallelism Dependencies that flow through memory locations are difficult to detect Introduction

– 14 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Name Dependence Two instructions use the same name but no flow of information Not a true data dependence, but is a problem when reordering instructions Antidependence: instruction j writes a register or memory location that instruction i reads Initial ordering (i before j) must be preserved Output dependence: instruction i and instruction j write the same register or memory location Ordering must be preserved To resolve, use renaming techniques Introduction

– 15 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Other Factors Data Hazards Read after write (RAW) Write after write (WAW) Write after read (WAR) Control Dependence Ordering of instruction i with respect to a branch instruction Instruction control dependent on a branch cannot be moved before the branch so that its execution is no longer controller by the branch An instruction not control dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch Introduction

– 16 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Examples OR instruction dependent on DADDU and DSUBU Assume R4 isn’t used after skip Possible to move DSUBU before the branch Introduction Example 1: DADDU R1,R2,R3 BEQZ R4,L DSUBU R1,R1,R6 L:… OR R7,R1,R8 Example 2: DADDU R1,R2,R3 BEQZ R12,skip DSUBU R4,R5,R6 DADDU R5,R4,R9 skip: OR R7,R8,R9

– 17 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Compiler Techniques for Exposing ILP Pipeline scheduling Separate dependent instruction from the source instruction by the pipeline latency of the source instructionExample: for (i=999; i>=0; i=i-1) x[i] = x[i] + s; Compiler Techniques

– 18 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Pipeline Stalls Loop:L.DF0,0(R1) stall ADD.D F4,F0,F2 stallstall S.D F4,0(R1) DADDUI R1,R1,#-8 stall (assume integer load latency is 1) BNE R1,R2,Loop Compiler Techniques

– 19 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Pipeline Scheduling Scheduled code: Loop:L.DF0,0(R1) DADDUI R1,R1,#-8 ADD.D F4,F0,F2 stallstall S.D F4,8(R1) BNE R1,R2,Loop Compiler Techniques

– 20 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Loop Unrolling Loop unrolling (not yet scheduled) Unroll by a factor of 4 (assume # elements is divisible by 4) Eliminate unnecessary instructions Loop:L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) ;drop DADDUI & BNE L.D F6,-8(R1) ADD.D F8,F6,F2 S.D F8,-8(R1) ;drop DADDUI & BNE L.D F10,-16(R1) ADD.D F12,F10,F2 S.D F12,-16(R1) ;drop DADDUI & BNE L.D F14,-24(R1) ADD.D F16,F14,F2 S.D F16,-24(R1) DADDUI R1,R1,#-32 BNE R1,R2,Loop Compiler Techniques note: number of live registers vs. original loop

– 21 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Loop Unrolling/Pipeline Scheduling Pipeline schedule the unrolled loop: Loop:L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 ADD.D F12,F10,F2 ADD.D F16,F14,F2 S.D F4,0(R1) S.D F8,-8(R1) DADDUI R1,R1,#-32 S.D F12,16(R1) S.D F16,8(R1) BNE R1,R2,Loop Compiler Techniques

– 22 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Strip Mining Unknown number of loop iterations? Number of iterations = n Goal: make k copies of the loop body Generate pair of loops: First executes n mod k times Second executes n / k times “Strip mining” Compiler Techniques

– 23 – CSCE 513 Fall 2015 Static Branch Prediction  Static Branch Prediction in general is a prediction that uses information that was gathered before the execution of the program.  always taken (MIPS-X, Stanford)  always not taken (Motorola MC88000).  Another way is to run the program with a profiler and some training input data predict the branch direction which was most frequently taken according to profiling. This prediction can be indicated by a hint bit in the branch opcode.

– 24 – CSCE 513 Fall 2015 Dynamic Branch Prediction Dynamic Branch Prediction or just Branch Prediction uses information about taken or not taken branches gathered at run-time to predict the outcome of a branch

– 25 – CSCE 513 Fall Bit Branch predictor Fig C.18 N-bit predictors

– 26 – CSCE 513 Fall Bit Saturating Counter Branch predictor.

– 27 – CSCE 513 Fall 2015 Branch History Table Indexing Instruction addr  x 1 =0 and x 0 =0 why? x 31 x 30 x 29 x 28 … x 12 x 11 … x 2 00 Prediction  What was done last time  2-bit  n-bit prediction

– 28 – CSCE 513 Fall 2015 Correlating Branch Predictors If(a == 2) a = 0; If(b ==2) b = 0; If (a != b) { …

– 29 – CSCE 513 Fall 2015 (m, n) predictors m last branches are used to predict One of 2 m n-bit predictors

– 30 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Branch Prediction Basic 2-bit predictor: For each branch: Predict taken or not taken If the prediction is wrong two consecutive times, change prediction Correlating predictor: Multiple 2-bit predictors for each branch One for each possible combination of outcomes of preceding n branches Local predictor: Multiple 2-bit predictors for each branch One for each possible combination of outcomes for the last n occurrences of this branch Tournament predictor: Combine correlating predictor with local predictor Branch Prediction

– 31 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Branch Prediction Performance Branch Prediction Branch predictor performance

– 32 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Dynamic Scheduling Rearrange order of instructions to reduce stalls while maintaining data flow Advantages: Compiler doesn’t need to have knowledge of microarchitecture Handles cases where dependencies are unknown at compile timeDisadvantage: Substantial increase in hardware complexity Complicates exceptions Branch Prediction

– 33 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Dynamic Scheduling Dynamic scheduling implies: Out-of-order execution Out-of-order completion Creates the possibility for WAR and WAW hazards Tomasulo’s Approach Tracks when operands are available Introduces register renaming in hardware Minimizes WAW and WAR hazards Branch Prediction

– 34 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Register Renaming Example: DIV.D F0,F2,F4 ADD.D F6,F0,F8 S.D F6,0(R1) SUB.D F8,F10,F14 MUL.D F6,F10,F8 + name dependence with F6 Branch Prediction antidependence

– 35 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Register Renaming Example: DIV.D F0,F2,F4 ADD.D S,F0,F8 S.D S,0(R1) SUB.D T,F10,F14 MUL.D F6,F10,T Now only RAW hazards remain, which can be strictly ordered Branch Prediction

– 36 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Register Renaming Register renaming is provided by reservation stations (RS) Contains: The instruction Buffered operand values (when available) Reservation station number of instruction providing the operand values RS fetches and buffers an operand as soon as it becomes available (not necessarily involving register file) Pending instructions designate the RS to which they will send their output Result values broadcast on a result bus, called the common data bus (CDB) Only the last output updates the register file As instructions are issued, the register specifiers are renamed with the reservation station May be more reservation stations than registers Branch Prediction

– 37 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Tomasulo’s Algorithm Load and store buffers Contain data and addresses, act like reservation stations Top-level design: Branch Prediction

– 38 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Tomasulo’s Algorithm Three Steps: Issue Get next instruction from FIFO queue If available RS, issue the instruction to the RS with operand values if available If operand values not available, stall the instruction Execute When operand becomes available, store it in any reservation stations waiting for it When all operands are ready, issue the instruction Loads and store maintained in program order through effective address No instruction allowed to initiate execution until all branches that proceed it in program order have completed Write result Write result on CDB into reservation stations and store buffers »(Stores must wait until address and value are received) Branch Prediction

– 39 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Example Branch Prediction

– 40 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register file when instruction is no longer speculative Need an additional piece of hardware to prevent any irrevocable action until an instruction commits I.e. updating state or taking an execution Branch Prediction

– 41 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Reorder Buffer Reorder buffer – holds the result of instruction between completion and commit Four fields: Instruction type: branch/store/register Destination field: register number Value field: output value Ready field: completed execution? Modify reservation stations: Operand source is now reorder buffer instead of functional unit Branch Prediction

– 42 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Reorder Buffer Register values and memory values are not written until an instruction commits On misprediction: Speculated entries in ROB are clearedExceptions: Not recognized until it is ready to commit Branch Prediction

– 43 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Multiple Issue and Static Scheduling To achieve CPI < 1, need to complete multiple instructions per clock Solutions: Statically scheduled superscalar processors VLIW (very long instruction word) processors dynamically scheduled superscalar processors Multiple Issue and Static Scheduling

– 44 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Multiple Issue Multiple Issue and Static Scheduling

– 45 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. VLIW Processors Package multiple operations into one instruction Example VLIW processor: One integer instruction (or branch) Two independent floating-point operations Two independent memory references Must be enough parallelism in code to fill the available slots Multiple Issue and Static Scheduling

– 46 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. VLIW Processors Disadvantages: Statically finding parallelism Code size No hazard detection hardware Binary code compatibility Multiple Issue and Static Scheduling

– 47 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Dynamic Scheduling, Multiple Issue, and Speculation Modern microarchitectures: Dynamic scheduling + multiple issue + speculation Two approaches: Assign reservation stations and update pipeline control table in half clock cycles Only supports 2 instructions/clock Design logic to handle any possible dependencies between the instructions Hybrid approaches Issue logic can become bottleneck Dynamic Scheduling, Multiple Issue, and Speculation

– 48 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Dynamic Scheduling, Multiple Issue, and Speculation Overview of Design

– 49 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Limit the number of instructions of a given class that can be issued in a “bundle” I.e. on FP, one integer, one load, one store Examine all the dependencies amoung the instructions in the bundle If dependencies exist in bundle, encode them in reservation stations Also need multiple completion/commit Dynamic Scheduling, Multiple Issue, and Speculation Multiple Issue

– 50 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Loop:LD R2,0(R1);R2=array element DADDIU R2,R2,#1;increment R2 SD R2,0(R1);store result DADDIU R1,R1,#8;increment pointer BNE R2,R3,LOOP;branch if not last element Dynamic Scheduling, Multiple Issue, and Speculation Example

– 51 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Dynamic Scheduling, Multiple Issue, and Speculation Example (No Speculation)

– 52 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Dynamic Scheduling, Multiple Issue, and Speculation Dual Issue/Dual Commit Example

– 53 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Need high instruction bandwidth! Branch-Target buffers Next PC prediction buffer, indexed by current PC Adv. Techniques for Instruction Delivery and Speculation Branch-Target Buffer

– 54 – CSCE 513 Fall 2015 Reorder Buffer

– 55 – CSCE 513 Fall 2015 Review of Data Hazards Assume instruction i comes before instruction j Instruction j is data dependant on I ifInstruction j is data dependant on I if i produces a result that j uses j depends on k and k depends on I Name dependence – two instructions use the same registerName dependence – two instructions use the same register Antidependence when j writes a register that i readsAntidependence when j writes a register that i reads Output dependence when both i and j write the same registerOutput dependence when both i and j write the same register HazardsHazards RAW WAW WAR