The University of Adelaide, School of Computer Science

Slides:



Advertisements
Similar presentations
Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.
Advertisements

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Cont. Computer Architecture.
Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation – Concepts 吳俊興 高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
EECC551 - Shaaban #1 Fall 2002 lec# Floating Point/Multicycle Pipelining in MIPS Completion of MIPS EX stage floating point arithmetic operations.
EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.
Lecture 9 Instruction Level Parallelism Topics Review Appendix C Dynamic Scheduling Scoreboarding Tomasulo Readings: Chapter 3 October 8, 2014 CSCE 513.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Instruction-Level Parallelism and Its Dynamic Exploitation
Concepts and Challenges
The University of Adelaide, School of Computer Science
/ Computer Architecture and Design
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
The University of Adelaide, School of Computer Science
CS5100 Advanced Computer Architecture Instruction-Level Parallelism
Approaches to exploiting Instruction Level Parallelism (ILP)
CS203 – Advanced Computer Architecture
CSCE430/830 Computer Architecture
Lecture 10 Tomasulo’s Algorithm
Lecture 12 Reorder Buffers
CSL718 : VLIW - Software Driven ILP
CPSC 614 Computer Architecture Lec 5 – Instruction Level Parallelism
Chapter 3: ILP and Its Exploitation
The University of Adelaide, School of Computer Science
Lecture 6: Advanced Pipelines
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
The University of Adelaide, School of Computer Science
CS 704 Advanced Computer Architecture
Adapted from the slides of Prof
Chapter 3: ILP and Its Exploitation
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
Advanced Computer Architecture
Adapted from the slides of Prof
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
CPSC 614 Computer Architecture Lec 5 – Instruction Level Parallelism
Adapted from the slides of Prof
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Chapter 3: ILP and Its Exploitation
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Dynamic Hardware Prediction
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CMSC 611: Advanced Computer Architecture
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Lecture 5: Pipeline Wrap-up, Static ILP
Presentation transcript:

The University of Adelaide, School of Computer Science Instruction-Level Parallelism and Its Exploitation March 7, 2018 COMPUTER ARCHITECTURE Subhajit Sidhanta 1 1 Chapter 2 — Instructions: Language of the Computer

The University of Adelaide, School of Computer Science Computer Architecture A Quantitative Approach, Fifth Edition The University of Adelaide, School of Computer Science The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 Chapter 3 Instruction-Level Parallelism and Its Exploitation Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 2

The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 Introduction Introduction Pipelining become universal technique in 1985 Overlaps execution of instructions Exploits “Instruction Level Parallelism” Beyond this, there are two main approaches: Hardware-based dynamic approaches Used in server and desktop processors Not used as extensively in modern parallel processors Compiler-based static approaches Not as successful outside of scientific applications Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 3

The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 Instruction-Level Parallelism Introduction When exploiting instruction-level parallelism, goal is to maximize CPI Pipeline CPI = Ideal pipeline CPI + Structural stalls + Data hazard stalls + Control stalls Parallelism with basic block is limited Typical size of basic block = 3-6 instructions Must optimize across branches Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 4

The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 Data Dependence Introduction Loop-Level Parallelism Unroll loop statically or dynamically Use SIMD (vector processors and GPUs) Challenges: Data dependency Instruction j is data dependent on instruction i if Instruction i produces a result that may be used by instruction j Instruction j is data dependent on instruction k and instruction k is data dependent on instruction i Dependent instructions cannot be executed simultaneously Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 5

The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 Data Dependence Introduction Dependencies are a property of programs Pipeline organization determines if dependence is detected and if it causes a stall Data dependence conveys: Possibility of a hazard Order in which results must be calculated Upper bound on exploitable instruction level parallelism Dependencies that flow through memory locations are difficult to detect Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 6

The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 Name Dependence Introduction Two instructions use the same name but no flow of information Not a true data dependence, but is a problem when reordering instructions Antidependence: instruction j writes a register or memory location that instruction i reads Initial ordering (i before j) must be preserved Output dependence: instruction i and instruction j write the same register or memory location Ordering must be preserved To resolve, use renaming techniques Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 7

The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 Other Factors Introduction Data Hazards Read after write (RAW) Write after write (WAW) Write after read (WAR) Control Dependence Ordering of instruction i with respect to a branch instruction Instruction control dependent on a branch cannot be moved before the branch so that its execution is no longer controller by the branch An instruction not control dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 8

The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 Examples: Data Hazard Introduction Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 9

The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 Examples: Control Hazard Introduction Example 1: DADDU R1,R2,R3 BEQZ R4,L DSUBU R1,R5,R6 L: … OR R7,R1,R8 Example 2: BEQZ R12,skip DSUBU R4,R5,R6 DADDU R5,R4,R9 skip: OR R7,R8,R9 OR instruction dependent on DADDU and DSUBU Assume R4 isn’t used after skip Possible to move DSUBU before the branch Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 10

The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 Compiler Techniques for Exposing ILP Compiler Techniques Pipeline scheduling Separate dependent instruction from the source instruction by the pipeline latency of the source instruction Example: for (i=999; i>=0; i=i-1) x[i] = x[i] + s; Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 11

The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 Pipeline Stalls Compiler Techniques Loop: L.D F0,0(R1) stall ADD.D F4,F0,F2 S.D F4,0(R1) DADDUI R1,R1,#-8 stall (assume integer load latency is 1) BNE R1,R2,Loop Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 12

The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 Pipeline Scheduling Compiler Techniques Scheduled code: Loop: L.D F0,0(R1) DADDUI R1,R1,#-8 ADD.D F4,F0,F2 stall S.D F4,8(R1) BNE R1,R2,Loop Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 13

The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 Loop Unrolling Compiler Techniques Loop unrolling Unroll by a factor of 4 (assume # elements is divisible by 4) Eliminate unnecessary instructions Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) ;drop DADDUI & BNE L.D F6,-8(R1) ADD.D F8,F6,F2 S.D F8,-8(R1) ;drop DADDUI & BNE L.D F10,-16(R1) ADD.D F12,F10,F2 S.D F12,-16(R1) ;drop DADDUI & BNE L.D F14,-24(R1) ADD.D F16,F14,F2 S.D F16,-24(R1) DADDUI R1,R1,#-32 BNE R1,R2,Loop note: number of live registers vs. original loop Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 14

The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 Loop Unrolling/Pipeline Scheduling Compiler Techniques Pipeline schedule the unrolled loop: Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 ADD.D F12,F10,F2 ADD.D F16,F14,F2 S.D F4,0(R1) S.D F8,-8(R1) DADDUI R1,R1,#-32 S.D F12,16(R1) S.D F16,8(R1) BNE R1,R2,Loop Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 15

The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 Strip Mining Compiler Techniques Unknown number of loop iterations? Number of iterations = n Goal: make k copies of the loop body Generate pair of loops: First executes n mod k times Second executes n / k times “Strip mining” Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 16

The University of Adelaide, School of Computer Science March 7, 2018 5 Loop Unrolling Decisions Requires understanding how one instruction depends on another and how the instructions can be changed or reordered given the dependences: Determine loop unrolling useful by finding that loop iterations were independent (except for maintenance code) Use different registers to avoid unnecessary constraints forced by using same registers for different computations Eliminate the extra test and branch instructions and adjust the loop termination and iteration code Determine that loads and stores in unrolled loop can be interchanged by observing that loads and stores from different iterations are independent Transformation requires analyzing memory addresses and finding that they do not refer to the same address Schedule the code, preserving any dependences needed to yield the same result as the original code 03/07/18 Lec4 ILP 17 Chapter 2 — Instructions: Language of the Computer

The University of Adelaide, School of Computer Science March 7, 2018 Software Pipelining Observation: if iterations from loops are independent, then can get more ILP by taking instructions from different iterations Software pipelining: reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop (~ Tomasulo in SW) Chapter 2 — Instructions: Language of the Computer

The University of Adelaide, School of Computer Science March 7, 2018 Recall: Software Pipelining Example Before: Unrolled 3 times 1 L.D F0,0(R1) 2 ADD.D F4,F0,F2 3 S.D 0(R1),F4 4 L.D F6,-8(R1) 5 ADD.D F8,F6,F2 6 S.D -8(R1),F8 7 L.D F10,-16(R1) 8 ADD.D F12,F10,F2 9 S.D -16(R1),F12 10 DSUBUI R1,R1,#24 11 BNEZ R1,LOOP After: Software Pipelined 1 S.D 0(R1),F4 ; Stores M[i] 2 ADD.D F4,F0,F2 ; Adds to M[i-1] 3 L.D F10,-16(R1); Loads M[i-2] 4 DSUBUI R1,R1,#8 5 BNEZ R1,LOOP SW Pipeline overlapped ops Time Loop Unrolled Symbolic Loop Unrolling Maximize result-use distance Less code space than unrolling Fill & drain pipe only once per loop vs. once per each unrolled iteration in loop unrolling Time Chapter 2 — Instructions: Language of the Computer

The University of Adelaide, School of Computer Science Loop Unrolling in VLIW The University of Adelaide, School of Computer Science March 7, 2018 Memory Memory FP FP Int. op/ Clock reference 1 reference 2 operation 1 op. 2 branch L.D F0,0(R1) L.D F6,-8(R1) 1 L.D F10,-16(R1) L.D F14,-24(R1) 2 L.D F18,-32(R1) L.D F22,-40(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 3 L.D F26,-48(R1) ADD.D F12,F10,F2 ADD.D F16,F14,F2 4 ADD.D F20,F18,F2 ADD.D F24,F22,F2 5 S.D 0(R1),F4 S.D -8(R1),F8 ADD.D F28,F26,F2 6 S.D -16(R1),F12 S.D -24(R1),F16 7 S.D -32(R1),F20 S.D -40(R1),F24 DSUBUI R1,R1,#56 8 S.D -48(R1),F28 BNEZ R1,LOOP 9 Unrolled 7 times to avoid delays 7 results in 9 clocks, or 1.3 clocks per iteration (1.8X) Average: 2.5 ops per clock, 50% efficiency Note: Need more registers in VLIW (15 vs. 6 in SS) Chapter 2 — Instructions: Language of the Computer

The University of Adelaide, School of Computer Science March 7, 2018 Software Pipelining with Loop Unrolling in VLIW Memory Memory FP FP Int. op/ Clock reference 1 reference 2 operation 1 op. 2 branch L.D F0,-48(R1) ST 0(R1),F4 ADD.D F4,F0,F2 1 L.D F6,-56(R1) ST -8(R1),F8 ADD.D F8,F6,F2 DSUBUI R1,R1,#24 2 L.D F10,-64(R1) ST -16(R1),F12 ADD.D F12,F10,F2 BNEZ R1,LOOP 3 Software pipelined across 9 iterations of original loop In each iteration of above loop, we: Store to m,m-8,m-16 (iterations I-3,I-2,I-1) Compute for m-24,m-32,m-40 (iterations I,I+1,I+2) Load from m-48,m-56,m-64 (iterations I+3,I+4,I+5) 9 results in 9 cycles, or 1 clock per iteration Average: 3.3 ops per clock, 66% efficiency Note: Need fewer registers for software pipelining (only using 7 registers here, was using 15) Chapter 2 — Instructions: Language of the Computer

The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 Branch Prediction Branch Prediction Basic 2-bit predictor: For each branch: Predict taken or not taken If the prediction is wrong two consecutive times, change prediction Correlating predictor: Multiple 2-bit predictors for each branch One for each possible combination of outcomes of preceding n branches Local predictor: One for each possible combination of outcomes for the last n occurrences of this branch Tournament predictor: Combine correlating predictor with local predictor Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 22

The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 2-bit Branch Predictor Branch Prediction Branch predictor performance Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 23

The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 Correlating Branch Predictor Branch Prediction Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 24

The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 Correlating Branch Prediction Branch Prediction (m,n) predictor Last m branch information  choose from 2^m branch predictors with n bits each Global history of most recent m branches  m-bit register 2 level predictor (1,2) predictor: last branch information  choose from a 2-bit predictor (m,n) predictor bits - 2^m * n * number of prediction entries selected by branch address Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 25

The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 Branch Prediction Performance Branch Prediction Branch predictor performance Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 26

The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 Dynamic Scheduling Branch Prediction Rearrange order of instructions to reduce stalls while maintaining data flow Advantages: Compiler doesn’t need to have knowledge of microarchitecture Handles cases where dependencies are unknown at compile time Disadvantage: Substantial increase in hardware complexity Complicates exceptions Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 27

The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 Dynamic Scheduling Dynamic scheduling implies: Out-of-order execution Out-of-order completion Creates the possibility for WAR and WAW hazards Tomasulo’s Approach Tracks when operands are available Introduces register renaming in hardware Minimizes WAW and WAR hazards Branch Prediction Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 28

The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 Register Renaming Branch Prediction Example: DIV.D F0,F2,F4 ADD.D F6,F0,F8 S.D F6,0(R1) SUB.D F8,F10,F14 MUL.D F6,F10,F8 + name dependence with F6 antidependence antidependence Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 29

The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 Register Renaming Branch Prediction Example: DIV.D F0,F2,F4 ADD.D S,F0,F8 S.D S,0(R1) SUB.D T,F10,F14 MUL.D F6,F10,T Now only RAW hazards remain, which can be strictly ordered Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 30

The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 Register Renaming Branch Prediction Register renaming is provided by reservation stations (RS) Contains: The instruction Buffered operand values (when available) Reservation station number of instruction providing the operand values RS fetches and buffers an operand as soon as it becomes available (not necessarily involving register file) Pending instructions designate the RS to which they will send their output Result values broadcast on a result bus, called the common data bus (CDB) Only the last output updates the register file As instructions are issued, the register specifiers are renamed with the reservation station May be more reservation stations than registers Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 31

The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 Tomasulo’s Algorithm Branch Prediction Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 32

The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 Tomasulo’s Algorithm Branch Prediction Load and store buffers Contain data and addresses, act like reservation stations Top-level design: Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 33

The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 Tomasulo’s Algorithm Three Steps: Issue Get next instruction from FIFO queue If available RS, issue the instruction to the RS with operand values if available. If no empty RS stall If operand values not available, stall the instruction keep track of units that will produce the values monitor CDB Execute When operand becomes available, store it in any reservation stations waiting for it When all operands are ready, issue the instruction Loads and store maintained in program order: effective address computed on basis of base register, effective add placed in load buffer, exec as soon as mem unit avail No instruction allowed to initiate execution until all branches that precede it in program order have completed Write result Write result on CDB into reservation stations and store buffers (Stores must wait until address and value ae received) Branch Prediction Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 34

The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 Example Branch Prediction Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 35

The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 Hardware-Based Speculation Branch Prediction Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register file when instruction is no longer speculative Need an additional piece of hardware to prevent any irrevocable action until an instruction commits I.e. updating state or taking an execution Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 36

The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 Hardware-Based Speculation Design Branch Prediction Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 37

The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 Reorder Buffer Branch Prediction Reorder buffer – holds the result of instruction between completion and commit Four fields: Instruction type: branch/store/register Destination field: register number Value field: output value Ready field: completed execution? Modify reservation stations: Operand source is now reorder buffer instead of functional unit Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 38

The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 Reorder Buffer Branch Prediction Register values and memory values are not written until an instruction commits On misprediction: Speculated entries in ROB are cleared Exceptions: Not recognized until it is ready to commit Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 39

The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 Multiple Issue and Static Scheduling To achieve CPI < 1, need to complete multiple instructions per clock Solutions: Statically scheduled superscalar processors VLIW (very long instruction word) processors dynamically scheduled superscalar processors Multiple Issue and Static Scheduling Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 40

The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 Multiple Issue Multiple Issue and Static Scheduling Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 41

The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 VLIW Processors Package multiple operations into one instruction Example VLIW processor: One integer instruction (or branch) Two independent floating-point operations Two independent memory references Must be enough parallelism in code to fill the available slots Multiple Issue and Static Scheduling Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 42

The University of Adelaide, School of Computer Science Overview of Design March 7, 2018 March 7, 2018 Dynamic Scheduling, Multiple Issue, and Speculation Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 43

The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 VLIW Processors Disadvantages: Statically finding parallelism Code size No hazard detection hardware Binary code compatibility Multiple Issue and Static Scheduling Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 44

The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 Dynamic Scheduling, Multiple Issue, and Speculation Modern microarchitectures: Dynamic scheduling + multiple issue + speculation Two approaches: Assign reservation stations and update pipeline control table in half clock cycles Only supports 2 instructions/clock Design logic to handle any possible dependencies between the instructions Hybrid approaches Issue logic can become bottleneck Dynamic Scheduling, Multiple Issue, and Speculation Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 45

The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 Multiple Issue Limit the number of instructions of a given class that can be issued in a “bundle” I.e. one FP, one integer, one load, one store Examine all the dependencies among the instructions in the bundle If dependencies exist in bundle, encode them in reservation stations Also need multiple completion/commit Dynamic Scheduling, Multiple Issue, and Speculation Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 46

The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 Branch-Target Buffer Need high instruction bandwidth! Branch-Target buffers Next PC prediction buffer, indexed by current PC Adv. Techniques for Instruction Delivery and Speculation Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 50

The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 Branch-Target Buffer: Problem Adv. Techniques for Instruction Delivery and Speculation Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 51

The University of Adelaide, School of Computer Science Branch Folding The University of Adelaide, School of Computer Science The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 Optimization: Larger branch-target buffer Add target instruction into buffer to deal with longer decoding time required by larger buffer “Branch folding” Adv. Techniques for Instruction Delivery and Speculation Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 52

The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 Return Address Predictor Function  Indirect Jump  Destination address Varies at runtime Procedure Returns  15% of branches Most unconditional branches come from function returns The same procedure can be called from multiple sites Calls from 1 site not clustered in tome  Causes the buffer to potentially forget about the return address from previous calls Create return address buffer organized as a stack Adv. Techniques for Instruction Delivery and Speculation Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 53

The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 Integrated Instruction Fetch Unit Design monolithic unit that performs: Integrated Branch prediction: Branch Predictor ˂ Instruction Fetch Unit Integrated Instruction prefetch Fetch ahead Instruction memory access and buffering Deal with crossing cache lines Adv. Techniques for Instruction Delivery and Speculation Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 54

The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 Register Renaming Register renaming vs. reorder buffers Instead of virtual registers from reservation stations and reorder buffer, create a single register pool Contains visible registers and virtual registers Use hardware-based map to rename registers during issue WAW and WAR hazards are avoided Speculation recovery occurs by copying during commit Still need a ROB-like queue to update table in order Simplifies commit: Record that mapping between architectural register and physical register is no longer speculative Free up physical register used to hold older value In other words: SWAP physical registers on commit Physical register de-allocation is more difficult Adv. Techniques for Instruction Delivery and Speculation Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 55

The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 Integrated Issue and Renaming Combining instruction issue with register renaming: Issue logic pre-reserves enough physical registers for the bundle (fixed number?) Issue logic finds dependencies within bundle, maps registers as necessary Issue logic finds dependencies between current bundle and already in-flight bundles, maps registers as necessary Adv. Techniques for Instruction Delivery and Speculation Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 56

The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 How Much? How much to speculate Mis-speculation degrades performance and power relative to no speculation May cause additional misses (cache, TLB) Prevent speculative code from causing higher costing misses (e.g. L2) Speculating through multiple branches Complicates speculation recovery No processor can resolve multiple branches per cycle Adv. Techniques for Instruction Delivery and Speculation Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 57

The University of Adelaide, School of Computer Science March 7, 2018 March 7, 2018 Energy Efficiency Speculation and energy efficiency Note: speculation is only energy efficient when it significantly improves performance Value prediction Uses: Loads that load from a constant pool Instruction that produces a value from a small set of values Not been incorporated into modern processors Similar idea--address aliasing prediction--is used on some processors Adv. Techniques for Instruction Delivery and Speculation Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer Chapter 2 — Instructions: Language of the Computer 58