Chapter 3: ILP and Its Exploitation

Slides:



Advertisements
Similar presentations
ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
Advertisements

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
Dynamic Branch Prediction (Sec 4.3) Control dependences become a limiting factor in exploiting ILP So far, we’ve discussed only static branch prediction.
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
COMP4611 Tutorial 6 Instruction Level Parallelism
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.
Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation – Concepts 吳俊興 高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.
EENG449b/Savvides Lec /25/05 March 24, 2005 Prof. Andreas Savvides Spring g449b EENG 449bG/CPSC 439bG.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative.
Chapter 2: ILP and Its Exploitation
Dynamic Branch Prediction
Instruction-Level Parallelism Dynamic Branch Prediction
Instruction-Level Parallelism and Its Dynamic Exploitation
CS203 – Advanced Computer Architecture
Concepts and Challenges
Dynamic Branch Prediction
The University of Adelaide, School of Computer Science
CS5100 Advanced Computer Architecture Instruction-Level Parallelism
Part IV Data Path and Control
CS 704 Advanced Computer Architecture
Instruction-Level Parallelism (ILP)
现代计算机体系结构 主讲教师:张钢 教授 天津大学计算机学院 2017年
Approaches to exploiting Instruction Level Parallelism (ILP)
CSCE430/830 Computer Architecture
Chapter 3: ILP and Its Exploitation
CMSC 611: Advanced Computer Architecture
So far we have dealt with control hazards in instruction pipelines by:
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Dynamic Branch Prediction
Advanced Computer Architecture
/ Computer Architecture and Design
CS 704 Advanced Computer Architecture
So far we have dealt with control hazards in instruction pipelines by:
Adapted from the slides of Prof
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
So far we have dealt with control hazards in instruction pipelines by:
Adapted from the slides of Prof
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Dynamic Hardware Prediction
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
So far we have dealt with control hazards in instruction pipelines by:
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
So far we have dealt with control hazards in instruction pipelines by:
CMSC 611: Advanced Computer Architecture
So far we have dealt with control hazards in instruction pipelines by:
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Loop-Level Parallelism
So far we have dealt with control hazards in instruction pipelines by:
Presentation transcript:

Chapter 3: ILP and Its Exploitation Review simple static pipeline ILP Overview Dynamic branch prediction Dynamic scheduling, out-of-order execution Multiple issue (superscalar) Hardware-based speculation ILP limitation Intel Core i7 and Cortex-A8 Fall 2016 CDA5155

The University of Adelaide, School of Computer Science 8 December 2018 Introduction Introduction Pipelining become universal technique in 1985 Overlaps execution of instructions Exploits “Instruction Level Parallelism” Beyond this, there are two main approaches: Hardware-based dynamic approaches Used in server and desktop processors Not used as extensively in PMP (mobile) processors Compiler-based static approaches Not as successful outside of scientific applications Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Instruction-Level Parallelism The University of Adelaide, School of Computer Science 8 December 2018 Instruction-Level Parallelism Introduction When exploiting instruction-level parallelism, goal is to minimize CPI Pipeline CPI = Ideal pipeline CPI + Structural stalls + Data hazard stalls + Control stalls Parallelism with basic block is limited Typical size of basic block = 3-6 instructions Must optimize across branches to avoid stalls Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

The University of Adelaide, School of Computer Science 8 December 2018 Data Dependence Introduction Loop-Level Parallelism Unroll loop statically or dynamically Use SIMD (vector processors and GPUs) Challenges: Data dependency Instruction j is data dependent on instruction i if Instruction i produces a result that may be used by instruction j Instruction j is data dependent on instruction k and instruction k is data dependent on instruction i Dependent instructions cannot be executed simultaneously Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

The University of Adelaide, School of Computer Science 8 December 2018 Data Dependence Introduction Dependencies are a property of programs Pipeline organization determines if dependence is detected and if it causes a stall Data dependence conveys: Possibility of a hazard Order in which results must be calculated Upper bound on exploitable instruction level parallelism Dependencies that flow through memory locations are difficult to detect Called memory dependence, not register dependence Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Data Dependence Recursive definition: Potential for a RAW hazard Instruction B is data dependent on instruction A iff: B uses a data result produced by instruction A, or There is another instruction C such that B is data dependent on C, and C is data dependent on A. Potential for a RAW hazard Loop: LD F0,0(R1) ADDD F4,F0,F2 SD 0(R1),F4 SUBI R1,R1,#8 BNEZ R1,Loop A A B C B Data dependencies in loop example

The University of Adelaide, School of Computer Science 8 December 2018 Name Dependence Introduction Two instructions use the same name but no flow of information Not a true data dependence, but is a problem when reordering instructions Anti-dependence (WAR): instruction j writes a register or memory location that instruction i reads Initial ordering (i before j) must be preserved Output dependence (WAW): instruction i and instruction j write the same register or memory location Ordering must be preserved To resolve, use renaming techniques Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

WAR (Anti-), WAW (Output-) Examples WAR Hazard: InstrJ writes operand before InstrI reads it WAW Hazard: InstrJ writes operand before InstrI writes it. I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7

The University of Adelaide, School of Computer Science 8 December 2018 Other Factors - Branch Introduction Control Dependence Ordering of instruction i with respect to a branch instruction Instruction control dependent on a branch cannot be moved before the branch so that its execution is no longer controller by the branch An instruction not control dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch Handling Branch Static branch prediction or inst. Scheduling Dynamic branch prediction with out-of-order execution Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Inst. Schedule Examples - Branch The University of Adelaide, School of Computer Science Inst. Schedule Examples - Branch 8 December 2018 Introduction Example 1: DADDU R1,R2,R3 BEQZ R4,L DSUBU R1,R1,R6 L: … OR R7,R1,R8 Example 2: BEQZ R12,skip DSUBU R4,R5,R6 DADDU R5,R4,R9 skip: OR R7,R8,R9 OR instruction dependent on DADDU and DSUBU Assume R4 isn’t used after skip Possible to move DSUBU before the branch Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Compiler Techniques for Exposing ILP The University of Adelaide, School of Computer Science 8 December 2018 Compiler Techniques for Exposing ILP Compiler Techniques Pipeline scheduling (by Compiler) Separate dependent instruction from the source instruction by the pipeline latency of the source instruction Example (Loop): for (i=999; i>=0; i=i-1) x[i] = x[i] + s; Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

The University of Adelaide, School of Computer Science 8 December 2018 Pipeline Stalls Compiler Techniques Loop: L.D F0,0(R1) stall ADD.D F4,F0,F2 S.D F4,0(R1) DADDUI R1,R1,#-8 stall (assume integer add latency is 1) BNE R1,R2,Loop (must resolve in ID stage) Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Apply Instruction Scheduling 1 Loop: L.D F0,0(R1) 2 DADDUI R1,R1,-8 3 ADD.D F4,F0,F2 4 stall 5 stall 6 S.D 8(R1),F4 ;altered offset 7 BNEZ R1,Loop Limited basic block (5 instructions) 7 clock cycles, but just 3 for execution (L.D, ADD.D, S.D), 4 for loop overhead; Can we make faster?

Unroll Loop Four Times 1 Loop: L.D F0,0(R1) 3 ADD.D F4,F0,F2 6 S.D 0(R1),F4 ;no DSUBUI/BNEZ 7 L.D F6,-8(R1) 9 ADD.D F8,F6,F2 12 S.D -8(R1),F8 ;drop loop 13 L.D F10,-16(R1) 15 ADD.D F12,F10,F2 18 S.D -16(R1),F12 ;drop loop 19 L.D F14,-24(R1) 21 ADD.D F16,F14,F2 24 S.D -24(R1),F16 25 DADDUI R1,R1,#-32 ;alter to 4*8 27 BNEZ R1,LOOP 27 clock cycles, or 6.75 per iteration (Assumes R1 is multiple of 4)

Unrolled and Rescheduled Loop 1 Loop: L.D F0,0(R1) 2 L.D F6,-8(R1) ; adjust disp. 3 L.D F10,-16(R1) 4 L.D F14,-24(R1) 5 ADD.D F4,F0,F2 6 ADD.D F8,F6,F2 7 ADD.D F12,F10,F2 8 ADD.D F16,F14,F2 9 S.D 0(R1),F4 10 S.D -8(R1),F8 11 S.D -16(R1),F12 12 DSUBUI R1,R1,#32 13 S.D 8(R1),F16 ; 8-32 = -24 14 BNEZ R1,LOOP 14 clock cycles, or 3.5 per iteration

Three Unrolling Considerations Decrease in amount of overhead (begin/end of the loop) amortized with each extra unrolling Amdahl’s Law Growth in code size For larger loops, concern it increases the instruction cache miss rate Register pressure: potential shortfall in registers created by aggressive unrolling and scheduling If not be possible to allocate all live values to registers, may lose some or all of its advantage  Loop unrolling also reduces impact of branches on pipeline; another way is branch prediction

The University of Adelaide, School of Computer Science 8 December 2018 Strip Mining Compiler Techniques Unknown number of loop iterations? Number of iterations = n Goal: make k copies of the loop body Generate pair of loops: First executes n mod k times Second executes n / k times “Strip mining” Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Control Stall - Static Branch Prediction Delayed branch To reorder code around branches, need to predict branch statically when compile Predict a branch as taken: Average misprediction = untaken frequency = 34% SPEC More accurate prediction: Profile based (see figure)

Dynamic Branch Prediction Why does prediction work? Underlying algorithm has regularities Data that is being operated on has regularities Instruction sequence has redundancies that are artifacts of way that humans/compilers think about problems Is dynamic branch prediction better than static branch prediction? Seems to be There are a small number of important branches in programs which have dynamic behavior

Dynamic Branch Prediction As the amount of ILP exploited increases (CPI decreases), impact of control stalls increases. Branches come more often A longer n-cycle delay postpones more instructions Dynamic Hardware Branch Prediction “Learns” which branches are taken, or not Make the right guess (most of the time) about whether a branch is taken, or not. Delay depends on whether prediction is correct, and whether branch is taken.

Branch-Prediction Buffers (BPB) Also called “branch history table” Low-order n bits of branch address used to index a table of branch history data for prediction. May have “collisions” between distant branches. Associative tables also possible (expensive) In each entry, k bits of information about history of that branch are stored (also called predictor). Common values of k: 1, 2, and larger Entry is used to predict what branch will do. Actual outcome of branch will update the entry.

The University of Adelaide, School of Computer Science 8 December 2018 Branch Prediction Branch Prediction Basic 2-bit predictor: For each branch: Predict taken or not taken If the prediction is wrong two consecutive times, change prediction Correlating predictor: Multiple 2-bit predictors for each branch One for each possible combination of outcomes of preceding n branches Local predictor: One for each possible combination of outcomes for the last n occurrences of this branch Tournament predictor: Combine correlating predictor with local predictor Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

1-bit Branch Prediction The entry for a branch has only two states: Bit = 1 “The last time this branch was encountered, it was taken. I predict it will be taken next time.” Bit = 0 “The last time this branch was encountered, it was not taken. I predict it will not be taken next time.” Will make 2 mistakes each time a loop is encountered. At the end of the first & last iterations. May always mispredict in pathological cases!

2-bit Branch Prediction 4 states Based on the most recent two branch history actions Only 1 mis-prediction per loop execution, after the first time the loop is reached. Last iteration What about n-bit predictor?

State Transition Diagram (2-bit predictor, 4 states) Note, state transition can be modified based on prediction algorithm

Implementing Branch Histories Separate “cache” (prediction table) accessed during IF stage (to obtain prediction outcome) Extra bits in instruction cache Problem with this approach in MIPS: After fetch, don’t know whether the instruction is really a branch or not (until decoding) Also don’t know the target address. In MIPS, by the time you know these things (in ID), you already know whether it’s really taken! Haven’t saved any time! Branch-Target Buffers (BTB) can fix this problem (later)...

Branch-Target Buffers (BTB) How to know the address of the next instruction as soon as the current instruction is fetched? Normally, an extra (ID) cycle is needed to: Determine that the first instruction is a branch Determine whether the branch is taken Compute the target address PC+offset Branch prediction alone doesn’t help – Need next PC What if, instead, the next instruction address could be fetched at the same time that the current instruction is fetched?  using BTB

BTB Schematic Note, BTB structure similar to cache (for taken branch only) It is expensive, predictor for untaken can be separate! (only predicted taken will enter BTB)

Branch Resolution in 5-stage pipeline Need to resolve branch and target address in IF stage!! Fetch from target This flowchart based on 1-bit predictor

Misprediction Rate for 2-bit BPB

Branch-Prediction Performance Contribution to cycle count depends on: Branch frequency & misprediction frequency Freqs. of taken/not taken, predicted/mispredicted. Delay of taken/not taken, predicted/mispredicted. How to reduce misprediction frequency? Increase buffer size to avoid collisions. Has little effect beyond ~4,096 entries. Increase prediction accuracy Increase # of bits/entry (little effect beyond 2) Use a different prediction scheme (correlated predictors, tournament predictors)

Correlated Prediction - Example Code fragment from eqntott: if (aa==2) aa=0; if1 if (bb==2) bb=0; if2 if (aa!=bb) { … };  false if (if1 & if2) MIPS code (aa=R1, bb=R2): SUBUI R3,R1,#2 ; (aa-2) BNEZ R3,L1 ; branch b1 (aa!=2) ADD R1,R0,R0 ; aa=0 L1: SUBUI R3,R2,#2 ; (bb-2) BNEZ R3,L2 ; branch b2 (bb!=2) ADD R2,R0,R0 ; bb=0 L2: SUBU R3,R1,R2 ; (aa-bb) BEQZ R3,L3 … ; branch b3 (aa==bb) ; Must taken if b1,b2 untaken

Even Simpler Example C code: MIPS code (d=R1): if (d==0) d=1; if (d==1) ... MIPS code (d=R1): BNEZ R1,L1 ; b1: d!=0 ADDI R1,R0,#1 ; d=1 L1: SUBUI R3,R1,#1 ; (d-1) BNEZ R3,L2 ; b2: d!=1 (and others) Any Correlation (b1, b2)? (Old figure)

Using 1-bit Predictor Suppose value of ‘d’ alternates between 2 and 0, and the code repeat multiple times All branches are mispredicted!! (with initial NT, NT) Note, b1 and b2 have individual 1-bit predictor

Correlating Predictors Have different predictions for the current branch depending on the previously executed branch instruction was taken or not. Notation: _ / _ (two separate prediction table) What to predict if the last branch was NOT taken Prediction used is shown in bold What to predict if the last branch was TAKEN Based on the last branch outcome Initial NT for correlation Note, by separate the branch history table, b1-not-taken  b2-always-not-taken is preserved.

(m,n) correlated predictors Uses the behavior of the most recent m branches encountered to select one of 2m different branch predictors for the next branch. Each of these predictors records n bits of history information for any given branch. On previous slide we saw a (1,1) predictor. Easy to implement: Behavior of last m branches: an m-bit shift register Branch-prediction buffer: access with low-order bits of branch address, concatenated with shift register.

(2,2) Correlated Predictor 1 (Correction based on the last 2 branch outcomes) (Maintain 4 separate tables!)

Correlated Predictors Better

Tournament Predictors Three predictors: Global correlated predictor, Local 2-bit predictor, and a tournament predictor The Tournament predictor determines which (global or local) predictor to be used for prediction 1: correct; 0: incorrect

Performance Tournament Predictor

Dynamic Branch Prediction Summary Prediction becoming important part of execution Branch History Table: 2 bits for loop accuracy Correlation: Recently executed branches correlated with next branch Either different branches (GA) Or different executions of same branches (PA) Tournament predictors take insight to next level, by using multiple predictors usually one based on global information and one based on local information, and combining them with a selector In 2006, tournament predictors using  30K bits are in processors like the Power5 and Pentium 4 Branch Target Buffer: include branch address & prediction for taken branch