Chapter 3: ILP and Its Exploitation

Chapter 3: ILP and Its Exploitation
Review simple static pipeline ILP Overview Dynamic branch prediction Dynamic scheduling, out-of-order execution Multiple issue (superscalar) Hardware-based speculation ILP limitation Intel Core i7 and Cortex-A8 Fall 2016 CDA5155

The University of Adelaide, School of Computer Science
8 December 2018 Introduction Introduction Pipelining become universal technique in 1985 Overlaps execution of instructions Exploits “Instruction Level Parallelism” Beyond this, there are two main approaches: Hardware-based dynamic approaches Used in server and desktop processors Not used as extensively in PMP (mobile) processors Compiler-based static approaches Not as successful outside of scientific applications Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Instruction-Level Parallelism
The University of Adelaide, School of Computer Science 8 December 2018 Instruction-Level Parallelism Introduction When exploiting instruction-level parallelism, goal is to minimize CPI Pipeline CPI = Ideal pipeline CPI + Structural stalls + Data hazard stalls + Control stalls Parallelism with basic block is limited Typical size of basic block = 3-6 instructions Must optimize across branches to avoid stalls Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

8 December 2018 Data Dependence Introduction Loop-Level Parallelism Unroll loop statically or dynamically Use SIMD (vector processors and GPUs) Challenges: Data dependency Instruction j is data dependent on instruction i if Instruction i produces a result that may be used by instruction j Instruction j is data dependent on instruction k and instruction k is data dependent on instruction i Dependent instructions cannot be executed simultaneously Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

8 December 2018 Data Dependence Introduction Dependencies are a property of programs Pipeline organization determines if dependence is detected and if it causes a stall Data dependence conveys: Possibility of a hazard Order in which results must be calculated Upper bound on exploitable instruction level parallelism Dependencies that flow through memory locations are difficult to detect Called memory dependence, not register dependence Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Data Dependence Recursive definition: Potential for a RAW hazard
Instruction B is data dependent on instruction A iff: B uses a data result produced by instruction A, or There is another instruction C such that B is data dependent on C, and C is data dependent on A. Potential for a RAW hazard Loop: LD F0,0(R1) ADDD F4,F0,F2 SD 0(R1),F4 SUBI R1,R1,#8 BNEZ R1,Loop A A B C B Data dependencies in loop example

8 December 2018 Name Dependence Introduction Two instructions use the same name but no flow of information Not a true data dependence, but is a problem when reordering instructions Anti-dependence (WAR): instruction j writes a register or memory location that instruction i reads Initial ordering (i before j) must be preserved Output dependence (WAW): instruction i and instruction j write the same register or memory location Ordering must be preserved To resolve, use renaming techniques Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

WAR (Anti-), WAW (Output-) Examples
WAR Hazard: InstrJ writes operand before InstrI reads it WAW Hazard: InstrJ writes operand before InstrI writes it. I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7

8 December 2018 Other Factors - Branch Introduction Control Dependence Ordering of instruction i with respect to a branch instruction Instruction control dependent on a branch cannot be moved before the branch so that its execution is no longer controller by the branch An instruction not control dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch Handling Branch Static branch prediction or inst. Scheduling Dynamic branch prediction with out-of-order execution Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Inst. Schedule Examples - Branch
The University of Adelaide, School of Computer Science Inst. Schedule Examples - Branch 8 December 2018 Introduction Example 1: DADDU R1,R2,R3 BEQZ R4,L DSUBU R1,R1,R6 L: … OR R7,R1,R8 Example 2: BEQZ R12,skip DSUBU R4,R5,R6 DADDU R5,R4,R9 skip: OR R7,R8,R9 OR instruction dependent on DADDU and DSUBU Assume R4 isn’t used after skip Possible to move DSUBU before the branch Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Compiler Techniques for Exposing ILP
The University of Adelaide, School of Computer Science 8 December 2018 Compiler Techniques for Exposing ILP Compiler Techniques Pipeline scheduling (by Compiler) Separate dependent instruction from the source instruction by the pipeline latency of the source instruction Example (Loop): for (i=999; i>=0; i=i-1) x[i] = x[i] + s; Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

8 December 2018 Pipeline Stalls Compiler Techniques Loop: L.D F0,0(R1) stall ADD.D F4,F0,F2 S.D F4,0(R1) DADDUI R1,R1,#-8 stall (assume integer add latency is 1) BNE R1,R2,Loop (must resolve in ID stage) Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Apply Instruction Scheduling
1 Loop: L.D F0,0(R1) 2 DADDUI R1,R1,-8 3 ADD.D F4,F0,F2 4 stall 5 stall 6 S.D 8(R1),F4 ;altered offset 7 BNEZ R1,Loop Limited basic block (5 instructions) 7 clock cycles, but just 3 for execution (L.D, ADD.D, S.D), 4 for loop overhead; Can we make faster?

Unroll Loop Four Times 1 Loop: L.D F0,0(R1) 3 ADD.D F4,F0,F2
6 S.D 0(R1),F4 ;no DSUBUI/BNEZ 7 L.D F6,-8(R1) 9 ADD.D F8,F6,F2 12 S.D -8(R1),F8 ;drop loop 13 L.D F10,-16(R1) 15 ADD.D F12,F10,F2 18 S.D -16(R1),F12 ;drop loop 19 L.D F14,-24(R1) 21 ADD.D F16,F14,F2 24 S.D -24(R1),F16 25 DADDUI R1,R1,#-32 ;alter to 4*8 27 BNEZ R1,LOOP 27 clock cycles, or 6.75 per iteration (Assumes R1 is multiple of 4)

Unrolled and Rescheduled Loop
1 Loop: L.D F0,0(R1) 2 L.D F6,-8(R1) ; adjust disp. 3 L.D F10,-16(R1) 4 L.D F14,-24(R1) 5 ADD.D F4,F0,F2 6 ADD.D F8,F6,F2 7 ADD.D F12,F10,F2 8 ADD.D F16,F14,F2 9 S.D 0(R1),F4 10 S.D -8(R1),F8 11 S.D -16(R1),F12 12 DSUBUI R1,R1,#32 13 S.D 8(R1),F16 ; 8-32 = -24 14 BNEZ R1,LOOP 14 clock cycles, or 3.5 per iteration

Three Unrolling Considerations
Decrease in amount of overhead (begin/end of the loop) amortized with each extra unrolling Amdahl’s Law Growth in code size For larger loops, concern it increases the instruction cache miss rate Register pressure: potential shortfall in registers created by aggressive unrolling and scheduling If not be possible to allocate all live values to registers, may lose some or all of its advantage  Loop unrolling also reduces impact of branches on pipeline; another way is branch prediction

8 December 2018 Strip Mining Compiler Techniques Unknown number of loop iterations? Number of iterations = n Goal: make k copies of the loop body Generate pair of loops: First executes n mod k times Second executes n / k times “Strip mining” Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Control Stall - Static Branch Prediction
Delayed branch To reorder code around branches, need to predict branch statically when compile Predict a branch as taken: Average misprediction = untaken frequency = 34% SPEC More accurate prediction: Profile based (see figure)

Dynamic Branch Prediction
Why does prediction work? Underlying algorithm has regularities Data that is being operated on has regularities Instruction sequence has redundancies that are artifacts of way that humans/compilers think about problems Is dynamic branch prediction better than static branch prediction? Seems to be There are a small number of important branches in programs which have dynamic behavior

Dynamic Branch Prediction
As the amount of ILP exploited increases (CPI decreases), impact of control stalls increases. Branches come more often A longer n-cycle delay postpones more instructions Dynamic Hardware Branch Prediction “Learns” which branches are taken, or not Make the right guess (most of the time) about whether a branch is taken, or not. Delay depends on whether prediction is correct, and whether branch is taken.

Branch-Prediction Buffers (BPB)
Also called “branch history table” Low-order n bits of branch address used to index a table of branch history data for prediction. May have “collisions” between distant branches. Associative tables also possible (expensive) In each entry, k bits of information about history of that branch are stored (also called predictor). Common values of k: 1, 2, and larger Entry is used to predict what branch will do. Actual outcome of branch will update the entry.

8 December 2018 Branch Prediction Branch Prediction Basic 2-bit predictor: For each branch: Predict taken or not taken If the prediction is wrong two consecutive times, change prediction Correlating predictor: Multiple 2-bit predictors for each branch One for each possible combination of outcomes of preceding n branches Local predictor: One for each possible combination of outcomes for the last n occurrences of this branch Tournament predictor: Combine correlating predictor with local predictor Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

1-bit Branch Prediction
The entry for a branch has only two states: Bit = 1 “The last time this branch was encountered, it was taken. I predict it will be taken next time.” Bit = 0 “The last time this branch was encountered, it was not taken. I predict it will not be taken next time.” Will make 2 mistakes each time a loop is encountered. At the end of the first & last iterations. May always mispredict in pathological cases!

2-bit Branch Prediction
4 states Based on the most recent two branch history actions Only 1 mis-prediction per loop execution, after the first time the loop is reached. Last iteration What about n-bit predictor?

State Transition Diagram (2-bit predictor, 4 states)
Note, state transition can be modified based on prediction algorithm

Implementing Branch Histories
Separate “cache” (prediction table) accessed during IF stage (to obtain prediction outcome) Extra bits in instruction cache Problem with this approach in MIPS: After fetch, don’t know whether the instruction is really a branch or not (until decoding) Also don’t know the target address. In MIPS, by the time you know these things (in ID), you already know whether it’s really taken! Haven’t saved any time! Branch-Target Buffers (BTB) can fix this problem (later)...

Branch-Target Buffers (BTB)
How to know the address of the next instruction as soon as the current instruction is fetched? Normally, an extra (ID) cycle is needed to: Determine that the first instruction is a branch Determine whether the branch is taken Compute the target address PC+offset Branch prediction alone doesn’t help – Need next PC What if, instead, the next instruction address could be fetched at the same time that the current instruction is fetched?  using BTB

BTB Schematic Note, BTB structure similar to cache (for taken branch only) It is expensive, predictor for untaken can be separate! (only predicted taken will enter BTB)

Branch Resolution in 5-stage pipeline
Need to resolve branch and target address in IF stage!! Fetch from target This flowchart based on 1-bit predictor

Misprediction Rate for 2-bit BPB

Branch-Prediction Performance
Contribution to cycle count depends on: Branch frequency & misprediction frequency Freqs. of taken/not taken, predicted/mispredicted. Delay of taken/not taken, predicted/mispredicted. How to reduce misprediction frequency? Increase buffer size to avoid collisions. Has little effect beyond ~4,096 entries. Increase prediction accuracy Increase # of bits/entry (little effect beyond 2) Use a different prediction scheme (correlated predictors, tournament predictors)

Correlated Prediction - Example
Code fragment from eqntott: if (aa==2) aa=0; if1 if (bb==2) bb=0; if2 if (aa!=bb) { … };  false if (if1 & if2) MIPS code (aa=R1, bb=R2): SUBUI R3,R1,#2 ; (aa-2) BNEZ R3,L1 ; branch b1 (aa!=2) ADD R1,R0,R0 ; aa=0 L1: SUBUI R3,R2,#2 ; (bb-2) BNEZ R3,L2 ; branch b2 (bb!=2) ADD R2,R0,R0 ; bb=0 L2: SUBU R3,R1,R2 ; (aa-bb) BEQZ R3,L3 … ; branch b3 (aa==bb) ; Must taken if b1,b2 untaken

Even Simpler Example C code: MIPS code (d=R1):
if (d==0) d=1; if (d==1) ... MIPS code (d=R1): BNEZ R1,L1 ; b1: d!=0 ADDI R1,R0,#1 ; d=1 L1: SUBUI R3,R1,#1 ; (d-1) BNEZ R3,L2 ; b2: d!=1 (and others) Any Correlation (b1, b2)? (Old figure)

Using 1-bit Predictor Suppose value of ‘d’ alternates between 2 and 0, and the code repeat multiple times All branches are mispredicted!! (with initial NT, NT) Note, b1 and b2 have individual 1-bit predictor

Correlating Predictors
Have different predictions for the current branch depending on the previously executed branch instruction was taken or not. Notation: _ / _ (two separate prediction table) What to predict if the last branch was NOT taken Prediction used is shown in bold What to predict if the last branch was TAKEN Based on the last branch outcome Initial NT for correlation Note, by separate the branch history table, b1-not-taken  b2-always-not-taken is preserved.

(m,n) correlated predictors
Uses the behavior of the most recent m branches encountered to select one of 2m different branch predictors for the next branch. Each of these predictors records n bits of history information for any given branch. On previous slide we saw a (1,1) predictor. Easy to implement: Behavior of last m branches: an m-bit shift register Branch-prediction buffer: access with low-order bits of branch address, concatenated with shift register.

(2,2) Correlated Predictor
1 (Correction based on the last 2 branch outcomes) (Maintain 4 separate tables!)

Correlated Predictors Better

Tournament Predictors
Three predictors: Global correlated predictor, Local 2-bit predictor, and a tournament predictor The Tournament predictor determines which (global or local) predictor to be used for prediction 1: correct; 0: incorrect

Performance Tournament Predictor

Dynamic Branch Prediction Summary
Prediction becoming important part of execution Branch History Table: 2 bits for loop accuracy Correlation: Recently executed branches correlated with next branch Either different branches (GA) Or different executions of same branches (PA) Tournament predictors take insight to next level, by using multiple predictors usually one based on global information and one based on local information, and combining them with a selector In 2006, tournament predictors using  30K bits are in processors like the Power5 and Pentium 4 Branch Target Buffer: include branch address & prediction for taken branch

Chapter 3: ILP and Its Exploitation

Similar presentations

Presentation on theme: "Chapter 3: ILP and Its Exploitation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chapter 3: ILP and Its Exploitation

Similar presentations

Presentation on theme: "Chapter 3: ILP and Its Exploitation"— Presentation transcript:

Similar presentations

About project

Feedback