TIME C1 C2 C3 C4 C5 C6 C7 C8 C9 I1 branch decode exec mem wb bubble

Slides:



Advertisements
Similar presentations
Pipelining and Control Hazards Oct
Advertisements

1 SS and Pipelining: The Sequel Data Forwarding Caches Branch Prediction Michele Co, September 24, 2001.
EECS 470 Lecture 6 Branches: Address prediction and recovery (And interrupt recovery too.)
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
CML CML CS 230: Computer Organization and Assembly Language Aviral Shrivastava Department of Computer Science and Engineering School of Computing and Informatics.
Branch Prediction in SimpleScalar
EECE476: Computer Architecture Lecture 21: Faster Branches Branch Prediction with Branch-Target Buffers (not in textbook) The University of British ColumbiaEECE.
Computer Architecture 2011 – Branch Prediction 1 Computer Architecture Advanced Branch Prediction Lihu Rappoport and Adi Yoaz.
EECS 470 Branch Prediction Lecture 6 Coverage: Chapter 3.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 8, 2003 Topic: Instruction-Level Parallelism (Dynamic Branch Prediction)
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 7, 2002 Topic: Instruction-Level Parallelism (Dynamic Branch Prediction)
EECS 470 Pipeline Control Hazards Lecture 5 Coverage: Chapter 3 & Appendix A.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )
1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.
Goal: Reduce the Penalty of Control Hazards
Branch Target Buffers BPB: Tag + Prediction
1 Lecture 18: Pipelining Today’s topics:  Hazards and instruction scheduling  Branch prediction  Out-of-order execution Reminder:  Assignment 7 will.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
1 COMP 740: Computer Architecture and Implementation Montek Singh Thu, Feb 19, 2009 Topic: Instruction-Level Parallelism III (Dynamic Branch Prediction)
Dynamic Branch Prediction
CIS 429/529 Winter 2007 Branch Prediction.1 Branch Prediction, Multiple Issue.
Pipelined Datapath and Control (Lecture #15) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.
1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )
1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.
Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.
Korea UniversityG. Lee CRE652 Processor Architecture Dynamic Branch Prediction.
Computer Structure Advanced Branch Prediction
CS 6290 Branch Prediction. Control Dependencies Branches are very frequent –Approx. 20% of all instructions Can not wait until we know where it goes –Long.
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)
CDA 5155 Week 3 Branch Prediction Superscalar Execution.
COSC3330 Computer Architecture Lecture 14. Branch Prediction
Computer Organization CS224
CS203 – Advanced Computer Architecture
Computer Structure Advanced Branch Prediction
Computer Architecture Advanced Branch Prediction
CS5100 Advanced Computer Architecture Advanced Branch Prediction
ELEN 468 Advanced Logic Design
Constructive Computer Architecture Tutorial 6: Discussion for lab6
Pipelining - Branch Prediction
Morgan Kaufmann Publishers The Processor
Samira Khan University of Virginia Dec 4, 2017
Computer Architecture Lecture 3
Pipelining review.
The processor: Pipelining and Branching
Module 3: Branch Prediction
So far we have dealt with control hazards in instruction pipelines by:
Lecture: Static ILP, Branch Prediction
Branch statistics Branches occur every 4-6 instructions (16-25%) in integer programs; somewhat less frequently in scientific ones Unconditional branches.
Lecture: Branch Prediction
/ Computer Architecture and Design
Pipelining and control flow
Control unit extension for data hazards
So far we have dealt with control hazards in instruction pipelines by:
Branch Prediction: Direction Predictors
So far we have dealt with control hazards in instruction pipelines by:
So far we have dealt with control hazards in instruction pipelines by:
Recovery: Redirect fetch unit to T path if actually T.
Pipelining: dynamic branch prediction Prof. Eric Rotenberg
Adapted from the slides of Prof
Control unit extension for data hazards
So far we have dealt with control hazards in instruction pipelines by:
So far we have dealt with control hazards in instruction pipelines by:
So far we have dealt with control hazards in instruction pipelines by:
Control unit extension for data hazards
So far we have dealt with control hazards in instruction pipelines by:
So far we have dealt with control hazards in instruction pipelines by:
Consider the following code segment for a loop: int a = 3, b = 4;
Computer Structure Advanced Branch Prediction
Lecture 7: Branch Prediction, Dynamic ILP
Presentation transcript:

TIME C1 C2 C3 C4 C5 C6 C7 C8 C9 I1 branch decode exec mem wb bubble bubble bubble bubble bubble bubble bubble bubble bubble bubble I4 fetch decode exec mem wb I5 fetch decode exec mem wb Redirected fetch

TIME C1 C2 C3 C4 C5 C6 C7 C8 C9 I1 branch decode exec mem wb bubble bubble bubble bubble bubble I2 fetch decode exec mem wb I3 fetch decode exec mem wb Redirected fetch

Predict PC + 4 Resolve if branch Resolve if non-branch TIME C1 C2 C3 C4 C5 C6 C7 C8 C9 I1 fetch decode exec mem wb I2 fetch decode exec mem wb I3 fetch decode exec mem wb I4 fetch decode exec mem wb I5 fetch decode exec mem wb

Predict PC + 4 Resolve next PC != PC + 4 TIME C1 C2 C3 C4 C5 C6 C7 C8 C9 I1 branch decode exec mem wb squashed I2 fetch decode bubble bubble bubble I3 fetch bubble bubble bubble bubble I4 fetch decode exec mem wb I5 fetch decode exec mem wb Redirected fetch

do { if (a[i] != 0) some computation i++; } while (i < 100); DOWHILE: load in r10 a[i] beq r10, r0, SKIP some computation SKIP: some computation addi r11, r11, 1 blt r11, r12, DOWHILE

Instruction opcode available Calculate Taken PC C1 C2 C3 Select PC+4 or Taken PC I1 FETCH fetch cache N S decode exec F? fetch decode

ITERATION 1. DOWHILE:. load in r10 a[i]. beq. r10, r0, SKIP ITERATION 1 DOWHILE: load in r10 a[i] beq r10, r0, SKIP FIRST TIME SEEN  PREDICT NOT TAKEN  LEARN NOT TAKEN some computation SKIP: some computation addi r11, r11, 1 blt r11, r12, DOWHILE FIRST TIME SEEN  PREDICT NOT TAKEN  MISPREDICTION  LEARN TAKEN ITERATION 2 DOWHILE: load in r10 a[i] beq r10, r0, SKIP SEEN BEFORE  PREDICT “SAME AS LAST TIME”: NOT TAKEN  LEARN NOT TAKEN some computation SKIP: some computation addi r11, r11, 1 blt r11, r12, DOWHILE SEEN BEFORE  PREDICT “SAME AS LAST TIME”: PREDICT TAKEN  LEARN NOT TAKEN

PC V N

ITERATION 1. DOWHILE:. load in r10 a[i] 0x100. beq. r10, r0, SKIP ITERATION 1 DOWHILE: load in r10 a[i] 0x100 beq r10, r0, SKIP some computation SKIP: addi r11, r11, 1 0x200 blt r11, r12, DOWHILE ITERATION 2 DOWHILE: load in r10 a[i] 0x100 beq r10, r0, SKIP some computation SKIP: addi r11, r11, 1 0x200 blt r11, r12, DOWHILE before after PC 0x100 1 PC Predict not taken (default) 0x100 1 PC 0x100 1 0x200 PC Predict not taken (default) 0x100 1 PC 0x100 1 0x200 PC Predict not taken (table) 0x100 1 PC 0x100 1 0x200 PC Predict taken (table)

Accuracy Accuracy = 100 + 98 / 200 = 99% Accuracy = # correct predictions / # all Predictions 100 beq  all not taken 100 blt  1 not taken at the end Predictions: Beq: all not taken – default Blt: first not taken wrong (default), last taken wrong Accuracy = 100 + 98 / 200 = 99%

How big this needs to be? PC V N 4G addresses, 4 bytes per instruction, aligned  1G possible branches 1G entries, each 4 bytes (PC), 2 bits (V & N) TOO LARGE

How big this needs to be? PC V N But if we had 1G entries we have 1-to-1 mapping of PC to entry: V N V N 1G V N No need for PC

PC PC V N V N V N V N 1G Few entries V N V N PC N N Few entries N h() 00 PC 00 V N h() V N V N V N 1G Few entries V N V N PC 00 h() N N Few entries N

PC Strongly NT Weakly NT Weakly T Strongly T T T T 00 01 10 11 01 NT T h() 00 01 10 11 01 NT T 01 NT NT NT 10

movi. r18, 3. # max i. movi. r19, 2. # max j. movi. r8, 0. #i = 0 DOi: movi r18, 3 # max i movi r19, 2 # max j movi r8, 0 #i = 0 DOi: movi r9, 0 # j = 0 DOj: some computation addi r9, r9, 1 blt r9, r19, DOj #J branch addi r8, r8, 1 blt r8, r18, DOi # I branch T T NT T T NT T T NT T (11) T(11) T(10) T(11) T (11) T(10) T(11) T (11) T(10)

older younger history PC 00 0 0 h()

movi r9, 0 DOj: some computation addi r9, r9, 1 blt r9, r19, DOj movi r9, 0 DOj: some computation addi r9, r9, 1 blt r9, r19, DOj movi r9, 0 DOj: some computation addi r9, r9, 1 blt r9, r19, DOj history prediction Pattern learned PC 00 0 0 0 0 1 history prediction PC 00 1 0 1 0 1 history prediction PC 00 1 1 1 1

movi r9, 0 DOj: some computation addi r9, r9, 1 blt r9, r19, DOj movi r9, 0 DOj: some computation addi r9, r9, 1 blt r9, r19, DOj movi r9, 0 DOj: some computation addi r9, r9, 1 blt r9, r19, DOj history prediction Pattern learned PC 00 01 01 1 Learned thus far 1 0 1 1 1 0 1 1 0 0 1

movi r9, 0 DOj: some computation addi r9, r9, 1 blt r9, r19, DOj movi r9, 0 DOj: some computation addi r9, r9, 1 blt r9, r19, DOj movi r9, 0 DOj: some computation addi r9, r9, 1 blt r9, r19, DOj 1 0 1 1 0 1 1 0 0 Learned thus far history prediction PC 00 10 1 correct PC 00 11 correct

1 0 1 1 0 1 1 0 0 Learned thus far movi r9, 0 DOj: some computation addi r9, r9, 1 blt r9, r19, DOj movi r9, 0 DOj: some computation addi r9, r9, 1 blt r9, r19, DOj movi r9, 0 DOj: some computation addi r9, r9, 1 blt r9, r19, DOj history prediction PC 00 01 1 history prediction PC 00 10 1 PC 00 11

PC bimodal Which is best for this branch? gshare

PC bimodal gshare meta

Overwriting Prediction Fast Prediction available C1 C2 C3 Overwriting Prediction fetch decode exec fetch decode

BTB PC TARGET ADDRESS V PC TARGET ADDRESS V PC TARGET ADDRESS V

PC PC+4 Next PC BTB Direction Predictor

Calls and returns

If (error != 0) error_handle(); If (a[i] < threshold) a++; else b++; Load a[i] in r8 blt r8, r9, THEN # r9 holds threshold ELSE: addi r10, r10, 1 # b++ br DONE THEN: addi r11, r11, 1 # a++ DONE: Load a[i] in r8 cmplt c0, r8, r9 # condition register c0 = r8 < r9 c0: addi r10, r10, 1 !c0: addi r11, r11, 1