TIME C1 C2 C3 C4 C5 C6 C7 C8 C9 I1 branch decode exec mem wb bubble

TIME C1 C2 C3 C4 C5 C6 C7 C8 C9 I1 branch decode exec mem wb bubble bubble bubble bubble bubble bubble bubble bubble bubble bubble I4 fetch decode exec mem wb I5 fetch decode exec mem wb Redirected fetch

TIME C1 C2 C3 C4 C5 C6 C7 C8 C9 I1 branch decode exec mem wb bubble bubble bubble bubble bubble I2 fetch decode exec mem wb I3 fetch decode exec mem wb Redirected fetch

Predict PC + 4 Resolve if branch Resolve if non-branch TIME C1 C2 C3 C4 C5 C6 C7 C8 C9 I1 fetch decode exec mem wb I2 fetch decode exec mem wb I3 fetch decode exec mem wb I4 fetch decode exec mem wb I5 fetch decode exec mem wb

Predict PC + 4 Resolve next PC != PC + 4 TIME C1 C2 C3 C4 C5 C6 C7 C8 C9 I1 branch decode exec mem wb squashed I2 fetch decode bubble bubble bubble I3 fetch bubble bubble bubble bubble I4 fetch decode exec mem wb I5 fetch decode exec mem wb Redirected fetch

do { if (a[i] != 0) some computation i++; } while (i < 100);
DOWHILE: load in r10 a[i] beq r10, r0, SKIP some computation SKIP: some computation addi r11, r11, 1 blt r11, r12, DOWHILE

Instruction opcode available Calculate Taken PC C1 C2 C3
Select PC+4 or Taken PC I1 FETCH fetch cache N S decode exec F? fetch decode

ITERATION 1. DOWHILE:. load in r10 a[i]. beq. r10, r0, SKIP
ITERATION 1 DOWHILE: load in r10 a[i] beq r10, r0, SKIP FIRST TIME SEEN  PREDICT NOT TAKEN  LEARN NOT TAKEN some computation SKIP: some computation addi r11, r11, 1 blt r11, r12, DOWHILE FIRST TIME SEEN  PREDICT NOT TAKEN  MISPREDICTION  LEARN TAKEN ITERATION DOWHILE: load in r10 a[i] beq r10, r0, SKIP SEEN BEFORE  PREDICT “SAME AS LAST TIME”: NOT TAKEN  LEARN NOT TAKEN some computation SKIP: some computation addi r11, r11, 1 blt r11, r12, DOWHILE SEEN BEFORE  PREDICT “SAME AS LAST TIME”: PREDICT TAKEN  LEARN NOT TAKEN

PC V N

ITERATION 1. DOWHILE:. load in r10 a[i] 0x100. beq. r10, r0, SKIP
ITERATION 1 DOWHILE: load in r10 a[i] 0x100 beq r10, r0, SKIP some computation SKIP: addi r11, r11, 1 0x200 blt r11, r12, DOWHILE ITERATION DOWHILE: load in r10 a[i] 0x100 beq r10, r0, SKIP some computation SKIP: addi r11, r11, 1 0x200 blt r11, r12, DOWHILE before after PC 0x100 1 PC Predict not taken (default) 0x100 1 PC 0x100 1 0x200 PC Predict not taken (default) 0x100 1 PC 0x100 1 0x200 PC Predict not taken (table) 0x100 1 PC 0x100 1 0x200 PC Predict taken (table)

Accuracy Accuracy = 100 + 98 / 200 = 99%
Accuracy = # correct predictions / # all Predictions 100 beq  all not taken 100 blt  1 not taken at the end Predictions: Beq: all not taken – default Blt: first not taken wrong (default), last taken wrong Accuracy = / 200 = 99%

How big this needs to be? PC V N
4G addresses, 4 bytes per instruction, aligned  1G possible branches 1G entries, each 4 bytes (PC), 2 bits (V & N) TOO LARGE

How big this needs to be? PC V N
But if we had 1G entries we have 1-to-1 mapping of PC to entry: V N V N 1G V N No need for PC

PC PC V N V N V N V N 1G Few entries V N V N PC N N Few entries N h()
00 PC 00 V N h() V N V N V N 1G Few entries V N V N PC 00 h() N N Few entries N

PC Strongly NT Weakly NT Weakly T Strongly T T T T 00 01 10 11 01 NT T
h() 00 01 10 11 01 NT T 01 NT NT NT 10

movi. r18, 3. # max i. movi. r19, 2. # max j. movi. r8, 0. #i = 0 DOi:
movi r18, 3 # max i movi r19, 2 # max j movi r8, 0 #i = 0 DOi: movi r9, 0 # j = 0 DOj: some computation addi r9, r9, 1 blt r9, r19, DOj #J branch addi r8, r8, 1 blt r8, r18, DOi # I branch T T NT T T NT T T NT T (11) T(11) T(10) T(11) T (11) T(10) T(11) T (11) T(10)

older younger history PC 00 0 0 h()

movi r9, 0 DOj: some computation addi r9, r9, 1 blt r9, r19, DOj movi r9, 0 DOj: some computation addi r9, r9, 1 blt r9, r19, DOj movi r9, 0 DOj: some computation addi r9, r9, 1 blt r9, r19, DOj history prediction Pattern learned PC 00 0 0 0 0 1 history prediction PC 00 1 0 1 0 1 history prediction PC 00 1 1 1 1

movi r9, 0 DOj: some computation addi r9, r9, 1 blt r9, r19, DOj movi r9, 0 DOj: some computation addi r9, r9, 1 blt r9, r19, DOj movi r9, 0 DOj: some computation addi r9, r9, 1 blt r9, r19, DOj history prediction Pattern learned PC 00 01 01 1 Learned thus far 1 0 1 1 1 0 1 1 0 0 1

movi r9, 0 DOj: some computation addi r9, r9, 1 blt r9, r19, DOj movi r9, 0 DOj: some computation addi r9, r9, 1 blt r9, r19, DOj movi r9, 0 DOj: some computation addi r9, r9, 1 blt r9, r19, DOj 1 0 1 1 0 1 1 0 0 Learned thus far history prediction PC 00 10 1 correct PC 00 11 correct

1 0 1 1 0 1 1 0 0 Learned thus far movi r9, 0 DOj: some computation addi r9, r9, 1 blt r9, r19, DOj movi r9, 0 DOj: some computation addi r9, r9, 1 blt r9, r19, DOj movi r9, 0 DOj: some computation addi r9, r9, 1 blt r9, r19, DOj history prediction PC 00 01 1 history prediction PC 00 10 1 PC 00 11

PC bimodal Which is best for this branch? gshare

PC bimodal gshare meta

Overwriting Prediction
Fast Prediction available C1 C2 C3 Overwriting Prediction fetch decode exec fetch decode

BTB PC TARGET ADDRESS V PC TARGET ADDRESS V PC TARGET ADDRESS V

PC PC+4 Next PC BTB Direction Predictor

Calls and returns

If (error != 0) error_handle();
If (a[i] < threshold) a++; else b++; Load a[i] in r8 blt r8, r9, THEN # r9 holds threshold ELSE: addi r10, r10, 1 # b++ br DONE THEN: addi r11, r11, 1 # a++ DONE: Load a[i] in r8 cmplt c0, r8, r9 # condition register c0 = r8 < r9 c0: addi r10, r10, 1 !c0: addi r11, r11, 1

TIME C1 C2 C3 C4 C5 C6 C7 C8 C9 I1 branch decode exec mem wb bubble

Similar presentations

Presentation on theme: "TIME C1 C2 C3 C4 C5 C6 C7 C8 C9 I1 branch decode exec mem wb bubble"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

TIME C1 C2 C3 C4 C5 C6 C7 C8 C9 I1 branch decode exec mem wb bubble

Similar presentations

Presentation on theme: "TIME C1 C2 C3 C4 C5 C6 C7 C8 C9 I1 branch decode exec mem wb bubble"— Presentation transcript:

Similar presentations

About project

Feedback