Presentation is loading. Please wait.

Presentation is loading. Please wait.

TIME C1 C2 C3 C4 C5 C6 C7 C8 C9 I1 branch decode exec mem wb bubble

Similar presentations


Presentation on theme: "TIME C1 C2 C3 C4 C5 C6 C7 C8 C9 I1 branch decode exec mem wb bubble"— Presentation transcript:

1 TIME C1 C2 C3 C4 C5 C6 C7 C8 C9 I1 branch decode exec mem wb bubble bubble bubble bubble bubble bubble bubble bubble bubble bubble I4 fetch decode exec mem wb I5 fetch decode exec mem wb Redirected fetch

2 TIME C1 C2 C3 C4 C5 C6 C7 C8 C9 I1 branch decode exec mem wb bubble bubble bubble bubble bubble I2 fetch decode exec mem wb I3 fetch decode exec mem wb Redirected fetch

3 Predict PC + 4 Resolve if branch Resolve if non-branch TIME C1 C2 C3 C4 C5 C6 C7 C8 C9 I1 fetch decode exec mem wb I2 fetch decode exec mem wb I3 fetch decode exec mem wb I4 fetch decode exec mem wb I5 fetch decode exec mem wb

4 Predict PC + 4 Resolve next PC != PC + 4 TIME C1 C2 C3 C4 C5 C6 C7 C8 C9 I1 branch decode exec mem wb squashed I2 fetch decode bubble bubble bubble I3 fetch bubble bubble bubble bubble I4 fetch decode exec mem wb I5 fetch decode exec mem wb Redirected fetch

5 do { if (a[i] != 0) some computation i++; } while (i < 100);
DOWHILE: load in r10 a[i] beq r10, r0, SKIP some computation SKIP: some computation addi r11, r11, 1 blt r11, r12, DOWHILE

6 Instruction opcode available Calculate Taken PC C1 C2 C3
Select PC+4 or Taken PC I1 FETCH fetch cache N S decode exec F? fetch decode

7 ITERATION 1. DOWHILE:. load in r10 a[i]. beq. r10, r0, SKIP
ITERATION 1 DOWHILE: load in r10 a[i] beq r10, r0, SKIP FIRST TIME SEEN  PREDICT NOT TAKEN  LEARN NOT TAKEN some computation SKIP: some computation addi r11, r11, 1 blt r11, r12, DOWHILE FIRST TIME SEEN  PREDICT NOT TAKEN  MISPREDICTION  LEARN TAKEN ITERATION DOWHILE: load in r10 a[i] beq r10, r0, SKIP SEEN BEFORE  PREDICT “SAME AS LAST TIME”: NOT TAKEN  LEARN NOT TAKEN some computation SKIP: some computation addi r11, r11, 1 blt r11, r12, DOWHILE SEEN BEFORE  PREDICT “SAME AS LAST TIME”: PREDICT TAKEN  LEARN NOT TAKEN

8 PC V N

9 ITERATION 1. DOWHILE:. load in r10 a[i] 0x100. beq. r10, r0, SKIP
ITERATION 1 DOWHILE: load in r10 a[i] 0x100 beq r10, r0, SKIP some computation SKIP: addi r11, r11, 1 0x200 blt r11, r12, DOWHILE ITERATION DOWHILE: load in r10 a[i] 0x100 beq r10, r0, SKIP some computation SKIP: addi r11, r11, 1 0x200 blt r11, r12, DOWHILE before after PC 0x100 1 PC Predict not taken (default) 0x100 1 PC 0x100 1 0x200 PC Predict not taken (default) 0x100 1 PC 0x100 1 0x200 PC Predict not taken (table) 0x100 1 PC 0x100 1 0x200 PC Predict taken (table)

10 Accuracy Accuracy = 100 + 98 / 200 = 99%
Accuracy = # correct predictions / # all Predictions 100 beq  all not taken 100 blt  1 not taken at the end Predictions: Beq: all not taken – default Blt: first not taken wrong (default), last taken wrong Accuracy = / 200 = 99%

11 How big this needs to be? PC V N
4G addresses, 4 bytes per instruction, aligned  1G possible branches 1G entries, each 4 bytes (PC), 2 bits (V & N) TOO LARGE

12 How big this needs to be? PC V N
But if we had 1G entries we have 1-to-1 mapping of PC to entry: V N V N 1G V N No need for PC

13 PC PC V N V N V N V N 1G Few entries V N V N PC N N Few entries N h()
00 PC 00 V N h() V N V N V N 1G Few entries V N V N PC 00 h() N N Few entries N

14 PC Strongly NT Weakly NT Weakly T Strongly T T T T 00 01 10 11 01 NT T
h() 00 01 10 11 01 NT T 01 NT NT NT 10

15 movi. r18, 3. # max i. movi. r19, 2. # max j. movi. r8, 0. #i = 0 DOi:
movi r18, 3 # max i movi r19, 2 # max j movi r8, 0 #i = 0 DOi: movi r9, 0 # j = 0 DOj: some computation addi r9, r9, 1 blt r9, r19, DOj #J branch addi r8, r8, 1 blt r8, r18, DOi # I branch T T NT T T NT T T NT T (11) T(11) T(10) T(11) T (11) T(10) T(11) T (11) T(10)

16 older younger history PC 00 0 0 h()

17 movi r9, 0 DOj: some computation addi r9, r9, 1 blt r9, r19, DOj movi r9, 0 DOj: some computation addi r9, r9, 1 blt r9, r19, DOj movi r9, 0 DOj: some computation addi r9, r9, 1 blt r9, r19, DOj history prediction Pattern learned PC 00 0 0 0 0 1 history prediction PC 00 1 0 1 0 1 history prediction PC 00 1 1 1 1

18 movi r9, 0 DOj: some computation addi r9, r9, 1 blt r9, r19, DOj movi r9, 0 DOj: some computation addi r9, r9, 1 blt r9, r19, DOj movi r9, 0 DOj: some computation addi r9, r9, 1 blt r9, r19, DOj history prediction Pattern learned PC 00 01 01 1 Learned thus far 1 0 1 1 1 0 1 1 0 0 1

19 movi r9, 0 DOj: some computation addi r9, r9, 1 blt r9, r19, DOj movi r9, 0 DOj: some computation addi r9, r9, 1 blt r9, r19, DOj movi r9, 0 DOj: some computation addi r9, r9, 1 blt r9, r19, DOj 1 0 1 1 0 1 1 0 0 Learned thus far history prediction PC 00 10 1 correct PC 00 11 correct

20 1 0 1 1 0 1 1 0 0 Learned thus far movi r9, 0 DOj: some computation addi r9, r9, 1 blt r9, r19, DOj movi r9, 0 DOj: some computation addi r9, r9, 1 blt r9, r19, DOj movi r9, 0 DOj: some computation addi r9, r9, 1 blt r9, r19, DOj history prediction PC 00 01 1 history prediction PC 00 10 1 PC 00 11

21 PC bimodal Which is best for this branch? gshare

22 PC bimodal gshare meta

23 Overwriting Prediction
Fast Prediction available C1 C2 C3 Overwriting Prediction fetch decode exec fetch decode

24 BTB PC TARGET ADDRESS V PC TARGET ADDRESS V PC TARGET ADDRESS V

25 PC PC+4 Next PC BTB Direction Predictor

26 Calls and returns

27 If (error != 0) error_handle();
If (a[i] < threshold) a++; else b++; Load a[i] in r8 blt r8, r9, THEN # r9 holds threshold ELSE: addi r10, r10, 1 # b++ br DONE THEN: addi r11, r11, 1 # a++ DONE: Load a[i] in r8 cmplt c0, r8, r9 # condition register c0 = r8 < r9 c0: addi r10, r10, 1 !c0: addi r11, r11, 1


Download ppt "TIME C1 C2 C3 C4 C5 C6 C7 C8 C9 I1 branch decode exec mem wb bubble"

Similar presentations


Ads by Google