Presentation is loading. Please wait.

Presentation is loading. Please wait.

COSC6385 Advanced Computer Architecture Lecture 9. Branch Prediction

Similar presentations


Presentation on theme: "COSC6385 Advanced Computer Architecture Lecture 9. Branch Prediction"— Presentation transcript:

1 COSC6385 Advanced Computer Architecture Lecture 9. Branch Prediction
Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston

2 Local Branch History Register
addi r10, r0, 4 addi r1, r1, r0 L1: … … addi r1, r1, 1 bne r1, r10, L1 for (i=0; i<4; i++) { …. } Actual T 001 T 011 T 111 T 111 NT 110 T 101 T 011 T 111 T 111 NT 110 000

3 Correlated Branch Predictor [PanSoRahmeh’92]
2-bit shift register (global branch history) Subsequent branch direction select Branch PC Branch PC 2-bit counter . 2-bit counter . X 2-bit counter . 2-bit counter . X X Prediction Prediction w hash hash . 2w 2-bit counter (2,2) Correlation Scheme 2-bit Sat. Counter Scheme (M,N) correlation scheme M: shift register size (# bits) N: N-bit counter

4 Two-Level Branch Predictor [YehPatt91,92,93]
Pattern History Table (PHT) 00…..00 2N entries 00…..01 Branch History Register (BHR) (Shift left when update) 00…..10 Rc-k Rc-1 1 1 1 N Prediction 11…..10 Current State 11…..11 PHT update Branch History Pattern FSM Update Logic Rc: Actual Branch Outcome Generalized correlated branch predictor 1st level keeps branch history in Branch History Register (BHR) 2nd level segregates pattern history in Pattern History Table (PHT)

5 Pattern History Table 2N entries addressed by N-bit BHR
Each entry keeps a counter (2-bit or more) for prediction Counter update: the same as 2-bit counter Can be initialized in alternate patterns (01, 10, 01, 10, ..) Alias (or interference) problem

6 Global History Schemes
GAg GAs GAp Per-set PHTs (SPHTs) Per-addr PHTs (PPHTs) SetP(B) Addr(B) Global PHT . . . . . . . Global BHR Global BHR Global BHR .. .. Set can be determined by branch opcode, compiler classification, or branch PC address. * [PanSoRahmeh’92] similar to GAp

7 GAs Two-Level Branch Prediction
The 2 LSBs are insignificant for 32-bit instruction PHT PC = 0x C . 10 0110 . BHR MSB = 1 Predict Taken

8 Predictor Update (Actually, Not Taken)
PHT Wrong Prediction PC = 0x C . 10 01 decremented 1100 0110 . BHR Update Predictor after branch is resolved

9 Per-Address History Schemes
PAp PAg PAs Per-set PHTs (SPHTs) Per-addr PHTs (PPHTs) SetP(B) Addr(B) Per-addr BHT (PBHT) Global PHT Per-addr BHT (PBHT) Per-set BHT (PBHT) . . . . . . . . . . Addr(B) Addr(B) Addr(B) .. .. Ex: Alpha 21264’s local predictor Ex: P6, Itanium

10 PAs Two-Level Branch Predictor
PC = 000 001 010 011 100 101 110 111 BHT . PHT MSB = 1 Predict Taken 11

11 PAs: Track the History of a Branch
Previous Outcome PC Counter if prev=0 1 3 Counter if prev=1 1 3 3 prev = 1 prediction = N prev = 3 prediction = T prev = 3 1 prediction = N 1 WN SN 2 WT 3 ST prev = 3 prediction = T

12 Deeper History Covers More Patterns
Last 3 Outcomes Counter if prev=000 Counter if prev=001 PC Counter if prev=010 1 3 1 3 2 2 1 Counter if prev=111 What pattern has this branch predictor entry learned? History Prediction 001 1 011 110 100 1 … (0011)*

13 PHT Indexing Tradeoff between more history bits and address bits
Branch addr Global history Gselect 4/4 Insufficient History Tradeoff between more history bits and address bits Too many bits needed in Gselect  sparse table entries

14 Gshare Branch Predictor [McFarling93]
Branch addr Global history Gselect 4/4 Gshare 8/8 Gselect 4/4: Index PHT by concatenate low order 4 bits Gshare 8/8: Index PHT by {Branch address  Global history} Tradeoff between more history bits and address bits Too many bits needed in Gselect  sparse table entries Gshare  Not to lose global history bits Ex: AMD Athlon, MIPS R12000, Sun MAJC, Broadcom SiByte’s SB-1

15 Gshare Branch Predictor
PHT PC Address 1 . 00 1 . Global BHR MSB = 0 Predict Not Taken

16 Hybrid Branch Predictor [McFarling93]
Branch PC P0 P1 . Final Prediction Choice (or Meta) Predictor Some branches correlated to global history, some correlated to local history Only update the meta-predictor when 2 predictors disagree

17 table of 2-/3-bit counters
Hybrid Predictors Pred0 Pred1 Meta- Predictor table of 2-/3-bit counters Pred0 Pred1 Meta Update --- Inc Dec Final Prediction If meta-counter MSB = 0, use pred0 else use pred1

18 Alpha 21264 (EV6) Hybrid Predictor
PC A “tournament branch predictor” Multi-predictor scheme w/ Local predictor (~PAg) Self-correlation Global predictor Inter-correlation Choice predictor as the decision maker: a 2-bit sat. counter to credit either local or global predictors. Die size impact History info tables ~2% BTB ~ 2.7% (associated with I-$ on a per-line basis) 2 cycle latency, we will discuss more later Global history 12 Local History Table 1024 x 10 bits Single Local Predictor 1024 x 3 bits Global Predictor 4096 x 2 bits Choice Predictor 4096 x 2 bits 10 Local prediction Global prediction Meta prediction Final Branch Prediction Next Line/set Prediction For Single-cycle Prediction L1 I-cache (64KB 2w) & TLB Virtual address 4 instr./cycle

19 Compaq Alpha 21264 - Global Predictor
Global Prediction Global predictor has 4K entries and is indexed by the history of the last 12 branches; each entry in the global predictor is a standard 2-bit predictor 12-bit pattern: ith bit 0 => ith prior branch not taken; => ith prior branch taken; Branch history register State 4096 x 2

20 Per Branch Pattern History Table (Two-Level Predictor)
first level - find history (pattern) 2nd level - predict branch for that pattern Correlating predictors Differs from global pattern history table as each branch has it’s own private history BHT PHT PC State 110110 Prediction 2 bit saturating counters 110110 20

21 Compaq Alpha 21264 – Local Predictor
Local History Table (1024x10) Local Prediction (1024x3) PC Local predictor (2-level predictor): Top level a local history table consisting of bit entries; each 10-bit entry corresponds to the most recent 10 branch outcomes for the entry. 10-bit history allows patterns 10 branches to be discovered and predicted. Next level Selected entry from the local history table is used to index a table of 1K entries consisting a 3-bit saturating counters, which provide the local prediction

22 Local History Table (1024x10)
Compaq Alpha 21264 Local History Table (1024x10) Local Prediction (1024x3) Global Prediction (4096x2) PC Local: previous executions of this branch Global: previous execution of all branches Tournament predictor Choice Prediction (4096x2) Path History prediction 4K 2-bit counters to choose from among a global predictor and a local predictor Total size: 4K*2 + 4K*2 + 1K*10 + 1K*3 = 29K bits! (~180,000 transistors) 22

23 Branch Prediction Accuracy
Static branch prediction (compiler) - 70% Per branch 2-bit saturating counters (no history) - 85% Two-level predictor (with history) % accuracy Tournament predictor – a little more accurate than Two-level 23

24 Accuracy v. Size (SPEC89) Slide from David Culler

25 Predicated Execution (!p1) mov b, 1 (p1) mov b, 0 add x, b, 1
(normal branch code) C B D A T N p1 = (cond) branch p1, TARGET mov b, 1 jmp JOIN TARGET: mov b, 0 (predicated code) B C D A if (cond) { b = 0; } else { b = 1; A p1 = (cond) (!p1) mov b, 1 (p1) mov b, 0 B C D add x, b, 1 Convert control flow dependency to data dependency Pro: Eliminate hard-to-predict branches Cons: (1) Fetch blocks B and C all the time (2) Wait until p1 is resolved

26 GPU – Branch Predication

27 GPU – Branch Predication
Pixel shader has a discard instruction. A pixel shader can decide to “kill” the current pixel, which means it won’t get written. If all pixels inside a batch get discarded, the shader unit can stop and go to another batch. If there’s at least one thread left standing, the rest will be dragged along.

28 HW - Branch Predictor class local_predictor : public branch_predictor { public: local_update u; local_predictor (void) { } branch_update *predict (branch_info & b) { u.direction_prediction (true); u.target_prediction (0); return &u; void update (branch_update *u, bool taken, unsigned int target){ };

29 Gshare Implementation
class gshare_predictor : public branch_predictor { public: #define HISTORY_LENGTH 15 #define TABLE_BITS 15 gshare_update u; branch_info bi; unsigned int history; unsigned char tab[1<<TABLE_BITS]; gshare_predictor (void) : history(0) { memset (tab, 0, sizeof (tab)); } branch_update *predict (branch_info & b) { bi = b; if (b.br_flags & BR_CONDITIONAL) { u.index = (history << (TABLE_BITS - HISTORY_LENGTH)) ^ (b.address & ((1<<TABLE_BITS)-1)); u.direction_prediction (tab[u.index] >> 1); } else { u.direction_prediction (true); u.target_prediction (0); return &u;

30 Gshare Implementation
void update (branch_update *u, bool taken, unsigned int target) { if (bi.br_flags & BR_CONDITIONAL) { unsigned char *c = &tab[((gshare_update*)u)->index]; if (taken) { if (*c < 3) (*c)++; } else { if (*c > 0) (*c)--; } history <<= 1; history |= taken; history &= (1<<HISTORY_LENGTH)-1;

31 Pentium M

32 Hybrid branch outcome predictors
Branch Prediction Hybrid branch outcome predictors Bimodal predictor Global predictor Loop predictor Branch target predictors BTB iBTB

33 Reverse Engineering

34 Branch Prediction in Pentium M

35 Branch Prediction in Pentium M

36 Bimodal Predictor A table of Bimodal counters – 4096 counters
Indexed by the IP address bits [11:0]

37 Global Predictor A 4-way cache structure with 2048 entries
Accessed with the hash function - PIR XOR conditional branch IP Resultant 9 bits are used as the index, 6 bits as the tag in the Global predictor PIR Organization PIR is the same PIR as the iBTB PIR

38 conditional taken branch
PIR Organization Width – 15 bits Affected by the 15 bits of the conditional taken branch IP address Affected by the 15 bits combined from the indirect branch IP address and the indirect branch target address. PIR is shifted for two bits left prior to update (XOR) with the newly occurred program branch. Unconditional, Conditional Not taken and Call/Returns branches do not affect the PIR 15bits 9bits 6bits conditional taken branch indirect taken branch

39 Hash Function

40 Loop Branch Predictor Buffer
A cache structure named loop branch predictor buffer (Loop BPB) has two 6-bit counters in one cache entry Counter MAX_VAL stores the loop branch maximum count value Counter CURR_VAL stores the loop branch current iteration number Loop BTB is a two way structure organized in 64 sets Index by the IP address bits [9:4] Tag bits are IP address bits [15:10]

41 Loop Branch Predictor Training
MAX_VAL counter must be set before loop prediction can work Training done in Loop BPB after branch allocation Shortcoming – Evicts existing entry but new branch may come out not to be loop 128 branches may be trained at once Allocate a branch in the loop BPB if the branch opposite outcome is detected T, …T, nT, nT, T, T,…, allocation on nT 2 way buffer: LRU policy

42 Loop BPB misprediction Global Predictor correct prediction
Final Prediction Branch pattern: T T T nT T T T nT T T T nT nT Loop allocated MAX_VAL set Loop BPB misprediction Global Predictor correct prediction


Download ppt "COSC6385 Advanced Computer Architecture Lecture 9. Branch Prediction"

Similar presentations


Ads by Google