COSC3330 Computer Architecture Lecture 15. Branch Prediction Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston
Topic Branch Prediction
Bimodal Branch Prediction PC Address 2N entries (each entry has a 2 bit counter) 1 . . . . . N bits . table update 2N entries addressed by N-bit PC Each entry keeps a counter (2-bit or more) for prediction Counter update: the same as 2-bit counter FSM Update Logic Actual outcome Prediction
Gshare Branch Predictor PHT PC Address 1 . . . . . . 00 1 . . . . . . Global BHR MSB = 0 Predict Not Taken
Pattern History Table 2N entries addressed by N-bit BHR Each entry keeps a counter (2-bit or more) for prediction Counter update: the same as 2-bit counter Can be initialized in alternate patterns (01, 10, 01, 10, ..) Alias (or interference) problem
Idea: Track the History of a Branch Previous Outcome PC Counter if prev=0 1 3 Counter if prev=1 1 3 3 prev = 1 prediction = N prev = 3 prediction = T prev = 3 1 prediction = N 1 WN SN 2 WT 3 ST prev = 3 prediction = T
Idea: Track the History of a Branch Previous Outcome PC Counter if prev=0 1 3 Counter if prev=1 1 3 1 WN SN 2 WT 3 ST 1 prev = 1 3 prediction = T 3 2 prev = T prediction = 3 2 prev = 1 prediction = T 3 prev = 1 prediction = T
Deeper History Covers More Patterns Last 3 Outcomes Counter if prev=000 Counter if prev=001 PC Counter if prev=010 1 3 1 3 2 2 1 Counter if prev=111 What pattern has this branch predictor entry learned? History Prediction 001 1 011 110 100 1 00110011001… (0011)*
Tournament Predictors No predictor is clearly the best Different branches exhibit different behaviors Some “constant”, some global, some local Idea: Let’s have a predictor to predict which predictor will predict better
Tournament Hybrid Predictors Meta- Predictor table of 2-/3-bit counters Pred0 Pred1 Meta Update --- Inc Dec Final Prediction If meta-counter MSB = 0, use pred0 else use pred1
Hybrid Branch Predictor [McFarling93] Branch PC P0 P1 . Final Prediction Choice (or Meta) Predictor Some branches correlated to global history, some correlated to local history Only update the meta-predictor when 2 predictors disagree
Compaq Alpha 21264
Compaq Alpha 21264 Four-issue superscalar Out of order execution Speculative execution Branch predictors Average branch misprediction penalty is 11 cycles
Compaq Alpha 21264 - Global Predictor Global Prediction Global predictor has 4K entries and is indexed by the history of the last 12 branches; each entry in the global predictor is a standard 2-bit predictor 12-bit pattern: ith bit 0 => ith prior branch not taken; 1 => ith prior branch taken; Branch history register 101101101101 State 4096 x 2
Per Branch Pattern History Table (Two-Level Predictor) first level - find history (pattern) 2nd level - predict branch for that pattern Correlating predictors Differs from global pattern history table as each branch has it’s own private history BHT PHT PC State 110110 Prediction 2 bit saturating counters 110110 15
Compaq Alpha 21264 – Local Predictor Local History Table (1024x10) Local Prediction (1024x3) PC Local predictor (2-level predictor): Top level a local history table consisting of 1024 10-bit entries; each 10-bit entry corresponds to the most recent 10 branch outcomes for the entry. 10-bit history allows patterns 10 branches to be discovered and predicted. Next level Selected entry from the local history table is used to index a table of 1K entries consisting a 3-bit saturating counters, which provide the local prediction
Local History Table (1024x10) Compaq Alpha 21264 Local History Table (1024x10) Local Prediction (1024x3) Global Prediction (4096x2) PC Local: previous executions of this branch Global: previous execution of all branches Tournament predictor Choice Prediction (4096x2) Path History prediction 4K 2-bit counters to choose from among a global predictor and a local predictor Total size: 4K*2 + 4K*2 + 1K*10 + 1K*3 = 29K bits! (~180,000 transistors) 17
Branch Prediction Accuracy Static branch prediction (compiler) - 70% Per branch 2-bit saturating counters (no history) - 85% Two-level predictor (with history) - 90-95% accuracy Tournament predictor – a little more accurate than Two-level 18
Accuracy v. Size (SPEC89) Slide from David Culler
HW2 Programming - Branch Predictor class local_predictor : public branch_predictor { public: local_update u; local_predictor (void) { } branch_update *predict (branch_info & b) { u.direction_prediction (true); u.target_prediction (0); return &u; void update (branch_update *u, bool taken, unsigned int target){ };
Gshare Implementation class gshare_predictor : public branch_predictor { public: #define HISTORY_LENGTH 15 #define TABLE_BITS 15 gshare_update u; branch_info bi; unsigned int history; unsigned char tab[1<<TABLE_BITS]; gshare_predictor (void) : history(0) { memset (tab, 0, sizeof (tab)); } branch_update *predict (branch_info & b) { bi = b; if (b.br_flags & BR_CONDITIONAL) { u.index = (history << (TABLE_BITS - HISTORY_LENGTH)) ^ (b.address & ((1<<TABLE_BITS)-1)); u.direction_prediction (tab[u.index] >> 1); } else { u.direction_prediction (true); u.target_prediction (0); return &u;
Gshare Implementation void update (branch_update *u, bool taken, unsigned int target) { if (bi.br_flags & BR_CONDITIONAL) { unsigned char *c = &tab[((gshare_update*)u)->index]; if (taken) { if (*c < 3) (*c)++; } else { if (*c > 0) (*c)--; } history <<= 1; history |= taken; history &= (1<<HISTORY_LENGTH)-1;
Pentium M
Hybrid branch outcome predictors Branch Prediction Hybrid branch outcome predictors Bimodal predictor Global predictor Loop predictor Branch target predictors BTB iBTB
Reverse Engineering
Branch Prediction in Pentium M
Branch Prediction in Pentium M
Bimodal Predictor A table of Bimodal counters – 4096 counters Indexed by the IP address bits [11:0]
Global Predictor A 4-way cache structure with 2048 entries Accessed with the hash function - PIR XOR conditional branch IP Resultant 9 bits are used as the index, 6 bits as the tag in the Global predictor PIR Organization PIR is the same PIR as the iBTB PIR
PIR Organization Width – 15 bits Affected by the 15 bits of the conditional taken branch IP address Affected by the 15 bits combined from the indirect branch IP address and the indirect branch target address. PIR is shifted for two bits left prior to update (XOR) with the newly occurred program branch. Unconditional, Conditional Not taken and Call/Returns branches do not affect the PIR
Hash Function