Combining Branch Predictors CS 7960-4 Lecture 7 Combining Branch Predictors Scott McFarling WRL Tech. Report TN-36 1993
Bimodal Branch Prediction Identifies most popular prediction in recent past Updates happen during commit 1 PC 10-bit index 1024 entries 2-bit saturating counters
Results SPEC’89 programs simulated for 10M instrs (modern studies use hard-to-predict programs) A larger predictor reduces contention for counters Prediction rates saturate at 93.5% (at 2K bytes) (Fig.3)
Local Predictors Two-Level predictor: The first level has history, the second level has saturating counters History gets updated immediately 1 1 1 PC 1 10-bit index 16 entries 1024 entries 2-bit saturating counters 4-bit history table
Results For small predictors, there could be contention at both levels, resulting in inaccurate predictions Will also take longer to warm up – after every context switch Does very well for large predictors – saturates at 97.1%
Global Predictors A single history register – neighboring branches have correlated results However, the PC is not used 1 1024 entries 10-bit global history 2-bit saturating counters
Do We Need PC? Note that the global history reveals which branch is being examined Hence, it outdoes bimodal predictors when the transistor budget is large (Fig.7) Local predictor does better – it is more important to identify the PC and local history than behavior of neighboring branches
Gselect Use a combination of PC and global history Bimodal and global prediction are special cases (Fig.9) 1 n PC / n+m / / 1024 entries m 5-bit global history 2-bit saturating counters
GShare Xor-ing 10 history bits and 10 PC bits has more info than the concatenation of 5 bits of each and more info than each individual component Branch Address Global History Gselect 4/4 Gshare 8/8 00000000 00000001 11111111 11110000 10000000 01111111 01111110 00000001 11100001 01111111
Terminology GAG: Global history indexes into global array of saturating counters PAG: Per-address history indexes into global array GAP: Global history indexes into each PC’s private array of counters (gselect) PAP: Per-address history indexes into each PC’s private array of counters
Trade-Offs Some predictors warm-up faster than others Some programs benefit from global history, some from local history Some programs have branches that interfere with each other Note that a 64KB local predictor has fewer saturating counters than a 64KB bimodal predictor – the former won’t be better for every program
Combining Predictors Use an array of saturating counters to pick the best available predictor for each PC Predictor A 1 PC 1024 entries Predictor B 2-bit saturating counters
Results The combination of local and gshare increases the prediction accuracy to 98.1% (Fig.16) For smaller transistor budgets, the combination of bimodal and gshare is better (gshare is twice the size to make sure the total is a power of two) A 1KB combined predictor does as well as a 16KB gselect predictor
Future Work Detect conflicts, correlations, and common predictions through profiling/compiler analysis Functions that compress information in history or PC Pipeline predictions – predict two branches ahead Hierarchical predictors – get a quick prediction in a cycle and a more accurate one two cycles later
Next Week’s Paper “Design Trade-Offs for the Alpha EV8 Conditional Branch Predictor”, Seznec et al., ISCA’02
Title Bullet