Predicting Conditional Branches With Fusion-Based Hybrid Predictors Gabriel H. Loh Yale University Dept. of Computer Science Dana S. Henry Yale University Depts. of Elec. Eng. & Comp. Sci. This research was funded by NSF Grant MIP
The Branch Prediction Problem 1 out of 5 instructions is a branch1 out of 5 instructions is a branch May require many cycles to resolveMay require many cycles to resolve –P4 has 20 cycle branch resolution pipeline –Future pipeline depths likely to increase [Sprangle02] Predict branches to keep pipeline fullPredict branches to keep pipeline full PC ComputeBranch resolution
Bigger Predictors = More Accurate Larger predictors tend to yield more accurate predictionsLarger predictors tend to yield more accurate predictions Faster cycle times force smaller branch predictorsFaster cycle times force smaller branch predictors Overriding predictor couples small, fast predictor with a large, multi-cycle predictor [Jiménez2000]Overriding predictor couples small, fast predictor with a large, multi-cycle predictor [Jiménez2000] –performs close to ideal large-fast predictor (but bigger predictors = slower)
Hybrid Predictors Wide variety of branch prediction algorithms availableWide variety of branch prediction algorithms available Hybrid combines more than one “stand-alone” or component predictor [McFarling93]:Hybrid combines more than one “stand-alone” or component predictor [McFarling93]: P1P1P1P1 P2P2P2P2Meta-Predictor Final Prediction
Multi-Hybrids P1P1P1P1 P2P2P2P2 PnPnPnPn Pr. Encoder … … … … Final Prediction P1P1P1P1 P2P2P2P2 M1M1M1M1 P3P3P3P3 P4P4P4P4 M2M2M2M2 M3M3M3M3 “Multi-Hybrid” [Evers96] “Quad-Hybrid” [Evers00]
Our Idea: Prediction Fusion P1P1P1P1 … … P2P2P2P2 P3P3P3P3 PnPnPnPnXXX Prediction Selection P1P1P1P1 … … P2P2P2P2 P3P3P3P3 PnPnPnPn Prediction Fusion
Early Attempt from ML Weighted Majority algorithm [LW94]Weighted Majority algorithm [LW94] –Better predictors get assigned larger weights –Make final prediction with larger sum Predictor with largest weight not always correctPredictor with largest weight not always correct P2P2P2P2 P6P6P6P6 P7P7P7P7 P1P1P1P1 P3P3P3P3 P4P4P4P4 P5P5P5P5 P8P8P8P8 P 2, P 6 and P 7 say “not-taken”P 1, P 3, P 4, P 5 and P 8 say “taken”
Outline COLT PredictorCOLT Predictor Choosing parameters and componentsChoosing parameters and components PerformancePerformance Prediction distributions, component choicePrediction distributions, component choice
COLT Organization Branch Address Branch History P1P1P1P1 P2P2P2P2 P3P3P3P3 PnPnPnPn 1010… … MappingTable VMT … Final Prediction
Pathological Example P1P1P1P1 P2P2P2P2 P3P3P3P Actual outcome = 1 (taken)
Example (cont’d) P1P1P1P1 P2P2P2P2 P3P3P3P Outcome is always wrong Selection: P1P1P1P1 P2P2P2P2 P3P3P3P Can recognize and remember this pattern 1 COLT: VMT
COLT Lookup Delay … P1P1P1P1 P2P2P2P2 PnPnPnPn Prediction time … MT Select critical delay
Design Choices # of branch address bits# of branch address bits # of branch history bits# of branch history bits # of components# of components Choice of componentsChoice of components –gshare, PAs, gskewed, … –History length, PHT size, … } Determines number of mapping tables } Determines size of individual MT’s
Predictor Components Global HistoryGlobal History –gshare [McFarling93] –Bi-Mode [Lee97] –Enhanced gskewed [Michaud97] –YAGS [Eden98] Local HistoryLocal History –PAs [Yeh94] –pskewed [Evers96] OtherOther –2bC (bimodal) [Smith81] –Loop [Chang95] –alloyed Perceptron [Jiménez02] } history lengths optimized on test data sets Total of 59 configurations Sizes vary up to 64KB
Huge Search Space 2 59 ways to choose components2 59 ways to choose components ways to choose COLT parameters ways to choose COLT parameters We use a genetic searchWe use a genetic search … bit-k = 0 means don’t include P k bit-k = 1 means do include P k VMT Size historylength gene format: …
Methodology SPEC2000 integer benchmarksSPEC2000 integer benchmarks –For tuning/optimization: 10M branches from test –For evaluation: 500M branches from train Skipped first 100M branchesSkipped first 100M branches –Compiled with cc –arch ev6 –O4 –fast –non_shared SimpleScalar simulatorSimpleScalar simulator –sim-safe for trace collection –MASE for ILP simulations
Genetic Search COLT Results NameSize(KB)ComponentsVMT Counter width History length 16 alpct(34/10) gskewed(12) gshare(8) 32 alpct(34/10) gshare(15) gshare(9) PAs(7) 64 alpct(40/14) gshare(16) YAGS(11) pskewed(6) 128 alpct(40/14) alpct(38/14) gshare(16) gskewed(13) YAGS(12) PAs(8) 256 alpct(50/18) alpct(34/10) gshare(18) Bi-Mode(16) gskewed(15) PAs(8)
Overall Predictor Performance
Per-Benchmark Performance
ILP Performance Simulated CPU:Simulated CPU: –6-issue –20 cycle pipeline –Same functional units, latencies, caches as Int e l P4/NetBurst microarchitecture 1-cycle2bC4-cycle OR alpct ++ 4-cycle OR COLT Ideal1-cycleCOLT
ILP Impact
COLT Parameter Sensitivity Mapping table counter widthsMapping table counter widths Number of mapping tablesNumber of mapping tables Number of history bits for VMT indexNumber of history bits for VMT index
Counter Width
VMT Size
History Length
Explaining Choice of Components Parameter sensitivity results shows GA performed well for the COLT parametersParameter sensitivity results shows GA performed well for the COLT parameters Why did it choose the component predictors that it did?Why did it choose the component predictors that it did?
Classifying COLT Predictions We examined the (32KB) COLT config.We examined the (32KB) COLT config. For each mapping table lookup, we examine the neighboring entries:For each mapping table lookup, we examine the neighboring entries: P1P1P1P1 P2P2P2P2 P3P3P3P3 P4P4P4P entry 0001 = NT entry 1001 = T entry 1101 = T
Classifying Predictions (cont’d) easy: all neighboring entries agree short: only gshare(9) distinguishes long: only gshare(14) distinguishes local: only PAs(7) distinguishes perceptron: only alpct(34/10) distinguishes multi-length: mix of gshare(9), (14) or alpct mixed: both global and local components gshare(9)gshare(14)PAs (7) alpct (34/10) 32KB COLT: Classes:
Prediction Classifications
Related Work/Issues Alloyed history [Skadron00]Alloyed history [Skadron00] Variable path history length [Stark98]Variable path history length [Stark98] Dynamic history length fitting [Juan98]Dynamic history length fitting [Juan98] Interference reduction [lots…]Interference reduction [lots…] COLT handles all of these cases* COLT handles all of these cases* Doesn’t support partial update policies
Open Research Better individual componentsBetter individual components Augment with SBI [Manne99], agree [Sprangle97]Augment with SBI [Manne99], agree [Sprangle97] Better fusion algorithmsBetter fusion algorithms Hybrid fusion/selection algorithmsHybrid fusion/selection algorithms Other domains (branch confidence prediction, value prediction, memory dependence prediction, instruction criticality prediction, …)Other domains (branch confidence prediction, value prediction, memory dependence prediction, instruction criticality prediction, …)
Summary Fusion is more powerful than selectionFusion is more powerful than selection –Combines multiple sources of information Branch behavior is very variedBranch behavior is very varied –Need long, short, global and local histories, multiple simultaneous lengths and types of history COLT is one possible fusion-based predictorCOLT is one possible fusion-based predictor –Combines multiple types of information –Current “best” purely dynamic predictor*
Questions?