1 A New Case for the TAGE Predictor André Seznec INRIA/IRISA.

2 2 Branch prediction Just the simplest way to improve processor core performance: Replacing the branch predictor by a more accurate one does not necessitate to change the rest of the design

3 3 The TAGE branch predictor Introduced in 2006  State-of-the-art global history predictor CBP-2 (2006), CBP-3 (2011)

4 4 TAGE: multiple tables, global history predictor The set of history lengths forms a geometric series most of the storage for short history !! {0, 2, 4, 8, 16, 32, 64, 128} Capture correlation on very long histories

5 5 TAGE: Tagged and prediction by the longest history matching entry pc h[0:L1] ctru tag =? ctru tag =? ctru tag =? prediction pc h[0:L2]pch[0:L3] 1 11 1111 1 1 Tagless base predictor

6 6 Why TAGE State-of-the-art global history predictor But also:  Large cost-effective design space:  32Kbits-512 Kbits  4 to 12 tables  100-1000 history bits  Confidence estimator for free [HPCA2011]

7 7 And more in this presentation Cost-effective hardware complexity and energy consumption  Limited nb of accesses to predictor tables  Use of single ported predictor tables Improving TAGE accuracy with a small side predictor  Tracking statistical correlation  Using local history

8 8 The implicit 3 accesses scenario in academic studies A prediction on the right path:  Read at prediction time  Update at retire time  Re-read  Update 3 accesses on the same predictor entry !! Might lead to the usage of 3-ported components Might lead to the usage of 3-ported components

9 9 Why not only 2 accesses through propagating values read at prediction time A loop, a bimodal predictor C=1 C=1, misprediction C=0, misprediction Execute C=0 Fetch Retire C=2 C=2, correct prediction

10 10 Is it that important for global history predictors ? Only 2 accesses:  33 % increase misp. rate on gshare  17 % increase misp. rate on GEHL  4 % increase misp. rate on TAGE Using 3rd Championship Branch Prediction framework

11 11 Reducing the number of predictor writes At retire time:  Lots of silent updates (rewrite saturated counters) [Banisiadi and Moshovos 2003]  ~ 2.1 writes for 1 mispredictions for TAGE  less than 10 writes for 100 branches

12 12 Eliminating most of the reads at retire time On correct predictions, do not re-read, but use the values read at prediction time:  gshare: +4.5 % mispredictions  GEHL: +8.8 % mispredictions  TAGE: +1.3 % mispredictions TAGE: 1.13 access {prediction+ (read at retire) + update} per prediction on the correct path TAGE: 1.13 access {prediction+ (read at retire) + update} per prediction on the correct path

13 13 Opens opportunity to use single-ported memory components Cycle stealing:  Wait for free cycles to update  Mispredictions  Fetch gating  Front-end stalls Complex management.. and impact on accuracy ? Complex management.. and impact on accuracy ?

14 14 A simple and general scheme using single-ported components A simple and general scheme using single-ported components

15 15 4-way interleaved single ported 4 banks per predictor table Guarantee that 3 consecutive predictions are done by 3 different banks: Predictions for Z after X and Y b(Z) = Z & 3 while ((b(Z)==b(X))|| (b(Z)==b(Y)) b(Z) += (1 & 3)

16 16 Read at prediction has priority Read at retire is delayed by at most one cycle Write update is delayed by at most two cycles

17 17 B0 B1 B3 B2 Pa Rt Un T=0 Prediction has priority no prediction for at least 2 cycles no prediction for at least 2 cycles Worst case for an update no extra read at retire time and no update for 2 cycles no extra read at retire time and no update for 2 cycles

18 18 B0 B1 B3 B2 Rt Un T=1 No prediction by construction Read at retire time

19 19 B0 B1 B3 B2 Un T=2 No prediction and no read at retire time by construction

20 20 4-way interleaved vs 3-ported TAGE predictor 0.5 % increase of misprediction rate 3.3x decrease of silicon area of the predictor tables 2x decrease of energy per table access Works also for the other global history predictors

21 21 Improving TAGE accuracy with a small side predictor

22 22 Two classes of branches not that well predicted by TAGE « Statistically » correlated branches: Not really correlated with the global history, but exhibit a bias Sometimes better predicted by a single wide PC indexed counter than by TAGE « Statistically » correlated branches: Not really correlated with the global history, but exhibit a bias Sometimes better predicted by a single wide PC indexed counter than by TAGE Branches correlated with local history: No problem if very regular global history TAGE can not learn the pattern if irregular Not just the loops with constant iteration numbers Branches correlated with local history: No problem if very regular global history TAGE can not learn the pattern if irregular Not just the loops with constant iteration numbers

23 23 The Statistical Corrector predictor (from 3rd Championship Branch Prediction) Poor correlation with global history, but some bias  Track cases such that:  « In this case (PC, history, prediction), TAGE is likely (>50 %) to mispredict »  AND REVERSE THE PREDICTION !! Tree of adders captures the « average behavior »

24 24 Statistical Correlator Predictor TAGE HAHA S tat. Corr. Prediction + ctr value ++ H A Pred Gehl-like 2.5 % misprediction rate decrease

25 25 Use the same principle for local history biased branches ! Use the same principle for local history biased branches !

26 26 Local Statistical Correlator Predictor TAGE HAHA Local S tat. Corr. Prediction + ctr value ++ LH A Pred LGehl-like Local hist. 478 Kbits 30 Kbits

27 27 Local Statistical Corrector Predictor 8-9 % misprediction rate decrease over TAGE local history correlation AND statistically biased branches No need for loop predictor Small local history tables (32-64 entries) State-of-art prediction accuracy: without the irrealistic tricks used at 3rd CBP State-of-art prediction accuracy: without the irrealistic tricks used at 3rd CBP

28 28 Managing speculative local history: not that easy S(peculative) H(istory) P(rogram) C(ounter) Inflight branches SH PC SH PC SH PC SH PC SH PC SH PC SH PC SH PC Direct Mapped Local History Table Direct Mapped Local History Table Stat. Corr. Local History Prediction SH = (SH <<1) + pred SH PC TAGE prediction

29 29 Major local history management cost The associative search on the inflight branches Can be leveraged to another goal !!

30 30 The « late update » mispredictions Issue:  Some mispredictions are due to late updates at retirement, (later than resolution time) Immediate Update Mimicker:  Try to catch these cases

31 31 PTAPTA Same table, same entry ETAETA ETAETA ETAETA PTAPTA PTAPTA ETAETA PTAPTA ETAETA PTAPTA PTAPTA PTAPTA PTAPTA PTAPTA Misprediction P(rediction) or (E)xecuted T(able) A(ddress in the table) PTAPTA PTAPTA PTAPTA PTAPTA PTAPTA PTAPTA PTAPTA PTAPTA PTAPTA PTAPTA PTAPTA PTAPTA Fetch 1 % misp. rate decrease 1 % misp. rate decrease The Immediate Update Mimicker for TAGE

32 32 The Immediate Update Mimicker Marginal accuracy gain But can be combined with speculative local history management

33 33 MPPKI Storage budget

34 34 Against alternative predictors Outperforms the (not so realistic) podium of 3rd Championship Branch Prediction  ISL-TAGE  FTL++  GEHL+LGEHL based  OH-SNAP  Piecewise linear + varying weights Particularly, on the most predictable benchmarks

35 35 Putting all together Complexity and energy  4-way interleaved tables  Reduced accesses at retire time Accuracy  Local Statistical Corrector Predictor  Immediate Update Mimicker ≈ State-of-the-art predictor Cost effective: silicon, energy

36 36 Conclusion Made a new case for TAGE:  Already known:  State-of-the-art global history predictor  Confidence estimation for free  Established:  Area and energy effective implementation with single-ported components  Accuracy improved with Local Statistical Predictor

38 38 Some « hope » on less predictable benchmarks MPPKI

