1 A 64 Kbytes ITTAGE indirect branch predictor André Seznec INRIA/IRISA
2 Build on ITTAGE ITTAGE: Introduced at the same time as TAGE (JILP 2006) Derived directly from the TAGE predictor: Target prediction instead of direction prediction
3 ITTAGE: multiple tables, global history predictor The set of history lengths forms a geometric series What is important: L(i)-L(i-1) is drastically increasing most of the storage for short history !! {0, 2, 4, 8, 16, 32, 64, 128} Capture correlation on very long histories
4 pc h[0:L1] =? prediction pc h[0:L2 ] pc h[0:L3 ] Tagless base Predictor The ITTAGE predictor
5 Prediction computation General case: Longest matching component provides the prediction Special case: Many mispredictions on newly allocated entries: weak Ctr Sometimes Altpred (slightly) more accurate than Pred Property dynamically monitored through a single 4-bit counter -2 % MPPKI
6 A tagged table entry Ctr: 2-bit hysteresis counter U: 1-bit useful counter Was the entry recently useful ? Tag: partial tag Target: the target TargetTagCtrU 32 bits or some way to reconstruct it
7 Allocate entries on mispredictions Allocate entries in longer history length tables On tables with U unset Set Ctr to Weak and U to 0 HUGE STORAGE BUDGET: Up to 3 entries allocated in different tables Fast warming
8 Managing the (U)seful bit Setting when avoids a misprediction (Pred = target) & (Alt ≠ target) Global reset when « difficulties » to allocate Dynamically monitor if more failures than successes on allocations
9 Most of the storage space for targets 32 bits per entry !! More than 12K (PC,target) pairs on CLIENT05 But only a maximum of 4038 different targets Use 12 bit pointers + a 4K table
10 Let us be realistic: leverage target locality All targets in at most KB regions Use a 128-entry region table: Fully associative, 240 bytes Saves 7 bits per ITTAGE entry Would have saved 39 bits on a 64-bit architecture !!
11 TargetTagCtrU Region offsetRegion pointer
12 The global history -16 % MPPKI
13 The global history (2) Including all branches ? Only indirect and calls: -2.5 % MPPKI But no conclusion: without 2 branches on INT05 and INT06 just the other way
14 + the other tricks (for TAGE) Immediate Update Mimicker Storage space interleaving Picking the best set of history lengths -1 % MPPKI
15 The Immediate Update Mimicker Issue: Some mispredictions due to late updates at retirement Immediate Update Mimicker: Try to catch these cases
16 PTAPTA Same table, same entry ETAETA PTAPTA PTAPTA PTAPTA PTAPTA PTAPTA PTAPTA ETAETA PTAPTA PTAPTA PTAPTA PTAPTA PTAPTA Misprediction P(rediction) T(able) A(ddress in the table) PTAPTA PTAPTA PTAPTA PTAPTA PTAPTA PTAPTA PTAPTA PTAPTA PTAPTA PTAPTA PTAPTA PTAPTA Fetch The Immediate Update Mimicker
17 =? prediction Xbar h[0,L 1] For the competition: interleaving
18 For the competition Guided selection of the best set of history lengths: 4Kentries: 0, 4Kentries: 0, 10, 4Kentries: 16, 27, 44, 60, 96, 109, 219, 449, 2Kentries: 487, 714, 1313, 2146, 3881 Remember: 10 bits per indirect, 5 per call
19 Where is the limit ? Less than 3 % MPPKI Why did you not use the « 12-bit pointer » trick ? Just winning 0.5 % MPPKI
20 Summary ITTAGE directly derived from TAGE History should include (PC+target) for indirect and calls Locality on targets can be leveraged Marginal tricks not really worth