Download presentation
Presentation is loading. Please wait.
Published byKory Booker Modified over 8 years ago
1
1 Revisiting the perceptron predictor André Seznec IRISA/ INRIA
2
André Seznec Caps Team Irisa 2 Perceptron-based branch prediction Jimenez and Lin HPCA 2001 Radically new approach to branch prediction Associate a set of 8-bit counters or weights with a branch address Use the global history vector as an input vector (+1, -1) Multiply/accumulate weights by inputs and use the sign as a prediction Selective update: Increment/decrement if misprediction Or if Sum is lower than a threshold
3
André Seznec Caps Team Irisa 3 Perceptron predictor ∑ Sign=prediction X
4
André Seznec Caps Team Irisa 4 Perceptron prediction works + Complexity linear with the history length: Can capture correlation on a very long history length - But: long latency: the multiply-accumulate tree ! Inherently unabled to discriminate between two histories if they are not « linearly separable » (2 weights, 2 history bits): h0 h1 is not recognized ! Can we do better ?
5
André Seznec Caps Team Irisa 5 Use a redundant history Insert several bits per branch in history to enhance linear separability: h0, h0 h1, h0 h2, h0 add
6
André Seznec Caps Team Irisa 6 Redundant history perceptron + significant misprediction reduction: > 30 % for 12 out of 20 benchmarks - 256 weights: A 256 multiply-add tree: 2048 bits wide !! 256 counter updates !! Latency ? Power consumption ? Logic complexity ?
7
André Seznec Caps Team Irisa 7 4 weights for 2 history bits = a single counter read Inputs (0, h0, h1, h0 h1), weights W0, W1, W2, W3 Possible contributions to the branch prediction: h=0 (0,0,0,0) C0= -W0 –W1-W2-W3 h=1 (0,1,0,1) C1= -W0 +W1-W2+W3 h=2 (0,0,1,1) C2= -W0 –W1+W2+W3 h=3 (0,1,1,0) C3= -W0 +W1+W2-W3 Update for h =2 and Out =1: C2 +=4 C0, C1 and C3 unchanged Let us store the Multiply-Accumulate contributions instead of the weights !!
8
André Seznec Caps Team Irisa 8 MAC contribution: 4-way redundant history Let us really represent blocks of 4 history bits per 16 weights there are only 16 possible multiply-accumulate contributions associated with these 16 weights Storing the Multiply-Accumulate contributions instead of the weights !!
9
André Seznec Caps Team Irisa 9 Redundant History Perceptron Predictor with MAC contribution ∑ Sign=prediction N 16x1 MUX 4N history bits
10
André Seznec Caps Team Irisa 10 Redundant history and MAC representation Replace a 16 multiply-add tree by a 16-to1 MUX Use of saturated arithmetic: one can reduce the width of counters to 6-bit A 256 8-bit multiply-accumulate tree replaced by a 16 6-bit adder tree
11
André Seznec Caps Team Irisa 11 Redundant history and MAC representation
12
André Seznec Caps Team Irisa 12 Back to finite storage predictors
13
André Seznec Caps Team Irisa 13 Redundant History Perceptron vs optimized 2bcgskew Optimized 2bcgskew: 1Mbit 72-36-9-9 history + lots of tricks 768 Kbits redundant history perceptron 20 benchmarks: SPEC 2000 + SPEC 95 fifty / fifty!! Perceptron and 2bcgskew do not capture exactly the same kind of correlation !!
14
André Seznec Caps Team Irisa 14 Towards the best of both worlds ! Redundant history skewed perceptron predictor
15
André Seznec Caps Team Irisa 15 Self-aliasing on a perceptron predictor 1. Consider H and H’ for a branch B differing on recent bits, If both behaviors are dictated by the same coincidating « old » history segment (e.g. bits 20-23), then there is an aliasing effect on a counter!! 2.Most of the correlation is captured by recent history: Most counters associated with « old » history are « wasted » 3. Let us enable the use of whole spectrum of counters through the use of multiple tables with different indices : « SKEWING »
16
André Seznec Caps Team Irisa 16 Redundant History Skewed Perceptron Predictor ∑ 4 tables accessed with different indices
17
André Seznec Caps Team Irisa 17 Redundant History Skewed Perceptron Predictor
18
André Seznec Caps Team Irisa 18 Further leveraging long history Some applications may benefit from history length up to 128 bits, many do not !! Don’t want to use a wider adder tree For a fixed history length, the number of pathes that lead to a single branch varies in a considerable way less information in some history sections than in others: Repeating patterns « waste » space in history Use of a compressed form of history !
19
André Seznec Caps Team Irisa 19 Further leveraging long history (2) Replace repeating patterns (up to 5 bits) by narrower chains 1.5-3 compression ratio on our benchmark set Use half uncompressed history and half compressed history Significant benefit ( > 25 %) on several benchmarks; harmless for the others Essentially captures all correlation associated with local history
20
André Seznec Caps Team Irisa 20 RHSP and compressed history
21
André Seznec Caps Team Irisa 21 Addressing the predictor latency Ahead pipelined redundant history perceptron predictor
22
André Seznec Caps Team Irisa 22 The latency issue ! Single cycle prediction would be needed but: 2-4 cycles for table read 2-4 cycles for adder tree Ahead pipelined 2bcgskew, Seznec and Fraboulet, ISCA 2003 on the fly information insertion in table indices resolve misprediction at execution time Path-based perceptron, Jiménez MICRO2003 « systolic-like » ahead pipelined perceptron prediction does not address table read delay resolve misprediction at commit time, not at execution time
23
André Seznec Caps Team Irisa 23 Ahead pipelining the RHSP: the challenges Use of X-block ahead information to initiate branch prediction: X-block ahead address and global history Use intermediate path information to ensure prediction accuracy But, inflight insertion of table indices is not sufficient !?! Need to checkpoint every information for recomputing on the fly any possible prediction for the X-1 intermediate blocks But avoid checkpoint volume explosion
24
André Seznec Caps Team Irisa 24 Ahead pipelined Redundant History Skewed Perceptron Predictor RHSP tables read ∑ + 32 counters for intermediate pathes X block ahead 5 1-block ahead history Sum on 14 counters
25
André Seznec Caps Team Irisa 25 Ahead pipelined Redundant History Skewed Perceptron Predictor Partial sum using only X-block ahead information Discriminate only 32 possible paths: 32 associated counters are read Compute 32 possible sums Select the prediction on last cycle Checkpoint the 32 possible predictions
26
André Seznec Caps Team Irisa 26 Ahead pipelined RHSP (768 Kbits)
27
André Seznec Caps Team Irisa 27 Ahead pipelined RHSP Very limited loss of accuracy for 6-block ahead: 5 1-bit ahead history are sufficient to discriminate among all the intermediate pathes Loss of accuracy increases with the length of prediction: Do not discriminate between all the pathes explosion of the number of pathes originated from the same X-block ahead block: Less and less predictions performed by low order counters
28
André Seznec Caps Team Irisa 28 Summary Perceptron based prediction improved: Prediction accuracy Use of redundant history Introduction of skewing Introduction of history compression MAC representation: 16 6-bit adder tree against 256 8-bit mult/acc. tree X-block ahead RHSP: on-time prediction without sacrificing accuracy or penalty misprediction resolution at execution stage
29
André Seznec Caps Team Irisa 29 Wide possible design space For dealing with his/her implementation constraint, the designer can play with: Number of tables Width of histories Compressed/uncompressed ratio Threshold/width of counters: Half threshold/ 5 bits counters is not so bad Use of other MAC representation 8 counters for 3 bits, 16 counters for 5 bits ..
30
André Seznec Caps Team Irisa 30 Bonus An « objective » comparison of RHSP and 2bcgskew by their (common) inventor
31
André Seznec Caps Team Irisa 31 2bc-gskew : logical view e-gskew
32
André Seznec Caps Team Irisa 32 Optimized 2bcgskew All optimizations in EV8 predictor: Different history lengthes for all tables Different hysteresis and prediction table sizes + a few other tricks: Sharing predictors and hysteresis tables through banking Randomly enforcing the flipping of counters on mispredictions to avoid ping-pong phenomena No « guru » design hash functions: just good functions 2**(N+11) bits predictor; (N,N,4N,8N) history (4,4,16,32) for 32Kbit (9,9,36,72) for 1Mbit
33
André Seznec Caps Team Irisa 33 2bcgskew vs RHSP (1) Efficiency of the prediction scheme: Both can use very long history: Extra local history prediction brings very poor benefit Not aware of any other predictor handling such long history RHSP better tolerates/accomodates compressed history RHSP captures some extra correlation Efficiency of the storage usage (small size predictors, e.g. 32Kbits): 2bcgskew more efficient on a few demanding benchmarks: go, gcc95 RHSP surprisingly efficient on most benchmarks
34
André Seznec Caps Team Irisa 34 2bcgskew vs RHSP (2) Accesses to the predictor: Up to three accesses on RHSP on correct predictions But not so many, accesses on correct predictions Single access to prediction, single access to hysteresis on correct predictions on 2bcgskew
35
André Seznec Caps Team Irisa 35 2bcgskew vs RHSP (3) Hardware logic cost: Adder tree + counter update for RHSP Hashing functions + small logic for 2bcgskew Latency: Table read + adder tree for RHSP Table read + a few gates for 2bcgskew
36
André Seznec Caps Team Irisa 36 That’s the end folks !
37
André Seznec Caps Team Irisa 37
38
André Seznec Caps Team Irisa 38 RHSP and compressed history
39
André Seznec Caps Team Irisa 39 RHSP and compressed history (2)
40
André Seznec Caps Team Irisa 40 RHSP and compressed history (3)
41
André Seznec Caps Team Irisa 41 RHSP vs 2bcgskew storage effectiveness (1)
42
André Seznec Caps Team Irisa 42 RHSP vs 2bcgskew storage effectiveness (2)
43
André Seznec Caps Team Irisa 43 RHSP vs 2bcgskew storage effectiveness (3)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.