1 Revisiting the perceptron predictor André Seznec IRISA/ INRIA.

Slides:

Advertisements

Similar presentations

André Seznec Caps Team IRISA/INRIA 1 Looking for limits in branch prediction with the GTL predictor André Seznec IRISA/INRIA/HIPEAC.

Advertisements

Dynamic History-Length Fitting: A third level of adaptivity for branch prediction Toni Juan Sanji Sanjeevan Juan J. Navarro Department of Computer Architecture.

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

Computer Science Department University of Central Florida Adaptive Information Processing: An Effective Way to Improve Perceptron Predictors Hongliang.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

André Seznec Caps Team IRISA/INRIA 1 The O-GEHL branch predictor Optimized GEometric History Length André Seznec IRISA/INRIA/HIPEAC.

Dynamic Branch Prediction

André Seznec Caps Team IRISA/INRIA Design tradeoffs for the Alpha EV8 Conditional Branch Predictor André Seznec, IRISA/INRIA Stephen Felix, Intel Venkata.

Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.

©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.

Using Carry-Save Adders For Radix- 4, Can Be Used to Generate 3a – No Booth’s Slight Delay Penalty from CSA – 3 Gates.

A PPM-like, tag-based predictor Pierre Michaud. 2 Main characteristics global history based 5 tables –one 4k-entry bimodal (indexed with PC) –four 1k-entry.

TAGE-SC-L Branch Predictors

CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

A Simple Divide-and-Conquer Approach for Neural-Class Branch Prediction Gabriel H. Loh College of Computing Georgia Tech.

EECS 470 Branch Prediction Lecture 6 Coverage: Chapter 3.

Perceptron-based Global Confidence Estimation for Value Prediction Master’s Thesis Michael Black June 26, 2003.

1 Improving Branch Prediction by Dynamic Dataflow-based Identification of Correlation Branches from a Larger Global History CSE 340 Project Presentation.

1 Applying Perceptrons to Speculation in Computer Architecture Michael Black Dissertation Defense April 2, 2007.

Power Savings in Embedded Processors through Decode Filter Cache Weiyu Tang, Rajesh Gupta, Alex Nicolau.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

VLSI Project Neural Networks based Branch Prediction Alexander ZlotnikMarcel Apfelbaum Supervised by: Michael Behar, Spring 2005.

EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

CIS 429/529 Winter 2007 Branch Prediction.1 Branch Prediction, Multiple Issue.

Distributed Arithmetic: Implementations and Applications

Perceptrons Branch Prediction and its’ recent developments

EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.

CS 7810 Lecture 6 The Impact of Delay on the Design of Branch Predictors D.A. Jimenez, S.W. Keckler, C. Lin Proceedings of MICRO

Optimized Hybrid Scaled Neural Analog Predictor Daniel A. Jiménez Department of Computer Science The University of Texas at San Antonio.

1 Storage Free Confidence Estimator for the TAGE predictor André Seznec IRISA/INRIA.

1 A 64 Kbytes ITTAGE indirect branch predictor André Seznec INRIA/IRISA.

Analysis of Branch Predictors

1 Two research studies related to branch prediction and instruction sequencing André Seznec INRIA/IRISA.

André Seznec Caps Team IRISA/INRIA 1 Analysis of the O-GEHL branch predictor Optimized GEometric History Length André Seznec IRISA/INRIA/HIPEAC.

1 A New Case for the TAGE Predictor André Seznec INRIA/IRISA.

CSCI 6461: Computer Architecture Branch Prediction Instructor: M. Lancaster Corresponding to Hennessey and Patterson Fifth Edition Section 3.3 and Part.

Not- Taken? Taken? The Frankenpredictor Gabriel H. Loh Georgia Tech College of Computing MICRO Dec 5, 2004.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

Trace Substitution Hans Vandierendonck, Hans Logie, Koen De Bosschere Ghent University EuroPar 2003, Klagenfurt.

1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

André Seznec Caps Team IRISA/INRIA 1 A 256 Kbits L-TAGE branch predictor André Seznec IRISA/INRIA/HIPEAC.

Effective ahead pipelining of instruction block address generation André Seznec and Antony Fraboulet IRISA/ INRIA.

Idealized Piecewise Linear Branch Prediction Daniel A. Jiménez Department of Computer Science Rutgers University.

1 The Inner Most Loop Iteration counter a new dimension in branch history André Seznec, Joshua San Miguel, Jorge Albericio.

Branch Prediction Perspectives Using Machine Learning Veerle Desmet Ghent University.

André Seznec Caps Team IRISA/INRIA 1 Analysis of the O-GEHL branch predictor Optimized GEometric History Length André Seznec IRISA/INRIA/HIPEAC.

Fast Path-Based Neural Branch Prediction Daniel A. Jimenez Presented by: Ioana Burcea.

Samira Khan University of Virginia April 12, 2016

Multiperspective Perceptron Predictor Daniel A. Jiménez Department of Computer Science & Engineering Texas A&M University.

Dynamic Branch Prediction

CS203 – Advanced Computer Architecture

Dynamic Branch Prediction

UNIVERSITY OF MASSACHUSETTS Dept

Samira Khan University of Virginia Dec 4, 2017

CMSC 611: Advanced Computer Architecture

Exploring Value Prediction with the EVES predictor

Looking for limits in branch prediction with the GTL predictor

Design tradeoffs for the Alpha EV8 Conditional Branch Predictor

Address-Value Delta (AVD) Prediction

Dynamic Branch Prediction

Lecture 10: Branch Prediction and Instruction Delivery

TAGE-SC-L Again MTAGE-SC

Dynamic Hardware Prediction

The O-GEHL branch predictor

Gang Luo, Hongfei Guo {gangluo,

Eshan Bhatia1, Gino Chacon1, Elvira Teran2, Paul V. Gratz1, Daniel A

Samira Khan University of Virginia Mar 6, 2019

Restrictive Compression Techniques to Increase Level 1 Cache Capacity

Presentation transcript:

1 Revisiting the perceptron predictor André Seznec IRISA/ INRIA

André Seznec Caps Team Irisa 2 Perceptron-based branch prediction Jimenez and Lin HPCA 2001  Radically new approach to branch prediction  Associate a set of 8-bit counters or weights with a branch address  Use the global history vector as an input vector (+1, -1)  Multiply/accumulate weights by inputs and use the sign as a prediction  Selective update: Increment/decrement if misprediction Or if Sum is lower than a threshold

André Seznec Caps Team Irisa 3 Perceptron predictor ∑ Sign=prediction X

André Seznec Caps Team Irisa 4 Perceptron prediction works  + Complexity linear with the history length:  Can capture correlation on a very long history length  - But:  long latency: the multiply-accumulate tree !  Inherently unabled to discriminate between two histories if they are not « linearly separable » (2 weights, 2 history bits): h0  h1 is not recognized ! Can we do better ?

André Seznec Caps Team Irisa 5 Use a redundant history  Insert several bits per branch in history to enhance linear separability: h0, h0  h1, h0  h2, h0  add

André Seznec Caps Team Irisa 6 Redundant history perceptron  + significant misprediction reduction:  > 30 % for 12 out of 20 benchmarks  weights:  A 256 multiply-add tree: 2048 bits wide !!  256 counter updates !!  Latency ?  Power consumption ?  Logic complexity ?

André Seznec Caps Team Irisa 7 4 weights for 2 history bits = a single counter read  Inputs (0, h0, h1, h0  h1), weights W0, W1, W2, W3  Possible contributions to the branch prediction:  h=0  (0,0,0,0) C0= -W0 –W1-W2-W3  h=1  (0,1,0,1) C1= -W0 +W1-W2+W3  h=2  (0,0,1,1) C2= -W0 –W1+W2+W3  h=3  (0,1,1,0) C3= -W0 +W1+W2-W3  Update for h =2 and Out =1:  C2 +=4  C0, C1 and C3 unchanged Let us store the Multiply-Accumulate contributions instead of the weights !!

André Seznec Caps Team Irisa 8 MAC contribution: 4-way redundant history  Let us really represent blocks of 4 history bits per 16 weights  there are only 16 possible multiply-accumulate contributions associated with these 16 weights Storing the Multiply-Accumulate contributions instead of the weights !!

André Seznec Caps Team Irisa 9 Redundant History Perceptron Predictor with MAC contribution ∑ Sign=prediction N 16x1 MUX 4N history bits

André Seznec Caps Team Irisa 10 Redundant history and MAC representation Replace a 16 multiply-add tree by a 16-to1 MUX Use of saturated arithmetic: one can reduce the width of counters to 6-bit A bit multiply-accumulate tree replaced by a 16 6-bit adder tree

André Seznec Caps Team Irisa 11 Redundant history and MAC representation

André Seznec Caps Team Irisa 12 Back to finite storage predictors

André Seznec Caps Team Irisa 13 Redundant History Perceptron vs optimized 2bcgskew  Optimized 2bcgskew: 1Mbit history + lots of tricks  768 Kbits redundant history perceptron  20 benchmarks: SPEC SPEC 95 fifty / fifty!! Perceptron and 2bcgskew do not capture exactly the same kind of correlation !!

André Seznec Caps Team Irisa 14 Towards the best of both worlds ! Redundant history skewed perceptron predictor

André Seznec Caps Team Irisa 15 Self-aliasing on a perceptron predictor 1. Consider H and H’ for a branch B differing on recent bits, If both behaviors are dictated by the same coincidating « old » history segment (e.g. bits 20-23), then there is an aliasing effect on a counter!! 2.Most of the correlation is captured by recent history: Most counters associated with « old » history are « wasted » 3. Let us enable the use of whole spectrum of counters through the use of multiple tables with different indices : « SKEWING »

André Seznec Caps Team Irisa 16 Redundant History Skewed Perceptron Predictor ∑ 4 tables accessed with different indices

André Seznec Caps Team Irisa 17 Redundant History Skewed Perceptron Predictor

André Seznec Caps Team Irisa 18 Further leveraging long history  Some applications may benefit from history length up to 128 bits, many do not !!  Don’t want to use a wider adder tree  For a fixed history length, the number of pathes that lead to a single branch varies in a considerable way  less information in some history sections than in others: Repeating patterns « waste » space in history Use of a compressed form of history !

André Seznec Caps Team Irisa 19 Further leveraging long history (2)  Replace repeating patterns (up to 5 bits) by narrower chains  compression ratio on our benchmark set  Use half uncompressed history and half compressed history  Significant benefit ( > 25 %) on several benchmarks; harmless for the others  Essentially captures all correlation associated with local history

André Seznec Caps Team Irisa 20 RHSP and compressed history

André Seznec Caps Team Irisa 21 Addressing the predictor latency Ahead pipelined redundant history perceptron predictor

André Seznec Caps Team Irisa 22 The latency issue !  Single cycle prediction would be needed but:  2-4 cycles for table read  2-4 cycles for adder tree  Ahead pipelined 2bcgskew, Seznec and Fraboulet, ISCA 2003  on the fly information insertion in table indices  resolve misprediction at execution time  Path-based perceptron, Jiménez MICRO2003  « systolic-like » ahead pipelined perceptron prediction  does not address table read delay  resolve misprediction at commit time, not at execution time

André Seznec Caps Team Irisa 23 Ahead pipelining the RHSP: the challenges  Use of X-block ahead information to initiate branch prediction:  X-block ahead address and global history  Use intermediate path information to ensure prediction accuracy  But, inflight insertion of table indices is not sufficient !?!  Need to checkpoint every information for recomputing on the fly any possible prediction for the X-1 intermediate blocks  But avoid checkpoint volume explosion

André Seznec Caps Team Irisa 24 Ahead pipelined Redundant History Skewed Perceptron Predictor RHSP tables read ∑ + 32 counters for intermediate pathes X block ahead 5 1-block ahead history Sum on 14 counters

André Seznec Caps Team Irisa 25 Ahead pipelined Redundant History Skewed Perceptron Predictor  Partial sum using only X-block ahead information  Discriminate only 32 possible paths:  32 associated counters are read  Compute 32 possible sums  Select the prediction on last cycle  Checkpoint the 32 possible predictions

André Seznec Caps Team Irisa 26 Ahead pipelined RHSP (768 Kbits)

André Seznec Caps Team Irisa 27 Ahead pipelined RHSP  Very limited loss of accuracy for 6-block ahead:  5 1-bit ahead history are sufficient to discriminate among all the intermediate pathes  Loss of accuracy increases with the length of prediction:  Do not discriminate between all the pathes  explosion of the number of pathes originated from the same X-block ahead block: Less and less predictions performed by low order counters

André Seznec Caps Team Irisa 28 Summary  Perceptron based prediction improved:  Prediction accuracy Use of redundant history Introduction of skewing Introduction of history compression  MAC representation: 16 6-bit adder tree against bit mult/acc. tree  X-block ahead RHSP: on-time prediction without sacrificing accuracy or penalty misprediction resolution at execution stage

André Seznec Caps Team Irisa 29 Wide possible design space  For dealing with his/her implementation constraint, the designer can play with:  Number of tables  Width of histories Compressed/uncompressed ratio  Threshold/width of counters: Half threshold/ 5 bits counters is not so bad  Use of other MAC representation 8 counters for 3 bits, 16 counters for 5 bits ..

André Seznec Caps Team Irisa 30 Bonus An « objective » comparison of RHSP and 2bcgskew by their (common) inventor

André Seznec Caps Team Irisa 31 2bc-gskew : logical view e-gskew

André Seznec Caps Team Irisa 32 Optimized 2bcgskew  All optimizations in EV8 predictor:  Different history lengthes for all tables  Different hysteresis and prediction table sizes  + a few other tricks:  Sharing predictors and hysteresis tables through banking  Randomly enforcing the flipping of counters on mispredictions to avoid ping-pong phenomena  No « guru » design hash functions: just good functions  2**(N+11) bits predictor; (N,N,4N,8N) history  (4,4,16,32) for 32Kbit  (9,9,36,72) for 1Mbit

André Seznec Caps Team Irisa 33 2bcgskew vs RHSP (1) Efficiency of the prediction scheme:  Both can use very long history: Extra local history prediction brings very poor benefit Not aware of any other predictor handling such long history  RHSP better tolerates/accomodates compressed history  RHSP captures some extra correlation Efficiency of the storage usage (small size predictors, e.g. 32Kbits):  2bcgskew more efficient on a few demanding benchmarks: go, gcc95  RHSP surprisingly efficient on most benchmarks

André Seznec Caps Team Irisa 34 2bcgskew vs RHSP (2) Accesses to the predictor:  Up to three accesses on RHSP on correct predictions But not so many, accesses on correct predictions  Single access to prediction, single access to hysteresis on correct predictions on 2bcgskew

André Seznec Caps Team Irisa 35 2bcgskew vs RHSP (3)  Hardware logic cost:  Adder tree + counter update for RHSP  Hashing functions + small logic for 2bcgskew  Latency:  Table read + adder tree for RHSP  Table read + a few gates for 2bcgskew

André Seznec Caps Team Irisa 36 That’s the end folks !

André Seznec Caps Team Irisa 37

André Seznec Caps Team Irisa 38 RHSP and compressed history

André Seznec Caps Team Irisa 39 RHSP and compressed history (2)

André Seznec Caps Team Irisa 40 RHSP and compressed history (3)

André Seznec Caps Team Irisa 41 RHSP vs 2bcgskew storage effectiveness (1)

André Seznec Caps Team Irisa 42 RHSP vs 2bcgskew storage effectiveness (2)

André Seznec Caps Team Irisa 43 RHSP vs 2bcgskew storage effectiveness (3)