Multilayer Perceptron based Branch Predictor

Slides:

Advertisements

Similar presentations

G53MLE | Machine Learning | Dr Guoping Qiu

Advertisements

NEURAL NETWORKS Perceptron

also known as the “Perceptron”

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.

Simple Neural Nets For Pattern Classification

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Prénom Nom Document Analysis: Artificial Neural Networks Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Perceptron-based Global Confidence Estimation for Value Prediction Master’s Thesis Michael Black June 26, 2003.

Chapter 5 NEURAL NETWORKS

1 Applying Perceptrons to Speculation in Computer Architecture Michael Black Dissertation Defense April 2, 2007.

VLSI Project Neural Networks based Branch Prediction Alexander ZlotnikMarcel Apfelbaum Supervised by: Michael Behar, Spring 2005.

Chapter 6: Multilayer Neural Networks

MACHINE LEARNING 12. Multilayer Perceptrons. Neural Networks Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Perceptrons Branch Prediction and its’ recent developments

Neural Methods for Dynamic Branch Prediction Daniel A. Jiménez Department of Computer Science Rutgers University.

Optimized Hybrid Scaled Neural Analog Predictor Daniel A. Jiménez Department of Computer Science The University of Texas at San Antonio.

CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.

1 Storage Free Confidence Estimator for the TAGE predictor André Seznec IRISA/INRIA.

Artificial Neural Networks

Multi-Layer Perceptrons Michael J. Watts

Explorations in Neural Networks Tianhui Cai Period 3.

Neural Networks AI – Week 23 Sub-symbolic AI Multi-Layer Neural Networks Lee McCluskey, room 3/10

ANNs (Artificial Neural Networks). THE PERCEPTRON.

Matlab Matlab Sigmoid Sigmoid Perceptron Perceptron Linear Linear Training Training Small, Round Blue-Cell Tumor Classification Example Small, Round Blue-Cell.

Analysis of Branch Predictors

Back-Propagation MLP Neural Network Optimizer ECE 539 Andrew Beckwith.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

LINEAR CLASSIFICATION. Biological inspirations  Some numbers…  The human brain contains about 10 billion nerve cells ( neurons )  Each neuron is connected.

Artificial Intelligence Techniques Multilayer Perceptrons.

André Seznec Caps Team IRISA/INRIA 1 Analysis of the O-GEHL branch predictor Optimized GEometric History Length André Seznec IRISA/INRIA/HIPEAC.

1 Chapter 11 Neural Networks. 2 Chapter 11 Contents (1) l Biological Neurons l Artificial Neurons l Perceptrons l Multilayer Neural Networks l Backpropagation.

Applying Neural Networks Michael J. Watts

1 Revisiting the perceptron predictor André Seznec IRISA/ INRIA.

Non-Bayes classifiers. Linear discriminants, neural networks.

A.N.N.C.R.I.P.S The Artificial Neural Networks for Cancer Research in Prediction & Survival A CSI – VESIT PRESENTATION Presented By Karan Kamdar Amit.

Neural Network Implementation of Poker AI

CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.

Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

Artificial Neural Networks (ANN). Artificial Neural Networks First proposed in 1940s as an attempt to simulate the human brain’s cognitive learning processes.

Idealized Piecewise Linear Branch Prediction Daniel A. Jiménez Department of Computer Science Rutgers University.

Branch Prediction Perspectives Using Machine Learning Veerle Desmet Ghent University.

Neural Networks References: “Artificial Intelligence for Games” "Artificial Intelligence: A new Synthesis"

Multiperspective Perceptron Predictor Daniel A. Jiménez Department of Computer Science & Engineering Texas A&M University.

Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.

Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Machine Learning Supervised Learning Classification and Regression

Neural networks.

Applying Neural Networks

Learning with Perceptrons and Neural Networks

Multiperspective Perceptron Predictor with TAGE

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

FA-TAGE Frequency Aware TAgged GEometric History Length Branch Predictor Boyu Zhang, Christopher Bodden, Dillon Skeehan ECE/CS 752 Advanced Computer Architecture.

Structure learning with deep autoencoders

Exploring Value Prediction with the EVES predictor

Machine Learning Today: Reading: Maria Florina Balcan

Perceptrons for Dummies

Artificial Intelligence Methods

Training a Neural Network

Artificial Neural Networks

Aleysha Becker Ece 539, Fall 2018

Capabilities of Threshold Neurons

Lecture Notes for Chapter 4 Artificial Neural Networks

TAGE-SC-L Again MTAGE-SC

Backpropagation David Kauchak CS159 – Fall 2019.

Computer Vision Lecture 19: Object Recognition III

David Kauchak CS158 – Spring 2019

The O-GEHL branch predictor

Presentation transcript:

Multilayer Perceptron based Branch Predictor Manav Garg, Prannoy Ghosh, Chenxiao Guan

Perceptron Used for binary classification using supervised learning. Linear classifier (Create a decision plane). Supports online learning (update using one input at a time). It computes its weighted inputs and uses a threshold activation function on the run.

Decision Boundary (AND) 0 AND 0 = false 0 AND 1 = false 1 AND 0 = false 1 AND 1 = true

History Perceptron algorithms are limited to linear classification. Jimenez et. all proposed the Linear Branch Prediction algorithm in HPCA 2001. Drawback:can only classify linear functions (linear inseparability). Latency (training time).

Organization of the Perceptron Predictor Keeps a table of m perceptron weights vectors. Table is indexed by branch address modulo m.

Piecewise Linear Branch Prediction - Daniel Jimenez Forms a piecewise linear decision surface. Each piece determined by the path to the predicted branch . Avoid aliasing due to history bits. Can solve more nonlinear problems than perceptron based approach. Branches cannot alias one another; accuracy much better. Drawback : Unlimited amount of storage required.

Strided Sampling and Future Work Avoids contiguous history bits usage to train the ANN. Instead uses strided variable lengths of global history for classification. The implementations so far do not delve into using concepts such as backpropagation for training on the run.

Artificial Neural Network Neural networks are made up of many artificial neurons. Each input into the neuron has its own weight associated with it and can learn to classify a nonlinear decision function.

Backpropagation Network Training For getting more accuracy for nonlinear cases 1. Initialize network with random weights. 2. For all training cases : a. Present training inputs to network and calculate output. b. For all layers (starting with output layer, back to input layer): i. Compare network output with correct output (error function). ii. Adapt weights in current layer (using activation function).

Branch Prediction using MLP/ANN A branch prediction can be treated as a case of nonlinear classification where the input can be n number of global history bit. Outputs known to be the desired prediction. Previous work done by Jimenez et. al has shown optimistic results for using MLP based predictor.

Branch-Predicting Perceptron Inputs (x’s) are from n branch history bits and are 0 or +1. n + 1 small integer weights (w’s) learned by on-line training. Output (y) is dot product of x’s and w’s; predict taken if y ≥ 0. Insert additional training to reduce error rate. Training finds correlations between history and outcome.

Work done .. Initial Attempt: Implemented Jimenez’s perceptron based predictors on CBP framework and traces. Most of his implementations performed poorly compared to the latest TAGE predictor. Most of the work done by us to test the accuracy of MLP is across two Limit Studies.

MLP Limit Study - 1 Experimental Setup: For the traces (provided by cbp-2016), the following steps were performed: The branch PC with the most number of mispredictions with the baseline predictor for the trace input was recorded. 2. For each of this branch PC's mispredicted instance in the trace file, a feature vector consisting of the last n global history outcomes is constructed. 3. Finally, all the instances (by looking at all the mispredicted instances of this branch PC) with their feature vectors are created and fed as input to the MLP (CMU MLP toolkit). The instances were divided into 3 sets: training, validation and test set before feeding the MLP.

Variables for this experiment: 4. Thus, we are training one neural network for one branch PC (the most mispredicted one) for this Limit Study. 5. The main motivation is that if we can somehow switch from the baseline predictor to the MLP when a particular mispredicted instance appears in the trace, MLP may provide a better prediction than the underlying base predictor. Variables for this experiment: Baseline predictor. Number of Global History bits used as feature vector for the MLP.

Study Parameters CMU MLP Toolkit Characteristics: BaseLine Predictor: GSHARE, TAGE-SC-L (from cbp-2014 by Andre Seznec) Number of Global History bits: 50, 100, 200 We selected 22 diverse traces across the 4 segments to ensure good representation across all input types. CMU MLP Toolkit Characteristics: 1 hidden layer (number of nodes double of input) learning rate = 0.1 epochs = 200 sigmoid activation function batch size = {256-training, 100-test}

GSHARE-50 TRACES MPKI Mean: 16.22830465 SM_30 SM_40 0.007497375 SM_50 SM_40 0.007497375 SM_50 36.96345551 SM_56 0.005660911 SM_57 0.100647017 SM_60 39.33143083 LM_8 LM_11 3.081210522 LM_15 0.019083059 LM_18 47.31370777 SS_30 25.13777745 SS_40 12.50939805 SS_50 31.04213776 SS_60 20.52403353 SS_70 27.95441 SS_80 8.99766646 SS_90 22.75946056 SS_123 0.098610489 SS_138 31.94021934 LS_1 15.49561699 LS_2 12.91063229 LS_3 20.83004628 Mean: 16.22830465

TAGE-SC-L, 50 TRACES MPKI Mean: 24.47919622 SM_30 32.64853978 SM_40 6.349206349 SM_50 11.53321644 SM_56 SM_57 3.550295858 SM_60 40.62718536 LM_8 46.15384615 LM_11 50.58049536 LM_15 45.58426533 LM_18 43.0863485 SS_30 39.64785112 SS_40 18.745584 SS_50 29.47965784 SS_60 35.18075989 SS_70 12.86428798 SS_80 6.348764435 SS_90 46.99170124 SS_123 9.639143731 SS_138 39.2225632 LS_1 13.22571744 LS_2 0.7162753681 LS_3 6.366611428 Mean: 24.47919622

TAGE-SC-L,100 TRACES MPKI Mean: 24.70459954 SM_30 25.49513259 SM_40 6.349206349 SM_50 11.82993901 SM_56 SM_57 3.550295858 SM_60 40.23157306 LM_8 46.15384615 LM_11 48.91640867 LM_15 45.21789433 LM_18 44.2299667 SS_30 34.89307953 SS_40 19.85433991 SS_50 28.25811405 SS_60 34.78152382 SS_70 11.06955037 SS_80 6.681737697 SS_90 48.80705394 SS_123 10.22629969 SS_138 41.07847435 LS_1 10.43598234 LS_2 0.7361719061 LS_3 N/A Mean: 24.70459954

TAGE-SC-L,200 TRACES MPKI Mean: 24.13947564 SM_30 20.89627392 SM_40 6.349206349 SM_50 11.44779631 SM_56 SM_57 3.550295858 SM_60 38.89588204 LM_8 46.15384615 LM_11 54.37306502 LM_15 43.93559583 LM_18 43.49833518 SS_30 N/A SS_40 19.16680254 SS_50 SS_60 36.65774144 SS_70 12.44219864 SS_80 7.050346894 SS_90 45.79875519 SS_123 11.55963303 SS_138 43.34356276 LS_1 12.91390728 LS_2 0.6167926781 LS_3 Mean: 24.13947564

Analysis The MPKI for TAGE-SC-L is nearly half of Gshare and is thus a better candidate for the baseline predictor. Also, as it has a low misprediction rate, the scope for opportunity is comparatively less in TAGE-SC-L compared to Gshare ( 24% error rate for TAGE-SC- L compared to 16% for Gshare). MLP Error Rate for TAGE-SC-L is only 25%. Thus, even for TAGE, an ideal MLP built on top of it, can produce significant results. Moreover, if we don't restrict ourselves to training only a single PC (we can probably train more if the memory budget permits), we can expect to see more improvement in performance. Increasing the number of global history bits rarely improve miss rate (occasionally worsening it!!)

MLP Limit Study - 2 Experimental Setup: For the traces (provided by cbp-2016), the following steps were performed: A set of traces were assigned to the Training Set Another set was created for Testing TAGE predictor with n tables was extended to (n+1) tables During Training Phase (for all the traces in the training set), for any entry promoted to the (n+1)th table, it’s m bits unfolded global history was recorded. These instances were fed to the offline MLP (CMU Toolkit) to generate the weights for all the entries in MLP. These weights were then hardcoded in the predictor file.

6. During Testing Phase (for the traces in the Testing Set), again, the baseline TAGE predictor was extended to (n+1) tables. 7. Whenever, there was a hit in the last table, now, MLP prediction was used (by utilizing the hard coded weights), instead of the TAGE prediction. 8. New accuracy for the Testing traces was now recorded. Intuition: There might a function unique to traces and by learning this offline, the existing TAGE predictor might perform better.

Variables for this experiment: Number of Tables in the baseline TAGE Predictor ( We used 12 and 15 Tables for baseline TAGE). Number of Global History bits used as feature vector for the MLP ( We used 50,100 bits for history). Different Mix of Traces (for testing and traces).

{TAGE, 50,15} Training Traces Testing traces BASELINE TAGE TAGE + MLP LONG_MOBILE-5 SHORT_MOBILE_6 10.1786 10.1643 LONG_MOBILE-16 SHORT_MOBILE_7 9.0727 9.0585 LONG_MOBILE-18 SHORT_MOBILE_8 1.6325 1.6316 LONG_MOBILE-19 SHORT_MOBILE_9 0.0186 0.0572 SHORT_MOBILE-2 SHORT_MOBILE_11 0.0086 SHORT_MOBILE-17 SHORT_MOBILE_13 0.0082 SHORT_MOBILE-20 SHORT_MOBILE_17 0.0008 SHORT_MOBILE-33 SHORT_MOBILE_28 0.0078 SHORT_MOBILE-39 SHORT_MOBILE_33 SHORT_MOBILE-45 SHORT_MOBILE_41 0.7866 0.8054 SHORT_MOBILE-46 SHORT_MOBILE_49 2.2532 2.253 SHORT_MOBILE-59 SHORT_MOBILE_53 1.6798 1.6956 SHORT_MOBILE-6 SHORT_MOBILE_58 0.0352 0.0363 SHORT_MOBILE-8 SHORT_SERVER_7 0.2859 1.4922 SHORT_SERVER-1 SHORT_SERVER_8 0.4476 0.4955 SHORT_SERVER-116 SHORT_SERVER_9 0.0286 0.0285 SHORT_SERVER-132 SHORT_SERVER_10 0.0376 0.0482 SHORT_SERVER-137 SHORT_SERVER_11 1.3022 1.2984 SHORT_SERVER-2 SHORT_SERVER_12 0.0209 0.0212 SHORT_SERVER-46 SHORT_SERVER_13 0.0421 0.0425 SHORT_SERVER-53 SHORT_SERVER_14 0.0419 0.0435 SHORT_SERVER-58 SHORT_SERVER_15 0.0429 0.0469 SHORT_SERVER-64 SHORT_SERVER_16 6.2502 6.2481 SHORT_SERVER-9 SHORT_SERVER_17 3.5357 3.5354 SHORT_SERVER-99 SHORT_SERVER_18 12.5613 12.4891 SHORT_SERVER_19 2.9554 2.9564 SHORT_SERVER_20 3.8471 3.8475 SHORT_SERVER_21 10.8197 10.8648 LONG_MOBILE_4 0.0043 1.3862 LONG_MOBILE_6 5.9526 6.0029 MEAN: 2.461953333 2.552486667 {TAGE, 50,15}

{TAGE, 50,12} Training Traces Testing traces BASELINE TAGE TAGE + MLP LONG_MOBILE-5 SHORT_MOBILE_6 10.27 10.2632 LONG_MOBILE-16 SHORT_MOBILE_7 9.068 9.0844 LONG_MOBILE-18 SHORT_MOBILE_8 1.6355 1.6363 SHORT_MOBILE-2 SHORT_MOBILE_9 0.0355 0.3309 SHORT_MOBILE-17 SHORT_MOBILE_11 0.0085 SHORT_MOBILE-20 SHORT_MOBILE_13 0.008 SHORT_MOBILE-33 SHORT_MOBILE_17 0.0008 SHORT_MOBILE-39 SHORT_MOBILE_28 0.0078 SHORT_MOBILE-45 SHORT_MOBILE_33 SHORT_MOBILE-46 SHORT_MOBILE_41 0.0212 0.805 SHORT_MOBILE-8 SHORT_MOBILE_49 2.2559 2.2637 SHORT_MOBILE-6 SHORT_MOBILE_53 1.6891 2.0776 SHORT_SERVER-116 SHORT_MOBILE_58 0.036 0.0396 SHORT_SERVER-132 SHORT_SERVER_7 0.2957 2.8338 SHORT_SERVER-137 SHORT_SERVER_8 0.5031 0.8142 SHORT_SERVER-2 SHORT_SERVER_9 0.0285 0.064 SHORT_SERVER-46 SHORT_SERVER_10 0.038 0.0384 SHORT_SERVER-53 SHORT_SERVER_11 1.3208 1.3157 SHORT_SERVER-58 SHORT_SERVER_12 0.0215 0.0225 SHORT_SERVER-99 SHORT_SERVER_13 0.0332 0.0421 SHORT_SERVER-9 SHORT_SERVER_14 0.0437 SHORT_SERVER_15 0.0394 0.0467 SHORT_SERVER_16 6.2338 6.2359 SHORT_SERVER_17 3.5402 3.5413 SHORT_SERVER_18 12.5706 12.5674 SHORT_SERVER_19 2.951 2.9517 SHORT_SERVER_20 3.8527 3.8612 SHORT_SERVER_21 10.9337 10.9625 LONG_MOBILE_4 0.0043 LONG_MOBILE_6 5.8709 5.8774 MEAN: 2.443656667 2.59162 {TAGE, 50,12}

{TAGE, 100,12} MEAN: 2.443656667 3.783566667 Training Traces Testing traces BASELINE TAGE TAGE + MLP LONG_MOBILE-5 SHORT_MOBILE_6 10.27 11.4913 LONG_MOBILE-16 SHORT_MOBILE_7 9.068 9.0865 LONG_MOBILE-18 SHORT_MOBILE_8 1.6355 1.6348 SHORT_MOBILE-17 SHORT_MOBILE_9 0.0355 0.3655 SHORT_MOBILE-33 SHORT_MOBILE_11 0.0085 10.3988 SHORT_MOBILE-39 SHORT_MOBILE_13 0.008 6.6472 SHORT_MOBILE-45 SHORT_MOBILE_17 0.0008 SHORT_MOBILE-43 SHORT_MOBILE_28 0.0078 1.1414 SHORT_MOBILE-8 SHORT_MOBILE_33 SHORT_SERVER-116 SHORT_MOBILE_41 0.0212 0.0214 SHORT_SERVER-132 SHORT_MOBILE_49 2.2559 11.0998 SHORT_SERVER-137 SHORT_MOBILE_53 1.6891 1.9527 SHORT_SERVER-2 SHORT_MOBILE_58 0.036 5.6249 SHORT_SERVER-46 SHORT_SERVER_7 0.2957 2.3553 SHORT_SERVER-53 SHORT_SERVER_8 0.5031 0.8373 SHORT_SERVER-99 SHORT_SERVER_9 0.0285 0.0641 SHORT_SERVER-22 SHORT_SERVER_10 0.038 0.1026 SHORT_SERVER-9 SHORT_SERVER_11 1.3208 1.3172 SHORT_SERVER_12 0.0215 0.1511 SHORT_SERVER_13 0.0332 0.039 SHORT_SERVER_14 0.044 SHORT_SERVER_15 0.0394 0.0459 SHORT_SERVER_16 6.2338 6.2566 SHORT_SERVER_17 3.5402 3.5449 SHORT_SERVER_18 12.5706 13.7814 SHORT_SERVER_19 2.951 2.9546 SHORT_SERVER_20 3.8527 3.8753 SHORT_SERVER_21 10.9337 10.9817 LONG_MOBILE_4 0.0043 1.8101 LONG_MOBILE_6 5.8709 5.8808 MEAN: 2.443656667 3.783566667 {TAGE, 100,12}

Analysis This approach of training the instances offline and learning a function (by hardcoding weights) doesn’t seem to work. The accuracy reduces(This comparison is with TAGE predictor with one less table, with the TAGE predictor of same size it would be even worse). Looks like there is no correlation among different traces, as good as making a random prediction at the last table. Moreover, contrary to our expectations, decreasing the number of TAGE Tables doesn’t help. We thought the TAGE with fewer tables would provide more constructive instances to MLP and will also present more scope for improvement. Another Surprising Result: Increasing the number of unfolded Global History Bits, makes the accuracy even worse. This supports our belief that the MLP just returns a random function (w.r.t Training traces).

Next Steps... ONLINE LEARNING OF MLP: As we couldn’t find a function which might work across all the traces, the only option now is to train the MLP on top of TAGE dynamically. We are extending the TAGE predictor (like before), however each run is divided into two parts- training & testing. This might help us learn a function tailored for each trace. Still implementing this scheme...

Conclusion Offline MLP Learning might not be a good idea. Best bet would be to try to train it on-the-fly and the results of Limit Study-1 indicate that there might be some scope if we switch at the right time from TAGE to MLP. HOWEVER... MLP Implementation in hardware is difficult to realize (Multiple hidden units, floating point computations, Activation functions, running multiple epochs). And building this on top of TAGE is crazy. Even if possible, the training time would just explode. The improvement gain might not justify this complex setup.

QUESTIONS?