Multilayer Perceptron based Branch Predictor

Multilayer Perceptron based Branch Predictor
Manav Garg, Prannoy Ghosh, Chenxiao Guan

Perceptron Used for binary classification using supervised learning.
Linear classifier (Create a decision plane). Supports online learning (update using one input at a time). It computes its weighted inputs and uses a threshold activation function on the run.

Decision Boundary (AND)
0 AND 0 = false 0 AND 1 = false 1 AND 0 = false 1 AND 1 = true

History Perceptron algorithms are limited to linear classification.
Jimenez et. all proposed the Linear Branch Prediction algorithm in HPCA 2001. Drawback:can only classify linear functions (linear inseparability). Latency (training time).

Organization of the Perceptron Predictor
Keeps a table of m perceptron weights vectors. Table is indexed by branch address modulo m.

Piecewise Linear Branch Prediction - Daniel Jimenez
Forms a piecewise linear decision surface. Each piece determined by the path to the predicted branch . Avoid aliasing due to history bits. Can solve more nonlinear problems than perceptron based approach. Branches cannot alias one another; accuracy much better. Drawback : Unlimited amount of storage required.

Strided Sampling and Future Work
Avoids contiguous history bits usage to train the ANN. Instead uses strided variable lengths of global history for classification. The implementations so far do not delve into using concepts such as backpropagation for training on the run.

Artificial Neural Network
Neural networks are made up of many artificial neurons. Each input into the neuron has its own weight associated with it and can learn to classify a nonlinear decision function.

Backpropagation Network Training
For getting more accuracy for nonlinear cases 1. Initialize network with random weights. 2. For all training cases : a. Present training inputs to network and calculate output. b. For all layers (starting with output layer, back to input layer): i. Compare network output with correct output (error function). ii. Adapt weights in current layer (using activation function).

Branch Prediction using MLP/ANN
A branch prediction can be treated as a case of nonlinear classification where the input can be n number of global history bit. Outputs known to be the desired prediction. Previous work done by Jimenez et. al has shown optimistic results for using MLP based predictor.

Branch-Predicting Perceptron
Inputs (x’s) are from n branch history bits and are 0 or +1. n + 1 small integer weights (w’s) learned by on-line training. Output (y) is dot product of x’s and w’s; predict taken if y ≥ 0. Insert additional training to reduce error rate. Training finds correlations between history and outcome.

Work done .. Initial Attempt: Implemented Jimenez’s perceptron based predictors on CBP framework and traces. Most of his implementations performed poorly compared to the latest TAGE predictor. Most of the work done by us to test the accuracy of MLP is across two Limit Studies.

MLP Limit Study - 1 Experimental Setup: For the traces (provided by cbp-2016), the following steps were performed: The branch PC with the most number of mispredictions with the baseline predictor for the trace input was recorded. 2. For each of this branch PC's mispredicted instance in the trace file, a feature vector consisting of the last n global history outcomes is constructed. 3. Finally, all the instances (by looking at all the mispredicted instances of this branch PC) with their feature vectors are created and fed as input to the MLP (CMU MLP toolkit). The instances were divided into 3 sets: training, validation and test set before feeding the MLP.

Variables for this experiment:
4. Thus, we are training one neural network for one branch PC (the most mispredicted one) for this Limit Study. 5. The main motivation is that if we can somehow switch from the baseline predictor to the MLP when a particular mispredicted instance appears in the trace, MLP may provide a better prediction than the underlying base predictor. Variables for this experiment: Baseline predictor. Number of Global History bits used as feature vector for the MLP.

Study Parameters CMU MLP Toolkit Characteristics:
BaseLine Predictor: GSHARE, TAGE-SC-L (from cbp-2014 by Andre Seznec) Number of Global History bits: 50, 100, 200 We selected 22 diverse traces across the 4 segments to ensure good representation across all input types. CMU MLP Toolkit Characteristics: 1 hidden layer (number of nodes double of input) learning rate = 0.1 epochs = 200 sigmoid activation function batch size = {256-training, 100-test}

GSHARE-50 TRACES MPKI Mean: 16.22830465 SM_30 SM_40 0.007497375 SM_50
SM_40 SM_50 SM_56 SM_57 SM_60 LM_8 LM_11 LM_15 LM_18 SS_30 SS_40 SS_50 SS_60 SS_70 SS_80 SS_90 SS_123 SS_138 LS_1 LS_2 LS_3 Mean:

TAGE-SC-L, 50 TRACES MPKI Mean: 24.47919622 SM_30 32.64853978 SM_40
SM_50 SM_56 SM_57 SM_60 LM_8 LM_11 LM_15 LM_18 SS_30 SS_40 SS_50 SS_60 SS_70 SS_80 SS_90 SS_123 SS_138 LS_1 LS_2 LS_3 Mean:

TAGE-SC-L,100 TRACES MPKI Mean: 24.70459954 SM_30 25.49513259 SM_40
SM_50 SM_56 SM_57 SM_60 LM_8 LM_11 LM_15 LM_18 SS_30 SS_40 SS_50 SS_60 SS_70 SS_80 SS_90 SS_123 SS_138 LS_1 LS_2 LS_3 N/A Mean:

TAGE-SC-L,200 TRACES MPKI Mean: 24.13947564 SM_30 20.89627392 SM_40
SM_50 SM_56 SM_57 SM_60 LM_8 LM_11 LM_15 LM_18 SS_30 N/A SS_40 SS_50 SS_60 SS_70 SS_80 SS_90 SS_123 SS_138 LS_1 LS_2 LS_3 Mean:

Analysis The MPKI for TAGE-SC-L is nearly half of Gshare and is thus a better candidate for the baseline predictor. Also, as it has a low misprediction rate, the scope for opportunity is comparatively less in TAGE-SC-L compared to Gshare ( 24% error rate for TAGE-SC- L compared to 16% for Gshare). MLP Error Rate for TAGE-SC-L is only 25%. Thus, even for TAGE, an ideal MLP built on top of it, can produce significant results. Moreover, if we don't restrict ourselves to training only a single PC (we can probably train more if the memory budget permits), we can expect to see more improvement in performance. Increasing the number of global history bits rarely improve miss rate (occasionally worsening it!!)

MLP Limit Study - 2 Experimental Setup: For the traces (provided by cbp-2016), the following steps were performed: A set of traces were assigned to the Training Set Another set was created for Testing TAGE predictor with n tables was extended to (n+1) tables During Training Phase (for all the traces in the training set), for any entry promoted to the (n+1)th table, it’s m bits unfolded global history was recorded. These instances were fed to the offline MLP (CMU Toolkit) to generate the weights for all the entries in MLP. These weights were then hardcoded in the predictor file.

6. During Testing Phase (for the traces in the Testing Set), again, the baseline TAGE predictor was extended to (n+1) tables. 7. Whenever, there was a hit in the last table, now, MLP prediction was used (by utilizing the hard coded weights), instead of the TAGE prediction. 8. New accuracy for the Testing traces was now recorded. Intuition: There might a function unique to traces and by learning this offline, the existing TAGE predictor might perform better.

Variables for this experiment:
Number of Tables in the baseline TAGE Predictor ( We used 12 and 15 Tables for baseline TAGE). Number of Global History bits used as feature vector for the MLP ( We used 50,100 bits for history). Different Mix of Traces (for testing and traces).

{TAGE, 50,15} Training Traces Testing traces BASELINE TAGE TAGE + MLP
LONG_MOBILE-5 SHORT_MOBILE_6 LONG_MOBILE-16 SHORT_MOBILE_7 9.0727 9.0585 LONG_MOBILE-18 SHORT_MOBILE_8 1.6325 1.6316 LONG_MOBILE-19 SHORT_MOBILE_9 0.0186 0.0572 SHORT_MOBILE-2 SHORT_MOBILE_11 0.0086 SHORT_MOBILE-17 SHORT_MOBILE_13 0.0082 SHORT_MOBILE-20 SHORT_MOBILE_17 0.0008 SHORT_MOBILE-33 SHORT_MOBILE_28 0.0078 SHORT_MOBILE-39 SHORT_MOBILE_33 SHORT_MOBILE-45 SHORT_MOBILE_41 0.7866 0.8054 SHORT_MOBILE-46 SHORT_MOBILE_49 2.2532 2.253 SHORT_MOBILE-59 SHORT_MOBILE_53 1.6798 1.6956 SHORT_MOBILE-6 SHORT_MOBILE_58 0.0352 0.0363 SHORT_MOBILE-8 SHORT_SERVER_7 0.2859 1.4922 SHORT_SERVER-1 SHORT_SERVER_8 0.4476 0.4955 SHORT_SERVER-116 SHORT_SERVER_9 0.0286 0.0285 SHORT_SERVER-132 SHORT_SERVER_10 0.0376 0.0482 SHORT_SERVER-137 SHORT_SERVER_11 1.3022 1.2984 SHORT_SERVER-2 SHORT_SERVER_12 0.0209 0.0212 SHORT_SERVER-46 SHORT_SERVER_13 0.0421 0.0425 SHORT_SERVER-53 SHORT_SERVER_14 0.0419 0.0435 SHORT_SERVER-58 SHORT_SERVER_15 0.0429 0.0469 SHORT_SERVER-64 SHORT_SERVER_16 6.2502 6.2481 SHORT_SERVER-9 SHORT_SERVER_17 3.5357 3.5354 SHORT_SERVER-99 SHORT_SERVER_18 SHORT_SERVER_19 2.9554 2.9564 SHORT_SERVER_20 3.8471 3.8475 SHORT_SERVER_21 LONG_MOBILE_4 0.0043 1.3862 LONG_MOBILE_6 5.9526 6.0029 MEAN: {TAGE, 50,15}

{TAGE, 50,12} Training Traces Testing traces BASELINE TAGE TAGE + MLP
LONG_MOBILE-5 SHORT_MOBILE_6 10.27 LONG_MOBILE-16 SHORT_MOBILE_7 9.068 9.0844 LONG_MOBILE-18 SHORT_MOBILE_8 1.6355 1.6363 SHORT_MOBILE-2 SHORT_MOBILE_9 0.0355 0.3309 SHORT_MOBILE-17 SHORT_MOBILE_11 0.0085 SHORT_MOBILE-20 SHORT_MOBILE_13 0.008 SHORT_MOBILE-33 SHORT_MOBILE_17 0.0008 SHORT_MOBILE-39 SHORT_MOBILE_28 0.0078 SHORT_MOBILE-45 SHORT_MOBILE_33 SHORT_MOBILE-46 SHORT_MOBILE_41 0.0212 0.805 SHORT_MOBILE-8 SHORT_MOBILE_49 2.2559 2.2637 SHORT_MOBILE-6 SHORT_MOBILE_53 1.6891 2.0776 SHORT_SERVER-116 SHORT_MOBILE_58 0.036 0.0396 SHORT_SERVER-132 SHORT_SERVER_7 0.2957 2.8338 SHORT_SERVER-137 SHORT_SERVER_8 0.5031 0.8142 SHORT_SERVER-2 SHORT_SERVER_9 0.0285 0.064 SHORT_SERVER-46 SHORT_SERVER_10 0.038 0.0384 SHORT_SERVER-53 SHORT_SERVER_11 1.3208 1.3157 SHORT_SERVER-58 SHORT_SERVER_12 0.0215 0.0225 SHORT_SERVER-99 SHORT_SERVER_13 0.0332 0.0421 SHORT_SERVER-9 SHORT_SERVER_14 0.0437 SHORT_SERVER_15 0.0394 0.0467 SHORT_SERVER_16 6.2338 6.2359 SHORT_SERVER_17 3.5402 3.5413 SHORT_SERVER_18 SHORT_SERVER_19 2.951 2.9517 SHORT_SERVER_20 3.8527 3.8612 SHORT_SERVER_21 LONG_MOBILE_4 0.0043 LONG_MOBILE_6 5.8709 5.8774 MEAN: {TAGE, 50,12}

{TAGE, 100,12} MEAN: 2.443656667 3.783566667 Training Traces
Testing traces BASELINE TAGE TAGE + MLP LONG_MOBILE-5 SHORT_MOBILE_6 10.27 LONG_MOBILE-16 SHORT_MOBILE_7 9.068 9.0865 LONG_MOBILE-18 SHORT_MOBILE_8 1.6355 1.6348 SHORT_MOBILE-17 SHORT_MOBILE_9 0.0355 0.3655 SHORT_MOBILE-33 SHORT_MOBILE_11 0.0085 SHORT_MOBILE-39 SHORT_MOBILE_13 0.008 6.6472 SHORT_MOBILE-45 SHORT_MOBILE_17 0.0008 SHORT_MOBILE-43 SHORT_MOBILE_28 0.0078 1.1414 SHORT_MOBILE-8 SHORT_MOBILE_33 SHORT_SERVER-116 SHORT_MOBILE_41 0.0212 0.0214 SHORT_SERVER-132 SHORT_MOBILE_49 2.2559 SHORT_SERVER-137 SHORT_MOBILE_53 1.6891 1.9527 SHORT_SERVER-2 SHORT_MOBILE_58 0.036 5.6249 SHORT_SERVER-46 SHORT_SERVER_7 0.2957 2.3553 SHORT_SERVER-53 SHORT_SERVER_8 0.5031 0.8373 SHORT_SERVER-99 SHORT_SERVER_9 0.0285 0.0641 SHORT_SERVER-22 SHORT_SERVER_10 0.038 0.1026 SHORT_SERVER-9 SHORT_SERVER_11 1.3208 1.3172 SHORT_SERVER_12 0.0215 0.1511 SHORT_SERVER_13 0.0332 0.039 SHORT_SERVER_14 0.044 SHORT_SERVER_15 0.0394 0.0459 SHORT_SERVER_16 6.2338 6.2566 SHORT_SERVER_17 3.5402 3.5449 SHORT_SERVER_18 SHORT_SERVER_19 2.951 2.9546 SHORT_SERVER_20 3.8527 3.8753 SHORT_SERVER_21 LONG_MOBILE_4 0.0043 1.8101 LONG_MOBILE_6 5.8709 5.8808 MEAN: {TAGE, 100,12}

Analysis This approach of training the instances offline and learning a function (by hardcoding weights) doesn’t seem to work. The accuracy reduces(This comparison is with TAGE predictor with one less table, with the TAGE predictor of same size it would be even worse). Looks like there is no correlation among different traces, as good as making a random prediction at the last table. Moreover, contrary to our expectations, decreasing the number of TAGE Tables doesn’t help. We thought the TAGE with fewer tables would provide more constructive instances to MLP and will also present more scope for improvement. Another Surprising Result: Increasing the number of unfolded Global History Bits, makes the accuracy even worse. This supports our belief that the MLP just returns a random function (w.r.t Training traces).

Next Steps... ONLINE LEARNING OF MLP:
As we couldn’t find a function which might work across all the traces, the only option now is to train the MLP on top of TAGE dynamically. We are extending the TAGE predictor (like before), however each run is divided into two parts- training & testing. This might help us learn a function tailored for each trace. Still implementing this scheme...

Conclusion Offline MLP Learning might not be a good idea.
Best bet would be to try to train it on-the-fly and the results of Limit Study-1 indicate that there might be some scope if we switch at the right time from TAGE to MLP. HOWEVER... MLP Implementation in hardware is difficult to realize (Multiple hidden units, floating point computations, Activation functions, running multiple epochs). And building this on top of TAGE is crazy. Even if possible, the training time would just explode. The improvement gain might not justify this complex setup.

QUESTIONS?

Multilayer Perceptron based Branch Predictor

Similar presentations

Presentation on theme: "Multilayer Perceptron based Branch Predictor"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multilayer Perceptron based Branch Predictor

Similar presentations

Presentation on theme: "Multilayer Perceptron based Branch Predictor"— Presentation transcript:

Similar presentations

About project

Feedback