Download presentation
Presentation is loading. Please wait.
1
Branch Prediction Dimitris Karteris Rafael Pasvantidιs
2
2 Outline What are branches? Techniques for handling branches Branch prediction Why do we need branch prediction? Branch prediction schemes (static/dynamic) “Real” branch predictors
3
3 Branches Instructions which can alter the flow of instruction execution in a program
4
4 Types of Branches
5
5 Techniques for handling branches IFIDEXMEMWB Stalling Branch delay slots Relies on programmer/compiler to fill Depends on being able to find suitable instructions Ties resolution delay to a particular pipeline Predication tranform control dependence to data dependence on branch condition
6
6 Why aren’t these techniques acceptable? Branches are frequent (15-25%) Today’s pipelines are deeper and wider Higher performance penalty for stalling Misprediction Penalty = issue width * resolution delay cycles A lot of cycles can be wasted!!!
7
7 Branch Prediction Predicting the outcome of a branch Direction: Taken / Not Taken Direction predictors Target Address PC+offset (Taken)/ PC+4 (Not Taken) Target address predictors Branch Target Address Cache (BTAC) or Branch Target Buffer (BTB)
8
8 Why do we need branch prediction? Branch prediction Increases the number of instructions available for the scheduler to issue. Increases instruction level parallelism (ILP) Allows useful work to be completed while waiting for the branch to resolve
9
9 A simple example which demonstrates the benefits if (x > 0) { a=0; b=1; c=2;}d=3;
10
10 Classification of branch prediction schemes (1) Static schemes Decision before runtime (i.e. at compile time) Predict Branch Taken / Not Taken All branches taken scheme : 34% avg. misprediction rate Backward Taken/Forward Not Taken (BTFNT) Advantage in Loops Doesn ’ t work well on programs with irregular branches Ball and Larus approach enhancement works a little better
11
11 Classification of branch prediction schemes (2) Profiling branch prediction based on profiles created by earlier runs key observation: behavior of branches bimodally distributed Preset static prediction bit in the opcode Doesn ’ t work well on data sets that occur at run- time Static schemes useful for scheduling when the branch delays are exposed by the architecture assisting dynamic predictors determining frequent code paths
12
12 Classification of branch prediction schemes (3) Dynamic Schemes Prediction decisions may change during the execution of the program Branch Target Buffer Lee and Smith 2-bit saturating up-down counters to collect history information Static Training Scheme Use statistics collected from pre-run of the program and history pattern consist of the last N run-time execution
13
13 What happens when a branch is mispredicted? On mispredict: No speculative state may commit Squash instructions in the pipeline Must not allow stores in the pipeline to occur Cannot allow stores which would not have happened to commit Need to handle exceptions appropriately
14
14 Simple branch predictor Accessed early in the pipeline using branch instruction (PC)
15
15 2-bit branch prediction
16
16 2-bit predictor state diagram
17
17 2-bit branch prediction A branch must miss twice before the prediction is changed It’s a specialization of the n-bit saturating scheme. Branch prediction buffer can be implemented as: Special cache accessed with the instruction address during IF Pair of bits attached to each block in the instruction cache
18
18 N-bit predictor scheme
19
19 Spec98 prediction accuracy (4K entry buffer)
20
20 Spec98 prediction accuracy, infinite buffer
21
21 Correlating (Two-Level) branch predictors (1) Consider the sequence (2): If (d==0) d=1; If (d==1) MIPS assembly for (2): BNEZ R1,L1 ;branch b1 DADDIU R1,R0,#1 ;d=1 L1: DADDIU R3,R1,#-1 BNEZ R3,L2 ;branch b2 … L2: Consider the sequence (1): If (aa==2) aa=0; If (bb==2) bb=0; if(aa!=bb) {
22
22 1-bit correlation branch predictor in (1) if b1 is NOT taken then b2 is NOT taken too! consider a predictor with 1 bit of correlation to capture dependence of one branch from another 2 prediction bits per branch: 1 assuming last branch executed was Not Taken 1 assuming last branch executed was Taken Pred bitsPred if last branch not taken Pred if last branch taken NT/NTNT NT/TNTT T/NTTNT T/TTT
23
23 Comparison d=?b1 predb1 actnew b1 pred b2 predb2 actnew b2 pred 2NTTT TT 0T T 2 TT TT 0T T d=?b1 predb1 actnew b1 pred b2 predb2 actnew b2 pred 2NT/NTTT/NTNT/NTTNT/T 0T/NTNTT/NTNT/TNTNT/T 2T/NTT NT/TT 0T/NTNTT/NTNT/TNTNT/T
24
24 Correlating branch predictors 2 bits of global history means we look at T/NT behavior of last to branches to determine the behavior of THIS branch. The buffer can be implemented as an one dimensional array (m,n) predictor uses behavior of last m branches to choose from 2 m predictor each being an n-bit predictor. It takes (2 m x n x # of entries selected by the branch address) bits.
25
25 Q: how can we capture the behavior of last n branches and adjust the behavior of the current branch accordingly? A: we use an n bit shift register and shift the behavior of each branch to this register as they become known. Correlating branch predictors 110 Last branch outcome
26
26 Correlating branch predictors Higher prediction rates than simple 2-bit predictor scheme with only trivial additional amount of HW (m-bit shift register) NOTE: buffer NOT a cache, so counters may correspond to different branches at some point in time Buffer can be implemented as a linear memory array that is n-bits wide Indexing is done by concatenating global history bits with the bits from the branch address
27
27 Correlating branch predictors How many bits are there in a (0,2) predictor that has 4K entries selected from the branch address? 2 0 x 2 x 4K = 8K How many bits the example predictor has? 2 2 x 2 x 16 =128 bits.
28
28 Correlating predictor performance
29
29 Hashing branch prediction algorithms gselect gshare
30
30 Gshare correlating predictor
31
31 Hybrid predictors The basic idea is to use a META predictor to select among multiple predictors Example: Local predictors are better in some branches Global predictors are better in utilizing correlation Use a predictor to select the better predictor
32
32 Tournament predictors n/m means: n left predictor m right predictor 0 incorrect 1 correct A predictor must be twice incorrect before we switch to another one
33
33 Fractions of predictions coming from the local predictor The tournament predictor selects between a local 2- bit predictor and a 2-bit Gshare predictor Each predictor has 1024 entries each 2 bits for a total 64K bits.
34
34 Misprediction rates
35
35 Need Address at Same Time as Prediction Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken) Note: must check for branch match now, since can’t use wrong branch address
36
36 A Branch Target Buffer Predicted PC Branch Prediction: Taken or not Taken
37
37 Return Address Predictors Return addresses can be predicted with BTB but accuracy can be low Procedure may be called from multiple sites Solution: small buffer operating as a stack If stack large enough it will predict perfectly
38
38 “Real” Branch Predictors Alpha 21264 Sun UltraSPARC-III Intel Pentium III AMD Athlon K7
39
39 Alpha 21264 8-stage pipeline, mispredict penalty 7 cycles Hybrid predictor (Fetch) 12-bit GAg (4K-entry PHT, 2 bit counters) 10-bit PAg (1K-entry BHT, 1K-entry PHT, 3-bit counters)
40
40 Alpha 21264 branch prediction mechanism
41
41 Sun UltraSPARC-III 14-stage pipeline, bpred accessed in instruction fetch stages 2-3 16K-entry 2-bit counter Gshare predictor Bimodal predictor which XOR’s PC bits with global history register (except 3 lower order bits) to reduce aliasing Miss queue Halves mispredict penalty by providing instructions for immediate use
42
42 Intel Pentium with MMX
43
43 Intel Pentium III Dynamic branch prediction 512-entry BTB predicts direction and target, 4-bit history used with PC to derive direction Static branch predictor for BTB misses Return Address Stack (RAS), 4/8 entries Branch Penalties: Not Taken: no penalty Correctly predicted taken: 1 cycle Mispredicted: at least 9 cycles, as many as 26, average 10-15 cycles
44
44 AMD Athlon K7 10-stage integer, 15-stage fp pipeline, predictor accessed in fetch 2K-entry bimodal, 2K-entry BTAC 12-entry RAS Branch Penalties: Correct Predict Taken: 1 cycle Mispredict penalty: at least 10 cycles
45
Q/A’s
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.