Presentation is loading. Please wait.

Presentation is loading. Please wait.

Copyright 2016 Csaba Andras MoritzECE668 Power Aware Branching.1 Few slides adapted from Patterson, et al © UCB and Morgan Kaufmann Csaba Andras Moritz.

Similar presentations


Presentation on theme: "Copyright 2016 Csaba Andras MoritzECE668 Power Aware Branching.1 Few slides adapted from Patterson, et al © UCB and Morgan Kaufmann Csaba Andras Moritz."— Presentation transcript:

1 Copyright 2016 Csaba Andras MoritzECE668 Power Aware Branching.1 Few slides adapted from Patterson, et al © UCB and Morgan Kaufmann Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Design of Power-Aware Dynamic Branch Prediction and Control Flow

2 Copyright 2016 Csaba Andras MoritzECE668 Power Aware Branching.2 Few slides adapted from Patterson, et al © UCB and Morgan Kaufmann Re-evaluating Correlation  Several SPEC benchmarks have less than a dozen branches responsible for 90% of taken branches: program branch % static# = 90% compress14%23613 eqntott25%4945 gcc15%95312020 mpeg10%5598532 real gcc13%173613214  Real programs + OS more like gcc  Small benefits of correlation beyond benchmarks?  Mispredict because either:  Wrong guess for that branch  Got branch history of wrong branch when indexing the table  For SPEC92, 4096 about as good as infinite table  Misprediction mostly due to wrong prediction  Can we improve using global history?

3 Copyright 2016 Csaba Andras MoritzECE668 Power Aware Branching.3 Few slides adapted from Patterson, et al © UCB and Morgan Kaufmann Gselect and Gshare predictors  Keep a global register (GR) with outcome of k branches  Use that in conjunction with PC to index into a table containing 2-bit predictors  Gselect – concatenate  Gshare – XOR (better) Copyright 2007 CAM

4 Copyright 2016 Csaba Andras MoritzECE668 Power Aware Branching.4 Few slides adapted from Patterson, et al © UCB and Morgan Kaufmann Tournament Predictors  Motivation for correlating branch predictors: 2-bit local predictor failed on important branches; by adding global information, performance improved  Tournament predictors: use two predictors, 1 based on global information and 1 based on local information, and combine with a selector  Hopes to select right predictor for right branch (or right context of branch)

5 Copyright 2016 Csaba Andras MoritzECE668 Power Aware Branching.5 Few slides adapted from Patterson, et al © UCB and Morgan Kaufmann Tournament Predictor in Alpha 21264  4K 2-bit counters to choose from among a global predictor and a local predictor  Global predictor also has 4K entries and is indexed by the history of the last 12 branches; each entry in the global predictor is a standard 2-bit predictor  12-bit pattern: ith bit is 0 => ith prior branch not taken; ith bit is 1 => ith prior branch taken ; 00,10,11 00,11 10 Use 1 Use 2 Use 1 00,01,11 00,11 10 01 4K  2 bits 1 3 2 12... Global (or #1)

6 Copyright 2016 Csaba Andras MoritzECE668 Power Aware Branching.6 Few slides adapted from Patterson, et al © UCB and Morgan Kaufmann Tournament Predictor in Alpha 21264  Local predictor consists of a 2-level predictor:  Top level a local history table consisting of 1024 10-bit entries; each 10-bit entry corresponds to the most recent 10 branch outcomes for the entry. 10-bit history allows patterns 10 branches to be discovered and predicted  Next level Selected entry from the local history table is used to index a table of 1K entries consisting a 3-bit saturating counters, which provide the local prediction  Total size: 4K*2 + 4K*2 + 1K*10 + 1K*3 = 29K bits! (~180K transistors) 1K  10 bits 1K  3 bits (Local or #2)

7 Copyright 2016 Csaba Andras MoritzECE668 Power Aware Branching.7 Few slides adapted from Patterson, et al © UCB and Morgan Kaufmann % of predictions from local predictor in Tournament Prediction Scheme 98% 100% 94% 90% 55% 76% 72% 63% 37% 69% 0%20%40%60%80%100% nasa7 matrix300 tomcatv doduc spice fpppp gcc espresso eqntott li

8 Copyright 2016 Csaba Andras MoritzECE668 Power Aware Branching.8 Few slides adapted from Patterson, et al © UCB and Morgan Kaufmann 94% 96% 98% 97% 100% 70% 82% 77% 82% 84% 99% 88% 86% 88% 86% 95% 99% 0%20%40%60%80%100% gcc espresso li fpppp doduc tomcatv Profile-based 2-bit counter Tournament Accuracy of Branch Prediction  Profile: branch profile from last execution (static in that is encoded in instruction, but profile) fig 3.40

9 Copyright 2016 Csaba Andras MoritzECE668 Power Aware Branching.9 Few slides adapted from Patterson, et al © UCB and Morgan Kaufmann Accuracy v. Size (SPEC89) 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 081624324048566472808896104112120128 Total predictor size (Kbits) Conditional branch misprediction rate Local - 2 bit counters Correlating - (2,2) scheme Tournament

10 Copyright 2016 Csaba Andras MoritzECE668 Power Aware Branching.10 Few slides adapted from Patterson, et al © UCB and Morgan Kaufmann Need Address at Same Time as Prediction  Branch Target Buffer (BTB): Address of branch used as index to get prediction AND branch address (if taken)  Note: must check for branch match now, since can’t use wrong branch address Branch PCPredicted PC =? PC of instruction FETCH Prediction state bits Yes: instruction is branch; use predicted PC as next PC (if predict Taken) No: branch not predicted; proceed normally (PC+4)

11 Copyright 2016 Csaba Andras MoritzECE668 Power Aware Branching.11 Few slides adapted from Patterson, et al © UCB and Morgan Kaufmann Branch Target “Cache”  Branch Target cache - Only predicted taken branches  “Cache” - Content Addressable Memory (CAM) or Associative Memory (see figure)  Use a big Branch History Table & a small Branch Target Cache Branch PCPredicted PC =? Prediction state bits (optional) Yes: predicted taken branch found No: not found PC

12 Copyright 2016 Csaba Andras MoritzECE668 Power Aware Branching.12 Few slides adapted from Patterson, et al © UCB and Morgan Kaufmann Steps with Branch target Buffer Branch_CPI_Penalty = [Buffer_hit_rate x P{Incorrect_prediction}] x Penalty_Cycles + [(1- Buffer_hit_rate) x P{Branch_taken}] x Penalty_Cycles =.0.91x0.1x2 + 0.09x0.6x2 =.29 Taken Branch? Entry found in branch- target buffer? Send out predicted PC Is instruction a taken branch? Send PC to memory and branch-target buffer Enter branch instruction address and next PC into branch-target buffer Mispredicted branch, kill fetched instruction; restart fetch at other target; delete entry from target buffer Normal instruction execution Branch correctly predicted; continue execution with no stalls No Yes No ID IF EX for the 5-stage MIPS

13 Copyright 2016 Csaba Andras MoritzECE668 Power Aware Branching.13 Few slides adapted from Patterson, et al © UCB and Morgan Kaufmann  Avoid branch prediction by turning branches into conditionally executed instructions: if (x) then A = B op C else NOP  If false, then neither store result nor cause interference  Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional move; PA-RISC can annul any following instruction  Drawbacks to conditional instructions  Still takes a clock even if “annulled”  Stall if condition evaluated late: Complex conditions reduce effectiveness since condition becomes known late in pipeline x A = B op C Predicated Execution

14 Copyright 2016 Csaba Andras MoritzECE668 Power Aware Branching.14 Few slides adapted from Patterson, et al © UCB and Morgan Kaufmann Special Case: Return Addresses  Register Indirect branch - hard to predict address  SPEC89 85% such branches for procedure return  Since stack discipline for procedures, save return address in small buffer that acts like a stack: 8 to 16 entries has small miss rate

15 Copyright 2016 Csaba Andras MoritzECE668 Power Aware Branching.15 Few slides adapted from Patterson, et al © UCB and Morgan Kaufmann How to Reduce Power  Reduce load capacitances switched  Smaller BTAC  Smaller local, global predictors  Less associativity  How do you know which branches? »Use static information »Combine runtime and compile time information »Add hints to be used at runtime  Also, predict statically  Branch folding »Runtime »Compile-time Copyright 2007 CAM & BlueRISC

16 Copyright 2016 Csaba Andras MoritzECE668 Power Aware Branching.16 Few slides adapted from Patterson, et al © UCB and Morgan Kaufmann Predict early/ahead  Early: Prefetch buffer like in the BlueRISC CPU and recent ARM CPUs allows branches to be prefetched. This means we can predict a branch dynamically in time before entering pipeline.  Ahead: BlueRISC’s patented approach allows predicting several cycles ahead of the branch. So, one would not need to predict with every PC, one can wait to predict only branch instructions.  Issues: needs ISA support.  Cannot be done unless there is time (not possible after a flush or short basic blocks, requires prefetch buffer in both cases)

17 Copyright 2016 Csaba Andras MoritzECE668 Power Aware Branching.17 Few slides adapted from Patterson, et al © UCB and Morgan Kaufmann Predict Statically  Use fixed static prediction when possible  Incurs limited power at runtime  Requires ISA support if not all branches are statically predicted

18 Copyright 2016 Csaba Andras MoritzECE668 Power Aware Branching.18 Few slides adapted from Patterson, et al © UCB and Morgan Kaufmann Ignore  Sometimes the best is to not predict – this especially for branches that are hard to predict  Also, ignore branches that are hard to predict but that are NOT on critical path  Requires ISA support to best implement

19 Copyright 2016 Csaba Andras MoritzECE668 Power Aware Branching.19 Few slides adapted from Patterson, et al © UCB and Morgan Kaufmann Phantom Branches  ARM removes branches to save power in the prefetch buffer  BlueRISC CPUs can remove them from the binary  Control-flow change 70% of the time needs fewer bits than a typical branch instruction would entail  This is most branches are short span requiring few bits only and a condition  Can be implemented cleverly and combined with other compiler exposed techniques

20 Copyright 2016 Csaba Andras MoritzECE668 Power Aware Branching.20 Few slides adapted from Patterson, et al © UCB and Morgan Kaufmann Hybrid History Tables  Use simple predictor on simple branches and complex when needed  Do not store all history everywhere  E.g., 2-bit combined with Tournament/gshare  Small complex predictor preserved for hard & critical branches  Needs compiler support

21 Copyright 2016 Csaba Andras MoritzECE668 Power Aware Branching.21 Few slides adapted from Patterson, et al © UCB and Morgan Kaufmann Hierarchical BTAC  Micro BTAC (mBTAC) a small CAM based BTAC with lower load capacitance  Use mBTAC when possible  Otherwise a BTAC can be similar to an I$ access and if you do every cycle it adds up  But, for every branch, either in hardware or with compiler we need to tell which predictor to use.  A compiler driven approach needs ISA support or some way to encode this. One could use a memory- mapped IO and preload this information but adds control which can impact clock speed if overdone 

22 Copyright 2016 Csaba Andras MoritzECE668 Power Aware Branching.22 Few slides adapted from Patterson, et al © UCB and Morgan Kaufmann Pitfall: Sometimes dumber is better  Alpha 21264 uses tournament predictor (29 Kbits)  Earlier 21164 uses a simple 2-bit predictor with 2K entries (or a total of 4 Kbits)  SPEC95 benchmarks, 21264 outperforms  21264 avg. 11.5 mispredictions per 1000 instructions  21164 avg. 16.5 mispredictions per 1000 instructions  Reversed for transaction processing (TP) !  21264 avg. 17 mispredictions per 1000 instructions  21164 avg. 15 mispredictions per 1000 instructions  TP code much larger & 21164 hold 2X branch predictions based on local behavior (2K vs. 1K local predictor in the 21264)  What about power?  Large predictors give some increase in prediction rate but for a large power cost as discussed

23 Copyright 2016 Csaba Andras MoritzECE668 Power Aware Branching.23 Few slides adapted from Patterson, et al © UCB and Morgan Kaufmann Finally  Understand that not all branches are made equal – those that are not on the critical path are simply not critical to do well on  Some branches are critical (part of critical loops and in top basic blocks) those need the best you got (advanced small predictors)  Performance is affected by branching a lot  The higher the issue width the more critical to do well  Next we look at some results based on these techniques

24 Copyright 2016 Csaba Andras MoritzECE668 Power Aware Branching.24 Few slides adapted from Patterson, et al © UCB and Morgan Kaufmann Power Consumption BlueRISC’s Compiler-driven Power-Aware Branch Prediction Comparison with 512 entry BTAC bimodal (patent-issued) Copyright 2007 CAM & BlueRISC

25 Copyright 2016 Csaba Andras MoritzECE668 Power Aware Branching.25 Few slides adapted from Patterson, et al © UCB and Morgan Kaufmann


Download ppt "Copyright 2016 Csaba Andras MoritzECE668 Power Aware Branching.1 Few slides adapted from Patterson, et al © UCB and Morgan Kaufmann Csaba Andras Moritz."

Similar presentations


Ads by Google