CS252 Graduate Computer Architecture Lecture 13 Prediction (Branches, Return Addrs, Dependencies) John Kubiatowicz Electrical Engineering and Computer.

Slides:



Advertisements
Similar presentations
Dynamic Branch Prediction (Sec 4.3) Control dependences become a limiting factor in exploiting ILP So far, we’ve discussed only static branch prediction.
Advertisements

Pipelining and Control Hazards Oct
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
CPE 631: Branch Prediction Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,
Dynamic Branch Prediction
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
3.13. Fallacies and Pitfalls Fallacy: Processors with lower CPIs will always be faster Fallacy: Processors with faster clock rates will always be faster.
Copyright 2001 UCB & Morgan Kaufmann ECE668.1 Adapted from Patterson, Katz and Culler © UCB Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical.
CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
CPE 631: Branch Prediction Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,
W04S1 COMP s1 Seminar 4: Branch Prediction Slides due to David A. Patterson, 2001.
CS252 Graduate Computer Architecture Lecture 9 Prediction (Con’t) (Dependencies, Load Values, Data Values) February 22 nd, 2010 John Kubiatowicz Electrical.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 8, 2003 Topic: Instruction-Level Parallelism (Dynamic Branch Prediction)
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 7, 2002 Topic: Instruction-Level Parallelism (Dynamic Branch Prediction)
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
EECC551 - Shaaban #1 lec # 5 Fall Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with.
EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Goal: Reduce the Penalty of Control Hazards
Branch Target Buffers BPB: Tag + Prediction
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )
1 COMP 740: Computer Architecture and Implementation Montek Singh Thu, Feb 19, 2009 Topic: Instruction-Level Parallelism III (Dynamic Branch Prediction)
Dynamic Branch Prediction
EENG449b/Savvides Lec /25/05 March 24, 2005 Prof. Andreas Savvides Spring g449b EENG 449bG/CPSC 439bG.
CIS 429/529 Winter 2007 Branch Prediction.1 Branch Prediction, Multiple Issue.
1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )
ENGS 116 Lecture 91 Dynamic Branch Prediction and Speculation Vincent H. Berk October 10, 2005 Reading for today: Chapter 3.2 – 3.6 Reading for Wednesday:
CPSC614 Lec 5.1 Instruction Level Parallelism and Dynamic Execution #4: Based on lectures by Prof. David A. Patterson E. J. Kim.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 2: Pipeline problems & tricks dr.ir. A.C. Verschueren Eindhoven.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.
CSCI 6461: Computer Architecture Branch Prediction Instructor: M. Lancaster Corresponding to Hennessey and Patterson Fifth Edition Section 3.3 and Part.
Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.
CS252 Graduate Computer Architecture Lecture 9 Prediction/Speculation (Branches, Return Addrs) February 15 th, 2011 John Kubiatowicz Electrical Engineering.
Yiorgos Makris Professor Department of Electrical Engineering University of Texas at Dallas EE (CE) 6304 Computer Architecture Lecture #12 (10/27/15) Course.
CPE 631 Session 17 Branch Prediction Electrical and Computer Engineering University of Alabama in Huntsville.
CS 6290 Branch Prediction. Control Dependencies Branches are very frequent –Approx. 20% of all instructions Can not wait until we know where it goes –Long.
CS252 Graduate Computer Architecture Lecture 14 Prediction (Con’t) (Dependencies, Load Values, Data Values) John Kubiatowicz Electrical Engineering and.
Copyright 2016 Csaba Andras MoritzECE668 Power Aware Branching.1 Few slides adapted from Patterson, et al © UCB and Morgan Kaufmann Csaba Andras Moritz.
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
Yiorgos Makris Professor Department of Electrical Engineering University of Texas at Dallas EE (CE) 6304 Computer Architecture Lecture #13 (10/28/15) Course.
Dynamic Branch Prediction
Computer Organization CS224
CS203 – Advanced Computer Architecture
Dynamic Branch Prediction
UNIVERSITY OF MASSACHUSETTS Dept
CMSC 611: Advanced Computer Architecture
CS252 Graduate Computer Architecture Lecture 9 Prediction (Con’t) (Dependencies, Load Values, Data Values) John Kubiatowicz Electrical Engineering and.
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
CS252 Graduate Computer Architecture Lecture 8 Prediction/Speculation (Branches, Return Addrs) February 14th, 2011 John Kubiatowicz Electrical Engineering.
So far we have dealt with control hazards in instruction pipelines by:
CPE 631: Branch Prediction
Dynamic Branch Prediction
/ Computer Architecture and Design
Pipelining and control flow
Control unit extension for data hazards
So far we have dealt with control hazards in instruction pipelines by:
Lecture 10: Branch Prediction and Instruction Delivery
So far we have dealt with control hazards in instruction pipelines by:
So far we have dealt with control hazards in instruction pipelines by:
Adapted from the slides of Prof
Control unit extension for data hazards
So far we have dealt with control hazards in instruction pipelines by:
So far we have dealt with control hazards in instruction pipelines by:
So far we have dealt with control hazards in instruction pipelines by:
Control unit extension for data hazards
So far we have dealt with control hazards in instruction pipelines by:
So far we have dealt with control hazards in instruction pipelines by:
CPE 631 Lecture 12: Branch Prediction
Presentation transcript:

CS252 Graduate Computer Architecture Lecture 13 Prediction (Branches, Return Addrs, Dependencies) John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley

3/7/2007cs252-S07, Lecture [0][1][VLR-1] Vector Arithmetic Instructions ADDV v3, v1, v2 v3 v2 v1 Scalar Registers r0 r15 Vector Registers v0 v15 [0][1][2][VLRMAX-1] VLR Vector Length Register v1 Vector Load and Store Instructions LV v1, r1, r2 Base, r1Stride, r2 Memory Vector Register Review: Vector Programming Model

3/7/2007cs252-S07, Lecture 13 3 (as reported in Microprocessor Report, Vol 13, No. 5) –Emotion Engine: 6.2 GFLOPS, 75 million polygons per second –Graphics Synthesizer: 2.4 Billion pixels per second –Claim: Toy Story realism brought to games! Reality: Sony Playstation 2000

3/7/2007cs252-S07, Lecture 13 4 Playstation 2000 Continued Sample Vector Unit –2-wide VLIW –Includes Microcode Memory –High-level instructions like matrix- multiply Emotion Engine: –Superscalar MIPS core –Vector Coprocessor Pipelines –RAMBUS DRAM interface

3/7/2007cs252-S07, Lecture 13 5 Prediction has become essential to getting good performance from scalar instruction streams. We will discuss predicting branches. However, architects are now predicting everything: data dependencies, actual data, and results of groups of instructions: –At what point does computation become a probabilistic operation + verification? –We are pretty close with control hazards already… Why does prediction work? –Underlying algorithm has regularities. –Data that is being operated on has regularities. –Instruction sequence has redundancies that are artifacts of way that humans/compilers think about problems. Prediction  Compressible information streams? Review Prediction: Branches, Dependencies, Data

3/7/2007cs252-S07, Lecture 13 6 Review: Independent “Fetch” unit Instruction Fetch with Branch Prediction Out-Of-Order Execution Unit Correctness Feedback On Branch Results Stream of Instructions To Execute Instruction fetch decoupled from execution Often issue logic (+ rename) included with Fetch

3/7/2007cs252-S07, Lecture 13 7 Branches must be resolved quickly In our loop-unrolling example, we relied on the fact that branches were under control of “fast” integer unit in order to get overlap! Loop:LDF00R1 MULTDF4F0F2 SDF40R1 SUBIR1R1#8 BNEZR1Loop What happens if branch depends on result of multd?? –We completely lose all of our advantages! –Need to be able to “predict” branch outcome. –If we were to predict that branch was taken, this would be right most of the time. Problem much worse for superscalar machines!

3/7/2007cs252-S07, Lecture 13 8 Avoid branch prediction by turning branches into conditionally executed instructions: if (x) then A = B op C else NOP –If false, then neither store result nor cause exception –Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional move; PA-RISC can annul any following instr. –IA-64: 64 1-bit condition fields selected so conditional execution of any instruction –This transformation is called “if-conversion” Drawbacks to conditional instructions –Still takes a clock even if “annulled” –Stall if condition evaluated late –Complex conditions reduce effectiveness; condition becomes known late in pipeline x A = B op C Predicated Execution

3/7/2007cs252-S07, Lecture 13 9 Dynamic Branch Prediction Problem Incoming stream of addresses Fast outgoing stream of predictions Correction information returned from pipeline Branch Predictor Incoming Branches { Address } Prediction { Address, Value } Corrections { Address, Value } History Information

3/7/2007cs252-S07, Lecture Case for Branch Prediction when Issue N instructions per clock cycle 1.Branches will arrive up to n times faster in an n- issue processor 2.Amdahl’s Law => relative impact of the control stalls will be larger with the lower potential CPI in an n-issue processor conversely, need branch prediction to ‘see’ potential parallelism

3/7/2007cs252-S07, Lecture Dynamic Branch Prediction Prediction could be “Static” (at compile time) or “Dynamic” (at runtime) –For our example, if we were to statically predict “taken”, we would only be wrong once each pass through loop –Static information passed through bits in opcode Is dynamic branch prediction better than static branch prediction? –Seems to be. Still some debate to this effect –Today, lots of hardware being devoted to dynamic branch predictors. Does branch prediction make sense for 5-stage, in- order pipeline? What about 8-stage pipeline? –Perhaps: eliminate branch delay slots –Then predict branches

3/7/2007cs252-S07, Lecture Review: Branch Target Buffer (BTB) Need Address at Same Time as Prediction Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken) –Note: must check for branch match now, since can’t use wrong branch address Can predict branch while in Fetch stage!!!! –Before we know is it a branch! (Caches previous decoding of instruction) Branch PCPredicted PC =? PC of instruction FETCH Extra prediction state bits Yes: instruction is branch and use predicted PC as next PC No: branch not predicted, proceed normally (Next PC = PC+4)

3/7/2007cs252-S07, Lecture Review: Branch History Table BHT is a table of “Predictors” –Usually 2-bit, saturating counters –Indexed by PC address of Branch – without tags Combine Branch Target Buffer and History Tables –Branch Target Buffer (BTB): identify branches and hold taken addresses »Trick: identify branch before fetching instruction! »Must be careful not to misidentify branches or destinations –Branch History Table makes prediction »Can be complex prediction mechanisms with long history »No address check: Can be good, can be bad (aliasing) T NT T T Predictor 0 Predictor 7 Predictor 1 Branch PC

3/7/2007cs252-S07, Lecture Correlating Branches Hypothesis: recent branches are correlated; that is, behavior of recently executed branches affects prediction of current branch Two possibilities; Current branch depends on: –Last m most recently executed branches anywhere in program Produces a “GA” (for “global adaptive”) in the Yeh and Patt classification (e.g. GAg) –Last m most recent outcomes of same branch. Produces a “PA” (for “per-address adaptive”) in same classification (e.g. PAg) Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper branch history table entry –A single history table shared by all branches (appends a “g” at end), indexed by history value. –Address is used along with history to select table entry (appends a “p” at end of classification) –If only portion of address used, often appends an “s” to indicate “set- indexed” tables (I.e. GAs)

3/7/2007cs252-S07, Lecture Correlating Branches (2,2) GAs predictor –First 2 means that we keep two bits of history –Second means that we have 2 bit counters in each slot. –Then behavior of recent branches selects between, say, four predictions of next branch, updating just that prediction –Note that the original two-bit counter solution would be a (0,2) GAs predictor –Note also that aliasing is possible here... Branch address 2-bits per branch predictors Prediction 2-bit global branch history register For instance, consider global history, set-indexed BHT. That gives us a GAs history table. Each slot is 2-bit counter

3/7/2007cs252-S07, Lecture What are Important Metrics? Clearly, Hit Rate matters –Even 1% can be important when above 90% hit rate Speed: Does this affect cycle time? Space: Clearly Total Space matters! –Papers which do not try to normalize across different options are playing fast and lose with data –Try to get best performance for the cost

3/7/2007cs252-S07, Lecture Accuracy of Different Schemes 4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT 0% 18% Frequency of Mispredictions

3/7/2007cs252-S07, Lecture BHT Accuracy Mispredict because either: –Wrong guess for that branch –Got branch history of wrong branch when index the table 4096 entry table programs vary from 1% misprediction (nasa7, tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12% –For SPEC92, 4096 about as good as infinite table How could HW predict “this loop will execute 3 times” using a simple mechanism? –Need to track history of just that branch –For given pattern, track most likely following branch direction Leads to two separate types of recent history tracking: –GBHR (Global Branch History Register) –PABHR (Per Address Branch History Table) Two separate types of Pattern tracking –GPHT (Global Pattern History Table) –PAPHT (Per Address Pattern History Table)

3/7/2007cs252-S07, Lecture Yeh and Patt classification GBHR GPHT GAg GPHT PABHR PAg PAPHT PABHR PAp GAg: Global History Register, Global History Table PAg: Per-Address History Register, Global History Table PAp: Per-Address History Register, Per-Address History Table

3/7/2007cs252-S07, Lecture Two-Level Adaptive Schemes: History Registers of Same Length (6 bits) PAp best: But uses a lot more state! GAg not effective with 6-bit history registers –Every branch updates the same history register  interference PAg performs better because it has a branch history table

3/7/2007cs252-S07, Lecture Versions with Roughly same accuracy (97%) Cost: –GAg requires 18-bit history register –PAg requires 12-bit history register –PAp requires 6-bit history register PAg is the cheapest among these

3/7/2007cs252-S07, Lecture Why doesn’t GAg do better? Difference between GAg and both PA variants: –GAg tracks correllations between different branches –PAg/PAp track corellations between different instances of the same branch These are two different types of pattern tracking –Among other things, GAg good for branches in straight-line code, while PA variants good for loops Problem with GAg? It aliases results from different branches into same table –Issue is that different branches may take same global pattern and resolve it differently –GAg doesn’t leave flexibility to do this

3/7/2007cs252-S07, Lecture Other Global Variants: Try to Avoid Aliasing GAs: Global History Register, Per-Address (Set Associative) History Table Gshare: Global History Register, Global History Table with Simple attempt at anti-aliasing GAs GBHR PAPHT GShare GPHT GBHR Address 

3/7/2007cs252-S07, Lecture Is Global or Local better? Neither: Some branches local, some global –From: “An Analysis of Correlation and Predictability: What Makes Two-Level Branch Predictors Work,” Evers, Patel, Chappell, Patt –Difference in predictability quite significant for some branches!

3/7/2007cs252-S07, Lecture Dynamically finding structure in Spaghetti ? Consider complex “spaghetti code” Are all branches likely to need the same type of branch prediction? –No. What to do about it? –How about predicting which predictor will be best? –Called a “Tournament predictor”

3/7/2007cs252-S07, Lecture Tournament Predictors Motivation for correlating branch predictors is 2- bit predictor failed on important branches; by adding global information, performance improved Tournament predictors: use 2 predictors, 1 based on global information and 1 based on local information, and combine with a selector Use the predictor that tends to guess correctly addr history Predictor A Predictor B

3/7/2007cs252-S07, Lecture Tournament Predictor in Alpha K 2-bit counters to choose from among a global predictor and a local predictor Global predictor also has 4K entries and is indexed by the history of the last 12 branches; each entry in the global predictor is a standard 2-bit predictor –12-bit pattern: ith bit 0 => ith prior branch not taken; ith bit 1 => ith prior branch taken; Local predictor consists of a 2-level predictor: –Top level a local history table consisting of bit entries; each 10-bit entry corresponds to the most recent 10 branch outcomes for the entry. 10-bit history allows patterns 10 branches to be discovered and predicted. –Next level Selected entry from the local history table is used to index a table of 1K entries consisting a 3-bit saturating counters, which provide the local prediction Total size: 4K*2 + 4K*2 + 1K*10 + 1K*3 = 29K bits! (~180,000 transistors)

3/7/2007cs252-S07, Lecture % of predictions from local predictor in Tournament Scheme

3/7/2007cs252-S07, Lecture Accuracy of Branch Prediction Profile: branch profile from last execution (static in that in encoded in instruction, but profile) fig 3.40

3/7/2007cs252-S07, Lecture Accuracy v. Size (SPEC89)

3/7/2007cs252-S07, Lecture Pitfall: Sometimes bigger and dumber is better uses tournament predictor (29 Kbits) Earlier uses a simple 2-bit predictor with 2K entries (or a total of 4 Kbits) SPEC95 benchmarks, outperforms –21264 avg mispredictions per 1000 instructions –21164 avg mispredictions per 1000 instructions Reversed for transaction processing (TP) ! –21264 avg. 17 mispredictions per 1000 instructions –21164 avg. 15 mispredictions per 1000 instructions TP code much larger & hold 2X branch predictions based on local behavior (2K vs. 1K local predictor in the 21264)

3/7/2007cs252-S07, Lecture Special Case Return Addresses Register Indirect branch hard to predict address –SPEC89 85% such branches for procedure return –Since stack discipline for procedures, save return address in small buffer that acts like a stack: 8 to 16 entries has small miss rate BTB PC Predicted Next PC Fetch Unit Destination From Call Instruction [ On Fetch?] Select for Indirect Jumps [ On Fetch ] Return Address Stack Mux

3/7/2007cs252-S07, Lecture Performance: Return Address Predictor Cache most recent return addresses: –Call  Push a return address on stack –Return  Pop an address off stack & predict as new PC

3/7/2007cs252-S07, Lecture Review: Memory Disambiguation Question: Given a load that follows a store in program order, are the two related? –Trying to detect RAW hazards through memory –Stores commit in order (ROB), so no WAR/WAW memory hazards. Implementation –Keep queue of stores, in program order –Watch for position of new loads relative to existing stores –Typically, this is a different buffer than ROB! »Could be ROB (has right properties), but too expensive When have address for load, check store queue: –If any store prior to load is waiting for its address  ????? –If load address matches earlier store address (associative lookup), then we have a memory-induced RAW hazard: »store value available  return value »store value not available  return ROB number of source –Otherwise, send out request to memory Will relax exact dependency checking in later lecture

3/7/2007cs252-S07, Lecture Memory Dependence Prediction Important to speculate? Two Extremes: –Naïve Speculation: always let load go forward –No Speculation: always wait for dependencies to be resolved Compare Naïve Speculation to No Speculation –False Dependency: wait when don’t have to –Order Violation: result of speculating incorrectly Goal of prediction: –Avoid false dependencies and order violations From “Memory Dependence Prediction using Store Sets”, Chrysos and Emer.

3/7/2007cs252-S07, Lecture Said another way: Could we do better? Results from same paper: performance improvement with oracle predictor –We can get significantly better performance if we find a good predictor –Question: How to build a good predictor?

3/7/2007cs252-S07, Lecture Premise: Past indicates Future Basic Premise is that past dependencies indicate future dependencies –Not always true! Hopefully true most of time Store Set: Set of store insts that affect given load –Example: AddrInst 0Store C 4Store A 8Store B 12Store C 28Load B  Store set { PC 8 } 32Load D  Store set { (null) } 36Load C  Store set { PC 0, PC 12 } 40Load B  Store set { PC 8 } –Idea: Store set for load starts empty. If ever load go forward and this causes a violation, add offending store to load’s store set Approach: For each indeterminate load: –If Store from Store set is in pipeline, stall Else let go forward Does this work?

3/7/2007cs252-S07, Lecture How well does an infinite tracking work? “Infinite” here means to place no limits on: –Number of store sets –Number of stores in given set Seems to do pretty well –Note: “Not Predicted” means load had empty store set –Only Applu and Xlisp seems to have false dependencies

3/7/2007cs252-S07, Lecture How to track Store Sets in reality? SSIT: Assigns Loads and Stores to Store Set ID (SSID) –Notice that this requires each store to be in only one store set! LFST: Maps SSIDs to most recent fetched store –When Load is fetched, allows it to find most recent store in its store set that is executing (if any)  allows stalling until store finished –When Store is fetched, allows it to wait for previous store in store set »Pretty much same type of ordering as enforced by ROB anyway »Transitivity  loads end up waiting for all active stores in store set What if store needs to be in two store sets? –Allow store sets to be merged together deterministically »Two loads, multiple stores get same SSID Want periodic clearing of SSIT to avoid: –problems with aliasing across program –Out of control merging

3/7/2007cs252-S07, Lecture How well does this do? Comparison against Store Barrier Cache –Marks individual Stores as “tending to cause memory violations” –Not specific to particular loads…. Problem with APPLU? –Analyzed in paper: has complex 3-level inner loop in which loads occasionally depend on stores –Forces overly conservative stalls (i.e. false dependencies)

3/7/2007cs252-S07, Lecture Conclusion Prediction works because…. –Programs have patterns –Just have to figure out what they are –Basic Assumption: Future can be predicted from past! Correlation: Recently executed branches correlated with next branch. –Either different branches (GA) –Or different executions of same branches (PA). Two-Level Branch Prediction –Uses complex history (either global or local) to predict next branch –Two tables: a history table and a pattern table –Global Predictors: GAg, GAs, GShare –Local Predictors: PAg, Pap Best type of predictor depends on branch –Tournament style predictor: choose between multiple predictors –Trys to find best predictor for each branch Dependence Prediction: Try to predict whether load depends on stores before addresses are known –Store set: Set of stores that have had dependencies with load in the past