COMP 740: Computer Architecture and Implementation

Slides:

Advertisements

Similar presentations

Branch prediction Titov Alexander MDSP November, 2009.

Advertisements

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

Dynamic Branch Prediction (Sec 4.3) Control dependences become a limiting factor in exploiting ILP So far, we’ve discussed only static branch prediction.

Pipelining and Control Hazards Oct

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

CPE 631: Branch Prediction Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,

Dynamic Branch Prediction

Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.

8 Processing of control transfer instructions TECH Computer Science 8.1 Introduction 8.2 Basic approaches to branch handling 8.3 Delayed branching 8.4.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

W04S1 COMP s1 Seminar 4: Branch Prediction Slides due to David A. Patterson, 2001.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 8, 2003 Topic: Instruction-Level Parallelism (Dynamic Branch Prediction)

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 7, 2002 Topic: Instruction-Level Parallelism (Dynamic Branch Prediction)

Computer Organization and Architecture The CPU Structure.

EECC551 - Shaaban #1 lec # 5 Fall Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with.

EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

Goal: Reduce the Penalty of Control Hazards

Branch Target Buffers BPB: Tag + Prediction

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Branch Prediction Dimitris Karteris Rafael Pasvantidιs.

COMP381 by M. Hamdi 1 (Recap) Control Hazards. COMP381 by M. Hamdi 2 Control (Branch) Hazard A: beqz r2, label B: label: P: Problem: The outcome.

1 COMP 740: Computer Architecture and Implementation Montek Singh Thu, Feb 19, 2009 Topic: Instruction-Level Parallelism III (Dynamic Branch Prediction)

Dynamic Branch Prediction

EENG449b/Savvides Lec /25/05 March 24, 2005 Prof. Andreas Savvides Spring g449b EENG 449bG/CPSC 439bG.

CIS 429/529 Winter 2007 Branch Prediction.1 Branch Prediction, Multiple Issue.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.

CSCI 6461: Computer Architecture Branch Prediction Instructor: M. Lancaster Corresponding to Hennessey and Patterson Fifth Edition Section 3.3 and Part.

Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.

CSC 4250 Computer Architectures October 31, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

Branch Prediction Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

CS 6290 Branch Prediction. Control Dependencies Branches are very frequent –Approx. 20% of all instructions Can not wait until we know where it goes –Long.

Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)

Dynamic Branch Prediction

CSL718 : Pipelined Processors

Instruction-Level Parallelism Dynamic Branch Prediction

Instruction-Level Parallelism and Its Dynamic Exploitation

CS203 – Advanced Computer Architecture

Computer Structure Advanced Branch Prediction

Concepts and Challenges

Dynamic Branch Prediction

CS5100 Advanced Computer Architecture Advanced Branch Prediction

CS 704 Advanced Computer Architecture

Morgan Kaufmann Publishers The Processor

CMSC 611: Advanced Computer Architecture

Module 3: Branch Prediction

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

So far we have dealt with control hazards in instruction pipelines by:

Dynamic Hardware Branch Prediction

CPE 631: Branch Prediction

Branch statistics Branches occur every 4-6 instructions (16-25%) in integer programs; somewhat less frequently in scientific ones Unconditional branches.

Dynamic Branch Prediction

Advanced Computer Architecture

/ Computer Architecture and Design

Pipelining and control flow

So far we have dealt with control hazards in instruction pipelines by:

Lecture 10: Branch Prediction and Instruction Delivery

So far we have dealt with control hazards in instruction pipelines by:

So far we have dealt with control hazards in instruction pipelines by:

Adapted from the slides of Prof

Dynamic Hardware Prediction

So far we have dealt with control hazards in instruction pipelines by:

So far we have dealt with control hazards in instruction pipelines by:

So far we have dealt with control hazards in instruction pipelines by:

So far we have dealt with control hazards in instruction pipelines by:

So far we have dealt with control hazards in instruction pipelines by:

CPE 631 Lecture 12: Branch Prediction

Presentation transcript:

COMP 740: Computer Architecture and Implementation Montek Singh Oct 10, 2016 Topic: Instruction-Level Parallelism - I (Dynamic Branch Prediction)

Instruction-Level Parallelism Exploit parallelism that can be “squeezed out” of programs written sequentially overlap the execution of instructions improve performance! Requires sophisticated hw & sw hardware: help discover and exploit parallelism dynamically (at runtime) dominates the desktop/server markets (e.g., Intel Core series) but creeping into mobile devices also software: find parallelism statically at compile time scientific computing and also in personal mobile devices (hardware must be simpler, energy-efficient) Pipelining become universal around 1985 goal is now to go beyond basic pipelining

Exploiting ILP There are several techniques. Here’s a summary:

The first technique we will study for ILP Branch Prediction The first technique we will study for ILP

Why Do We Need Branch Prediction? Parallelism within basic block is limited Basic blocks are short: 3-6 instructions Control dependences can become the bottleneck Must optimize across branches Since branches disrupt sequential flow of instrs… we need to be able to predict branch behavior to avoid stalling the pipeline What must we predict? Two things: Branch Outcome Is the branch taken? Branch Target Address What is the next PC value?

A General Model of Branch Prediction Branch predictor accuracy Branch penalties T: probability of branch being taken p: fraction of branches that are predicted to be taken A: accuracy of prediction j, k, m, n: associated delays (penalties) for the four events (n is usually 0) Branch penalty of a particular prediction method

Theoretical Limits of Branch Prediction Best case: branches are perfectly predicted (A = 1) also assume that n = 0 minimum branch penalty = j*T Let s be the pipeline stage where BTA becomes known Then j = s-1 See static prediction methods in previous lecture Thus, performance of any branch prediction strategy is limited by s, the location of the pipeline stage that develops BTA A, the accuracy of the prediction

Review: Static Branch Prediction Methods Several static prediction strategies: Predict all branches as NOT TAKEN Predict all branches as TAKEN Predict all branches with certain opcodes as TAKEN, and all others as NOT TAKEN Predict all forward branches as NOT TAKEN, and all backward branches as TAKEN Opcodes have default predictions, which the compiler may reverse by setting a bit in the instruction

Dynamic Branch Prediction Premise: History of a branch instr’s outcome matters! whether a branch will be taken depends greatly on the way previous dynamic instances of the same branch were decided Dynamic prediction methods: take advantage of this fact by making their predictions dependent on the past behavior of the same branch instr such methods are called Branch History Table (BHT) methods

BHT Methods for Branch Prediction

A One-Bit Predictor T NT NT T State 1 Predict Taken State 0 Predict Not Taken T Predictor misses twice on typical loop branches Once at the end of loop Once at the end of the 1st iteration of next execution of loop The outcome sequence NT-T-NT-T makes it miss all the time

A Two-Bit Predictor A four-state Moore machine NT State 2 Predict Taken State 3 Predict Taken T State 0 Not Taken State 1 A four-state Moore machine Predictor misses once on typical loop branches hence popular Outcome sequence NT-NT-T-T-NT-NT-T-T make it miss all the time

A Two-Bit Predictor A four-state Moore machine Predictor misses once on typical loop branches hence popular Input sequence NT-NT-T-T-NT-NT-T-T make it miss all the time

Correlating Branch Outcome Predictors The history-based branch predictors seen so far base their predictions on past history of branch that is being predicted A completely different idea: The outcome of a branch may well be predicted successfully based on the outcome of the last k branches executed i.e., the path leading to the branch being predicted Much-quoted example from SPEC92 benchmark eqntott if (aa == 2) /*b1*/ aa = 0; if (bb == 2) /*b2*/ bb = 0; if (aa != bb) /*b3*/ { … } TAKEN(b1) && TAKEN(b2) implies NOT-TAKEN(b3)

Another Example of Branch Correlation if (d == 0) //b1 d = 1; if (d == 1) //b2 ... Assume multiple runs of code fragment d alternates between 2 and 0 How would a 1-bit predictor initialized to state 0 behave? BNEZ R1, L1 ADDI R1, R0, 1 L1: SUBI R3, R1, 1 BNEZ R3, L2 … L2:

A Correlating Branch Predictor Think of having a pair of 1-bit predictors [p0, p1] for each branch, where we choose between predictors (and update them) based on outcome of most recent branch (i.e., B1 for B2, and B2 for B1) if most recent br was not taken, use and update (if needed) predictor p0 If most recent br was taken, use and update (if needed) predictor p1 How would such (1,1) correlating predictors behave if initialized to [0,0]?

Organization of (m,n) Correlating Predictor Using the results of last m branches 2m outcomes can be kept in m-bit shift register n-bit “self-history” predictor BHT addressed using m bits of global history select column (particular predictor) some lower bits of branch address select row (particular branch instr) entry holds n previous outcomes Aliasing can occur since BHT uses only portion of branch instr address state in various predictors in single row may correspond to different branches at different points of time m=0 is ordinary BHT 4 Branch address Global branch history Prediction 2-bit branch predictors 2

Improved Dynamic Branch Prediction Recall that, even with perfect accuracy of prediction, branch penalty of a prediction method is (s-1)*T s is the pipeline stage where BTA is developed T is the frequency of taken branches Further improvements can be obtained only by using a cache storing BTAs, and accessing it simultaneously with the I-cache Such a cache is called a Branch Target Buffer (BTB) BHT and BTB can be used together Coupled: one table holds all the information Uncoupled: two independent tables

Branch Target Buffers optional field Figure 3.21 A branch-target buffer. The PC of the instruction being fetched is matched against a set of instruction addresses stored in the first column; these represent the addresses of known branches. If the PC matches one of these entries, then the instruction being fetched is a taken branch, and the second field, predicted PC, contains the prediction for the next PC after the branch. Fetching begins immediately at that address. The third field, which is optional, may be used for extra prediction state bits.

How BTB is used During instruction fetch: read BTB concurrently with instr. memory for most/recent taken branches, BTB immediately provides the branch target address (BTA)

Using BTB and BHT Together Uncoupled solution BTB stores only the BTAs of taken branches recently executed No separate branch outcome prediction (the presence of an entry in BTB can be used as an implicit prediction of the branch being TAKEN next time) Use the BHT in case of a BTB miss Coupled solution Stores BTAs of all branches recently executed Has separate branch outcome prediction for each table entry Use BHT in case of BTB hit Predict NOT TAKEN otherwise

Parameters of Real Machines

Instruction is a branch Coupled BTB and BHT Access BTB and I-cache Miss in BTB PNT (fetch inline) Hit in BTB Not a branch OK: zero penalty Case 1 Instruction is a branch (Next instr killed) Branch not taken Case 2 Branch taken Enter into BTB Case 3 Predict Not Taken (using BHT) Predict Taken Go to BTA stored in BTB Update BTB? Case 4 Update BTB Case 5 Case 6 Wrong BTA Case 7 Correct BTA Case 8

(Fetch inline, wait for opcode) Instruction is a branch Decoupled BTB and BHT Access BTB and I-cache Miss in BTB (Fetch inline, wait for opcode) Hit in BTB Predict Taken Go to BTA stored in BTB Not a branch OK: zero penalty Case 1 Instruction is a branch (Next instr killed) Predict Not Taken (using BHT) Branch not taken Case 2 Branch taken Enter into BTB Case 3 Case 4 Case 5 Remove from BTB Case 6 Wrong BTA Update BTB Case 7 Correct BTA Case 8

Reducing Misprediction Penalties Need to recover whenever branch prediction is not correct Discard all speculatively executed instructions Resume execution along alternative path (this is the costly step) Scenarios where recovery is needed Predict taken, branch is taken, BTA wrong (case 7) Predict taken, branch is not taken (cases 4 and 6) Predict not taken, branch is taken (case 3) Preparing for recovery involves working on alternative parh On instruction level Two fetch address registers per speculated branch (PPC 603 & 640) Two instruction buffers (IBM 360/91, SuperSPARC, Pentium) On I-cache level For PT, also do next-line prefetching For PNT, also do target-line prefetching

Predicting Dynamic BTAs Vast majority of dynamic BTAs come from procedure returns (85% for SPEC95) also case/switch statements indirect procedure calls, etc.  OOP (C++, java) Procedure call-return for the most part follows a stack discipline a specialized return address buffer operated as a stack is appropriate for high prediction accuracy Pushes return address on call Pops return address on return Depth of RAS should be as large as maximum call depth to avoid mispredictions 8-16 elements generally sufficient