Branch Prediction Prof. Mikko H. Lipasti

Slides:

Advertisements

Similar presentations

Dynamic Branch Prediction

Advertisements

COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.

Computer Architecture Computer Architecture Processing of control transfer instructions, part I Ola Flygt Växjö University

8 Processing of control transfer instructions TECH Computer Science 8.1 Introduction 8.2 Basic approaches to branch handling 8.3 Delayed branching 8.4.

Computer Organization and Architecture

Copyright 2001 UCB & Morgan Kaufmann ECE668.1 Adapted from Patterson, Katz and Culler © UCB Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical.

W04S1 COMP s1 Seminar 4: Branch Prediction Slides due to David A. Patterson, 2001.

EECE476: Computer Architecture Lecture 21: Faster Branches Branch Prediction with Branch-Target Buffers (not in textbook) The University of British ColumbiaEECE.

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 8, 2003 Topic: Instruction-Level Parallelism (Dynamic Branch Prediction)

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 7, 2002 Topic: Instruction-Level Parallelism (Dynamic Branch Prediction)

EECE476: Computer Architecture Lecture 20: Branch Prediction Chapter extra The University of British ColumbiaEECE 476© 2005 Guy Lemieux.

EECC551 - Shaaban #1 lec # 5 Fall Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with.

EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

Determining Branch Direction

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

1 COMP 740: Computer Architecture and Implementation Montek Singh Thu, Feb 19, 2009 Topic: Instruction-Level Parallelism III (Dynamic Branch Prediction)

The PowerPC Architecture  IBM, Motorola, and Apple Alliance  Based on the IBM POWER Architecture Facilitate parallel execution Scale well with advancing.

Dynamic Branch Prediction

EENG449b/Savvides Lec /25/05 March 24, 2005 Prof. Andreas Savvides Spring g449b EENG 449bG/CPSC 439bG.

CIS 429/529 Winter 2007 Branch Prediction.1 Branch Prediction, Multiple Issue.

Arvind and Joel Emer Computer Science and Artificial Intelligence Laboratory M.I.T. Branch Prediction.

CH12 CPU Structure and Function

1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.

Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.

Branch Prediction Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

CS 6290 Branch Prediction. Control Dependencies Branches are very frequent –Approx. 20% of all instructions Can not wait until we know where it goes –Long.

Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)

COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.

CSL718 : Pipelined Processors

Instruction-Level Parallelism and Its Dynamic Exploitation

CS203 – Advanced Computer Architecture

Computer Structure Advanced Branch Prediction

COMP 740: Computer Architecture and Implementation

Computer Architecture Advanced Branch Prediction

UNIVERSITY OF MASSACHUSETTS Dept

Morgan Kaufmann Publishers

CS5100 Advanced Computer Architecture Advanced Branch Prediction

Samira Khan University of Virginia Nov 13, 2017

Chapter 4 The Processor Part 4

Flow Path Model of Superscalars

CMSC 611: Advanced Computer Architecture

Module 3: Branch Prediction

So far we have dealt with control hazards in instruction pipelines by:

CPE 631: Branch Prediction

Branch statistics Branches occur every 4-6 instructions (16-25%) in integer programs; somewhat less frequently in scientific ones Unconditional branches.

15-740/ Computer Architecture Lecture 24: Control Flow

CSCE 432/832 High Performance Processor Architectures Instruction Flow & Branch Prediction Adopted from Lecture notes based in part on slides created by.

Dynamic Branch Prediction

Advanced Computer Architecture

/ Computer Architecture and Design

Pipelining and control flow

So far we have dealt with control hazards in instruction pipelines by:

Lecture 10: Branch Prediction and Instruction Delivery

Instruction Flow Techniques

So far we have dealt with control hazards in instruction pipelines by:

So far we have dealt with control hazards in instruction pipelines by:

Adapted from the slides of Prof

Dynamic Hardware Prediction

So far we have dealt with control hazards in instruction pipelines by:

So far we have dealt with control hazards in instruction pipelines by:

So far we have dealt with control hazards in instruction pipelines by:

So far we have dealt with control hazards in instruction pipelines by:

Chapter 11 Processor Structure and function

So far we have dealt with control hazards in instruction pipelines by:

CPE 631 Lecture 12: Branch Prediction

Presentation transcript:

Branch Prediction Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti

Lecture Overview Program control flow Branch Prediction Implicit sequential control flow Disruptions of sequential control flow Branch Prediction Branch instruction processing Branch instruction speculation Key historical studies on branch prediction UCB Study [Lee and Smith, 1984] IBM Study [Nair, 1992] Branch prediction implementation (PPC 604) BTAC and BHT design Fetch Address Generation

Program Control Flow Implicit Sequential Control Flow Static Program Representation Control Flow Graph (CFG) Nodes = basic blocks Edges = Control flow transfers Physical Program Layout Mapping of CFG to linear program memory Implied sequential control flow Dynamic Program Execution Traversal of the CFG nodes and edges (e.g. loops) Traversal dictated by branch conditions Dynamic Control Flow Deviates from sequential control flow Disrupts sequential fetching Can stall IF stage and reduce I-fetch bandwidth Draw simple CFG, map to sequential, write “branch instr. deviates from implied sequential control flow”

Program Control Flow Dynamic traversal of static CFG Mapping CFG to linear memory Show common path as loop on left

Disruption of Sequential Control Flow Penalty (increased lost opportunity cost): 1) target addr. Generation, 2) condition resolution Sketch Tyranny of Amdahl’s Law pipeline diagram

Branch Prediction Target address generation  Target Speculation Access register: PC, General purpose register, Link register Perform calculation: +/- offset, autoincrement, autodecrement Condition resolution  Condition speculation Condition code register, General purpose register Comparison of data register(s) Non-CC architecture requires actual computation to resolve condition

Target Address Generation Decode: PC = PC + DISP (adder) (1 cycle penalty) Dispatch: PC = (R2) ( 2 cycle penalty) Execute: PC = (R2) + offset (3 cycle penalty) Both cond & uncond Guess target address? Keep history based on branch address

Condition Resolution Dispatch: RD ( 2 cycle penalty) Execute: Status (3 cycle penalty) Only cond (not uncond) Guess taken vs. not taken

Branch Instruction Speculation Fetch: PC = PC + 4 BTB (cache) contains k cycles of history, generates guess Execute: corrects wrong guess

Branch/Jump Target Prediction 0x0348 0101 (NTNT) 0x0612 Branch Target Buffer: small cache in fetch stage Previously executed branches, address, taken history, target(s) Fetch stage compares current FA against BTB If match, use prediction If predict taken, use BTB target When branch executes, BTB is updated Optimization: Size of BTB: increases hit rate Prediction algorithm: increase accuracy of prediction Write: 0x0348, 0101 (NTNT), 0x0612

Branch Prediction: Condition Speculation Biased Not Taken Hardware prediction Does not affect ISA Not effective for loops Software Prediction Extra bit in each branch instruction Set to 0 for not taken Set to 1 for taken Bit set by compiler or user; can use profiling Static prediction, same behavior every time Prediction based on branch offset Positive offset: predict not taken Negative offset: predict taken Prediction based on dynamic history PowerPC ‘y’ bit: slight twist IBM Fortran anecdote: programmers got this wrong BTFN: if-then-else are 50/50 quite often, loop usually taken

UCB Study [Lee and Smith, 1984] Benchmarks used 26 programs (IBM 370, DEC PDP-11, CDC 6400) 6 workloads (4 IBM, 1 DEC, 1 CDC) Used trace-driven simulation Branch types Unconditional: always taken or always not taken Subroutine call: always taken Loop control: usually taken Decision: either way, if-then-else Computed goto: always taken, with changing target Supervisor call: always taken Execute: always taken (IBM 370) IBM1: compiler IBM2: cobol (business app) IBM3: scientific IBM4: supervisor (OS) Bottom-up hacking => no science Computed goto: C++ virtual fn calls Execute: mini subroutine call (ouch) IBM1-4: compiler, cobol, scientific, supervisor IBM1 IBM2 IBM3 IBM4 DEC CDC Avg T 0.640 0.657 0.704 0.540 0.738 0.778 0.676 NT 0.360 0.343 0.296 0.460 0.262 0.222 0.324

Branch Prediction Function Prediction function F(X1, X2, … ) X1 – opcode type X2 – history Prediction effectiveness based on opcode only, or history IBM1 IBM2 IBM3 IBM4 DEC CDC Opcode only 66 69 71 55 80 78 History 0 64 70 54 74 History 1 92 95 87 97 82 History 2 93 91 83 98 History 3 94 84 History 4 History 5 96 F > 0.5 => pred T F <= 0.5 => pred NT Draw line under 2, diminishing returns

Example Prediction Algorithm Hardware table remembers last 2 branch outcomes History of past several branches encoded by FSM Current state used to generate prediction Results: Subdivide state machine into T (3 states) vs. NT (1 state); highlight prediction and state names Draw FSM logic under “Info” column Workload IBM1 IBM2 IBM3 IBM4 DEC CDC Accuracy 93 97 91 83 98

Other Prediction Algorithms Combining prediction accuracy with BTB hit rate (86.5% for 128 sets of 4 entries each), branch prediction can provide the net prediction accuracy of approximately 80%. This implies a 5-20% performance enhancement. Hysteresis: => wrong twice before prediction changes T or t? results in predict taken N or N? results in not-taken prediction

IBM Study [Nair, 1992] Branch processing on the IBM RS/6000 Separate branch functional unit Five different branch types b: unconditional branch bl: branch and link (subroutine calls) bc: conditional branch bcr: conditional branch using link register (returns) bcc: conditional branch using count register Overlap of branch instructions with other instructions Zero cycle branches Two causes for branch stalls Unresolved conditions Branches downstream too close to unresolved branches Explain zero-cycle branches: if CC is known, and target can be computed and fetched (eager fetch), can have 0 pipeline delay

Branch Instruction Distribution % of each branch type % bc with penalty cycles Benchmark b bl bc bcr bcc 3 cyc 2 cyc 1 cyc spice2g6 7.86 0.30 12.58 0.32 13.82 3.12 0.76 doduc 1.00 0.94 8.22 1.01 10.14 1.76 2.02 matrix300 0.00 14.50 0.68 0.22 0.20 tomcatv 6.10 0.24 0.02 0.01 gcc 2.30 1.32 15.50 1.81 22.46 9.48 4.85 espresso 3.61 0.58 19.85 37.37 1.77 0.31 li 2.41 1.92 14.36 1.91 31.55 3.44 1.37 eqntott 0.91 0.47 32.87 0.51 5.01 11.01 0.80 Matrix300, tomcatv have almost no bc with penalty (loop closing branches) “control intensive” benchmarks make lots of decisions with “late” data (I.e. usually load-compare-branch)

Exhaustive Search for Optimal 2-bit Predictor There are 220 possible state machines of 2-bit predictors Some machines are uninteresting, pruning them out reduces the number of state machines to 5248 For each benchmark, determine prediction accuracy for all the predictor state machines Find optimal 2-bit predictor for each application Spice2g6 - weird Doduc, gcc, espresso – counters Li, eqntott – near counters Write in counter values – 97.0, 94.3, 89.1, 89.1, 86.8, 87.2

Number of History Bits Needed Prediction Accuracy (Overall CPI Overhead) Benchmark 3 bit 2 bit 1 bit 0 bit spice2g6 97.0 (0.009) 96.2 (0.013) 76.6 (0.031) doduc 94.2 (0.003) 94.3 (0.003) 90.2 (0.004) 69.2 (0.022) gcc 89.7 (0.025) 89.1 (0.026) 86.0 (0.033) 50.0 (0.128) espresso 89.5 (0.045) 89.1 (0.047) 87.2 (0.054) 58.5 (0.176) li 88.3 (0.042) 86.8 (0.048) 82.5 (0.063) 62.4 (0.142) eqntott 89.3 (0.028) 87.2 (0.033) 82.9 (0.046) 78.4 (0.049) 2- bit counter is enough, 1-bit often is not: Power4 found opposite conclusion, given fixed storage (16K x 1bit was better than 8K x 2 bits) Branch history table size: Direct-mapped array of 2k entries Some programs, like gcc, have over 7000 conditional branches In collisions, multiple branches share the same predictor Constructive and destructive interference Destructive interference Marginal gains beyond 1K entries (for these programs)

Branch Prediction Implementation (PPC 604)

BTAC and BHT Design (PPC 604) Fast (one cycle) and slow (two cycle) predictors BTAC: provides target address and condition (if hit) with fast turnaround; equivalent to single-bit history for condition BHT: provides 2-bit counters for better history/hysteresis

BTAC and BHT Design (PPC 604) BTAC prediction BHT prediction Branch on count register (loops) executed early PC+disp branch target and condition computed Exceptions detected at completion Many sources for next fetch address!