CSCE 432/832 High Performance Processor Architectures Instruction Flow & Branch Prediction Adopted from Lecture notes based in part on slides created by.

Slides:

Advertisements

Similar presentations

Dynamic Branch Prediction

Advertisements

Computer Organization and Architecture

Computer Architecture Computer Architecture Processing of control transfer instructions, part I Ola Flygt Växjö University

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Computer Organization and Architecture

W04S1 COMP s1 Seminar 4: Branch Prediction Slides due to David A. Patterson, 2001.

EECE476: Computer Architecture Lecture 21: Faster Branches Branch Prediction with Branch-Target Buffers (not in textbook) The University of British ColumbiaEECE.

Computer Organization and Architecture The CPU Structure.

EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

Determining Branch Direction

The PowerPC Architecture  IBM, Motorola, and Apple Alliance  Based on the IBM POWER Architecture Facilitate parallel execution Scale well with advancing.

Dynamic Branch Prediction

Pipelining to Superscalar Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

CH12 CPU Structure and Function

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.

Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.

Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

Branch Prediction Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

CS 6290 Branch Prediction. Control Dependencies Branches are very frequent –Approx. 20% of all instructions Can not wait until we know where it goes –Long.

COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.

ECE/CS 552: Pipeline Hazards © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim.

CSL718 : Pipelined Processors

Instruction-Level Parallelism Dynamic Branch Prediction

Instruction-Level Parallelism and Its Dynamic Exploitation

Computer Architecture: Branch Prediction (II) and Predicated Execution

Prof. Hsien-Hsin Sean Lee

CS203 – Advanced Computer Architecture

COMP 740: Computer Architecture and Implementation

William Stallings Computer Organization and Architecture 8th Edition

Computer Architecture Advanced Branch Prediction

Morgan Kaufmann Publishers

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Samira Khan University of Virginia Nov 13, 2017

ECE/CS 552: Pipelining to Superscalar

Chapter 4 The Processor Part 4

Flow Path Model of Superscalars

Module 3: Branch Prediction

Lecture 14: Reducing Cache Misses

So far we have dealt with control hazards in instruction pipelines by:

CSCE 432/832 High Performance Processor Architectures Scalar to Superscalar Adopted from Lecture notes based in part on slides created by Mikko H. Lipasti,

Branch statistics Branches occur every 4-6 instructions (16-25%) in integer programs; somewhat less frequently in scientific ones Unconditional branches.

15-740/ Computer Architecture Lecture 24: Control Flow

Pipelining to Superscalar ECE/CS 752 Fall 2017

Dynamic Branch Prediction

/ Computer Architecture and Design

Pipelining and control flow

Control unit extension for data hazards

So far we have dealt with control hazards in instruction pipelines by:

Lecture 10: Branch Prediction and Instruction Delivery

Instruction Flow Techniques

CS203 – Advanced Computer Architecture

ECE/CS 552: Pipelining to Superscalar

So far we have dealt with control hazards in instruction pipelines by:

So far we have dealt with control hazards in instruction pipelines by:

Adapted from the slides of Prof

Branch Prediction Prof. Mikko H. Lipasti

Dynamic Hardware Prediction

So far we have dealt with control hazards in instruction pipelines by:

So far we have dealt with control hazards in instruction pipelines by:

So far we have dealt with control hazards in instruction pipelines by:

So far we have dealt with control hazards in instruction pipelines by:

Chapter 11 Processor Structure and function

So far we have dealt with control hazards in instruction pipelines by:

Presentation transcript:

CSCE 432/832 High Performance Processor Architectures Instruction Flow & Branch Prediction Adopted from Lecture notes based in part on slides created by Mikko H. Lipasti, John Shen, Mark Hill, David Wood, Guri Sohi, and Jim Smith CS252 S05

Instruction Flow Techniques Goal of Instruction Flow and Impediments Branch Types and Implementations What’s So Bad About Branches? What are Control Dependences? Impact of Control Dependences on Performance Improving I-Cache Performance 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05

Instruction Flow in Context 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05

CSCE 432/832, Superscalar -- Instruction Flow Goal and Impediments Goal of Instruction Flow Supply processor with maximum number of useful instructions every clock cycle Impediments Branches and jumps Finite I-Cache 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05

Branch Types and Implementation Types of Branches Conditional or Unconditional Save PC? How is target computed? Single (fixed-range) target (immediate, PC+immediate) Multiple (varied and flexible) targets (register) Branch Architectures Condition code or condition registers Register 3-typle that describes branch type Single target – fixed range, easy to compute Multiple targets – flexible, could be tough to compute 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05

Branch Types and Implementation 31 CR0 CR1 CR2 CR3 CR4 CR5 CR6 CR7 PowerPC 32-bit condition register – eight 4-bit fields (CR0-CR7) CR0 can be implicit result of integer op CR1 can be implicit result of FP op Compare ops set CR fields Special CR ops manipulate bits Conditional branch instructions test CR bits 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05

CSCE 432/832, Superscalar -- Instruction Flow Branches – PowerPC 12 Types of Branches Branch (unconditional, no save PC, PC+imm) Branch absolute (uncond, no save PC, imm) Branch and link (uncond, save PC, PC+imm) Branch abs and link (uncond, save PC, imm) Branch conditional (conditional, no save PC, PC+imm) Branch cond abs (cond, no save PC, imm) Branch cond and link (cond, save PC, PC+imm) Branch cond abs and link (cond, save PC, imm) Branch cond to link register (cond, don’t save PC, reg) Branch cond to link reg and link (cond, save PC, reg) Branch cond to count reg (cond, don’t save PC, reg) Branch cond to count reg and link (cond, save PC, reg) 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05

CSCE 432/832, Superscalar -- Instruction Flow Branches – Alpha Alpha 3 Types of Branches Conditional branch (cond, no save PC, PC+imm) Bxx Ra, disp Unconditional branch (uncond, Save PC, PC+imm) Br Ra, disp Jumps (uncond, save PC, Register) J Ra Digital Equipment Corp, 1992 (?) => Compaq acquired in late 90’s => HP & Compaq merged in 2002 “Ultimate RISC” ISA, so still interesting, even though HP has announced EOL 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05

CSCE 432/832, Superscalar -- Instruction Flow Branches – MIPS MIPS 6 Types of Branches Jump (uncond, no save PC, imm) Jump and link (uncond, save PC, imm) Jump register (uncond, no save PC, register) Jump and link register (uncond, save PC, register) Branch (conditional, no save PC, PC+imm) Branch and link (conditional, save PC, PC+imm) 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05

What’s So Bad About Branches? Effects of Branches Fragmentation of I-Cache lines Need to determine branch direction Need to determine branch target Use up execution resources Direction: T/NT, -> condition resolution which is data dependent Target: generate address, fetch, potentially data dependent Execution: wastes slots throughout pipeline if not filled at top 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05

What’s So Bad About Branches? Problem: Fetch stalls until direction is determined Solutions: Minimize delay Move instructions determining branch condition away from branch Make use of delay Non-speculative: Fill delay slots with useful safe instructions Execute both paths (eager execution) Speculative: Predict branch direction Minimize: condition code architecture Speculate: Fetch-Decode-Dispatch-Execute: keep pipeline full of speculative instructions 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05

What’s So Bad About Branches? Problem: Fetch stalls until branch target is determined Solutions: Minimize delay Generate branch target early Make use of delay: Predict branch target Single target Multiple targets Minimize: extra adder, simpler addressing modes (MIPS, Alpha), special-purpose registers (CTR, LR in PowerPC) read early Multiple: Return address stack, bl-push, blr-pop (draw picture of stack) Finite size, overflow, underflow What other cause for multiple targets? Virtual function calls—empirically still have most common target, but do cause problems 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05

CSCE 432/832, Superscalar -- Instruction Flow Control Dependences Control Flow Graph Shows possible paths of control flow through basic blocks 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05

CSCE 432/832, Superscalar -- Instruction Flow Control Dependences Control Dependence Node B is CD on Node A if A determines whether B executes 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05

Limits on Instruction Level Parallelism (ILP) Weiss and Smith [1984] 1.58 Sohi and Vajapeyam [1987] 1.81 Tjaden and Flynn [1970] 1.86 (Flynn’s bottleneck) Tjaden and Flynn [1973] 1.96 Uht [1986] 2.00 Smith et al. [1989] Jouppi and Wall [1988] 2.40 Johnson [1991] 2.50 Acosta et al. [1986] 2.79 Wedig [1982] 3.00 Butler et al. [1991] 5.8 Melvin and Patt [1991] 6 Wall [1991] 7 (Jouppi disagreed) Kuck et al. [1972] 8 Riseman and Foster [1972] 51 (no control dependences) Nicolau and Fisher [1984] 90 (Fisher’s optimism) Flynn’s bottleneck: within basic block, idealized Johnson 1991:Caches, general-purpose UNIX, realistic, NT branches double scope over Flynn Riseman and Foster: next slide Nicolau/Fisher: Scientific: unroll loops, many taken branches, data parallelism, nested loops, led to VLIW (Multiflow) 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05

Riseman and Foster’s Study 7 benchmark programs on CDC-3600 Assume infinite machines Infinite memory and instruction stack Infinite register file Infinite functional units True dependencies only at dataflow limit If bounded to single basic block, speedup is 1.72 (Flynn’s bottleneck) If one can bypass branches (hypothetically), then: Infinite: factor out other issues; controlled experiment, find limit or upper bound Speedups are: 1.72, 2.72, 3.62, 7.21, 14.8, 24.4, 51.2 Must get past branches to get ILP Predated branch prediction (seems obvious in retrospect) Branches Bypassed 1 2 8 32 128  Speedup 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05

Improving I-Cache Performance Larger Cache Size More associativity Larger line size Prefetching Next-line Target Markov Code layout Other types of cache organization Trace cache [Rotenberg paper on reading list] Fewer conflicts, spatial locality Trace cache folds out branches completely 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05

CSCE 432/832, Superscalar -- Instruction Flow Lecture Overview Program control flow Implicit sequential control flow Disruptions of sequential control flow Branch Prediction Branch instruction processing Branch instruction speculation Key historical studies on branch prediction UCB Study [Lee and Smith, 1984] IBM Study [Nair, 1992] Branch prediction implementation (PPC 604) BTAC and BHT design Fetch Address Generation 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05

CSCE 432/832, Superscalar -- Instruction Flow Program Control Flow Implicit Sequential Control Flow Static Program Representation Control Flow Graph (CFG) Nodes = basic blocks Edges = Control flow transfers Physical Program Layout Mapping of CFG to linear program memory Implied sequential control flow Dynamic Program Execution Traversal of the CFG nodes and edges (e.g. loops) Traversal dictated by branch conditions Dynamic Control Flow Deviates from sequential control flow Disrupts sequential fetching Can stall IF stage and reduce I-fetch bandwidth Draw simple CFG, map to sequential, write “branch instr. deviates from implied sequential control flow” 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05

CSCE 432/832, Superscalar -- Instruction Flow Program Control Flow Dynamic traversal of static CFG Mapping CFG to linear memory Show common path as loop on left 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05

Disruption of Sequential Control Flow Penalty (increased lost opportunity cost): 1) target addr. Generation, 2) condition resolution Sketch Tyranny of Amdahl’s Law pipeline diagram 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05

CSCE 432/832, Superscalar -- Instruction Flow Branch Prediction Target address generation  Target Speculation Access register: PC, General purpose register, Link register Perform calculation: +/- offset, autoincrement, autodecrement Condition resolution  Condition speculation Condition code register, General purpose register Comparison of data register(s) Non-CC architecture requires actual computation to resolve condition 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05

Target Address Generation Decode: PC = PC + DISP (adder) (1 cycle penalty) Dispatch: PC = (R2) ( 2 cycle penalty) Execute: PC = (R2) + offset (3 cycle penalty) Both cond & uncond Guess target address? Keep history based on branch address 11/30/2018 CS252 S05

Condition Resolution 11/30/2018 Dispatch: RD ( 2 cycle penalty) Execute: Status (3 cycle penalty) Both cond & uncond Guess taken vs. not taken 11/30/2018 CS252 S05

Branch Instruction Speculation Fetch: PC = PC + 4 BTB (cache) contains k cycles of history, generates guess Execute: corrects wrong guess 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05

Branch/Jump Target Prediction Branch Target Buffer: small cache in fetch stage Previously executed branches, address, taken history, target(s) Fetch stage compares current FA against BTB If match, use prediction If predict taken, use BTB target When branch executes, BTB is updated Optimization: Size of BTB: increases hit rate Prediction algorithm: increase accuracy of prediction Write: 0x0348, 0101 (NTNT), 0x0612 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05

Branch Prediction: Condition Speculation Biased Not Taken Hardware prediction Does not affect ISA Not effective for loops Software Prediction Extra bit in each branch instruction Set to 0 for not taken Set to 1 for taken Bit set by compiler or user; can use profiling Static prediction, same behavior every time Prediction based on branch offset Positive offset: predict not taken Negative offset: predict taken Prediction based on dynamic history PowerPC ‘y’ bit: slight twist IBM Fortran anecdote: programmers got this wrong BTFN: if-then-else are 50/50 quite often, loop usually taken 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05

UCB Study [Lee and Smith, 1984] Benchmarks used 26 programs (IBM 370, DEC PDP-11, CDC 6400) 6 workloads (4 IBM, 1 DEC, 1 CDC) Used trace-driven simulation Branch types Unconditional: always taken or always not taken Subroutine call: always taken Loop control: usually taken Decision: either way, if-then-else Computed goto: always taken, with changing target Supervisor call: always taken Execute: always taken (IBM 370) Bottom-up hacking => no science Computed goto: C++ virtual fn calls Execute: mini subroutine call (ouch) IBM1-4: compiler, cobol, scientific, supervisor IBM1 IBM2 IBM3 IBM4 DEC CDC Avg T 0.640 0.657 0.704 0.540 0.738 0.778 0.676 NT 0.360 0.343 0.296 0.460 0.262 0.222 0.324 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05

Branch Prediction Function Prediction function F(X1, X2, … ) X1 – opcode type X2 – history Prediction effectiveness based on opcode only, or history IBM1 IBM2 IBM3 IBM4 DEC CDC Opcode only 66 69 71 55 80 78 History 0 64 70 54 74 History 1 92 95 87 97 82 History 2 93 91 83 98 History 3 94 84 History 4 History 5 96 F > 0.5 => pred T F <= 0.5 => pred NT Draw line under 2, diminishing returns 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05

Example Prediction Algorithm Hardware table remembers last 2 branch outcomes History of past several branches encoded by FSM Current state used to generate prediction Results: Subdivide state machine into T (3 states) vs. NT (1 state); highlight prediction and state names Draw FSM logic under “Info” column IBM1 IBM2 IBM3 IBM4 DEC CDC 93 97 91 83 98 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05

Other Prediction Algorithms Combining prediction accuracy with BTB hit rate (86.5% for 128 sets of 4 entries each), branch prediction can provide the net prediction accuracy of approximately 80%. This implies a 5-20% performance enhancement. Hysteresis: => wrong twice before prediction changes T or t? results in predict taken N or N? results in not-taken prediction 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05

CSCE 432/832, Superscalar -- Instruction Flow IBM Study [Nair, 1992] Branch processing on the IBM RS/6000 Separate branch functional unit Five different branch types b: unconditional branch bl: branch and link (subroutine calls) bc: conditional branch bcr: condor branch using link register (returns) bcc: conditional branch using count register Overlap of branch instructions with other instructions Zero cycle branches Two causes for branch stalls Unresolved conditions Branches downstream too close to unresolved branches Explain zero-cycle branches: if CC is known, and target can be computed and fetched (eager fetch), can have 0 pipeline delay 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05

Branch Instruction Distribution % of each branch type % bc with penalty cycles Benchmark b bl bc bcr 3 cyc 2 cyc 1 cyc spice2g6 7.86 0.30 12.58 0.32 13.82 3.12 0.76 doduc 1.00 0.94 8.22 1.01 10.14 1.76 2.02 matrix300 0.00 14.50 0.68 0.22 0.20 tomcatv 6.10 0.24 0.02 0.01 gcc 2.30 1.32 15.50 1.81 22.46 9.48 4.85 espresso 3.61 0.58 19.85 37.37 1.77 0.31 li 2.41 1.92 14.36 1.91 31.55 3.44 1.37 eqntott 0.91 0.47 32.87 0.51 5.01 11.01 0.80 Matrix300, tomcatv have almost no bc with penalty (loop closing branches) “control intensive” benchmarks make lots of decisions with “late” data (I.e. usually load-compare-branch) 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05

Exhaustive Search for Optimal 2-bit Predictor There are 220 possible state machines of 2-bit predictors Some machines are uninteresting, pruning them out reduces the number of state machines to 5248 For each benchmark, determine prediction accuracy for all the predictor state machines Find optimal 2-bit predictor for each application Spice2g6 - weird Doduc, gcc, espresso – counters Li, eqntott – near counters Write in counter values – 97.0, 94.3, 89.1, 89.1, 86.8, 87.2 11/30/2018 CS252 S05

Exhaustive Search for Optimal 2-bit Predictor There are 220 possible state machines of 2-bit predictors Some machines are uninteresting, pruning them out reduces the number of state machines to 5248 For each benchmark, determine prediction accuracy for all the predictor state machines Find optimal 2-bit predictor for each application Spice2g6 - weird Doduc, gcc, espresso – counters Li, eqntott – near counters Write in counter values – 97.0, 94.3, 89.1, 89.1, 86.8, 87.2 11/30/2018 CS252 S05

Number of History Bits Needed Prediction Accuracy (Overall CPI Overhead) Benchmark 3 bit 2 bit 1 bit 0 bit spice2g6 97.0 (0.009) 96.2 (0.013) 76.6 (0.031) doduc 94.2 (0.003) 94.3 (0.003) 90.2 (0.004) 69.2 (0.022) gcc 89.7 (0.025) 89.1 (0.026) 86.0 (0.033) 50.0 (0.128) espresso 89.5 (0.045) 89.1 (0.047) 87.2 (0.054) 58.5 (0.176) li 88.3 (0.042) 86.8 (0.048) 82.5 (0.063) 62.4 (0.142) eqntott 89.3 (0.028) 87.2 (0.033) 82.9 (0.046) 78.4 (0.049) 2- bit counter is enough, 1-bit often is not: Power4 found opposite conclusion, given fixed storage (16K x 1bit was better than 8K x 2 bits) Branch history table size: Direct-mapped array of 2k entries Some programs, like gcc, have over 7000 conditional branches In collisions, multiple branches share the same predictor Constructive interference Destructive interference Marginal gains beyond 1K entries (for these programs) 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05

Branch Prediction Implementation (PPC 604) 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05

BTAC and BHT Design (PPC 604) Fast (one cycle) and slow (two cycle) predictors BTAC: provides target address and condition (if hit) with fast turnaround; equivalent to single-bit history for condition BHT: provides 2-bit counters for better history/hysteresis 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05

BTAC and BHT Design (PPC 604) BTAC prediction BHT prediction Branch on count register (loops) executed early PC+disp branch target and condition computed Exceptions detected at completion Many sources for next fetch address! 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05