CSCE 432/832 High Performance Processor Architectures Instruction Flow & Branch Prediction Adopted from Lecture notes based in part on slides created by Mikko H. Lipasti, John Shen, Mark Hill, David Wood, Guri Sohi, and Jim Smith CS252 S05
Instruction Flow Techniques Goal of Instruction Flow and Impediments Branch Types and Implementations What’s So Bad About Branches? What are Control Dependences? Impact of Control Dependences on Performance Improving I-Cache Performance 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05
Instruction Flow in Context 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05
CSCE 432/832, Superscalar -- Instruction Flow Goal and Impediments Goal of Instruction Flow Supply processor with maximum number of useful instructions every clock cycle Impediments Branches and jumps Finite I-Cache 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05
Branch Types and Implementation Types of Branches Conditional or Unconditional Save PC? How is target computed? Single (fixed-range) target (immediate, PC+immediate) Multiple (varied and flexible) targets (register) Branch Architectures Condition code or condition registers Register 3-typle that describes branch type Single target – fixed range, easy to compute Multiple targets – flexible, could be tough to compute 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05
Branch Types and Implementation 31 CR0 CR1 CR2 CR3 CR4 CR5 CR6 CR7 PowerPC 32-bit condition register – eight 4-bit fields (CR0-CR7) CR0 can be implicit result of integer op CR1 can be implicit result of FP op Compare ops set CR fields Special CR ops manipulate bits Conditional branch instructions test CR bits 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05
CSCE 432/832, Superscalar -- Instruction Flow Branches – PowerPC 12 Types of Branches Branch (unconditional, no save PC, PC+imm) Branch absolute (uncond, no save PC, imm) Branch and link (uncond, save PC, PC+imm) Branch abs and link (uncond, save PC, imm) Branch conditional (conditional, no save PC, PC+imm) Branch cond abs (cond, no save PC, imm) Branch cond and link (cond, save PC, PC+imm) Branch cond abs and link (cond, save PC, imm) Branch cond to link register (cond, don’t save PC, reg) Branch cond to link reg and link (cond, save PC, reg) Branch cond to count reg (cond, don’t save PC, reg) Branch cond to count reg and link (cond, save PC, reg) 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05
CSCE 432/832, Superscalar -- Instruction Flow Branches – Alpha Alpha 3 Types of Branches Conditional branch (cond, no save PC, PC+imm) Bxx Ra, disp Unconditional branch (uncond, Save PC, PC+imm) Br Ra, disp Jumps (uncond, save PC, Register) J Ra Digital Equipment Corp, 1992 (?) => Compaq acquired in late 90’s => HP & Compaq merged in 2002 “Ultimate RISC” ISA, so still interesting, even though HP has announced EOL 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05
CSCE 432/832, Superscalar -- Instruction Flow Branches – MIPS MIPS 6 Types of Branches Jump (uncond, no save PC, imm) Jump and link (uncond, save PC, imm) Jump register (uncond, no save PC, register) Jump and link register (uncond, save PC, register) Branch (conditional, no save PC, PC+imm) Branch and link (conditional, save PC, PC+imm) 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05
What’s So Bad About Branches? Effects of Branches Fragmentation of I-Cache lines Need to determine branch direction Need to determine branch target Use up execution resources Direction: T/NT, -> condition resolution which is data dependent Target: generate address, fetch, potentially data dependent Execution: wastes slots throughout pipeline if not filled at top 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05
What’s So Bad About Branches? Problem: Fetch stalls until direction is determined Solutions: Minimize delay Move instructions determining branch condition away from branch Make use of delay Non-speculative: Fill delay slots with useful safe instructions Execute both paths (eager execution) Speculative: Predict branch direction Minimize: condition code architecture Speculate: Fetch-Decode-Dispatch-Execute: keep pipeline full of speculative instructions 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05
What’s So Bad About Branches? Problem: Fetch stalls until branch target is determined Solutions: Minimize delay Generate branch target early Make use of delay: Predict branch target Single target Multiple targets Minimize: extra adder, simpler addressing modes (MIPS, Alpha), special-purpose registers (CTR, LR in PowerPC) read early Multiple: Return address stack, bl-push, blr-pop (draw picture of stack) Finite size, overflow, underflow What other cause for multiple targets? Virtual function calls—empirically still have most common target, but do cause problems 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05
CSCE 432/832, Superscalar -- Instruction Flow Control Dependences Control Flow Graph Shows possible paths of control flow through basic blocks 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05
CSCE 432/832, Superscalar -- Instruction Flow Control Dependences Control Dependence Node B is CD on Node A if A determines whether B executes 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05
Limits on Instruction Level Parallelism (ILP) Weiss and Smith [1984] 1.58 Sohi and Vajapeyam [1987] 1.81 Tjaden and Flynn [1970] 1.86 (Flynn’s bottleneck) Tjaden and Flynn [1973] 1.96 Uht [1986] 2.00 Smith et al. [1989] Jouppi and Wall [1988] 2.40 Johnson [1991] 2.50 Acosta et al. [1986] 2.79 Wedig [1982] 3.00 Butler et al. [1991] 5.8 Melvin and Patt [1991] 6 Wall [1991] 7 (Jouppi disagreed) Kuck et al. [1972] 8 Riseman and Foster [1972] 51 (no control dependences) Nicolau and Fisher [1984] 90 (Fisher’s optimism) Flynn’s bottleneck: within basic block, idealized Johnson 1991:Caches, general-purpose UNIX, realistic, NT branches double scope over Flynn Riseman and Foster: next slide Nicolau/Fisher: Scientific: unroll loops, many taken branches, data parallelism, nested loops, led to VLIW (Multiflow) 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05
Riseman and Foster’s Study 7 benchmark programs on CDC-3600 Assume infinite machines Infinite memory and instruction stack Infinite register file Infinite functional units True dependencies only at dataflow limit If bounded to single basic block, speedup is 1.72 (Flynn’s bottleneck) If one can bypass branches (hypothetically), then: Infinite: factor out other issues; controlled experiment, find limit or upper bound Speedups are: 1.72, 2.72, 3.62, 7.21, 14.8, 24.4, 51.2 Must get past branches to get ILP Predated branch prediction (seems obvious in retrospect) Branches Bypassed 1 2 8 32 128 Speedup 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05
Improving I-Cache Performance Larger Cache Size More associativity Larger line size Prefetching Next-line Target Markov Code layout Other types of cache organization Trace cache [Rotenberg paper on reading list] Fewer conflicts, spatial locality Trace cache folds out branches completely 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05
CSCE 432/832, Superscalar -- Instruction Flow Lecture Overview Program control flow Implicit sequential control flow Disruptions of sequential control flow Branch Prediction Branch instruction processing Branch instruction speculation Key historical studies on branch prediction UCB Study [Lee and Smith, 1984] IBM Study [Nair, 1992] Branch prediction implementation (PPC 604) BTAC and BHT design Fetch Address Generation 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05
CSCE 432/832, Superscalar -- Instruction Flow Program Control Flow Implicit Sequential Control Flow Static Program Representation Control Flow Graph (CFG) Nodes = basic blocks Edges = Control flow transfers Physical Program Layout Mapping of CFG to linear program memory Implied sequential control flow Dynamic Program Execution Traversal of the CFG nodes and edges (e.g. loops) Traversal dictated by branch conditions Dynamic Control Flow Deviates from sequential control flow Disrupts sequential fetching Can stall IF stage and reduce I-fetch bandwidth Draw simple CFG, map to sequential, write “branch instr. deviates from implied sequential control flow” 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05
CSCE 432/832, Superscalar -- Instruction Flow Program Control Flow Dynamic traversal of static CFG Mapping CFG to linear memory Show common path as loop on left 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05
Disruption of Sequential Control Flow Penalty (increased lost opportunity cost): 1) target addr. Generation, 2) condition resolution Sketch Tyranny of Amdahl’s Law pipeline diagram 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05
CSCE 432/832, Superscalar -- Instruction Flow Branch Prediction Target address generation Target Speculation Access register: PC, General purpose register, Link register Perform calculation: +/- offset, autoincrement, autodecrement Condition resolution Condition speculation Condition code register, General purpose register Comparison of data register(s) Non-CC architecture requires actual computation to resolve condition 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05
Target Address Generation Decode: PC = PC + DISP (adder) (1 cycle penalty) Dispatch: PC = (R2) ( 2 cycle penalty) Execute: PC = (R2) + offset (3 cycle penalty) Both cond & uncond Guess target address? Keep history based on branch address 11/30/2018 CS252 S05
Condition Resolution 11/30/2018 Dispatch: RD ( 2 cycle penalty) Execute: Status (3 cycle penalty) Both cond & uncond Guess taken vs. not taken 11/30/2018 CS252 S05
Branch Instruction Speculation Fetch: PC = PC + 4 BTB (cache) contains k cycles of history, generates guess Execute: corrects wrong guess 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05
Branch/Jump Target Prediction Branch Target Buffer: small cache in fetch stage Previously executed branches, address, taken history, target(s) Fetch stage compares current FA against BTB If match, use prediction If predict taken, use BTB target When branch executes, BTB is updated Optimization: Size of BTB: increases hit rate Prediction algorithm: increase accuracy of prediction Write: 0x0348, 0101 (NTNT), 0x0612 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05
Branch Prediction: Condition Speculation Biased Not Taken Hardware prediction Does not affect ISA Not effective for loops Software Prediction Extra bit in each branch instruction Set to 0 for not taken Set to 1 for taken Bit set by compiler or user; can use profiling Static prediction, same behavior every time Prediction based on branch offset Positive offset: predict not taken Negative offset: predict taken Prediction based on dynamic history PowerPC ‘y’ bit: slight twist IBM Fortran anecdote: programmers got this wrong BTFN: if-then-else are 50/50 quite often, loop usually taken 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05
UCB Study [Lee and Smith, 1984] Benchmarks used 26 programs (IBM 370, DEC PDP-11, CDC 6400) 6 workloads (4 IBM, 1 DEC, 1 CDC) Used trace-driven simulation Branch types Unconditional: always taken or always not taken Subroutine call: always taken Loop control: usually taken Decision: either way, if-then-else Computed goto: always taken, with changing target Supervisor call: always taken Execute: always taken (IBM 370) Bottom-up hacking => no science Computed goto: C++ virtual fn calls Execute: mini subroutine call (ouch) IBM1-4: compiler, cobol, scientific, supervisor IBM1 IBM2 IBM3 IBM4 DEC CDC Avg T 0.640 0.657 0.704 0.540 0.738 0.778 0.676 NT 0.360 0.343 0.296 0.460 0.262 0.222 0.324 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05
Branch Prediction Function Prediction function F(X1, X2, … ) X1 – opcode type X2 – history Prediction effectiveness based on opcode only, or history IBM1 IBM2 IBM3 IBM4 DEC CDC Opcode only 66 69 71 55 80 78 History 0 64 70 54 74 History 1 92 95 87 97 82 History 2 93 91 83 98 History 3 94 84 History 4 History 5 96 F > 0.5 => pred T F <= 0.5 => pred NT Draw line under 2, diminishing returns 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05
Example Prediction Algorithm Hardware table remembers last 2 branch outcomes History of past several branches encoded by FSM Current state used to generate prediction Results: Subdivide state machine into T (3 states) vs. NT (1 state); highlight prediction and state names Draw FSM logic under “Info” column IBM1 IBM2 IBM3 IBM4 DEC CDC 93 97 91 83 98 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05
Other Prediction Algorithms Combining prediction accuracy with BTB hit rate (86.5% for 128 sets of 4 entries each), branch prediction can provide the net prediction accuracy of approximately 80%. This implies a 5-20% performance enhancement. Hysteresis: => wrong twice before prediction changes T or t? results in predict taken N or N? results in not-taken prediction 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05
CSCE 432/832, Superscalar -- Instruction Flow IBM Study [Nair, 1992] Branch processing on the IBM RS/6000 Separate branch functional unit Five different branch types b: unconditional branch bl: branch and link (subroutine calls) bc: conditional branch bcr: condor branch using link register (returns) bcc: conditional branch using count register Overlap of branch instructions with other instructions Zero cycle branches Two causes for branch stalls Unresolved conditions Branches downstream too close to unresolved branches Explain zero-cycle branches: if CC is known, and target can be computed and fetched (eager fetch), can have 0 pipeline delay 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05
Branch Instruction Distribution % of each branch type % bc with penalty cycles Benchmark b bl bc bcr 3 cyc 2 cyc 1 cyc spice2g6 7.86 0.30 12.58 0.32 13.82 3.12 0.76 doduc 1.00 0.94 8.22 1.01 10.14 1.76 2.02 matrix300 0.00 14.50 0.68 0.22 0.20 tomcatv 6.10 0.24 0.02 0.01 gcc 2.30 1.32 15.50 1.81 22.46 9.48 4.85 espresso 3.61 0.58 19.85 37.37 1.77 0.31 li 2.41 1.92 14.36 1.91 31.55 3.44 1.37 eqntott 0.91 0.47 32.87 0.51 5.01 11.01 0.80 Matrix300, tomcatv have almost no bc with penalty (loop closing branches) “control intensive” benchmarks make lots of decisions with “late” data (I.e. usually load-compare-branch) 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05
Exhaustive Search for Optimal 2-bit Predictor There are 220 possible state machines of 2-bit predictors Some machines are uninteresting, pruning them out reduces the number of state machines to 5248 For each benchmark, determine prediction accuracy for all the predictor state machines Find optimal 2-bit predictor for each application Spice2g6 - weird Doduc, gcc, espresso – counters Li, eqntott – near counters Write in counter values – 97.0, 94.3, 89.1, 89.1, 86.8, 87.2 11/30/2018 CS252 S05
Exhaustive Search for Optimal 2-bit Predictor There are 220 possible state machines of 2-bit predictors Some machines are uninteresting, pruning them out reduces the number of state machines to 5248 For each benchmark, determine prediction accuracy for all the predictor state machines Find optimal 2-bit predictor for each application Spice2g6 - weird Doduc, gcc, espresso – counters Li, eqntott – near counters Write in counter values – 97.0, 94.3, 89.1, 89.1, 86.8, 87.2 11/30/2018 CS252 S05
Number of History Bits Needed Prediction Accuracy (Overall CPI Overhead) Benchmark 3 bit 2 bit 1 bit 0 bit spice2g6 97.0 (0.009) 96.2 (0.013) 76.6 (0.031) doduc 94.2 (0.003) 94.3 (0.003) 90.2 (0.004) 69.2 (0.022) gcc 89.7 (0.025) 89.1 (0.026) 86.0 (0.033) 50.0 (0.128) espresso 89.5 (0.045) 89.1 (0.047) 87.2 (0.054) 58.5 (0.176) li 88.3 (0.042) 86.8 (0.048) 82.5 (0.063) 62.4 (0.142) eqntott 89.3 (0.028) 87.2 (0.033) 82.9 (0.046) 78.4 (0.049) 2- bit counter is enough, 1-bit often is not: Power4 found opposite conclusion, given fixed storage (16K x 1bit was better than 8K x 2 bits) Branch history table size: Direct-mapped array of 2k entries Some programs, like gcc, have over 7000 conditional branches In collisions, multiple branches share the same predictor Constructive interference Destructive interference Marginal gains beyond 1K entries (for these programs) 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05
Branch Prediction Implementation (PPC 604) 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05
BTAC and BHT Design (PPC 604) Fast (one cycle) and slow (two cycle) predictors BTAC: provides target address and condition (if hit) with fast turnaround; equivalent to single-bit history for condition BHT: provides 2-bit counters for better history/hysteresis 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05
BTAC and BHT Design (PPC 604) BTAC prediction BHT prediction Branch on count register (loops) executed early PC+disp branch target and condition computed Exceptions detected at completion Many sources for next fetch address! 11/30/2018 CSCE 432/832, Superscalar -- Instruction Flow CS252 S05