Instruction Flow Techniques

Instruction Flow Techniques
Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti

Instruction Flow Techniques
Goal of Instruction Flow and Impediments Branch Types and Implementations What’s So Bad About Branches? What are Control Dependences? Impact of Control Dependences on Performance Improving I-Cache Performance

Instruction Flow in Context

Goal and Impediments Goal of Instruction Flow Impediments
Supply processor with maximum number of useful instructions every clock cycle Impediments Branches and jumps Finite I-Cache Capacity Bandwidth restrictions

Branch Types and Implementation
Types of Branches Conditional or Unconditional Save PC? How is target computed? Single target (immediate, PC+immediate) Multiple targets (register) Branch Architectures Condition code or condition registers Register 3-typle that describes branch type Single target – fixed range, easy to compute Multiple targets – flexible, could be tough to compute

Branch Types and Implementation
31 CR0 CR1 CR2 CR3 CR4 CR5 CR6 CR7 PowerPC 32-bit condition register – eight 4-bit fields (CR0-CR7) CR0 can be implicit result of integer op CR1 can be implicit result of FP op Compare ops set explicit CR field Special CR ops manipulate bits Conditional branch instructions test CR bits

Branches – PowerPC 12 Types of Branches
Branch (unconditional, no save PC, PC+imm) Branch absolute (uncond, no save PC, imm) Branch and link (uncond, save PC, PC+imm) Branch abs and link (uncond, save PC, imm) Branch conditional (conditional, no save PC, PC+imm) Branch cond abs (cond, no save PC, imm) Branch cond and link (cond, save PC, PC+imm) Branch cond abs and link (cond, save PC, imm) Branch cond to link register (cond, don’t save PC, reg) Branch cond to link reg and link (cond, save PC, reg) Branch cond to count reg (cond, don’t save PC, reg) Branch cond to count reg and link (cond, save PC, reg)

Branches – DEC Alpha Alpha 3 Types of Branches
Conditional branch (cond, no save PC, PC+imm) Bxx Ra, disp Unconditional branch (uncond, Save PC, PC+imm) Br Ra, disp Jumps (uncond, save PC, Register) J Ra Digital Equipment Corp, 1992 (?) => Compaq acquired in late 90’s => HP & Compaq merged in 2002 “Ultimate RISC” ISA, so still interesting, even though HP has announced EOL

Branches – MIPS MIPS 6 Types of Branches
Jump (uncond, no save PC, imm) Jump and link (uncond, save PC, imm) Jump register (uncond, no save PC, register) Jump and link register (uncond, save PC, register) Branch (conditional, no save PC, PC+imm) Branch and link (conditional, save PC, PC+imm)

What’s So Bad About Branches?
Effects of Branches Fragmentation of I-Cache lines Need to determine branch direction Need to determine branch target Use up execution resources Pipeline drain/fill Direction: T/NT, -> condition resolution which is data dependent Target: generate address, fetch, potentially data dependent Execution: wastes slots throughout pipeline if not filled at top

Problem: Fetch stalls until direction is determined Solutions: Minimize delay Move instructions determining branch condition away from branch (CC architecture) Make use of delay Non-speculative: Fill delay slots with useful safe instructions Execute both paths (eager execution) Speculative: Predict branch direction Minimize: condition code architecture Speculate: Fetch-Decode-Dispatch-Execute: keep pipeline full of speculative instructions

Problem: Fetch stalls until branch target is determined Solutions: Minimize delay Generate branch target early Make use of delay: Predict branch target Single target Multiple targets Minimize: extra adder, simpler addressing modes (MIPS, Alpha), special-purpose registers (CTR, LR in PowerPC) read early Multiple: Return address stack, bl-push, blr-pop (draw picture of stack) Finite size, overflow, underflow What other cause for multiple targets? Virtual function calls—empirically still have most common target, but do cause problems

Control Dependences Control Flow Graph
Shows possible paths of control flow through basic blocks

Control Dependences Control Dependence
Basic block boundaries determined by: after branch or before label (branch target); the two don’t always coincide (i.e. BB3 starts without label, BB5 starts without branch) Graph theory: if all paths from A to exit include B (or exclude B), there is no control dependence: (A & B are control independent) If some path from A to exit includes B, and some other path does not, B is control dependent on A Draw CD edges from BB1 to {BB2,BB3,BB4,BB5}; from BB2 to {BB3, BB4}; from BB5 to {BB2, BB3, BB4, BB5} Control Dependence Node B is CD on Node A if A determines whether B executes If path 1 from A to exit includes B, and path 2 does not, then B is control-dependent on A

Limits on Instruction Level Parallelism (ILP)
Weiss and Smith [1984] 1.58 Sohi and Vajapeyam [1987] 1.81 Tjaden and Flynn [1970] 1.86 (Flynn’s bottleneck) Tjaden and Flynn [1973] 1.96 Uht [1986] 2.00 Smith et al. [1989] Jouppi and Wall [1988] 2.40 Johnson [1991] 2.50 Acosta et al. [1986] 2.79 Wedig [1982] 3.00 Butler et al. [1991] 5.8 Melvin and Patt [1991] 6 Wall [1991] 7 (Jouppi disagreed) Kuck et al. [1972] 8 Riseman and Foster [1972] 51 (no control dependences) Nicolau and Fisher [1984] 90 (Fisher’s optimism) Flynn’s bottleneck: within basic block, idealized Johnson 1991:Caches, general-purpose UNIX, realistic, NT branches double scope over Flynn Riseman and Foster: next slide Nicolau/Fisher: Scientific: unroll loops, many taken branches, data parallelism, nested loops, led to VLIW (Multiflow)

Riseman and Foster’s Study
7 benchmark programs on CDC-3600 Assume infinite machines Infinite memory and instruction stack Infinite register file Infinite functional units True dependencies only at dataflow limit If bounded to single basic block, speedup is 1.72 (Flynn’s bottleneck) If one can bypass n branches (hypothetically), then: Infinite: factor out other issues; controlled experiment, find limit or upper bound Speedups are: 1.72, 2.72, 3.62, 7.21, 14.8, 24.4, 51.2 Must get past branches to get ILP Predated branch prediction (seems obvious in retrospect) BranchesBypassed 1 2 8 32 128  Speedup 1.72 2.72 3.62 7.21 14.8 24.4 51.2

Speculative Execution
Riseman & Foster showed potential But no idea how to reap benefit 1979: Jim Smith patents branch prediction at Control Data Predict current branch based on past history Today: virtually all processors use branch prediction © 2005 Mikko Lipasti

Improving I-Cache Performance
USPTO -- Basic Modern Processor Design Improving I-Cache Performance Larger cache size Code compression Instruction registers Increased associativity Conflict misses less of a problem than in data caches Larger line size Spatial locality inherent in sequential program I-stream Code layout Maximize instruction stream’s spatial locality Cache prefetching Next-line, streaming buffer Branch target (even if not taken) Other types of I-cache organization Trace cache [Ch. 9] Fewer conflicts, spatial locality Trace cache folds out branches completely © 2005 Mikko Lipasti © 2005 Mikko Lipasti

Recap Branch types Control dependences
Improving instruction cache performance

Instruction Flow Techniques

Similar presentations

Presentation on theme: "Instruction Flow Techniques"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Instruction Flow Techniques

Similar presentations

Presentation on theme: "Instruction Flow Techniques"— Presentation transcript:

Similar presentations

About project

Feedback