Dynamic Hardware Prediction Importance of control dependences Branches and jumps are frequent Limiting factor as ILP increases (Amdahl’s law) Schemes to attack control dependences Static Basic (stall the pipeline) Predict-not-taken and predict-taken Delayed branch and canceling branch Dynamic predictors Effectiveness of dynamic prediction schemes Accuracy Cost
Basic Branch Prediction Buffers a.k.a. Branch History Table (BHT) - Small direct-mapped cache of T/NT bits Branch Instruction IR: + Branch Target PC: BHT T (predict taken) NT (predict not- taken) PC + 4
N-bit Branch Prediction Buffers Use an n-bit saturating counter Only the loop exit causes a misprediction 2-bit predictor almost as good as any general n-bit predictor Predict taken Predict taken 11 10 taken not taken Predict not taken Predict not taken 00 01 2-bit Predictor
Correlating Predictors a.k.a. Two-level Predictors – Use recent behavior of other (previous) branches Branch Instruction IR: + Branch Target PC: BHT T (predict taken) NT (predict not- taken) PC + 4 1-bit global branch history: (stores behavior of previous branch) NT/T NT T
Example . . . Basic one-bit predictor BNEZ R1, L1 ; branch b1 (d!=0) ADDI R1, R0, #1 L1: SUBUI R3, R1, #1 BNEZ R3, L2 ; branch b2 L2: . . . Basic one-bit predictor d=? b1 pred b1 action new b1 pred b2 pred b2 action new b2 pred 2 NT T T NT T T 0 T NT NT T NT NT One-bit predictor with one-bit correlation d=? b1 pred b1 action new b1 pred b2 pred b2 action new b2 pred 2 NT/NT T T/NT NT/NT T NT/T 0 T/NT NT T/NT NT/T NT NT/T 2 T/NT T T/NT NT/T T NT/T 0 T/NT NT T/NT NT/T NT NT/T
(m, n) Predictors Use behavior of the last m branches 2m n-bit predictors for each branch Simple implementation Use m-bit shift register to record the behavior of the last m branches (m,n) BPF m-bit GBH PC: n-bit predictor
Size of the Buffers Number of bits in a (m,n) predictor 2m x n x Number of entries in the table Example – assume 8K bits in the BHT (0,1): 8K entries (0,2): 4K entries (2,2): 1K entries (12,2): 1 entry! Does not use the branch address Relies only on the global branch history
Performance of 2-bit Predictors
Branch-Target Buffers Further reduce control stalls (hopefully to 0) Store the predicted address in the buffer Access the buffer during IF PC T/NT Predicted address Look up = NO: instruction is not a branch YES: instruction is a branch
Prediction with BTF IF ID EX Send PC to memory and BTF NO YES Entry found in BTF? Send out predicted address Is instr a taken branch? ID NO YES Taken branch? NO YES Update BTF Kill fetched instr; restart fetch at other target delete entry from BTF; EX
Target Instruction Buffers Store target instructions instead of addresses Advantages BTB access can take longer than time between IFs and BTB can be larger Branch folding Zero-cycle unconditional branches Replace branch with target instruction
Performance Issues Limitations of branch prediction schemes Prediction accuracy (80% - 95%) Type of program Size of buffer Penalty of misprediction Fetch from both directions to reduce penalty Memory system should: Dual-ported Have an interleaved cache Fetch from one path and then from the other
Software approaches to exploiting ILP Chapter 4
Instruction Level Parallelism Potential overlap among instructions Few possibilities in a basic block Blocks are small (6-7 instructions) Instructions are dependent Goal: Exploit ILP across multiple basic blocks Iterations of a loop for (i = 1000; i > 0; i=i-1) x[i] = x[i] + s;
Basic Pipeline Scheduling Find sequences of unrelated instructions Compiler’s ability to schedule Amount of ILP available in the program Latencies of the functional units Latency assumptions for the examples Standard MIPS integer pipeline No structural hazards (fully pipelined or duplicated units Latencies of FP operations: Instruction producing result Instruction using result Latency FP ALU op 3 SD 2 LD 1
Basic Scheduling for (i = 1000; i > 0; i=i-1) x[i] = x[i] + s; Sequential MIPS Assembly Code Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 BNEZ R1, Loop for (i = 1000; i > 0; i=i-1) x[i] = x[i] + s; Pipelined execution: Loop: LD F0, 0(R1) 1 stall 2 ADDD F4, F0, F2 3 stall 4 stall 5 SD 0(R1), F4 6 SUBI R1, R1, #8 7 stall 8 BNEZ R1, Loop 9 stall 10 Scheduled pipelined execution: Loop: LD F0, 0(R1) 1 SUBI R1, R1, #8 2 ADDD F4, F0, F2 3 stall 4 BNEZ R1, Loop 5 SD 8(R1), F4 6
Loop Unrolling Unrolled loop (four copies): Scheduled Unrolled loop: Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F6, -8(R1) ADDD F8, F6, F2 SD -8(R1), F8 LD F10, -16(R1) ADDD F12, F10, F2 SD -16(R1), F12 LD F14, -24(R1) ADDD F16, F14, F2 SD -24(R1), F16 SUBI R1, R1, #32 BNEZ R1, Loop Scheduled Unrolled loop: Loop: LD F0, 0(R1) LD F6, -8(R1) LD F10, -16(R1) LD F14, -24(R1) ADDD F4, F0, F2 ADDD F8, F6, F2 ADDD F12, F10, F2 ADDD F16, F14, F2 SD 0(R1), F4 SD -8(R1), F8 SUBI R1, R1, #32 SD 16(R1), F12 BNEZ R1, Loop SD 8(R1), F16
Loop Transformations Instruction independency is the key requirement for the transformations Example Determine that is legal to move SD after SUBI and BNEZ Determine that unrolling is useful (iterations are independent) Use different registers to avoid unnecessary constrains Eliminate extra tests and branches Determine that LD and SD can be interchanged Schedule the code, preserving the semantics of the code
Dependences If instructions are independent Types of dependences They are parallel They can be reordered Types of dependences Data Name Control
Data Dependences Instruction j is data dependent on instr. i if: i produces a result used by j j is data dependent on k and k is data dependent on i Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD O(R1), F4 SUBI R1, R1, 8 BNEZ R1, Loop Dependences Indicate potential hazard (one or more RAW) Determine order of results Set upper bound on ILP
Techniques to Increase ILP Dependences are a property of programs Actual hazards are a property of the pipeline Techniques to avoid dependence limitations Maintain dependences but avoid hazards Code scheduling hardware software Eliminate dependences by code transformations Complex Compiler-based
Example: Dependence Elimination Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 LD F6, 0(R1) ADDD F8, F6, F2 SD 0(R1), F8 LD F10, 0(R1) ADDD F12, F10, F2 SD 0(R1), F12 LD F14, 0(R1) ADDD F16, F14, F2 SD 0(R1), F16 BNEZ R1, Loop Data dependencies SUBI, LD, SD Force sequential execution of iterations Compiler removes this dependency by: Computing intermediate R1 values Eliminating intermediate SUBI Changing final SUBI Data flow analysis Can do on Registers Cannot do easily on memory locations 100(R1) = 20(R2)
Name Dependences Two instructions use the same register or memory location, but there is no flow of data Antidependence Corresponds to a WAR hazard Output Dependence Corresponds to a WAW hazard To eliminate the dependence: change the name! Register renaming (easy) Static or dynamic
Example: Name Dependences Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F0, -8(R1) SD -8(R1), F4 LD F0, -16(R1) SD -16(R1), F4 LD F0, -24(R1) SD -24(R1), F4 SUBI R1, R1, #32 BNEZ R1, Loop Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F6, -8(R1) ADDD F8, F6, F2 SD -8(R1), F8 LD F10, -16(R1) ADDD F12, F10, F2 SD -16(R1), F12 LD F14, -24(R1) ADDD F16, F14, F2 SD -24(R1), F16 SUBI R1, R1, #32 BNEZ R1, Loop Register Renaming
Example: Control Dependences Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 BEQZ R1, Exit LD F6, 0(R1) ADDD F8, F6, F2 SD 0(R1), F8 LD F10, 0(R1) ADDD F12, F10, F2 SD 0(R1), F12 LD F14, 0(R1) ADDD F16, F14, F2 SD 0(R1), F16 BNEZ R1, Loop Exit: Intermediate BEQZ are never taken Eliminate!
Dealing with control stalls Properties of program correctness must be preserved when handling control dependencies Exception behavior Data flow Static techniques that alleviate Delayed branch scheduling reduce stalls Loop unrolling can enable reduction in control dependences Conditional execution or speculation
Loop-Level Parallelism Analysis at the source level Dependencies across iterations for (i=1000; i>0; i=i-1) x[i] = x[i] + s; for (i=1; i<=100; i=i+1) { x[i+1] = x[i] + z[i]; /* loop-carried dependence */ y[i+1] = y[i] + x[i+1]; }