Dynamic Hardware Prediction

Slides:

Advertisements

Similar presentations

CS 378 Programming for Performance Single-Thread Performance: Compiler Scheduling for Pipelines Adopted from Siddhartha Chatterjee Spring 2009.

Advertisements

ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

Compiler techniques for exposing ILP

ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:

CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,

COMP4611 Tutorial 6 Instruction Level Parallelism

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.

CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.

1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.

Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation – Concepts 吳俊興高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

Chapter 2 Instruction-Level Parallelism and Its Exploitation

EECC551 - Shaaban #1 Fall 2002 lec# Floating Point/Multicycle Pipelining in MIPS Completion of MIPS EX stage floating point arithmetic operations.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.

Dynamic Branch Prediction

EENG449b/Savvides Lec /25/05 March 24, 2005 Prof. Andreas Savvides Spring g449b EENG 449bG/CPSC 439bG.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

COMP381 by M. Hamdi 1 Loop Level Parallelism Instruction Level Parallelism: Loop Level Parallelism.

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Instruction-Level Parallelism Dynamic Branch Prediction

Instruction-Level Parallelism and Its Dynamic Exploitation

CS203 – Advanced Computer Architecture

Concepts and Challenges

Dynamic Branch Prediction

CS 704 Advanced Computer Architecture

Instruction-Level Parallelism (ILP)

现代计算机体系结构主讲教师：张钢教授天津大学计算机学院 2017年

Approaches to exploiting Instruction Level Parallelism (ILP)

CSL718 : VLIW - Software Driven ILP

CMSC 611: Advanced Computer Architecture

Lecture 6: Advanced Pipelines

CPE 631: Branch Prediction

Branch statistics Branches occur every 4-6 instructions (16-25%) in integer programs; somewhat less frequently in scientific ones Unconditional branches.

Compiler techniques for exposing ILP (cont)

Siddhartha Chatterjee Spring 2008

Chapter 3: ILP and Its Exploitation

Dynamic Branch Prediction

Advanced Computer Architecture

/ Computer Architecture and Design

Lecture 10: Branch Prediction and Instruction Delivery

Adapted from the slides of Prof

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Midterm 2 review Chapter

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Adapted from the slides of Prof

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

How to improve (decrease) CPI

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CMSC 611: Advanced Computer Architecture

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Loop-Level Parallelism

Lecture 5: Pipeline Wrap-up, Static ILP

CPE 631 Lecture 12: Branch Prediction

Presentation transcript:

Dynamic Hardware Prediction Importance of control dependences Branches and jumps are frequent Limiting factor as ILP increases (Amdahl’s law) Schemes to attack control dependences Static Basic (stall the pipeline) Predict-not-taken and predict-taken Delayed branch and canceling branch Dynamic predictors Effectiveness of dynamic prediction schemes Accuracy Cost

Basic Branch Prediction Buffers a.k.a. Branch History Table (BHT) - Small direct-mapped cache of T/NT bits Branch Instruction IR: + Branch Target PC: BHT T (predict taken) NT (predict not- taken) PC + 4

N-bit Branch Prediction Buffers Use an n-bit saturating counter Only the loop exit causes a misprediction 2-bit predictor almost as good as any general n-bit predictor Predict taken Predict taken 11 10 taken not taken Predict not taken Predict not taken 00 01 2-bit Predictor

Correlating Predictors a.k.a. Two-level Predictors – Use recent behavior of other (previous) branches Branch Instruction IR: + Branch Target PC: BHT T (predict taken) NT (predict not- taken) PC + 4 1-bit global branch history: (stores behavior of previous branch) NT/T NT T

Example . . . Basic one-bit predictor BNEZ R1, L1 ; branch b1 (d!=0) ADDI R1, R0, #1 L1: SUBUI R3, R1, #1 BNEZ R3, L2 ; branch b2 L2: . . . Basic one-bit predictor d=? b1 pred b1 action new b1 pred b2 pred b2 action new b2 pred 2 NT T T NT T T 0 T NT NT T NT NT One-bit predictor with one-bit correlation d=? b1 pred b1 action new b1 pred b2 pred b2 action new b2 pred 2 NT/NT T T/NT NT/NT T NT/T 0 T/NT NT T/NT NT/T NT NT/T 2 T/NT T T/NT NT/T T NT/T 0 T/NT NT T/NT NT/T NT NT/T

(m, n) Predictors Use behavior of the last m branches 2m n-bit predictors for each branch Simple implementation Use m-bit shift register to record the behavior of the last m branches (m,n) BPF m-bit GBH PC: n-bit predictor

Size of the Buffers Number of bits in a (m,n) predictor 2m x n x Number of entries in the table Example – assume 8K bits in the BHT (0,1): 8K entries (0,2): 4K entries (2,2): 1K entries (12,2): 1 entry! Does not use the branch address Relies only on the global branch history

Performance of 2-bit Predictors

Branch-Target Buffers Further reduce control stalls (hopefully to 0) Store the predicted address in the buffer Access the buffer during IF PC T/NT Predicted address Look up = NO: instruction is not a branch YES: instruction is a branch

Prediction with BTF IF ID EX Send PC to memory and BTF NO YES Entry found in BTF? Send out predicted address Is instr a taken branch? ID NO YES Taken branch? NO YES Update BTF Kill fetched instr; restart fetch at other target delete entry from BTF; EX

Target Instruction Buffers Store target instructions instead of addresses Advantages BTB access can take longer than time between IFs and BTB can be larger Branch folding Zero-cycle unconditional branches Replace branch with target instruction

Performance Issues Limitations of branch prediction schemes Prediction accuracy (80% - 95%) Type of program Size of buffer Penalty of misprediction Fetch from both directions to reduce penalty Memory system should: Dual-ported Have an interleaved cache Fetch from one path and then from the other

Software approaches to exploiting ILP Chapter 4

Instruction Level Parallelism Potential overlap among instructions Few possibilities in a basic block Blocks are small (6-7 instructions) Instructions are dependent Goal: Exploit ILP across multiple basic blocks Iterations of a loop for (i = 1000; i > 0; i=i-1) x[i] = x[i] + s;

Basic Pipeline Scheduling Find sequences of unrelated instructions Compiler’s ability to schedule Amount of ILP available in the program Latencies of the functional units Latency assumptions for the examples Standard MIPS integer pipeline No structural hazards (fully pipelined or duplicated units Latencies of FP operations: Instruction producing result Instruction using result Latency FP ALU op 3 SD 2 LD 1

Basic Scheduling for (i = 1000; i > 0; i=i-1) x[i] = x[i] + s; Sequential MIPS Assembly Code Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 BNEZ R1, Loop for (i = 1000; i > 0; i=i-1) x[i] = x[i] + s; Pipelined execution: Loop: LD F0, 0(R1) 1 stall 2 ADDD F4, F0, F2 3 stall 4 stall 5 SD 0(R1), F4 6 SUBI R1, R1, #8 7 stall 8 BNEZ R1, Loop 9 stall 10 Scheduled pipelined execution: Loop: LD F0, 0(R1) 1 SUBI R1, R1, #8 2 ADDD F4, F0, F2 3 stall 4 BNEZ R1, Loop 5 SD 8(R1), F4 6

Loop Unrolling Unrolled loop (four copies): Scheduled Unrolled loop: Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F6, -8(R1) ADDD F8, F6, F2 SD -8(R1), F8 LD F10, -16(R1) ADDD F12, F10, F2 SD -16(R1), F12 LD F14, -24(R1) ADDD F16, F14, F2 SD -24(R1), F16 SUBI R1, R1, #32 BNEZ R1, Loop Scheduled Unrolled loop: Loop: LD F0, 0(R1) LD F6, -8(R1) LD F10, -16(R1) LD F14, -24(R1) ADDD F4, F0, F2 ADDD F8, F6, F2 ADDD F12, F10, F2 ADDD F16, F14, F2 SD 0(R1), F4 SD -8(R1), F8 SUBI R1, R1, #32 SD 16(R1), F12 BNEZ R1, Loop SD 8(R1), F16

Loop Transformations Instruction independency is the key requirement for the transformations Example Determine that is legal to move SD after SUBI and BNEZ Determine that unrolling is useful (iterations are independent) Use different registers to avoid unnecessary constrains Eliminate extra tests and branches Determine that LD and SD can be interchanged Schedule the code, preserving the semantics of the code

Dependences If instructions are independent Types of dependences They are parallel They can be reordered Types of dependences Data Name Control

Data Dependences Instruction j is data dependent on instr. i if: i produces a result used by j j is data dependent on k and k is data dependent on i Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD O(R1), F4 SUBI R1, R1, 8 BNEZ R1, Loop Dependences Indicate potential hazard (one or more RAW) Determine order of results Set upper bound on ILP

Techniques to Increase ILP Dependences are a property of programs Actual hazards are a property of the pipeline Techniques to avoid dependence limitations Maintain dependences but avoid hazards Code scheduling hardware software Eliminate dependences by code transformations Complex Compiler-based

Example: Dependence Elimination Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 LD F6, 0(R1) ADDD F8, F6, F2 SD 0(R1), F8 LD F10, 0(R1) ADDD F12, F10, F2 SD 0(R1), F12 LD F14, 0(R1) ADDD F16, F14, F2 SD 0(R1), F16 BNEZ R1, Loop Data dependencies SUBI, LD, SD Force sequential execution of iterations Compiler removes this dependency by: Computing intermediate R1 values Eliminating intermediate SUBI Changing final SUBI Data flow analysis Can do on Registers Cannot do easily on memory locations 100(R1) = 20(R2)

Name Dependences Two instructions use the same register or memory location, but there is no flow of data Antidependence Corresponds to a WAR hazard Output Dependence Corresponds to a WAW hazard To eliminate the dependence: change the name! Register renaming (easy) Static or dynamic

Example: Name Dependences Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F0, -8(R1) SD -8(R1), F4 LD F0, -16(R1) SD -16(R1), F4 LD F0, -24(R1) SD -24(R1), F4 SUBI R1, R1, #32 BNEZ R1, Loop Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F6, -8(R1) ADDD F8, F6, F2 SD -8(R1), F8 LD F10, -16(R1) ADDD F12, F10, F2 SD -16(R1), F12 LD F14, -24(R1) ADDD F16, F14, F2 SD -24(R1), F16 SUBI R1, R1, #32 BNEZ R1, Loop Register Renaming

Example: Control Dependences Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 BEQZ R1, Exit LD F6, 0(R1) ADDD F8, F6, F2 SD 0(R1), F8 LD F10, 0(R1) ADDD F12, F10, F2 SD 0(R1), F12 LD F14, 0(R1) ADDD F16, F14, F2 SD 0(R1), F16 BNEZ R1, Loop Exit: Intermediate BEQZ are never taken Eliminate!

Dealing with control stalls Properties of program correctness must be preserved when handling control dependencies Exception behavior Data flow Static techniques that alleviate Delayed branch scheduling reduce stalls Loop unrolling can enable reduction in control dependences Conditional execution or speculation

Loop-Level Parallelism Analysis at the source level Dependencies across iterations for (i=1000; i>0; i=i-1) x[i] = x[i] + s; for (i=1; i<=100; i=i+1) { x[i+1] = x[i] + z[i]; /* loop-carried dependence */ y[i+1] = y[i] + x[i+1]; }