现代计算机体系结构主讲教师：张钢教授天津大学计算机学院 2017年

Slides:

Advertisements

Similar presentations

ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.

Advertisements

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP Instructor: Morris Lancaster Corresponding to Hennessey and Patterson Fifth Edition.

1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.

COMP4611 Tutorial 6 Instruction Level Parallelism

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

Lecture 3: Chapter 2 Instruction Level Parallelism Dr. Eng. Amr T. Abdel-Hamid CSEN 601 Spring 2011 Computer Architecture Text book slides: Computer Architec.

Dynamic Branch Prediction

Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.

CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

W04S1 COMP s1 Seminar 4: Branch Prediction Slides due to David A. Patterson, 2001.

1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.

COMP381 by M. Hamdi 1 Pipeline Hazards. COMP381 by M. Hamdi 2 Pipeline Hazards Hazards are situations in pipelining where one instruction cannot immediately.

Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation – Concepts 吳俊興高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.

EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

Chapter 2 Instruction-Level Parallelism and Its Exploitation

EECC551 - Shaaban #1 Fall 2002 lec# Floating Point/Multicycle Pipelining in MIPS Completion of MIPS EX stage floating point arithmetic operations.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.

Dynamic Branch Prediction

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

COMP381 by M. Hamdi 1 Loop Level Parallelism Instruction Level Parallelism: Loop Level Parallelism.

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim.

CSC 4250 Computer Architectures October 13, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

EECC551 - Shaaban #1 Fall 2001 lec# Floating Point/Multicycle Pipelining in DLX Completion of DLX EX stage floating point arithmetic operations.

1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

Dynamic Branch Prediction

Instruction-Level Parallelism and Its Dynamic Exploitation

CS 352H: Computer Systems Architecture

CS203 – Advanced Computer Architecture

Concepts and Challenges

CS5100 Advanced Computer Architecture Instruction-Level Parallelism

CS 704 Advanced Computer Architecture

5 Steps of MIPS Datapath Figure A.2, Page A-8

Instruction-Level Parallelism (ILP)

Approaches to exploiting Instruction Level Parallelism (ILP)

CSCE430/830 Computer Architecture

Dynamic Hardware Branch Prediction

Siddhartha Chatterjee Spring 2008

Chapter 3: ILP and Its Exploitation

Dynamic Branch Prediction

Advanced Computer Architecture

/ Computer Architecture and Design

CS 704 Advanced Computer Architecture

Adapted from the slides of Prof

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Instruction Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Adapted from the slides of Prof

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CSC3050 – Computer Architecture

Dynamic Hardware Prediction

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CMSC 611: Advanced Computer Architecture

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Loop-Level Parallelism

Lecture 5: Pipeline Wrap-up, Static ILP

Presentation transcript:

现代计算机体系结构主讲教师：张钢教授天津大学计算机学院 2017年课件、作业、讨论网址：http://glearning.tju.edu.cn/ 主讲教师：张钢教授天津大学计算机学院通信邮箱：gzhang@tju.edu.cn 2017年现代计算机体系结构

The Main Contents课程主要内容 Chapter 1. Fundamentals of Quantitative Design and Analysis Chapter 2. Memory Hierarchy Design Chapter 3. Instruction-Level Parallelism and Its Exploitation Chapter 4. Data-Level Parallelism in Vector, SIMD, and GPU Architectures Chapter 5. Thread-Level Parallelism Chapter 6. Warehouse-Scale Computers to Exploit Request-Level and Data-Level Parallelism Appendix A. Pipelining: Basic and Intermediate Concepts

Instruction-Level Parallelism (ILP) Instructions are evaluated in parallel. Pipelining Two approaches to exploiting ILP Dynamic & Hardware-dependent Intel Pentium Series, Athlon, MIPS R10000/12000, Sun UltraSPARC III, PowerPC, … Static & Software-dependent (Appendix A, Appendix G) IA-64, Intel Itanium, embedded processors 现代计算机体系结构

Instruction-Level Parallelism (ILP) Pipeline CPI = Ideal CPI + Pipeline stall clock cycles per instruction Pipeline CPI = Ideal CPI + Structural stalls + Data hazard stalls + Control stalls Ideal CPI=1 现代计算机体系结构

Visualizing Pipelining *The ideal CPI on a pipelined machine is almost always 1. Time (clock cycles) Reg ALU DMem Ifetch Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7 Cycle 5 I n s t r. O r d e 现代计算机体系结构

Techniques to Decrease Pipeline CPI Forwarding and Bypassing Delayed Branches and Simple Branch Scheduling Basic Dynamic Scheduling (Scoreboarding) Dynamic Scheduling with Renaming Branch Prediction Issuing Multiple Instructions per Cycle Hardware Speculation Dynamic Memory Disambiguation 现代计算机体系结构

Techniques to Decrease Pipeline CPI Loop Unrolling Basic Compiler Pipeline Scheduling Compiler Dependence Analysis, Software Pipelining, Trace Scheduling Hardware Support for Compiler Speculation 现代计算机体系结构

Parallel and Dependent Instructions If two instructions are parallel, they can execute simultaneously. If two instructions are dependent, they must be executed in order. How to determine an instruction is dependent on anther instruction? 现代计算机体系结构

Data Dependences There are three different types of dependences Data dependences (True data dependences) Name dependences Control dependences 现代计算机体系结构

Data Dependences An instruction j is data dependent on instruction i if either i produces a result that may be used by j, or j is data dependent on instruction k, and k is data dependent on i. 现代计算机体系结构

Data Dependences Loop: L.D F0, 0(R1) ;F0=array element ADD.D F4, F0, F2 ;add scalar in F2 S.D F4, 0(R1) ;store the result DADDIU R1, R1, #-8 ;decrement pointer 8 bytes BNE R1, R2, LOOP ;branch R1 != R2 Floating-point data dependences Integer data dependence 现代计算机体系结构

Data Dependences The order must be preserved for correct execution. If two instructions are data dependent, they cannot execute simultaneously or be completely overlapped. Data dependence between DADDIU and BNE => Branch test for the MIPS pipeline in the ID stage (2nd stage). 现代计算机体系结构

Data Hazards Consider the instruction sequence: ADD R1,R2,R3 SUB R4,R1,R3 AND R6,R1,R7 OR R8,R1,R9 XOR R10,R1,R11 The result in R1 is produced after it is required by the last three instructions. 现代计算机体系结构

Data Hazard on R1 add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 Time (clock cycles) IF ID/RF EX MEM WB I n s t r. O r d e add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 Reg ALU DMem Ifetch 现代计算机体系结构

Pipelined Datapath Forwarding and Bypassing 现代计算机体系结构

Data Dependences A data dependence conveys three things indicates the possibility of a hazard, determines the order in which results must be calculated, and sets an upper bound on how much parallelism can possibly be exploited. 现代计算机体系结构

How to Overcome a Dependence Maintaining the dependence but avoiding a hazard Code scheduling (by the compiler or by the hardware) Eliminating a dependence by transforming the code Eliminating a dependence by forwarding 现代计算机体系结构

Data Hazard Even with Forwarding Time (clock cycles) Reg ALU DMem Ifetch I n s t r. O r d e lw r1, 0(r2) sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9 现代计算机体系结构

Data Hazard Even with Forwarding Time (clock cycles) I n s t r. O r d e Reg ALU DMem Ifetch lw r1, 0(r2) Reg Ifetch ALU DMem Bubble sub r4,r1,r6 Ifetch ALU DMem Reg Bubble and r6,r1,r7 Bubble Ifetch Reg ALU DMem or r8,r1,r9 现代计算机体系结构

Name Dependence Occurs when two instructions use the same register or memory location (name), but there is no flow of data between the instructions associated with that name. When i precedes j in program order: Antidependence: Instruction j writes a register or memory location that instruction i reads. Output dependence: Instructions i and j write the same register or memory location. No value transmitted between instructions. 现代计算机体系结构

Register Renaming Instructions involved in a name dependence can execute simultaneously or be reordered, if the name (register number or memory location) used in the instructions is changed so the instructions do not conflict. (Especially for register operands) Statically by a compiler or dynamically by the hardware. 现代计算机体系结构

Register Renaming example Antidependence R3:=R3 + R5; (I1) R4:=R3 + 1; (I2) R3:=R5 + 1; (I3) R7:=R3 + R4; (I4) I3 can not complete before I2 starts as I2 needs a value in R3 and I3 changes R3 现代计算机体系结构

Register Renaming example R3b:=R3a + R5a (I1) R4b:=R3b + 1 (I2) R3c:=R5a + 1 (I3) R7b:=R3c + R4b (I4) Without subscript refers to logical register in instruction With subscript is hardware register allocated Note R3a R3b R3c 现代计算机体系结构

Hazards A hazard is created whenever there is a dependence between instructions, and they are close enough that the overlap during execution, caused by pipelining, or other reordering of instructions, would change the order of access to the operand involved in the dependence. 现代计算机体系结构

Three Generic Data Hazards The goal of S/W and H/W Techniques in our course is to preserve the program order only where it affects the outcome of the program to maximize ILP. When instruction i occurs before instruction j in program order, RAW (Read after Write): j tries to read a source before i writes it. WAW (Write after Write): j tries to write an operand before it is written by i. WAR (Write after Read): j tries to write a destination before it is read by i. 现代计算机体系结构

Three Generic Data Hazards Read After Write (RAW) InstrJ tries to read operand before InstrI writes it Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication. I: add r1,r2,r3 J: sub r4,r1,r3 现代计算机体系结构

Three Generic Data Hazards Write After Read (WAR) InstrJ writes operand before InstrI reads it Called an “anti-dependence” by compiler writers.This results from reuse of the name “r1”. Can’t happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5 I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 现代计算机体系结构

Three Generic Data Hazards Write After Write (WAW) InstrJ writes operand before InstrI writes it. Called an “output dependence” by compiler writers This also results from the reuse of name “r1”. Can’t happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and Writes are always in stage 5 I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7 现代计算机体系结构

Control Dependences Caused by branch instructions For example If p1 { S1; }; If p2 { S2; }; S1 is control dependent on p1,and S2 is control dependent on p2 but not on p1. 现代计算机体系结构

Control Dependences There are two constraints imposed by control dependences An instruction that is control dependent on a branch cannot be moved before the branch. An instruction that is not control dependent on a branch cannot be moved after the branch. Consider this code sequence: DADDU R2,R3,R4 BEQZ R2,L1 LW R1,0(R2) L1: 现代计算机体系结构

Control Dependences Control dependence is not the critical property that must be preserved. We can violate the control dependences, if we can do so without affecting the correctness of the program. (e.g. branch prediction) 现代计算机体系结构

Basic Compiler Techniques for Exposing ILP 现代计算机体系结构

Goal: to keep a pipeline full. To avoid a pipeline stall, a dependent instruction must be separated from the source instruction by a distance in clock cycles equal to the pipeline latency of that source instruction. Goal: to keep a pipeline full. 现代计算机体系结构

Basic Pipeline Scheduling and Loop Unrolling 现代计算机体系结构

Latencies Inst. producing result Inst. using result Latency in cycles FP ALU op Another FP op 3 Store double 2 Load double 1 Latency: the number of clock cycles needed to avoid a stall between a producer and a consumer Branch: 1, Integer ALU op – branch: 1 Integer load: 1 Integer ALU - integer ALU: 0 Functional units are fully pipelined or replicated. => No structural hazard 现代计算机体系结构

Example for ( i = 1000; i > 0; i = i – 1) x[i] = x[i] + s; MIPS Assembly code Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDIU R1, R1, # -8 BNE R1, R2, LOOP 现代计算机体系结构

Without Any Scheduling Clock cycle issued Loop: L.D F0, 0(R1) 1 stall 2 ADD.D F4, F0, F2 3 stall 4 stall 5 S.D F4, 0(R1) 6 DADDIU R1, R1, # -8 7 stall 8 BNE R1, R2, LOOP 9 stall 10 现代计算机体系结构

With Scheduling Clock cycle issued Loop: L.D F0, 0(R1) 1 DADDIU R1, R1, # -8 2 ADD.D F4, F0, F2 3 stall 4 BNE R1, R2, LOOP 5 S.D F4, 8(R1) 6 delayed branch not trivial 现代计算机体系结构

With Scheduling The actual work of operating on the array element takes 3 cycles (load, add, store). The remaining 3 cycles Loop overhead (DADDIU, BNE) Stall To eliminate the 3 cycles, we need to get more operations within the loop relative to the number of overhead instructions. => Loop Unrolling 现代计算机体系结构

Reducing Loop Overhead Loop Unrolling Simple scheme for increasing the number of instructions relative to the branch and overhead instructions Simply replicates the loop body multiple times, adjusting the loop termination code. Improves scheduling It allows instructions from different iterations to be scheduled together. Uses different registers for each iteration. 现代计算机体系结构

Unrolled Loop (No Scheduling) Clock cycle issued Loop: L.D F0, 0(R1) 1 2 ADD.D F4, F0, F2 3 4 5 S.D F4, 0(R1) 6 L.D F6, -8(R1) 7 8 ADD.D F8, F6, F2 9 10 11 S.D F8, -8(R1) 12 L.D F10, -16(R1) 13 14 ADD.D F12, F10, F2 15 16 17 S.D F12, -16(R1) 18 L.D F14, -24(R1) 19 20 ADD.D F16, F14, F2 21 22 23 S.D F16, -24(R1) 24 DADDIU R1, R1, # -32 25 26 BNE R1, R2, LOOP 27 28 DADDIU and BNE dropped 现代计算机体系结构

Loop Unrolling Loop unrolling is normally done early in the compilation process, so that redundant computations can be exposed and eliminated by the optimizer. Unrolling improves the performance of the loop by eliminating overhead instructions. 现代计算机体系结构

Loop Unrolling (Scheduling) Clock cycle issued Loop: L.D F0, 0(R1) 1 L.D F6, -8(R1) 2 L.D F10, -16(R1) 3 L.D F14, -24(R1) 4 ADD.D F4, F0, F2 5 ADD.D F8, F6, F2 6 ADD.D F12, F10, F2 7 ADD.D F16, F14, F2 8 S.D F4, 0(R1) 9 S.D F8, -8(R1) 10 DADDIU R1, R1, # -32 11 S.D F12, 16(R1) 12 BNE R1, R2, LOOP 13 S.D F16, 8(R1) 14 现代计算机体系结构

Summary The key to most hardware and software ILP techniques is to know when and how the ordering among instructions may be changed. This process must be performed in a methodical fashion either by a compiler or by hardware. 现代计算机体系结构

Summary To obtain the final unrolled code, Determine that it is legal to move the S.D after the DADDIU and BNE, and find the amount to adjust the S.D offset. Determine that unrolling the loop will be useful by finding that the loop iterations are independent, except for the loop maintenance code. Use different registers to avoid unnecessary constraints. Eliminate the extra test and branch instructions and adjust the loop termination and iteration code. 现代计算机体系结构

Summary Determine that the loads and stores in the unrolled loop can be interchanged by observing that the loads and stores from different iterations are independent. This transformation requires analyzing the memory addresses and finding that they do not refer to the same address. Schedule the code, preserving any dependences needed to yield the same result as the original code. 现代计算机体系结构

Summary Three different effects limit the gains from loop unrolling A decrease in the amount of overhead amortized with each unroll Code size limitations Cache miss Register pressure Compiler limitations Complexity 现代计算机体系结构

Loop Unrolling I (Unoptimized, No Delayed Branch) Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDIU R1, R1, #-8 L.D F0, 0(R1) DADDIU R1, R1, # -8 BNE R1, R2, LOOP By symbolically computing the intermediate value of R1 现代计算机体系结构

Loop Unrolling I (Unoptimized, No Delayed Branch) Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) L.D F0, -8(R1) S.D F4, -8(R1) L.D F0, -16(R1) S.D F4, -16(R1) L.D F0, -24(R1) S.D F4, -24(R1) DADDIU R1, R1, # -32 BNE R1, R2, LOOP Remove name dependences using Register Renaming name dependence true dependence 现代计算机体系结构

Loop Unrolling II (Register Renaming) Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) L.D F6, -8(R1) ADD.D F8, F6, F2 S.D F8, -8(R1) L.D F10, -16(R1) ADD.D F12, F10, F2 S.D F12, -16(R1) L.D F14, -24(R1) ADD.D F16, F14, F2 S.D F16, -24(R1) DADDIU R1, R1, # -32 BNE R1, R2, LOOP true dependence 现代计算机体系结构

With the renaming, the copies of each loop body become independent and can be overlapped or executed in parallel. Problem: potential shortfall in registers Register pressure It arises because scheduling code to increase ILP causes the number of live values to increase. It may not be possible to allocate all the live values to registers. The combination of unrolling and aggressive scheduling can cause this problem. 现代计算机体系结构

Loop unrolling is a simple but useful method for increasing the size of straight-line code fragments that can be scheduled effectively. 现代计算机体系结构

One Memory Port/Structural Hazards Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 ALU I n s t r. O r d e Load Ifetch Reg DMem Reg Reg ALU DMem Ifetch Instr 1 Reg ALU DMem Ifetch Instr 2 ALU Instr 3 Ifetch Reg DMem Reg Reg ALU DMem Ifetch Instr 4 现代计算机体系结构

One Memory Port/Structural Hazards Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 ALU I n s t r. O r d e Load Ifetch Reg DMem Reg Reg ALU DMem Ifetch Instr 1 Reg ALU DMem Ifetch Instr 2 Bubble Stall Reg ALU DMem Ifetch Instr 3 现代计算机体系结构

Reducing Branch Costs with Prediction 现代计算机体系结构

Static Branch Prediction Delayed branch? To reorder code around branches, need to predict branch statically when compile There are several different methods to statically predict branch behavior Simplest scheme is to predict a branch as taken Average misprediction = untaken branch frequency = 34% SPEC 现代计算机体系结构 CS252 S05 56

Static Branch Prediction More accurate scheme predicts branches using profile information collected from earlier runs, and modify prediction based on last run: CSCE 430/830, Instruction Level Parallelism 57 Integer Floating Point 现代计算机体系结构 CS252 S05 57

Dynamic Branch Prediction Why does prediction work? Underlying algorithm has regularities Data that is being operated on has regularities Instruction sequence has redundancies that are artifacts of way that humans/compilers think about problems Is dynamic branch prediction better than static branch prediction? Seems to be There are a small number of important branches in programs which have dynamic behavior 现代计算机体系结构 CS252 S05 58

Dynamic Branch Prediction Performance = ƒ(accuracy, cost of misprediction) The simplest dynamic branch-prediction scheme is a branch-prediction buffer or branch history table Branch History Table (BHT) : Lower bits of PC address index table of 1-bit values Says whether or not branch taken last time No address check 现代计算机体系结构 CS252 S05 59

Dynamic Branch Prediction Problem: in a loop, 1-bit BHT will cause two mispredictions End of loop case, when it exits instead of looping as before First time through loop on next time through code, when it predicts exit instead of looping 现代计算机体系结构 CS252 S05 60

Dynamic Branch Prediction Solution: 2-bit scheme where change prediction only if get misprediction twice Adds hysteresis to decision making process T NT Predict Taken Predict Not Taken 现代计算机体系结构 CS252 S05 61

BHT Accuracy Mispredict because either: 4096 entry table: Wrong guess for that branch Got branch history of wrong branch when index the table 4096 entry table: Integer Floating Point 现代计算机体系结构 CS252 S05 62

Correlated Branch Prediction Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper n-bit branch history table If (aa= =2) /*b1*/ aa=0; If (bb= =2) /*b2*/ bb=0; If (aa!=bb) { /*b3*/ … } DADDIU R3,R1,#-2 BNEZ R3,L1 DADD R1,R0,R0 L1: DADDIU R3,R2,#-2 BNEZ R3,L2 DADD R2,R0,R0 L2: DSUBU R3,R1,R2 BEQZ R3,L3 现代计算机体系结构 CS252 S05 63

Correlated Branch Prediction 现代计算机体系结构

Correlated Branch Prediction In general, a (m,n) predictor means recording last m branches to select between 2m history tables, each with n-bit counters Thus, old 2-bit BHT is a (0,2) predictor Global Branch History: m-bit shift register keeping T/NT status of last m branches. Each entry in table has 2m n-bit predictors. 现代计算机体系结构 CS252 S05 65

Correlating Branches (2,2) predictor Branch address – Behavior of recent branches selects between four predictions of next branch, updating just that prediction Branch address 4 2-bits per branch predictor Prediction 2-bit global branch history 现代计算机体系结构 CS252 S05 66

Correlating Predictor Branch predictor that use the behavior of other branches to make a prediction are called correlating predictors or two-level predictors. 现代计算机体系结构

Accuracy of Different Schemes 20% 4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT 18% 16% 14% 12% 11% Frequency of Mispredictions 10% 8% 6% 6% 6% 6% 5% 5% 4% 4% 2% 1% 1% 0% 0% nasa7 matrix300 tomcatv doducd spice fpppp gcc expresso eqntott li 4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2) 现代计算机体系结构 CS252 S05 68

Tournament Predictors Tournament predictors: use 2 predictors, 1 based on global information and 1 based on local information, and combine with a selector Hopes to select right predictor for right branch 现代计算机体系结构

Tournament Predictors 1/0 means that the first predictor was right and the second predictor was wrong 现代计算机体系结构

Tournament Predictors Figure 3.4 The misprediction rate for three different predictors on SPEC89 as the total number of bits is increased. 现代计算机体系结构

作业5 3.1 英文版第5版现代计算机体系结构

现代计算机体系结构