Lecture on High Performance Processor Architecture (CS05162)

Slides:



Advertisements
Similar presentations
1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
Advertisements

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
Pipelining and Control Hazards Oct
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
Instruction-Level Parallelism (ILP)
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.
Computer ArchitectureFall 2007 © October 24nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
Goal: Reduce the Penalty of Control Hazards
Appendix A Pipelining: Basic and Intermediate Concepts
Pipelined Datapath and Control (Lecture #15) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.
ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Lecture 15: Pipelining and Hazards CS 2011 Fall 2014, Dr. Rozier.
Pipeline Hazard CT101 – Computing Systems. Content Introduction to pipeline hazard Structural Hazard Data Hazard Control Hazard.
1 Appendix A Pipeline implementation Pipeline hazards, detection and forwarding Multiple-cycle operations MIPS R4000 CDA5155 Spring, 2007, Peir / University.
1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.
Comp Sci pipelining 1 Ch. 13 Pipelining. Comp Sci pipelining 2 Pipelining.
CMPE 421 Parallel Computer Architecture
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Pipelining Basics.
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
Winter 2002CSE Topic Branch Hazards in the Pipelined Processor.
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)
Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.
Value Prediction Kyaw Kyaw, Min Pan Final Project.
Instruction-Level Parallelism and Its Dynamic Exploitation
Lecture: Out-of-order Processors
CS 352H: Computer Systems Architecture
Computer Organization CS224
CS203 – Advanced Computer Architecture
Concepts and Challenges
Dynamic Branch Prediction
CS 704 Advanced Computer Architecture
Pipeline Implementation (4.6)
Appendix C Pipeline implementation
Pipelining: Advanced ILP
Chapter 4 The Processor Part 3
Morgan Kaufmann Publishers The Processor
Lecture 6: Advanced Pipelines
Computer Architecture Lecture 3
Pipelining review.
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
How to improve (decrease) CPI
Chapter Six.
Advanced Computer Architecture
Control unit extension for data hazards
Lecture 20: OOO, Memory Hierarchy
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Instruction Execution Cycle
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CS203 – Advanced Computer Architecture
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Control unit extension for data hazards
Dynamic Hardware Prediction
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Wackiness Algorithm A: Algorithm B:
Control unit extension for data hazards
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Conceptual execution on a processor which exploits ILP
Pipelining Hazards.
Presentation transcript:

Lecture on High Performance Processor Architecture (CS05162) Value Prediction and Instruction Reuse An Hong han@ustc.edu.cn Fall 2007 University of Science and Technology of China Department of Computer Science and Technology CS258 S99

Outline What’s Data Hazards and Solution? What Makes Data Speculation possible? Value Prediction (VP) Instruction Reuse (IR) 2018/11/27 CS of USTC AN Hong

A Taxonomy of Speculation Execution Techniques What can we speculate on? Speculative Execution Control Speculation Data Speculation Data Location Branch Direction (binary:taken/not-taken) Aliased (binary) Branch Target (multi-valued:anywhere in the address space) Address(multi-valued) (prefetching) Get class to tell me these Data Value (multi-valued) What makes speculation possible? 2018/11/27 CS of USTC AN Hong CS258 S99

What’s the Problem? :Data Hazards Content of $2 10 10 10 10 10 -20 -20 -20 -20 Time T1 T2 T3 T4 T5 T6 T7 T8 T9 sub $2, $1,$3 IF ID EX ME WB and $12, $2, $3 IF ID EX ME WB or $13, $6, $2 IF ID EX ME WB add $14, $5, $4 IF ID EX ME WB sw $15, 100($6) IF ID EX ME WB Data Hazards: $2 data is needed before it is written back in WB stage to the register file 2018/11/27 CS of USTC AN Hong

数据相关(又称数据依赖) 在程序的一个基本块中存在的数据相关有以下几种情形: 真数据依赖:两条指令之间存在数据流,有真正的数据依赖关系 RAW(Read After Write)相关:对于指令i和j,如果 (1) 指令j使用指令i产生的结果,则称指令j与指令i为RAW相关;或者 (2) 指令j与指令i存在RAW相关,而指令k与指令j存在RAW相关,则称指令k与指令i为RAW相关 伪数据依赖(又称名相关):指令使用的寄存器或存储器称为名。两条指令使用相同名,但它们之间不存在数据流,则它们之间是一种伪数据依赖关系,包括两种情形: WAR(Write After Read)相关:对于指令i和j,如果指令i先执行,指令j写的名是指令i读的名,则称指令j与指令i为WAR相关(又称反相关,anti-dependence) WAW( Write After Write)相关: 对于指令i和j,如果指令i与指令j写相同的名,则称指令j与指令i为WAW相关(又称输出相关,output-dependence) 2018/11/27 CS of USTC AN Hong

Data Hazard on r1: Read after write hazard (RAW) add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 2018/11/27 CS of USTC AN Hong CS258 S99

Data Hazard on r1: Read after write hazard (RAW) Dependencies backwards in time are hazards Time (clock cycles) IF ID/RF EX MEM WB add r1,r2,r3 Im Reg ALU Reg Dm I n s t r. O r d e ALU sub r4,r1,r3 Im Reg Dm Reg ALU and r6,r1,r7 Im Reg Dm Reg or r8,r1,r9 Im Reg ALU Dm Reg ALU Im Reg Dm xor r10,r1,r11 2018/11/27 CS of USTC AN Hong CS258 S99

Data Hazard Solution(1): Stall Pipeline Introduce bubbles into the pipeline  Stall Stall the dependent instructions until the instruction that causes the dependence leaves the pipeline Time T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 sub $2, $1,$3 IF ID EX ME WB and $12, $2, $3 IF ** ** ** ID EX ME WB or $13, $6, $2 IF ID EX ME WB add $14, $5, $4 IF ID EX ME WB sw $15, 100($6) IF ID EX ME WB 2018/11/27 CS of USTC AN Hong

Data Hazard Solution(2): Forwarding (or Bypassing) “Forward” result from one stage to another Time (clock cycles) IF ID/RF EX MEM WB add r1,r2,r3 Im Reg ALU Reg Dm I n s t r. O r d e ALU sub r4,r1,r3 Im Reg Dm Reg ALU and r6,r1,r7 Im Reg Dm Reg or r8,r1,r9 Im Reg ALU Dm Reg ALU Im Reg Dm xor r10,r1,r11 2018/11/27 CS of USTC AN Hong CS258 S99

Forwarding: What about Loads? Dependencies backwards in time are hazards Can’t solve with forwarding Must delay/stall instruction dependent on loads Time (clock cycles) IF ID/RF EX MEM WB lw r1,0(r2) Im Reg ALU Reg Dm ALU sub r4,r1,r3 Im Reg Dm Reg 2018/11/27 CS of USTC AN Hong CS258 S99

Forwarding (or Bypassing): What about Loads ? Dependencies backwards in time are hazards Can’t solve with forwarding Must delay/stall instruction dependent on loads Time (clock cycles) IF ID/RF EX MEM WB lw r1,0(r2) Im Reg ALU Reg Dm ALU Im Reg Dm sub r4,r1,r3 Stall 2018/11/27 CS of USTC AN Hong CS258 S99

Data Hazard Solution(3):Out-of-Order Execution Need to detect data dependences at run time Need of precise exceptions: Out-of-order execution, in-order completion Time T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 sub $2, $1,$3 IF ID EX ME WB add $14, $5, $4 IF ID EX ME WB sw $15, 100($6) IF ID EX ME WB and $12, $2, $3 IF ** ID EX ME WB or $13, $6, $2 IF ID EX ME WB 2018/11/27 CS of USTC AN Hong

Data Hazard Solution(4): Data Speculation In a wide-issue processors, e.g. 8 ~ 12 instructions per clock cycle Larger than a basic block (5 ~ 7 instructions) Multiple branches – use multiple-branch prediction (e.g. trace cache) Multiple data dependence chains – very hard to execute them in the same clock cycle Value speculation is primarily used to resolve data dependences: In the same clock cycle Long latency operations (e.g. load operations) 2018/11/27 CS of USTC AN Hong

Data Hazard Solution(4): Data Speculation Why is Speculation Useful? Speculation lets all these instruction run in parallel on a superscalar machine. addq $3 $1 $2 addq $4 $3 $1 addq $5 $3 $2 What is Value Prediction? Predict the value of instructions before they are executed Cp. Branch Prediction eliminates the control dependences Prediction Data are just two values( taken or not taken) Value Prediction eliminates the data dependences Prediction Data are taken from a much larger range of values 2018/11/27 CS of USTC AN Hong

Data Hazard Solution(4): Data Speculation Value Locality: likelihood of a previously-seen value recurring repeatedly within a storage location Observed in any storage locations Registers Cache memory Main memory Most work focussing on value stored in registers to break potential data dependences: register value locality Why Value Prediction? Results of many instructions can be accurately predicted before they are issued or executed. Dependent instructions are no longer bound by the serialization constraints imposed by data dependences. More parallelism can be explored. Prediction of values for dependant instructions can lead to beneficial speculative execution 2018/11/27 CS of USTC AN Hong

冗余指令 若将程序执行期间生成的每条静态指令的动态实例进行缓存,则每条生成结果的动态指令可归为以下三种类型: 冗余指令 新结果指令:首次生成新值的动态指令 <5% 重复结果指令:生成结果与对应静态指令的其它动态实例相同的动态指令 80%~90% 可推导型指令:生成结果能用先前的结果推导出来的动态指令 <5% 冗余指令 重复型指令 和可推导指令 2018/11/27 CS of USTC AN Hong

Source of Value Locality(Sources of value predictability) How often does the same value result from the same instruction twice in a row? Question: Where does value locality occur? Somewhat Yes No Single-cycle Arithmetic (i.e. addq $1 $2) Single-cycle Logical (i.e bis $1 $2) Multi-cycle Arithmetic (i.e. mulq $1 $2) Register Move (i.e. cmov $1 $2) Integer Load (i.e. ldq $1 8($2)) Store with base register update FP Multiply FP Add FP Move FP Load Somewhat 40-50% Yes 60-80% No 20-30% 2018/11/27 CS of USTC AN Hong CS258 S99

Source of Value Locality(Sources of predictability) Data redundancy: text files with white spaces, empty cells in spreadsheets Error checking Program constants Computed branches Virtual function calls Glue code: allow calling from one compilation unit to another Addressability: pointer tables store constant addresses loaded at runtime Call contexts: caller-saved/callee saved registers Memory alias resolution: conservative assumptions from compiler regarding aliasing Register spill code …… 2018/11/27 CS of USTC AN Hong

Load Value Locality 2018/11/27 CS of USTC AN Hong

Why Value Prediction is possible? Value Locality 2018/11/27 CS of USTC AN Hong

Why Value Prediction is possible? Register value locality (No. of times each static instruction writes a register value that matches a previously-seen value for that static instruction) / (Total no. of dynamic register writes in the program) With history depth of one: average 49% With history depth of four: average 61% 2018/11/27 CS of USTC AN Hong

Register Value Locality by Instruction Type (Table 2, Figure 3) Integer and floating-point double loads are the most predictable frequently-occurring instructions Single-cycle instructions fewer input operands -> higher value locality Multi-cycle instructions more input operands -> lower value locality 2018/11/27 CS of USTC AN Hong

Register Value Locality by Instruction Type (Table 2, Figure 3) 2018/11/27 CS of USTC AN Hong

Register Value Locality by Instruction Type (Table 2, Figure 3) 2018/11/27 CS of USTC AN Hong

Value Sequences Types Basic Sequences Composing Sequences Constant: 3 3 3 3 3 3 …… Δ=0 Sources: surprisingly often Stride: 1 2 3 4 5 6 7 …… Δ=1 Sources: most common case, an array being accessed in a regular fashion, loop induction variables Non-Stride: 29 31 12 34 56 …… Composing Sequences Repeated Stride: 1 2 3 1 2 3 1 2 3 …… Repeated Non-Stride: 1 13 35 41 1 13 35 41 …… Sources: Nested Loop…… 2018/11/27 CS of USTC AN Hong

Classification of Value predictors Computational predictors: Uses previous values to make predictions Last Value Predictors Predicts previous value Saturating counter can be used, with values only being predicted above a threshold New values only predicted if it happens successively Stride Predictors VN = VN-1+(VN-1-VN-2) 2-delta Two strides; s1 is VN-1-VN-2, s2 computes predictions. If s1 is repeated, then s2 is updated. Context-based predictors(History-Based or Pattern Based Predictors): Matches recent value history to previous value history Finite Context Method Predictors k-th order predictor uses the previous k values to make the prediction Counts occurrences of a prediction following a pattern and predicts the one with the maximum count. Hybrid Predictors 2018/11/27 CS of USTC AN Hong

The performance measure of value predictor Three factors determine the efficacy Accuracy ability to avoid mispredictions Coverage ability to predict as many instruction outcomes as possible Scope The set of instructions that the predictor targets Relationships between factors Accuracy ↔ Coverage trade-off Low implementation cost Achieve better accuracy and coverage Mispredictions for useless predictions are eliminated 2018/11/27 CS of USTC AN Hong

Exploiting Value Locality “predict the results of instructions based on previously seen results” Fetch Decode Issue Execute Commit Predict Value Verify if mispredicted Value Prediction (VP) Instruction Reuse (IR) Fetch Decode Issue Execute Commit Check for previous use Verify arguments are the same if reused Notice differences No need to execute in IR No need to verify in IR Predicted value available to other instructions “recognize that a computation chain has been previously performed and therefore need not be performed again” 2018/11/27 CS of USTC AN Hong CS258 S99

Value prediction Speculative prediction of register values Values predicted during fetch and dispatch, forwarded to dependent instructions. Dependent instructions can be issued and executed immediately. Before committing a dependent instruction, we must verify the predictions. If wrong: must restart dependent instruction w/ correct values. Fetch Decode Issue Execute Commit Predict Value Verify if mispredicted 2018/11/27 CS of USTC AN Hong

Value Prediction Units PC PC Should I predict? 2018/11/27 CS of USTC AN Hong

Classification Table (CT) Value Prediction Table (VPT) How to predict values? Classification Table (CT) Value Prediction Table (VPT) PC Pred History PC Value History Value Prediction Table (VPT) Cache indexed by instruction address (PC) Mapped to one or more 64-bit values Values replaced (LRU) when instruction first encountered or when prediction incorrect. 32 KB cache: 4K 8-byte entries Prediction 2018/11/27 CS of USTC AN Hong

Estimating prediction accuracy Classification Table (CT) Value Prediction Table (VPT) PC Pred History PC Value History Classification Table (CT) Cache indexed by instruction address (PC) Mapped to 2-bit saturating counter, incremented when correct and decremented when wrong. 0,1 = don’t use prediction 2 = use prediction 3 = use prediction and don’t replace value if wrong 1K entries sufficient Predicted Value Prediction 2018/11/27 CS of USTC AN Hong

Verifying predictions Predicted instruction executes normally. Dependent instruction cannot commit until predicted instruction has finished executing. Computed result compared to predicted; if ok then dependent instructions can commit. If not, dependent instructions must reissue and execute with computed value. Miss penalty = 1 cycle later than no prediction. Fetch Decode Issue Execute Commit Predict Value Verify if mispredicted 2018/11/27 CS of USTC AN Hong

Instruction Reuse Obtain results of instructions from their previous executions. If previous results still valid, don’t execute the instruction again, just commit the results! Non-speculative, early verification Previous results read in parallel with fetch. Reuse test in parallel with decode. Only execute if reuse test fails. Fetch Decode Issue Execute Commit Check for previous use Verify arguments are the same if reused 2018/11/27 CS of USTC AN Hong

How to reuse instructions? Reuse buffer Cache indexed by instruction address (PC) Stores result of instruction along with info needed for establishing reusability: Operand register names Pointer chain of dependent instructions Assume 4K entries (each entry takes 4x as much space as VPT: compare to 16K VP) 4-way set-associative. 2018/11/27 CS of USTC AN Hong

Reuse Scheme Dependent chain of results (each points to previous instruction in chain) Entry is reusable if the entries on which it depends have been reused (can’t reuse out of order). Start of chain: reusable if “valid” bit set; invalidated when operand registers overwritten. Special handling of loads and stores. Instruction will not be reused if: Inputs not ready for reuse test (decode stage) Different operand registers 2018/11/27 CS of USTC AN Hong

“predict the results of instructions based on previously seen results” Comparing VP and IR “predict the results of instructions based on previously seen results” Value Prediction (VP) Instruction Reuse (IR) “recognize that a computation chain has been previously performed and therefore need not be performed again” 2018/11/27 CS of USTC AN Hong

Comparing VP and IR Value Prediction (VP) Instruction Reuse (IR) IR can’t predict when: Inputs aren’t ready Same result follows from different inputs VP makes a lucky guess “predict the results of instructions based on previously seen results” Which captures more redundancy? Value Prediction (VP) Instruction Reuse (IR) Which captures more redundancy? “recognize that a computation chain has been previously performed and therefore need not be performed again” IR can’t predict when Inputs aren’t ready Same result follows from different inputs Lucky guesses But stats show that IR captures 80-90% of redundancy that VP does 2018/11/27 CS of USTC AN Hong CS258 S99

Comparing VP and IR Value Prediction (VP) Instruction Reuse (IR) “predict the results of instructions based on previously seen results” Which handles misprediction better? Value Prediction (VP) Instruction Reuse (IR) Which captures more redundancy? IR is non-speculative, so it never mispredicts “recognize that a computation chain has been previously performed and therefore need not be performed again” IR doesn’t ever mispredict 2018/11/27 CS of USTC AN Hong CS258 S99

Comparing VP and IR Value Prediction (VP) Instruction Reuse (IR) “predict the results of instructions based on previously seen results” Which integrates best with branches? Value Prediction (VP) Instruction Reuse (IR) Which captures more redundancy? IR Mispredicted branches are detected earlier Instructions from mispredicted branches can be reused. VP Causes more misprediction “recognize that a computation chain has been previously performed and therefore need not be performed again” IR Mispredicted branches are detected earlier on reuse Mispredicted branches add to reuse buffer VP 1)causes more misprediction (if speculative execution) 2)don’t speed things up at all (if you don’t speculate) 2018/11/27 CS of USTC AN Hong CS258 S99

Comparing VP and IR Value Prediction (VP) Instruction Reuse (IR) “predict the results of instructions based on previously seen results” Which is better for resource contention? Value Prediction (VP) Instruction Reuse (IR) Which captures more redundancy? IR might not even need to execute the instruction “recognize that a computation chain has been previously performed and therefore need not be performed again” IR might not even need to execute the instruction 2018/11/27 CS of USTC AN Hong CS258 S99

Comparing VP and IR Value Prediction (VP) Instruction Reuse (IR) “predict the results of instructions based on previously seen results” Which is better for execution latency? Value Prediction (VP) Instruction Reuse (IR) Which captures more redundancy? VP causes some instructions to be executed twice (when values are mispredicted), IR executes once or not at all. “recognize that a computation chain has been previously performed and therefore need not be performed again” IR reduces execution time sometimes VP causes some instructions to be executed twice 2018/11/27 CS of USTC AN Hong CS258 S99

Value Prediction (VP) Instruction Reuse (IR) Possible class project: Can we get the best of both techniques? “predict the results of instructions based on previously seen results” Value Prediction (VP) Instruction Reuse (IR) “recognize that a computation chain has been previously performed and therefore need not be performed again” 2018/11/27 CS of USTC AN Hong

Summary 84-97% of redundant instructions reusable. Realistic configuration, on simulated (current and near-future) PowerPC, VP gave 4.5-6.8% speedups. 3-4x more speedup than devoting extra space to cache. VP’s Speedups vary between benchmarks (grep: 60%) VP’s Potential speedups up to 70% for idealized configurations. Can exceed dataflow limit (on idealized machine). Are these really realistic? Net performance: VP better on some benchmarks; IR better on some. All speedups typically 5-10%. More interesting question: can the two schemes be combined? 2018/11/27 CS of USTC AN Hong