Abstraction Question General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the.

Slides:

Advertisements

Similar presentations

Morgan Kaufmann Publishers The Processor

Advertisements

Pipeline Summary Try to put everything together for pipelines Before going onto caches. Peer Instruction Lecture Materials for Computer Architecture by.

1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.

CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

Pipelining and Control Hazards Oct

Lecture Objectives: 1)Define branch prediction. 2)Draw a state machine for a 2 bit branch prediction scheme 3)Explain the impact on the compiler of branch.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

Breaking up is hard to do….

Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.

Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

EECC551 - Shaaban #1 lec # 5 Fall Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with.

Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 1.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.

Goal: Reduce the Penalty of Control Hazards

COMP381 by M. Hamdi 1 (Recap) Control Hazards. COMP381 by M. Hamdi 2 Control (Branch) Hazard A: beqz r2, label B: label: P: Problem: The outcome.

Lec 9: Pipelining Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University.

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

Pipelined Datapath and Control (Lecture #15) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.

1  1998 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining.

ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.

Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 17 - Pipelined.

Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.

1 Stalls and flushes  So far, we have discussed data hazards that can occur in pipelined CPUs if some instructions depend upon others that are still executing.

Lecture 15: Pipelining and Hazards CS 2011 Fall 2014, Dr. Rozier.

1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.

Chapter 4 CSF 2009 The processor: Pipelining. Performance Issues Longest delay determines clock period – Critical path: load instruction – Instruction.

Comp Sci pipelining 1 Ch. 13 Pipelining. Comp Sci pipelining 2 Pipelining.

CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.

CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.

CMPE 421 Parallel Computer Architecture

Pipeline Data Hazards Warning, warning, warning! Read 4.8 Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under.

1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.

CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Pipelining Basics.

1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

CS 1104 Help Session IV Five Issues in Pipelining Colin Tan, S

Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

CECS 440 Pipelining.1(c) 2014 – R. W. Allison [slides adapted from D. Patterson slides with additional credits to M.J. Irwin]

Winter 2002CSE Topic Branch Hazards in the Pipelined Processor.

1 (Based on text: David A. Patterson & John L. Hennessy, Computer Organization and Design: The Hardware/Software Interface, 3 rd Ed., Morgan Kaufmann,

2/15/02CSE Data Hazzards Data Hazards in the Pipelined Implementation.

Computing Systems Pipelining: enhancing performance.

1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.

Branch Hazards and Static Branch Prediction Techniques

Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)

Introduction to Computer Organization Pipelining.

Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.

CSCE 212 Chapter 6 Enhancing Performance with Pipelining Instructor: Jason D. Bakos.

Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.

Lecture 5. MIPS Processor Design Pipelined MIPS #1 Prof. Taeweon Suh Computer Science & Engineering Korea University COSE222, COMP212 Computer Architecture.

CS 352H: Computer Systems Architecture

Stalling delays the entire pipeline

Pipeline Implementation (4.6)

Chapter 4 The Processor Part 4

ECS 154B Computer Architecture II Spring 2009

Chapter 4 The Processor Part 3

Pipelining review.

Pipelining in more detail

The Processor Lecture 3.6: Control Hazards

November 5 No exam results today. 9 Classes to go!

CS203 – Advanced Computer Architecture

Pipelining (II).

CSC3050 – Computer Architecture

Wackiness Algorithm A: Algorithm B:

Presentation transcript:

Abstraction Question General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the machine Embedded processors tend to have entire control over the code run, compiler (if any), the ISA, and the hardware – so breaking the abstraction layer makes sense. Bottom line becomes who controls what elements of the design. Intel side note

We’ve been doing mips – how’d intel do this for a CISC x86 ISA? Microops from Pentium Pro on

Branch Hazards or “Which way did he go?” Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. Dr. Leo Porter Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License Read 5.1, 5.2

Control Dependence Just as an instruction will be dependent on other instructions to provide its operands ( dependence), it will also be dependent on other instructions to determine whether it gets executed or not ( dependence or ___________ dependence). Control dependences are particularly critical with _____________ branches. add $5, $3, $2 sub $6, $5, $2 beq $6, $7, somewhere and $9, $6, $1... somewhere: or $10, $5, $2 add $12, $11, $9... conditional branch control data

Dealing With Branch Hazards Hardware –stall until you know which direction –reduce hazard through earlier computation of branch direction –guess which direction  assume not taken (easiest)  more educated guess based on history (requires that you know it is a branch before it is even decoded!) Hardware/Software –noops, or instructions that get executed either way (delayed branch).

Given our current pipeline – let’s assume we stall until we know the branch outcome. How many cycles will you lose per branch? Stalling the pipeline Sel ecti on cycles A0 B1 C2 D3 E4

Stalling for Branch Hazards beq $4, $0, there and $12, $2, $5 or... add... sw... IMReg DMReg IMReg IMReg DM IMReg DMReg IMReg DMReg CC1CC2CC3CC4CC5CC6CC7CC8 Bubble

Stalling for Branch Hazards Seems wasteful, particularly when the branch isn’t taken. Makes all branches cost 4 cycles. What if we just assume ____________ Why? branchesAren’t taken

Assume Branch Not Taken beq $4, $0, there and $12, $2, $5 or... add... sw... IMReg DMReg IMReg IMReg DM IMReg DMReg IMReg DMReg CC1CC2CC3CC4CC5CC6CC7CC8 works pretty well when you’re right

Assume Branch Not Taken beq $4, $0, there and $12, $2, $5 or... add... there: sub $12, $4, $2 IMReg IMReg IM Reg IMReg DMReg CC1CC2CC3CC4CC5CC6CC7CC8 Flush same performance as stalling when you’re wrong Wrong Path insts

Let’s improve the pipeline so we move branch resolution to Decode. How many cycles would we lose then on a taken branch? Stalling the pipeline Sel ecti on cycles A0 B1 C2 D3 E4 Add drawing of Resolving in decode.

The Pipeline with flushing for taken branches Notice the IF/ID flush line added.

Branch Hazards – Assume Not Taken Great if most of your branches aren’t taken. What about loops which are taken 95% of the time? –we would like the option of assuming not taken for some branches, and taken for others, depending on ???

Branch Hazards – Predicting Taken IMReg ALU DMRegIMReg ALU DMReg CC1CC2CC3CC4CC5CC6CC7CC8 beq $2, $1, here here: lw Reading quiz Required information to predict Taken: 1.An instruction is a branch before decode 2.The target of the branch 3.The outcome of the branch SelectionRequired knowledge A2,3 B1,2,3 C1,2 D2 ENone of the above

Branch Target Buffer Keeps track of the PCs of recently seen branches and their targets. Consult during Fetch (in parallel with Instruction Memory read) to determine: –Is this a branch? –If so, what is the target

Branch Hazards – Predict Taken Static policy: –Forward branches (if statements) predict not taken –Backward branches (loops) predict taken Dynamic prediction (coming soon) First – Branch Delay Slots

Eliminating the Branch Stall There’s no rule that says we have to see the effect of the branch immediately. Why not wait an extra instruction before branching? The original SPARC and MIPS processors each used a single branch delay slot to eliminate single-cycle stalls after branches. The instruction after a conditional branch is always executed in those machines, regardless of whether the branch is taken or not!

Branch Delay Slot beq $4, $0, there and $12, $2, $5 there: or... add... sw... IMReg DMReg IMReg IMReg DM IMReg DMReg IMReg DMReg CC1CC2CC3CC4CC5CC6CC7CC8 Branch delay slot instruction (next instruction after a branch) is executed even if the branch is taken.

Filling the branch delay slot 1 add $5, $3, $7 2 add $9, $1, $3 3 sub $6, $1, $4 4 and $7, $8, $2 5 beq $6, $7, there nop /* branch delay slot */ 6 add $9, $1, $2 7 sub $2, $9, $5... there: 8 mult $2, $10, $11 … No-R7 WAR Safe, $1 and $3 are fine No-R6 No-R7 Not safe ($9 on nt path) Not safe (needs $9 not yet produced) Not safe ($2 is used before overwritten) E is the correct answer SelectionSafe instructions A1,2 B2,6 C6,8 D1,2,7,8 ENone of the above * It is not safe to assume anything about the … code

Filling the branch delay slot The branch delay slot is only useful if you can find something to put there. If you can’t find anything, you must put a noop to insure correctness.

Branch Delay Slots This works great for this implementation of the architecture, but becomes a permanent part of the ISA. What about the MIPS R10000, which has a 5-cycle branch penalty, and executes 4 instructions per cycle??? Bottom line: Exposed a detail of the hardware implementation to the ISA.

Dynamic Branch Prediction Can we guess the outcome of branches? What should we base that guess on?

Branch Prediction program counter for (i=0;i<10;i++) {... }... add $i, $i, #1 beq $i, #10, loop Accuracy? Too easily swayed

Two-bit predictors give better loop prediction for (i=0;i<10;i++) {... }... add $i, $i, #1 beq $i, #10, loop Strongly Taken 11 Weakly Taken 10 Weakly Not Taken 01 Strongly Not Taken 00 Decrement when not taken Increment when taken branch address 00 PHT Better (less sway) Slower learning

Suppose we have the following branch patterns for 3 branches (A, B, C). What is the accuracy of a 1-bit and 2- bit Branch History Table. Assume initial values of 1 (1-bit) and (10) 2-bit. Strongly Taken 11 Weakly Taken 10 Weakly Not Taken 01 Strongly Not Taken 00 Decrement when not taken Increment when taken 1-bit A. T T T T N B. T T N T N C. N T N T N 2-bit A. T T T T N B. T T N T N C. N T N T N

Modern Branch Prediction - Pentium 4 Performance dependent on accurate Branch Prediction 20 Stage Pipeline – 3-way issue –60 instructions in flight (12 branches) –17 th stage is branch resolution –~17*3=51 instructions lost on mispredict X

Branch Prediction Latest branch predictors are significantly more sophisticated, using more advanced correlating techniques, larger structures, and even AI techniques Use patterns of branches (local history) and recent other branch history (global history) to make predictions

For a given program on our 5-stage MIPS pipeline processor: 20% of insts are loads, 50% of instructions following a load are arithmetic instructions depending on the load 20% of instructions are branches. Using dynamic branch prediction, we achieve 80% prediction accuracy. What is the CPI of your program? SelectionCPI A0.76 B0.9 C1.0 D1.1 E1.14 Putting it all together.

Control Hazards -- Key Points Control (or branch) hazards arise because we must fetch the next instruction before we know if we are branching or where we are branching. Control hazards are detected in hardware. We can reduce the impact of control hazards through: –early detection of branch address and condition –branch prediction –branch delay slots

Given our 5-stage MIPS pipeline – what is the steady state CPI for the following code? Assume the branch is taken thousands of times. Recall – a processor is in steady state when all stages are active. Loop: lw r1, 0 (r2) add r2, r3, r4 sub r5, r1, r2 beq r5, $zero, Loop SelectionCPI A1 B1.25 C1.5 D1.75 ENone of the above Steady-State CPI = (#insts+#stalls+#flushed_insts) #insts

IF = 200ps ID = 100ps EX = 100ps M = 200ps WB = 100ps Hardware engineers determine these to be the execution times per stage of the MIPS 5-stage pipeline processor. Consider splitting IF and M into 2 stages each. (So IF1 IF2 and M1 M2.) The most important code run by the company is (assume branch is taken most of the time): Loop: lw r1, 0 (r2) add r2, r3, r4 sub r5, r1, r2 beq r5, $zero, Loop SelectionCPICT AIncrease B Decrease C Increase DDecrease EIncreaseNo Change What would be the impact of the new 7-stage pipeline compared to the original 5-stage MIPS pipeline.. Assume the pipeline has forwarding where available, predicts branch not taken, and resolves branches in ID. Isomorphic

IF = 200ps ID = 100ps EX = 200ps M = 200ps WB = 100ps Hardware engineers determine these to be the execution times per stage of the MIPS 5-stage pipeline processor. Consider splitting IF and M into 2 stages each. (So IF1 IF2 and M1 M2.) The most important code run by the company is (assume branch is taken most of the time): Loop: lw r1, 0 (r2) add r2, r3, r4 sub r5, r1, r2 beq r5, $zero, Loop SelectionCPICT AIncrease B Decrease C Increase DDecrease EIncreaseNo Change What would be the impact of the new 7-stage pipeline compared to the original 5-stage MIPS pipeline.. Assume the pipeline has forwarding where available, predicts branch not taken, and resolves branches in ID.

Loop: lw r1, 0 (r2) add r2, r3, r4 sub r5, r1, r2 beq r5, $zero, Loop 7-stage Pipeline For diagramming the code in the last slide. Drive home – more stages = more hazards

Pipelining -- Key Points Pipelining focuses on improving instruction throughput, not individual instruction latency. Data hazards can be handled by hardware or software – but most modern processors have hardware support for stalling and forwarding. Control hazards can be handled by hardware or software – but most modern processors use Branch Target Buffers and advanced dynamic branch prediction to reduce the hazard. ET = IC*CPI*CT