The University of Adelaide, School of Computer Science

Slides:



Advertisements
Similar presentations
Morgan Kaufmann Publishers The Processor
Advertisements

1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
Review: Pipelining. Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer.
Pipelining - Hazards.
Instruction-Level Parallelism (ILP)
1 Lecture 17: Basic Pipelining Today’s topics:  5-stage pipeline  Hazards and instruction scheduling Mid-term exam stats:  Highest: 90, Mean: 58.
EECS 470 Pipeline Hazards Lecture 4 Coverage: Appendix A.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
©UCB CS 162 Computer Architecture Lecture 3: Pipelining Contd. Instructor: L.N. Bhuyan
1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.
L18 – Pipeline Issues 1 Comp 411 – Spring /03/08 CPU Pipelining Issues Finishing up Chapter 6 This pipe stuff makes my head hurt! What have you.
L17 – Pipeline Issues 1 Comp 411 – Fall /1308 CPU Pipelining Issues Finishing up Chapter 6 This pipe stuff makes my head hurt! What have you been.
DLX Instruction Format
1 Lecture 4: Advanced Pipelines Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10)
Lec 9: Pipelining Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University.
Appendix A Pipelining: Basic and Intermediate Concepts
1  1998 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining.
ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.
-1.1- PIPELINING 2 nd week. -2- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM PIPELINING 2 nd week References Pipelining concepts The DLX.
Lecture 15: Pipelining and Hazards CS 2011 Fall 2014, Dr. Rozier.
Pipeline Hazard CT101 – Computing Systems. Content Introduction to pipeline hazard Structural Hazard Data Hazard Control Hazard.
1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.
Memory/Storage Architecture Lab Computer Architecture Pipelining Basics.
Chapter 2 Summary Classification of architectures Features that are relatively independent of instruction sets “Different” Processors –DSP and media processors.
1 Appendix A Pipeline implementation Pipeline hazards, detection and forwarding Multiple-cycle operations MIPS R4000 CDA5155 Spring, 2007, Peir / University.
EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining.
Chapter 4 CSF 2009 The processor: Pipelining. Performance Issues Longest delay determines clock period – Critical path: load instruction – Instruction.
Comp Sci pipelining 1 Ch. 13 Pipelining. Comp Sci pipelining 2 Pipelining.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CMPE 421 Parallel Computer Architecture
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
Electrical and Computer Engineering University of Cyprus LAB3: IMPROVING MIPS PERFORMANCE WITH PIPELINING.
EECE 476: Computer Architecture Slide Set #5: Implementing Pipelining Tor Aamodt Slide background: Die photo of the MIPS R2000 (first commercial MIPS microprocessor)
5/13/99 Ashish Sabharwal1 Pipelining and Hazards n Hazards occur because –Don’t have enough resources (ALU’s, memory,…) Structural Hazard –Need a value.
Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.
Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.
LECTURE 7 Pipelining. DATAPATH AND CONTROL We started with the single-cycle implementation, in which a single instruction is executed over a single cycle.
11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.
Introduction to Computer Organization Pipelining.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
L17 – Pipeline Issues 1 Comp 411 – Fall /23/09 CPU Pipelining Issues Read Chapter This pipe stuff makes my head hurt! What have you been.
CS203 – Advanced Computer Architecture Pipelining Review.
Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.
Computer Organization
Morgan Kaufmann Publishers
Morgan Kaufmann Publishers The Processor
Appendix C Pipeline implementation
\course\cpeg323-08F\Topic6b-323
Appendix A - Pipelining
Pipelining: Advanced ILP
Morgan Kaufmann Publishers The Processor
Pipelining in more detail
CSC 4250 Computer Architectures
\course\cpeg323-05F\Topic6b-323
Pipeline control unit (highly abstracted)
The Processor Lecture 3.6: Control Hazards
Control unit extension for data hazards
Instruction Execution Cycle
Pipeline control unit (highly abstracted)
CS203 – Advanced Computer Architecture
Pipeline Control unit (highly abstracted)
Control unit extension for data hazards
Introduction to Computer Organization and Architecture
Control unit extension for data hazards
Throughput = #instructions per unit time (seconds/cycles etc.)
Pipelining Hazards.
Presentation transcript:

The University of Adelaide, School of Computer Science Computer Architecture A Quantitative Approach, Fifth Edition The University of Adelaide, School of Computer Science 25 April 2017 Appendix C Pipelining: Basic and Intermediate Concepts Chapter 2 — Instructions: Language of the Computer

Basic Pipelining Pipelining is the organizational implementation technique that has been responsible for the most dramatic increase in computer performance. Overview of basic pipelining What is pipelining? Computing pipeline speedup Clocking pipelines Pipelining MIPS Pipeline hazards Handling interrupts.

Pipelining

Pipelining 3 Stages Assume a 2 ns flip-flop delay

Pipelining: Computing the speedup Time per instruction TPI = CPI cycle time We can think about pipelining as reducing either CPI or cycle time Ideal speedup Requires that all stages be perfectly balanced No synchronization (latch, flip-flop) overhead No stall cycles The speedup from a pipeline is limited CPIreal = CPIideal + CPI stall CCTreal = Timelongest pipestage + Timelatch overhead

MIPS Instruction Formats

Basic MIPS Pipeline

Basic MIPS Pipeline (simplified)

Pipelining By Adding Registers

MIPS Pipelined Execution Instruction 1 2 3 4 5 6 7 8 9 i IF ID EX MEM WB i+1 i+2 i+3 i+4

Rules for pipeline registers Each stage must be independent, so inter-stage registers must hold Data values Control signals, including Decoded instruction fields MUX controls ALU controls Think of the register file as two independent units Read file, accessed in ID Write file, accessed in WB There is no “final” set of registers after WB, (WB/IF) because the instruction is finished and all results are recorded in permanent machine state (register file, memory, and PC)

A More Accurate Pipeline Schematic

Pipeline Dataflow: the details Reg-Reg ALU Reg-immed Load Store Branch Jump IF IR2 = IMEM[PC] PC2 = PC = PC+4 ID A3 = Regs[IR25..21]; B3=Regs[20..16]; IR3=IR2;PC3=PC2; IM3=IR2[15]16 ##IR2[14..0] EX ALU4= A3 op B3; IR4 = IR3 PC4 = PC3 ALU4 = A3 op IM3 ALU4 = A3 + IM3 MD4 = B3 ALU4 = PC3 + IM3 CO4 = A3 op 0 MEM IR5=IR4 PC5=PC4 WB5 = DMEM[ALU4] DMEM[ALU4] = MD4 If (C04) PC=ALU4 PC = ALU4 WB Din = WB

Problems with Pipelining (Dependencies and Hazards) Dependencies: a property of the program Data dependencies Instruction j uses the result produced by instruction I Control dependencies The execution of instruction j depends upon the result of instruction i

Dependencies and Hazards Hazard a result of dependencies in the pipeline Hazards lead to pipeline stalls or the execution of the wrong instruction Data hazards Instruction depends upon the result of an instruction still in the pipeline Structural Hazard Two instructions try to use the same hardware resource in a single cycle Control hazard Caused by the delay in fetching an instruction and decision about changes in instruction flow

Structural hazards When two instructions need to use the same hardware resource in the same cycle. Resources are not duplicated Register file write ports Resources is not fully pipelined, I.e. takes more than one cycle Division, floating points Fix #1: Stall later instruction Low cost, but increases average CPI Best used for rare events Examples: MIPS R2000 multi-cycle multiply SPARC V1 single memory port for instruction and data Fix #2: Duplicate the resource Increase cost, but preserves CPI Best used for cheap resources and/or frequent events

Structural hazards, continued Example resource duplication Separate instruction and data memory Separate ALU and PC adders Register files with multiple ports Fix #3: Pipeline expensive resource Moderate cost compared to duplication, expensive compared to stalling Best used for high performance or specialty machines Fully pipelined floating point units for scientific machines. How to avoid structural hazards altogether Design the ISA so that each resource needed by an instruction: Is used once Is always used in the same pipeline stage Takes one cycle MIPS is designed with pipelining in mind, x86 is not

Types of Data Hazards RAW (Read After Write) WAW (Write After Write) Only hazard for “fixed” pipelines Later instruction must be read after the earlier instruction writes WAW (Write After Write) Variable length pipeline Later instructions must write after earlier instruction I WAR (Write after Read) Pipeline with late read Later instruction must write after earlier instruction reads We can have Data hazard through memory locations F R A M W F R A M W F R 1 2 3 4 W F R A M W F R 1 2 3 4 5 W F R A M W

Example RAW pipeline hazard

Stall for RAW hazards Relatively cheap: just needs some extra compare and control logic Detected in ID stage by comparing the registers to be read with the registers to be written for the instruction currently in the EX, MEM, or WB stages Stall if a match is found Increases the average CPI Would happen much too frequently F R X M W Write Data to R1 here F R X M W Bubble ADD R1, R2, R3 ADD R4, R1, R5 Read from R1 here

Stall type #1: Freeze the whole pipeline 2 3 4 5 6 7 8 9 10 11 I IF ID EX MEM WB I+1 I+2 I+3 I+4 I+5 I+6 Freeze all pipe stages for one or more cycles, and suppress writeback Needs only one global stall signal which suppresses all latching in all pipeline stages Sometimes called a “fixed pipe” or “frozen pipe” stall Works for cache misses Will not work to remove pipeline hazards

Stall type #2: Delay completion of an instruction 1 2 3 4 5 6 7 8 9 10 11 I IF ID EX MEM WB I+1 I+2 Stall I+3 I+4 I+5 I+6 Instruction progress stops for one cycle Earlier instructions continue towards completion Prior instructions must suspend and make no more progress An “elastic pipe: stall Good when the need for stalling is only detected after decode, like for pipeline hazards Bubble in: EX MEM WB

Bypass (Forwarding) If data is available elsewhere in the pipeline, there is no need to stall Detect condition Bypass (or forward) data directly to the consuming pipeline stage Bypass eliminates stalls for single-cycle operations Reduces longest stall to N-1 cycles for N-cycle operations

Physical Forwarding Paths The third forwarding operation might not be necessary if we can make read-after-write register file

Example forwarding decisions If EX has just finished an operation for which ID wants to read the value from either operand, we must forward If IR.Will_Write_Reg and IR4.Write_Reg_Num == IR3.RS1_Reg_Num then ALUmuxA =SelectALU4 If IR.Will_Write_Reg and IR4.Write_Reg_Num == IR3.RS2_Reg_Num then ALUmuxB =SelectALU4 Need one comparison and multiplex control for each forwarding path Be careful: if you forward from more than one instruction, choose the closest in the pipeline

Physical Forwarding Paths

Forwarding Animation (1)

Forwarding Animation (2)

Forwarding Animation (3)

Forwarding Animation (4)

Other Data Hazards WAR (Write After Read) WAW (Write After Write) Can happen if the instruction pipeline has early writes and/or late reads; something like: DIV (R1), Suppose that it does not read destination indirect until after the divide ADD ..,(R1)+ Incremented value of R1 is written before DIV has read value of R1 Can not happen in DLX because all reads are early (ID) and all writes are late (WB) WAW (Write After Write) Can happen when a fast operation follows a slow one; like LW R1,0(R2) IF ID EX MEM WB ADD R1, R2, R3 IF ID EX WB Can not happen in DLX (integer) because there is only one WB stage and instructions use it in order

One data hazard left Loaded data is not available until the end of MEM, which is too late for the following instruction Forwarding can not help, so we must stall – or just “decree” that you can not write code like this. Such a decree is called a “delayed load” and was used in the original MIPS 2000

Stalling to interlock

The software fix: instruction scheduling to avoid stalls Since we can not avoid a stall following a load, avoid the stall by rearranging the code (“pipeline scheduling”), if possible Replace sub r4, r5, r7 lw r1, 50(r2) add r3, r1, r4 With This can improve a simple RISC machine performance

The software fix: instruction scheduling to avoid stalls But it is limited Usually limited to basic blocks between branches, 5-7 instructions Difficult to do interchanges to variables referenced indirectly (pointer, array, or parameter) due to the risk of aliases.

Branches and jumps Control point: know target and condition Mem control point Branch penalty: number of pipeline stages to control point This 3-cycle penalty works, but since branches occur every 5-7 instructions, it kills performance. What to do Determine the branch condition earlier than EX Compute the target address earlier than MEM Instruction 1 2 3 4 5 6 7 8 9 Branch I IF ID EX MEM WB I+1 Stall I+2 I+3

Characteristics of MIPS branches and jumps The branch condition Only has EQ/NE comparison to zero Fast and cheap, no need for a full ALU Use a 32-bit NOR gate instead The branch target Always PC-relative Needs only 16 bit adder (and carry propagation) The jump target Target = {PC[31:28], offset, 00} All can be moved to the ID stage, at the cost of additional hardware (and maybe increased cycle time) Still requires one stall

Pipelining and Branch ISA Design Simple branches Makes ID control point possible Maybe increases cycle time 1 cycle penalty Complex branches Requires EX control point Maybe lower cycle time 2 cycle branch penalty

Reducing branch penalties (1) Predict that the branch will not be taken Continue fetching from sequential addresses. Cancel later if branch was taken Easy to do If it is not, continue If it is, change the following instructions into a NOP and thus take a 1-cycle penalty Helps a little, but bets the wrong way for loops

Reducing branch penalties (2) Predict that the branch will be taken Only useful if the target address is known before the branch condition – not true for MIPS Cancel later if the branch was not taken Always has some delay in fetching the branch target

Reducing branch penalties (3) Change the ISA: delay the effect of the branch Always execute the instruction(s) after the branch or jump Depends on the compiler to find something useful to do in the branch delay slot(s). An ugly dependence of ISA on implementation – may change Interaction with branch prediction, interrupts.

Filling the branch delay slot

How useful are canceling branches Integer : 35 % slots wasted Floating point : 25% slots wasted

Performance of Branch schemes? Effective CPI = 1 + %branches  average branch penalty For integer MIPS: 20% of instructions are branches or jumps. 70% of them go to the target Strategy Branch Taken penalty Branch not taken penalty Effective CPI Stall 3 1.60 Branch in ID 1 1.20 Predict taken Predict not taken 1.14 Delay slot 0.5 1.10 Cancel branch 0.3 1.06

Pipeline example Consider the following pipeline which implements the MIPS-like ISA. The only variation on the MIPS ISA is the support of full register compares in branch instructions Instruction 1 2 3 4 5 6 7 8 9 10 11 I IF ID EX1 EX2/ MEM1 MEM2 WB I+1 I+2 I+3 I+4 I+5

The Pipeline stages Stage Function IF Instruction fetch ID Instruction decode. Register fetch EX1 Address generation (data and PC-target) EX2/MEM1 ALU operation Branch condition resolution First cycle of memory access MEM2 Second cycle of memory access WB Register file writeback

Assumptions Writes to the register file occur in the first half of the clock cycle while reads from the register file occur in the second half All bypass paths have been implemented to minimize pipeline stalls due to data hazards The pipeline implements hardware interlocks

The University of Adelaide, School of Computer Science 25 April 2017 Questions How many register file ports does the processor need to minimize structural hazards? Indicate all forwarding required to minimize stalls in the given pipeline. Also, specify the minimum number of comparators needed to implement forwarding? What is the worst case delay due to RAW data hazards? What is the branch delay of this pipeline? Chapter 2 — Instructions: Language of the Computer

Instruction Dependencies The frequencies in the table are presented as percentages of all instructions executed Type Instruction Sequence Frequency 1 ALUop Rx,-,- ALUop -,-,Rx or ALUop -,Rx,- 10% 2 Store Rx,-(-) 5% 3 Load -,-(Rx) or Store -,-(Rx) 4 JumpRegister Rx 1% 5 Branch Rx,-,# or Branch -,Rx,# 2% 6 Load Rx,-(-) 15% 7 Load Rx, -(-) Load -,-(Rx) or Store -,-(Rx) 3% 8 9

More Questions List the instruction sequences from the previous table that cause data stalls in the pipeline. Indicate the corresponding number of stall cycles. Compute the CPI for the pipeline due to data hazards only. Ignore instruction sequences that are not listed in the table If the frequency of conditional branches is 10% of which 65% are taken and the frequency of unconditional branches is 6%, compute the overall CPI assuming a TAKEN branch prediction scheme.

Summary Pipelining: overlaps execution of instructions Improves instruction throughput → latency of long program Problem: structural, data, and control hazards Hazards occur if there are dependences and pipeline exposes them Common solution: stall, forwarding, scheduling Performance CPIreal = CPIideal + Stallsstructural + Stallsdata + Stallscontrol Cycle timereal = Timelongest pipestage + Register Overhead What makes pipelining easier Simple instructions (load-stores, branches Fixed length, encoding with few formats