1 Pipelining Part I CS448. 2 What is Pipelining? Like an Automobile Assembly Line for Instructions –Each step does a little job of processing the instruction.

Slides:



Advertisements
Similar presentations
1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
Advertisements

Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Review: Pipelining. Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer.
Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee.
Chapter Six 1.
CIS429/529 Winter 2007 Pipelining-1 1 Pipeling RISC/MIPS64 five stage pipeline Basic pipeline performance Pipeline hazards Branch hazards More pipeline.
1 Lecture 17: Basic Pipelining Today’s topics:  5-stage pipeline  Hazards and instruction scheduling Mid-term exam stats:  Highest: 90, Mean: 58.
CS-447– Computer Architecture Lecture 12 Multiple Cycle Datapath
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
CIS629 Fall 2002 Pipelining 2- 1 Control Hazards Created by branch statements BEQZLOC ADDR1,R2,R3. LOCSUBR1,R2,R3 PC needs to be computed but it happens.
Computer ArchitectureFall 2007 © October 3rd, 2007 Majd F. Sakr CS-447– Computer Architecture.
DLX Instruction Format
Computer ArchitectureFall 2007 © October 31, CS-447– Computer Architecture M,W 10-11:20am Lecture 17 Review.
Appendix A Pipelining: Basic and Intermediate Concepts
ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.
Pipelining Basics Assembly line concept An instruction is executed in multiple steps Multiple instructions overlap in execution A step in a pipeline is.
CSE378 Pipelining1 Pipelining Basic concept of assembly line –Split a job A into n sequential subjobs (A 1,A 2,…,A n ) with each A i taking approximately.
Instruction Sets and Pipelining Cover basics of instruction set types and fundamental ideas of pipelining Later in the course we will go into more depth.
Pipeline Hazard CT101 – Computing Systems. Content Introduction to pipeline hazard Structural Hazard Data Hazard Control Hazard.
COSC 3430 L08 Basic MIPS Architecture.1 COSC 3430 Computer Architecture Lecture 08 Processors Single cycle Datapath PH 3: Sections
CSC 4250 Computer Architectures September 15, 2006 Appendix A. Pipelining.
COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.
1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.
Lecture 5: Pipelining Implementation Kai Bu
Chapter 2 Summary Classification of architectures Features that are relatively independent of instruction sets “Different” Processors –DSP and media processors.
Computer Organization CS224 Fall 2012 Lesson 28. Pipelining Analogy  Pipelined laundry: overlapping execution l Parallelism improves performance §4.5.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
CMPE 421 Parallel Computer Architecture
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
IT253: Computer Organization Lecture 9: Making a Processor: Single-Cycle Processor Design Tonga Institute of Higher Education.
CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and
Processor Design CT101 – Computing Systems. Content GPR processor – non pipeline implementation Pipeline GPR processor – pipeline implementation Performance.
Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.
Pipelining Example Laundry Example: Three Stages
CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.
11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.
Introduction to Computer Organization Pipelining.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
CS203 – Advanced Computer Architecture Pipelining Review.
CS161 – Design and Architecture of Computer Systems
Lecture 18: Pipelining I.
Computer Organization
ARM Organization and Implementation
Morgan Kaufmann Publishers
Performance of Single-cycle Design
CMSC 611: Advanced Computer Architecture
Morgan Kaufmann Publishers The Processor
Pipeline Implementation (4.6)
School of Computing and Informatics Arizona State University
Chapter 3: Pipelining 순천향대학교 컴퓨터학부 이 상 정 Adapted from
Morgan Kaufmann Publishers The Processor
Lecture 5: Pipelining Basics
Pipelining Basic concept of assembly line
Control unit extension for data hazards
An Introduction to pipelining
Guest Lecturer TA: Shreyas Chand
Instruction Execution Cycle
Pipelining Basic concept of assembly line
Pipelining.
Control unit extension for data hazards
Pipelining Basic concept of assembly line
Pipelining Appendix A and Chapter 3.
Morgan Kaufmann Publishers The Processor
Control unit extension for data hazards
Guest Lecturer: Justin Hsia
Pipelined datapath and control
Pipelining Hazards.
Presentation transcript:

1 Pipelining Part I CS448

2 What is Pipelining? Like an Automobile Assembly Line for Instructions –Each step does a little job of processing the instruction –Ideally each step operates in parallel Simple Model –Instruction Fetch –Instruction Decode –Instruction Execute F1D1E1 F2D2E2 F3D3E3

3 Ideal Pipeline Performance If stages are perfectly balanced: The more stages the better? –Each stage typically corresponds to a clock cycle –Stages will not be perfectly balanced –Synchronous: Slowest stage will dominate time –Many hazards await us Two ways to view pipelining –Reduced CPI (when going from non-piped to pipelined) –Reduced Cycle Time (when increasing pipeline depth)

4 Ideal Pipeline Performance Implemented completely in hardware –Exploits parallelism in a sequential instruction stream Invisible to the programmer! –Not so for other forms of parallelism we will see –Not invisible to programmer looking to optimize –Compiler must become aware of pipelining issues All modern machines use pipelines –Widely used in 80’s –Multiple pipelines in 90’s

5 DLX Instructions Opcode rs1 rd Immediate I-Type Loads & Stores rd  rs op immediate Conditional Branches rs1 is the condition register checked, rd unused, immediate is offset Opcode rs1 rs2 rd func Opcode Offset added to PC R-Type and J-Type Mostly just look at I- Type for now

6 Unpipelined DLX Every DLX instruction can be executed in 5 steps 1. IF – Instruction Fetch –IR  Mem[PC] –NPC  PC + 4; Next Program Counter 2. ID – Instruction Decode / Register Fetch –A  Regs[IR ]; rs1 –B  Regs[IR ]; rd –Imm  (IR 16 ) 16 ## IR ; Sign extend immediate –Fetch operands in parallel for later use. Might not be used! Fixed Field decoding

7 Unpipelined DLX 3. EX - Execution / Effective Address Cycle –There are four operations depending on the opcode decoded from the previous stage –Memory Reference ALUOutput  A + Imm ; Compute effective address –Register-Register ALU Operation ALUOutput  A func B; e.g. R1 + R2 –Register-Immediate ALU Operation ALUOutput  A op Imm; e.g. R –Branch ALUOutput  NPC + Imm; PC based offset Cond  A op 0; e.g. op is == for BEQZ –Note that the load/store architecture of DLX means that effective address and execution cycles can be combined into one clock cycle since no instruction needs to simultaneously calculate a data address and perform an ALU op

8 Unpipelined DLX 4. MEM – Memory Access / Branch Completion –There are two cases, one for memory references and one for branches –Both cases PC  NPC; Update PC –Memory reference LMD  Mem[ALUOutput]; for memory Loads Mem[ALUOutput]  B; or Stores Note the address was previously computed in step 3 –Branch If (cond) PC  ALUOutput; PC gets new address

9 Unpipelined DLX 5. WB – Write Back –Writes data back to the REGISTER FILE Memory writes were done in step 4 –Three options –Register to Register ALU Regs[IR ]  ALUOutput; rd for R-Type –Register-Immediate ALU Regs[IR ]  ALUOutput; rd for I-Type –Load Instruction Regs[IR ]  LMD; LMD from 4

10 Hardware Implementation of DLX Datapath Registers between stages  Pipelined

11 Unpipelined DLX Implementation Most instructions require five cycles Branch and Store require four clock cycles –Which aren’t needed? –Reduces CPI to 4.83 using 12% branch, 5% store frequency Other optimizations possible Control Unit for five cycles? –Finite State Machine –Microcode (Intel)

12 Why do we need Control? Clock pulse controls when cycles operate –Control determines which stages can function, what data is passed on –Registers are enabled or disabled via control –Memory has read or write lines set via control –Multiplexers, ALU, etc. must be selected COND selects if MUX is enabled or not for new PC value Control mostly ignored in the book –We’ll do the same, but remember… it’s a complex and important implementation issue

13 Adding Pipelining Run each stage concurrently Need to add registers to hold data between stages –Pipeline registers or Pipeline latches –Rather than ~5 cycles per instruction, 1 cycle per instruction! –Ideal case: Really this simple? –No, but it is a good idea… we’ll see the pitfalls shortly

14 Important Pipeline Characteristics Latency –Time required for an instruction to propagate through the pipeline –Based on the Number of Stages * Cycle Time –Dominant if there are lots of exceptions / hazards, i.e. we have to constantly be re-filling the pipeline Throughput –The rate at which instructions can start and finish –Dominant if there are few exceptions and hazards, i.e. the pipeline stays mostly full Note we need an increased memory bandwidth over the non-pipelined processor

15 Pipelining Example Assume the 5 stages take time 10ns, 8ns, 10ns, 10ns, and 7ns respectively Unpipelined –Ave instr execution time = = 45 ns Pipelined –Each stage introduces some overhead, say 1ns per stage –We can only go as fast as the slowest stage! –Each stage then takes 11ns; in steady state we execute each instruction in 11ns –Speedup = UnpipelinedTime / Pipelined Time = 45ns / 11ns = 4.1 times or about a 4X speedup Note: Actually a higher latency for pipelined instructions!

16 Pipelining Hazards Unfortunately, the picture presented so far is a bit too good to be true… we have problems with hazards Structural –Resource conflicts when the hardware can’t support all combinations of overlapped stages –e.g. Might use ALU to add PC to PC and execute op Data –An instruction depends on the results of some previous instruction that is still being processed in the pipeline –e.g. R1 = R2 + R3; R4 = R1 + R6; problem here? Control –Branches and other instructions that change the PC –If we branch, we may have the wrong instructions in the pipeline

17 Structural Hazards Overlapped execution may require duplicate resources Clock 4: –Memory access for i may conflict with IF for i+4 May solve via separate cache/buffer for instructions, data –IF might use the ALU which conflicts with EX

18 Dealing with Hazards One solution: Stall –Let the instructions later in the stage continue, and stall the earlier instruction Need to do in this order, since if we stalled the later instructions, they would become a bottleneck and nothing else could move out of the pipeline –Once the problem is cleared, the stall is cleared –Often called a pipeline bubble since it floats through the pipeline but does no useful work Stalls increase the CPI from its ideal value of 1

19 Structural Hazard Example Consider a CPU with a single memory pipeline for data and instructions –If an instruction contains a data memory reference, it will conflict with the instruction fetch –We will introduce a bubble while the latter instruction waits for the first instruction to finish

20 Structural Hazard Example

21 Structural Hazard Example What if Instruction 1 is also a LOAD? No Instr finished in CC8

22 Alternate Depiction of Stall

23 Avoiding Structural Hazards How can we avoid structural hazards? –Issue of cost for the designer –E.g. allow multiple access paths to memory Separate access to instructions from data –Build multiple ALU or other functional units Don’t forget the cost/performance tradeoff and Amdahl’s law –If we don’t encounter structural hazards often, it might not be worth the expense to design hardware to address it, instead just handle it with a stall or other method

24 Measuring Performance with Stalls We also know that: Substitution Yields:

25 Measuring Stall Performance Given: We can calculate CPI_Pipelined: The ideal CPI is just the value 1. Substituting this in: Assuming no overhead in pipelined clock cycles (i.e. the latch time) then the clock cycle ratio is just 1, yielding:

26 How Realistic is the Pipeline Speedup Equation? Good for a ballpark figure, comes close to a SWAG Overhead in pipeline latches shouldn’t be ignored Effects of pipeline depth –Deeper pipelines have a higher probability of stalls –Also requires additional replicated resources and higher cost Need to run simulations with memory, I/O systems, cache, etc. to get a better idea of speedup Next we’ll examine the myriad of problems from data hazards and control hazards to further complicate our simple pipeline