CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

Slides:



Advertisements
Similar presentations
PipelineCSCE430/830 Pipeline: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Prof. Yifeng Zhu, U of Maine Fall,
Advertisements

CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Review: Pipelining. Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer.
Pipelining I (1) Fall 2005 Lecture 18: Pipelining I.
Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee.
CS252/Patterson Lec 1.1 1/17/01 Pipelining: Its Natural! Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer.
Pipelining - Hazards.
COMP381 by M. Hamdi 1 Pipeline Hazards. COMP381 by M. Hamdi 2 Pipeline Hazards Hazards are situations in pipelining where one instruction cannot immediately.
CPSC 614 Computer Architecture Lec 3 Pipeline Review EJ Kim Dept. of Computer Science Texas A&M University Adapted from CS 252 Spring 2006 UC Berkeley.
CSCE 430/830 Computer Architecture Basic Pipelining & Performance
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 1.
Chapter 5 Pipelining and Hazards
EENG449b/Savvides Lec 3.1 1/20/04 January 20, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
©UCB CS 162 Computer Architecture Lecture 3: Pipelining Contd. Instructor: L.N. Bhuyan
Computer ArchitectureFall 2007 © October 24nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
Computer ArchitectureFall 2007 © October 22nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
The Basics: Pipelining J. Nelson Amaral University of Alberta 1.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 9, 2002 Topic: Pipelining Basics.
Pipelining - II Adapted from CS 152C (UC Berkeley) lectures notes of Spring 2002.
Computer ArchitectureFall 2008 © October 6th, 2008 Majd F. Sakr CS-447– Computer Architecture.
ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 17 - Pipelined.
Introduction to Pipelining Rabi Mahapatra Adapted from the lecture notes of Dr. John Kubiatowicz (UC Berkeley)
CS1104: Computer Organisation School of Computing National University of Singapore.
Pipelining. 10/19/ Outline 5 stage pipelining Structural and Data Hazards Forwarding Branch Schemes Exceptions and Interrupts Conclusion.
CPE 731 Advanced Computer Architecture Pipelining Review Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of California,
EECS 252 Graduate Computer Architecture Lecture 3  0 (continued) Review of Instruction Sets, Pipelines, Caches and Virtual Memory January 25 th, 2012.
Appendix A - Pipelining CSCI/ EENG – W01 Computer Architecture 1 Prof. Babak Beheshti Slides based on the PowerPoint Presentations created by David.
Integrated Circuits Costs
CSC 7080 Graduate Computer Architecture Lec 3 – Pipelining: Basic and Intermediate Concepts (Appendix A) Dr. Khalaf Notes adapted from: David Patterson.
EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining.
Appendix A Pipelining: Basic and Intermediate Concept
Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.
Analogy: Gotta Do Laundry
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
ECE 232 L18.Pipeline.1 Adapted from Patterson 97 ©UCBCopyright 1998 Morgan Kaufmann Publishers ECE 232 Hardware Organization and Design Lecture 18 Pipelining.
CECS 440 Pipelining.1(c) 2014 – R. W. Allison [slides adapted from D. Patterson slides with additional credits to M.J. Irwin]

Cs 152 L1 3.1 DAP Fa97,  U.CB Pipelining Lessons °Pipelining doesn’t help latency of single task, it helps throughput of entire workload °Multiple tasks.
CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and
CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
HazardsCS510 Computer Architectures Lecture Lecture 7 Pipeline Hazards.
CPE 442 hazards.1 Introduction to Computer Architecture CpE 442 Designing a Pipeline Processor (lect. II)
CS252/Patterson Lec 1.1 1/17/01 معماري کامپيوتر - درس نهم pipeline برگرفته از درس : Prof. David A. Patterson.
HazardsCS510 Computer Architectures Lecture Lecture 7 Pipeline Hazards.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.
Lecture 18: Pipelining I.
Review: Instruction Set Evolution
Pipelining: Hazards Ver. Jan 14, 2014
CMSC 611: Advanced Computer Architecture
5 Steps of MIPS Datapath Figure A.2, Page A-8
ECE232: Hardware Organization and Design
Appendix A - Pipelining
Chapter 3: Pipelining 순천향대학교 컴퓨터학부 이 상 정 Adapted from
CPE 631 Lecture 03: Review: Pipelining, Memory Hierarchy
Chapter 4 The Processor Part 2
CMSC 611: Advanced Computer Architecture
Pipelining sections A.1-A.8
An Introduction to pipelining
Electrical and Computer Engineering
Pipelining Appendix A and Chapter 3.
Throughput = #instructions per unit time (seconds/cycles etc.)
Presentation transcript:

CMPUT Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252 lecture slides at Berkeley)

CMPUT Computer Systems and Architecture2 What is Pipelining? Pipelining is a key implementation technique used to build fast processors. It allows the execution of multiple instructions to overlap in time. A pipeline within a processor is similar to a car assembly line. Each assembly station is called a pipe stage or a pipe segment. The throughput of an instruction pipeline is the measure of how often an instruction exits the pipeline.

CMPUT Computer Systems and Architecture3 Pipelining: Its Natural! zLaundry Example zAnn, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold xWasher takes 30 minutes xDryer takes 40 minutes x“Folder” takes 20 minutes ABCD

CMPUT Computer Systems and Architecture4 Sequential Laundry xSequential laundry takes 6 hours for 4 loads xIf they learned pipelining, how long would laundry take? ABCD PM Midnight TaskOrderTaskOrder Time

CMPUT Computer Systems and Architecture5 Pipelined Laundry Start work ASAP yPipelined laundry takes 3.5 hours for 4 loads ABCD 6 PM Midnight TaskOrderTaskOrder Time What is preventing them from doing it faster? How could we eliminate this limiting factor?

CMPUT Computer Systems and Architecture6 Pipelined Laundry Start work ASAP ABCD 6 PM Midnight TaskOrderTaskOrder Time Pipeline throughput: How many loads are completed per unit of time? Pipeline Latency: Once a load starts, how long does it takes to complete it?

MuxMux 0 1 MuxMux 0 1 MuxMux MuxMux PC Sign ext. Shift left 2 Conc/ Shift left 2 Read address Write address Write data MemData Instruction [31-26] Instruction [25-0] Instruction register Memory Read register 1 Read register 2 Write register Write data Read data 1 Read data 2 Registers 4 32 MuxMux MuxMux ALU result Zero ALU Target I[25-21] I[20-16] I[15-0] [15-11]

CMPUT Computer Systems and Architecture8 5 Steps of MIPS Datapath Figure 3.1, Page 130, CA:AQA 2e Memory Access Write Back Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc LMDLMD ALU MUX Memory Reg File MUX Data Memory MUX Sign Extend 4 Adder Zero? Next SEQ PC Address Next PC WB Data Inst RD RS1 RS2 Imm

CMPUT Computer Systems and Architecture9 5 Steps of MIPS Datapath Figure 3.4, Page 134, CA:AQA 2e Memory Access Write Back Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc ALU Memory Reg File MUX Data Memory MUX Sign Extend Zero? IF/ID ID/EX MEM/WB EX/MEM 4 Adder Next SEQ PC RD WB Data Data stationary control – local decode for each instruction phase / pipeline stage Next PC Address RS1 RS2 Imm MUX

CMPUT Computer Systems and Architecture10 Steps to Execute Each Instruction Type

CMPUT Computer Systems and Architecture11 Pipeline Stages We can divide the execution of an instruction into the following stages: IF: Instruction Fetch ID: Instruction Decode EX: Execution MEM: Memory Access WB: Write Back

CMPUT Computer Systems and Architecture12 Pipeline Throughput and Latency IFIDEXMEMWB 5 ns4 ns5 ns10 ns4 ns Consider the pipeline above with the indicated delays. We want to know what is the pipeline throughput and the pipeline latency. Pipeline throughput: instructions completed per second. Pipeline latency: how long does it take to execute a single instruction in the pipeline.

CMPUT Computer Systems and Architecture13 Pipeline Throughput and Latency IFIDEXMEMWB 5 ns4 ns5 ns10 ns4 ns Pipeline throughput: how often an instruction is completed. Pipeline latency: how long does it take to execute an instruction in the pipeline. Is this right?

CMPUT Computer Systems and Architecture14 Pipeline Throughput and Latency IFIDEXMEMWB 5 ns4 ns5 ns10 ns4 ns Simply adding the latencies to compute the pipeline latency, only would work for an isolated instruction IFMEMID I1 L(I1) = 28ns EXWB MEMID IF I2 L(I2) = 33ns EXWB MEMID IF I3 L(I3) = 38ns EXWB MEMID IF I4 L(I5) = 43ns EXWB We are in trouble! The latency is not constant. This happens because this is an unbalanced pipeline. The solution is to make every state the same length as the longest one.

CMPUT Computer Systems and Architecture15 Pipeline Throughput and Latency IFIDEXMEMWB 5 ns4 ns5 ns10 ns4 ns The slowest pipeline state also limits the latency!! IFMEMID I1 L(I1) = L(I2) = L(I3) = L(I4) = 50ns EXWB IFMEMID I2 L(I2) = 50ns EXWB IFMEMIDEXWB IFMEMIDEX I3 I4

CMPUT Computer Systems and Architecture16 Pipeline Throughput and Latency IFIDEXMEMWB 5 ns4 ns5 ns10 ns4 ns How long does it take to execute instructions in this pipeline? (disregard bubbles caused by branches, cache misses, and hazards) How long would it take using the same modules without pipelining?

CMPUT Computer Systems and Architecture17 Pipeline Throughput and Latency IFIDEXMEMWB 5 ns4 ns5 ns10 ns4 ns Thus the speedup that we got from the pipeline is: How can we improve this pipeline design? We need to reduce the unbalance to increase the clock speed.

CMPUT Computer Systems and Architecture18 Pipeline Throughput and Latency IFIDEX MEM1 WB 5 ns4 ns5 ns 4 ns Now we have one more pipeline stage, but the maximum latency of a single stage is reduced in half. MEM2 5 ns The new latency for a single instruction is:

CMPUT Computer Systems and Architecture19 Pipeline Throughput and Latency IFIDEX MEM1 WB 5 ns4 ns5 ns 4 ns MEM2 5 ns IF MEM1 ID I1 EXWB MEM1 IF MEM1 ID I2 EXWB MEM1 IF MEM1 ID I3 EXWB MEM1 IF MEM1 ID I4 EXWB MEM1 IF MEM1 ID I5 EXWB MEM1 IF MEM1 ID I6 EXWB MEM1 IF MEM1 ID I7 EXWB MEM1

CMPUT Computer Systems and Architecture20 Pipeline Throughput and Latency IFIDEX MEM1 WB 5 ns4 ns5 ns 4 ns MEM2 5 ns How long does it take to execute instructions in this pipeline? (disregard bubles caused by branches, cache misses, etc, for now) Thus the speedup that we get from the pipeline is:

CMPUT Computer Systems and Architecture21 Pipeline Throughput and Latency IFIDEX MEM1 WB 5 ns4 ns5 ns 4 ns MEM2 5 ns What have we learned from this example? 1. It is important to balance the delays in the stages of the pipeline 2. The throughput of a pipeline is 1/max(delay). 3. The latency is N  max(delay), where N is the number of stages in the pipeline.

CMPUT Computer Systems and Architecture22 Pipelining Lessons ABCD 6 PM 789 TaskOrderTaskOrder Time Pipelining doesn’t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to “fill” and “drain” pipeline reduces speedup

CMPUT Computer Systems and Architecture23 Computer Pipelines zExecute billions of instructions, so throughput is what matters zDesirable features: yall instructions same length, yregisters located in same place in instruction format, ymemory operands only in loads or stores

Visualizing Pipelining Figure 3.3, Page 133, CA:AQA 2e I n s t r. O r d e r Time (clock cycles) Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg Cycle 1Cycle 2Cycle 3Cycle 4Cycle 6Cycle 7Cycle 5

CMPUT Computer Systems and Architecture25 Its Not That Easy for Computers yLimits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle xStructural hazards: Hardware cannot support this combination of instructions (single person to fold and put clothes away) xData hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock) xControl hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).

One Memory Port/Structural Hazards Figure 3.6, Page 142, CA:AQA 2e I n s t r. O r d e r Time (clock cycles) Load Instr 1 Instr 2 Instr 3 Instr 4 Reg ALU DMem Ifetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMem Ifetch Reg Cycle 1Cycle 2Cycle 3Cycle 4Cycle 6Cycle 7Cycle 5 Reg ALU DMemIfetch Reg

One Memory Port/Structural Hazards Figure 3.7, Page 143, CA:AQA 2e I n s t r. O r d e r Time (clock cycles) Load Instr 1 Instr 2 Stall Instr 3 Reg ALU DMem Ifetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg Cycle 1Cycle 2Cycle 3Cycle 4Cycle 6Cycle 7Cycle 5 Reg ALU DMemIfetch Reg Bubble

I n s t r. O r d e r add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg Data Hazard on R1 Figure 3.9, page 147, CA:AQA 2e Time (clock cycles) IFID/RF EX MEM WB

CMPUT Computer Systems and Architecture29 yRead After Write (RAW) Instr J tries to read operand before Instr I writes it yCaused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication. Three Generic Data Hazards I: add r1,r2,r3 J: sub r4,r1,r3

CMPUT Computer Systems and Architecture30 yWrite After Read (WAR) Instr J writes operand before Instr I reads it yCalled an “anti-dependence” by compiler writers. This results from reuse of the name “r1”. I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Three Generic Data Hazards

CMPUT Computer Systems and Architecture31 Three Generic Data Hazards yWrite After Write (WAW) Instr J writes operand before Instr I writes it. yCalled an “output dependence” by compiler writers This also results from the reuse of name “r1”. I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7

CMPUT Computer Systems and Architecture32 Three Generic Data Hazards zWAW Can’t happen in MIPS 5 stage pipeline because: x All instructions take 5 stages, and x Writes are always in stage 5

CMPUT Computer Systems and Architecture33 Forwarding to Avoid Data Hazard I n s t r. O r d e r Time (clock cycles) add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11

CMPUT Computer Systems and Architecture34 HW Change for Forwarding Figure 3.20, Page 161

CMPUT Computer Systems and Architecture35 I n s t r. O r d e r Time (clock cycles) lw r1, 0(r2) sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9 Data Hazard Even with Forwarding Figure 3.12, Page 153

CMPUT Computer Systems and Architecture36 Data Hazard Even with Forwarding Figure 3.13, Page 154 I n s t r. O r d e r Time (clock cycles) lw r1, 0(r2) sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9

Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d,e, and f in memory. Slow code: LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f SUB Rd,Re,Rf SWd,Rd Software Scheduling to Avoid Load Hazards Fast code: LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB Rd,Re,Rf SWd,Rd

CMPUT Computer Systems and Architecture38 Why the fast code is faster?

CMPUT Computer Systems and Architecture39 Control Hazard on Branches Three Stage Stall 10: beq r1,r3,36 14: and r2,r3,r5 18: or r6,r1,r7 22: add r8,r1,r9 36: xor r10,r1,r11 Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg

CMPUT Computer Systems and Architecture40 Example: Branch Stall Impact zIf 30% branch, Stall of 3 cycles is significant zTwo part solution: yDetermine if branch is taken or not sooner, AND yCompute taken branch address earlier zMIPS branch tests if register = 0 or  0 zMIPS Solution: yMove Zero test to ID/RF stage yAdder to calculate new PC in ID/RF stage y1 clock cycle penalty for branch versus 3

Adder IF/ID Pipelined MIPS Datapath Figure 3.22, page 163, CA:AQA 2/e Memory Access Write Back Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc ALU Memory Reg File MUX Data Memory MUX Sign Extend Zero? MEM/WB EX/MEM 4 Adder Next SEQ PC RD WB Data Data stationary control – local decode for each instruction phase / pipeline stage Next PC Address RS1 RS2 Imm MUX ID/EX

CMPUT Computer Systems and Architecture42 Four Branch Hazard Alternatives #1: Stall until branch direction is clear #2: Predict Branch Not Taken yExecute successor instructions in sequence y“Squash” instructions in pipeline if branch actually taken yAdvantage of late pipeline state update y47% MIPS branches not taken on average yPC+4 already calculated, so use it to get next instruction #3: Predict Branch Taken y53% MIPS branches taken on average yBut haven’t calculated branch target address in MIPS xMIPS still incurs 1 cycle branch penalty xOther machines: branch target known before outcome

CMPUT Computer Systems and Architecture43 Four Branch Hazard Alternatives #4: Delayed Branch yDefine branch to take place AFTER a following instruction branch instruction sequential successor 1 sequential successor sequential successor n branch target if taken y1 slot delay allows proper decision and branch target address in 5 stage pipeline yMIPS uses this Branch delay of length n

CMPUT Computer Systems and Architecture44 Delayed Branch zWhere to get instructions to fill branch delay slot? yBefore branch instruction yFrom the target address: only valuable when branch taken yFrom fall through: only valuable when branch not taken yCanceling branches allow more slots to be filled xA canceling branch only executes instructions in the delay slot if the direction predicted is correct.

CMPUT Computer Systems and Architecture45 Fig. 3.28

CMPUT Computer Systems and Architecture46 Delayed Branch zCompiler effectiveness for single branch delay slot: yFills about 60% of branch delay slots yAbout 80% of instructions executed in branch delay slots useful in computation yAbout 50% (60% x 80%) of slots usefully filled zDelayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar)