1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 9, 2002 Topic: Pipelining Basics.

Slides:



Advertisements
Similar presentations
PipelineCSCE430/830 Pipeline: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Prof. Yifeng Zhu, U of Maine Fall,
Advertisements

CPE 442 pipeline.1 Intro to Computer Architecture CpE 242 Computer Architecture and Engineering Designing a Pipeline Processor.
1 IKI20210 Pengantar Organisasi Komputer Kuliah no. 25: Pipeline 10 Januari 2003 Bobby Nazief Johny Moningka
Chapter 8. Pipelining.
Review: Pipelining. Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer.
Pipelining I (1) Fall 2005 Lecture 18: Pipelining I.
Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee.
EECS 318 CAD Computer Aided Design LECTURE 2: DSP Architectures Instructor: Francis G. Wolff Case Western Reserve University This presentation.
Pipeline Computer Architecture Lecture 12: Designing a Pipeline Processor.
Computer Architecture
CS252/Patterson Lec 1.1 1/17/01 Pipelining: Its Natural! Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
Savio Chau Single Cycle Controller Design Last Time: Discussed the Designing of a Single Cycle Datapath Control Datapath Memory Processor (CPU) Input Output.
ECE 361 Computer Architecture Lecture 13: Designing a Pipeline Processor Start X:40.
CS61C L28 CPU Design : Pipelining to Improve Performance I (1) Garcia, Fall 2006 © UCB 100 Msites!  Sometimes it’s nice to stop and reflect. The web was.
Computer ArchitectureFall 2007 © October 22nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
Pipelining Datapath Adapted from the lecture notes of Dr. John Kubiatowicz (UC Berkeley) and Hank Walker (TAMU)
CS61C L26 CPU Design : Designing a Single-Cycle CPU II (1) Garcia, Fall 2006 © UCB Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.
1 Atanasoff–Berry Computer, built by Professor John Vincent Atanasoff and grad student Clifford Berry in the basement of the physics building at Iowa State.
Pipelining - II Adapted from CS 152C (UC Berkeley) lectures notes of Spring 2002.
CS 61C L30 Introduction to Pipelined Execution (1) Garcia, Fall 2004 © UCB Lecturer PSOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.
Computer ArchitectureFall 2008 © October 6th, 2008 Majd F. Sakr CS-447– Computer Architecture.
Pipelining - II Rabi Mahapatra Adapted from CS 152C (UC Berkeley) lectures notes of Spring 2002.
Introduction to Pipelining Rabi Mahapatra Adapted from the lecture notes of Dr. John Kubiatowicz (UC Berkeley)
1  What is the most boring household activity?. 2 A relevant question  Assuming you’ve got: —One washer (takes 30 minutes) —One drier (takes 40 minutes)
Lecture 12: Pipeline Datapath Design Professor Mike Schulte Computer Architecture ECE 201.
CPE 442 pipeline.1 Intro to Computer Architecture CpE 242 Computer Architecture and Engineering Designing a Pipeline Processor.
9.2 Pipelining Suppose we want to perform the combined multiply and add operations with a stream of numbers: A i * B i + C i for i =1,2,3,…,7.
CS1104: Computer Organisation School of Computing National University of Singapore.
Integrated Circuits Costs
B 0000 Pipelining ENGR xD52 Eric VanWyk Fall
EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining.
Analogy: Gotta Do Laundry
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.
ECE 232 L18.Pipeline.1 Adapted from Patterson 97 ©UCBCopyright 1998 Morgan Kaufmann Publishers ECE 232 Hardware Organization and Design Lecture 18 Pipelining.
Lecture 3: Review CPU Design Alvin R. Lebeck CPS 220 Fall 2001.

Cs 152 L1 3.1 DAP Fa97,  U.CB Pipelining Lessons °Pipelining doesn’t help latency of single task, it helps throughput of entire workload °Multiple tasks.
CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and
CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
CPE 442 hazards.1 Introduction to Computer Architecture CpE 442 Designing a Pipeline Processor (lect. II)
Pipelining CS365 Lecture 9. D. Barbara Pipeline CS465 2 Outline  Today’s topic  Pipelining is an implementation technique in which multiple instructions.
CS252/Patterson Lec 1.1 1/17/01 معماري کامپيوتر - درس نهم pipeline برگرفته از درس : Prof. David A. Patterson.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
Single Cycle Controller Design
CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
CS/EE 362 multipath..1 ©DAP & SIK 1995 CS/EE 362 Hardware Fundamentals Lecture 14: Designing a Multi-Cycle Datapath (Chapter 5: Hennessy and Patterson)
EEL-4713 Ann Gordon-Ross 1 EEL-4713C Computer Architecture Designing a Pipelined Processor.
Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 10 Computer Hardware Design (Pipeline Datapath and Control Design) Prof. Dr.
Lecture 18: Pipelining I.
Pipelines An overview of pipelining
Performance of Single-cycle Design
CMSC 611: Advanced Computer Architecture
ECE232: Hardware Organization and Design
Pipelining Lessons 6 PM T a s k O r d e B C D A 30
CpE 442 Designing a Pipeline Processor (lect. II)
Chapter 4 The Processor Part 2
CS 61C: Great Ideas in Computer Architecture Control and Pipelining
Inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture 20 CPU Design: Control II & Pipelining I TA Noah Johnson Greet class.
Lecturer: Alan Christopher
Serial versus Pipelined Execution
Pipelining Lessons 6 PM T a s k O r d e B C D A 30
An Introduction to pipelining
CpE 242 Computer Architecture and Engineering Designing a Pipeline Processor Start X:40.
Pipelining Appendix A and Chapter 3.
CMCS Computer Architecture Lecture 20 Pipelined Datapath and Control April 11, CMSC411.htm Mohamed.
A relevant question Assuming you’ve got: One washer (takes 30 minutes)
Recall: Performance Evaluation
Presentation transcript:

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 9, 2002 Topic: Pipelining Basics

2 Reading: Appendix A (HP3) Lecture Overview  A Pipelined Processor Introduction to the concept of pipelined processor Introduction to the concept of pipelined processor Pipelined Datapath Pipelined Datapath Pipeline example: Load Instruction Pipeline example: Load Instruction  Pipelined Datapath and Pipelined Control  Pipeline Example: Interaction among Instructions

3 ABCD Pipelining: It’s Natural! Laundry Example: Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Washer takes 30 minutes Dryer takes 40 minutes Dryer takes 40 minutes “Folder” takes 20 minutes “Folder” takes 20 minutes

4  Sequential laundry takes 6 hours for 4 loads If they learned pipelining, how long would laundry take? A B C D PM Midnight TaskOrderTaskOrder Time Sequential Laundry

5 Pipelined laundry takes 3.5 hours for 4 loads A B C D 6 PM Midnight TaskOrderTaskOrder Time Pipelined Laundry: Start work ASAP

6 A B C D 6 PM 789 TaskOrderTaskOrder Time Pipelining Lessons  Pipelining doesn’t help latency of single task, it helps throughput of entire workload  Pipeline rate limited by slowest pipeline stage  Multiple tasks operating simultaneously  Potential speedup = Number pipe stages  Unbalanced lengths of pipe stages reduces speedup  Time to “fill” pipeline and time to “drain” it reduces speedup

7  Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory Fetch the instruction from the Instruction Memory  Reg/Dec: Registers Fetch and Instruction Decode  Exec: Calculate the memory address  Mem: Read the data from the Data Memory  WrB: Write the data back to the register file Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5 IfetchReg/DecExecMemWrBLoad The Five Stages of a RISC Instruction

8 The load instruction has 5 stages: The load instruction has 5 stages: Five independent functional units to work on each stage Five independent functional units to work on each stage  Each functional unit is used only once! A second load can start doing Ifetch as soon as the first load finishes its Ifetch stage A second load can start doing Ifetch as soon as the first load finishes its Ifetch stage Each load still takes five cycles to complete Each load still takes five cycles to complete  The latency of a single load is still 5 cycles The throughput is much higher The throughput is much higher  CPI approaches 1  Cycle time is ~1/5th the cycle time of the single-cycle implementation Instructions start executing before previous instructions complete execution Instructions start executing before previous instructions complete execution IfetchReg/DecExecMemWrBLoad Key Ideas Behind Instruction Pipelining CPI  Cycle time 

9 Pipelining the LOAD Instruction  The five independent pipeline stages are: Read next instruction: The Ifetch stage Read next instruction: The Ifetch stage Decode instruction and fetch register values: The Reg/Dec stage Decode instruction and fetch register values: The Reg/Dec stage Execute the operation: The Exec stage Execute the operation: The Exec stage Access data memory: The Mem stage Access data memory: The Mem stage Write data to destination register: The WrB stage Write data to destination register: The WrB stage  One instruction enters the pipeline every cycle One instruction comes out of the pipeline (completed) every cycle One instruction comes out of the pipeline (completed) every cycle The “effective” CPI is 7/3 (tends to 1); ~1/5 cycle time The “effective” CPI is 7/3 (tends to 1); ~1/5 cycle time Clock Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7IfetchReg/DecExecMemWrB1st lw IfetchReg/DecExecMemWrB2nd lw IfetchReg/DecExecMemWrB3rd lw

10  Ifetch: Instruction fetch Fetch the instruction from the instruction memory Fetch the instruction from the instruction memory  Reg/Dec: Registers fetch and instruction decode  Exec: ALU operates on the two register operands  WrB: Write the ALU output back to the register file Cycle 1Cycle 2Cycle 3Cycle 4 IfetchReg/DecExecWrBR-type The Four Stages of R-type

11  We have a problem called pipeline conflict or hazard Two instructions try to write to the register file at the same time! Two instructions try to write to the register file at the same time! “Contention for a shared resource” (in OS terminology) “Contention for a shared resource” (in OS terminology)  It is no longer meaningful to talk about the execution of a single instruction in isolation Execution is inherently concurrent; need to achieve serializability Execution is inherently concurrent; need to achieve serializability Clock Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9 IfetchReg/DecExecWrR-type IfetchReg/DecExecWrR-type IfetchReg/DecExecMemWrLoad IfetchReg/DecExecWrR-type IfetchReg/DecExecWrR-type OOPS! We have a problem! Pipelining the R-type and Load Instructions

12  Each functional unit can only be used once per instruction  Each functional unit must be used at the same stage for all instructions Load uses Register File’s Write Port during its 5th stage Load uses Register File’s Write Port during its 5th stage R-type uses Register File’s Write Port during its 4th stage R-type uses Register File’s Write Port during its 4th stage IfetchReg/DecExecMemWrBLoad IfetchReg/DecExecWrBR-type 1234   How to resolve this pipeline hazard? Important Observations

13 Clock Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9 IfetchReg/DecMemWrBR-type IfetchReg/DecMemWrBR-type IfetchReg/DecExecMemWrBLoad IfetchReg/DecMemWrBR-type IfetchReg/DecMemWrBR-type Exec IfetchReg/DecExecWrR-type Mem Solution: Delay R-type’s Write by 1 Cycle  Delay R-type’s register write by one cycle: Now R-type instructions also use Reg File’s write port at Stage 5 Now R-type instructions also use Reg File’s write port at Stage 5 Mem stage is a NO-OP stage: nothing is being done. Effective CPI? Mem stage is a NO-OP stage: nothing is being done. Effective CPI?

14  Ifetch: Instruction fetch Fetch the instruction from the instruction memory Fetch the instruction from the instruction memory  Reg/Dec: Registers fetch and instruction decode  Exec: Calculate the memory address  Mem: Write the data into the data memory Cycle 1Cycle 2Cycle 3Cycle 4 IfetchReg/DecExecMemStoreWrB The Four Stages of Store

15  Ifetch: Instruction fetch Fetch the instruction from the instruction memory Fetch the instruction from the instruction memory  Reg/Dec: Registers fetch and instruction decode  Exec: ALU compares the two register operands Adder calculates the branch target address Adder calculates the branch target address  Mem: If the registers we compared in the Exec stage are the same, Write the branch target address into the PC Write the branch target address into the PC Cycle 1Cycle 2Cycle 3Cycle 4 IfetchReg/DecExecMemBeqWrB The Four Stages of Beq

16 IF/ID Register ID/Ex Register Ex/Mem Register Mem/Wr Register PC Data Mem WA Di RADo IF_Unit A I RFile Di Ra Rb Rw MemWr RegWr ExtOp Exec Unit busA busB Imm16 ALUOp ALUSrc Mux 1 0 MemtoReg 1 0 RegDst Rt Rd Imm16 PC+4 Rs Rt PC+4 Zero Branch 1 0 Clk IfetchReg/DecExecMemWrB EX Unit A Pipelined Datapath

17 MemWr IF/ID: lw $1, 100 ($2) ID/Ex Register Ex/Mem Register Mem/Wr Register PC = 12 Data Me m WA Di RADo IF_Unit A I RFile Di Ra Rb Rw RegWr ExtOp Exec Unit busA busB Imm16 ALUOp ALUSrc Mux 1 0 MemtoReg 1 0 RegDst Rt Rd Imm16 PC+4 Rs Rt PC+4 Zero Branch 1 0 Clk IfetchReg/DecExecMem You are here! The Instruction Fetch Stage Location 8: lw $1, 0x100($2) $1  Mem{($2) + 0x100}

18 lw $1, 0x100 ($2) PC = 12 “8” Adder Instruction Memory “4” Instruction Address Clk Ifetch You are here! Reg/Dec PC+4 32 Detailed View of the Instruction Fetch Unit Location 8: lw $1, 0x100($2)

19 The Decode / Register Fetch Stage Location 8: lw $1, 0x100($2) $1  Mem{($2) + 0x100}

20 OP rs rt rd func PC + 4 Rw Control Rb Ra rt rs Register File rt rd Imm16 Bus-A Bus-B PC+4 Din Clk Detailed View of the Fetch/Decode Stage

21 MemWr Load’s Address Calculation Stage Location 8: lw $1, 0x100($2) $1  Mem{($2) + 0x100}

22 ID/Ex Register Ex/Mem: Load’s Memory Address ALU Control ALUctr 32 busA 32 busB Extender Mux 16 imm16 ALUSrc=1 ExtOp=1 3 ALU Zero ALUout 32 Adder 3 ALUOp=Add << 2 32 PC+4 Target 32 Clk Exec You are here! Mem Detailed View of the Execution Unit

23 Load’s Memory Access Stage Location 8: lw $1, 0x100($2) $1  Mem{($2) + 0x100}

24 Load’s Write Back Stage Location 8: lw $1, 0x100($2) $1  Mem{($2) + 0x100}