Slides:



Advertisements
Similar presentations
PipelineCSCE430/830 Pipeline: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Prof. Yifeng Zhu, U of Maine Fall,
Advertisements

1 IKI20210 Pengantar Organisasi Komputer Kuliah no. 25: Pipeline 10 Januari 2003 Bobby Nazief Johny Moningka
CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Chapter 8. Pipelining.
Review: Pipelining. Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer.
Pipelining I (1) Fall 2005 Lecture 18: Pipelining I.
Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee.
Pipeline Computer Architecture Lecture 12: Designing a Pipeline Processor.
Computer Architecture
CS252/Patterson Lec 1.1 1/17/01 Pipelining: Its Natural! Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer.
Pipelining - Hazards.
Mary Jane Irwin ( ) [Adapted from Computer Organization and Design,
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 1.
Computer ArchitectureFall 2007 © October 24nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
ECE 232 L19.Pipeline2.1 Adapted from Patterson 97 ©UCBCopyright 1998 Morgan Kaufmann Publishers ECE 232 Hardware Organization and Design Lecture 19 Pipelining,
Computer ArchitectureFall 2007 © October 22nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
1 CSE SUNY New Paltz Chapter Six Enhancing Performance with Pipelining.
Pipelining Datapath Adapted from the lecture notes of Dr. John Kubiatowicz (UC Berkeley) and Hank Walker (TAMU)
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 9, 2002 Topic: Pipelining Basics.
1 Atanasoff–Berry Computer, built by Professor John Vincent Atanasoff and grad student Clifford Berry in the basement of the physics building at Iowa State.
Pipelining - II Adapted from CS 152C (UC Berkeley) lectures notes of Spring 2002.
Computer ArchitectureFall 2008 © October 6th, 2008 Majd F. Sakr CS-447– Computer Architecture.
Ceg3420 L13.1 DAP Fa97,  U.CB CEG3420 Computer Design Introduction to Pipelining.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 17 - Pipelined.
Pipelining - II Rabi Mahapatra Adapted from CS 152C (UC Berkeley) lectures notes of Spring 2002.
Introduction to Pipelining Rabi Mahapatra Adapted from the lecture notes of Dr. John Kubiatowicz (UC Berkeley)
Lecture 12: Pipeline Datapath Design Professor Mike Schulte Computer Architecture ECE 201.
9.2 Pipelining Suppose we want to perform the combined multiply and add operations with a stream of numbers: A i * B i + C i for i =1,2,3,…,7.
CS1104: Computer Organisation School of Computing National University of Singapore.
Computer Science Education
B 0000 Pipelining ENGR xD52 Eric VanWyk Fall
EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining.
Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.
Analogy: Gotta Do Laundry
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.
ECE 232 L18.Pipeline.1 Adapted from Patterson 97 ©UCBCopyright 1998 Morgan Kaufmann Publishers ECE 232 Hardware Organization and Design Lecture 18 Pipelining.
EECS 322 March 27, 2000 Based on Dave Patterson slides Instructor: Francis G. Wolff Case Western Reserve University This presentation.
Cs 152 L1 3.1 DAP Fa97,  U.CB Pipelining Lessons °Pipelining doesn’t help latency of single task, it helps throughput of entire workload °Multiple tasks.
Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining.
CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and
Pipelining Example Laundry Example: Three Stages
CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
CPE 442 hazards.1 Introduction to Computer Architecture CpE 442 Designing a Pipeline Processor (lect. II)
Pipelining CS365 Lecture 9. D. Barbara Pipeline CS465 2 Outline  Today’s topic  Pipelining is an implementation technique in which multiple instructions.
CS252/Patterson Lec 1.1 1/17/01 معماري کامپيوتر - درس نهم pipeline برگرفته از درس : Prof. David A. Patterson.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.
EEL-4713 Ann Gordon-Ross 1 EEL-4713C Computer Architecture Designing a Pipelined Processor.
Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 10 Computer Hardware Design (Pipeline Datapath and Control Design) Prof. Dr.
B 0001 Pipelining ENGR xD52 Eric VanWyk Fall
B1111 Pipelining ENGR xD52 Eric VanWyk Fall 2014.
Lecture 18: Pipelining I.
Pipelines An overview of pipelining
Review: Instruction Set Evolution
CMSC 611: Advanced Computer Architecture
ECE232: Hardware Organization and Design
Pipelining Lessons 6 PM T a s k O r d e B C D A 30
Chapter 3: Pipelining 순천향대학교 컴퓨터학부 이 상 정 Adapted from
Chapter 4 The Processor Part 2
CS-447– Computer Architecture Lecture 14 Pipelining (2)
Pipelining Lessons 6 PM T a s k O r d e B C D A 30
An Introduction to pipelining
Pipelining Appendix A and Chapter 3.
Recall: Performance Evaluation
Presentation transcript:

Pipelining A B C D Example: Doing the laundry Ann, Brian, Cathy, & Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer takes 40 minutes “Folder” takes 20 minutes A B C D

Sequential Laundry 6 PM 7 8 9 10 11 Midnight 30 40 20 30 40 20 30 40 Time 30 40 20 30 40 20 30 40 20 30 40 20 T a s k O r d e A B C D Sequential laundry takes 6 hours for 4 loads If they learned pipelining, how long would laundry take?

Pipelined Laundry: Start work ASAP 6 PM 7 8 9 10 11 Midnight Time 30 40 20 T a s k O r d e A B C D Pipelined laundry takes 3.5 hours for 4 loads

Pipelining Lessons 6 PM 7 8 9 30 40 20 A B C D Pipelining doesn’t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously using different resources Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” it reduces speedup Stall for Dependences 6 PM 7 8 9 Time 30 40 20 T a s k O r d e A B C D

Pipelined Execution Time IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB Program Flow IFetch Dcd Exec Mem WB Now we just have to make it work

Single Cycle, Multiple Cycle, vs. Pipeline Clk Single Cycle Implementation: Load Store Waste Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk Multiple Cycle Implementation: Load Store R-type Ifetch Reg Exec Mem Wr Ifetch Reg Exec Mem Ifetch Pipeline Implementation: Load Ifetch Reg Exec Mem Wr Store Ifetch Reg Exec Mem Wr R-type Ifetch Reg Exec Mem Wr

Why Pipeline? Suppose we execute 100 instructions Single Cycle Machine 45 ns/cycle x 1 CPI x 100 inst = ____ ns Multicycle Machine 10 ns/cycle x 4.0 CPI x 100 inst = ____ ns Ideal pipelined machine 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = ____ ns

CPI for Pipelined Processors Ideal pipelined machine 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = ____ ns CPI in pipelined processor is “issue rate”. Ignore fill/drain, ignore latency. Example: A processor wastes 2 cycles after every branch, and 1 after every load, during which it cannot issue a new instruction. If a program has 10% branches and 30% loads, what is the CPI on this program?

Pipelined Datapath Divide datapath into multiple pipeline stages IF Instruction Fetch RF Register Fetch EX Execute MEM Data Memory WB Writeback Register Register Register Register PC Data Memory Instr. Register File

Pipelined Control The Main Control generates the control signals during Reg/Dec Control signals for Exec (ALUOp, ALUSrcA, …) are used 1 cycle later Control signals for Mem (MemWE, IorD, …) are used 2 cycles later Control signals for Wr (Mem2Reg, RegWE, …) are used 3 cycles later Reg/Dec Exec Mem Wr ALUSrcA ALUSrcA ALUSrcB ALUSrcB ALUOp ALUOp Main Control RegDst RegDst Ex/Mem Register IF/ID Register ID/Ex Register Mem/Wr Register MemWE MemWE MemWE IorD IorD IorD Mem2Reg Mem2Reg Mem2Reg Mem2Reg RegWE RegWE RegWE RegWE

Can pipelining get us into trouble? Yes: Pipeline Hazards structural hazards: attempt to use the same resource two different ways at the same time E.g., combined washer/dryer would be a structural hazard or folder busy doing something else (watching TV) data hazards: attempt to use item before it is ready E.g., one sock of pair in dryer and one in washer; can’t fold until get sock from washer through dryer instruction depends on result of prior instruction still in the pipeline control hazards: attempt to make decision before condition evaluated E.g., washing football uniforms and need to get proper detergent level; need to see after dryer before next load in branch instructions Can always resolve hazards by waiting pipeline control must detect the hazard take action (or delay action) to resolve hazards

Pipelining the Load Instruction The five independent functional units in the pipeline datapath are: Instruction Memory for the Ifetch stage Register File’s Read ports (bus A and busB) for the Reg/Dec stage ALU for the Exec stage Data Memory for the Mem stage Register File’s Write port (bus W) for the Wr stage Clock Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Ifetch Reg/Dec Exec Mem Wr 1st lw 2nd lw 3rd lw

The Four Stages of R-type Ifetch: Fetch the instruction from the Instruction Memory Reg/Dec: Register Fetch and Instruction Decode Exec: ALU operates on the two register operands Wr: Write the ALU output back to the register file Cycle 1 Cycle 2 Cycle 3 Cycle 4 Ifetch Reg/Dec Exec Wr R-type

Structural Hazard Interaction between R-type and loads causes structural hazard on writeback Clock Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Ifetch Reg/Dec Exec Wr R-type Mem Load

Important Observation Each functional unit can only be used once per instruction Each functional unit must be used at the same stage for all instructions: Load uses Register File’s Write Port during its 5th stage R-type uses Register File’s Write Port during its 4th stage Solution: Delay R-type’s register write by one cycle: Now R-type instructions also use Reg File’s write port at Stage 5 Mem stage is a NOOP stage: nothing is being done. Ifetch Reg/Dec Exec Mem Wr Load 1 2 3 4 5 Ifetch Reg/Dec Exec Wr R-type 1 2 3 4 Ifetch Reg/Dec Exec Wr R-type Mem 1 2 3 4 5

Pipelining the R-type Instruction Clock Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Ifetch Reg/Dec Mem Wr R-type Exec Load

The Four Stages of Store Ifetch: Fetch the instruction from the Instruction Memory Reg/Dec: Register Fetch and Instruction Decode Exec: Calculate the memory address Mem: Write the data into the Data Memory Wr: NOOP Compatible with Load & R-type instructions Cycle 1 Cycle 2 Cycle 3 Cycle 4 Ifetch Reg/Dec Exec Mem Store Wr

The Stages of Branch Ifetch: Fetch the instruction from the Instruction Memory Reg/Dec: Register Fetch and Instruction Decode, compute branch target Exec: Test condition & update the PC Mem: NOOP Wr: NOOP Cycle 1 Cycle 2 Cycle 3 Cycle 4 Ifetch Reg/Dec Exec Mem Beq Wr

Control Hazard Branch updates the PC at the end of the Exec stage. Clock Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Ifetch Reg/Dec Mem Wr R-type beq Exec load

Accelerate Branches When can we compute branch target address? When can we compute beq condition? IF Instruction Fetch RF Register Fetch EX Execute MEM Data Memory WB Writeback Register Register Register Register PC Data Memory Instr. Register File

Control Hazard 2 Branch updates the PC at the end of the Reg/Dec stage. Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock R-type Ifetch Reg/Dec Exec Mem Wr beq Ifetch Reg/Dec Exec Mem Wr load Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Mem Wr Cycle 1 Cycle 2 Cycle 3 Cycle 4 Ifetch Reg/Dec Exec Mem Beq Wr

Solution #1: Stall Delay loading next instruction, load no-op instead CPI if all other instructions take 1 cycle, and branches are 20% of instructions? Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock R-type Ifetch Reg/Dec Exec Mem Wr beq Ifetch Reg/Dec Exec Mem Wr Stall Bubble Bubble Bubble Bubble Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Mem Wr

Solution #2: Branch Prediction Guess all branches not taken, squash if wrong CPI if 50% of branches actually not taken, and branch frequency 20%? Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock R-type Ifetch Reg/Dec Exec Mem Wr beq Ifetch Reg/Dec Exec Mem Wr load Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Mem Wr

Solution #3: Branch Delay Slot Redefine branches: Instruction directly after branch always executed Instruction after branch is the delay slot Compiler/assembler fills the delay slot add $t1, $t0, $t0 beq $t2, $t3, FOO sub $t2, $t0, $t3 add $t1, $t0, $t0 beq $t1, $t3, FOO add $t1, $t0, $t0 beq $t1, $t3, FOO add $t1, $t3, $t3 … FOO: add $t1, $t2, $t0 add $t1, $t0, $t0 beq $t1, $t3, FOO

Data Hazards Consider the following code: add $t0, $t1, $t2 sub $t3, $t0, $t4 and $t5, $t0, $t7 or $t8, $t0, $s0 xor $s1, $t0, $s2 Mem Wr Exec Clock Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Ifetch Reg/Dec add sub and or xor

Design Register File Carefully What if reads see value after write during the same cycle? add $t0, $t1, $t2 sub $t3, $t0, $t4 and $t5, $t0, $t7 or $t8, $t0, $s0 xor $s1, $t0, $s2 Mem Wr Exec Clock Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Ifetch Reg/Dec add sub and or xor

Forwarding Add logic to pass last two values from ALU output to ALU input(s) as needed Forward the ALU output to later instructions add $t0, $t1, $t2 sub $t3, $t0, $t4 and $t5, $t0, $t7 or $t8, $t0, $s0 xor $s1, $t0, $s2 Mem Wr Exec Clock Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Ifetch Reg/Dec add sub and or xor

Forwarding (cont.) Requires values from last two ALU operations. Remember destination register for operation. Compare sources of current instruction to destinations of previous 2. IF Instruction Fetch RF Register Fetch EX Execute MEM Data Memory WB Writeback Register Register Register Register PC Data Memory Instr. Register File

Data Hazards on Loads load $t0, 0($t1) sub $t3, $t0, $t4 and $t5, $t0, $t7 or $t8, $t0, $s0 xor $s1, $t0, $s2 Mem Wr Exec Clock Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Ifetch Reg/Dec load sub and or xor

Data Hazards on Loads Solution: Use same forwarding hardware & register file for hazards 2+ cycles later Force compiler to not allow register reads within a cycle of load Fill delay slot, or insert no-op.

Pipelined CPI, cycle time CPI, assuming compiler can fill 50% of delay slots Pipelined: cycle time = 1ns. Delay for 1M instr: Multicycle: CPI = 4.0, cycle time = 1ns. Delay for 1M instr: Single cycle: CPI = 1.0, cycle time = 4.5ns. Delay for 1M instr: Instruction Type Type Cycles Type Frequency Cycles * Freq ALU 50% Load 20% Store 10% Branch CPI:

Pipelined CPU Summary