Review: Pipelining. Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer.

Slides:



Advertisements
Similar presentations
PipelineCSCE430/830 Pipeline: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Prof. Yifeng Zhu, U of Maine Fall,
Advertisements

CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Chapter 8. Pipelining.
Pipelining I (1) Fall 2005 Lecture 18: Pipelining I.
Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee.
Computer Architecture
CS252/Patterson Lec 1.1 1/17/01 Pipelining: Its Natural! Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer.
COMP381 by M. Hamdi 1 Pipeline Hazards. COMP381 by M. Hamdi 2 Pipeline Hazards Hazards are situations in pipelining where one instruction cannot immediately.
ECE 361 Computer Architecture Lecture 13: Designing a Pipeline Processor Start X:40.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 1.
Chapter 5 Pipelining and Hazards
©UCB CS 162 Computer Architecture Lecture 3: Pipelining Contd. Instructor: L.N. Bhuyan
Computer ArchitectureFall 2007 © October 24nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
Computer ArchitectureFall 2007 © October 22nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
Pipelining Datapath Adapted from the lecture notes of Dr. John Kubiatowicz (UC Berkeley) and Hank Walker (TAMU)
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 9, 2002 Topic: Pipelining Basics.
1 Atanasoff–Berry Computer, built by Professor John Vincent Atanasoff and grad student Clifford Berry in the basement of the physics building at Iowa State.
Pipelining - II Adapted from CS 152C (UC Berkeley) lectures notes of Spring 2002.
Computer ArchitectureFall 2008 © October 6th, 2008 Majd F. Sakr CS-447– Computer Architecture.
ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 17 - Pipelined.
Introduction to Pipelining Rabi Mahapatra Adapted from the lecture notes of Dr. John Kubiatowicz (UC Berkeley)
9.2 Pipelining Suppose we want to perform the combined multiply and add operations with a stream of numbers: A i * B i + C i for i =1,2,3,…,7.
CS1104: Computer Organisation School of Computing National University of Singapore.
Pipelining. 10/19/ Outline 5 stage pipelining Structural and Data Hazards Forwarding Branch Schemes Exceptions and Interrupts Conclusion.
CPE 731 Advanced Computer Architecture Pipelining Review Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of California,
Computer Science Education
Integrated Circuits Costs
B 0000 Pipelining ENGR xD52 Eric VanWyk Fall
EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining.
Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.
Analogy: Gotta Do Laundry
CMPE 421 Parallel Computer Architecture
Computer Architecture Lecture 4 Pipelining. 2 Introduction A.1 What is Pipelining? A.2 The Major Hurdle of Pipelining-Structural Hazards –Data Hazards.
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
ECE 232 L18.Pipeline.1 Adapted from Patterson 97 ©UCBCopyright 1998 Morgan Kaufmann Publishers ECE 232 Hardware Organization and Design Lecture 18 Pipelining.

Cs 152 L1 3.1 DAP Fa97,  U.CB Pipelining Lessons °Pipelining doesn’t help latency of single task, it helps throughput of entire workload °Multiple tasks.
Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining.
CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and
Pipelining Example Laundry Example: Three Stages
CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
HazardsCS510 Computer Architectures Lecture Lecture 7 Pipeline Hazards.
CPE 442 hazards.1 Introduction to Computer Architecture CpE 442 Designing a Pipeline Processor (lect. II)
CMPUT Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.
CS252/Patterson Lec 1.1 1/17/01 معماري کامپيوتر - درس نهم pipeline برگرفته از درس : Prof. David A. Patterson.
HazardsCS510 Computer Architectures Lecture Lecture 7 Pipeline Hazards.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 10 Computer Hardware Design (Pipeline Datapath and Control Design) Prof. Dr.
Lecture 18: Pipelining I.
Pipelines An overview of pipelining
Lecture 15: Pipelining: Branching & Complications
Review: Instruction Set Evolution
CMSC 611: Advanced Computer Architecture
5 Steps of MIPS Datapath Figure A.2, Page A-8
ECE232: Hardware Organization and Design
Appendix A - Pipelining
Chapter 3: Pipelining 순천향대학교 컴퓨터학부 이 상 정 Adapted from
CPE 631 Lecture 03: Review: Pipelining, Memory Hierarchy
Chapter 4 The Processor Part 2
CMSC 611: Advanced Computer Architecture
Appendix A - Pipelining
An Introduction to pipelining
Chapter 8. Pipelining.
Pipelining Appendix A and Chapter 3.
Throughput = #instructions per unit time (seconds/cycles etc.)
Presentation transcript:

Review: Pipelining

Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer takes 40 minutes “Folder” takes 20 minutes ABCD

Sequential Laundry Sequential laundry takes 6 hours for 4 loads ABCD PM Midnight TaskOrderTaskOrder Time

Pipelined Laundry Start work ASAP Pipelined laundry takes 3.5 hours for 4 loads ABCD 6 PM Midnight TaskOrderTaskOrder Time

Pipelining: Observations Multiple tasks operating simultaneously Pipelining doesn’t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” it reduces speedup ABCD 6 PM 789 TaskOrderTaskOrder Time

5 Steps of DLX Datapath Figure 3.1 Memory Access Write Back Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc PC + Inst. Mem. 4 NPC IR Regs Imm. Sign Ext. A B MUXMUX MUXMUX MUXMUX MUXMUX LMD ALU Data Mem. ALU Output Zero?Cond

Pipelined DLX Datapath Figure 3.4 Memory Access Write Back Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc. PC + Inst. Mem. 4 IF/ID Regs Sign Ext. MUXMUX MUXMUX MUXMUX MUXMUX ALU Data Mem. Zero? 1632 ID/EXEX/MEMMEM/WB

Visualizing Pipelining Figure 3.3 I n s t r. O r d e r Time (clock cycles)

Limits to Pipelining Hazards prevent next instruction from executing during its designated clock cycle –Structural hazards: HW cannot support this combination of instructions –Data hazards: Instruction depends on result of prior instruction still in the pipeline –Control hazards: Pipelining of branches & other instructions that change the PC Common solution is to stall the pipeline until the hazard is resolved, inserting one or more “bubbles” in the pipeline

One Memory Port/Structural Hazards Figure 3.6 I n s t r. O r d e r Time (clock cycles) Load Instr 1 Instr 2 Instr 3 Instr 4

One Memory Port/Structural Hazards Figure 3.7 I n s t r. O r d e r Load Instr 1 Instr 2 stall Instr 3

Speed Up Equation for Pipelining Speedup from pipelining = Ave Instr Time unpipelined Ave Instr Time pipelined = CPI unpipelined x Clock Cycle unpipelined CPI pipelined x Clock Cycle pipelined = CPI unpipelined Clock Cycle unpipelined CPI pipelined Clock Cycle pipelined Ideal CPI = CPI unpipelined /Pipeline depth Speedup = Ideal CPI x Pipeline depth Clock Cycle unpipelined CPI pipelined Clock Cycle pipelined x x

Speed Up Equation for Pipelining CPI pipelined = Ideal CPI + Pipeline stall clock cycles per instr Speedup = Ideal CPI x Pipeline depth Clock Cycle unpipelined Ideal CPI + Pipeline stall CPI Clock Cycle pipelined What is the maximum possible speedup? Speedup = Pipeline depth Clock Cycle unpipelined 1 + Pipeline stall CPI Clock Cycle pipelined x x

Example: Dual-port vs. Single-port Machine A: Dual ported memory Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate Ideal CPI = 1 for both Loads are 40% of instructions executed SpeedUp A = Pipeline Depth/(1 + 0) x (clock unpipe /clock pipe ) = Pipeline Depth SpeedUp B = Pipeline Depth/( ) x (clock unpipe /(clock pipe ) = (Pipeline Depth/1.4) x 1.05 = 0.75 x Pipeline Depth SpeedUp A / SpeedUp B = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33 Machine A is 1.33 times faster

Data Hazard on R1 Figure 3.9

Three Generic Data Hazards Instr I followed by Instr J Read After Write (RAW) Instr J tries to read operand before Instr I writes it

Three Generic Data Hazards Instr I followed by Instr J Write After Read (WAR) Instr J tries to write operand before Instr I reads it Can’t happen in DLX 5 stage pipeline because: – All instructions take 5 stages, – Reads are always in stage 2, and – Writes are always in stage 5

Three Generic Data Hazards Instr I followed by Instr J Write After Write (WAW) Instr J tries to write operand before Instr I writes it – Leaves wrong result ( Instr I not Instr J ) Can’t happen in DLX 5 stage pipeline because: – All instructions take 5 stages, and – Writes are always in stage 5 Will see WAR and WAW in later more complicated pipes

Forwarding to Avoid Data Hazard Figure 3.10

HW Change for Forwarding Figure 3.20

Data Hazard Even with Forwarding Figure 3.12

Data Hazard Even with Forwarding Figure 3.13

Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d,e, and f in memory. Slow code: LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f SUB Rd,Re,Rf SWd,Rd Software Scheduling to Avoid Load Hazards Fast code: LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB Rd,Re,Rf SWd,Rd

Compiler Avoiding Load Stalls % loads stalling pipeline 0%20%40%60%80% tex spice gcc 25% 14% 31% 65% 42% 54% scheduledunscheduled

Control Hazard on Branches Three Stage Stall

Branch Stall Impact If CPI = 1, if 30% branch, Stall 3 cycles => new CPI = 1.9! Two part solution: –Determine branch taken or not sooner, AND –Compute taken branch address earlier DLX branch tests if register = 0 or <> 0 DLX Solution: –Move Zero test to ID/RF stage –Adder to calculate new PC in ID/RF stage –1 clock cycle penalty for branch versus 3

Pipelined DLX Datapath Figure 3.22

Four Branch Hazard Alternatives #1: Stall until branch direction is clear #2: Predict Branch Not Taken –Execute successor instructions in sequence –“Squash” instructions in pipeline if branch actually taken –Advantage of late pipeline state update –47% DLX branches not taken on average –PC+4 already calculated, so use it to get next instruction #3: Predict Branch Taken –53% DLX branches taken on average –But haven’t calculated branch target address in DLX »DLX still incurs 1 cycle branch penalty

Four Branch Hazard Alternatives #4: Delayed Branch –Define branch to take place AFTER a following instruction branch instruction sequential successor 1 sequential successor sequential successor n branch target if taken –1 slot delay allows proper decision and branch target address in 5 stage pipeline –DLX uses this Branch delay of length n

Delayed Branch Where to get instructions to fill branch delay slot? –Before branch instruction –From the target address: only valuable when branch taken –From fall through: only valuable when branch not taken Compiler effectiveness for single branch delay slot: –Fills about 60% of branch delay slots –About 80% of instructions executed in branch delay slots useful in computation –About 50% (60% x 80%) of slots usefully filled

Pipelining Summary Just overlap tasks, and easy if tasks are independent Speed Up vs Pipeline Depth: Hazards limit performance on computers: –Structural: need more HW resources –Data: need forwarding, compiler scheduling –Control: discuss next time Speedup = Pipeline Depth 1 + Pipeline stall CPI X Clock Cycle Unpipelined Clock Cycle Pipelined