EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining.

Slides:



Advertisements
Similar presentations
PipelineCSCE430/830 Pipeline: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Prof. Yifeng Zhu, U of Maine Fall,
Advertisements

CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Review: Pipelining. Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer.
Pipelining I (1) Fall 2005 Lecture 18: Pipelining I.
Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee.
CS252/Patterson Lec 1.1 1/17/01 Pipelining: Its Natural! Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer.
Pipelining - Hazards.
COMP381 by M. Hamdi 1 Pipeline Hazards. COMP381 by M. Hamdi 2 Pipeline Hazards Hazards are situations in pipelining where one instruction cannot immediately.
CPSC 614 Computer Architecture Lec 3 Pipeline Review EJ Kim Dept. of Computer Science Texas A&M University Adapted from CS 252 Spring 2006 UC Berkeley.
CSCE 430/830 Computer Architecture Basic Pipelining & Performance
Mary Jane Irwin ( ) [Adapted from Computer Organization and Design,
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
ECE 361 Computer Architecture Lecture 13: Designing a Pipeline Processor Start X:40.
Chapter 5 Pipelining and Hazards
EENG449b/Savvides Lec 3.1 1/20/04 January 20, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
©UCB CS 162 Computer Architecture Lecture 3: Pipelining Contd. Instructor: L.N. Bhuyan
Computer ArchitectureFall 2007 © October 24nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
Computer ArchitectureFall 2007 © October 22nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
Pipelining - II Adapted from CS 152C (UC Berkeley) lectures notes of Spring 2002.
Appendix A Pipelining: Basic and Intermediate Concepts
Computer ArchitectureFall 2008 © October 6th, 2008 Majd F. Sakr CS-447– Computer Architecture.
ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 17 - Pipelined.
Pipeline Hazard CT101 – Computing Systems. Content Introduction to pipeline hazard Structural Hazard Data Hazard Control Hazard.
CS136, Advanced Architecture Basics of Pipelining.
CS1104: Computer Organisation School of Computing National University of Singapore.
Pipelining. 10/19/ Outline 5 stage pipelining Structural and Data Hazards Forwarding Branch Schemes Exceptions and Interrupts Conclusion.
CPE 731 Advanced Computer Architecture Pipelining Review Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of California,
EECS 252 Graduate Computer Architecture Lecture 3  0 (continued) Review of Instruction Sets, Pipelines, Caches and Virtual Memory January 25 th, 2012.
Appendix A - Pipelining CSCI/ EENG – W01 Computer Architecture 1 Prof. Babak Beheshti Slides based on the PowerPoint Presentations created by David.
Pipeline Review. 2 Review from last lecture Tracking and extrapolating technology part of architect’s responsibility Expect Bandwidth in disks, DRAM,
Integrated Circuits Costs
CSC 7080 Graduate Computer Architecture Lec 3 – Pipelining: Basic and Intermediate Concepts (Appendix A) Dr. Khalaf Notes adapted from: David Patterson.
Appendix A Pipelining: Basic and Intermediate Concept
Analogy: Gotta Do Laundry
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
CMPE 421 Parallel Computer Architecture
ECE 232 L18.Pipeline.1 Adapted from Patterson 97 ©UCBCopyright 1998 Morgan Kaufmann Publishers ECE 232 Hardware Organization and Design Lecture 18 Pipelining.
CECS 440 Pipelining.1(c) 2014 – R. W. Allison [slides adapted from D. Patterson slides with additional credits to M.J. Irwin]

Cs 152 L1 3.1 DAP Fa97,  U.CB Pipelining Lessons °Pipelining doesn’t help latency of single task, it helps throughput of entire workload °Multiple tasks.
CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and
Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.
CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
HazardsCS510 Computer Architectures Lecture Lecture 7 Pipeline Hazards.
CPE 442 hazards.1 Introduction to Computer Architecture CpE 442 Designing a Pipeline Processor (lect. II)
CMPUT Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.
CS252/Patterson Lec 1.1 1/17/01 معماري کامپيوتر - درس نهم pipeline برگرفته از درس : Prof. David A. Patterson.
HazardsCS510 Computer Architectures Lecture Lecture 7 Pipeline Hazards.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.
Lecture 18: Pipelining I.
Lecture 15: Pipelining: Branching & Complications
Review: Instruction Set Evolution
CMSC 611: Advanced Computer Architecture
5 Steps of MIPS Datapath Figure A.2, Page A-8
ECE232: Hardware Organization and Design
Appendix A - Pipelining
Chapter 3: Pipelining 순천향대학교 컴퓨터학부 이 상 정 Adapted from
CPE 631 Lecture 03: Review: Pipelining, Memory Hierarchy
CMSC 611: Advanced Computer Architecture
An Introduction to pipelining
Electrical and Computer Engineering
Pipelining Appendix A and Chapter 3.
Throughput = #instructions per unit time (seconds/cycles etc.)
Presentation transcript:

EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining

EEL5708 Acknowledgements All the lecture slides were adopted from the slides of David Patterson (1998, 2001) and David E. Culler (2001), Copyright , University of California Berkeley

EEL5708 Pipelining

EEL5708 Sequential Laundry Sequential laundry takes 6 hours for 4 loads If they learned pipelining, how long would laundry take? ABCD PM Midnight TaskOrderTaskOrder Time

EEL5708 Pipelined Laundry Start work ASAP Pipelined laundry takes 3.5 hours for 4 loads ABCD 6 PM Midnight TaskOrderTaskOrder Time

EEL5708 Pipelining Lessons Pipelining doesn’t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” it reduces speedup ABCD 6 PM 789 TaskOrderTaskOrder Time

EEL5708 Fast, Pipelined Instruction Interpretation Instruction Register Operand Registers Instruction Address Result Registers Next Instruction Instruction Fetch Decode & Operand Fetch Execute Store Results NI IF D E W NI IF D E W NI IF D E W NI IF D E W NI IF D E W Time Registers or Mem

EEL5708 Instruction Pipelining Execute billions of instructions, so throughput is what matters –except when? What is desirable in instruction sets for pipelining? –Variable length instructions vs. all instructions same length? –Memory operands part of any operation vs. memory operands only in loads or stores? –Register operand many places in instruction format vs. registers located in same place?

EEL5708 Example: MIPS (Note register location) Op Rs1Rd immediate Op Op Rs1Rs2 target RdOpx Register-Register Register-Immediate Op Rs1Rs2/Opx immediate Branch Jump / Call

EEL Steps of MIPS Datapath Memory Access Write Back Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc ALU MUX Memory Reg File MUX Data Memory MUX Sign Extend 4 Adder Zero? Next SEQ PC Address Next PC WB Data Inst RD RS1 RS2 Immediate

EEL Steps of MIPS Datapath Memory Access Write Back Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc ALU Memory Reg File MUX Data Memory MUX Sign Extend Zero? IF/ID ID/EX MEM/WB EX/MEM 4 Adder Next SEQ PC RD WB Data Data stationary control – local decode for each instruction phase / pipeline stage Next PC Address RS1 RS2 Imm MUX

EEL5708 Visualizing Pipelining Figure 3.3, Page 133, CA:AQA 2e I n s t r. O r d e r Time (clock cycles) Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg Cycle 1Cycle 2Cycle 3Cycle 4Cycle 6Cycle 7Cycle 5

EEL5708 Its Not That Easy for Computers Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle –Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away) –Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock) –Control hazards: Pipelining of branches & other instructions that change the PC –Common solution is to stall the pipeline until the hazard is resolved, inserting one or more “bubbles” in the pipeline

EEL5708 Speed Up Equation for Pipelining CPI pipelined = Ideal CPI + Pipeline stall clock cycles per instr Speedup = Ideal CPI x Pipeline depth Clock Cycle unpipelined Ideal CPI + Pipeline stall CPI Clock Cycle pipelined Speedup = Pipeline depth Clock Cycle unpipelined 1 + Pipeline stall CPI Clock Cycle pipelined x x

EEL5708 Structural Hazard Example: Dual-port vs. Single-port Machine A: Dual ported memory Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate Ideal CPI = 1 for both Loads are 40% of instructions executed SpeedUp A = Pipeline Depth/(1 + 0) x (clock unpipe /clock pipe ) = Pipeline Depth SpeedUp B = Pipeline Depth/( x 1) x (clock unpipe /(clock unpipe / 1.05) = (Pipeline Depth/1.4) x 1.05 = 0.75 x Pipeline Depth SpeedUp A / SpeedUp B = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33 Machine A is 1.33 times faster

EEL5708 Three Generic Data Hazards Instr I followed by Instr J Read After Write (RAW) Instr J tries to read operand before Instr I writes it

EEL5708 Three Generic Data Hazards Instr I followed by Instr J Write After Read (WAR) Instr J tries to write operand before Instr I reads i –Gets wrong operand Can’t happen in our 5 stage pipeline because: – All instructions take 5 stages, and – Reads are always in stage 2, and – Writes are always in stage 5

EEL5708 Three Generic Data Hazards Instr I followed by Instr J Write After Write (WAW) Instr J tries to write operand before Instr I writes it – Leaves wrong result ( Instr I not Instr J ) Can’t happen in our 5 stage pipeline because: – All instructions take 5 stages, and – Writes are always in stage 5

EEL5708 Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d,e, and f in memory. Slow code: LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f SUB Rd,Re,Rf SWd,Rd Software Scheduling to Avoid Load Hazards Fast code: LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB Rd,Re,Rf SWd,Rd

EEL5708 Control Hazard on Branches Three Stage Stall

EEL5708 Branch Stall Impact If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9! Two part solution: –Determine branch taken or not sooner, AND –Compute taken branch address earlier Branch tests if register = 0 or <> 0 Solution: –Move Zero test to ID/RF stage –Adder to calculate new PC in ID/RF stage –1 clock cycle penalty for branch versus 3

EEL5708 Four Branch Hazard Alternatives #1: Stall until branch direction is clear #2: Predict Branch Not Taken –Execute successor instructions in sequence –“Squash” instructions in pipeline if branch actually taken –Advantage of late pipeline state update –47% branches not taken on average –PC+4 already calculated, so use it to get next instruction #3: Predict Branch Taken –53% branches taken on average –But haven’t calculated branch target address »still incurs 1 cycle branch penalty »Other machines: branch target known before outcome

EEL5708 Four Branch Hazard Alternatives #4: Delayed Branch –Define branch to take place AFTER a following instruction branch instruction sequential successor 1 sequential successor sequential successor n branch target if taken –1 slot delay allows proper decision and branch target address in 5 stage pipeline Branch delay of length n

EEL5708 Delayed Branch Where to get instructions to fill branch delay slot? –Before branch instruction –From the target address: only valuable when branch taken –From fall through: only valuable when branch not taken –Cancelling branches allow more slots to be filled Compiler effectiveness for single branch delay slot: –Fills about 60% of branch delay slots –About 80% of instructions executed in branch delay slots useful in computation –About 50% (60% x 80%) of slots usefully filled Delayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar)

EEL5708 Evaluating Branch Alternatives SchedulingBranchCPIspeedup v.speedup v. scheme penaltyunpipelinedstall Stall pipeline Predict taken Predict not taken Delayed branch Conditional & Unconditional = 14%, 65% change PC

EEL5708 Pipelining Summary Just overlap tasks, and easy if tasks are independent Speed Up / Pipeline Depth; if ideal CPI is 1, then: Hazards limit performance on computers: –Structural: need more HW resources –Data (RAW,WAR,WAW): need forwarding, compiler scheduling –Control: delayed branch, prediction Speedup = Pipeline Depth 1 + Pipeline stall CPI X Clock Cycle Unpipelined Clock Cycle Pipelined