ECE/CS 552: Pipelining Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes based on set created by Mark Hill and John P.

Slides:

Advertisements

Similar presentations

Morgan Kaufmann Publishers The Processor

Advertisements

1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.

Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.

Pipelined Processor II (cont’d) CPSC 321

S. Barua – CPSC 440 CHAPTER 6 ENHANCING PERFORMANCE WITH PIPELINING This chapter presents pipelining.

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

Pipelining III Andreas Klappenecker CPSC321 Computer Architecture.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

©UCB CS 162 Computer Architecture Lecture 3: Pipelining Contd. Instructor: L.N. Bhuyan

1 Chapter Six - 2nd Half Pipelined Processor Forwarding, Hazards, Branching EE3055 Web:

1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.

Pipeline Exceptions & ControlCSCE430/830 Pipeline: Exceptions & Control CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.

EENG449b/Savvides Lec 4.1 1/22/04 January 22, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.

DLX Instruction Format

1 Lecture 4: Advanced Pipelines Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10)

Appendix A Pipelining: Basic and Intermediate Concepts

ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.

Lecture 15: Pipelining and Hazards CS 2011 Fall 2014, Dr. Rozier.

COSC 3430 L08 Basic MIPS Architecture.1 COSC 3430 Computer Architecture Lecture 08 Processors Single cycle Datapath PH 3: Sections

1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.

Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency: Time.

Chapter 4 CSF 2009 The processor: Pipelining. Performance Issues Longest delay determines clock period – Critical path: load instruction – Instruction.

Comp Sci pipelining 1 Ch. 13 Pipelining. Comp Sci pipelining 2 Pipelining.

CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.

CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.

Computer Organization CS224 Chapter 4 Part b The Processor Spring 2010 With thanks to M.J. Irwin, T. Fountain, D. Patterson, and J. Hennessy for some lecture.

CMPE 421 Parallel Computer Architecture

1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.

Electrical and Computer Engineering University of Cyprus LAB3: IMPROVING MIPS PERFORMANCE WITH PIPELINING.

CS 1104 Help Session IV Five Issues in Pipelining Colin Tan, S

Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam.

1 (Based on text: David A. Patterson & John L. Hennessy, Computer Organization and Design: The Hardware/Software Interface, 3 rd Ed., Morgan Kaufmann,

Introduction to Computer Organization Pipelining.

Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.

PROCESSOR PIPELINING YASSER MOHAMMAD. SINGLE DATAPATH DESIGN.

LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,

ECE/CS 552: Pipeline Hazards © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim.

Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.

Use of Pipelining to Achieve CPI < 1

CS 352H: Computer Systems Architecture

Exceptions Another form of control hazard Could be caused by…

ECE/CS 552: Pipelining © Prof. Mikko Lipasti

Morgan Kaufmann Publishers

Performance of Single-cycle Design

Pipeline Implementation (4.6)

ECS 154B Computer Architecture II Spring 2009

\course\cpeg323-08F\Topic6b-323

Pipelining: Advanced ILP

Review: MIPS Pipeline Data and Control Paths

Morgan Kaufmann Publishers The Processor

Lecture 6: Advanced Pipelines

Pipelining review.

Single-cycle datapath, slightly rearranged

Lecture 9. MIPS Processor Design – Pipelined Processor Design #2

Pipelining in more detail

\course\cpeg323-05F\Topic6b-323

Data Hazards Data Hazard

Lecture 5. MIPS Processor Design

Pipeline control unit (highly abstracted)

Control unit extension for data hazards

Instruction Execution Cycle

Pipeline control unit (highly abstracted)

Pipeline Control unit (highly abstracted)

Pipelining (II).

Introduction to Computer Organization and Architecture

Spring 2010 Ilam University

ELEC / Computer Architecture and Design Spring 2015 Pipeline Control and Performance (Chapter 6) Vishwani D. Agrawal James J. Danaher.

Presentation transcript:

ECE/CS 552: Pipelining Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes based on set created by Mark Hill and John P. Shen Updated by Mikko Lipasti

Pipelining Forecast –Big Picture –Datapath –Control –Data Hazards Stalls Forwarding –Control Hazards –Exceptions

Motivation Single cycle implementation –CPI = 1 –Cycle = imem + RFrd + ALU + dmem + RFwr + muxes + control –E.g = 2000ps –Time/program = P x 2ns Instructions Cycles Program Instruction Time Cycle (code size) XX (CPI) (cycle time)

Multicycle Multicycle implementation: Cycle: Instr: iFDXMW i+1FDX i+2FDXM i+3F i+4

Multicycle Multicycle implementation –CPI = 3, 4, 5 –Cycle = max(memory, RF, ALU, mux, control) – = max(500,250,500) = 500ps –Time/prog = P x 4 x 500 = P x 2000ps = P x 2ns Would like: –CPI = 1 + overhead from hazards (later) –Cycle = 500ps + overhead –In practice, ~3x improvement

Big Picture Instruction latency = 5 cycles Instruction throughput = 1/5 instr/cycle CPI = 5 cycles per instruction Instead –Pipelining: process instructions like a lunch buffet –ALL microprocessors use it E.g. Core i7, AMD Barcelona, ARM11

Big Picture Instruction Latency = 5 cycles (same) Instruction throughput = 1 instr/cycle CPI = 1 cycle per instruction CPI = cycle between instruction completion = 1

Ideal Pipelining Bandwidth increases linearly with pipeline depth Latency increases by latch delays

Example: Integer Multiplier 9 16x16 combinational multiplier ISCAS-85 C6288 standard benchmark Tools: Synopsys DC/LSI Logic 110nm gflxp ASIC [Source: J. Hayes, Univ. of Michigan]

Example: Integer Multiplier ConfigurationDelayMPSArea (FF/wiring)Area Increase Combinational3.52ns (--/1759) 2 Stages1.87ns534 (1.9x)8725 (1078/1870)16% 4 Stages1.17ns855 (3.0x)11276 (3388/2112)50% 8 Stages0.80ns1250 (4.4x)17127 (8938/2612)127% 10 Pipeline efficiency 2-stage: nearly double throughput; marginal area cost 4-stage: 75% efficiency; area still reasonable 8-stage: 55% efficiency; area more than doubles Tools: Synopsys DC/LSI Logic 110nm gflxp ASIC

Ideal Pipelining Cycle: Instr: iFDXMW i+1FDXMW i+2FDXMW i+3FDXMW i+4FDXMW

Pipelining Idealisms Uniform subcomputations –Can pipeline into stages with equal delay Identical computations –Can fill pipeline with identical work Independent computations –No relationships between work units Are these practical? –No, but can get close enough to get significant speedup

Complications Datapath –Five (or more) instructions in flight Control –Must correspond to multiple instructions Instructions may have –data and control flow dependences –I.e. units of work are not independent One may have to stall and wait for another

Datapath

Datapath

Control Control –Set by 5 different instructions –Divide and conquer: carry IR down the pipe MIPS ISA requires the appearance of sequential execution –Precise exceptions –True of most general purpose ISAs

Program Dependences

Program Data Dependences True dependence (RAW) –j cannot execute until i produces its result Anti-dependence (WAR) –j cannot write its result until i has read its sources Output dependence (WAW) –j cannot write its result until i has written its result

Control Dependences Conditional branches –Branch must execute to determine which instruction to fetch next –Instructions following a conditional branch are control dependent on the branch instruction

Example (quicksort/MIPS) #for (; (j < high) && (array[j] < array[low]) ; ++j ); #$10 = j #$9 = high #$6 = array #$8 = low bgedone, $10, $9 mul$15, $10, 4 addu$24, $6, $15 lw$25, 0($24) mul$13, $8, 4 addu$14, $6, $13 lw$15, 0($14) bgedone, $25, $15 cont: addu$10, $10, 1... done: addu$11, $11, -1

Resolution of Pipeline Hazards Pipeline hazards –Potential violations of program dependences –Must ensure program dependences are not violated Hazard resolution –Static: compiler/programmer guarantees correctness –Dynamic: hardware performs checks at runtime Pipeline interlock –Hardware mechanism for dynamic hazard resolution –Must detect and enforce dependences at runtime

Pipeline Hazards Necessary conditions: –WAR: write stage earlier than read stage Is this possible in IF-RD-EX-MEM-WB ? –WAW: write stage earlier than write stage Is this possible in IF-RD-EX-MEM-WB ? –RAW: read stage earlier than write stage Is this possible in IF-RD-EX-MEM-WB? If conditions not met, no need to resolve Check for both register and memory

Pipeline Hazard Analysis Memory hazards –RAW: Yes/No? –WAR: Yes/No? –WAW: Yes/No? Register hazards –RAW: Yes/No? –WAR: Yes/No? –WAW: Yes/No?

RAW Hazard Earlier instruction produces a value used by a later instruction: –add $1, $2, $3 –sub $4, $5, $1 Cycle: Instr: addFDXMW subFDXMW

RAW Hazard - Stall Detect dependence and stall: –add $1, $2, $3 –sub $4, $5, $1 Cycle: Instr: addFDXMW subFDXMW

Control Dependence One instruction affects which executes next –sw $4, 0($5) –bne $2, $3, loop –sub $6, $7, $8 Cycle: Instr: swFDXMW bneFDXMW subFDXMW

Control Dependence - Stall Detect dependence and stall –sw $4, 0($5) –bne $2, $3, loop –sub $6, $7, $8 Cycle: Instr: swFDXMW bneFDXMW subFDXMW

Pipelined Datapath Start with single-cycle datapath Pipelined execution –Assume each instruction has its own datapath –But each instruction uses a different part in every cycle –Multiplex all on to one datapath –Latches separate cycles (like multicycle) Ignore hazards for now –Data –control

Pipelined Datapath

Instruction flow –add and load –Write of registers –Pass register specifiers Any info needed by a later stage gets passed down the pipeline –E.g. store value through EX

Pipelined Control IF and ID –None EX –ALUop, ALUsrc, RegDst MEM –Branch, MemRead, MemWrite WB –MemtoReg, RegWrite

Datapath Control Signals

Pipelined Control

All Together

Pipelined Control Controlled by different instructions Decode instructions and pass the signals down the pipe Control sequencing is embedded in the pipeline –No explicit FSM –Instead, distributed FSM

Pipelining Not too complex yet –Data hazards –Control hazards –Exceptions

RAW Hazards Must first detect RAW hazards –Pipeline analysis proves that WAR/WAW don’t occur ID/EX.WriteRegister = IF/ID.ReadRegister1 ID/EX.WriteRegister = IF/ID.ReadRegister2 EX/MEM.WriteRegister = IF/ID.ReadRegister1 EX/MEM.WriteRegister = IF/ID.ReadRegister2 MEM/WB.WriteRegister = IF/ID.ReadRegister1 MEM/WB.WriteRegister = IF/ID.ReadRegister2

RAW Hazards Not all hazards because –WriteRegister not used (e.g. sw) –ReadRegister not used (e.g. addi, jump) –Do something only if necessary

RAW Hazards Hazard Detection Unit –Several 5-bit (or 6-bit) comparators Response? Stall pipeline –Instructions in IF and ID stay –IF/ID pipeline latch not updated –Send ‘nop’ down pipeline (called a bubble) –PCWrite, IF/IDWrite, and nop mux

RAW Hazard Forwarding A better response – forwarding –Also called bypassing Comparators ensure register is read after it is written Instead of stalling until write occurs –Use mux to select forwarded value rather than register value –Control mux with hazard detection logic

© 2005 Mikko Lipasti 41 Forwarding Paths (ALU instructions)

Write before Read RF Register file design –2-phase clocks common –Write RF on first phase –Read RF on second phase Hence, same cycle: –Write $1 –Read $1 No bypass needed –If read before write or DFF-based, need bypass

© 2005 Mikko Lipasti 43 ALU Forwarding

© 2005 Mikko Lipasti 44 Forwarding Paths (Load instructions)

Implementation of Load Forwarding

Control Flow Hazards Control flow instructions –branches, jumps, jals, returns –Can’t fetch until branch outcome known –Too late for next IF

Control Flow Hazards What to do? –Always stall –Easy to implement –Performs poorly –1/6 th instructions are branches, each branch takes 3 cycles –CPI = x 1/6 = 1.5 (lower bound)

Control Flow Hazards Predict branch not taken Send sequential instructions down pipeline Kill instructions later if incorrect Must stop memory accesses and RF writes Late flush of instructions on misprediction –Complex –Global signal (wire delay)

Control Flow Hazards Even better but more complex –Predict taken –Predict both (eager execution) –Predict one or the other dynamically Adapt to program branch patterns Lots of chip real estate these days –Pentium III, 4, Alpha Current research topic –More later (lecture on branch prediction)

Control Flow Hazards Another option: delayed branches –Always execute following instruction –“delay slot” (later example on MIPS pipeline) –Put useful instruction there, otherwise ‘nop’ A mistake to cement this into ISA –Just a stopgap (one cycle, one instruction) –Superscalar processors (later) Delay slot just gets in the way (special case)

Exceptions and Pipelining add $1, $2, $3 overflows A surprise branch –Earlier instructions flow to completion –Kill later instructions –Save PC in EPC, set PC to EX handler, etc. Costs a lot of designer sanity –554 teams that try this sometimes fail

Exceptions Even worse: in one cycle –I/O interrupt –User trap to OS (EX) –Illegal instruction (ID) –Arithmetic overflow –Hardware error –Etc. Interrupt priorities must be supported

Review Big Picture Datapath Control –Data hazards Stalls Forwarding or bypassing –Control flow hazards Branch prediction Exceptions