Spring 2010 Ilam University

Slides:



Advertisements
Similar presentations
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Advertisements

Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
Pipelined Processor II (cont’d) CPSC 321
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
1  1998 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining.
Pipelining III Andreas Klappenecker CPSC321 Computer Architecture.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
©UCB CS 162 Computer Architecture Lecture 3: Pipelining Contd. Instructor: L.N. Bhuyan
1 Chapter Six - 2nd Half Pipelined Processor Forwarding, Hazards, Branching EE3055 Web:
1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.
L18 – Pipeline Issues 1 Comp 411 – Spring /03/08 CPU Pipelining Issues Finishing up Chapter 6 This pipe stuff makes my head hurt! What have you.
L17 – Pipeline Issues 1 Comp 411 – Fall /1308 CPU Pipelining Issues Finishing up Chapter 6 This pipe stuff makes my head hurt! What have you been.
1  1998 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining.
1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve performance by increasing instruction throughput.
Lecture 15: Pipelining and Hazards CS 2011 Fall 2014, Dr. Rozier.
1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.
Chapter 4 CSF 2009 The processor: Pipelining. Performance Issues Longest delay determines clock period – Critical path: load instruction – Instruction.
CMPE 421 Parallel Computer Architecture
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
ECE/CS 552: Pipelining Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes based on set created by Mark Hill and John P.
Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam.
1 (Based on text: David A. Patterson & John L. Hennessy, Computer Organization and Design: The Hardware/Software Interface, 3 rd Ed., Morgan Kaufmann,
1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
L17 – Pipeline Issues 1 Comp 411 – Fall /23/09 CPU Pipelining Issues Read Chapter This pipe stuff makes my head hurt! What have you been.
ECE/CS 552: Pipeline Hazards © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim.
Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.
Use of Pipelining to Achieve CPI < 1
Chapter Six.
CS 352H: Computer Systems Architecture
Exceptions Another form of control hazard Could be caused by…
Stalling delays the entire pipeline
Pipelining: Hazards Ver. Jan 14, 2014
ECE/CS 552: Pipelining © Prof. Mikko Lipasti
Morgan Kaufmann Publishers
Single Clock Datapath With Control
Pipeline Implementation (4.6)
ECS 154B Computer Architecture II Spring 2009
CDA 3101 Spring 2016 Introduction to Computer Organization
\course\cpeg323-08F\Topic6b-323
ECE232: Hardware Organization and Design
Pipelining: Advanced ILP
Chapter 4 The Processor Part 3
Review: MIPS Pipeline Data and Control Paths
Morgan Kaufmann Publishers The Processor
Morgan Kaufmann Publishers The Processor
Pipelining review.
Single-cycle datapath, slightly rearranged
Lecture 9. MIPS Processor Design – Pipelined Processor Design #2
Pipelining in more detail
\course\cpeg323-05F\Topic6b-323
Data Hazards Data Hazard
Lecture 5. MIPS Processor Design
Pipeline control unit (highly abstracted)
Chapter Six.
Chapter Six.
Control unit extension for data hazards
November 5 No exam results today. 9 Classes to go!
Instruction Execution Cycle
Pipeline control unit (highly abstracted)
CS203 – Advanced Computer Architecture
Pipeline Control unit (highly abstracted)
Control unit extension for data hazards
CSC3050 – Computer Architecture
Introduction to Computer Organization and Architecture
Control unit extension for data hazards
Guest Lecturer: Justin Hsia
ELEC / Computer Architecture and Design Spring 2015 Pipeline Control and Performance (Chapter 6) Vishwani D. Agrawal James J. Danaher.
Pipelining Hazards.
Presentation transcript:

Spring 2010 Ilam University Chapter 6: Pipelining Spring 2010 Ilam University

Forecast Pipelining Big Picture Datapath Control Data Hazards Stalls Forwarding Control Hazards Exceptions

Motivation Single cycle implementation CPI = 1 Instructions Cycles Program Instruction Time Cycle (code size) X (CPI) (cycle time) Single cycle implementation CPI = 1 Cycle = imem + RFrd + ALU + dmem + RFwr + muxes + control E.g. 500+250+500+500+250+0+0 = 2000ps Time/program = P x 2ns

Multicycle Multicycle implementation: Cycle: Instr: 1 2 3 4 5 6 7 8 9 10 11 12 13 i F D X M W i+1 i+2 i+3 i+4

Multicycle Multicycle implementation Would like: CPI = 3, 4, 5 Cycle = max(memory, RF, ALU, mux, control) =max(500,250,500) = 500ps Time/prog = P x 4 x 500 = P x 2000ps = P x 2ns Would like: CPI = 1 + overhead from hazards (later) Cycle = 500ps + overhead In practice, ~3x improvement

Big Picture Instruction latency = 5 cycles Instruction throughput = 1/5 instr/cycle CPI = 5 cycles per instruction Instead Pipelining: process instructions like a lunch buffet ALL microprocessors use it E.g. Pentium 4, Athlon, PowerPC G5

Big Picture Instruction Latency = 5 cycles (same) Instruction throughput = 1 instr/cycle CPI = 1 cycle per instruction CPI = cycle between instruction completion = 1

Lundry Example

Ideal Pipelining Bandwidth increases linearly with pipeline depth Latency increases by latch delays

Ideal Pipelining Cycle: Instr: 1 2 3 4 5 6 7 8 9 10 11 12 13 i F D X M W i+1 i+2 i+3 i+4

Pipelining Idealisms Uniform subcomputations Identical computations Can pipeline into stages with equal delay Identical computations Can fill pipeline with identical work Independent computations No relationships between work units Are these practical? No, but can get close enough to get significant speedup

Complications Datapath Control Instructions may have Five (or more) instructions in flight Control Must correspond to multiple instructions Instructions may have data and control flow dependences I.e. units of work are not independent One may have to stall and wait for another

Datapath (Fig. 6.10)

Datapath (Fig. 6.11)

Control Control Set by 5 different instructions Divide and conquer: carry IR down the pipe MIPS ISA requires the appearance of sequential execution True of most general purpose ISAs

Program Dependences

Program Data Dependences True dependence (RAW) j cannot execute until i produces its result Anti-dependence (WAR) j cannot write its result until i has read its sources Output dependence (WAW) j cannot write its result until i has written its result

Control Dependences Conditional branches Branch must execute to determine which instruction to fetch next Instructions following a conditional branch are control dependent on the branch instruction

Example (quicksort/MIPS) # for (; (j < high) && (array[j] < array[low]) ; ++j ); # $10 = j # $9 = high # $6 = array # $8 = low bge done, $10, $9 mul $15, $10, 4 addu $24, $6, $15 lw $25, 0($24) mul $13, $8, 4 addu $14, $6, $13 lw $15, 0($14) bge done, $25, $15 cont: addu $10, $10, 1 . . . done: addu $11, $11, -1

Resolution of Pipeline Hazards Potential violations of program dependences Must ensure program dependences are not violated Hazard resolution Static: compiler/programmer guarantees correctness Dynamic: hardware performs checks at runtime Pipeline interlock Hardware mechanism for dynamic hazard resolution Must detect and enforce dependences at runtime

Pipeline Hazards Necessary conditions: WAR: write stage earlier than read stage Is this possible in IF-RD-EX-MEM-WB ? WAW: write stage earlier than write stage RAW: read stage earlier than write stage Is this possible in IF-RD-EX-MEM-WB? If conditions not met, no need to resolve Check for both register and memory

Pipeline Hazard Analysis Memory hazards RAW: Yes/No? WAR: Yes/No? WAW: Yes/No? Register hazards

RAW Hazard Earlier instruction produces a value used by a later instruction: add $1, $2, $3 sub $4, $5, $1 Cycle: Instr: 1 2 3 4 5 6 7 8 9 10 11 12 13 add F D X M W sub

RAW Hazard - Stall Detect dependence and stall: add $1, $2, $3 sub $4, $5, $1 Cycle: Instr: 1 2 3 4 5 6 7 8 9 10 11 12 13 add F D X M W sub

Control Dependence One instruction affects which executes next bne $2, $3, loop sub $6, $7, $8 Cycle: Instr: 1 2 3 4 5 6 7 8 9 10 11 12 13 sw F D X M W bne sub

Control Dependence - Stall Detect dependence and stall sw $4, 0($5) bne $2, $3, loop sub $6, $7, $8 Cycle: Instr: 1 2 3 4 5 6 7 8 9 10 11 12 13 sw F D X M W bne sub

Pipelined Datapath Start with single-cycle datapath (F6.10) Pipelined execution Assume each instruction has its own datapath But each instruction uses a different part in every cycle Multiplex all on to one datapath Latches separate cycles (like multicycle) Ignore hazards for now Data control

Pipelined Datapath (Fig. 6.12)

Pipelined Datapath Instruction flow add and load Write of registers Pass register specifiers Any info needed by a later stage gets passed down the pipeline E.g. store value through EX

Pipelined Control IF and ID EX MEM WB None ALUop, ALUsrc, RegDst Branch, MemRead, MemWrite WB MemtoReg, RegWrite

Figure 6.25

Figure 6.29

Figure 6.30

Pipelined Control Controlled by different instructions Decode instructions and pass the signals down the pipe Control sequencing is embedded in the pipeline No explicit FSM Instead, distributed FSM

Pipelining Not too complex yet Data hazards Control hazards Exceptions

RAW Hazards Must first detect RAW hazards Pipeline analysis proves that WAR/WAW don’t occur ID/EX.WriteRegister = IF/ID.ReadRegister1 ID/EX.WriteRegister = IF/ID.ReadRegister2 EX/MEM.WriteRegister = IF/ID.ReadRegister1 EX/MEM.WriteRegister = IF/ID.ReadRegister2 MEM/WB.WriteRegister = IF/ID.ReadRegister1 MEM/WB.WriteRegister = IF/ID.ReadRegister2

RAW Hazards Not all hazards because WriteRegister not used (e.g. sw) ReadRegister not used (e.g. addi, jump) Do something only if necessary

RAW Hazards Hazard Detection Unit Response? Stall pipeline Several 5-bit (or 6-bit) comparators Response? Stall pipeline Instructions in IF and ID stay IF/ID pipeline latch not updated Send ‘nop’ down pipeline (called a bubble) PCWrite, IF/IDWrite, and nop mux

RAW Hazard Forwarding A better response – forwarding Also called bypassing Comparators ensure register is read after it is written Instead of stalling until write occurs Use mux to select forwarded value rather than register value Control mux with hazard detection logic

Forwarding Use temporary results, don’t wait for them to be written register file forwarding to handle read/write to same register ALU forwarding what if this $2 was $13?

Forwarding

Forwarding Paths (ALU instructions) IF ID b ALU PATHS a i+1: i+2: i+3: i: R1 (i i+1) Forwarding via Path a (i i+2) via Path b (i i+3) i writes R1 before i+3 reads R1 RD MEM WB R1 c

Write before Read RF Register file design Hence, same cycle: 2-phase clocks common Write RF on first phase Read RF on second phase Hence, same cycle: Write $1 Read $1 No bypass needed If read before write or DFF-based, need bypass

Implementation of ALU Forwarding

Can't always forward Load word can still cause a hazard: an instruction tries to read a register following a load instruction that writes to the same register. Thus, we need a hazard detection unit to “stall” the load instruction

Stalling We can stall the pipeline by keeping an instruction in the same stage

Forwarding Paths (Load instructions) (i i+2) (i i+1) IF ID e LOAD FORWARDING PATH(s) i+1: i+2: RD ALU MEM WB i:R1 Stall i+1 Forwarding via Path d i writes R1 before i+2 reads R1 d R1 MEM[]

Control Flow Hazards Control flow instructions branches, jumps, jals, returns Can’t fetch until branch outcome known Too late for next IF

Control Flow Hazards What to do? Always stall Easy to implement Performs poorly 1/6th instructions are branches, each branch takes 3 cycles CPI = 1 + 3 x 1/6 = 1.5 (lower bound)

Control Flow Hazards Predict branch not taken Send sequential instructions down pipeline Kill instructions later if incorrect Must stop memory accesses and RF writes Late flush of instructions on misprediction Complex Global signal (wire delay)

Branch Hazards When we decide to branch, other instructions are in the pipeline! We are predicting “branch not taken” need to add hardware for flushing instructions if we are wrong

Flushing Instructions

Control Flow Hazards Even better but more complex Predict taken Predict both (eager execution) Predict one or the other dynamically Adapt to program branch patterns Lots of chip real estate these days Pentium III, 4, Alpha 21264 Current research topic

Dynamic branch prediction

Improving Performance Try and avoid stalls! E.g., reorder these instructions: lw $t0, 0($t1) lw $t2, 4($t1) sw $t2, 0($t1) sw $t0, 4($t1) Add a “branch delay slot” Always execute following instruction “delay slot” (later example on MIPS pipeline) Put useful instruction there, otherwise ‘nop’ rely on compiler to “fill” the slot with something useful Superscalar: start more than one instruction in the same cycle

Dynamic Scheduling The hardware performs the “scheduling” hardware tries to find instructions to execute out of order execution is possible speculative execution and dynamic branch prediction All modern processors are very complicated DEC Alpha 21264: 9 stage pipeline, 6 instruction issue PowerPC and Pentium: branch history table Compiler technology important This class has given you the background you need to learn more

Exceptions Even worse: in one cycle I/O interrupt User trap to OS (EX) Illegal instruction (ID) Arithmetic overflow Hardware error Etc. Interrupt priorities must be supported