Download presentation
Presentation is loading. Please wait.
Published bySierra Fogg Modified over 9 years ago
1
Computer Architecture 2014– Pipeline 1 Computer Architecture Pipeline By Yoav Etsion & Dan Tsafrir Presentation based on slides by David Patterson, Avi Mendelson, Randi Katz, and Lihu Rappoport
2
Computer Architecture 2014– Pipeline 2 Pipeline idea: keep everyone busy
3
Computer Architecture 2014– Pipeline 3 Pipeline: more accurately… Expert in placing tomato and closing the sandwich Expert in placing roast beef Expert in cutting bread Pipelining elsewhere Unix shell grep string File | wc -l Assembling cars Whenever want to keep functional units busy
4
Computer Architecture 2014– Pipeline 4 Data Access Data Access Pipeline: microarchitecture 24681012141618... Inst Fetch Reg ALU Reg Inst Fetch Reg ALU Reg Inst Fetch 8 ns Time Program execution order lw R1, 100(R0) lw R2, 200(R0) lw R3, 300(R0) before
5
Computer Architecture 2014– Pipeline 5 Data Access Data Access Pipeline: microarchitecture 24681012141618... Inst Fetch Reg ALU Reg Inst Fetch Reg ALU Reg Inst Fetch 8 ns Time Program execution order lw R1, 100(R0) lw R2, 200(R0) lw R3, 300(R0) before // R1 = mem[0+100] fetch decode & bring regs to ALU 100+R0 access mem write back result to R1
6
Computer Architecture 2014– Pipeline 6 Data Access Data Access Data Access Data Access Data Access Pipeline: microarchitecture Speed set by slowest component (instruction takes longer in pipeline) First commercial use in 1985 In Intel chips since 486 (until then, serial execution) 24681012141618 24681012 14... Inst Fetch Reg ALU Reg Inst Fetch Reg ALU Reg Inst Fetch Inst Fetch Reg ALU Reg Inst Fetch Reg ALU Reg Inst Fetch Reg ALU Reg 2 ns 8 ns Time Program execution order lw R1, 100(R0) lw R2, 200(R0) lw R3, 300(R0) Time Program execution order lw R1, 100(R0) lw R2, 200(R0) lw R3, 300(R0) before after
7
Computer Architecture 2014– Pipeline 7 Data Access Data Access Data Access Data Access Data Access Pipeline: microarchitecture Speed set by slowest component (instruction takes longer in pipeline) First commercial use in 1985 In Intel chips since 486 (until then, serial execution) 24681012141618 24681012 14... Inst Fetch Reg ALU Reg Inst Fetch Reg ALU Reg Inst Fetch Inst Fetch Reg ALU Reg Inst Fetch Reg ALU Reg Inst Fetch Reg ALU Reg 2 ns 8 ns Time Program execution order lw R1, 100(R0) lw R2, 200(R0) lw R3, 300(R0) Time Program execution order lw R1, 100(R0) lw R2, 200(R0) lw R3, 300(R0) before after // R1 = mem[0+100] fetch decode & bring regs to ALU 100+R0 access mem write back result to R1
8
Computer Architecture 2014– Pipeline 8 MIPS Introduced in 1981 by Hennessy (of “Patterson & Hennessy”) Idea suggested earlier, e.g., by John Cocke and friends at IBM in the 1970s, but not developed in full MIPS = Microprocessor without Interlocked Pipeline Stages RISC Often used in computer architecture courses Was very successful (e.g., inspired the Alpha ISA) Interlocks (“without interlocks”) Mechanisms to allow stages to indicate they are busy E.g., ‘divide’ & ‘multiply’ required interlocks => paused other stages upstream With MIPS, every sub-phase of all instructions fits into 1 cycle No die area wasted on pausing mechanisms => faster cycle
9
Computer Architecture 2014– Pipeline 9 Pipeline: principles Ideal speedup = num of pipeline stages An instruction finishes every clock cycle Namely, IPC of an ideal pipelined machine is 1 Increase throughput rather than reduce latency One instruction still takes the same (or longer) Since max speedup = num of stages & Latency determined by slowest stage, should: Partition pipe to many stages Balance work across stages Shorten longest stage as much as possible
10
Computer Architecture 2014– Pipeline 10 Pipeline: overheads & limitations Can increase per-instruction latency Due to stages imbalance Requires more logic than serial execution Time to “fill” pipe reduces speedup Time to “drain” pipe reduces speedup E.g., upon interrupt or context switch Stalls when there are dependencies
11
Computer Architecture 2014– Pipeline 11 Pipelined CPU
12
Computer Architecture 2014– Pipeline 12 Pipeline: fetch bring next instruction from memory; 4 bytes (32 bit) per instruction Instruction saved in register, in preparation of next pipe stage when not branching, next instruction is in next word
13
Computer Architecture 2014– Pipeline 13 Pipeline: decode + regs fetch decode source reg numbers read their values from reg file reg IDs are 5 bits (2^5 = 32)
14
Computer Architecture 2014– Pipeline 14 Pipeline: decode + regs fetch decode & sign-extend immediate (from 16 bit to 32)
15
Computer Architecture 2014– Pipeline 15 Pipeline: decode + regs fetch decode destination reg (can be one of two, depending on op) & save in register for next stage…
16
Computer Architecture 2014– Pipeline 16 Pipeline: decode + regs fetch decode destination reg (can be one of two, depending on op) & save in latch for next stage… …based on the op type, next phase will determine, which reg of the two is the destination
17
Computer Architecture 2014– Pipeline 17 Pipeline: execute ALU computes – “R” operation (the “shift” field is missing from this illustration) reg1 reg2 func (6bit) to reg3
18
Computer Architecture 2014– Pipeline 18 Pipeline: execute ALU computes – “I” operation (not branch & not load/store) reg1 imm opcode to reg2
19
Computer Architecture 2014– Pipeline 19 Pipeline: execute ALU computes – “I” operation conditional branch BEQ or BNE [ if (reg1==reg2) pc = pc+4 + (imm<<2) ] reg1 imm opcode reg2 Branch?
20
Computer Architecture 2014– Pipeline 20 Pipeline: execute ALU computes – “I” operation load (store is similar) ( reg2 = mem[reg1+imm] ) reg1 imm to reg2
21
Computer Architecture 2014– Pipeline 21 Pipeline: updating PC no branch: just add 4 to PC unconditional branch: add immediate to PC+4 (type J operation) conditional branch: depends on result of ALU
22
Computer Architecture 2014– Pipeline 22 Pipelined CPU with Control
23
Computer Architecture 2014– Pipeline 23 Pipeline Example: cycle 1 0 lw R10,9(R1) 4 sub R11,R2,R3 8 and R12,R4,R5 12 or R13,R6,R7
24
Computer Architecture 2014– Pipeline 24 Pipeline Example: cycle 2 0 lw R10,9(R1) 4 sub R11,R2,R3 8 and R12,R4,R5 12 or R13,R6,R7
25
Computer Architecture 2014– Pipeline 25 Pipeline Example: cycle 3 0 lw R10,9(R1) 4 sub R11,R2,R3 8 and R12,R4,R5 12 or R13,R6,R7
26
Computer Architecture 2014– Pipeline 26 Pipeline Example: cycle 4 ALUSrc 6 ALU result Zero Add result Add Shift left 2 ALU Control ALUOp RegDst RegWrite Read reg 1 Read reg 2 Write reg Write data Read data 1 Read data 2 Register File [15-0] [20-16] [15-11] Sign extend 16 32 ID/EX EX/MEM MEM/WB Instruction MemRead MemWrite Address Write Data Read Data Memory Branch PCSrc MemtoReg 4 Instruction Memory Address Add IF/ID 0 1 muxmux 0 1 muxmux 0 1 muxmux 1 0 muxmux Instruction lw PC 16 12 8 or [R4 ] Data from memory address [R1]+9 sub 4 [R5] 11 12 and 16 10 [R2]-[R3] 0 lw R10,9(R1) 4 sub R11,R2,R3 8 and R12,R4,R5 12 or R13,R6,R7
27
Computer Architecture 2014– Pipeline 27 Pipeline Hazards: 1. Structural Hazards
28
Computer Architecture 2014– Pipeline 28 Structural Hazard Two instructions attempt to use same resource simultaneously Problem: register-file accessed in 2 stages Write during stage 5 (WB) Read during stage 2 (ID) => Resource (RF) conflict Solution Split stage into two sub-stages Do write in first half Do reads in second half 2 read ports, 1 write port (separate)
29
Computer Architecture 2014– Pipeline 29 Structural Hazard Problem: memory accessed in 2 stages Fetch (stage 1), when reading instructions from memory Memory (stage 4), when data is read/written from/to memory Princeton architecture Solution Split data/inst. Memories Harvard architecture Today, separate instruction cache and data cache
30
Computer Architecture 2014– Pipeline 30 Pipeline Hazards: 2. Data Hazards
31
Computer Architecture 2014– Pipeline 31 When two instructions access the same register RAW: Read-After-Write True dependency WAR: Write-After-Read Anti-dependency WAW: Wrtie-After-Write False-dependency Key problem with regular in-order pipelines is RAW We will also learn about out-of-order pipelines Data Dependencies
32
Computer Architecture 2014– Pipeline 32 Problem with starting next instruction before first is finished dependencies that “go backward in time” are data hazards Data Dependencies sub R2, R1, R3 and R12,R2, R5 or R13,R6, R2 add R14,R2, R2 sw R15,100(R2) Program execution order CC 1CC 2CC 3CC 4CC 5CC 6 Time (clock cycles) CC 7CC 8CC 9 10 0 –20 Value of R2 10 -20
33
Computer Architecture 2014– Pipeline 33 IM bubble IM IM RAW Hazard: HW Solution 1 - Add Stalls IMReg CC 1CC 2CC 3CC 4CC 5CC 6 Time (clock cycles) CC 7CC 8CC 9 100–20 Value of R2 DM Reg IMDMReg Reg IMReg IM Reg DMReg IMDMReg Reg Reg DM sub R2, R1, R3 stall and R12,R2, R5 or R13,R6, R2 add R14,R2, R2 sw R15,100(R2) Program execution order 10 -20 Let the hardware detect hazard and add stalls if needed Problem: slow! Solution: forwarding whenever possible
34
Computer Architecture 2014– Pipeline 34 Use temporary results, don’t wait for them to be written to the register file register file forwarding to handle read/write to same register ALU forwarding RAW Hazard: HW Solution 2 - Forwarding IMReg IMReg IMRegDMReg IMDMReg IMDMReg DMReg Reg Reg Reg XXX–20XXXXX XXXX– XXXX DM sub R2, R1, R3 and R12,R2, R5 or R13,R6, R2 add R14,R2, R2 sw R15,100(R2) Program execution order CC 1CC 2CC 3CC 4CC 5CC 6 Time (clock cycles) CC 7CC 8CC 9 10 0 –20 Value of R2 10 -20 Value EX/MEM Value MEM/WB
35
Computer Architecture 2014– Pipeline 35 Forwarding Hardware
36
Computer Architecture 2014– Pipeline 36 Forwarding Hardware Added 2 mux units before ALU Each mux gets 3 inputs, from: 1.Prev stage (ID/EX) 2.Next stage (EX/MEM) 3.The one after (MEM/WB) Forward unit tells the 2 mux units which input to use
37
Computer Architecture 2014– Pipeline 37 Forwarding Control EX Hazard: A. if (EX/MEM.RegWrite and (EX/MEM.WriteReg = ID/EX.ReadReg1)) then ALUSelA = 1 B. if (EX/MEM.RegWrite and (EX/MEM.WriteReg = ID/EX.ReadReg2)) then ALUSelB = 1 MEM Hazard: if (not A and MEM/WB.RegWrite (MEM/WB.WriteReg = ID/EX.ReadReg1)) then ALUSelA = 2 if (not B and MEM/WB.RegWrite and (MEM/WB.WriteReg = ID/EX.ReadReg2)) then ALUSelB = 2
38
Computer Architecture 2014– Pipeline 38 Forwarding Control EX Hazard: A. if (EX/MEM.RegWrite and (EX/MEM.WriteReg = ID/EX.ReadReg1)) then ALUSelA = 1 B. if (EX/MEM.RegWrite and (EX/MEM.WriteReg = ID/EX.ReadReg2)) then ALUSelB = 1 MEM Hazard: if (not A and MEM/WB.RegWrite (MEM/WB.WriteReg = ID/EX.ReadReg1)) then ALUSelA = 2 if (not B and MEM/WB.RegWrite and (MEM/WB.WriteReg = ID/EX.ReadReg2)) then ALUSelB = 2 If, in memory stage, we’re writing the output to a register And the reg we’re writing to also happens to be inp_reg1 for the execute stage Then mux_A should select inp_1, namely, the ALU should feed itself
39
Computer Architecture 2014– Pipeline 39 Forwarding Hardware Example: Bypassing From EX to Src1 and From WB to Src2 lw R11,9(R1) sub R10,R2, R3 and R12,R10,R11 load op => read from “1”
40
Computer Architecture 2014– Pipeline 40 Forwarding Hardware Example #2: Bypassing From WB to Src2 sub R10,R2, R3 xxx and R12,R10,R11 not load op => read from “0”
41
Computer Architecture 2014– Pipeline 41 RF Split => fewer forwarding paths sub R2, R1, R3 xxx and R12,R2,R11 Register file is written during first half of the cycle and read during second half of the cycle Returns updated value Compiler must cleverly order instructions Ineffective if pipeline stages require more than 1 cycle IMReg IM Reg IMDMReg IMDMReg DM Reg Reg DM RegReg
42
Computer Architecture 2014– Pipeline 42 “load” op can cause “un-forwardable” hazards load value to R In the next instruction, use R as input A bigger problem in longer pipelines Can't always forward (stall inevitable) Reg IM Reg Reg IM IMRegDMReg IMDMReg IMDMReg DMReg Reg Reg DM CC 1CC 2CC 3CC 4CC 5CC 6 Time (clock cycles) CC 7CC 8CC 9 Program execution order lw R2, 30(R1) and R12,R2, R5 or R13,R6, R2 add R14,R2, R2 sw R15,100(R2)
43
Computer Architecture 2014– Pipeline 43 De-assert the enable to ID/EXE The dependant instruction ( and ) stays another cycle in IF/EXE De-assert the enable to the IF/ID latch and to the PC Freeze pipeline stages preceding the stalled instruction Issue a NOP into the EXE/MEM latch (instead of the stalled inst.) Allow the stalling instruction ( lw ) to move on Stalling
44
Computer Architecture 2014– Pipeline 44 Hazard Detection (Stall) Logic if ( (ID/EX.RegWrite) and (ID/EX.opcode == lw) and ( (ID/EX.WriteReg == IF/ID.ReadReg1) or (ID/EX.WriteReg == IF/ID.ReadReg2)) then stall IF/ID
45
Computer Architecture 2014– Pipeline 45 Forwarding + Hazard Detection Unit
46
Computer Architecture 2014– Pipeline 46 Example: code for (assume all variables are in memory): a = b + c; d = e – f; Slow code LW Rb,b LW Rc,c Stall ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f Stall SUB Rd,Re,Rf SWd,Rd Instruction order can be changed as long as correctness is kept (no dependencies violated) Compiler scheduling helps avoid load hazards (when possible) Fast code LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB Rd,Re,Rf SWd,Rd
47
Computer Architecture 2014– Pipeline 47 Pipeline Hazards: 3. Control Hazards
48
Computer Architecture 2014– Pipeline 48 Branch, but where? The decision to branch happens deep within the pipeline Likewise, the target of the branch becomes known deep within the pipeline How does this effect the pipeline logic? For example…
49
Computer Architecture 2014– Pipeline 49 Executing a BEQ Instruction (i) BEQ R4, R5, 27 → if (R4-R5=0) then PC PC+4+SignExt(27)*4 ; else PC PC+4 0 or 4 beq R4, R5, 27 8 and 12 sw 16 sub Assume this program state
50
Computer Architecture 2014– Pipeline 50 Executing a BEQ Instruction (i) BEQ R4, R5, 27 → if (R4-R5=0) then PC PC+4+SignExt(27)*4 ; else PC PC+4 0 or 4 beq R4, R5, 27 8 and 12 sw 16 sub We know: Values of registers We don’t know: If branch will be taken What is its target
51
Computer Architecture 2014– Pipeline 51 Executing a BEQ Instruction (ii) BEQ R4, R5, 27 → if (R4-R5=0) then PC PC+4+SignExt(27)*4 ; else PC PC+4 0 or 4 beq R4, R5, 27 8 and 12 sw 16 sub …Now we know, but only in next cycle will this effect PC Calculate branch condition = compute R4-R5 & compare to 0 Calculate branch target
52
Computer Architecture 2014– Pipeline 52 Executing a BEQ Instruction (iii) BEQ R4, R5, 27 → if (R4-R5=0) then PC PC+4+SignExt(27)*4 ; else PC PC+4 0 or 4 beq R4, R5, 27 8 and 12 sw 16 sub Finally, if taken, branch sets the PC
53
Computer Architecture 2014– Pipeline 53 Control Hazard on Branches And Beq sub sw Inst from target IMRegDM Reg PC IMRegDM Reg IMRegDM Reg IMRegDM Reg IMRegDM Reg Outcome: The 3 instructions following the branch are in the pipeline even if branch is taken!
54
Computer Architecture 2014– Pipeline 54 Traps, Exceptions and Interrupts Indication of events that require a higher authority to intervene (i.e. the operating system) Atomically changes the protection mode and branches to OS Protection mode determines what the running is allowed to do (access devices, memory, etc). Traps: Synchronous The program asks for OS services (e.g. access a device) Exceptions: Synchronous The program did something bad (divide-by-zero; prot. violation) Interrupts: Asynchronous An external device needs OS attention (finished an operation) Can these be handled like regular branches?
55
Computer Architecture 2014– Pipeline 55 Stall Easiest solution: Stall pipe when branch encountered until resolved But there’s a prices. Assume: CPI = 1 20% of instructions are branches (realistic) Stall 3 cycles on every branch (extra 3 cycles for each branch) Then the price is: CPI new = 1 + 0.2 × 3 = 1.6// 1 = all instr., including branch [ CPI new = CPI Ideal + avg. stall cycles / instr. ] Namely: We lose 60% of the performance!
56
Computer Architecture 2014– Pipeline 56 Branch Prediction and Speculative Execution
57
Computer Architecture 2014– Pipeline 57 Static prediction: branch not taken Execute instructions from the fall-through (not-taken) path As if there is no branch If the branch is not-taken (~50%), no penalty is paid If branch actually taken Flush the fall-through path instructions before they change the machine state (memory / registers) Fetch the instructions from the correct (taken) path Assuming ~50% branches not taken on average CPI new = 1 + (0.2 × 0.5) × 3 = 1.3 30% slowdown instead of 60% What happens in longer pipelines?
58
Computer Architecture 2014– Pipeline 58 Dynamic branch prediction Branch prediction is a key impediment to performance Modern processors employ complex branch predictors Often achieve < 3% misprediction rate Given an instruction, we need to predict Is it a branch? Branch taken? Target address? To avoid stalling Prediction needed at end of ‘fetch’ Before we even now what’s the instruction… A simple mechanism: Branch Target Buffer (BTB)
59
Computer Architecture 2014– Pipeline 59 BTB – the idea fast lookup table PC of fetched instruction ?= Predicted branch taken or not taken? (last few times) No => we don’t know, so we don’t predict Yes => instruction is a branch, so let’s predict it Branch PC Target PC History Predicted Target (Works in a straightforward manner only for direct branches, otherwise target PC changes)
60
Computer Architecture 2014– Pipeline 60 How it works in a nutshell Until proven otherwise, assume branches are not taken Fall through instructions (assume branch has no effect) Upon the first time a branch is taken Pay the price (in terms of stalls), but Save the details of the branch in the BTB (= PC, target PC, and whether or not branch was taken) While fetching, HW checks in parallel Whether PC is in BTB If found, make a prediction Taken? Address? Upon misprediction Flush (throw out) pipeline content & start over from right PC
61
Computer Architecture 2014– Pipeline 61 Prediction steps 1. Allocate Insert instruction to BTB once identified as taken branch Do not insert not-taken branches Implicitly predict they’d continue not to be taken Insert both conditional & unconditional To identify, and to save arithmetic 2. Predict BTB lookup done in parallel to PC-lookup, providing: Indication whether PC is a branch (=> BTB “hit”) Branch target Branch direction (forward or backward in program) Branch type (conditional or not) 3. Update (when branch taken & its outcome becomes known) Branch target, history (taken or not)
62
Computer Architecture 2014– Pipeline 62 Misprediction Occurs when Predict = not taken, reality = taken Predict = taken, reality = not taken Branch taken as predicted, but wrong target (indirect, as in the jmp register) Must flush pipeline Reset pipeline registers (similar to turning all into NOPs) Commonly, other flush methods are easier to implement Set the PC to the correct path Start fetching instruction from correct path
63
Computer Architecture 2014– Pipeline 63 CPI Assuming a fraction of p correct predictions CPI_new = 1 + (0.2 × (1-p)) × 3 Example, p=0.7 CPI_new = 1 + (0.2 × 0.3) × 3 = 1.18 Example, p=0.98 CPI_new = 1 + (0.2 × 0.02) × 3 = 1.012 (But this is a simplistic model; in reality the price can sometimes be much higher.)
64
Computer Architecture 2014– Pipeline 64 History & prediction algorithm “Always backward” prediction Works for long loops Some branches exhibit “locality” Typically behave as the last time they were invoked Typically depend on their previous outcome (& it alone) Can save a history window What happened last time, and before that, and before… The bigger the window, the greater the complexity Some branches regularly alternate between taken & untaken Taken, then untaken, then taken, … Need only one history bit to identify this Some branches are correlated with previous branches Those that lead to them
65
Computer Architecture 2014– Pipeline 65 Adding a BTB to the Pipeline
66
Computer Architecture 2014– Pipeline 66 Using The BTB PC moves to next instruction Inst Mem gets PC and fetches new inst BTB gets PC and looks it up IF/ID latch loaded with new inst BTB Hit ?Br taken ? PC PC + 4PC pred addr IF ID IF/ID latch loaded with pred inst IF/ID latch loaded with seq. inst Branch ? yesno yes noyes EXE
67
Computer Architecture 2014– Pipeline 67 Using The BTB (cont.) ID EXE MEM WB Branch ? Calculate br cond & trgt Flush pipe & update PC Corect pred ? yesno IF/ID latch loaded with correct inst continue Update BTB yes no continue
68
Computer Architecture 2014– Pipeline 68 Prediction algorithm Can do an entire course on this issue Still actively researched As noted, modern predictors can often achieve misprediction < 2% Still, it has been shown that these 2% can sometimes significantly worsen performance A real problem in out-of-order pipelines We did not talk about the issue of indirect branches As in virtual function calls (object oriented) Where the branch target is written in memory, elsewhere
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.