Computer Architecture 2011 – Pipeline 1 Computer Architecture MIPS Pipeline Lihu Rappoport and Adi Yoaz Some of the slides were taken from: (1) Avi Mendelson.

Computer Architecture 2011 – Pipeline 1 Computer Architecture MIPS Pipeline Lihu Rappoport and Adi Yoaz Some of the slides were taken from: (1) Avi Mendelson (2) Randi Katz (3) Patterson

Computer Architecture 2011 – Pipeline 2 Data Access Data Access Data Access Data Access Data Access Pipelining Instructions Ideal speedup is number of stages in the pipeline. Do we achieve this? 24681012141618 24681012 14... Inst Fetch Reg ALU Reg Inst Fetch Reg ALU Reg Inst Fetch Inst Fetch Reg ALU Reg Inst Fetch Reg ALU Reg Inst Fetch Reg ALU Reg 2 ns 8 ns Time Program execution order lw R1, 100(R0) lw R2, 200(R0) lw R3, 300(R0) Time Program execution order lw R1, 100(R0) lw R2, 200(R0) lw R3, 300(R0)

Computer Architecture 2011 – Pipeline 3 Pipelined Car Assembly chassisenginefinish 1 hour2 hours 1 hour Car 1Car 2Car 3

Computer Architecture 2011 – Pipeline 4 Pipelining  Pipelining does not reduce the latency of single task, it increases the throughput of entire workload  Potential speedup = Number of pipe stages  Pipeline rate is limited by the slowest pipeline stage  Partition the pipe to many pipe stages  Make the longest pipe stage to be as short as possible  Balance the work in the pipe stages  Pipeline adds overhead (e.g., latches)  Time to “fill” pipeline and time to “drain” it reduces speedup  Stall for dependencies  Too many pipe-stages start to loose performance  IPC of an ideal pipelined machine is 1  Every clock one instruction finishes

Computer Architecture 2011 – Pipeline 5 Pipelined CPU

Computer Architecture 2011 – Pipeline 6 Pipelined CPU with Control

Computer Architecture 2011 – Pipeline 7 Structural Hazard  Attempt to use the same resource two different ways at the same time  Register File:  Accessed in 2 stages:  Read during stage 2 (ID)  Write during stage 5 (WB)  Solution: 2 read ports, 1 write port  Memory  Accessed in 2 stages:  Instruction Fetch during stage 1 (IF)  Data read/write during stage 4 (MEM)  Solution: separate instruction cache and data cache  Each functional unit can only be used once per instruction  Each functional unit must be used at the same stage for all instructions

Computer Architecture 2011 – Pipeline 8 Pipeline Example: cycle 1 0 lw R10,9(R1) 4 sub R11,R2,R3 8 and R12,R4,R5 12 or R13,R6,R7

Computer Architecture 2011 – Pipeline 11 Pipeline Example: cycle 4 ALUSrc 6 ALU result Zero Add result Add Shift left 2 ALU Control ALUOp RegDst RegWrite Read reg 1 Read reg 2 Write reg Write data Read data 1 Read data 2 Register File [15-0] [20-16] [15-11] Sign extend 16 32 ID/EX EX/MEM MEM/WB Instruction MemRead MemWrite Address Write Data Read Data Memory Branch PCSrc MemtoReg 4 Instruction Memory Address Add IF/ID 0 1 muxmux 0 1 muxmux 0 1 muxmux 1 0 muxmux Instruction lw PC 16 12 8 or [R4 ] Data from memory address [R1]+9 sub 4 [R5] 11 12 and 16 10 [R2]-[R3] 0 lw R10,9(R1) 4 sub R11,R2,R3 8 and R12,R4,R5 12 or R13,R6,R7

Computer Architecture 2011 – Pipeline 12  Problem with starting next instruction before first is finished  dependencies that “go backward in time” are data hazards Dependencies: RAW Hazard sub R2, R1, R3 and R12,R2, R5 or R13,R6, R2 add R14,R2, R2 sw R15,100(R2) Program execution order CC 1CC 2CC 3CC 4CC 5CC 6 Time (clock cycles) CC 7CC 8CC 9 1010/–20 Value of R2 10 -20

Computer Architecture 2011 – Pipeline 13 IM bubble IM IM RAW Hazard: HW Solution 1 - Add Stalls IMReg CC 1CC 2CC 3CC 4CC 5CC 6 Time (clock cycles) CC 7CC 8CC 9 1010/–20 Value of R2 DM Reg IMDMReg Reg IMReg IM Reg DMReg IMDMReg Reg Reg DM sub R2, R1, R3 stall and R12,R2, R5 or R13,R6, R2 add R14,R2, R2 sw R15,100(R2) Program execution order 10 -20 Have the hardware detect hazard and add stalls if needed Problem: this also slows us down!

Computer Architecture 2011 – Pipeline 14  Use temporary results, don’t wait for them to be written to the register file  register file forwarding to handle read/write to same register  ALU forwarding RAW Hazard: HW Solution 2 - Forwarding IMReg IMReg IMRegDMReg IMDMReg IMDMReg DMReg Reg Reg Reg XXX–20XXXXX XXXX– XXXX DM sub R2, R1, R3 and R12,R2, R5 or R13,R6, R2 add R14,R2, R2 sw R15,100(R2) Program execution order CC 1CC 2CC 3CC 4CC 5CC 6 Time (clock cycles) CC 7CC 8CC 9 1010/–20 Value of R2 10 -20 Value EX/MEM Value MEM/WB

Computer Architecture 2011 – Pipeline 15 Forwarding Hardware

Computer Architecture 2011 – Pipeline 16 Forwarding Control  EX Hazard:  if (EX/MEM.RegWrite and (EX/MEM.WriteReg = ID/EX.ReadReg1)) ALUSelA = 1  if (EX/MEM.RegWrite and (EX/MEM.WriteReg = ID/EX.ReadReg2)) ALUSelB = 1  MEM Hazard:  if (MEM/WB.RegWrite and ((not EX/MEM.RegWrite) or (EX/MEM.WriteReg  ID/EX.ReadReg1)) and (MEM/WB.WriteReg = ID/EX.ReadReg1)) ALUSelA = 2  if (MEM/WB.RegWrite and ((not EX/MEM.RegWrite) or (EX/MEM.WriteReg  ID/EX.ReadReg2)) and (MEM/WB.WriteReg = ID/EX.ReadReg2)) ALUSelB = 2

Computer Architecture 2011 – Pipeline 17 Forwarding Hardware Example: Bypassing From EX to Src1 and From WB to Src2 lw R11,9(R1) sub R10,R2, R3 and R12,R10,R11

Computer Architecture 2011 – Pipeline 18 Forwarding Hardware Example 2: Bypassing From WB to Src2 sub R10,R2, R3 xxx and R12,R10,R11

Computer Architecture 2011 – Pipeline 19 Register File Split sub R2, R1, R3 xxx and R12,R2,R11  Register file is written during first half of the cycle  Register file is read during second half of the cycle  Register file is written before it is read  returns the correct data IMReg IM Reg IMDMReg IMDMReg DM Reg Reg DM RegReg

Computer Architecture 2011 – Pipeline 20  Load word can still causes a hazard:  an instruction tries to read a register following a load instruction that writes to the same register  A hazard detection unit is needed to “stall” the load instruction Can't always forward Reg IM Reg Reg IM IMRegDMReg IMDMReg IMDMReg DMReg Reg Reg DM CC 1CC 2CC 3CC 4CC 5CC 6 Time (clock cycles) CC 7CC 8CC 9 Program execution order lw R2, 30(R1) and R12,R2, R5 or R13,R6, R2 add R14,R2, R2 sw R15,100(R2)

Computer Architecture 2011 – Pipeline 21  De-assert the enable to ID/EXE  The dependant instruction ( and ) stays another cycle in IF/EXE  De-assert the enable to the IF/ID latch, and to the PC  Freeze pipeline stages preceding the stalled instruction  Issue a NOP into the EXE/MEM latch (instead of the stalled inst.)  Allow the stalling instruction ( lw ) to move on Stalling

Computer Architecture 2011 – Pipeline 22 Hazard Detection (Stall) Logic if (ID/EX.RegWrite and (ID/EX.opcode = lw) and ( (ID/EX.WriteReg = IF/ID.ReadReg1) or (ID/EX.WriteReg = IF/ID.ReadReg2) ) then stall

Computer Architecture 2011 – Pipeline 23 Forwarding + Hazard Detection Unit

Computer Architecture 2011 – Pipeline 24 Example: code for (assume all variables are in memory): a = b + c; d = e – f; Slow code LW Rb,b LW Rc,c Stall ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f Stall SUB Rd,Re,Rf SWd,Rd Instruction order can be changed as long as the correctness is kept Software Scheduling to Avoid Load Hazards Fast code LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB Rd,Re,Rf SWd,Rd

Computer Architecture 2011 – Pipeline 25 Control Hazards

Computer Architecture 2011 – Pipeline 26 Executing a BEQ Instruction (i) BEQ R4, R5, 27 ; if (R4-R5=0) then PC  PC+4+SignExt(27)*4 ; else PC  PC+4 0 or 4 beq R4, R5, 27 8 and 12 sw 16 sub Calculate branch condition

Computer Architecture 2011 – Pipeline 27 Executing a BEQ Instruction (ii) BEQ R4, R5, 27 ; if (R4-R5=0) then PC  PC+4+SignExt(27)*4 ; else PC  PC+4 0 or 4 beq R4, R5, 27 8 and 12 sw 16 sub

Computer Architecture 2011 – Pipeline 28 Executing a BEQ Instruction (iii) BEQ R4, R5, 27 ; if (R4-R5=0) then PC  PC+4+SignExt(27)*4 ; else PC  PC+4 0 or 4 beq R4, R5, 27 8 and 12 sw 16 sub

Computer Architecture 2011 – Pipeline 29 Control Hazard on Branches And Beq sub sw The 3 instructions following the branch get into the pipe even if the branch is taken Inst from target IMRegDM Reg PC IMRegDM Reg IMRegDM Reg IMRegDM Reg IMRegDM Reg

Computer Architecture 2011 – Pipeline 30 Control Hazard: Stall  Stall pipe when branch is encountered until resolved  Stall impact: assumptions  CPI = 1  20% of instructions are branches  Stall 3 cycles on every branch  CPI new = 1 + 0.2 × 3 = 1.6 (CPI new = CPI Ideal + avg. stall cycles / instr.) We loose 60% of the performance

Computer Architecture 2011 – Pipeline 31 Control Hazard: Predict Not Taken  Execute instructions from the fall-through (not-taken) path  As if there is no branch  If the branch is not-taken (~50%), no penalty is paid  If branch actually taken  Flush the fall-through path instructions before they change the machine state (memory / registers)  Fetch the instructions from the correct (taken) path  Assuming ~50% branches not taken on average CPI new = 1 + (0.2 × 0.5) × 3 = 1.3

Computer Architecture 2011 – Pipeline 32 Dynamic Branch Prediction Look up PC of inst in fetch ?= Branch predicted taken or not taken No:Inst is not pred to be branch Yes:Inst is pred to be branch Branch PC Target PC History Predicted Target  Add a Branch Target Buffer (BTB) the predicts (at fetch)  Instruction is a branch  Branch taken / not-taken  Taken branch target

Computer Architecture 2011 – Pipeline 33 BTB  Allocation  Allocate instructions identified as branches (after decode)  Both conditional and unconditional branches are allocated  Not taken branches need not be allocated  BTB miss implicitly predicts not-taken  Prediction  BTB lookup is done parallel to IC lookup  BTB provides  Indication that the instruction is a branch (BTB hits)  Branch predicted target  Branch predicted direction  Branch predicted type (e.g., conditional, unconditional)  Update (when branch outcome is known)  Branch target  Branch history (taken / not-taken)

Computer Architecture 2011 – Pipeline 34 BTB (cont.)  Wrong prediction  Predict not-taken, actual taken  Predict taken, actual not-taken, or actual taken but wrong target  In case of wrong prediction – flush the pipeline  Reset latches (same as making all instructions to be NOPs)  Select the PC source to be from the correct path  Need get the fall-through with the branch  Start fetching instruction from correct path  Assuming P% correct prediction rate CPI new = 1 + (0.2 × (1-P)) × 3  For example, if P=0.7 CPI new = 1 + (0.2 × 0.3) × 3 = 1.18

Computer Architecture 2011 – Pipeline 35 Adding a BTB to the Pipeline

Computer Architecture 2011 – Pipeline 36 Using The BTB PC moves to next instruction Inst Mem gets PC and fetches new inst BTB gets PC and looks it up IF/ID latch loaded with new inst BTB Hit ?Br taken ? PC  PC + 4PC  perd addr IF ID IF/ID latch loaded with pred inst IF/ID latch loaded with seq. inst Branch ? yesno yes noyes EXE

Computer Architecture 2011 – Pipeline 37 Using The BTB (cont.) ID EXE MEM WB Branch ? Calculate br cond & trgt Flush pipe & update PC Corect pred ? yesno IF/ID latch loaded with correct inst continue Update BTB yes no continue

Computer Architecture 2011 – Pipeline 38 Backup

Computer Architecture 2011 – Pipeline 39  R-type (register insts)  I-type (Load, Store, Branch, inst’s w/imm data)  J-type (Jump) op: operation of the instruction rs, rt, rd: the source and destination register specifiers shamt: shift amount funct: selects the variant of the operation in the “op” field address / immediate: address offset or immediate value target address: target address of the jump instruction op target address 0 2631 6 bits 26 bits oprs rtrdshamtfunct 061116 212631 6 bits 5 bits oprs rt immediate 016 212631 6 bits 16 bits5 bits MIPS Instruction Formats

Computer Architecture 2011 – Pipeline 40  Each memory location  is 8 bit = 1 byte wide  has an address  We assume 32 byte address  An address space of 2 32 bytes  Memory stores both instructions and data  Each instruction is 32 bit wide  stored in 4 consecutive bytes in memory  Various data types have different width The Memory Space 1 byte 00000000 00000001 00000002 00000003 00000004 00000005 00000006 FFFFFFFA FFFFFFFB FFFFFFFC FFFFFFFD FFFFFFFE FFFFFFFF

Computer Architecture 2011 – Pipeline 41 Register File RegWrite Read reg 1 Read reg 2 Write reg Write data Read data 1 Read data 2 Register File 32 5 5 5  The Register File holds 32 registers  Each register is 32 bit wide  The RF supports parallel  reading any two registers and  writing any register  Inputs  Read reg 1/2: #register whose value will be output on Read data 1/2  RegWrite: write enable  Write reg (relevant when RegWrite=1)  #register to which the value in Write data is written to  Write data (relevant when RegWrite=1)  data written to Write reg  Outputs  Read data 1/2: data read from Read reg 1/2

Computer Architecture 2011 – Pipeline 42 Memory Components  Inputs  Address: address of the memory location we wish to access  Read: read data from location  Write: write data into location  Write data (relevant when Write=1) data to be written into specified location  Outputs  Read data (relevant when Read=1) data read from specified location Write Address Write Data Read Data Memory Read 32 Cache  Memory components are slow relative to the CPU  A cache is a fast memory which contains only small part of the memory  Instruction cache stores parts of the memory space which hold code  Data Cache stores parts of the memory space which hold data

Computer Architecture 2011 – Pipeline 43 The Program Counter (PC)  Holds the address (in memory) of the next instruction to be executed  After each instruction, advanced to point to the next instruction  If the current instruction is not a taken branch, the next instruction resides right after the current instruction PC  PC + 4  If the current instruction is a taken branch, the next instruction resides at the branch target PC  target (absolute jump) PC  PC + 4 + offset×4 (relative jump)

Computer Architecture 2011 – Pipeline 44  Fetch  Fetch instruction pointed by PC from I-Cache  Decode  Decode instruction (generate control signals)  Fetch operands from register file  Execute  For a memory access: calculate effective address  For an ALU operation: execute operation in ALU  For a branch: calculate condition and target  Memory Access  For load: read data from memory  For store: write data into memory  Write Back  Write result back to register file  update program counter Instruction Execution Stages Instruction Fetch Instruction Decode Execute Memory Result Store

Computer Architecture 2011 – Pipeline 45 The MIPS CPU Instruction Decode / register fetch Instruction fetch Execute / address calculation Memory access Write back Control

Computer Architecture 2011 – Pipeline 46 Executing an Add Instruction R3 R5 3 5 R3 + R5 + [PC]+4 2 oprsrtrdshamtfunct 061116212631 35200 = AddALU Add R2, R3, R5 ; R2  R3+R5 ADD

Computer Architecture 2011 – Pipeline 47 Executing a Load Instruction oprsrtimmediate 016212631 2130LW LW R1, (30)R2 ; R1  Mem[R2+30]

Computer Architecture 2011 – Pipeline 48 Executing a Store Instruction oprsrtimmediate 016212631 2130SW SW R1, (30)R2 ; Mem[R2+30]  R1

Computer Architecture 2011 – Pipeline 49 Executing a BEQ Instruction BEQ R4, R5, 27 ; if (R4-R5=0) then PC  PC+4+SignExt(27)*4 ; else PC  PC+4 oprsrtimmediate 016212631 4527BEQ

Computer Architecture 2011 – Pipeline 50 Control Signals

Computer Architecture 2011 – Pipeline 51 Pipelined CPU: Load (cycle 1 – Fetch) lw oprsrtimmediate 016212631 2130LW LW R1, (30)R2 ; R1  Mem[R2+30] PC+4

Computer Architecture 2011 – Pipeline 52 Pipelined CPU: Load (cycle 2 – Dec) oprsrtimmediate 016212631 2130LW LW R1, (30)R2 ; R1  Mem[R2+30] PC+4 R2 30

Computer Architecture 2011 – Pipeline 53 Pipelined CPU: Load (cycle 3 – Exe) oprsrtimmediate 016212631 2130LW LW R1, (30)R2 ; R1  Mem[R2+30] R2+30

Computer Architecture 2011 – Pipeline 54 Pipelined CPU: Load (cycle 4 – Mem) oprsrtimmediate 016212631 2130LW LW R1, (30)R2 ; R1  Mem[R2+30] D

Computer Architecture 2011 – Pipeline 55 Pipelined CPU: Load (cycle 5 – WB) oprsrtimmediate 016212631 2130LW LW R1, (30)R2 ; R1  Mem[R2+30] D

Computer Architecture 2011 – Pipeline 56 Datapath with Control

Computer Architecture 2011 – Pipeline 57 Multi-Cycle Control  Pass control signals along just like the data

Computer Architecture 2011 – Pipeline 58 Five Execution Steps  Instruction Fetch  Use PC to get instruction and put it in the Instruction Register.  Increment the PC by 4 and put the result back in the PC. IR = Memory[PC]; PC = PC + 4;  Instruction Decode and Register Fetch  Read registers rs and rt  Compute the branch address A = Reg[IR[25-21]]; B = Reg[IR[20-16]]; ALUOut = PC + (sign-extend(IR[15-0]) << 2);  We aren't setting any control lines based on the instruction type (we are busy "decoding" it in our control logic)

Computer Architecture 2011 – Pipeline 59 Five Execution Steps (cont.) Execution ALU is performing one of three functions, based on instruction type:  Memory Reference: effective address calculation. ALUOut = A + sign-extend(IR[15-0]);  R-type: ALUOut = A op B;  Branch: if (A==B) PC = ALUOut; Memory Access or R-type instruction completion Write-back step

Computer Architecture 2011 – Pipeline 60 The Store Instruction  swrt, rs, imm16  mem[PC]Fetch the instruction from memory  Addr <- R[rs] + SignExt(imm16) Calculate the memory address  Mem[Addr] <- R[rt]Store the register into memory  PC <- PC + 4Calculate the next instruction’s address oprsrtimmediate 016212631 6 bits16 bits5 bits Mem[Rs + SignExt[imm16]] <- Rt Example: sw rt, rs, imm16

Computer Architecture 2011 – Pipeline 61 RAW Hazard: SW Solution IMReg CC 1CC 2CC 3CC 4CC 5CC 6 Time (clock cycles) CC 7CC 8CC 9 1010/–20 Value of R2 DM Reg IMDMReg Reg IMReg IM Reg DMReg IMDMReg Reg Reg DM sub R2, R1, R3 NOP and R12,R2, R5 or R13,R6, R2 add R14,R2, R2 sw R15,100(R2) Program execution order 10 -20 Have compiler avoid hazards by adding NOP instructions Problem: this really slows us down!

Computer Architecture 2011 – Pipeline 62 Delayed Branch  Define branch to take place AFTER n following instruction  HW executes n instructions following the branch regardless of branch is taken or not  SW puts in the n slots following the branch instructions that need to be executed regardless of branch resolution  Instructions that are before the branch instruction, or  Instructions from the converged path after the branch  If cannot find independent instructions, put NOP Original Code r3 = 23 R4 = R3+R5 If (r1==r2) goto x R1 = R4 + R5 X: R7 = R1 New Code If (r1==r2) goto x r3 = 23 R4 = R3 +R5 NOP R1 = R4 + R5 X: R7 = R1 

Computer Architecture 2011 – Pipeline 63 Delayed Branch Performance  Filling 1 delay slot is easy, 2 is hard, 3 is harder  Assuming we can effectively fill d% of the delayed slots CPI new = 1 + 0.2 × (3 × (1-d))  For example, for d=0.5, we get CPI new = 1.3  Mixing architecture with micro-arch  New generations requires more delay slots  Cause computability issues between generations

Computer Architecture 2011 – Pipeline 1 Computer Architecture MIPS Pipeline Lihu Rappoport and Adi Yoaz Some of the slides were taken from: (1) Avi Mendelson.

Similar presentations

Presentation on theme: "Computer Architecture 2011 – Pipeline 1 Computer Architecture MIPS Pipeline Lihu Rappoport and Adi Yoaz Some of the slides were taken from: (1) Avi Mendelson."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computer Architecture 2011 – Pipeline 1 Computer Architecture MIPS Pipeline Lihu Rappoport and Adi Yoaz Some of the slides were taken from: (1) Avi Mendelson.

Similar presentations

Presentation on theme: "Computer Architecture 2011 – Pipeline 1 Computer Architecture MIPS Pipeline Lihu Rappoport and Adi Yoaz Some of the slides were taken from: (1) Avi Mendelson."— Presentation transcript:

Similar presentations

About project

Feedback