Download presentation
Presentation is loading. Please wait.
1
Appendix A Pipelining: Basic and Intermediate Concepts 吳俊興 高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構
2
2 Outline Basic concept of Pipelining The Basic Pipeline for MIPS The Major Hurdles of Pipelining – Pipeline Hazards
3
3 What Is Pipelining? Laundry Example Ann, Betty, Cathy, Dave each has one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer takes 40 minutes “Folder” takes 20 minutes ABCD
4
4 What Is Pipelining? Sequential laundry takes 6 hours for 4 loads ABCD 304020304020304020304020 6 PM 789 10 11 Midnight TaskOrderTaskOrder Time Want to reduce the time? - Pipelining!!!
5
5 What Is Pipelining? Start work ASAP Pipelined laundry takes 3.5 hours for 4 loads ABCD TaskOrderTaskOrder 6 PM 789 Time 3040 20
6
6 Pipelining doesn’t help latency of single task; it helps throughput of entire workload Pipeline rate is limited by the slowest pipeline stage Multiple tasks operating simultaneously Potential speedup = Number of pipe stages –Unbalanced lengths of pipe stages reduces speedup What Is Pipelining? Pipelining is an implementation technique whereby multiple instructions are overlapped in execution It takes advantage of parallelism that exists among instructions => instruction-level parallelism It is the key implementation technique used to make fast CPUs
7
7 MIPS Without Pipelining The execution of instructions is controlled by CPU clock. One specific function in one clock cycle. Every MIPS instruction takes 5 clock cycles in terms of five different stages. Several temporary registers are introduced to implement the 5-stage structure.
8
8 MIPS Functions Instruction Fetch (IF): Send out the PC and fetch the instruction from memory into the instruction register (IR); increment the PC by 4 to address the next sequential instruction and store it in next program count register (NPC). IR holds the instruction that will be used in the next stage. NPC holds the value of the next PC. Passed To Next Stage IR <- Mem[PC] NPC <- PC + 4 Only consider load- store, BEQZ, and integer ALU
9
9 Instruction Decode/Register Fetch (ID): Decode the instruction and access the register file to read the registers. The outputs of the general purpose registers are read into two temporary registers (A & B) for use in later clock cycles. We sign extend the lower 16 bits of the Instruction Register into another temporal register Imm. Passed To Next Stage A <- Regs[rs]; B <- Regs[rt]; Imm <- ((IR 16 ) 48 ##IR 16..31 MIPS Functions
10
10 Passed To Next Stage ALUOutput <- A + Imm; ALUOutput <- A func. B; ALUOutput <- A op Imm; ALUOutput <- NPC+ Imm<<2, Cond = (A==0); Execution/Effective Address Calculation (EX): We perform an operation (for an ALU) or an address calculation (if the instruction is about load/store or Branch). If an ALU, actually do the operation. If an address calculation, figure out the address and store it for the next cycle. MIPS Functions
11
11 Passed To Next Stage LMD = Mem[ALUOutput] or Mem[ALUOutput] = B; If (cond) PC <- ALUOutput Memory Access/Branch Completion (MEM): If it is an ALU instruction, do nothing. If it is a load/store instruction, then access memory. If it is a branch instruction, update PC if necessary in terms of condition. MIPS Functions
12
12 Passed To Next Stage Regs[rd] <- ALUOutput; Regs[rs] <- ALUOutput; Regs[rt] <- LMD; Write-back (WB): Update the registers from either the ALU or from the data loaded. MIPS Functions
13
13 The classic five-stages pipeline for MIPS We can pipeline the execution with almost no changes by simply starting a new instruction on each clock cycle. Each clock cycle becomes a pipe stage – a cycle in the pipe line which results in the execution pattern as a typical way of pipeline structure. Although each instruction takes 5 clock cycles to complete, the hardware will initiate a new instruction during each clock cycle and will be executing some parts of the five different instruction already existing in the pipeline. It may be hard to believe that pipelining is as simple as this. Clock number Instruction number123456789 Instruction iIFIDEXMEMWB Instruction i+1IFIDEXMEMWB Instruction i+2IFIDEXMEMWB Instruction i+3IFIDEXMEMWB Instruction i+4IFIDEXMEMWB
14
14 Figure A.2 The pipeline can be thought of as a series of data paths shifted in time
15
15 Simple MIPS Pipeline MIPS pipeline data path to deal with problems that pipelining introduces in real implementation. It is critical to ensure that instructions at different stage in the pipeline do not attempt to use the hardware resources at the same time (in the same clock cycle) – perform different operations with the same functional unit such as ALU on the same clock cycle. Instructions and data memories are separated in different caches (IM/DM). Register file is used in two stages: one for reading in ID and one for writing in WB. To handle a read and a write to the same register, we perform the register write in the first half of the clock and the read in the second.
16
16 Pipeline implementation for MIPS In order to ensure that instructions in different stages of the pipeline do not interfere with each other, the data path is pipelined by adding a set of registers, one between each pair of pipe stages. The registers serve to convey values and control information from one stage to the next. Most of the data paths flow from left to right, which is from earlier in time to later. The paths flowing from right to left (which carry the register write-back information and PC information on a branch) introduce complications into the pipeline.
17
17 Events on Pipe Stages of the MIPS Pipeline StageAny instruction IFIF/ID.IR <- Mem[PC]; IF/ID.NPC, PC <- (If ((EX/MEM.opcode==branch) & EX/MEM.cond){ EX/MEM.ALUOutput} else {PC+4}); ID ID/EX.A <- Regs[IF/ID.IR[rs]]; ID/EX.B <- Regs[IF/ID.IR[rt]]; ID/EX.NPC <- IF/ID.NPC; ID/EX.IR <- IF/ID.IR; ID/EX.Imm <- sign-extend(IF/ID.IR[immediate field]); ALU InstructionLoad or storeBranch EX EX/MEM.IR <- ID/EX.IR; EX/MEM.ALUOutput <- ID/EX.A func ID/EX.B; or EX/MEM.ALUOutput <- ID/EX.A op ID/EX.Imm; EX/MEM.IR <- ID/EX.IR EX/MEM.ALUOutput <- ID/EX.A + ID/EX.Imm; EX/MEM.B <- ID/EX.B EX/MEM.ALUOutput <- ID/EX.NPC + (ID/EX.Imm << 2); EX/MEM.cond <- (ID/EX.A ==0); MEM MEM/WB.IR <- EX/MEM.IR; MEM/WB.ALUOutput <- EX/MEM.ALUOutput; MEM/WB.IR <- EX/MEM.IR; MEM/WB.LMD <- Mem[EX/MEM.ALUOutput]; or Mem[EX/MEM.ALUOutput] <- EX/MEM.B; WB Regs[MEM/WB.IR[rd]] <- MEM/WB.ALUOutput; or Regs[MEM/WB.IR[rt]] <- MEM/WB.ALUOutput For load only: Regs[MEM/WB.IR[rt]] <- MEM/WB.LMD Figure A.19
18
18 Basic Performance Issues for Pipelining Example: Assume that an unpipelined processor has a 1ns clock cycle and that it uses 4 cycles for ALU operations and branches and 5 cycles for memory operations. Assume that the relative frequencies of these operations are 40%, 20%, and 40%, respectively. Suppose that due to clock skew and setup, pipelining the processor adds 0.2 ns overhead to the clock. Ignoring any latency impact, how much speedup in the instruction execution time will we gain from the pipeline implementation? Solution: Avg. instr. exec time unpipelined = Clock cycle time x Avg. CPI = 1ns x (40%x4+20%x4+40%x5) = 4.4ns Ideal situation without any latency, avg. CPI is just only 1 cycle for all kind of instructions and the clock cycle time is equal to 1.0ns + 0.2ns (1.2ns), then Avg. instr. exec time pipelined = 1.2ns x1 = 1.2ns Then, speed up from pipelining is 4.4ns/1.2ns or 3.7 times. What is the result if there is no overhead when implement pipelining?
19
19 A.2 The Major Hurdle of Pipelining – Pipeline Hazard Limits to pipelining: there are situations, called Hazards, prevent next instruction from executing during its designated clock cycle, thus reduce the performance from the ideal speedup. Three classes of hazards are: –Structural hazards: arise from resource conflicts when the hardware cannot support all possible combinations of instructions simultaneously in overlapped execution- two different instructions use same h/w in the same cycle. –Data hazards: arise when an instruction depends on result of prior instruction still in the pipeline, RAW, WAR and WAW. –Control hazards: Pipelining of branches & other instructions that change the PC. Common solution is to stall the pipeline until the hazard is cleared, i.e., inserting one or more “bubbles” in the pipeline.
20
20 Performance of Pipelining with Stalls The Pipelined CPI: Ignoring cycle time overhead of pipelining, and assuming the stages are perfectly balanced (all occupy one clock cycle) and all instructions take the same num of cycles, we have speedup from pipelining:
21
21 Structural Hazards When two or more different instructions want to use same h/w resource in same cycle e.g., MEM uses the same memory port as IF as shown in this slide. Solution: stall I n s t r. O r d e r Time (clock cycles) Load Instr 1 Instr 2 Instr 3 Instr 4 Reg ALU DMem Ifetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMem Ifetch Reg Cycle 1Cycle 2Cycle 3Cycle 4Cycle 6Cycle 7Cycle 5 Reg ALU DMemIfetch Reg
22
22 Structural Hazards This is another way of looking at the effect of a stall. I n s t r. O r d e r Time (clock cycles) Load Instr 1 Instr 2 Stall Instr 3 Reg ALU DMem Ifetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg Cycle 1Cycle 2Cycle 3Cycle 4Cycle 6Cycle 7Cycle 5 Reg ALU DMemIfetch Reg Bubble
23
23 Structural Hazards This is another way to represent the stall.
24
24 Stall –low cost, simple –Increases CPI –use for rare case since stalling has performance effect Replicate resource –good performance –increases cost (+ maybe interconnect delay) –useful for cheap or divisible resources Dealing With Structural Hazards E.g., we use separate instruction and data memories in MIPS pipeline
25
25 Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands (registers) so that the order differs from the order seen by sequentially executing instructions on an unpipelined processor. Where there’s real trouble is when we have: instruction A instruction B, and B manipulates (reads or writes) data before A does. This violates the order of the instructions, since the architecture implies that A completes entirely before B is executed.
26
26 Read After Write (RAW) Instr J tries to read operand before Instr I writes it Caused by a “dependence” (in compiler nomenclature). This hazard results from an actual need for communication. Execution Order is: Instr I Instr J I: dadd r1,r2,r3 J: dsub r4,r1,r3 Data Hazards
27
27 Write After Read (WAR) Instr J tries to write operand before Instr I reads it –Gets wrong operand –Called an “anti-dependence” by compiler writers. This results from reuse of the name “r1”. Can’t happen in MIPS 5 stage pipeline because: – All instructions take 5 stages, and – Reads are always in stage 2, and – Writes are always in stage 5 Execution Order is: Instr I Instr J I: dsub r4,r1,r3 J: dadd r1,r2,r3 K: mul r6,r1,r7 Data Hazards
28
28 Write After Write (WAW) Instr J tries to write operand before Instr I writes it – Leaves wrong result ( Instr I not Instr J ) –Called an “output dependence” by compiler writers This also results from the reuse of name “r1”. Can’t happen in MIPS 5 stage pipeline because: – All instructions take 5 stages, and – Writes are always in stage 5 Will see WAR and WAW in later more complicated pipeline implementations Execution Order is: Instr I Instr J I: dsub r1,r4,r3 J: dadd r1,r2,r3 K: mul r6,r1,r7 Data Hazards
29
29 Simple Solution to RAW Hardware detects RAW and stalls until the result is written into the register + low cost to implement, simple -- reduces # instruction executed per cycle Minimizing RAW stalls: Forwarding (also called bypassing) Key insight: the result is not really needed by the current instruction until after the previous instruction actually produces it. The ALU result from both the EX/MEM and MEM/WB pipeline registers is always fed back to the ALU inputs. If the forwarding hardware detects that the previous ALU operation has written the register corresponding to a source for the current ALU operation, control logic selects the forwarded result as the ALU input rather than the value read from the register file. Solutions to Data Hazards
30
30 The use of the result of the ADD instruction in the next two instructions causes a hazard, since the register is not written until after those instructions read it. I n s t r. O r d e r dadd r1,r2,r3 dsub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg IF ID EX MEM WB Data Hazards Time (clock cycles) CC1CC2CC3CC4CC6CC7CC5CC8CC9
31
31 Forwarding is the concept of making data available to the input of the ALU for subsequent instructions, even though the generating instruction hasn’t gotten to WB in order to write the memory or registers. Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg Forwarding to Avoid Data Hazards I n s t r. O r d e r dadd r1,r2,r3 dsub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 Time (clock cycles) CC1CC2CC3CC4CC6CC7CC5CC8CC9
32
32 There are some instances where hazards occur, even with forwarding, e.g., the data isn’t loaded until after the MEM stage. I n s t r. O r d e r LD R1,0(R2) DSUB R4,R1,R6 AND R6,R1,R7 OR R8,R1,R9 Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg Data Hazards Requiring Stalls Time (clock cycles) CC1CC2CC3CC4CC6CC7CC5CC8
33
33 OR R8,R1,R9 I n s t r. O r d e r LD R1,0(R2) DSUB R4,R1,R6 AND R6,R1,R7 Reg ALU DMemIfetch Reg Ifetch ALU DMem Reg Bubble Ifetch ALU DMem Reg Bubble Reg Ifetch ALU DMem Bubble Reg Data Hazards Requiring Stalls The stall is necessary for the case. Time (clock cycles) CC1CC2CC3CC4CC6CC7CC5CC8
34
34 LD R1, 0(R2)IFIDEXMEMWB DSUB R4, R1, R5IFIDEXMEMWB AND R6, R1, R7IFIDEXMEMWB OR R8, R1, R9IFIDEXMEMWB LD R1, 0(R2)IFIDEXMEMWB DSUB R4, R1, R5IFIDstallEXMEMWB AND R6, R1, R7IFstallIDEXMEMWB OR R8, R1, R9stallIFIDEXMEMWB Another Representation of the Stall In the top table, we can see why a stall is needed: The MEM cycle of the load produces a value that is needed in the EX cycle of the DSUB, which occurs at the same time. This problem is solved by inserting a stall, as shown in the bottom table.
35
35 Control Hazards A control hazard happens when we need to find the destination of a branch, and can’t fetch any new instructions until we know that destination. –If instruction i is a taken branch, then the PC is normally not changed until the end of ID Control hazards can cause a greater performance loss than do data hazards.
36
36 Control Hazard on Branches Three-Cycle Stall 12: beq r1,r3,36 16: and r2,r3,r5 20: or r6,r1,r7 24: add r8,r1,r9 36: xor r10,r1,r11 Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg Time (clock cycles) CC1CC2CC3CC4CC6CC7CC5CC8CC9
37
37 Branch Stall Impact If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9! Two solutions to this dramatic increase: –Determine branch taken or not sooner, AND –Compute target address earlier MIPS branch tests if register = 0 or ^ 0 MIPS Solution: –Move Zero test to ID stage –Adder to calculate target address in ID stage –1 clock cycle penalty for branch versus 3
38
38 The Pipeline of 1-Cycle Stall for Branch
39
39 Four Solutions to Branch Hazards #1: Stall until branch direction is clear –Simple both for software and hardware –Branch penalty is fixed (1-cycle penalty for revised MIPS) Branch instr.IFIDEXMEMWB Branch successorIF IDEXMEMWB Branch successor+1IFIDEXMEMWB Branch successor+2IFIDEXMEMWB
40
40 Four Solutions to Branch Hazards Untaken branch instr.IFIDEXMEMWB Branch successorIFIDEXMEMWB Branch successor+1IFIDEXMEMWB Branch successor+2IFIDEXMEMWB Branch successor+3IFIDEXMEMWB Taken branch instr.IFIDEXMEMWB Branch successorIFidle Branch targetIFIDEXMEMWB Branch successor+1IFIDEXMEMWB Branch successor+2IFIDEXMEMWB #2: Predict Branch Not Taken –Continue to fetch instructions as if the branch were a normal instruction. –If the branch is taken, turn the fetched instruction into a no-op and restart the fetch at the target address.
41
41 Four Solutions to Branch Hazards #3: Predict Branch Taken –As soon as the branch is decoded and the target address is computed, we assume the branch to be taken and begin fetching and executing at the target. –But haven’t calculated the target address before we know the branch outcome in MIPS MIPS still incurs 1-cycle branch penalty Useful for other machines on which the target address is known before the branch outcome
42
42 #4: Delayed Branch –The execution cycle with a branch delay of one is branch instruction sequential successor 1 branch target if taken –The sequential successor is in the branch delay slot. –The instruction in the branch delay slot is executed whether or not the branch is taken (for zero cycle penalty) Four Solutions to Branch Hazards Where to get instructions to fill branch delay slot? –From before branch instruction –From target address: only valuable when branch taken –From fall through: only valuable when branch not taken –Canceling or nullifying branches allow more slots to be filled (non- zero cycle penalty, its value depends on the rate of correct predication) –the delay-slot instruction is turned into a no-op if incorrectly predicted
43
43 Four Solutions to Branch Hazards
44
44 Pipelining Introduction Summary Just overlap tasks, and easy if tasks are independent Speed Up vs. Pipeline Depth; if ideal CPI is 1, then: Hazards limit performance on computers: –Structural: need more hardware resources –Data (RAW,WAR,WAW): need forwarding, compiler scheduling –Control: delayed branch, prediction Speedup = Pipeline Depth 1 + Pipeline stall CPI X Clock Cycle Unpipelined Clock Cycle Pipelined
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.