Throughput = #instructions per unit time (seconds/cycles etc.) Throughput of an unpipelined machine 1/time per instruction Time per instruction = pipeline depth*time to execute a single stage. The time to execute a single stage can be rewritten as: Throughput of a pipelined machine 1/time to execute a single stage (assuming all stages take same time) Deriving the throughput equation for pipelined machine Unit time determined by units that are used to represent denominator Cycles Instr/Cycles, seconds Instr/second Time per instruction on unpipelined machine Pipeline depth Throughput = Time per instruction on unpipelined machine Depth of the pipeline
Physics of Clock Skew Basically caused because the clock edge reaches different parts of the chip at different times Capacitance-charge-discharge rates All wires, leads, transistors, etc. have capacitance Longer wire, larger capacitance Repeaters used to drive current, handle fan-out problems C is inversely proportional to rate-of-change of V Time to charge/discharge adds to delay Dominant problem in old integration densities. For a fixed C, rate-of-change of V is proportional to I Problem with this approach is power requirements go up Power dissipation becomes a problem. Speed-of-light propagation delays Dominates current integration densities as nowadays capacitances are much lower. But nowadays clock rates are much faster (even small delays will consume a large part of the clock cycle) Current day research asynchronous chip designs
Return to pipelining Its Not That Easy for Computers Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away) Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock) Control hazards: Pipelining of branches & other instructions that change the PC Common solution is to stall the pipeline until the hazard is resolved, inserting one or more “bubbles” in the pipeline
Speedup = average instruction time unpiplined average instruction time pipelined Remember that average instruction time = CPI*Clock Cycle And ideal CPI for pipelined machine is 1. 2
Structural Hazards Overlapped execution of instructions: Pipelining of functional units Duplication of resources Structural Hazard When the pipeline can not accommodate some combination of instructions Consequences Stall Increase of CPI from its ideal value (1)
Pipelining of Functional Units Fully pipelined M1 M2 M3 M4 M5 FP Multiply IF ID MEM WB EX Partially pipelined M1 M2 M3 M4 M5 FP Multiply IF ID MEM WB EX Not pipelined M1 M2 M3 M4 M5 FP Multiply IF ID MEM WB EX
To pipeline or Not to pipeline Elements to consider Effects of pipelining and duplicating units Increased costs Higher latency (pipeline register overhead) Frequency of structural hazard Example: unpipelined FP multiply unit in DLX Latency: 5 cycles Impact on mdljdp2 program? Frequency of FP instructions: 14% Depends on the distribution of FP multiplies Best case: uniform distribution Worst case: clustered, back-to-back multiplies
Resource Duplication Load Inst 1 Inst 2 Stall Inst 3 M Reg M Reg Reg M ALU Reg Inst 1 M Reg M ALU Inst 2 M Reg M Reg ALU Stall Inst 3 M Reg M Reg ALU
3
Three Generic Data Hazards InstrI followed by InstrJ Read After Write (RAW) InstrJ tries to read operand before InstrI writes it
Three Generic Data Hazards InstrI followed by InstrJ Write After Read (WAR) InstrJ tries to write operand before InstrI reads i Gets wrong operand Can’t happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5
Three Generic Data Hazards InstrI followed by InstrJ Write After Write (WAW) InstrJ tries to write operand before InstrI writes it Leaves wrong result ( InstrI not InstrJ ) Can’t happen in DLX 5 stage pipeline because: All instructions take 5 stages, and Writes are always in stage 5 Will see WAR and WAW in later more complicated pipes
Examples in more complicated pipelines WAW - write after write WAR - write after read LW R1, 0(R2) IF ID EX M1 M2 WB ADD R1, R2, R3 IF ID EX WB SW 0(R1), R2 IF ID EX M1 M2 WB ADD R2, R3, R4 IF ID EX WB This is a problem if Register writes are during The first half of the cycle And reads during the Second half
Data Hazards IM Reg DM Reg IM Reg DM Reg IM Reg DM Reg IM Reg DM Reg ADD R1, R2, R3 ALU IM Reg DM Reg SUB R4, R1, R5 ALU IM Reg DM Reg ALU AND R6, R1, R7 IM Reg DM Reg ALU OR R8, R1, R9 IM Reg DM XOR R10, R1, R11 ALU
Forwarding IM Reg DM Reg IM Reg DM Reg IM Reg DM Reg IM Reg DM Reg IM ADD R1, R2, R3 ALU IM Reg DM Reg SUB R4, R1, R5 ALU IM Reg DM Reg ALU AND R6, R1, R7 IM Reg DM Reg ALU OR R8, R1, R9 IM Reg DM XOR R10, R1, R11 ALU
Stalls inspite of forwarding IM Reg DM Reg LW R1, 0(R2) ALU IM Reg DM Reg SUB R4, R1, R5 ALU IM Reg DM Reg ALU AND R6, R1, R7 IM Reg DM Reg ALU OR R8, R1, R9
Pipeline Interlocks IM Reg DM Reg IM Reg DM Reg Reg DM IM IM Reg LW R1, 0(R2) ALU IM Reg DM Reg SUB R4, R1, R5 ALU Reg DM IM ALU AND R6, R1, R7 IM Reg ALU OR R8, R1, R9 LW R1, 0(R2) IF ID EX MEM WB SUB R4, R1, R5 IF ID stall EX MEM WB AND R6, R1, R7 IF stall ID EX MEM WB OR R8, R1, R9 stall IF ID EX MEM WB
Software Scheduling to Avoid Load Hazards Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d ,e, and f in memory. Slow code: LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f SUB Rd,Re,Rf SW d,Rd Fast code: LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB Rd,Re,Rf SW d,Rd
Effect of Software Scheduling LW Rb,b IF ID EX MEM WB LW Rc,c IF ID EX MEM WB ADD Ra,Rb,Rc IF ID EX MEM WB SW a,Ra IF ID EX MEM WB LW Re,e IF ID EX MEM WB LW Rf,f IF ID EX MEM WB SUB Rd,Re,Rf IF ID EX MEM WB SW d,Rd IF ID EX MEM WB LW Rb,b IF ID EX MEM WB LW Rc,c IF ID EX MEM WB LW Re,e IF ID EX MEM WB ADD Ra,Rb,Rc IF ID EX MEM WB LW Rf,f IF ID EX MEM WB SW a,Ra IF ID EX MEM WB SUB Rd,Re,Rf IF ID EX MEM WB SW d,Rd IF ID EX MEM WB
Compiler Scheduling Eliminates load interlocks Demands more registers Simple scheduling Basic block (sequential segment of code) Good for simple pipelines Percentage of loads that result in a stall FP: 13% Int: 25%
3