CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Pipelining Basics
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst The 5 Cycles in MIPS MIPS steps: 1.Fetch the instruction from RAM 2.Decode and read the regs 3.Execute the operation or calculate the effective address 4.Read/write RAM; store the regs 5.Save a RAM read into regs Pipelining principle: multiple instructions are overlapped in execution
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Basic Pipelining History CDC 6600 –One of the first pipeline processors –Dates back to 1970 –Designed by Seymour Cray Most modern CPUs, even in PCs and embedded chips, now include pipelining.
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Performance Possibilities Consider 1000 instructions to be pipelined Single cycle machine / non-pipelined –CCT = 8 ns due to longest datapath –CPI = 1 but 8 ns per instruction –8 ns * 1000 = 8000 ns Multi-cycle machine / pipelined –CCT = 2 ns due to longest stage in datapath –5 stages 10 ns per instruction –8 ns + 2 ns * 1000 = 2008 ns Speedup = 8000 / 2008 = 3.98 4 To “fill” the pipeline
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Pipeline Performance A single instruction takes more (or the same amount of ) time A group / sequence of instructions takes less time Pipelining increases throughput rather than decreasing execution time for an individual instruction Design principle: –Good designs demand good compromises
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst CPI Revisited CPI = total # of cycles total # of instructions Hypothetically,the CPI of a pipelined processor is 1
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Hazards Limits to Pipelined Performance
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Roadblocks to Pipelining Structural hazards –Multiple instructions vying for a single shared resource –Ex: RAM, ALU –Instruction! Data hazards –Later instruction uses the result of an earlier instruction –Ex: lw followed by an add that uses the loaded data Control hazards –Fetch of a later instruction relies on the result of an earlier instruction to determine the correct control path –Ex: conditional branches that are taken
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Structural Hazards Suppose Princeton architecture – one RAM for both instructions and data Structural hazard two instructions require RAM in the same cycle Need to use Harvard architecture to accommodate this Lw FDEMW FDEMW FDEMW F
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Structural Hazards Which “instruction” is coming from the I-MEM in any given cycle? –Need to replicate it! Structural hazards can (usually) be removed by adding duplicate hardware
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst More Structural Hazards Which “instruction” is coming from the I-MEM in any given cycle? –Need to replicate it! Structural hazards can (usually) be removed by adding duplicate hardware How do I read and write to the register file at the same time?!?
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Pipelining Requirements Decode is performed in the second half of the D stage –D stage involves a read from the register file Write back is performed in the first half of the W stage –W stage involves a write to the register file Not actually how it is implemented (but the concept works) lw FDEMW lw FDEMW lw FDEMW
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Control & Data Hazards Solutions
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Hazard # 2 - Data Hazards nand cannot read reg 1 until add has stored it Since read/write can occur in the same cycle, must stall 2 cycles here before nand can proceed add FDEMW nand F--DEMW
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Solutions to Data Hazards Forwarding / bypassing –Data is forwarded, as soon is it available, from one stage to another –Forwarding occurs prior to the M/W stages Result of add is forwarded from E stage (output reg from ALU of add ) to the E stage of nand (back to the ALU again, but for nand this time) –reg 1 is not written until W stage, but its value is used earlier anyway add FDEMW nand FDEMW
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst More Forwarding / Bypassing add1 2 3 FDEMW sw FDEMW add1 2 3 FDEMW sw FDEMW
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Data Hazards: Load Stalls Cannot forward “back in time” – must permit a “load stall” to wait on the result of the load –Forwarding can’t solve everything (unfortunately) lw FDEMW add2 1 3 FDEMW lw FDEMW add2 1 3 FD-EMW
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Test Yourself Consider the following instruction sequence: What are the forwarding paths required to correctly implement this sequence? Are there any forwarding paths that conflict? add1 2 2 FDEMW add1 1 2 FDEMW add1 1 1 FDEMW
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Hazard # 3 - Control Hazards The lw instruction should only complete if the branch fails! add4 5 6 FDEMW beq1 2 loop FDEMW lw FDEMW
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Control Hazards (2) Stalls are “bubbles” in the pipeline – no useful work is accomplished in a stall The multi-cycle machine “resolves” branches in the E stage –Branch resolution could be completed in the D stage if we pass rA and rB thru a special “subtractor” and bypass the A and B regs –Resolving branches in the D stage requires only a single cycle of stalling in the pipeline (vs 2 if we stick to branch resolution in E) add4 5 6 FDEMW beq1 2 loop FDEMW lw FD add F next instruction FDEMW
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Simple Solutions to Control Hazards What to do about control hazards: 1.Always stall –Resolve branches fast – in the D stage to reduce the stall to 1 cycle 2.Guess! (ok, “predict”) –Gamble on the most likely outcome of the branch test, and fetch the instruction that would be executed –If wrong undo the fetch, and get the correct instruction – Ex : always predict branch failure, or always predict branch success
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Branch Prediction Example (1) Predict failure if correct, this sequence proceeds without a stall Branch failure is equivalent to nop since the branch instruction does nothing add4 5 6 FDEMW beq1 2 loop FDEMW lw FDEMW
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Branch Prediction Example (2) Predict failure if incorrect, must clear out the incorrect lw instruction and refetch the correct next instruction instead –Results in a 1 cycle stall when using early resolution add4 5 6 FDEMW beq1 2 loop FDEMW lw F---- correct instruction FDEMW
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst (Somewhat) Clever Solutions to Control Hazards 3.Dynamic branch prediction –Predict the next instruction based on the past history of the branch instruction –Requires a table of recent results of all branches encountered – “branch prediction table” Could predict branches with a 1 bit predictor model: –Save the result of a branch in a 1 bit buffer –The buffer is a table indexed by the low order bits of the address of the branch instruction If buffer contents = 0 predict branch not taken If buffer contents = 1 predict branch taken
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst 1 Bit Dynamic Branch Prediction for (int i = 0; i < 10; i++) { … } becomes lw1 0 ten add2 0 0 loop:beq2 1 exit … If using simple “not taken” prediction, we’re wrong 90% of the time! With a 1-bit predictor: –On first iteration, prediction is “not take branch” incorrect –On last iteration, prediction is “take branch” incorrect –2 mispredictions out of 10 tests 80% correct for 90% branch success
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst taken 2 Bit Dynamic Branch Prediction Could predict branches with a 2 bit predictor/corrector FSM: –(basically a 2-bit “saturating” adder) On the same example, we get 90% correct with 90% branch success Predict taken Predict not taken taken not taken taken not taken (weak) (strong!)
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Modern Branch Prediction Modern branch prediction is extremely important! –Long pipelines huge branch penalties –We need to be right as much as possible. Because of the importance, modern predictors are also –Extremely complex (some mimic AI routines in hardware) –Take up a lot of space (lots of memories to store historical information)
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Branch Prediction – Some Stats Predict: –Not taken – ~50-60% accurate –NT but backwards taken – ~ 65% accurate –Same as last time – ~ 80% accurate Actual Designs –Pentium – ~85% accurate –Pentium Pro – ~92% accurate Researched Designs –Papers have demonstrated over 96-98% accuracy