Download presentation
Presentation is loading. Please wait.
Published byHollie Blake Modified over 9 years ago
1
claudio.talarico@mail.ewu.edu1 Computing Systems Pipelining: enhancing performance
2
2 Pipelining Technique in which multiple instructions are overlapped in execution Instructions’ steps can be carried in parallel T exec = 2400 ps T exec = 1400 ps
3
3 Pipelining Improve performance by increasing instruction throughput as opposed to decreasing the execution time (= latency) of an individual instruction increasing throughput decrease total time to complete the work Ideal speedup is number of stages in the pipeline. Do we achieve this? stages may be imperfectly balanced Pipelining involves some overhead
4
4 Pipelining What makes it easy (designing instruction sets for pipelining) all instructions are the same length just a few instruction formats memory operands appear only in loads and stores What makes it hard? sometime the next instruction cannot be started in the next cycle (hazards) structural hazards: suppose we had only one memory control hazards: need to worry about branch instructions data hazards: an instruction depends on a previous instruction We’ll build a simple pipeline and look at these issues instructions supported: lw, sw, add, sub, and, or, slt, beq
5
5 Basic idea Basic idea: take a single-cycle datapath and separate it into pieces mux Stylized Datapath; the drawing leaves out some details
6
6 Pipelined datapath There is a bug ! Can you find it ? What instructions can we execute to manifest the bug? Instructions and data move from left to right (with two exceptions)
7
7 Corrected datapath For the load instruction we need to preserve the destination register number until the data is read from the MEM/WB pipeline register
8
8 Graphically representing pipelines Pipelining can be difficult to understand every clock cycle, many instructions are simultaneously executing in a single datapath To aid understanding there are 2 basic styles of pipeline figures: Multiple-clock-cycle pipeline diagrams Single-clock-cycle pipeline diagram Can help with answering questions like: how many cycles does it take to execute this code? what is the ALU doing during cycle 4? help understand datapaths We highlight the right half of registers or memory when they are being read and highlight the left half when they are being written
9
9 Multiple-clock cycle diagrams: graphical view
10
10 Multiple-clock cycle diagrams: traditional view
11
11 Single-clock cycle diagrams: pipeline at a particular time instant
12
12 Pipeline operation One operation begins in every cycle Also, one operation completes in each cycle Each instruction takes 5 cycles K cycles in general, where k is the depth of the pipeline In one clock cycle, several instructions are active Different stages are executing different instructions When a stage is not used no control needs to be applied Issue: how to generate control signals ? we need to set the control values for each pipeline stage for each instruction
13
13 Pipeline Control Note: we moved the position of the destination register
14
14 Pipeline Control We have 5 stages. What needs to be controlled in each stage? Instruction Fetch and PC Increment The control signals to read IM and write the PC are always asserted, so there is nothing special to control this pipeline stage Instruction Decode / Register Fetch The same thing happens at every clock cycle, so there are no optional control lines to set Execution / address calculation control lines set in this stage: RegDest, ALUop, and ALUSrc Memory access control lines set in this stage: Branch, MemRead, and MemWrite Write Back control lines set in this stage: MemtoReg, and RegWrite
15
15 Pipeline Control Since the control signals are needed from the execution stage on: we can generate the control signals during the instruction decode stage and pass them along the pipeline registers just like the data we have nine control lines
16
16 Pipeline datapath with control
17
17 Dependencies Problem with starting next instruction before first is finished dependencies that “go backward in time” are data hazards
18
18 Software solution Have compiler guarantee no hazards Where do we insert the “nops” ? sub$2, $1, $3 and $12,$2, $5 or$13,$6, $2 add$14,$2, $2 sw$15,100($2) Problem: this really slows us down! Two nops here !!
19
19 Hardware solution: Forwarding Use temporary results, don’t wait for them to be written ALU forwarding (EX hazard) read/write to same register (MEM hazard) what if this $2 was $13?
20
20 Forwarding logic Forwarding from EX/MEM registers if (EX/MEM.RegWrite // instruction writes to register and (EX/MEM.RegisterRd != 0) // not if destination $zero and (EX/MEM.RegisterRd = ID/EX.Register.Rs)) ForwardA = 10 if (EX/MEM.RegWrite // instruction writes to register and (EX/MEM.RegisterRd != 0) // not if destination $zero and (EX/MEM.RegisterRd = ID/EX.Register.Rt)) ForwardB = 10
21
21 Forwarding logic Forwarding from MEM/WB registers if (MEM/WB.RegWrite // instruction writes to register and (MEM/WB.RegisterRd != 0) // not if destination $zero and (MEM/WB.RegisterRd = ID/EX.Register.Rs)) ForwardA = 01 if (MEM/WB.RegWrite // instruction writes to register and (MEM/WB.RegisterRd != 0) // not if destination $zero and (MEM/WB.RegisterRd = ID/EX.Register.Rt)) ForwardB = 01 Almost true !!! There is a bug !!!
22
22 Forwarding logic Let’s consider a sequence of instructions all reading and writing to the same register According to the previous policy, since MEM/WB.RegisterRd=ID/EX.Register.Rs we “should” forward from MEM/WB. MEM/WB EX/MEM But, this time the more recent result is in the EX/MEM register Thus, we have to forward from EX/MEM register (Fortunately, we already know how to do !!)
23
23 Forwarding logic Forwarding from MEM/WB registers (corrected version) if (MEM/WB.RegWrite //instruction writes to register and (MEM/WB.RegisterRd != 0) //not if destination $zero and (MEM/WB.RegisterRd = ID/EX.Register.Rs) and (EX/MEM.RegisterRd != ID/EX.Register.Rs)) ForwardA = 01 if (MEM/WB.RegWrite //instruction writes to register and (MEM/WB.RegisterRd != 0) //not if destination $zero and (MEM/WB.RegisterRd = ID/EX.Register.Rt) and (EX/MEM.RegisterRd != ID/EX.Register.Rt)) ForwardB = 01 Make sure the latest value is not in EX/MEM
24
24 Forwarding unit The main idea (some details not shown) ForwardA ForwardB
25
25 Forwarding unit Mux controlSourceComment ForwardA=00ID/EXThe first ALU operand comes from the register file ForwardA=10EX/MEMThe first ALU operand is forwarded from prior ALU result ForwardA=01MEM/WBThe first ALU operand is forwarded from DM or an earlier ALU result ForwardB=00ID/EXThe first ALU operand comes from the register file ForwardB=10EX/MEMThe first ALU operand is forwarded from prior ALU result ForwardB=01MEM/WBThe first ALU operand is forwarded from DM or an earlier ALU result
26
26 Can’t always forward !!! Load word instruction can still cause a hazard: - an instruction tries to read a register following a load instruction that writes to the same register. Thus, we need a hazard detection unit to “stall” the load instruction The hazard cannot be solved by forwarding we must stall (insert a nop)
27
27 Stall logic Hazard detection unit: if (ID/EX.MeamRead and ((ID/EX.Register.Rt = IF/ID.Register.Rs) or (ID/EX.Register.Rt = IF/ID.Register.Rt))) stall the pipeline We can stall by letting an instruction that won’t do anything go forward Deasserting the control lines (in this way the instruction has no effect, act like a bubble in the pipeline) and preventing the following instructions to be fetched This is accomplished simply by preventing the PC register and the IF/ID register from changing The only instruction that reads data memory is load The destination of the load instruction is in the R t field PC … …
28
28 Pipeline with hazard detection unit Some details not shown
29
29 Branch Hazards (= control hazards) When we decide to branch, other instructions are in the pipeline!
30
30 Solutions to branch hazard Branch stalling (software) easy but inefficient Static branch prediction: we assume “branch not taken” we need to add hardware for flushing instructions if we are wrong we must discard the instructions in IF, ID, and EX stages (change the control values to 0) Reducing the branch delay penalty move the branch decision earlier (to ID stage) compare the two registers read in the ID stage comparison for equality requires few extra gates still need to flush instruction in IF/ID register clearing the register transform the fetched instruction into a nop Make the hazard into a feature: delayed branch slot always execute the instruction following the branch Registers =
31
31 Branch detection in the ID stage branch target computation has been moved ahead
32
32 Delayed branch (MIPS) A “branch delay slot” which the compiler tries to fill with a useful instruction (make the one cycle delay part of the ISA) best solutionbranch mostly taken
33
33 Branches If the branch is taken, we may have a penalty of one cycle For our simple design, this is reasonable With deeper pipelines, penalty increases and static branch prediction drastically hurts performance Solution: dynamic branch prediction (keep track of branch history) Branch prediction is especially important because it enables other more advanced pipelining techniques to be effective! Modern processors predict correctly 95% of the time! Example: Loop branch that branch 9 times in a row, then is not taken. Assume 1-bit predictor. We will fail prediction the first and last time prediction accuracy 80%
34
34 Improving performance Try and avoid stalls! e.g., reorder these instructions : lw $t0, 0($t1) lw $t2, 4($t1) sw $t2, 0($t1) sw $t0, 4($t1) Dynamic Pipeline Scheduling Hardware is organized differently and chooses which instructions to execute next Will execute instructions out of order (e.g., doesn’t wait for a dependency to be resolved, but rather keeps going!) Speculates on branches and keeps the pipeline full (may need to rollback if prediction incorrect) Trying to exploit instruction-level parallelism
35
35 Dynamic scheduled pipeline
36
36 Advanced Pipelining Trying to exploit instruction-level parallelism Increase the depth of the pipeline (overlap more instructions) Replicate internal functional units to start more than one instruction each cycle (multiple issue) static multiple issues (decision at compile time) dynamic multiple issues (decision at execution time) Loop unrolling to expose more ILP (better scheduling) “Superscalar” processors DEC Alpha 21264: 9 stage pipeline, 6 instruction issue All modern processors are superscalar and issue multiple instructions usually with some limitations (e.g., different “pipes”) VLIW: very long instruction word, static multiple issue (relies more on compiler technology)
37
37 Summary Pipelined processors divide execution in multiple steps Pipelining improves instruction throughput, not the inherent execution time (latency) of instructions However pipeline hazards reduce performance structural, data and control hazards Structural hazards are resolved by duplicating resources Data forwarding helps resolve data hazards but not all hazard can be resolved (load followed by R-type) some data hazards require nop insertion (bubbles) Control hazard delay penalty can be reduced by branch prediction always not taken, delayed slots, dynamic prediction
38
38 Concluding Remarks Pipelined processors are not easy to design Technology affect implementation Instruction set design affect performance and design difficulty More stages do not necessarily lead to higher performance Pipelining and multiple issue both attempt to exploit ILP
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.