Computing Systems Pipelining: enhancing performance.

claudio.talarico@mail.ewu.edu1 Computing Systems Pipelining: enhancing performance

2 Pipelining  Technique in which multiple instructions are overlapped in execution  Instructions’ steps can be carried in parallel T exec = 2400 ps T exec = 1400 ps

3 Pipelining  Improve performance by increasing instruction throughput as opposed to decreasing the execution time (= latency) of an individual instruction  increasing throughput decrease total time to complete the work  Ideal speedup is number of stages in the pipeline. Do we achieve this?  stages may be imperfectly balanced  Pipelining involves some overhead

4 Pipelining  What makes it easy (designing instruction sets for pipelining)  all instructions are the same length  just a few instruction formats  memory operands appear only in loads and stores  What makes it hard? sometime the next instruction cannot be started in the next cycle (hazards)  structural hazards: suppose we had only one memory  control hazards: need to worry about branch instructions  data hazards: an instruction depends on a previous instruction  We’ll build a simple pipeline and look at these issues  instructions supported: lw, sw, add, sub, and, or, slt, beq

5 Basic idea Basic idea: take a single-cycle datapath and separate it into pieces mux Stylized Datapath; the drawing leaves out some details

6 Pipelined datapath There is a bug ! Can you find it ? What instructions can we execute to manifest the bug? Instructions and data move from left to right (with two exceptions)

7 Corrected datapath For the load instruction we need to preserve the destination register number until the data is read from the MEM/WB pipeline register

8 Graphically representing pipelines  Pipelining can be difficult to understand  every clock cycle, many instructions are simultaneously executing in a single datapath  To aid understanding there are 2 basic styles of pipeline figures:  Multiple-clock-cycle pipeline diagrams  Single-clock-cycle pipeline diagram  Can help with answering questions like:  how many cycles does it take to execute this code?  what is the ALU doing during cycle 4?  help understand datapaths  We highlight the right half of registers or memory when they are being read and highlight the left half when they are being written

9 Multiple-clock cycle diagrams: graphical view

10 Multiple-clock cycle diagrams: traditional view

11 Single-clock cycle diagrams: pipeline at a particular time instant

12 Pipeline operation  One operation begins in every cycle  Also, one operation completes in each cycle  Each instruction takes 5 cycles  K cycles in general, where k is the depth of the pipeline  In one clock cycle, several instructions are active  Different stages are executing different instructions  When a stage is not used no control needs to be applied  Issue: how to generate control signals ?  we need to set the control values for each pipeline stage for each instruction

13 Pipeline Control Note: we moved the position of the destination register

14 Pipeline Control  We have 5 stages. What needs to be controlled in each stage?  Instruction Fetch and PC Increment  The control signals to read IM and write the PC are always asserted, so there is nothing special to control this pipeline stage  Instruction Decode / Register Fetch  The same thing happens at every clock cycle, so there are no optional control lines to set  Execution / address calculation  control lines set in this stage: RegDest, ALUop, and ALUSrc  Memory access  control lines set in this stage: Branch, MemRead, and MemWrite  Write Back  control lines set in this stage: MemtoReg, and RegWrite

15 Pipeline Control  Since the control signals are needed from the execution stage on:  we can generate the control signals during the instruction decode stage and  pass them along the pipeline registers just like the data we have nine control lines

16 Pipeline datapath with control

17 Dependencies  Problem with starting next instruction before first is finished  dependencies that “go backward in time” are data hazards

18 Software solution  Have compiler guarantee no hazards  Where do we insert the “nops” ? sub$2, $1, $3 and $12,$2, $5 or$13,$6, $2 add$14,$2, $2 sw$15,100($2)  Problem: this really slows us down! Two nops here !!

19 Hardware solution: Forwarding  Use temporary results, don’t wait for them to be written  ALU forwarding (EX hazard)  read/write to same register (MEM hazard) what if this $2 was $13?

20 Forwarding logic  Forwarding from EX/MEM registers if (EX/MEM.RegWrite // instruction writes to register and (EX/MEM.RegisterRd != 0) // not if destination $zero and (EX/MEM.RegisterRd = ID/EX.Register.Rs)) ForwardA = 10 if (EX/MEM.RegWrite // instruction writes to register and (EX/MEM.RegisterRd != 0) // not if destination $zero and (EX/MEM.RegisterRd = ID/EX.Register.Rt)) ForwardB = 10

21 Forwarding logic  Forwarding from MEM/WB registers if (MEM/WB.RegWrite // instruction writes to register and (MEM/WB.RegisterRd != 0) // not if destination $zero and (MEM/WB.RegisterRd = ID/EX.Register.Rs)) ForwardA = 01 if (MEM/WB.RegWrite // instruction writes to register and (MEM/WB.RegisterRd != 0) // not if destination $zero and (MEM/WB.RegisterRd = ID/EX.Register.Rt)) ForwardB = 01 Almost true !!! There is a bug !!!

22 Forwarding logic  Let’s consider a sequence of instructions all reading and writing to the same register According to the previous policy, since MEM/WB.RegisterRd=ID/EX.Register.Rs we “should” forward from MEM/WB. MEM/WB EX/MEM But, this time the more recent result is in the EX/MEM register Thus, we have to forward from EX/MEM register (Fortunately, we already know how to do !!)

23 Forwarding logic  Forwarding from MEM/WB registers (corrected version) if (MEM/WB.RegWrite //instruction writes to register and (MEM/WB.RegisterRd != 0) //not if destination $zero and (MEM/WB.RegisterRd = ID/EX.Register.Rs) and (EX/MEM.RegisterRd != ID/EX.Register.Rs)) ForwardA = 01 if (MEM/WB.RegWrite //instruction writes to register and (MEM/WB.RegisterRd != 0) //not if destination $zero and (MEM/WB.RegisterRd = ID/EX.Register.Rt) and (EX/MEM.RegisterRd != ID/EX.Register.Rt)) ForwardB = 01 Make sure the latest value is not in EX/MEM

24 Forwarding unit  The main idea (some details not shown) ForwardA ForwardB

25 Forwarding unit Mux controlSourceComment ForwardA=00ID/EXThe first ALU operand comes from the register file ForwardA=10EX/MEMThe first ALU operand is forwarded from prior ALU result ForwardA=01MEM/WBThe first ALU operand is forwarded from DM or an earlier ALU result ForwardB=00ID/EXThe first ALU operand comes from the register file ForwardB=10EX/MEMThe first ALU operand is forwarded from prior ALU result ForwardB=01MEM/WBThe first ALU operand is forwarded from DM or an earlier ALU result

26 Can’t always forward !!!  Load word instruction can still cause a hazard: - an instruction tries to read a register following a load instruction that writes to the same register.  Thus, we need a hazard detection unit to “stall” the load instruction The hazard cannot be solved by forwarding we must stall (insert a nop)

27 Stall logic  Hazard detection unit: if (ID/EX.MeamRead and ((ID/EX.Register.Rt = IF/ID.Register.Rs) or (ID/EX.Register.Rt = IF/ID.Register.Rt))) stall the pipeline  We can stall by letting an instruction that won’t do anything go forward  Deasserting the control lines (in this way the instruction has no effect, act like a bubble in the pipeline)  and preventing the following instructions to be fetched  This is accomplished simply by preventing the PC register and the IF/ID register from changing The only instruction that reads data memory is load The destination of the load instruction is in the R t field PC … …

28 Pipeline with hazard detection unit Some details not shown

29 Branch Hazards (= control hazards)  When we decide to branch, other instructions are in the pipeline!

30 Solutions to branch hazard  Branch stalling (software)  easy but inefficient  Static branch prediction:  we assume “branch not taken”  we need to add hardware for flushing instructions if we are wrong  we must discard the instructions in IF, ID, and EX stages (change the control values to 0)  Reducing the branch delay penalty  move the branch decision earlier (to ID stage)  compare the two registers read in the ID stage  comparison for equality requires few extra gates  still need to flush instruction in IF/ID register  clearing the register transform the fetched instruction into a nop  Make the hazard into a feature: delayed branch slot  always execute the instruction following the branch Registers =

31 Branch detection in the ID stage branch target computation has been moved ahead

32 Delayed branch (MIPS)  A “branch delay slot” which the compiler tries to fill with a useful instruction (make the one cycle delay part of the ISA) best solutionbranch mostly taken

33 Branches  If the branch is taken, we may have a penalty of one cycle  For our simple design, this is reasonable  With deeper pipelines, penalty increases and static branch prediction drastically hurts performance  Solution: dynamic branch prediction (keep track of branch history)  Branch prediction is especially important because it enables other more advanced pipelining techniques to be effective!  Modern processors predict correctly 95% of the time!  Example: Loop branch that branch 9 times in a row, then is not taken. Assume 1-bit predictor. We will fail prediction the first and last time  prediction accuracy 80%

34 Improving performance  Try and avoid stalls! e.g., reorder these instructions : lw $t0, 0($t1) lw $t2, 4($t1) sw $t2, 0($t1) sw $t0, 4($t1)  Dynamic Pipeline Scheduling  Hardware is organized differently and chooses which instructions to execute next  Will execute instructions out of order (e.g., doesn’t wait for a dependency to be resolved, but rather keeps going!)  Speculates on branches and keeps the pipeline full (may need to rollback if prediction incorrect)  Trying to exploit instruction-level parallelism

35 Dynamic scheduled pipeline

36 Advanced Pipelining  Trying to exploit instruction-level parallelism  Increase the depth of the pipeline (overlap more instructions)  Replicate internal functional units to start more than one instruction each cycle (multiple issue)  static multiple issues (decision at compile time)  dynamic multiple issues (decision at execution time)  Loop unrolling to expose more ILP (better scheduling)  “Superscalar” processors  DEC Alpha 21264: 9 stage pipeline, 6 instruction issue  All modern processors are superscalar and issue multiple instructions usually with some limitations (e.g., different “pipes”)  VLIW: very long instruction word, static multiple issue (relies more on compiler technology)

37 Summary  Pipelined processors divide execution in multiple steps  Pipelining improves instruction throughput, not the inherent execution time (latency) of instructions  However pipeline hazards reduce performance  structural, data and control hazards  Structural hazards are resolved by duplicating resources  Data forwarding helps resolve data hazards  but not all hazard can be resolved (load followed by R-type)  some data hazards require nop insertion (bubbles)  Control hazard delay penalty can be reduced by branch prediction  always not taken, delayed slots, dynamic prediction

38 Concluding Remarks  Pipelined processors are not easy to design  Technology affect implementation  Instruction set design affect performance and design difficulty  More stages do not necessarily lead to higher performance  Pipelining and multiple issue both attempt to exploit ILP

Computing Systems Pipelining: enhancing performance.

Similar presentations

Presentation on theme: "Computing Systems Pipelining: enhancing performance."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computing Systems Pipelining: enhancing performance.

Similar presentations

Presentation on theme: "Computing Systems Pipelining: enhancing performance."— Presentation transcript:

Similar presentations

About project

Feedback