Presentation is loading. Please wait.

Presentation is loading. Please wait.

Appendix A Pipelining: Basic and Intermediate Concepts

Similar presentations


Presentation on theme: "Appendix A Pipelining: Basic and Intermediate Concepts"— Presentation transcript:

1 Appendix A Pipelining: Basic and Intermediate Concepts

2 Pipelining An implementation technique whereby multiple instructions are overlapped in execution. Each step in the pipeline (called a pipe stage) completes a part of an instruction. Because all stages proceed at the same time, the length of a processor (clock) cycle is determined by the time required for the slowest pipe stage. CSCE 614 Fall 2009

3 Pipelining Designer’s goal: Balancing the length of each pipeline stage. If the stages are perfectly balanced, the time per instruction on the pipelined processor is, Time per instruction on unpipelined machine Number of pipe stages Speedup from pipelining = number of pipe stages CSCE 614 Fall 2009

4 RISC Instruction Set (MIPS64)
64-bit version of the MIPS instruction set. 32 registers 3 classes of instructions ALU instructions: DADD, DSUB, … Load and store instructions: LD, SD, … Branches and jumps CSCE 614 Fall 2009

5 Implementation of a RISC (Unpipelined, Multicycle)
Implementation of an integer subset of a RISC architecture that takes at most 5 clock cycles. Instruction Fetch (IF) Instruction Decode/Register Fetch (ID) Execution/Effective Address Calculation (EX) Memory Access (MEM) Write-Back (WB) CSCE 614 Fall 2009

6 Instruction Format (32-bit Version)
All MIPS instructions are 32 bits long. R-format (add, sub, …) OP rs rt rd sa funct I-format (lw, sw, …) OP rs rt immediate J-format (j) OP jump target CSCE 614 Fall 2009

7 Instruction Fetch Cycle (IF)
Send the program counter (PC) to memory. Fetch the current instruction from memory. Update the PC to the next sequential PC by adding 4 to the PC. CSCE 614 Fall 2009

8 Instruction Decode/Register Fetch Cycle (ID)
Decode the instruction and read the registers from the register file. Do the equality test on the registers for a possible branch. Sign-extend the offset field of the instruction in case it is needed. Compute the possible branch target address by adding the sign-extended offset to the incremented PC. CSCE 614 Fall 2009

9 Execution/Effective Address Calculation (EX)
The ALU operates on the operands prepared in the prior cycle. Memory reference instructions: The ALU adds the base register and the offset to form the effective address. Register-Register: The ALU performs the operation specified by the ALU opcode on the values from the register file. Register-Immediate: The ALU performs the operation specified by the opcode on the first value from the register file and the sign-extended immediate. CSCE 614 Fall 2009

10 Memory Access (MEM) If the instruction is a load, memory does a read using the effective address computed in the previous cycle. If it is a store, then the memory writes the data from the second register read from the register file using the effective address. CSCE 614 Fall 2009

11 Write-Back cycle (WB) Register-Register ALU instruction or Load instruction: Write the result into the register file. CSCE 614 Fall 2009

12 In this implementation, branch instructions require 2 cycles, store instructions require 4 cycles, and all other instructions require 5 cycles. Assuming a branch frequency of 12% and a store frequency of 10%, What is the overall CPI? CSCE 614 Fall 2009

13 Classic 5 Stage Pipeline for a RISC Processor
CSCE 614 Fall 2009

14 Performance Issues in Pipelining
Pipelining increases the CPU instruction throughput. Throughput: the number of instructions completed per unit of time. Pipelining does not decrease the execution time of an individual instruction. It increases the execution time due to overhead (clock skew and pipeline register delay) in the control of the pipeline. CSCE 614 Fall 2009

15 Example (p. A-10) Consider the unpipelined processor. Assume that it has a 1ns clock cycle and that it uses 4 cycles for ALU operations and branches and 5 cycles for memory operations. Assume that the relative frequencies of these operations are 40%, 20%, and 40%, respectively. Suppose that due to clock skew and setup, pipelining the processor adds 0.2ns of overhead to the clock. Ignoring any latency impact, how much speedup in the instruction execution rate will we gain from a pipeline? CSCE 614 Fall 2009

16 Classic 5 Stage Pipeline for a RISC Processor
CSCE 614 Fall 2009

17 Classic 5-Stage Pipeline
What happens in the pipeline? One resource cannot be used for two different operations on the same clock cycle. => Separate instruction and data memories. The register file is used in two stages: ID (two reads) and WB (one write). => Register write in the first half of the clock cycle and register read in the second half. CSCE 614 Fall 2009

18 Pipeline Hazards

19 Pipeline Hazards Situations that prevent the next instructions in the instruction stream from executing during its designated clock cycle. Hazards reduce the performance from the ideal speedup gained by pipelining. Structural Hazards Data Hazards Control Hazards Hazards can make it necessary to stall the pipeline. CSCE 614 Fall 2009

20 Pipeline Hazards When an instruction is stalled, all instructions issued later than the stalled instruction are also stalled. No new instructions are fetched during the stall. CSCE 614 Fall 2009

21 Structural Hazards Hardware cannot support the combination of instructions that we want to execute in the same clock cycle. Suppose we have a single memory instead of two memories. CSCE 614 Fall 2009

22 Control Hazards This arises from the need to make a decision based on the results of one instruction while others are executing. branch instruction Pipeline stall (or bubble) How can we overcome this problem? CSCE 614 Fall 2009

23 Branch Hazards To minimize the branch penalty, put in enough hardware so that we can test registers, calculate the branch target address, and update the PC during the second stage. CSCE 614 Fall 2009

24 Example Estimate the impact on the CPI of stalling on branches. Assume all other instructions have a CPI of 1. CSCE 614 Fall 2009

25 Branch Prediction Computers do indeed use prediction to handle branches. Simplest: Always predict that branches will fail. If you’re right, the pipeline proceeds at full speed. Dynamic hardware predictors make their guesses depending on the behavior of each branch. Popular: Keeping a history for each branch as taken or untaken, and then using the past to predict the future. => about 90% accuracy CSCE 614 Fall 2009

26 Branch Prediction When the guess is wrong, the pipeline must
make sure that the instruction following the wrongly guessed branch have no effect and must restart the pipeline from the proper branch address. CSCE 614 Fall 2009

27 Delayed Branch Delayed decision Used in MIPS
The delayed branch always executes the next sequential instruction, with the branch taking place after that one instruction delay. CSCE 614 Fall 2009

28 CSCE 614 Fall 2009

29 MIPS software will place an instruction immediately after the delayed branch instruction that is not affected by the branch, and a taken branch changes the address of the instruction that follows this safe instruction. Compilers typically fill about 50% of the branch delay slots with useful instructions. CSCE 614 Fall 2009

30 Data Hazards An instruction depends on the results of a previous instruction still in the pipeline. e.g. add $s0, $t0, $t1 sub $t2, $s0, $t3 The add instruction doesn’t write the result until the 5th stage. => 3 bubbles CSCE 614 Fall 2009

31 Solution forwarding (or bypassing): getting the missing item early from the internal resources. e.g. as soon as the ALU creates the sum for the add, we can supply it as the input for the subtract. CSCE 614 Fall 2009

32 CSCE 614 Fall 2009

33 Load-Use Data Hazard CSCE 614 Fall 2009

34 Even with forwarding, we still have to stall one stage for a load-use data hazard.
Delayed loads: to follow a load with an instruction independent of that load. CSCE 614 Fall 2009

35 CSCE 614 Fall 2009

36 Implementation of the MIPS Datapath
CSCE 614 Fall 2009

37 Events on Every Pipe Stage of the MIPS Pipeline
See Figure A.19 on page A-32. CSCE 614 Fall 2009

38 Revised Datapath CSCE 614 Fall 2009

39 Revised Pipeline Structure
See Figure A.25 on page A-39. CSCE 614 Fall 2009

40 Extending the MIPS to Handle Multicycle Operations

41 Floating-Point Operations
The floating-point pipeline will allow for a longer latency for operations. the EX cycle may be repeated as many times as needed to complete the operation. The number of repetitions can vary for different operations. There may be multiple floating-point functional units. CSCE 614 Fall 2009

42 Assumptions Main integer unit: handles loads and stores, integer ALU operations, and branches. FP and integer multiplier. FP adder: handles FP add, subtract, and conversion. FP and integer divider. The EX stages of these functional units are not pipelined. CSCE 614 Fall 2009

43 MIPS with 3 FP Functional Units
CSCE 614 Fall 2009

44 Because EX is not pipelined, no other instruction using that functional unit may issue until the previous instruction leaves EX. Instruction issue (p. A-33): the process of letting an instruction move from the ID stage into the EX stage of the pipeline. If an instruction cannot proceed to the EX stage, the entire pipeline behind that instruction will be stalled. CSCE 614 Fall 2009

45 Latency: the number of intervening cycles between an instruction that produces a result and an instruction that uses the result. Initiation interval: the number of cycles that must elapse between issuing two operations of a given type. CSCE 614 Fall 2009

46 Example (Figure A.30) Functional Unit Latency Initiation Interval
Integer ALU 1 Data memory (integer/FP loads) FP add 3 FP multiply (integer multiply) 6 FP divide (integer divide) 24 25 CSCE 614 Fall 2009

47 Since most operations consume their operands at the beginning of EX stage, the latency is usually the number of stages after EX that an instruction produces a result. 0 for Integer ALU operations. 1 for loads. Pipeline latency is essentially equal to 1 cycle less than the depth of the execution pipeline, which is the number of stages from the EX stage to the stage that produces the result. CSCE 614 Fall 2009

48 To achieve a higher clock rate, fewer logic levels are put in each pipe stage.
=> The number of pipe stages required for more complex operations is larger. The penalty for the faster clock rate is longer latency for operations. CSCE 614 Fall 2009

49 Supporting Multiple FP Operations
unpipelined CSCE 614 Fall 2009


Download ppt "Appendix A Pipelining: Basic and Intermediate Concepts"

Similar presentations


Ads by Google