Download presentation
Presentation is loading. Please wait.
1
Advanced Computer Architecture 2019
A Quantitative Approach, Fifth Edition Advanced Computer Architecture 2019 27 May 2019 Appendix C Pipelining: Basic and Intermediate Concepts Advanced Computer Architecture Utsunomiya University Appendix C: Pipelining: Basic and Intermediate Concepts
2
Advanced Computer Architecture 2019 @ Utsunomiya University
Outline A “Typical” RISC ISA (RISC-V) 5 stage pipelining Structural Hazards Data Hazards & Forwarding Branch Hazards Handling Exceptions Floating Point Operations Advanced Computer Architecture Utsunomiya University
3
Advanced Computer Architecture 2019 @ Utsunomiya University
A "Typical" RISC ISA 32-bit fixed format instruction 32 32-bit GPR (x0 contains zero) 3-address, reg-reg arithmetic instruction Single addressing mode for load/store: base + displacement Simple branch conditions (Delayed branch) see: SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM PowerPC, CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3 Advanced Computer Architecture Utsunomiya University
4
Summary of RISC-V ISA (RV32I)
x0 x1 x31 Programmable storage 2^32 x bytes 31 x 32-bit GPRs (x0 always zero) PC 32-bit instructions on word boundary PC Arithmetic logical add, sub, and, or, xor, slt, sltu, addi, slti, sltiu, andi, ori, xori, lui, auipc sll, srl, sra, slli, srli, srai Memory Access lb, lbu, lh, lhu, lw sb, sh, sw Control jal, jalr beq, bne, blt, bge, bltu, bgeu Advanced Computer Architecture Utsunomiya University
5
Instruction Format (RV32I)
- Location of each register operand field is fixed - MSB indicates sign of immediate value Advanced Computer Architecture Utsunomiya University
6
Advanced Computer Architecture 2019 @ Utsunomiya University
Datapath vs Control Datapath Controller signals Control Points Datapath: Storage, FU, interconnect sufficient to perform the desired functions Inputs are Control Points Outputs are signals Controller: State machine to orchestrate operation on the data path Based on desired function and signals Advanced Computer Architecture Utsunomiya University
7
5 Steps of RISC-V Datapath
27 May 2019 5 Steps of RISC-V Datapath Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Memory Access Write Back Next PC Next SEQ PC MUX Next SEQ PC 4 Adder Adder RS1 Reg File Address z Memory RS2 ALU Inst RD Memory Data MUX MUX Sign Extend Imm WB Data Advanced Computer Architecture Utsunomiya University Appendix C: Pipelining: Basic and Intermediate Concepts
8
Inst. Set Processor Controller
IR <= mem[PC]; PC <= PC + 4 Ifetch A <= Reg[IRrs1]; B <= Reg[IRrs2] opFetch Jump (and link) Branch LD RR RI if bop(A,b) PC <= PC + IRimm r <= PC PC <= PC +IRimma r <= A opIRop IRimm r <= A + IRimm r <= A opIRop B WB <= Mem[r] WB <= r WB <= r WB <= r Reg[IRrd] <= WB Reg[IRrd] <= WB Reg[IRrd] <= WB Reg[IRrd] <= WB Advanced Computer Architecture Utsunomiya University
9
Advanced Computer Architecture 2019 @ Utsunomiya University
Outline A “Typical” RISC ISA (RISC-V) 5 stage pipelining Structural Hazards Data Hazards & Forwarding Branch Hazards Handling Exceptions Floating Point Operations Advanced Computer Architecture Utsunomiya University
10
5 Steps of RISC-V Datapath & Stage Registers
27 May 2019 5 Steps of RISC-V Datapath & Stage Registers Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Memory Access Write Back Next PC Next SEQ PC IF/ID ID/EX MEM/WB EX/MEM MUX Next SEQ PC 4 Adder Adder RS1 Reg File Address z Memory RS2 ALU Memory Data MUX MUX Sign Extend Imm WB Data RD RD RD Advanced Computer Architecture Utsunomiya University Appendix C: Pipelining: Basic and Intermediate Concepts
11
Visualizing Pipelining
Time (clock cycles) Reg ALU DMem Ifetch Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7 Cycle 5 I n s t r. O r d e Advanced Computer Architecture Utsunomiya University
12
Pipelining is not quite that easy!
Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle Structural hazards: A required resource is busy Data hazards: Instruction depends on the result of prior instruction still in the pipeline Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow Advanced Computer Architecture Utsunomiya University
13
Advanced Computer Architecture 2019 @ Utsunomiya University
Outline A “Typical” RISC ISA (RISC-V) 5 stage pipelining Structural Hazards Data Hazards & Forwarding Branch Hazards Handling Exceptions Floating Point Operations Advanced Computer Architecture Utsunomiya University
14
One Memory Port/Structural Hazards
Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 ALU I n s t r. O r d e Load Ifetch Reg DMem Reg Reg ALU DMem Ifetch Instr 1 Reg ALU DMem Ifetch Instr 2 ALU Instr 3 Ifetch Reg DMem Reg Reg ALU DMem Ifetch Instr 4 Advanced Computer Architecture Utsunomiya University
15
One Memory Port/Structural Hazards
Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 ALU I n s t r. O r d e Load Ifetch Reg DMem Reg Reg ALU DMem Ifetch Instr 1 Reg ALU DMem Ifetch Instr 2 Bubble Stall Reg ALU DMem Ifetch Instr 3 How do you “bubble” the pipe? Advanced Computer Architecture Utsunomiya University
16
Speed Up Equation for Pipelining
27 May 2019 Speed Up Equation for Pipelining For simple RISC pipeline, CPI = 1: Advanced Computer Architecture Utsunomiya University Appendix C: Pipelining: Basic and Intermediate Concepts
17
Example: Dual-port vs. Single-port
Machine A: Dual ported memory (“Harvard Architecture”) Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate Ideal CPI = 1 for both Loads are 40% of instructions executed SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe) = Pipeline Depth SpeedUpB = Pipeline Depth/( x 1) x (clockunpipe/(clockunpipe / 1.05) = (Pipeline Depth/1.4) x 1.05 = 0.75 x Pipeline Depth SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33 Machine A is 1.33 times faster Advanced Computer Architecture Utsunomiya University
18
Advanced Computer Architecture 2019 @ Utsunomiya University
Outline A “Typical” RISC ISA (RISC-V) 5 stage pipelining Structural Hazards Data Hazards & Forwarding Branch Hazards Handling Exceptions Floating Point Operations Advanced Computer Architecture Utsunomiya University
19
Advanced Computer Architecture 2019 @ Utsunomiya University
Data Hazard on R1 Time (clock cycles) IF ID/RF EX MEM WB I n s t r. O r d e add x1,x2,x3 sub x4,x1,x3 and x6,x1,x7 or x8,x1,x9 xor x10,x1,x11 Reg ALU DMem Ifetch Advanced Computer Architecture Utsunomiya University
20
Three Generic Data Hazards
Read After Write (RAW) InstrJ tries to read operand before InstrI writes it Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication. I: add x1,x2,x3 J: sub x4,x1,x3 Advanced Computer Architecture Utsunomiya University
21
Three Generic Data Hazards
Write After Read (WAR) InstrJ writes operand before InstrI reads it Called an “anti-dependence” by compiler writers. This results from reuse of the name “x1”. Can’t happen in RISC-V 5 stage pipeline because: All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5 I: sub x4,x1,x3 J: add x1,x2,x3 K: add x6,x1,x7 Advanced Computer Architecture Utsunomiya University
22
Three Generic Data Hazards
Write After Write (WAW) InstrJ writes operand before InstrI writes it. Called an “output dependence” by compiler writers This also results from the reuse of name “x1”. Can’t happen in RISC-V stage pipeline because: All instructions take 5 stages, and Writes are always in stage 5 Will see WAR and WAW in more complicated pipes I: sub x1,x4,x3 J: add x1,x2,x3 K: add x6,x1,x7 Advanced Computer Architecture Utsunomiya University
23
Forwarding to Avoid Data Hazard
Time (clock cycles) I n s t r. O r d e add x1,x2,x3 sub x4,x1,x3 and x6,x1,x7 or x8,x1,x9 xor x10,x1,x11 Reg ALU DMem Ifetch Advanced Computer Architecture Utsunomiya University
24
HW Change for Forwarding
ID/EX EX/MEM MEM/WR NextPC mux Registers ALU Data Memory mux mux Immediate What circuit detects and resolves this hazard? Advanced Computer Architecture Utsunomiya University
25
Forwarding to Avoid LW-SW Data Hazard
Time (clock cycles) I n s t r. O r d e add x1,x2,x3 lw x4, 0(x1) sw x4,12(x1) or x8,x6,x9 xor x10,x9,x11 Reg ALU DMem Ifetch Advanced Computer Architecture Utsunomiya University
26
Data Hazard Even with Forwarding
27 May 2019 Time (clock cycles) Reg ALU DMem Ifetch I n s t r. O r d e lw x1,0(x2) sub x4,x1,x6 and x6,x1,x7 or x8,x1,x9 MIPS actutally didn’t interlecok: MPU without Interlocked Pipelined Stages Advanced Computer Architecture Utsunomiya University Appendix C: Pipelining: Basic and Intermediate Concepts
27
Data Hazard Even with Forwarding
Time (clock cycles) I n s t r. O r d e Reg ALU DMem Ifetch lw x1, 0(x2) Reg Ifetch ALU DMem Bubble sub x4,x1,x6 Ifetch ALU DMem Reg Bubble and x6,x1,x7 Bubble or x8,x1,x9 Ifetch Reg ALU DMem How is this detected? Advanced Computer Architecture Utsunomiya University
28
Software Scheduling to Avoid Load Hazards
Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d ,e, and f in memory. Slow code: LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f SUB Rd,Re,Rf SW d,Rd Fast code: LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB Rd,Re,Rf SW d,Rd Compiler optimizes for performance. Hardware checks for safety. Advanced Computer Architecture Utsunomiya University
29
Advanced Computer Architecture 2019 @ Utsunomiya University
Outline A “Typical” RISC ISA (RISC-V) 5 stage pipelining Structural Hazards Data Hazards & Forwarding Branch Hazards Handling Exceptions Floating Point Operations Advanced Computer Architecture Utsunomiya University
30
Control Hazard on Branches Three Stage Stall
Reg ALU DMem Ifetch 10: beq x1,x3,36 14: and x2,x3,x5 18: or x6,x1,x7 22: add x8,x1,x9 36: xor x10,x1,x11 What do you do with the 3 instructions in between? How do you do it? Where is the “commit”? Advanced Computer Architecture Utsunomiya University
31
Advanced Computer Architecture 2019 @ Utsunomiya University
Branch Stall Impact If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9! Two part solution: Determine branch taken or not sooner, AND Compute taken branch address earlier RISC-V Solution: Move Zero-test to ID stage Adder to calculate new PC in ID stage 1 clock cycle penalty for branch versus 3 Advanced Computer Architecture Utsunomiya University
32
Pipelined RISC-V Datapath
27 May 2019 Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Memory Access Write Back Next PC Next SEQ PC ID/EX EX/MEM MEM/WB MUX 4 Adder IF/ID Adder Z RS1 Address Reg File Memory RS2 ALU Memory Data MUX MUX Sign Extend Imm WB Data RD RD RD Advanced Computer Architecture Utsunomiya University Appendix C: Pipelining: Basic and Intermediate Concepts
33
Four Branch Hazard Alternatives
#1: Stall until branch direction is clear #2: Predict Branch Not Taken Execute successor instructions in sequence “Squash” instructions in pipeline if branch actually taken Advantage of late pipeline state update 47% branches not taken on average PC+4 already calculated, so use it to get next instruction ARM and RISC-V use this #3: Predict Branch Taken 53% branches taken on average But haven’t calculated branch target address in RISC-V RISC-V still incurs 1 cycle branch penalty Advanced Computer Architecture Utsunomiya University
34
Four Branch Hazard Alternatives
#4: Delayed Branch Define branch to take place AFTER a following instruction branch instruction sequential successor1 sequential successor sequential successorn target of taken branch 1 slot delay allows proper decision and branch target address in 5 stage pipeline Branch delay of length n Advanced Computer Architecture Utsunomiya University
35
Scheduling Branch Delay Slots
27 May 2019 A. From before branch B. From branch target C. From fall through add x1,x2,x3 if x2=0 then add x1,x2,x3 if x1=0 then sub x4,x5,x6 delay slot delay slot add x1,x2,x3 if x1=0 then or x7,x8,x9 delay slot sub x4,x5,x6 becomes becomes becomes if x2=0 then add x1,x2,x3 Limitations on delayed-branch scheduling come from 1) restrictions on the instructions that can be moved/copied into the delay slot and 2) limited ability to predict at compile time whether a branch is likely to be taken or not. In B and C, the use of $1 prevents the add instruction from being moved to the delay slot In B the sub may need to be copied because it could be reached by another path. B is preferred when the branch is taken with high probability (such as loop branches add x1,x2,x3 if x1=0 then sub x4,x5,x6 add x1,x2,x3 if x1=0 then or x7,x8,x9 sub x4,x5,x6 A is the best choice, fills delay slot & reduces instruction count (IC) In B, the sub instruction may need to be copied, increasing IC In B and C, must be okay to execute sub when branch fails Advanced Computer Architecture Utsunomiya University Appendix C: Pipelining: Basic and Intermediate Concepts
36
Advanced Computer Architecture 2019 @ Utsunomiya University
Delayed Branch Compiler effectiveness for single branch delay slot: Fills about 60% of branch delay slots About 80% of instructions executed in branch delay slots useful in computation About 50% (60% x 80%) of slots usefully filled Delayed Branch downside: As processor go to deeper pipelines and multiple issue, the branch delay grows and need more than one delay slot Delayed branching has lost popularity compared to more expensive but more flexible dynamic approaches Growth in available transistors has made dynamic approaches relatively cheaper Advanced Computer Architecture Utsunomiya University
37
Evaluating Branch Alternatives
Assume 4% unconditional branch, 6% conditional branch- untaken, 10% conditional branch-taken Scheduling Branch CPI speedup v. speedup v. scheme penalty unpipelined stall Stall pipeline Predict taken Predict not taken Delayed branch Advanced Computer Architecture Utsunomiya University
38
Advanced Computer Architecture 2019 @ Utsunomiya University
Outline A “Typical” RISC ISA (RISC-V) 5 stage pipelining Structural Hazards Data Hazards & Forwarding Branch Hazards Handling Exceptions Floating Point Operations Advanced Computer Architecture Utsunomiya University
39
Advanced Computer Architecture 2019 @ Utsunomiya University
Exception: An unusual event happens to an instruction during its execution Examples: divide by zero, undefined opcode Interrupt: Hardware signal to switch the processor to a new instruction stream Example: a sound card interrupts when it needs more audio output samples (an audio “click” happens if it is left waiting) Problem: It must appear that the exception or interrupt must appear between 2 instructions (Ii and Ii+1) The effect of all instructions up to and including Ii is totally completed No effect of any instruction after Ii can take place The interrupt (exception) handler either aborts program or restarts at instruction Ii+1 Advanced Computer Architecture Utsunomiya University
40
Precise Exceptions in Static Pipelines
Key observation: architected state only change in memory and register write stages. Advanced Computer Architecture Utsunomiya University
41
Advanced Computer Architecture 2019 @ Utsunomiya University
Outline A “Typical” RISC ISA (RISC-V) 5 stage pipelining Structural Hazards Data Hazards & Forwarding Branch Hazards Handling Exceptions Floating Point Operations Advanced Computer Architecture Utsunomiya University
42
Floating Point Operations
FP pipeline will allow for a longer latency It is impractical to require all FP operations complete in 1 clock cycle Slow clock rate Enormous amounts of logic circuit Multiple FUs (functional units) Integer unit, FP/Int multiplier, FP adder, FP/Int divider Two alternatives Multicycle (unpipelined) FP unit Pipelined FP unit Latency and initiation interval (or repeat interval) Advanced Computer Architecture Utsunomiya University
43
Advanced Computer Architecture 2019 @ Utsunomiya University
Multicycle (unpipelined) FP units Advanced Computer Architecture Utsunomiya University
44
Advanced Computer Architecture 2019 @ Utsunomiya University
Pipelined FP units WAR and WAW hazards may appear because instructions have different lengths Advanced Computer Architecture Utsunomiya University
45
Advanced Computer Architecture 2019 @ Utsunomiya University
Latency and initiation interval Functional unit Latency Initiation interval Integer ALU 1 Data memory (load) FP add 3 FP/Int multiply 6 FP/Int divide 24 25 Can start every FP-add operation but each result is obtained after 3 cycles FP divider is not pipelined requires 24 cycles to complete Advanced Computer Architecture Utsunomiya University
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.