CpE 442 Designing a Pipeline Processor (lect. II)

Slides:



Advertisements
Similar presentations
CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Advertisements

Review: Pipelining. Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
ECE 361 Computer Architecture Lecture 13: Designing a Pipeline Processor Start X:40.
Chapter 5 Pipelining and Hazards
©UCB CS 162 Computer Architecture Lecture 3: Pipelining Contd. Instructor: L.N. Bhuyan
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 9, 2002 Topic: Pipelining Basics.
ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.
1 第三章 Instruction-Level Parallelism and Its Dynamic Exploitation 陈文智 浙江大学计算机学院 2011 年 09 月.
Pipelining. 10/19/ Outline 5 stage pipelining Structural and Data Hazards Forwarding Branch Schemes Exceptions and Interrupts Conclusion.
CPE 731 Advanced Computer Architecture Pipelining Review Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of California,
EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
CMPE 421 Parallel Computer Architecture
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
CMPE 421 Parallel Computer Architecture Part 2: Hardware Solution: Forwarding.

HazardsCS510 Computer Architectures Lecture Lecture 7 Pipeline Hazards.
CPE 442 hazards.1 Introduction to Computer Architecture CpE 442 Designing a Pipeline Processor (lect. II)
CS252/Patterson Lec 1.1 1/17/01 معماري کامپيوتر - درس نهم pipeline برگرفته از درس : Prof. David A. Patterson.
Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.
EEL4713C Ann Gordon-Ross.1 EEL-4713C Computer Architecture Pipelined Processor - Hazards.
Problem with Single Cycle Processor Design
Computer Organization
Stalling delays the entire pipeline
Note how everything goes left to right, except …
Review: Instruction Set Evolution
Lecture 4: Pipeline Complications: Data and Control Hazards
Pipelining: Hazards Ver. Jan 14, 2014
Performance of Single-cycle Design
CMSC 611: Advanced Computer Architecture
5 Steps of MIPS Datapath Figure A.2, Page A-8
Appendix C Pipeline implementation
ECE232: Hardware Organization and Design
Pipelining Lessons 6 PM T a s k O r d e B C D A 30
Appendix A - Pipelining
Chapter 3: Pipelining 순천향대학교 컴퓨터학부 이 상 정 Adapted from
Forwarding Now, we’ll introduce some problems that data hazards can cause for our pipelined processor, and show how to handle them with forwarding.
Chapter 4 The Processor Part 3
Review: MIPS Pipeline Data and Control Paths
Chapter 4 The Processor Part 2
CMSC 611: Advanced Computer Architecture
SOLUTIONS CHAPTER 4.
Pipelining review.
Single-cycle datapath, slightly rearranged
CS 704 Advanced Computer Architecture
Pipelining Chapter 6.
A pipeline diagram Clock cycle lw $t0, 4($sp) IF ID
Pipelining in more detail
CS-447– Computer Architecture Lecture 14 Pipelining (2)
Pipelining Lessons 6 PM T a s k O r d e B C D A 30
Pipeline control unit (highly abstracted)
The Processor Lecture 3.6: Control Hazards
The Processor Lecture 3.5: Data Hazards
Instruction Execution Cycle
Overview What are pipeline hazards? Types of hazards
CpE 242 Computer Architecture and Engineering Designing a Pipeline Processor Start X:40.
Pipeline control unit (highly abstracted)
Pipeline Control unit (highly abstracted)
Instructors: Randy H. Katz David A. Patterson
Introduction to Computer Organization and Architecture
Throughput = #instructions per unit time (seconds/cycles etc.)
CMCS Computer Architecture Lecture 20 Pipelined Datapath and Control April 11, CMSC411.htm Mohamed.
Guest Lecturer: Justin Hsia
Recall: Performance Evaluation
©2003 Craig Zilles (derived from slides by Howard Huang)
Pipelined datapath and control
Pipelining Hazards.
Presentation transcript:

CpE 442 Designing a Pipeline Processor (lect. II) Start X:40

Review: Single Cycle, Multiple Cycle, vs. Pipeline Clk Single Cycle Implementation: Load Store Waste Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk Multiple Cycle Implementation: Load Store R-type Ifetch Reg Exec Mem Wr Ifetch Reg Exec Mem Ifetch Here are the timing diagrams showing the differences between the single cycle, multiple cycle, and pipeline implementations. For example, in the pipeline implementation, we can finish executing the Load, Store, and R-type instruction sequence in seven cycles. In the multiple clock cycle implementation, however, we cannot start executing the store until Cycle 6 because we must wait for the load instruction to complete. Similarly, we cannot start the execution of the R-type instruction until the store instruction has completed its execution in Cycle 9. In the Single Cycle implementation, the cycle time is set to accommodate the longest instruction, the Load instruction. Consequently, the cycle time for the Single Cycle implementation can be five times longer than the multiple cycle implementation. But may be more importantly, since the cycle time has to be long enough for the load instruction, it is too long for the store instruction so the last part of the cycle here is wasted. +2 = 77 min. (X:57) Pipeline Implementation: Load Ifetch Reg Exec Mem Wr Store Ifetch Reg Exec Mem Wr R-type Ifetch Reg Exec Mem Wr

Review: A Pipelined Datapath Clk Ifetch Reg/Dec Exec Mem Wr RegWr ExtOp ALUOp Branch 1 PC+4 PC+4 PC PC+4 Imm16 Imm16 Rs Zero busA Data Mem A Ra The pipelined datapath consists of combination logic blocks separated by pipeline registers. If you get rid of all these registers (not the PC), this pipelined datapath is reduced to the single-cycle datapath. This should give you extra incentive to do a good job on your single cycle processor design homework because you can build your pipeline design based on your single cycle design. Anyway, the registers mark the beginning and the end of a pipe stage. In the multiple clock cycle lecture, I recommended that the best way to think about a logic clock cycle is that it begins slightly after the clock tick and ends right at the next clock tick. For example here, the Reg/Decode stage begins slightly after this clock tick when the output of the IF/ID register has stabilized to its new value AND ends RIGHT at the next clock tick when the output of the register file is clocked into the ID/Exec register. At the end of the Reg/Decode stage, the register output that just clocked into the ID/Exec register has NOT yet propagate to the register output yet. It takes a Clk-to-Q delay. When the new value we just clocked in (points to the clock tick) has propagate to the register output, then we have reach the beginning of the Exec stage. Notice that the Wr stage of the pipeline starts here (last cycle) but there is no corresponding datapath underneath it because the Wr stage of the pipeline is handled by the same part of the pipeline that handles the Register Read stage. This part of the datapath (Reg File) is the only part that is used by more than one stage of the pipeline. This is OK because the register file has independent Read and Write ports. More specifically, the Reg/Decode stage of the pipeline uses the register file’s read port while the Write Back stage of the pipeline uses the register file’s write port. +3 = 32 min. (Y:12) busB Exec Unit Rb RA Do IUnit IF/ID Register ID/Ex Register Ex/Mem Register Mem/Wr Register Mux 1 Rt RFile WA Di Rt Rw Di I Rd 1 RegDst ALUSrc MemWr MemtoReg

Review: Pipeline Control “Data Stationary Control” The Main Control generates the control signals during Reg/Dec Control signals for Exec (ExtOp, ALUSrc, ...) are used 1 cycle later Control signals for Mem (MemWr Branch) are used 2 cycles later Control signals for Wr (MemtoReg MemWr) are used 3 cycles later Reg/Dec Exec Mem Wr ExtOp ExtOp ALUSrc ALUSrc ALUOp ALUOp Main Control RegDst RegDst The main control here is identical to the one in the single cycle processor. It generate all the control signals necessary for a given instruction during that instruction’s Reg/Decode stage. All these control signals will be saved in the ID/Exec pipeline register at the end of the Reg/Decode cycle. The control signals for the Exec stage (ALUSrc, ... etc.) come from the output of the ID/Exec register. That is they are delayed ONE cycle from the cycle they are generated. The rest of the control signals that are not used during the Exec stage is passed down the pipeline and saved in the Exec/Mem register. The control signals for the Mem stage (MemWr, Branch) come from the output of the Exec/Mem register. That is they are delayed two cycles from the cycle they are generated. Finally, the control signals for the Wr stage (MemtoReg & RegWr) come from the output of the Exec/Wr register: they are delayed three cycles from the cycle they are generated. +2 = 45 min. (Y:45) Ex/Mem Register IF/ID Register ID/Ex Register Mem/Wr Register MemWr MemWr MemWr Branch Branch Branch MemtoReg MemtoReg MemtoReg MemtoReg RegWr RegWr RegWr RegWr

Review: Pipeline Summary Pipeline Processor: Natural enhancement of the multiple clock cycle processor Each functional unit can only be used once per instruction If a instruction is going to use a functional unit: it must use it at the same stage as all other instructions Pipeline Control: Each stage’s control signal depends ONLY on the instruction that is currently in that stage Let me summarized the different processor implementations we have learned in the last few lectures. First the single cycle processor has two disadvantages: (1) long cycle time. (2) And may be more importantly, this long cycle time is too long for all instructions except the load. The multiple clock cycle processor solve these problems by: (1) Dividing the instructions into smaller steps. (2) Then execute each step (instead of the entire instruction) in one clock cycle. The pipeline processor is just a natural extension of the multiple clock cycle processor. In order to convert the multiple clock cycle processor into a pipeline processor, we need to make sure each functional unit is only used once per instruction. Furthermore, if a instruction is going to use a functional unit, it must use it at the SAME stage as all other instructions use it. Finally, for the pipeline control, the most important thing you need to remember is that the control signals at each stage depends ONLY on the instruction that is currently in that stage. +2 = 75 min. (Y:55)

Outline of Today’s Lecture Recap and Introduction (5 minutes) Introduction to Hazards (15 minutes) Forwarding (25 minutes) 1 cycle Load Delay (5 minutes) 1 cycle Branch Delay (15 minutes) What makes pipelining hard Summary (5 minutes) Here is the outline of today’s lecture. In the next 15 minutes or so, I will give you an overview of the pipelining idea. Then I will show you the pipelined datapath as well as the pipelined control. One problem will come up again is the potential race condition between the address lines and the write enable line of data memory and register file. I will show you how to avoid that. I will also show you how the pipelined datapath works with multiple instructions in it. Finally, I will summarize the differences between the single cycle, multiple cycle and pipeline implementations of the processor. +1 = 6 min. (X:46)

Introduction to Hazards Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle structural hazards: HW cannot support this combination of instructions data hazards: instruction depends on result of prior instruction still in the pipeline control hazards: pipelining of branches & other instructionsCommon solution is to stall the pipeline until the hazardbubbles” in the pipeline

Single Memory is a Structural Hazard Time (clock cycles) ALU I n s t r. O r d e Mem Reg Mem Reg Load ALU Mem Reg Instr 1 ALU Mem Reg Instr 2 ALU Instr 3 Mem Reg Mem Reg ALU Mem Reg Instr 4

Option 1: Stall to resolve Memory Structural Hazard Time (clock cycles) ALU I n s t r. O r d e Mem Reg Mem Reg Load ALU Mem Reg Instr 1 ALU Mem Reg Instr 2 Instr 3(stall) ALU Mem Reg bubble Instr 4 ALU Mem Reg

Option 2: Duplicate to Resolve Structural Hazard Separate Instruction Cache (Im) & Data Cache (Dm) Time (clock cycles) ALU Im Reg Dm I n s t r. O r d e Load ALU Im Reg Dm Instr 1 ALU Im Reg Dm Instr 2 Im ALU Reg Dm Instr 3 ALU Im Reg Dm Instr 4

add r1 ,r2,r3 sub r4, r1 ,r3 and r6, r1 ,r7 or r8, r1 ,r9 Data Hazard on r1 add r1 ,r2,r3 sub r4, r1 ,r3 and r6, r1 ,r7 or r8, r1 ,r9 xor r10, r1 ,r11

Data Hazard on r1: (Figure 6.30, page 397, P&H) Dependencies backwards in time are hazards Time (clock cycles) IF ID/RF EX MEM WB add r1,r2,r3 Im Reg ALU Reg Dm I n s t r. O r d e ALU sub r4,r1,r3 Im Reg Dm Reg ALU and r6,r1,r7 Im Reg Dm Reg or r8,r1,r9 Im Reg ALU Dm Reg ALU Im Reg Dm xor r10,r1,r11

Option1: HW Stalls to Resolve Data Hazard Dependencies backwards in time are hazards Time (clock cycles) IF ID/RF EX MEM WB add r1,r2,r3 Im Reg ALU Reg Dm I n s t r. O r d e sub r4, r1,r3 bubble Im Reg ALU Dm Reg and r6,r1,r7 ALU Im Reg Dm or r8,r1,r9 ALU Im Reg xor r10,r1,r11 Im Reg

But recall use of “Data Stationary Control” The Main Control generates the control signals during Reg/Dec Control signals for Exec (ExtOp, ALUSrc, ...) are used 1 cycle later Control signals for Mem (MemWr Branch) are used 2 cycles later Control signals for Wr (MemtoReg MemWr) are used 3 cycles later Reg/Dec Exec Mem Wr ExtOp ExtOp ALUSrc ALUSrc ALUOp ALUOp Main Control RegDst RegDst The main control here is identical to the one in the single cycle processor. It generate all the control signals necessary for a given instruction during that instruction’s Reg/Decode stage. All these control signals will be saved in the ID/Exec pipeline register at the end of the Reg/Decode cycle. The control signals for the Exec stage (ALUSrc, ... etc.) come from the output of the ID/Exec register. That is they are delayed ONE cycle from the cycle they are generated. The rest of the control signals that are not used during the Exec stage is passed down the pipeline and saved in the Exec/Mem register. The control signals for the Mem stage (MemWr, Branch) come from the output of the Exec/Mem register. That is they are delayed two cycles from the cycle they are generated. Finally, the control signals for the Wr stage (MemtoReg & RegWr) come from the output of the Exec/Wr register: they are delayed three cycles from the cycle they are generated. +2 = 45 min. (Y:45) Ex/Mem Register IF/ID Register ID/Ex Register Mem/Wr Register MemWr MemWr MemWr Branch Branch Branch MemtoReg MemtoReg MemtoReg MemtoReg RegWr RegWr RegWr RegWr

Option 1: How HW really stalls pipeline HW doesn’t change PC => keeps fetching same instruction & sets control signals to to benign values (0) Time (clock cycles) IF ID/RF EX MEM WB add r1,r2,r3 Im Reg ALU Reg Dm I n s t r. O r d e bubble Im stall bubble Im stall bubble Im stall sub r4,r1,r3 ALU Im Reg Dm and r6,r1,r7 ALU Im Reg Dm

Option 2: SW inserts indepdendent instructions Worst case inserts NOP instructions Time (clock cycles) IF ID/RF EX MEM WB add r1,r2,r3 Im Reg ALU Reg Dm I n s t r. O r d e ALU nop Im Reg Dm Reg ALU Im Dm Reg nop Reg Im Reg ALU Dm Reg nop sub r4,r1,r3 ALU Im Reg Dm and r6,r1,r7 ALU Im Reg Dm

Option 3 Insight: Data is available! (Figure 6.41, page 413, P&H) Pipeline registers already contain needed data Time (clock cycles) IF ID/RF EX MEM WB add r1,r2,r3 Im Reg ALU Reg Dm I n s t r. O r d e ALU sub r4,r1,r3 Im Reg Dm Reg ALU and r6,r1,r7 Im Reg Dm Reg or r8,r1,r9 Im Reg ALU Dm Reg ALU Im Reg Dm xor r10,r1,r11

HW Change for “Forwarding” (Bypassing):(Figure 6.42, page 414, P&H) Increase multiplexors to add paths from pipeline registers Assumes register read during write gets new value (otherwise more results to be forwarded)

From Last Lecture: The Delay Load Phenomenon Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Clock I0: Load Ifetch Reg/Dec Exec Mem Wr Plus 1 Ifetch Reg/Dec Exec Mem Wr Plus 2 Ifetch Reg/Dec Exec Mem Wr Plus 3 Ifetch Reg/Dec Exec Mem Wr Plus 4 Ifetch Reg/Dec Exec Mem Wr Although Load is fetched during Cycle 1: The data is NOT written into the Reg File until the end of Cycle 5 We cannot read this value from the Reg File until Cycle 6 3-instruction delay before the load take effect Similarly, although the load instruction is fetched during cycle 1, the data is not written into the register file until the end of Cycle 5. Consequently, the earliest time we can read this value from the register file is in Cycle 6. In other words, there is a 3-instruction delay between the load instruction and the instruction that can use the result of the load. This is referred to as Data Hazard in the text book. We will show in the next lecture that by clever design techniques, we can reduce this delay to ONE instruction. That is if the load instruction is issued in Cycle 1, the instruction comes right next to it (Plus 1) cannot use the result of this load but the next-next instruction (Plus 2) can. +2 = 73 min. (Y:53)

Forwarding reduces Data Hazard to 1 cycle: (Figure 6.47, page 420 P&H) Time (clock cycles) IF ID/RF EX MEM WB lw r1, 0(r2) Im Reg ALU Reg Dm I n s t r. O r d e ALU sub r4,r1,r6 Im Reg Dm Reg ALU and r6,r1,r7 Im Reg Dm Reg or r8,r1,r9 Im Reg ALU Dm Reg

Option1: HW Stalls to Resolve Data Hazard “Interlock”: checks for hazard & stalls Time (clock cycles) IF ID/RF EX MEM WB lw r1, 0(r2) Im Reg ALU Reg Dm I n s t r. O r d e bubble Im stall ALU sub r4,r1,r3 Im Reg Dm Reg ALU and r6,r1,r7 Im Reg Dm Reg or r8,r1,r9 Im Reg ALU Dm Reg

Option 2: SW inserts independent instructions Worst case inserts NOP instructions MIPS I solution: No HW checking Time (clock cycles) IF ID/RF EX MEM WB ALU Im Reg Dm lw r1, 0(r2) I n s t r. O r d e ALU Im Reg Dm nop ALU sub r4,r1,r3 Im Reg Dm Reg ALU and r6,r1,r7 Im Reg Dm Reg or r8,r1,r9 Im Reg ALU Dm Reg

Software Scheduling to Avoid Load Hazards Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d ,e, and f in memory. Slow code: LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f SUB Rd,Re,Rf SW d,Rd

Slow code: LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f SUB Rd,Re,Rf SW d,Rd Fast code: LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB Rd,Re,Rf SW d,Rd

From Last Lecture: The Delay Branch Phenomenon Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Cycle 11 Clk 12: Beq (target is 1000) Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Mem Wr 16: R-type 20: R-type Ifetch Reg/Dec Exec Mem Wr 24: R-type Ifetch Reg/Dec Exec Mem Wr 1000: Target of Br Ifetch Reg/Dec Exec Mem Wr Although Beq is fetched during Cycle 4: Target address is NOT written into the PC until the end of Cycle 7 Branch’s target is NOT fetched until Cycle 8 3-instruction delay before the branch take effect Notice that although the branch instruction is fetched during Cycle 4, its target address is NOT written into the Program Counter until the end of clock Cycle 7. Consequently, the branch target instruction is not fetched until clock Cycle 8. In other words, there is a 3-instruction delay between the branch instruction is issued and the branch effect is felt in the program. This is referred to as Branch Hazard in the text book. And as we will show in the next lecture, by clever design techniques, we can reduce the delay to ONE instruction. That is if the branch instruction in Location 12 is issued in Cycle 4, we will only execute one more sequential instruction (Location 16) before the branch target is executed. +2 = 71 min. (Y:51)

Control Hazard on Branches: 3 stage stall

Branch Stall Impact If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9! 2 part solution: Determine branch taken or not sooner, AND Compute taken branch address earlier MIPS branch tests = 0 or ° 0 Solution Option 1: Move Zero test to ID/RF stage Adder to calculate new PC in ID/RF stage 1 clock cycle penalty for branch vs. 3

Option 1: move HW forward to reduce branch delay Execute Addr. Calc. Memory Access Instr. Decode Reg. Fetch Instruction Fetch Write Back

Branch Delay now 1 clock cycle Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc. Memory Access Write Back

Option 2: Define Branch as Delayed Worst case, SW inserts NOP into branch delay Where to get instructions to fill branch delay slot? Before branch instruction, example sw r1,0(r2); beqd r0,r2,T change to, beqd r0,r2,T; sw r1,0(r2) From the target address: only valuable when branch From fall through: only valuable when don’t branch Compiler effectiveness for single branch delay slot: Fills about 60% of branch delay slots About 80% of instructions executed in branch delay slots useful in computation about 50% (60% x 80%) of slots usefully filled

Complete data Path with Hazard detection and Forwarding Figure 6.51 in the text

Example Text Figure 6.52

When is pipelining hard? Interrupts: 5 instructions executing in 5 stage pipeline How to stop the pipeline? Restrart? Who caused the interrupt? Stage Problem interrupts occurring IF Page fault on instruction fetch; misaligned memory access; memory-protection violation ID Undefined or illegal opcode EX Arithmetic interrupt MEM Page fault on data fetch; misaligned memory access; memory-protection violation Load with data page fault, Add with instruction page fault? Solution 1: interrupt vector/instruction 2: interrupt ASAP, restart everything incomplete

Data path with Exception Handling, Text Figure 6.55, add a Cause register, an Exception PC, and constant addr. of Exception Handeling routine

Review: Summary of Pipelining Basics Speed Up Š Pipeline Depth (number of stages); if ideal CPI is 1, then: Hazards limit performance on computers: structural: need more HW resources data: need forwarding, compiler scheduling control: early evaluation & PC, delayed branch, prediction Increasing length of pipe increases impact of hazards since pipelining helps instruction bandwidth, not latency Compilers key to reducing cost of data and control hazards load delay slots branch delay slots Exceptions, Instruction Set, FP makes pipelining harder Longer pipelines => Branch prediction, more instruction parallelism?