Appendix A Pipelining: Basic and Intermediate Concept

Slides:



Advertisements
Similar presentations
PipelineCSCE430/830 Pipeline: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Prof. Yifeng Zhu, U of Maine Fall,
Advertisements

CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Chapter 8. Pipelining.
Review: Pipelining. Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer.
Pipelining I (1) Fall 2005 Lecture 18: Pipelining I.
Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee.
Computer Architecture
Pipelining Preview Basics & Challenges
CS252/Patterson Lec 1.1 1/17/01 Pipelining: Its Natural! Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer.
Pipelining - Hazards.
COMP381 by M. Hamdi 1 Pipeline Hazards. COMP381 by M. Hamdi 2 Pipeline Hazards Hazards are situations in pipelining where one instruction cannot immediately.
CPSC 614 Computer Architecture Lec 3 Pipeline Review EJ Kim Dept. of Computer Science Texas A&M University Adapted from CS 252 Spring 2006 UC Berkeley.
CSCE 430/830 Computer Architecture Basic Pipelining & Performance
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 1.
Chapter 5 Pipelining and Hazards
EENG449b/Savvides Lec 3.1 1/20/04 January 20, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
Computer ArchitectureFall 2007 © October 24nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
Computer ArchitectureFall 2007 © October 22nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
DLX Instruction Format
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 9, 2002 Topic: Pipelining Basics.
Pipelining - II Adapted from CS 152C (UC Berkeley) lectures notes of Spring 2002.
Appendix A Pipelining: Basic and Intermediate Concepts
Computer ArchitectureFall 2008 © October 6th, 2008 Majd F. Sakr CS-447– Computer Architecture.
ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 17 - Pipelined.
Pipeline Hazard CT101 – Computing Systems. Content Introduction to pipeline hazard Structural Hazard Data Hazard Control Hazard.
CSC 4250 Computer Architectures September 15, 2006 Appendix A. Pipelining.
Lecture 7: Pipelining Review Kai Bu
Pipelining. 10/19/ Outline 5 stage pipelining Structural and Data Hazards Forwarding Branch Schemes Exceptions and Interrupts Conclusion.
CPE 731 Advanced Computer Architecture Pipelining Review Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of California,
EECS 252 Graduate Computer Architecture Lecture 3  0 (continued) Review of Instruction Sets, Pipelines, Caches and Virtual Memory January 25 th, 2012.
Appendix A - Pipelining CSCI/ EENG – W01 Computer Architecture 1 Prof. Babak Beheshti Slides based on the PowerPoint Presentations created by David.
Lecture 05: Pipelining Basics & Hazards Kai Bu
Pipeline Review. 2 Review from last lecture Tracking and extrapolating technology part of architect’s responsibility Expect Bandwidth in disks, DRAM,
Computer Science Education
Integrated Circuits Costs
CSC 7080 Graduate Computer Architecture Lec 3 – Pipelining: Basic and Intermediate Concepts (Appendix A) Dr. Khalaf Notes adapted from: David Patterson.
EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining.
Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
ECE 232 L18.Pipeline.1 Adapted from Patterson 97 ©UCBCopyright 1998 Morgan Kaufmann Publishers ECE 232 Hardware Organization and Design Lecture 18 Pipelining.

CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and
Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.
Pipelining Example Laundry Example: Three Stages
CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
CMPUT Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.
CS252/Patterson Lec 1.1 1/17/01 معماري کامپيوتر - درس نهم pipeline برگرفته از درس : Prof. David A. Patterson.
11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 10 Computer Hardware Design (Pipeline Datapath and Control Design) Prof. Dr.
Lecture 18: Pipelining I.
Review: Instruction Set Evolution
CMSC 611: Advanced Computer Architecture
5 Steps of MIPS Datapath Figure A.2, Page A-8
ECE232: Hardware Organization and Design
Appendix A - Pipelining
Chapter 3: Pipelining 순천향대학교 컴퓨터학부 이 상 정 Adapted from
CMSC 611: Advanced Computer Architecture
Serial versus Pipelined Execution
CSC 4250 Computer Architectures
An Introduction to pipelining
Pipelining Appendix A and Chapter 3.
Pipelining Hazards.
Presentation transcript:

Appendix A Pipelining: Basic and Intermediate Concept Computer Architecture 計算機結構 Appendix A Pipelining: Basic and Intermediate Concept Ping-Liang Lai (賴秉樑)  

Outline A.1 Introduction A.2 The Major Hurdle of Pipelining – Pipeline Hazards

Pipelining: Its Natural! Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer takes 40 minutes “Folder” takes 20 minutes A B C D

Sequential Laundry 6 PM 7 8 9 10 11 Midnight 30 40 20 30 40 20 30 40 Time 30 40 20 30 40 20 30 40 20 30 40 20 T a s k O r d e A B C D Sequential laundry takes 6 hours for 4 loads If they learned pipelining, how long would laundry take?

Pipelined Laundry Start Work ASAP 6 PM 7 8 9 10 11 Midnight Time 30 40 20 T a s k O r d e A B C D Pipelined laundry takes 3.5 hours for 4 loads

Computer Pipelines Pipeline properties Execute billions of instructions, so throughput is what matters. Pipelining doesn’t help latency of single task, it helps throughput of entire workload; Pipeline rate limited by slowest pipeline stage; Multiple tasks operating simultaneously; Potential speedup = Number pipe stages; Unbalanced lengths of pipe stages reduces speedup; Time to “fill” pipeline and time to “drain” it reduces speedup. The time per instruction on the pipelined processor in ideal conditions is equal to, However, the stages will not be perfectly balanced. Pipelining yields a reduction in the average execution time per instruction.

The Basics of a RISC Instruction Set (1/2) In this book, we use a RISC (Reduced Instruction Set Computer) architecture or load-store architecture to illustrate the basic concepts. All operations on data apply to data in register and typically change the entire register (32 or 64 bits per register). The only operations that affect memory are load and store operation. The instruction formats are few in number with all instructions typically being one size. These simple three properties lead to dramatic simplifications in the implementation of pipelining.

The Basics of a RISC Instruction Set (2/2) MIPS 64 Instructions have a D on the start or end of the mnemonic, for example: DADD, LD; 32 registers, and R0 = 0; Three classes of instructions ALU instruction: add (DADD), subtract (DSUB), and logical operations (such as AND or OR); Load and store instructions: Branches and jumps:

RISC Instruction Set Every instruction can be implemented in at most 5 clock cycles Instruction fetch cycle (IF): send PC to memory, fetch the current instruction from memory, and update PC to the next sequential PC by adding 4 to the PC. Instruction decode/register fetch cycle (ID): decode the instruction, read the registers corresponding to register source specifiers from the register file. Execution/effective address cycle (EX): perform Memory reference, Register-Register ALU instruction and Register-Immediate ALU instruction. Memory access (MEM): perform load/store instructions. Write-back cycle (WB): Register-Register ALU instruction or Load instruction. In this implementation, branch requires 2 cycles, store requires 4 cycles, and all other instructions require 5 cycles.

5 Steps of MIPS Datapath 4 MUX Adder MUX MUX MUX IR <= mem[PC]; Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Memory Access Write Back Next PC MUX 4 Adder Next SEQ PC Zero? RS1 Reg File Address RS2 MUX Memory ALU Inst Memory Data L M D RD MUX MUX Sign Extend Imm IR <= mem[PC]; PC <= PC + 4 WB Data Reg[IRrd] <= Reg[IRrs] opIRop Reg[IRrt]

Classic 5-Stage Pipeline for a RISC Each cycle the hardware will initiate a new instruction and will be executing some part of the five different instructions. Simple; However, be ensure that the overlap of instructions in the pipeline cannot cause such a conflict. (also called Hazard) Clock number Instruction number 1 2 3 4 5 6 7 8 9 Instruction i IF ID EX MEM WB Instruction i+1 Instruction i+2 Instruction i+3 Instruction i+4 This separation is done by introducing pipeline registers between successive stages of the pipeline.

Visualizing Pipelining Time (clock cycles or processor cycles) Reg ALU DMem Ifetch Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7 Cycle 5 I n s t r. O r d e

5 Steps of MIPS Datapath 4 MUX Adder MUX MUX MUX IR <= mem[PC]; Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Memory Access Write Back Next PC IF/ID ID/EX MEM/WB EX/MEM MUX Next SEQ PC Next SEQ PC 4 Adder Zero? RS1 Reg File Address Memory RS2 MUX ALU Memory Data MUX MUX IR <= mem[PC]; PC <= PC + 4 Sign Extend Imm WB Data A <= Reg[IRrs]; B <= Reg[IRrt] RD RD RD rslt <= A opIRop B WB <= rslt Reg[IRrd] <= WB

RISC Pipelining Property Function units are used in different cycles, and hence overlapping the execution of multiple instructions introduces relatively few conflicts. Separate instruction and data memories: Eliminate a conflict for accessing a single memory. The Register file is used in the two stages Read from register in ID (second half of CC), and write to register in WB (first half of CC). Figure on page.12 does not deal with the PC Increment and store the PC every clock, and done it during the IF stage. A branch does not change the PC until the ID stage (have an adder to compute the potential branch target).

5 Steps of MIPS Datapath 4 Data stationary control MUX Adder MUX MUX Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Memory Access Write Back Next PC IF/ID ID/EX MEM/WB EX/MEM MUX Next SEQ PC Next SEQ PC 4 Adder Zero? RS1 Reg File Address MUX Memory RS2 ALU Memory Data MUX MUX Sign Extend Imm WB Data RD RD RD Data stationary control local decode for each instruction phase / pipeline stage

Pipelining Performance (1/2) Pipelining increases throughput, not reduce the execution time of an individual instruction. In face, slightly increases the execution time (an instruction) due to overhead in the control of the pipeline. Practical depth of a pipeline is limits by increasing execution time. Pipeline overhead Unbalanced pipeline stage; Pipeline stage overhead; Pipeline register delay; Clock skew.

Pipeline Performance (2/2) Example 1 (p.A-10): Consider the unpipelined processor in previous section. Assume that it has a 1ns clock cycle and that it uses 4 cycles for ALU operations and branches, and 5 cycles for memory operations. Assume that the relative frequencies of these operations are 40%, 20%, and 40%, respectively. Suppose that due to clock skew and setup, pipelining the processor adds 0.2 ns of overhead to the clock. Ignoring any latency impact, how much speedup in the instruction execution rate will we gain from a pipeline? Answer The average instruction execution time on the unpipelined processor is In the pipeline, the clock must run at the speed of the slowest stage plus overhead, which will be 1+0.2 ns.

Outline A.1 Introduction A.2 The Major Hurdle of Pipelining – Pipeline Hazards

Pipeline Hazards Hazard, that prevent the next instruction in the instruction steam. Structural hazards: resource conflict. Data hazards: an instruction depends on the results of a previous instruction Control hazards: arise from the pipelining of branches and other instructions that change the PC. Hazards in pipelines can make it necessary to stall the pipeline. Stall will reduce pipeline performance.

Performance with Pipeline Stall (1/2)

Performance with Pipeline Stall (2/2)

Structure Hazards Structure Hazards If some combination of instructions cannot be accommodated because of resource conflict (resources are pipelining of functional units and duplication of resources). Occur when some functional unit is not fully pipelined, or No enough duplicated resources.

One Memory Port/Structural Hazards Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 ALU I n s t r. O r d e Load Ifetch Reg DMem Reg Reg ALU DMem Ifetch Instr 1 Reg ALU DMem Ifetch Instr 2 ALU Instr 3 Ifetch Reg DMem Reg Reg ALU DMem Ifetch Instr 4

One Memory Port/Structural Hazards Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 ALU I n s t r. O r d e Load Ifetch Reg DMem Reg Reg ALU DMem Ifetch Instr 1 Reg ALU DMem Ifetch Instr 2 Bubble Stall Reg ALU DMem Ifetch Instr 3 How do you “bubble” the pipe?

Performance on Structure Hazard Example 2 (p.A-13): Let’s see how much the load structure hazard might cost. Suppose that data reference constitute 40% of the mix, and that the ideal CPI of the pipelined processor, ignoring the structure hazard, is 1. Assume that the processor with the structure hazard has a clock rate that is 1.05 times higher than the clock rate of processor without the hazard. Disregarding any other performance losses, is the pipeline with or without the structure hazard faster, and by how much? Answer The average instruction execution time on the unpipelined processor is How to solve structure hazard? (Next slide)

Summary of Structure Hazard An alternative to this structure hazard, designer could provide a separate memory access for instructions. Splitting the cache into separate instruction and data caches, or Use a set of buffers, usually called instruction buffers, to hold instruction; However, it will increase cost overhead. Ex1: pipelining function units or duplicated resources is a high cost; Ex2: require twice bandwidth and often have higher bandwidth at the pins to support both an instruction and a data cache access every cycle; Ex3: a floating-point multiplier consumes lots of gates. If the structure hazard is rare, it may not be worth the cost to avoid it.

Data Hazards Data Hazards Occur when the pipeline changes the order of read/write accesses to operands so that the order differs from the order seen by sequentially executing instructions on an unpipelined processor. Occur when some functional unit is not fully pipelined, or No enough duplicated resources. A example of pipelined execution DADD R1, R2, R3 DSUB R4, R1, R5 AND R6, R1, R7 OR R8, R1, R9 XOR R10, R1, R11

Data Hazard on R1 DADD R1,R2,R3 DSUB R4,R1,R5 AND R6,R1,R7 OR R8,R1,R9 Time (clock cycles) IF ID/RF EX MEM WB I n s t r. O r d e Reg ALU DMem Ifetch DADD R1,R2,R3 DSUB R4,R1,R5 AND R6,R1,R7 OR R8,R1,R9 XOR R10,R1,R11

Three Generic Data Hazards (1/3) Read After Write (RAW) InstrJ tries to read operand before InstrI writes it Caused by a “dependence” (in compiler nomenclature). This hazard results from an actual need for communication. I: ADD R1,R2,R3 J: SUB R4,R1,R3

Three Generic Data Hazards (2/3) Write After Read (WAR) InstrJ writes operand before InstrI reads it Called an “anti-dependence” by compiler writers. This results from reuse of the name “R1”. Can’t happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5 I: SUB R4,R1,R3 J: ADD R1,R2,R3 K: MUL R6,R1,R7

Three Generic Data Hazards (3/3) Write After Write (WAW) InstrJ writes operand before InstrI writes it. Called an “output dependence” by compiler writers This also results from the reuse of name “R1”. Can’t happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and Writes are always in stage 5. Will see WAR and WAW in more complicated pipes. I: SUB R1,R4,R3 J: ADD R1,R2,R3 K: MUL R6,R1,R7

Forwarding to Avoid Data Hazard Time (clock cycles) I n s t r. O r d e Reg ALU DMem Ifetch DADD R1,R2,R3 DSUB R4,R1,R5 AND R6,R1,R7 OR R8,R1,R9 XOR R10,R1,R11

HW Change for Forwarding ID/EX EX/MEM MEM/WR NextPC mux Registers ALU Data Memory mux mux Immediate What circuit detects and resolves this hazard?

Forwarding to Avoid LW-SW Data Hazard Time (clock cycles) I n s t r. O r d e Reg ALU DMem Ifetch DADD R1,R2,R3 LD R4,0(R1) SD R4,12(R1) OR R8,R6,R9 XOR R10,R9,R11

Data Hazard Even with Forwarding Time (clock cycles) Reg ALU DMem Ifetch I n s t r. O r d e LD R1,0(R2) DSUB R4,R1,R6 DAND R6,R1,R7 OR R8,R1,R9

Data Hazard Even with Forwarding Time (clock cycles) Reg ALU DMem Ifetch I n s t r. O r d e LD R1,0(R2) Reg Ifetch ALU DMem Bubble DSUB R4,R1,R6 Ifetch ALU DMem Reg Bubble AND R6,R1,R7 Bubble Ifetch Reg ALU DMem OR R8,R1,R9 How is this detected?

Branch Hazards Recall a branch Taken: if a branch changes the PC to its target address, it is a taken branch. Not taken: if it falls through, it is not taken, or untaken. Figure shows that the simplest method of dealing with branches is to redo the fetch of instruction following a branch Detect the branch during ID (when instruction are decoded). Clock number Branch instruction IF ID EX MEM WB Branch successor Branch successor + 1 Branch successor + 2

Control Hazard on Branches Three Stage Stall Reg ALU DMem Ifetch 10: BEQ R1,R3,36 14: AND R2,R3,R5 18: OR R6,R1,R7 22: ADD R8,R1,R9 36: XOR R10,R1,R11 What do you do with the 3 instructions in between? How do you do it? Where is the “commit”?

Branch Stall Impact If CPI = 1, 30% branch, Two part solution Stall 3 cycles => new CPI = 1.9! Two part solution Determine branch taken or not sooner, AND Compute taken branch address earlier MIPS branch tests if register = 0 or  0 MIPS Solution Move Zero test to ID/RF stage Adder to calculate new PC in ID/RF stage 1 clock cycle penalty for branch versus 3

Pipelined MIPS Datapath Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Memory Access Write Back Next PC Next SEQ PC ID/EX EX/MEM MEM/WB MUX 4 Adder IF/ID Adder Zero? RS1 Address Reg File Memory RS2 ALU Memory Data MUX MUX Sign Extend Imm WB Data RD RD RD Interplay of instruction set design and cycle time.

Four Branch Hazard Alternatives #1: Stall until branch direction is clear #2: Predict Branch Not Taken Execute successor instructions in sequence “Squash” instructions in pipeline if branch actually taken Advantage of late pipeline state update 47% MIPS branches not taken on average PC+4 already calculated, so use it to get next instruction #3: Predict Branch Taken 53% MIPS branches taken on average But haven’t calculated branch target address in MIPS MIPS still incurs 1 cycle branch penalty Other machines: branch target known before outcome

Four Branch Hazard Alternatives #4: Delayed Branch Define branch to take place AFTER a following instruction branch instruction sequential successor1 sequential successor2 ........ sequential successorn branch target if taken 1 slot delay allows proper decision and branch target address in 5 stage pipeline MIPS uses this

Scheduling Branch Delay Slots A. From before branch B. From branch target C. From fall through A is the best choice, fills delay slot & reduces instruction count (IC) In B, the sub instruction may need to be copied, increasing IC In B and C, must be okay to execute sub when branch fails add $1,$2,$3 if $2=0 then add $1,$2,$3 if $1=0 then sub $4,$5,$6 delay slot delay slot add $1,$2,$3 if $1=0 then delay slot sub $4,$5,$6 becomes becomes becomes if $2=0 then add $1,$2,$3 add $1,$2,$3 if $1=0 then sub $4,$5,$6 add $1,$2,$3 if $1=0 then sub $4,$5,$6

Delayed Branch Compiler effectiveness for single branch delay slot: Fills about 60% of branch delay slots. About 80% of instructions executed in branch delay slots useful in computation. About 50% (60% x 80%) of slots usefully filled. Delayed Branch downside: As processor go to deeper pipelines and multiple issue, the branch delay grows and need more than one delay slot Delayed branching has lost popularity compared to more expensive but more flexible dynamic approaches. Growth in available transistors has made dynamic approaches relatively cheaper.

Evaluating Branch Alternatives The effective pipeline speedup with branch penalties, assuming an ideal CPI of 1, is Because of the following: We obtain

Performance on Control Hazard (1/2) Example 3 (pA-25): for a deeper pipeline, such as that in a MIPS R4000, it takes at least three pipeline stages before the branch-target address is known and an additional cycle before the branch condition is evaluated, assuming no stalls on the registers in the conditional comparison, A three-stage delay leads to the branch penalties for the three simplest prediction schemes listed in the following Figure A.15. Find the effective additional to the CPI arising from branches for this pipeline, assuming the following frequencies: Unconditional branch 4% Conditional branch, untaken 6% Conditional branch, taken 10% Figure A.15 Branch scheme Penalty unconditional Penalty untaken Penalty taken Flush pipeline 2 3 Predicted taken Predicted un taken

Performance on Control Hazard (2/2) Answer Additions to the CPI from branch cost Branch scheme Unconditional branches Untaken conditional branches Taken conditional branches All branches Frequency of event 4% 6% 10% 20% Stall pipeline 0.08 0.18 0.30 0.56 Predicted taken 0.20 0.46 Predicted untaken 0.00 0.38