Lecture 7: Pipelining Review Kai Bu

Slides:

Advertisements

Similar presentations

Lecture 4: CPU Performance

Advertisements

COMP 4211 Seminar Presentation Based On: Computer Architecture A Quantitative Approach by Hennessey and Patterson Presenter : Feri Danes.

Review: Pipelining. Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer.

Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee.

Pipelining Preview Basics & Challenges

CS252/Patterson Lec 1.1 1/17/01 Pipelining: Its Natural! Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer.

Lecture 6: Pipelining MIPS R4000 and More Kai Bu

Pipelining: Basic and Intermediate Concepts

CIS429/529 Winter 2007 Pipelining-1 1 Pipeling RISC/MIPS64 five stage pipeline Basic pipeline performance Pipeline hazards Branch hazards More pipeline.

CIS629 Fall 2002 Pipelining 2- 1 Control Hazards Created by branch statements BEQZLOC ADDR1,R2,R3. LOCSUBR1,R2,R3 PC needs to be computed but it happens.

DLX Instruction Format

1 Atanasoff–Berry Computer, built by Professor John Vincent Atanasoff and grad student Clifford Berry in the basement of the physics building at Iowa State.

Appendix A Pipelining: Basic and Intermediate Concepts

ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.

Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 17 - Pipelined.

Pipelining Basics Assembly line concept An instruction is executed in multiple steps Multiple instructions overlap in execution A step in a pipeline is.

Pipeline Hazard CT101 – Computing Systems. Content Introduction to pipeline hazard Structural Hazard Data Hazard Control Hazard.

CS1104: Computer Organisation School of Computing National University of Singapore.

CSC 4250 Computer Architectures September 15, 2006 Appendix A. Pipelining.

COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

Pipelining. 10/19/ Outline 5 stage pipelining Structural and Data Hazards Forwarding Branch Schemes Exceptions and Interrupts Conclusion.

Lecture 5: Pipelining Implementation Kai Bu

Lecture 05: Pipelining Basics & Hazards Kai Bu

Chapter 2 Summary Classification of architectures Features that are relatively independent of instruction sets “Different” Processors –DSP and media processors.

Computer Science Education

1 Appendix A Pipeline implementation Pipeline hazards, detection and forwarding Multiple-cycle operations MIPS R4000 CDA5155 Spring, 2007, Peir / University.

CSC 4250 Computer Architectures September 26, 2006 Appendix A. Pipelining.

EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining.

Appendix A Pipelining: Basic and Intermediate Concept

Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.

1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.

1 Pipelining Part I CS What is Pipelining? Like an Automobile Assembly Line for Instructions –Each step does a little job of processing the instruction.

CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and

Processor Design CT101 – Computing Systems. Content GPR processor – non pipeline implementation Pipeline GPR processor – pipeline implementation Performance.

Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

Pipelining Example Laundry Example: Three Stages

CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

CS252/Patterson Lec 1.1 1/17/01 معماري کامپيوتر - درس نهم pipeline برگرفته از درس : Prof. David A. Patterson.

EE524/CptS561 Jose G. Delgado-Frias 1 Processor Basic steps to process an instruction IFID/OFEXMEMWB Instruction Fetch Instruction Decode / Operand Fetch.

11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.

Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.

LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,

CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

CSC 4250 Computer Architectures September 22, 2006 Appendix A. Pipelining.

Instruction-Level Parallelism

Lecture 18: Pipelining I.

Computer Organization

ARM Organization and Implementation

Lecture 07: Pipelining Multicycle, MIPS R4000, and More

CMSC 611: Advanced Computer Architecture

Appendix C Pipeline implementation

ECE232: Hardware Organization and Design

Pipelining: Implementation

School of Computing and Informatics Arizona State University

Chapter 3: Pipelining 순천향대학교 컴퓨터학부 이 상 정 Adapted from

Pipelining: Basics & Hazards

Pipelining: Advanced ILP

Chapter 4 The Processor Part 2

Pipelining Multicycle, MIPS R4000, and More

Lecture 05: Pipelining Basics & Hazards

CSC 4250 Computer Architectures

An Introduction to pipelining

Instruction Execution Cycle

Overview What are pipeline hazards? Types of hazards

Pipelining Multicycle, MIPS R4000, and More

Pipelining Appendix A and Chapter 3.

MIPS Pipelining: Part I

Lecture 06: Pipelining Implementation

Pipelining Hazards.

Presentation transcript:

Lecture 7: Pipelining Review Kai Bu

Appendix C Lectures 4-6

Pipelining start executing one instruction before completing the previous one

Outline What’s Pipelining How Pipelining Works Pipeline Hazards Pipeline with Multicycle FP Operations

Outline What’s Pipelining How Pipelining Works Pipeline Hazards Pipeline with Multicycle FP Operations

Laundry Example Ann, Brian, Cathy, Dave Each has one load of clothes to wash, dry, fold. washer 30 mins dryer 40 mins folder 20 mins

Sequential Laundry What would you do? Task Order A B C D Time Hours

Sequential Laundry What would you do? Task Order A B C D Time Hours

Pipelined Laundry Observations A task has a series of stages; Stage dependency: e.g., wash before dry; Multi tasks with overlapping stages; Simultaneously use diff resources to speed up; Slowest stage determines the finish time; Task Order A B C D Time Hours

Pipelined Laundry Observations No speed up for individual task; e.g., A still takes =90 But speed up for average task execution time; e.g., 3.5*60/4=52.5 < =90 Task Order A B C D Time Hours

Assembly Line Auto Cola

Pipelining An implementation technique whereby multiple instructions are overlapped in execution. e.g., B wash while A dry Essence: Start executing one instruction before completing the previous one. Significance: Make fast CPUs. A B

Balanced Pipeline Equal-length pipe stages e.g., Wash, dry, fold = 40 mins per unpipelined laundry time = 40x3 mins 3 pipe stages – wash, dry, fold A T1 40min T2 T3 T4 A A B B B C CD

Balanced Pipeline Equal-length pipe stages e.g., Wash, dry, fold = 40 mins per unpipelined laundry time = 40x3 mins 3 pipe stages – wash, dry, fold A T1 40min T2 T3 T4 A A B B B C CD

Balanced Pipeline Equal-length pipe stages e.g., Wash, dry, fold = 40 mins per unpipelined laundry time = 40x3 mins 3 pipe stages – wash, dry, fold A T1 40min T2 T3 T4 A A B B B C CD

One task/instruction per 40 mins Time per instruction by pipeline = Time per instr on unpipelined machine Number of pipe stages Speed up by pipeline = Number of pipe stages Balanced Pipeline Equal-length pipe stages e.g., Wash, dry, fold = 40 mins per unpipelined laundry time = 40x3 mins 3 pipe stages – wash, dry, fold A T1 40min T2 T3 T4 A A B B B C CD Performance

Pipelining Terminology Latency: the time for an instruction to complete. Throughput of a CPU: the number of instructions completed per second. Clock cycle: everything in CPU moves in lockstep; synchronized by the clock. Processor Cycle: time required between moving an instruction one step down the pipeline; = time required to complete a pipe stage; = max(times for completing all stages); = one or two clock cycles, but rarely more. CPI: clock cycles per instruction

Outline What’s Pipelining How Pipelining Works Pipeline Hazards Pipeline with Multicycle FP Operations

RISC: Five-Stage Pipeline How it works separate instruction and data mems to eliminate conflicts for a single memory between instruction fetch and data memory access. IFMEM Instr memData mem

RISC: Five-Stage Pipeline How it works use the register file in two stages; either with half CC; in one clock cycle, write before read IDWB readwrite

RISC: Five-Stage Pipeline How it works introduce pipeline registers between successive stages; pipeline registers store the results of a stage and use them as the input of the next stage.

RISC: Five-Stage Pipeline How it works

RISC: Five-Stage Pipeline How it works - omit pipeline regs for simplicity but required in implementation

RISC: Reduced Instruction Set Computer at most 5 clock cycles per instruction – 1 IF ID EX MEM WB Instruction Fetch cycle send the PC to memory; fetch the current instruction from mem; PC = PC + 4; //each instr is 4 bytes

RISC: Reduced Instruction Set Computer at most 5 clock cycles per instruction – 2 IF ID EX MEM WB Instruction Decode/register fetch cycle decode the instruction; read the registers (corresponding to register source specifiers);

RISC: Reduced Instruction Set Computer at most 5 clock cycles per instruction – 3 IF ID EX MEM WB Execution/effective address cycle ALU operates on the operands from ID: 3 functions depending on the instr type - 1 Memory reference -Memory reference: ALU adds base register and offset to form effective address;

RISC: Reduced Instruction Set Computer at most 5 clock cycles per instruction – 3 IF ID EX MEM WB Execution/effective address cycle ALU operates on the operands from ID: 3 functions depending on the instr type - 2 Register-Register ALU instruction -Register-Register ALU instruction: ALU performs the operation specified by opcode on the values read from the register file;

RISC: Reduced Instruction Set Computer at most 5 clock cycles per instruction – 3 IF ID EX MEM WB EXecution/effective address cycle ALU operates on the operands from ID: 3 functions depending on the instr type - 3 Register-Immediate ALU instruction -Register-Immediate ALU instruction: ALU operates on the first value read from the register file and the sign-extended immediate.

RISC: Reduced Instruction Set Computer at most 5 clock cycles per instruction – 4 IF ID EX MEM WB MEMory access for load instr: the memory does a read using the effective address; for store instr: the memory writes the data from the second register using the effective address.

RISC: Reduced Instruction Set Computer at most 5 clock cycles per instruction – 5 IF ID EX MEM WB Write-Back cycle for Register-Register ALU or load instr; write the result into the register file, whether it comes from the memory (for load) or from the ALU (for ALU instr).

RISC: Reduced Instruction Set Computer 3 classes of instructions - 1 ALU (Arithmetic Logic Unit) instructions operate on two regs or a reg + a sign- extended immediate; store the result into a third reg; e.g., add (DADD), subtract (DSUB) logical operations AND, OR

RISC: Reduced Instruction Set Computer 3 classes of instructions - 2 Load (LD) and store (SD) instructions operands: base register + offset; the sum (called effective address) is used as a memory address; Load: use a second reg operand as the destination for the data loaded from memory; Store: use a second reg operand as the source of the data stored into memory.

RISC: Reduced Instruction Set Computer 3 classes of instructions - 3 Branches and jumps conditional transfers of control;Branch: specify the branch condition specify the branch condition with a set of condition bits or comparisons between two regs or between a reg and zero; decide the branch destination decide the branch destination by adding a sign-extended offset to the current PC (program counter);

MIPS Instruction at most 5 clock cycles per instruction IF ID EX MEM WB

MIPS Instruction IF ID EX MEM WB IR ← Mem[PC]; NPC ← PC + 4;

MIPS Instruction IF ID EX MEM WB A ← Regs[rs]; B ← Regs[rt]; Imm ← sign-extended immediate field of IR (lower 16 bits)

MIPS Instruction IF ID EX MEM WB ALUOutput ← A + Imm; ALUOutput ← A func B; ALUOutput ← A op Imm; ALUOutput ← NPC + (Imm<<2); Cond ← (A == 0);

MIPS Instruction IF ID EX MEM WB LMD ← Mem[ALUOutput]; Mem[ALUOutput] ← B; if (cond) PC ← ALUOutput;

MIPS Instruction IF ID EX MEM WB Regs[rd] ← ALUOutput; Regs[rt] ← ALUOutput; Regs[rt] ← LMD;

MIPS Instruction Demo Prof. Gurpur Prabhu, Iowa State Univ torial/PIPELINE/DLXimplem.html torial/PIPELINE/DLXimplem.html Load, Store Register-register ALU Register-immediate ALU Branch

Load

Store

Register-Register ALU

Register-Immediate ALU

Branch

Outline What’s Pipelining How Pipelining Works Pipeline Hazards Pipeline with Multicycle FP Operations

When Pipeline Is Stuck LD R1, 0(R2) DSUB R4, R1, R5 R1

Structural Hazard Example 1 mem port mem conflict data access vs instr fetch Load Instr i+3 Instr i+2 Instr i+1 MEM IF

Structural Hazard Stall Instr i+3 till CC 5

Data Hazard DADD DSUB AND OR XOR R1, R2, R3 R4, R1, R5 R6, R1, R7 R8, R1, R9 R10, R1, R11 R1 No hazard 1 st half cycle: w 2 nd half cycle: r

Data Hazard Solution: forwarding directly feed back EX/MEM&MEM/WB pipeline regs’ results to the ALU inputs; if forwarding hw detects that previous ALU has written the reg corresponding to a source for the current ALU, control logic selects the forwarded result as the ALU input.

Data Hazard: Forwarding DADD DSUB AND OR XOR R1, R2, R3 R4, R1, R5 R6, R1, R7 R8, R1, R9 R10, R1, R11 R1

Data Hazard: Forwarding DADD DSUB AND OR XOR R1, R2, R3 R4, R1, R5 R6, R1, R7 R8, R1, R9 R10, R1, R11 R1 EX/MEM

Data Hazard: Forwarding DADD DSUB AND OR XOR R1, R2, R3 R4, R1, R5 R6, R1, R7 R8, R1, R9 R10, R1, R11 R1 MEM/WB

Data Hazard: Forwarding Generalized forwarding pass a result directly to the functional unit that requires it; forward results to not only ALU inputs but also other types of functional units;

Data Hazard: Forwarding Generalized forwarding DADDR1, R2, R3 LDR4, 0(R1) SDR4, 12(R1) R1 R4

Data Hazard Sometimes stall is necessary R1 LDR1, 0(R2) DSUBR4, R1, R5 MEM/WB Forwarding cannot be backward. Has to stall.

Branch Hazard Redo IF If the branch is untaken, the stall is unnecessary. essentially a stall

Branch Hazard: Solutions 4 simple compile time schemes – 1 Freeze or flush the pipeline hold or delete any instructions after the branch till the branch dst is known; i.e., Redo IF w/o the first IF

Branch Hazard: Solutions 4 simple compile time schemes – 2 Predicted-untaken simply treat every branch as untaken; when the branch is untaken, pipelining as if no hazard.

Branch Hazard: Solutions 4 simple compile time schemes – 2 Predicted-untaken but if the branch is taken: turn fetched instr into a no-op (idle); restart the IF at the branch target addr

Branch Hazard: Solutions 4 simple compile time schemes – 3 Predicted-taken simply treat every branch as taken; not apply to the five-stage pipeline; apply to scenarios when branch target addr is known before branch outcome.

Branch Hazard: Solutions 4 simple compile time schemes – 4 Delayed branch delay the branch execution after the next instruction; pipelining sequence: branch instruction sequential successor branch target if taken Branch delay slot the next instruction

Branch Hazard: Solutions Delayed branch

Outline What’s Pipelining How Pipelining Works Pipeline Hazards Pipeline with Multicycle FP Operations

Multicycle FP Operation FP pipeline allow for a longer latency for op; two changes over integer pipeline: repeat EX; use multiple FP functional units;

FP Pipeline loads and stores integer ALU operations branches FP add FP subtract FP conversion FP and integer multiplier FP and integer divider

Generalized FP Pipeline EX is pipelined (except for FP divider) Additional pipeline registers e.g., ID/A1 FP divider: 24 CCs

Generalized FP Pipeline Example italics: stage where data is needed bold: stage where a result is available

Hazard Divider is not fully pipelined – structural hazard

Hazard Instructions have varying running times, maybe >1 register write in a cycle - structural hazard

Hazard Instructions no longer reach WB in order – Write after write (WAW) hazard

Hazard Instructions may complete in a different order than they were issued – exceptions

Hazard Longer latency of operations – more frequent stalls for RAW hazards

RAW Hazards

Structural Hazards

WAW Hazards If L.D were issued one cycle earlier L.D would write F2 one cycle earlier than ADD.D – WAW hazard what if another instruction using F2 between them? --- No WAW

All in MIPS R4000

MIPS R stage -> 8-stage Higher clock rate

MIPS R4000 IF: first half of instruction fetch; PC selection; initiation of instruction cache access;

MIPS R4000 IS: second half of instruction fetch; completion of instruction cache access;

MIPS R4000 RF: instruction decode and register fetch; hazard checking; instruction cache hit detection;

MIPS R4000 EX: execution effective address calculation; ALU operation; branch-target computation and condition evaluation;

MIPS R4000 DF: data fetch first half of data access;

MIPS R4000 DS: second half of data fetch completion of data cache access;

MIPS R4000 TC: tag check determine whether the data cache access hit;

MIPS R4000 WB: write back for loads and register-register operations;

MIPS R cycle load delay

MIPS R cycle branch delay

MIPS R4000 FP unit with eight different stages

MIPS R4000 FP operations: latency and initiation interval

MIPS R4000 FP operations Example 1 FP multiply + FP add

MIPS R4000 FP operations Example 2 FP add + FP multiply

MIPS R4000 FP operations Example 3: divide + add

MIPS R4000 FP operations Example 4 FP add + FP divide

?