MIPS Pipeline Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University See P&H Chapter 4.6.

Slides:

Advertisements

Similar presentations

Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.

Advertisements

331 W08.1Spring :332:331 Computer Architecture and Assembly Language Spring 2006 Week 8: Datapath Design [Adapted from Dave Patterson’s UCB CS152.

1 RISC Pipeline Han Wang CS3410, Spring 2010 Computer Science Cornell University See: P&H Chapter 4.6.

Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Pipeline Hazards See: P&H Chapter 4.7.

Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University RISC Pipeline See: P&H Chapter 4.6.

1 CS3410 Guest Lecture A Simple CPU: remaining branch instructions CPU Performance Pipelined CPU Tudor Marian.

Pipeline Hazards Hakim Weatherspoon CS 3410, Spring 2011 Computer Science Cornell University See P&H Appendix 4.7.

© Kavita Bala, Computer Science, Cornell University Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Pipelining See: P&H Chapter 4.5.

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University See P&H Chapter: 1.6,

Mary Jane Irwin ( ) [Adapted from Computer Organization and Design,

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

EECS 470 Pipeline Hazards Lecture 4 Coverage: Appendix A.

Computer Architecture Lecture 3 Coverage: Appendix A

©UCB CS 162 Computer Architecture Lecture 3: Pipelining Contd. Instructor: L.N. Bhuyan

Computer ArchitectureFall 2007 © October 3rd, 2007 Majd F. Sakr CS-447– Computer Architecture.

Lec 8: Pipelining Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University.

Lec 9: Pipelining Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University.

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University See P&H Chapter: , 1.6, Appendix B.

Computer ArchitectureFall 2008 © October 6th, 2008 Majd F. Sakr CS-447– Computer Architecture.

CSE378 Pipelining1 Pipelining Basic concept of assembly line –Split a job A into n sequential subjobs (A 1,A 2,…,A n ) with each A i taking approximately.

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University See P&H Chapter: 1.6,

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University See P&H Appendix B.8 (register files) and B.9.

COSC 3430 L08 Basic MIPS Architecture.1 COSC 3430 Computer Architecture Lecture 08 Processors Single cycle Datapath PH 3: Sections

1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University See P&H Chapter: , 1.6, Appendix B.

Memory Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University.

CPU Performance Pipelined CPU Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University See P&H Chapters 1.4 and 4.5.

Pipelined Datapath and Control

Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University CPU Performance Pipelined CPU See P&H Chapters 1.4 and 4.5.

Chapter 4 CSF 2009 The processor: Pipelining. Performance Issues Longest delay determines clock period – Critical path: load instruction – Instruction.

Chapter 4 The Processor CprE 381 Computer Organization and Assembly Level Programming, Fall 2012 Revised from original slides provided by MKP.

CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.

CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.

1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.

Sample Code (Simple) Run the following code on a pipelined datapath: add1 2 3 ; reg 3 = reg 1 + reg 2 nand ; reg 6 = reg 4 & reg 5 lw ; reg.

Electrical and Computer Engineering University of Cyprus LAB3: IMPROVING MIPS PERFORMANCE WITH PIPELINING.

Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University MIPS Pipeline See P&H Chapter 4.6.

Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University MIPS Pipeline See P&H Chapter 4.6.

Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.

CPU Performance Pipelined CPU Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University See P&H Chapters 1.4 and 4.5.

CPU Performance Pipelined CPU

Computer Organization

Pipeline Hazards Hakim Weatherspoon CS 3410, Spring 2012

Morgan Kaufmann Publishers

Pipeline Control Hazards and Instruction Variations

Single Clock Datapath With Control

ECE232: Hardware Organization and Design

ECS 154B Computer Architecture II Spring 2009

Hakim Weatherspoon CS 3410 Computer Science Cornell University

Hakim Weatherspoon CS 3410 Computer Science Cornell University

Morgan Kaufmann Publishers The Processor

Chapter 4 The Processor Part 2

Single-cycle datapath, slightly rearranged

Pipeline Hazards Hakim Weatherspoon CS 3410, Spring 2013

Pipeline Control Hazards and Instruction Variations

Rocky K. C. Chang 6 November 2017

Pipelining Basic concept of assembly line

The Processor Lecture 3.6: Control Hazards

The Processor Lecture 3.4: Pipelining Datapath and Control

Instruction Execution Cycle

Pipelining Basic concept of assembly line

Pipelining Basic concept of assembly line

Pipelining Appendix A and Chapter 3.

Morgan Kaufmann Publishers The Processor

Introduction to Computer Organization and Architecture

Pipelining Hakim Weatherspoon CS 3410 Computer Science

Pipelining Hakim Weatherspoon CS 3410 Computer Science

MIPS Pipeline Hakim Weatherspoon CS 3410, Spring 2013 Computer Science

MIPS Pipelined Datapath

Presentation transcript:

MIPS Pipeline Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University See P&H Chapter 4.6

A Processor alu PC imm memory d in d out addr target offset cmp control =? new pc register file inst extend +4 Review: Single cycle processor

What determines performance of Processor? A) Critical Path B) Clock Cycle Time C) Cycles Per Instruction (CPI) D) All of the above E) None of the above

Review: Single Cycle Processor Advantages Single Cycle per instruction make logic and clock simple Disadvantages Since instructions take different time to finish, memory and functional unit are not efficiently utilized. Cycle time is the longest delay. –Load instruction Best possible CPI is 1 –However, lower MIPS and longer clock period (lower clock frequency); hence, lower performance.

Review: Multi Cycle Processor Advantages Better MIPS and smaller clock period (higher clock frequency) Hence, better performance than Single Cycle processor Disadvantages Higher CPI than single cycle processor Pipelining: Want better Performance want small CPI (close to 1) with high MIPS and short clock period (high clock frequency) CPU time = instruction count x CPI x clock cycle time

Single Cycle vs Pipelined Processor See: P&H Chapter 4.5

The Kids Alice Bob They don’t always get along…

The Bicycle

The Materials Saw Drill Glue Paint

The Instructions N pieces, each built following same sequence: SawDrillGluePaint

Design 1: Sequential Schedule Alice owns the room Bob can enter when Alice is finished Repeat for remaining tasks No possibility for conflicts

Elapsed Time for Alice: 4 Elapsed Time for Bob: 4 Total elapsed time: 4*N Can we do better? Sequential Performance time … Latency: Throughput: Concurrency: CPI = Latency:4 hours/task Throughput:1 task/4 hrs Concurrency: 1 CPI = 1

Design 2: Pipelined Design Partition room into stages of a pipeline One person owns a stage at a time 4 stages 4 people working simultaneously Everyone moves right in lockstep AliceBobCarolDave

Pipelined Performance time … Latency: Throughput: Concurrency: Latency:4 hrs/task Throughput:1 task/hr Concurrency: 4 CPI = 1

Lessons Principle: Throughput increased by parallel execution Pipelining: Identify pipeline stages Isolate stages from each other Resolve pipeline hazards (Thursday)

A Processor alu PC imm memory d in d out addr target offset cmp control =? new pc register file inst extend +4 Review: Single cycle processor

Write- Back Memory Instruction Fetch Execute Instruction Decode register file control A Processor alu imm memory d in d out addr inst PC memory compute jump/branch targets new pc +4 extend

Basic Pipeline Five stage “RISC” load-store architecture 1.Instruction fetch (IF) –get instruction from memory, increment PC 2.Instruction Decode (ID) –translate opcode into control signals and read registers 3.Execute (EX) –perform ALU operation, compute jump/branch targets 4.Memory (MEM) –access memory if needed 5.Writeback (WB) –update register file

Time Graphs Clock cycle Latency: Throughput: Concurrency: IFIDEX MEM WB IFIDEX MEM WB IFIDEX MEM WB IFIDEX MEM WB IFIDEX MEM WB Latency:5 cycles Throughput:1 instr/cycle Concurrency:5 CPI = 1 add lw

Principles of Pipelined Implementation Break instructions across multiple clock cycles (five, in this case) Design a separate stage for the execution performed during each clock cycle Add pipeline registers (flip-flops) to isolate signals between different stages

Pipelined Processor See: P&H Chapter 4.6

Write- Back Memory Instruction Fetch Execute Instruction Decode extend register file control Pipelined Processor alu memory d in d out addr PC memory new pc inst IF/ID ID/EXEX/MEMMEM/WB imm B A ctrl B DD M compute jump/branch targets +4

IF Stage 1: Instruction Fetch Fetch a new instruction every cycle Current PC is index to instruction memory Increment the PC at end of cycle (assume no branches for now) Write values of interest to pipeline register (IF/ID) Instruction bits (for later decoding) PC+4 (for later computing branch targets)

IF PC instruction memory new pc addrmc +4

IF PC instruction memory new pc addrmc +4

IF PC instruction memory inst addrmc 00 = read word IF/ID WE 1 Rest of pipeline +4 PC+4 pcsel pcreg pcrel pcabs

ID Stage 2: Instruction Decode On every cycle: Read IF/ID pipeline register to get instruction bits Decode instruction, generate control signals Read from register file Write values of interest to pipeline register (ID/EX) Control information, Rd index, immediates, offsets, … Contents of Ra, Rb PC+4 (for computing branch targets later)

ID ctrl ID/EX Rest of pipeline PC+4 inst IF/ID PC+4 Stage 1: Instruction Fetch register file WE Rd Ra Rb D B A B A imm

ID ctrl ID/EX Rest of pipeline PC+4 inst IF/ID PC+4 Stage 1: Instruction Fetch register file WE Rd Ra Rb D B A B A extend imm decode result dest

EX Stage 3: Execute On every cycle: Read ID/EX pipeline register to get values and control bits Perform ALU operation Compute targets (PC+4+offset, etc.) in case this is a branch Decide if jump/branch should be taken Write values of interest to pipeline register (EX/MEM) Control information, Rd index, … Result of ALU operation Value in case this is a memory store instruction

Stage 2: Instruction Decode EX ctrl EX/MEM Rest of pipeline B D ctrl ID/EX PC+4 B A alu imm target

Stage 2: Instruction Decode pcrel pcabs EX ctrl EX/MEM Rest of pipeline B D ctrl ID/EX PC+4 B A alu + || branch? imm pcsel pcreg target

MEM Stage 4: Memory On every cycle: Read EX/MEM pipeline register to get values and control bits Perform memory load/store if needed –address is ALU result Write values of interest to pipeline register (MEM/WB) Control information, Rd index, … Result of memory operation Pass result of ALU operation

MEM ctrl MEM/WB Rest of pipeline Stage 3: Execute M D ctrl EX/MEM B D memory d in d out addr mc target

MEM ctrl MEM/WB Rest of pipeline Stage 3: Execute M D ctrl EX/MEM B D memory d in d out addr mc target branch? pcsel pcrel pcabs pcreg

WB Stage 5: Write-back On every cycle: Read MEM/WB pipeline register to get values and control bits Select value and write to register file

WB Stage 4: Memory ctrl MEM/WB M D

WB Stage 4: Memory ctrl MEM/WB M D result dest

IF/ID +4 ID/EXEX/MEMMEM/WB mem d in d out addr inst PC+4 OP B A Rt B D M D PC+4 imm OP Rd OP Rd PC inst mem Rd RaRb D B A Rd

Administrivia Required: partner for group project Project1 (PA1) and Homework2 (HW2) are both out PA1 Design Doc and HW2 due in one week, start early Work alone on HW2, but in group for PA1 Save your work! Save often. Verify file is non-zero. Periodically save to Dropbox, . Beware of MacOSX 10.5 (leopard) and 10.6 (snow-leopard) Use your resources Lab Section, Piazza.com, Office Hours, Homework Help Session, Class notes, book, Sections, CSUGLab

Administrivia Check online syllabus/schedule Slides and Reading for lectures Office Hours Homework and Programming Assignments Prelims (in evenings): Tuesday, February 26 th Thursday, March 28 th Thursday, April 25 th Schedule is subject to change

Collaboration, Late, Re-grading Policies “Black Board” Collaboration Policy Can discuss approach together on a “black board” Leave and write up solution independently Do not copy solutions Late Policy Each person has a total of four “slip days” Max of two slip days for any individual assignment Slip days deducted first for any late assignment, cannot selectively apply slip days For projects, slip days are deducted from all partners 20% deducted per day late after slip days are exhausted Regrade policy Submit written request to lead TA, and lead TA will pick a different grader Submit another written request, lead TA will regrade directly Submit yet another written request for professor to regrade.

Example: Sample Code (Simple) Assume eight-register machine Run the following code on a pipelined datapath addr3 r1 r2 ; reg 3 = reg 1 + reg 2 nand r6 r4 r5 ; reg 6 = ~(reg 4 & reg 5) lw r4 20 (r2) ; reg 4 = Mem[reg2+20] addr5 r2 r5 ; reg 5 = reg 2 + reg 5 sw r7 12(r3) ; Mem[reg3+12] = reg 7 Slides thanks to Sally McKee

Example: : Sample Code (Simple) add r3, r1, r2; nand r6, r4, r5; lw r4, 20(r2); add r5, r2, r5; sw r7, 12(r3);

MIPS instruction formats All MIPS instructions are 32 bits long, has 3 formats R-type I-type J-type oprsrtrdshamtfunc 6 bits5 bits 6 bits oprsrtimmediate 6 bits5 bits 16 bits opimmediate (target address) 6 bits 26 bits

MIPS Instruction Types Arithmetic/Logical R-type: result and two source registers, shift amount I-type: 16-bit immediate with sign/zero extension Memory Access load/store between registers and memory word, half-word and byte operations Control flow conditional branches: pc-relative addresses jumps: fixed offsets, register absolute

Time Graphs add nand lw add sw Clock cycle Latency: Throughput: Concurrency: IFIDEX MEM WB IFIDEX MEM WB IFIDEX MEM WB IFIDEX MEM WB IFIDEX MEM WB

PC Inst mem Register file MUXMUX ALUALU MUXMUX 4 Data mem + MUXMUX Bits Bits op Rt imm valB valA PC+4 target ALU result op dest valB op dest ALU result mdata instruction 0 R2 R3 R4 R5 R1 R6 R0 R7 regA regB Bits data dest IF/ID ID/EXEX/MEMMEM/WB extend MUXMUX Rd

data dest IF/ID ID/EXEX/MEMMEM/WB extend 0 MUXMUX 0 At time 1, Fetch add r3 r1 r2

PC Inst mem Register file MUXMUX ALUALU MUXMUX 4 Data mem + MUXMUX Bits Bits nop add R2 R3 R4 R5 R1 R6 R0 R7 Bits data dest Fetch: add Time: 1 IF/ID ID/EXEX/MEMMEM/WB extend 0 MUXMUX 0 / 2 / add / 3 / 4 / 36 / 9 / 2

PC Inst mem Register file MUXMUX ALUALU MUXMUX 4 Data mem + MUXMUX Bits Bits add nop nand R2 R3 R4 R5 R1 R6 R0 R7 1 2 Bits data dest Fetch: nand nand add Time: 2 IF/ID ID/EXEX/MEMMEM/WB extend 2 MUXMUX 3 / 3 / 45 / add / nand / 9 / 6 / 8 / 18 / 7 / 5 / 3 /

PC Inst mem Register file MUXMUX ALUALU MUXMUX 4 Data mem + MUXMUX Bits Bits nand add 3 9 nop lw 4 20(2) R2 R3 R4 R5 R1 R6 R0 R7 4 5 Bits data dest Fetch: lw 4 20(2) lw 4 20(2) nand add Time: IF/ID ID/EXEX/MEMMEM/WB extend 5 MUXMUX / 45 / 3 / 4 / -3 / add / 18 / 7 / nand / 7 / 6 / 8

PC Inst mem Register file MUXMUX ALUALU MUXMUX 4 Data mem + MUXMUX Bits Bits lw nand 6 7 add add R2 R3 R4 R5 R1 R6 R0 R7 2 4 Bits data dest Fetch: add add lw 4 20(2) nand add Time: IF/ID ID/EXEX/MEMMEM/WB extend 4 MUXMUX 0 6 5

PC Inst mem Register file MUXMUX ALUALU MUXMUX 4 Data mem + MUXMUX Bits Bits add lw 4 18 nand sw 7 12(3) R2 R3 R4 R5 R1 R6 R0 R7 2 5 Bits data dest Fetch: sw 7 12(3) sw 7 12(3) add lw 4 20 (2) nand add Time: IF/ID ID/EXEX/MEMMEM/WB extend 5 MUXMUX 5 0 4

PC Inst mem Register file MUXMUX ALUALU MUXMUX 4 Data mem + MUXMUX Bits Bits sw add 5 7 lw R2 R3 R4 R5 R1 R6 R0 R7 3 7 Bits data dest No more instructions sw 7 12(3) add lw 4 20(2) nand Time: IF/ID ID/EXEX/MEMMEM/WB extend 7 MUXMUX 0 5 5

PC Inst mem Register file MUXMUX ALUALU MUXMUX 4 Data mem + MUXMUX Bits Bits sw 7 22 add R2 R3 R4 R5 R1 R6 R0 R7 Bits data dest No more instructions nop nop sw 7 12(3) add lw 4 20(2) Time: IF/ID ID/EXEX/MEMMEM/WB extend MUXMUX 0 7

PC Inst mem Register file MUXMUX ALUALU MUXMUX 4 Data mem + MUXMUX Bits Bits sw R2 R3 R4 R5 R1 R6 R0 R7 Bits data dest No more instructions nop nop nop sw 7 12(3) add Time: Slides thanks to Sally McKee IF/ID ID/EXEX/MEMMEM/WB extend MUXMUX

PC Inst mem Register file MUXMUX ALUALU MUXMUX 4 Data mem + MUXMUX Bits Bits R2 R3 R4 R5 R1 R6 R0 R7 Bits data dest No more instructions nop nop nop nop sw 7 12(3) Time: 9 IF/ID ID/EXEX/MEMMEM/WB extend MUXMUX

Pipelining Recap Powerful technique for masking latencies Logically, instructions execute one at a time Physically, instructions execute in parallel –Instruction level parallelism Abstraction promotes decoupling Interface (ISA) vs. implementation (Pipeline)

60 0:add 1:nand 2:lw 3:add 4:sw r0 r1 r2 r3 r4 r5 r6 r IF/ID +4 ID/EXEX/MEMMEM/WB mem d in d out addr inst PC+4 OP B A Rd B D M D PC+4 imm OP Rd OP Rd PC inst mem 77 add r3, r1, r2nand r6, r4, r5 add r3, r1, r2lw r4, 20(r2) nand r6, r4, r5 add r3, r1, r2add r5, r2, r5 lw r4, 20(r2) nand r6, r4, r5 add r3, r1, r2sw r7, 12(r3) add r5, r2, r5 lw r4, 20(r2) nand r6, r4, r5 add r3, r1, r2sw r7, 12(r3) add r5, r2, r5 lw r4, 20(r2) nand r6, r4, r5sw r7, 12(r3) add r5, r2, r5 lw r4, 20(r2)sw r7, 12(r3) add r5, r2, r5sw r7, 12(r3) Rd RaRb D B A