Asanovic/Devadas Spring 2002 6.823 Simple Instruction Pipelining Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology.

Slides:



Advertisements
Similar presentations
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Advertisements

EEM 486 EEM 486: Computer Architecture Lecture 4 Designing a Multicycle Processor.
Multi-cycle Implementations Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology January 13, 2012L6-1
1 RISC Pipeline Han Wang CS3410, Spring 2010 Computer Science Cornell University See: P&H Chapter 4.6.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Pipelining III Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture MIPS Review & Pipelining I Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture ISAs and MIPS Steve Ko Computer Sciences and Engineering University at Buffalo.
CS 152 Computer Architecture and Engineering Lecture 4 - Pipelining Krste Asanovic Electrical Engineering and Computer Sciences University of California.
The Processor: Datapath & Control
Mary Jane Irwin ( ) [Adapted from Computer Organization and Design,
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
©UCB CS 162 Computer Architecture Lecture 3: Pipelining Contd. Instructor: L.N. Bhuyan
1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.
L18 – Pipeline Issues 1 Comp 411 – Spring /03/08 CPU Pipelining Issues Finishing up Chapter 6 This pipe stuff makes my head hurt! What have you.
L17 – Pipeline Issues 1 Comp 411 – Fall /1308 CPU Pipelining Issues Finishing up Chapter 6 This pipe stuff makes my head hurt! What have you been.
January 31, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 4 - Pipelining Krste Asanovic Electrical Engineering and Computer.
Computer ArchitectureFall 2007 © October 3rd, 2007 Majd F. Sakr CS-447– Computer Architecture.
DLX Instruction Format
Lec 9: Pipelining Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University.
Appendix A Pipelining: Basic and Intermediate Concepts
CS 152 Computer Architecture and Engineering Lecture 4 - Pipelining Krste Asanovic Electrical Engineering and Computer Sciences University of California.
ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.
Chapter 4 Sections 4.1 – 4.4 Appendix D.1 and D.2 Dr. Iyad F. Jafar Basic MIPS Architecture: Single-Cycle Datapath and Control.
COSC 3430 L08 Basic MIPS Architecture.1 COSC 3430 Computer Architecture Lecture 08 Processors Single cycle Datapath PH 3: Sections
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
Computer Organization CS224 Chapter 4 Part b The Processor Spring 2010 With thanks to M.J. Irwin, T. Fountain, D. Patterson, and J. Hennessy for some lecture.
CMPE 421 Parallel Computer Architecture
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
Electrical and Computer Engineering University of Cyprus LAB3: IMPROVING MIPS PERFORMANCE WITH PIPELINING.
CS 152 Computer Architecture and Engineering Lecture 3 - From CISC to RISC Krste Asanovic Electrical Engineering and Computer Sciences University of California.
Electrical and Computer Engineering University of Cyprus LAB 2: MIPS.
1 Pipelining Part I CS What is Pipelining? Like an Automobile Assembly Line for Instructions –Each step does a little job of processing the instruction.
Datapath and Control Unit Design
CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and
Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.
1. Convert the RISCEE 1 Architecture into a pipeline Architecture (like Figure 6.30) (showing the number data and control bits). 2. Build the control line.
11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.
CSE431 L06 Basic MIPS Pipelining.1Irwin, PSU, 2005 MIPS Pipeline Datapath Modifications  What do we need to add/modify in our MIPS datapath? l State registers.
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)
CMPE 421 REVIEW: MIDTERM 1. A MODIFIED FIVE-Stage Pipeline PC A Y R MD1 addr inst Inst Memory Imm Ext add rd1 GPRs rs1 rs2 ws wd rd2 we wdata addr wdata.
Chapter 4 From: Dr. Iyad F. Jafar Basic MIPS Architecture: Multi-Cycle Datapath and Control.
Computer Architecture Lecture 6.  Our implementation of the MIPS is simplified memory-reference instructions: lw, sw arithmetic-logical instructions:
CS161 – Design and Architecture of Computer Systems
Electrical and Computer Engineering University of Cyprus
Problem with Single Cycle Processor Design
Computer Organization
Single Clock Datapath With Control
Multi-Cycle CPU.
ECE232: Hardware Organization and Design
ECS 154B Computer Architecture II Spring 2009
Peng Liu Lecture 7 Pipelining Peng Liu
CS 152 Computer Architecture and Engineering Lecture 4 - Pipelining
CS 152 Computer Architecture and Engineering Lecture 4 - Pipelining
MIPS processor continued
Review: MIPS Pipeline Data and Control Paths
Chapter 4 The Processor Part 2
Single-cycle datapath, slightly rearranged
Dept. of Info. Sci. & Elec. Engg.
The Processor Lecture 3.6: Control Hazards
The Processor Lecture 3.4: Pipelining Datapath and Control
The Processor Lecture 3.2: Building a Datapath with Control
An Introduction to pipelining
Pipelining: Basic Concepts
Multi-Cycle Datapath Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
Introduction to Computer Organization and Architecture
The Processor: Datapath & Control.
COMS 361 Computer Organization
Processor: Datapath and Control
ELEC / Computer Architecture and Design Spring 2015 Pipeline Control and Performance (Chapter 6) Vishwani D. Agrawal James J. Danaher.
Presentation transcript:

Asanovic/Devadas Spring Simple Instruction Pipelining Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology

Asanovic/Devadas Spring Processor Performance Equation  Instructions per program depends on source code, compiler technology, and ISA  Microcoded DLX from last lecture had cycles per instruction (CPI) of around 7 minimum  Time per cycle for microcoded DLX fixed by microcode cycle time — mostly ROM access + next μPC select logic Time = Instructions * Cycles * Time Program Program Instruction Cycle

Asanovic/Devadas Spring Pipelined DLX To pipeline DLX:  First build unpipelined DLX with CPI=1  Next, add pipeline registers to reduce cycle time while maintaining CPI=1

Asanovic/Devadas Spring A Simple Memory Model Reads and writes are always completed in one cycle a Read can be done any time (i.e. combinational) a Write is performed at the rising clock edge if it is enabled ⇒ the write address, data, and enable must be stable at the clock edge WriteEnable Clock Address WriteData ReadData MAGIC RAM

Asanovic/Devadas Spring Datapath for ALU Instructions RegWrite addr Inst. Memory inst Add PC clk we GPRs Imm Ext ALU Control ALU OpCode RegDstExtSel OpSel BSrc rf2 / rf3 Reg / Imm immediate func rf3 ← (rf1) func (rf2) rf2 ← (rf1) op immediate

Asanovic/Devadas Spring Datapath for Memory Instructions Should program and data memory be separate? Harvard style: separate (Aiken and Mark 1 influence) - read-only program memory - read/write data memory at some level the two memories have to be the same Princeton style: the same (von Neumann’s influence) - A Load or Store instruction requires accessing the memory more than once during its execution

Asanovic/Devadas Spring Load/Store Instructions: Harvard-Style Datapath RegWrite Add PC clk we GPRs Imm Ext ALU Control ALU OpCode RegDstExtSel OpSel BSrc displacement clk MemWrite WBSrc ALU / Mem addr Inst. Memory inst addressing mode (rf1) + displacement rf1 is the base register rf2 is the destination of a Load or the source for a Store we addr rdata Data Memory wdata

Asanovic/Devadas Spring Memory Hierarchy c.2002 Desktop & Small Server Our memory model is a good approximation of the hierarchical memory system when we hit in the on-chip cache On-chip Caches Proc 0.5-2ns 2-3 clk 8~64KB <10ns 5~15 clk MB SRAM/ eDRAM Interleaved Banks of DRAM Hard Disk Off-chip L3 Cache < 25ns 15~50 clk 1~8MB ~150ns 100~300 clk 64M~1GB ~10ms seek time ~10 7 clk 20~100GB

Asanovic/Devadas Spring Conditional Branches Add PC clk we GPRs Imm Ext ALU Control ALU OpCode RegDstExtSel OpSel BSrc addr Inst. Memory inst Add we addr Data Memory wdata rdata RegWrite PCSrc ( ~j / j ) MemWriteWBSrc zero? clk

Asanovic/Devadas Spring Register-Indirect Jumps Add PC clk we GPRs Imm Ext ALU Control ALU OpCode RegDstExtSel OpSel BSrc addr Inst. Memory inst Add we addr Data Memory wdata rdata RegWrite PCSrc ( ~j / j ) MemWriteWBSrc zero? clk Jump & Link?

Asanovic/Devadas Spring Jump & Link Add PC clk we GPRs Imm Ext ALU Control ALU OpCode RegDstExtSel OpSel BSrc addr Inst. Memory inst we addr Data Memory wdata rdata RegWrite PCSrc MemWriteWBSrc ALU / Mem / PC zero? Add rf3 / rf2 / R31 clk

Asanovic/Devadas Spring PC-Relative Jumps Add PC clk we GPRs Imm Ext ALU Control ALU OpCode RegDstExtSel OpSel BSrc addr Inst. Memory inst we addr Data Memory wdata rdata RegWrite PCSrc MemWriteWBSrc ALU / Mem / PC zero? Add rf3 / rf2 / R31 clk

Asanovic/Devadas Spring Hardwired Control is pureCombinational Logic: Unpipelined DLX combinational logic op code zero? ExtSel BSrc OpSel MemWrite WBSrc RegDst RegWrite PCSrc

Asanovic/Devadas Spring ALU Control & Immediate Extension Inst (Func) Inst (Opcode) ALUop OpSel ( Func, Op, +, 0? ) ExtSel ( sExt 16, uExt 16, sExt 26,High 16 ) Decode Map

Asanovic/Devadas Spring Hardwired Control worksheet Add PC clk we GPRs Imm Ext ALU Control ALU OpCode RegDst ExtSel OpSel addr Inst. Memory inst we addr Data Memory wdata rdata RegWrite PCSrc PCR / RInd / ~j MemWrite WBSrc ALU / Mem / PC zero? Add rf3 / rf2 / R31 sExt 16 /uExt 16 / sExt 26 /High 16 Func/ Op/+/0? BSrc Reg / Imm

Asanovic/Devadas Spring Hardwired Control Table BSrc= Reg / Imm WBSrc= ALU / Mem / PC RegDst= rf2 / rf3 / R31 PCSrc1 = j / ~j PCSrc2 = PCR / RInd OpcodeExt Sel BSrcOp Sel Mem Write Reg Write WB SrcReg Dest PC Src ALU*RegFuncnoyesALUrf3~j ALUu*RegFuncnoyesALUrf3~j ALUi sExt 16 ImmOpnoyesALUrf2~j ALUiu uExt 16 ImmOpnoyesALUrf2~j LW sExt 16 Imm+noyesMemrf2~j SW sExt 16 Imm+yesno**~j BEQZ z=0 sExt 16 *0?no **PCR BEQZ z=1 sExt 16 *0?no **~j J sExt 26 **no **PCR JAL sExt 26 **noyesPCR31PCR JR***no **Rlnd JALR***noyesPCR31Rlnd

Asanovic/Devadas Spring Hardwired Unpipelined Machine  Simple  One instruction per cycle  Why wasn’t this a popular machine style?

Asanovic/Devadas Spring Unpipelined DLX Clock period must be sufficiently long for all of the following steps to be “completed”: 1. 1instruction fetch 2. decode and register fetch 3. ALU operation 4. data fetch if required 5. register write-back setup time ⇒ t C > t IFetch + t RFetch + t ALU + t DMem + t RWB  At the rising edge of the following clock, the PC, the register file and the memory are updated

Asanovic/Devadas Spring Pipelined DLX Datapath Clock period can be reduced by dividing the execution of an instruction into multiple cycles t C > max {t IF, t RF, t ALU, t DM, t RW } = t DM (probably) However, CPI will increase unless instructions are pipelined Add IR we GPRs Imm Ext ALU addr Inst. Memory we addr Data Memory wdata rdata fetch phase decode & Reg-fetch phase execute phase memory phase write -back phase

Asanovic/Devadas Spring An Ideal Pipeline  All objects go through the same stages  No sharing of resources between any two stages  Propagation delay through all pipeline stages is equal  The scheduling of an object entering the pipeline is not affected by the objects in other stages These conditions generally hold for industrial assembly lines. An instruction pipeline, however, cannot satisfy the last condition. Why? stage

Asanovic/Devadas Spring Pipelining History  Some very early machines had limited pipelined execution (e.g., Zuse Z4, WWII) Usually overlap fetch of next instruction with current execution  IBM Stretch first major “supercomputer” incorporating extensive pipelining, result bypassing, and branch prediction project started in 1954, delivered in 1961 didn’t meet initial performance goal of 100x faster with 10x faster circuits up to 11 macroinstructions in pipeline at same time microcode engine highly pipelined also (up to 6 microinstructions in pipeline at same time) Stretch was origin of 8-bit byte and lower case characters, carried on into IBM 360

Asanovic/Devadas Spring How to divide the datapath into stages Suppose memory is significantly slower than other stages. In particular, suppose Since the slowest stage determines the clock, it may be possible to combine some stages without any loss of performance t IM = t DM = 10 units t ALU = 5 units t RF = t RW = 1 unit

Asanovic/Devadas Spring Minimizing Critical Path Add we GPRs Imm Ext ALU addr Inst. Memory we addr Data Memory wdata rdata fetch phase decode & Reg-fetch & execute phase memory phase write -back phase Write-back stage takes much less time than other stages. Suppose we combined it with the memory phase ⇒ increase the critical path by 10% t C > max {t IF, t RF, t ALU, t DM, t RW }

Asanovic/Devadas Spring Maximum Speedup by Pipelining For the 4-stage pipeline, given t IM = t DM = 10 units, t ALU = 5 units, t RF = t RW = 1 unit t C could be reduced from 27 units to 10 units ⇒ speedup = 2.7 However, if t IM = t DM = t ALU = t RF = t RW = 5 units The same 4-stage pipeline can reduce t C from 25 units to 10 units ⇒ speedup = 2.5 But, since t IM = t DM = t ALU = t RF = t RW, it is possible to achieve higher speedup with more stages in the pipeline. A 5-stage pipeline can reduce tC from 25 units to 5 units ⇒ speedup = 5

Asanovic/Devadas Spring Technology Assumptions We will assume A small amount of very fast memory (caches) backed up by a large, slower memory Fast ALU (at least for integers) Multiported Register files (slower!). It makes the following timing assumption valid t IM ≈ t RF ≈ t ALU ≈ t DM ≈ t RW A 5-stage pipelined Harvard-style architecture will be the focus of our detailed design

Asanovic/Devadas Spring Stage Pipelined Execution Add we GPRs Imm Ext we Memory wdata rdata fetch phase (IF) decode & Reg-fetch phase (ID) memory phase (MA) write -back phase (WB) addr we Memory wdata rdata addr execute phase (EX) time t0 t1 t2 t3 t4 t5 t6 t7.... instruction1 IF 1 ID 1 EX 1 MA 1 WB 1 instruction2 IF 2 ID 2 EX 2 MA 2 WB 2 instruction3 IF 3 ID 3 EX 3 MA 3 WB 3 instruction4 IF 4 ID 4 EX 4 MA 4 WB 4 instruction5 IF 5 ID 5 EX 5 MA 5 WB 5

Asanovic/Devadas Spring Stage Pipelined Execution Resource Usage Diagram Add we GPRs Imm Ext we Memory wdata rdata fetch phase (IF) decode & Reg-fetch phase (ID) memory phase (MA) write -back phase (WB) we Memory wdata rdata addr execute phase (EX) addr time t0 t1 t2 t3 t4 t5 t6 t7.... IF I 1 I 2 I 3 I 4 I 5 ID I 1 I 2 I 3 I 4 I 5 EX I 1 I 2 I 3 I 4 I 5 MA I 1 I 2 I 3 I 4 I 5 WB I 1 I 2 I 3 I 4 I 5 Resources

Asanovic/Devadas Spring Pipelined Execution: ALU Instructions Add we GPRs Imm Ext ALU addr Inst. Memory we addr Data Memory wdata rdata MD1 MD2 not quite correct!

Asanovic/Devadas Spring Pipelined Execution: Need for Several IR’s PC Add we GPRs Imm Ext ALU addr Inst. Memory we addr Data Memory wdata rdata MD1 MD2

Asanovic/Devadas Spring IRs and Control points PC Add we GPRs Imm Ext ALU addr Inst. Memory we addr Data Memory wdata rdata MD1 MD2 Are control points connected properly? -Load/Store instructions -ALU instructions

Asanovic/Devadas Spring Pipelined DLX Datapath without jumps PC Add we GPRs Imm Ext ALU addr Inst. Memory we addr Data Memory wdata rdata MD1 MD2 RegWrite RegDst OpSel MemWrite WBSrc ExtSelBSrc

Asanovic/Devadas Spring How Instructions can Interact with each other in a Pipeline An instruction in the pipeline may need a resource being used by another instruction in the pipeline structural hazard An instruction may produce data that is needed by a later instruction data hazard In the extreme case, an instruction may determine the next instruction to be executed control hazard (branches, interrupts,...)

Asanovic/Devadas Spring Feedback to Resolve Hazards Controlling pipeline in this manner works provided the instruction at stage i+1 can complete without any interference from instructions in stages 1 to i (otherwise deadlocks may occur) Feedback to previous stages is used to stall or kill instructions FB 1 FB 2 FB 3 FB 4 stage