CS61CL L10 CPU II: Control & Pipeline (1) Huddleston, Summer 2009 © UCB Jeremy Huddleston inst.eecs.berkeley.edu/~cs61c CS61CL : Machine Structures Lecture.

Slides:

Advertisements

Similar presentations

Pipelining I (1) Fall 2005 Lecture 18: Pipelining I.

Advertisements

Pipelining II (1) Fall 2005 Lecture 19: Pipelining II.

CS61C L29 CPU Design : Pipelining to Improve Performance II (1) Garcia, Fall 2006 © UCB Running away with it  Our #8 football team had a great effort.

CS152 Lec9.1 CS152 Computer Architecture and Engineering Lecture 9 Designing Single Cycle Control.

CS61C L24 Review Pipeline © UC Regents 1 CS61C - Machine Structures Lecture 24 - Review Pipelined Execution November 29, 2000 David Patterson

Inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 29 – CPU Design : Pipelining to Improve Performance II Designed for the.

CS Computer Architecture 1 CS 430 – Computer Architecture Pipelined Execution - Review William J. Taffe using slides of David Patterson.

Inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 29 – CPU Design : Pipelining to Improve Performance II Cal researcher Marty.

inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 28 – CPU Design : Pipelining to Improve Performance The College Board.

361 datapath Computer Architecture Lecture 8: Designing a Single Cycle Datapath.

CS61C L29 CPU Design : Pipelining to Improve Performance (1) Garcia, Spring 2007 © UCB Wirelessly recharge batt  Powercast & Philips have developed a.

CS61C L19 CPU Design : Designing a Single-Cycle CPU (1) Beamer, Summer 2007 © UCB Scott Beamer Instructor inst.eecs.berkeley.edu/~cs61c CS61C : Machine.

CS61C L26 Single Cycle CPU Datapath II (1) Garcia © UCB Lecturer PSOE Dan Garcia inst.eecs.berkeley.edu/~cs61c CS61C : Machine.

CS61C L30 CPU Design : Pipelining to Improve Performance II (1) Garcia, Spring 2007 © UCB E-voting bill in congress!  Rep Rush Holt (D-NJ) has a bill.

CS61C L20 Single-Cycle CPU Control (1) Beamer, Summer 2007 © UCB Scott Beamer Instructor inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture.

CS61C L26 CPU Design : Designing a Single-Cycle CPU II (1) Garcia, Spring 2007 © UCB 3.6 TB DVDs? Maybe!  Researchers at Harvard have found a way to use.

Savio Chau Single Cycle Controller Design Last Time: Discussed the Designing of a Single Cycle Datapath Control Datapath Memory Processor (CPU) Input Output.

Ceg3420 control.1 ©UCB, DAP’ 97 CEG3420 Computer Design Lecture 9.2: Designing Single Cycle Control.

CS 61C L35 Single Cycle CPU Control II (1) Garcia, Spring 2004 © UCB Lecturer PSOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.

CS 61C L34 Single Cycle CPU Control I (1) Garcia, Spring 2004 © UCB Lecturer PSOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.

Lecturer PSOE Dan Garcia

CS61C L28 CPU Design : Pipelining to Improve Performance I (1) Garcia, Fall 2006 © UCB 100 Msites!  Sometimes it’s nice to stop and reflect. The web was.

Microprocessor Design

ECE 232 L13. Control.1 ©UCB, DAP’ 97 ECE 232 Hardware Organization and Design Lecture 13 Control Design

CS152 / Kubiatowicz Lec8.1 2/22/99©UCB Spring 1999 CS152 Computer Architecture and Engineering Lecture 8 Designing Single Cycle Control Feb 22, 1999 John.

CS 61C L17 Control (1) A Carle, Summer 2006 © UCB inst.eecs.berkeley.edu/~cs61c/su06 CS61C : Machine Structures Lecture #17: CPU Design II – Control

CS61C L21 Pipeline © UC Regents 1 CS61C - Machine Structures Lecture 21 - Introduction to Pipelined Execution November 15, 2000 David Patterson

CS151B Computer Systems Architecture Winter 2002 TuTh 2-4pm BH Instructor: Prof. Jason Cong Lecture 8 Designing a Single Cycle Control.

CS61C L31 Pipelined Execution, part II (1) Garcia, Fall 2004 © UCB Lecturer PSOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.

CS61C L26 CPU Design : Designing a Single-Cycle CPU II (1) Garcia, Fall 2006 © UCB Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.

Inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 25 CPU design (of a single-cycle CPU) Intel is prototyping circuits that.

CS430 – Computer Architecture Introduction to Pipelined Execution

CS 61C L29 Single Cycle CPU Control II (1) Garcia, Fall 2004 © UCB Andrew Schultz inst.eecs.berkeley.edu/~cs61c-tb inst.eecs.berkeley.edu/~cs61c CS61C.

CS61C L21 Pipelining I (1) Chae, Summer 2008 © UCB Albert Chae, Instructor inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture #21 – Pipelining.

Scott Beamer, Instructor

CS61C L27 Single-Cycle CPU Control (1) Garcia, Spring 2010 © UCB inst.eecs.berkeley.edu/~cs61c UC Berkeley CS61C : Machine Structures Lecture 27 Single-cycle.

CS 61C L30 Introduction to Pipelined Execution (1) Garcia, Fall 2004 © UCB Lecturer PSOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.

CS 61C L16 Datapath (1) A Carle, Summer 2004 © UCB inst.eecs.berkeley.edu/~cs61c/su05 CS61C : Machine Structures Lecture #16 – Datapath Andy.

CS61C L20 Single Cycle Datapath, Control (1) Chae, Summer 2008 © UCB Albert Chae, Instructor inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture.

CS 61C L18 Pipelining I (1) A Carle, Summer 2005 © UCB inst.eecs.berkeley.edu/~cs61c/su05 CS61C : Machine Structures Lecture #18: Pipelining

361 control Computer Architecture Lecture 9: Designing Single Cycle Control.

CS61C L26 CPU Design : Designing a Single-Cycle CPU II (1) Garcia, Spring 2010 © UCB inst.eecs.berkeley.edu/~cs61c UC Berkeley CS61C : Machine Structures.

ECE 232 L12.Datapath.1 Adapted from Patterson 97 ©UCBCopyright 1998 Morgan Kaufmann Publishers ECE 232 Hardware Organization and Design Lecture 12 Datapath.

CS 61C L28 Single Cycle CPU Control I (1) Garcia, Fall 2004 © UCB Lecturer PSOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.

CS61C L27 Single Cycle CPU Control (1) Garcia, Fall 2006 © UCB Wireless High Definition?  Several companies will be working on a “WirelessHD” standard,

CS61C L19 Introduction to Pipelined Execution (1) Garcia, Fall 2005 © UCB Lecturer PSOE, new dad Dan Garcia inst.eecs.berkeley.edu/~cs61c.

CS 61C L38 Pipelined Execution, part II (1) Garcia, Spring 2004 © UCB Lecturer PSOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.

CS3350B Computer Architecture Winter 2015 Lecture 5.6: Single-Cycle CPU: Datapath Control (Part 1) Marc Moreno Maza [Adapted.

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Pipelining Instructors: Krste Asanovic & Vladimir Stojanovic

CS1104: Computer Organisation School of Computing National University of Singapore.

EEM 486: Computer Architecture Designing Single Cycle Control.

EEM 486: Computer Architecture Designing a Single Cycle Datapath.

CS 61C: Great Ideas in Computer Architecture Pipelining & Hazards 1 Instructors: John Wawrzynek & Vladimir Stojanovic

CS3350B Computer Architecture Winter 2015 Lecture 5.7: Single-Cycle CPU: Datapath Control (Part 2) Marc Moreno Maza [Adapted.

CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and

CS 61C L5.2.2 Pipelining I (1) K. Meinz, Summer 2004 © UCB CS61C : Machine Structures Lecture Pipelining I Kurt Meinz inst.eecs.berkeley.edu/~cs61c.

Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.

CDA 3101 Summer 2003 Introduction to Computer Organization Pipeline Control And Pipeline Hazards 17 July 2003.

Csci 136 Computer Architecture II –Single-Cycle Datapath Xiuzhen Cheng

EEM 486: Computer Architecture Lecture 3 Designing Single Cycle Control.

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Single-Cycle CPU Datapath & Control Part 2 Instructors: Krste Asanovic & Vladimir Stojanovic.

Single Cycle Controller Design

CS 110 Computer Architecture Lecture 11: Single-Cycle CPU Datapath & Control Instructor: Sören Schwertfeger School of Information.

Lecture 18: Pipelining I.

CDA 3101 Spring 2016 Introduction to Computer Organization

Pipelining concepts, datapath and hazards

CS 61C: Great Ideas in Computer Architecture Control and Pipelining

Inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture 20 CPU Design: Control II & Pipelining I TA Noah Johnson Greet class.

Lecturer: Alan Christopher

Instructors: Randy H. Katz David A. Patterson

Presentation transcript:

CS61CL L10 CPU II: Control & Pipeline (1) Huddleston, Summer 2009 © UCB Jeremy Huddleston inst.eecs.berkeley.edu/~cs61c CS61CL : Machine Structures Lecture #9 – Single Cycle CPU Design

CS61CL L10 CPU II: Control & Pipeline (2) Huddleston, Summer 2009 © UCB Review: A Single Cycle Datapath 32 ALUctr clk busW RegWr 32 busA 32 busB 55 RwRaRb RegFile Rs Rt Rd RegDst Extender 3216 imm16 ALUSrcExtOp MemtoReg clk Data In 32 MemWr = ALU 0 1 WrEnAdr Data Memory 5 Instruction Imm16RdRtRs nPC_sel instr fetch unit clk We have everything except control signals

CS61CL L10 CPU II: Control & Pipeline (3) Huddleston, Summer 2009 © UCB Recap: Meaning of the Control Signals nPC_sel: “+4” 0  PC <– PC + 4 “br” 1  PC <– PC {SignExt(Im16), 00 } Later in lecture: higher-level connection between mux and branch condition “n”=next imm16 clk PC 00 1 nPC_sel PC Ext Adder Mux Inst Address 0 1

CS61CL L10 CPU II: Control & Pipeline (4) Huddleston, Summer 2009 © UCB Recap: Meaning of the Control Signals ExtOp:“zero”, “sign” ALUsrc:0  regB; 1  immed ALUctr:“ ADD ”, “ SUB ”, “ OR ” °MemWr:1  write memory °MemtoReg: 0  ALU; 1  Mem °RegDst:0  “rt”; 1  “rd” °RegWr:1  write register 32 ALUctr clk busW RegWr 32 busA 32 busB 55 RwRaRb RegFile Rs Rt Rd RegDst Extender 3216 imm16 ALUSrc ExtOp MemtoReg clk Data In 32 MemWr ALU 0 1 WrEnAdr Data Memory 5

CS61CL L10 CPU II: Control & Pipeline (5) Huddleston, Summer 2009 © UCB Instruction Fetch Unit at the Beginning of Add Fetch the instruction from Instruction memory: Instruction = MEM[PC] same for all instructions imm16 clk PC 00 1 nPC_sel PC Ext Adder Mux Inst Address Inst Memory Instruction

CS61CL L10 CPU II: Control & Pipeline (6) Huddleston, Summer 2009 © UCB The Single Cycle Datapath during Add R[rd] = R[rs] + R[rt] oprsrtrdshamtfunct ALUctr= ADD clk busW RegWr=1 32 busA 32 busB 55 RwRaRb RegFile Rs Rt Rd RegDst=1 Extender 3216 imm16 ALUSrc=0 ExtOp=x MemtoReg=0 clk Data In 32 MemWr= = ALU 0 1 WrEnAdr Data Memory 5 Instruction Imm16RdRtRs nPC_sel=+4 instr fetch unit clk

CS61CL L10 CPU II: Control & Pipeline (7) Huddleston, Summer 2009 © UCB Instruction Fetch Unit at the End of Add PC = PC + 4 This is the same for all instructions except: Branch and Jump imm16 clk PC 00 1 nPC_sel=+4 PC Ext Adder Mux Inst Address Inst Memory

CS61CL L10 CPU II: Control & Pipeline (8) Huddleston, Summer 2009 © UCB Single Cycle Datapath during Or Immediate? oprsrtimmediate R[rt] = R[rs] OR ZeroExt[Imm16] 32 ALUctr= clk busW RegWr= 32 busA 32 busB 55 RwRaRb RegFile Rs Rt Rd RegDst= Extender 3216 imm16 ALUSrc= ExtOp= MemtoReg= clk Data In 32 MemWr= = ALU 0 1 WrEnAdr Data Memory 5 Instruction Imm16RdRtRs nPC_sel= instr fetch unit clk

CS61CL L10 CPU II: Control & Pipeline (9) Huddleston, Summer 2009 © UCB R[rt] = R[rs] OR ZeroExt[Imm16] oprsrtimmediate Single Cycle Datapath during Or Immediate? Instruction nPC_sel=+4instr fetch unit 32 ALUctr= OR clk busW RegWr=1 32 busA 32 busB 55 RwRaRb RegFile Rs Rt Rd RegDst=0 Extender 3216 imm16 ALUSrc=1 ExtOp=zero MemtoReg=0 clk Data In 32 MemWr= = ALU 0 1 WrEnAdr Data Memory 5 Imm16RdRtRs clk

CS61CL L10 CPU II: Control & Pipeline (10) Huddleston, Summer 2009 © UCB The Single Cycle Datapath during Load? R[rt] = Data Memory {R[rs] + SignExt[imm16]} oprsrtimmediate ALUctr= clk busW RegWr= 32 busA 32 busB 55 RwRaRb RegFile Rs Rt Rd RegDst= Extender 3216 imm16 ALUSrc= ExtOp= MemtoReg= clk Data In 32 MemWr= = ALU 0 1 WrEnAdr Data Memory 5 Instruction Imm16RdRtRs nPC_sel= instr fetch unit clk

CS61CL L10 CPU II: Control & Pipeline (11) Huddleston, Summer 2009 © UCB The Single Cycle Datapath during Load R[rt] = Data Memory {R[rs] + SignExt[imm16]} oprsrtimmediate ALUctr= ADD clk busW RegWr=1 32 busA 32 busB 55 RwRaRb RegFile Rs Rt Rd RegDst=0 Extender 3216 imm16 ALUSrc=1 ExtOp=sign MemtoReg=1 clk Data In 32 MemWr= = ALU 0 1 WrEnAdr Data Memory 5 Instruction Imm16RdRtRs nPC_sel=+4 instr fetch unit clk

CS61CL L10 CPU II: Control & Pipeline (12) Huddleston, Summer 2009 © UCB The Single Cycle Datapath during Store? oprsrtimmediate Data Memory {R[rs] + SignExt[imm16]} = R[rt] 32 ALUctr= clk busW RegWr= 32 busA 32 busB 55 RwRaRb RegFile Rs Rt Rd RegDst= Extender 3216 imm16 ALUSrc= ExtOp= MemtoReg= clk Data In 32 MemWr= = ALU 0 1 WrEnAdr Data Memory 5 Instruction Imm16RdRtRs nPC_sel= instr fetch unit clk

CS61CL L10 CPU II: Control & Pipeline (13) Huddleston, Summer 2009 © UCB The Single Cycle Datapath during Store Data Memory {R[rs] + SignExt[imm16]} = R[rt] oprsrtimmediate ALUctr= ADD clk busW RegWr=0 32 busA 32 busB 55 RwRaRb RegFile Rs Rt Rd RegDst=x Extender 3216 imm16 ALUSrc=1 ExtOp=sign MemtoReg=x clk Data In 32 MemWr= = ALU 0 1 WrEnAdr Data Memory 5 Instruction Imm16RdRtRs nPC_sel=+4 instr fetch unit clk

CS61CL L10 CPU II: Control & Pipeline (14) Huddleston, Summer 2009 © UCB The Single Cycle Datapath during Branch? if (R[rs] - R[rt] == 0) then Zero = 1 ; else Zero = 0 oprsrtimmediate ALUctr= clk busW RegWr= 32 busA 32 busB 55 RwRaRb RegFile Rs Rt Rd RegDst= Extender 3216 imm16 ALUSrc= ExtOp= MemtoReg= clk Data In 32 MemWr= = ALU 0 1 WrEnAdr Data Memory 5 Instruction Imm16RdRtRs nPC_sel= instr fetch unit clk

CS61CL L10 CPU II: Control & Pipeline (15) Huddleston, Summer 2009 © UCB The Single Cycle Datapath during Branch if (R[rs] - R[rt] == 0) then “=“ = 1 ; else “=“ = 0 oprsrtimmediate ALUctr= x clk busW RegWr=0 32 busA 32 busB 55 RwRaRb RegFile Rs Rt Rd RegDst=x Extender 3216 imm16 ALUSrc=0 ExtOp=x MemtoReg=x clk Data In 32 MemWr= = ALU 0 1 WrEnAdr Data Memory 5 Instruction Imm16RdRtRs nPC_sel=br instr fetch unit clk

CS61CL L10 CPU II: Control & Pipeline (16) Huddleston, Summer 2009 © UCB Instruction Fetch Unit at the End of Branch if (Equals == 1) then PC = PC SignExt[imm16]*4 ; else PC = PC + 4 oprsrtimmediate What is encoding of nPC_sel? Direct MUX select? Branch inst. / not branch Let’s pick 2nd option Adr Inst Memory nPC_sel Instruction Equal nPC_sel Q: What logic gate? imm16 clk PC 00 1 PC Ext Adder Mux 0 1 MUX ctrl

CS61CL L10 CPU II: Control & Pipeline (17) Huddleston, Summer 2009 © UCB 1. Analyze instruction set architecture (ISA)  datapath requirements meaning of each instruction is given by the register transfers datapath must include storage element for ISA registers datapath must support each register transfer 2. Select set of datapath components and establish clocking methodology 3. Assemble datapath meeting requirements 4. Analyze implementation of each instruction to determine setting of control points that effects the register transfer. 5. Assemble the control logic (hard part!) How to Design a Processor: step-by-step

CS61CL L10 CPU II: Control & Pipeline (18) Huddleston, Summer 2009 © UCB Step 4: Given Datapath: RTL  Control ALUctr RegDst ALUSrc ExtOp MemtoRegMemWr Instruction Imm16RdRsRt nPC_sel Adr Inst Memory DATA PATH Control Op Fun RegWr

CS61CL L10 CPU II: Control & Pipeline (19) Huddleston, Summer 2009 © UCB A Summary of the Control Signals addsuborilwswbeqjump RegDst ALUSrc MemtoReg RegWrite MemWrite nPCsel Jump ExtOp ALUctr x Add x Subtract Or Add x 1 x x 0 x x Subtract x x x 0 0 ? 1 x x optarget address oprsrtrdshamtfunct oprsrt immediate R-type I-type J-type add, sub ori, lw, sw, beq jump func op Appendix A See We Don’t Care :-)

CS61CL L10 CPU II: Control & Pipeline (20) Huddleston, Summer 2009 © UCB Boolean Expressions for Controller RegDst = add + sub ALUSrc = ori + lw + sw MemtoReg = lw RegWrite = add + sub + ori + lw MemWrite = sw nPCsel = beq Jump = jump ExtOp = lw + sw ALUctr[0] = sub + beq (assume ALUctr is 0 ADD, 01: SUB, 10: OR ) ALUctr[1] = or where, rtype = ~op 5  ~op 4  ~op 3  ~op 2  ~op 1  ~op 0, ori = ~op 5  ~op 4  op 3  op 2  ~op 1  op 0 lw = op 5  ~op 4  ~op 3  ~op 2  op 1  op 0 sw = op 5  ~op 4  op 3  ~op 2  op 1  op 0 beq = ~op 5  ~op 4  ~op 3  op 2  ~op 1  ~op 0 jump = ~op 5  ~op 4  ~op 3  ~op 2  op 1  ~op 0 add = rtype  func 5  ~func 4  ~func 3  ~func 2  ~func 1  ~func 0 sub = rtype  func 5  ~func 4  ~func 3  ~func 2  func 1  ~func 0 How do we implement this in gates?

CS61CL L10 CPU II: Control & Pipeline (21) Huddleston, Summer 2009 © UCB Controller Implementation add sub ori lw sw beq jump RegDst ALUSrc MemtoReg RegWrite MemWrite nPCsel Jump ExtOp ALUctr[0] ALUctr[1] “AND” logic “OR” logic opcodefunc

CS61CL L10 CPU II: Control & Pipeline (22) Huddleston, Summer 2009 © UCB Processor Performance Can we estimate the clock rate (frequency) of our single-cycle processor? We know: 1 cycle per instruction lw is the most demanding instruction. Assume these delays for major pieces of the datapath: -Instr. Mem, ALU, Data Mem : 2 ns each, regfile 1 ns -Instruction execution requires: = 8 ns  125 MHz What can we do to improve clock rate? Will this improve performance as well? We want increases in clock rate to result in programs executing quicker.

CS61CL L10 CPU II: Control & Pipeline (23) Huddleston, Summer 2009 © UCB Gotta Do Laundry Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, fold, and put away Washer takes 30 minutes Dryer takes 30 minutes “Folder” takes 30 minutes “Stasher” takes 30 minutes to put clothes into drawers ABCD

CS61CL L10 CPU II: Control & Pipeline (24) Huddleston, Summer 2009 © UCB Sequential Laundry Sequential laundry takes 8 hours for 4 loads TaskOrderTaskOrder B C D A 30 Time 30 6 PM AM

CS61CL L10 CPU II: Control & Pipeline (25) Huddleston, Summer 2009 © UCB Pipelined Laundry Pipelined laundry takes 3.5 hours for 4 loads! TaskOrderTaskOrder B C D A 12 2 AM 6 PM Time 30

CS61CL L10 CPU II: Control & Pipeline (26) Huddleston, Summer 2009 © UCB General Definitions Latency: time to completely execute a certain task for example, time to read a sector from disk is disk access time or disk latency Throughput: amount of work that can be done over a period of time

CS61CL L10 CPU II: Control & Pipeline (27) Huddleston, Summer 2009 © UCB Pipelining doesn’t help latency of single task, it helps throughput of entire workload Multiple tasks operating simultaneously using different resources Potential speedup = Number pipe stages Time to “fill” pipeline and time to “drain” it reduces speedup: 2.3X v. 4X in this example 6 PM 789 Time B C D A 30 TaskOrderTaskOrder Pipelining Lessons (1/2)

CS61CL L10 CPU II: Control & Pipeline (28) Huddleston, Summer 2009 © UCB Suppose new Washer takes 20 minutes, new Stasher takes 20 minutes. How much faster is pipeline? Pipeline rate limited by slowest pipeline stage Unbalanced lengths of pipe stages reduces speedup Pipelining Lessons (2/2) 6 PM 789 Time B C D A 30 TaskOrderTaskOrder

CS61CL L10 CPU II: Control & Pipeline (29) Huddleston, Summer 2009 © UCB 1) IFtch: Instruction Fetch, Increment PC 2) Dcd: Instruction Decode, Read Registers 3) Exec: Mem-ref:Calculate Address Arith-log: Perform Operation 4) Mem: Load: Read Data from Memory Store: Write Data to Memory 5) WB: Write Data Back to Register Steps in Executing MIPS

CS61CL L10 CPU II: Control & Pipeline (30) Huddleston, Summer 2009 © UCB Every instruction must take same number of steps, also called pipeline “stages”, so some will go idle sometimes IFtchDcdExecMemWB IFtchDcdExecMemWB IFtchDcdExecMemWB IFtchDcdExecMemWB IFtchDcdExecMemWB IFtchDcdExecMemWB Time Pipelined Execution Representation

CS61CL L10 CPU II: Control & Pipeline (31) Huddleston, Summer 2009 © UCB Use datapath figure to represent pipeline IFtchDcdExecMemWB ALU I$ Reg D$Reg PC instruction memory +4 rt rs rd registers ALU Data memory imm 1. Instruction Fetch 2. Decode/ Register Read 3. Execute4. Memory 5. Write Back Review: Datapath for MIPS

CS61CL L10 CPU II: Control & Pipeline (32) Huddleston, Summer 2009 © UCB I n s t r. O r d e r Load Add Store Sub Or I$ Time (clock cycles) I$ ALU Reg I$ D$ ALU Reg D$ Reg I$ D$ Reg ALU Reg D$ Reg D$ ALU (In Reg, right half highlight read, left half write) Reg I$ Graphical Pipeline Representation

CS61CL L10 CPU II: Control & Pipeline (33) Huddleston, Summer 2009 © UCB Suppose 2 ns for memory access, 2 ns for ALU operation, and 1 ns for register file read or write; compute instruction rate Nonpipelined Execution: lw : IF + Read Reg + ALU + Memory + Write Reg = = 8 ns add : IF + Read Reg + ALU + Write Reg = = 6 ns (recall 8ns for single-cycle processor) Pipelined Execution: Max(IF,Read Reg,ALU,Memory,Write Reg) = 2 ns Example

CS61CL L10 CPU II: Control & Pipeline (34) Huddleston, Summer 2009 © UCB Pipeline Hazard: Matching socks in later load A depends on D; stall since folder tied up TaskOrderTaskOrder B C D A E F bubble 12 2 AM 6 PM Time 30

CS61CL L10 CPU II: Control & Pipeline (35) Huddleston, Summer 2009 © UCB Administrivia Midterm Solutions Regrade Requests HW7 (Design Document)

CS61CL L10 CPU II: Control & Pipeline (36) Huddleston, Summer 2009 © UCB Problems for Pipelining CPUs Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle Structural hazards: HW cannot support some combination of instructions (single person to fold and put clothes away) Control hazards: Pipelining of branches causes later instruction fetches to wait for the result of the branch Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock) These might result in pipeline stalls or “bubbles” in the pipeline.

CS61CL L10 CPU II: Control & Pipeline (37) Huddleston, Summer 2009 © UCB Read same memory twice in same clock cycle I$ Load Instr 1 Instr 2 Instr 3 Instr 4 ALU I$ Reg D$Reg ALU I$ Reg D$Reg ALU I$ Reg D$Reg ALU Reg D$Reg ALU I$ Reg D$Reg I n s t r. O r d e r Time (clock cycles) Structural Hazard #1: Single Memory (1/2)

CS61CL L10 CPU II: Control & Pipeline (38) Huddleston, Summer 2009 © UCB Structural Hazard #1: Single Memory (2/2) Solution: infeasible and inefficient to create second memory (We’ll learn about this more next week) so simulate this by having two Level 1 Caches (a temporary smaller [of usually most recently used] copy of memory) have both an L1 Instruction Cache and an L1 Data Cache need more complex hardware to control when both caches miss

CS61CL L10 CPU II: Control & Pipeline (39) Huddleston, Summer 2009 © UCB Structural Hazard #2: Registers (1/2) Can we read and write to registers simultaneously? I$ sw Instr 1 Instr 2 Instr 3 Instr 4 ALU I$ Reg D$Reg ALU I$ Reg D$Reg ALU I$ Reg D$Reg ALU Reg D$Reg ALU I$ Reg D$Reg I n s t r. O r d e r Time (clock cycles)

CS61CL L10 CPU II: Control & Pipeline (40) Huddleston, Summer 2009 © UCB Structural Hazard #2: Registers (2/2) Two different solutions have been used: 1) RegFile access is VERY fast: takes less than half the time of ALU stage -Write to Registers during first half of each clock cycle -Read from Registers during second half of each clock cycle 2) Build RegFile with independent read and write ports Result: can perform Read and Write during same clock cycle

CS61CL L10 CPU II: Control & Pipeline (41) Huddleston, Summer 2009 © UCB Control Hazard: Branching (1/8) Where do we do the compare for the branch? I$ beq Instr 1 Instr 2 Instr 3 Instr 4 ALU I$ Reg D$Reg ALU I$ Reg D$Reg ALU I$ Reg D$Reg ALU Reg D$Reg ALU I$ Reg D$Reg I n s t r. O r d e r Time (clock cycles)

CS61CL L10 CPU II: Control & Pipeline (42) Huddleston, Summer 2009 © UCB Control Hazard: Branching (2/8) We had put branch decision-making hardware in ALU stage therefore two more instructions after the branch will always be fetched, whether or not the branch is taken Desired functionality of a branch if we do not take the branch, don’t waste any time and continue executing normally if we take the branch, don’t execute any instructions after the branch, just go to the desired label

CS61CL L10 CPU II: Control & Pipeline (43) Huddleston, Summer 2009 © UCB Control Hazard: Branching (3/8) Initial Solution: Stall until decision is made insert “no-op” instructions (those that accomplish nothing, just take time) or hold up the fetch of the next instruction (for 2 cycles). Drawback: branches take 3 clock cycles each (assuming comparator is put in ALU stage)

CS61CL L10 CPU II: Control & Pipeline (44) Huddleston, Summer 2009 © UCB Control Hazard: Branching (4/8) Optimization #1: insert special branch comparator in Stage 2 as soon as instruction is decoded (Opcode identifies it as a branch), immediately make a decision and set the new value of the PC Benefit: since branch is complete in Stage 2, only one unnecessary instruction is fetched, so only one no-op is needed Side Note: This means that branches are idle in Stages 3, 4 and 5.

CS61CL L10 CPU II: Control & Pipeline (45) Huddleston, Summer 2009 © UCB Control Hazard: Branching (5/8) Branch comparator moved to Decode stage. I$ beq Instr 1 Instr 2 Instr 3 Instr 4 ALU I$ Reg D$Reg ALU I$ Reg D$Reg ALU I$ Reg D$Reg ALU Reg D$Reg ALU I$ Reg D$Reg I n s t r. O r d e r Time (clock cycles)

CS61CL L10 CPU II: Control & Pipeline (46) Huddleston, Summer 2009 © UCB Control Hazard: Branching (6a/8) User inserting no-op instruction add beq nop ALU I$ Reg D$Reg ALU I$ Reg D$Reg ALU Reg D$Reg I$ I n s t r. O r d e r Time (clock cycles) bub ble Impact: 2 clock cycles per branch instruction  slow lw bub ble

CS61CL L10 CPU II: Control & Pipeline (47) Huddleston, Summer 2009 © UCB Control Hazard: Branching (6b/8) Controller inserting a single bubble add beq lw ALU I$ Reg D$Reg ALU I$ Reg D$Reg ALU Reg D$Reg I$ I n s t r. O r d e r Time (clock cycles) bub ble Impact: 2 clock cycles per branch instruction  slow

CS61CL L10 CPU II: Control & Pipeline (48) Huddleston, Summer 2009 © UCB Control Hazard: Branching (7/8) Optimization #2: Redefine branches Old definition: if we take the branch, none of the instructions after the branch get executed by accident New definition: whether or not we take the branch, the single instruction immediately following the branch gets executed (called the branch-delay slot) The term “Delayed Branch” means we always execute inst after branch This optimization is used with MIPS

CS61CL L10 CPU II: Control & Pipeline (49) Huddleston, Summer 2009 © UCB Control Hazard: Branching (8/8) Notes on Branch-Delay Slot Worst-Case Scenario: can always put a no-op in the branch-delay slot Better Case: can find an instruction preceding the branch which can be placed in the branch-delay slot without affecting flow of the program -re-ordering instructions is a common method of speeding up programs -compiler/assembler must be very smart in order to find instructions to do this -usually can find such an instruction at least 50% of the time -Jumps also have a delay slot…

CS61CL L10 CPU II: Control & Pipeline (50) Huddleston, Summer 2009 © UCB Example: Nondelayed vs. Delayed Branch add $1,$2,$3 sub $4, $5,$6 beq $1, $4, Exit or $8, $9,$10 xor $10, $1,$11 Nondelayed Branch add $1,$2,$3 sub $4, $5,$6 beq $1, $4, Exit or $8, $9,$10 xor $10, $1,$11 Delayed Branch Exit: MALTAL

CS61CL L10 CPU II: Control & Pipeline (51) Huddleston, Summer 2009 © UCB Data Hazards (1/2) Consider the following sequence of instructions add $t0, $t1, $t2 sub $t4, $t0,$t3 and $t5, $t0,$t6 or $t7, $t0,$t8 xor $t9, $t0,$t10

CS61CL L10 CPU II: Control & Pipeline (52) Huddleston, Summer 2009 © UCB Data Hazards (2/2) Data-flow backward in time are hazards sub $t4,$t0,$t3 ALU I$ Reg D$Reg and $t5,$t0,$t6 ALU I$ Reg D$Reg or $t7,$t0,$t8 I$ ALU Reg D$Reg xor $t9,$t0,$t10 ALU I$ Reg D$Reg add $t0,$t1,$t2 IFID/RFEXMEMWB ALU I$ Reg D$ Reg I n s t r. O r d e r Time (clock cycles)

CS61CL L10 CPU II: Control & Pipeline (53) Huddleston, Summer 2009 © UCB Data Hazard Solution: Forwarding Forward result from one stage to another sub $t4,$t0,$t3 ALU I$ Reg D$Reg and $t5,$t0,$t6 ALU I$ Reg D$Reg or $t7,$t0,$t8 I$ ALU Reg D$Reg xor $t9,$t0,$t10 ALU I$ Reg D$Reg add $t0,$t1,$t2 IFID/RFEXMEMWB ALU I$ Reg D$ Reg “ or ” hazard solved by register hardware

CS61CL L10 CPU II: Control & Pipeline (54) Huddleston, Summer 2009 © UCB Data Hazard: Loads (1/4) Dataflow backwards in time are hazards Can’t solve all cases with forwarding Must stall instruction dependent on load, then forward (more hardware) sub $t3,$t0,$t2 ALU I$ Reg D$Reg lw $t0,0($t1) IFID/RFEXMEMWB ALU I$ Reg D$ Reg

CS61CL L10 CPU II: Control & Pipeline (55) Huddleston, Summer 2009 © UCB Data Hazard: Loads (2/4) Hardware stalls pipeline Called “interlock” sub $t3,$t0,$t2 ALU I$ Reg D$Reg bub ble and $t5,$t0,$t4 ALU I$ Reg D$Reg bub ble or $t7,$t0,$t6 I$ ALU Reg D$ bub ble lw $t0, 0($t1) IFID/RFEXMEMWB ALU I$ Reg D$ Reg

CS61CL L10 CPU II: Control & Pipeline (56) Huddleston, Summer 2009 © UCB Data Hazard: Loads (3/4) Instruction slot after a load is called “load delay slot” If that instruction uses the result of the load, then the hardware interlock will stall it for one cycle. If the compiler puts an unrelated instruction in that slot, then no stall Letting the hardware stall the instruction in the delay slot is equivalent to putting a nop in the slot (except the latter uses more code space)

CS61CL L10 CPU II: Control & Pipeline (57) Huddleston, Summer 2009 © UCB Data Hazard: Loads (4/4) Stall is equivalent to nop sub $t3,$t0,$t2 and $t5,$t0,$t4 or $t7,$t0,$t6 I$ ALU Reg D$ lw $t0, 0($t1) ALU I$ Reg D$ Reg bub ble ALU I$ Reg D$ Reg ALU I$ Reg D$ Reg nop

CS61CL L10 CPU II: Control & Pipeline (58) Huddleston, Summer 2009 © UCB °5 steps to design a processor 1. Analyze instruction set  datapath requirements 2. Select set of datapath components & establish clock methodology 3. Assemble datapath meeting the requirements 4. Analyze implementation of each instruction to determine setting of control points that effects the register transfer. 5. Assemble the control logic Formulate Logic Equations Design Circuits Summary: Single-cycle Processor Control Datapath Memory Processor Input Output

CS61CL L10 CPU II: Control & Pipeline (59) Huddleston, Summer 2009 © UCB Things to Remember Optimal Pipeline Each stage is executing part of an instruction each clock cycle. One instruction finishes during each clock cycle. On average, execute far more quickly. What makes this work? Similarities between instructions allow us to use same stages for all instructions (generally). Each stage takes about the same amount of time as all others: little wasted time.

CS61CL L10 CPU II: Control & Pipeline (60) Huddleston, Summer 2009 © UCB “And in Conclusion..” Pipeline challenge is hazards Forwarding helps w/many data hazards Delayed branch helps with control hazard in 5 stage pipeline Load delay slot / interlock necessary More aggressive performance: Superscalar Out-of-order execution

CS61CL L10 CPU II: Control & Pipeline (61) Huddleston, Summer 2009 © UCB Bonus slides These are extra slides that used to be included in lecture notes, but have been moved to this, the “bonus” area to serve as a supplement. The slides will appear in the order they would have in the normal presentation

CS61CL L10 CPU II: Control & Pipeline (62) Huddleston, Summer 2009 © UCB RTL: The Add Instruction add rd, rs, rt MEM[PC]Fetch the instruction from memory R[rd] = R[rs] + R[rt]The actual operation PC = PC + 4Calculate the next instruction’s address oprsrtrdshamtfunct bits 5 bits

CS61CL L10 CPU II: Control & Pipeline (63) Huddleston, Summer 2009 © UCB A Summary of the Control Signals (1/2) inst Register Transfer add R[rd]  R[rs] + R[rt];PC  PC + 4 ALUsrc = RegB, ALUctr = “ ADD ”, RegDst = rd, RegWr, nPC_sel = “+4” sub R[rd]  R[rs] – R[rt];PC  PC + 4 ALUsrc = RegB, ALUctr = “ SUB ”, RegDst = rd, RegWr, nPC_sel = “+4” ori R[rt]  R[rs] + zero_ext(Imm16); PC  PC + 4 ALUsrc = Im, Extop = “Z”,ALUctr = “ OR ”, RegDst = rt,RegWr, nPC_sel =“+4” lw R[rt]  MEM[ R[rs] + sign_ext(Imm16)];PC  PC + 4 ALUsrc = Im, Extop = “sn”, ALUctr = “ ADD ”, MemtoReg, RegDst = rt, RegWr, nPC_sel = “+4” sw MEM[ R[rs] + sign_ext(Imm16)]  R[rs];PC  PC + 4 ALUsrc = Im, Extop = “sn”, ALUctr = “ ADD ”, MemWr, nPC_sel = “+4” beq if ( R[rs] == R[rt] ) then PC  PC + sign_ext(Imm16)] || 00 else PC  PC + 4 nPC_sel = “br”, ALUctr = “ SUB ”

CS61CL L10 CPU II: Control & Pipeline (64) Huddleston, Summer 2009 © UCB 32 ALUctr = Clk busW RegWr = 32 busA 32 busB 555 RwRaRb bit Registers Rs Rt Rd RegDst = Extender Mux imm16 ALUSrc = ExtOp = Mux MemtoReg = Clk Data In WrEn 32 Adr Data Memory 32 MemWr = ALU Instruction Fetch Unit Clk Zero Instruction Imm16RdRsRt New PC = { PC[31..28], target address, 00 } nPC_sel= The Single Cycle Datapath during Jump optarget address J-typejump 25 Jump= TA26

CS61CL L10 CPU II: Control & Pipeline (65) Huddleston, Summer 2009 © UCB The Single Cycle Datapath during Jump 32 ALUctr =x Clk busW RegWr = 0 32 busA 32 busB 555 RwRaRb bit Registers Rs Rt Rd RegDst = x Extender Mux imm16 ALUSrc = x ExtOp = x Mux MemtoReg = x Clk Data In WrEn 32 Adr Data Memory 32 MemWr = 0 ALU Instruction Fetch Unit Clk Zero Instruction RdRsRt New PC = { PC[31..28], target address, 00 } nPC_sel=? Jump=1 Imm16 TA26 optarget address J-typejump 25

CS61CL L10 CPU II: Control & Pipeline (66) Huddleston, Summer 2009 © UCB Instruction Fetch Unit at the End of Jump Adr Inst Memory Adder PC Clk 00 Mux 1 nPC_sel imm16 Instruction 0 1 Zero nPC_MUX_sel New PC = { PC[31..28], target address, 00 } optarget address J-typejump 25 How do we modify this to account for jumps? Jump

CS61CL L10 CPU II: Control & Pipeline (67) Huddleston, Summer 2009 © UCB Instruction Fetch Unit at the End of Jump Adr Inst Memory Adder PC Clk 00 Mux 1 nPC_sel imm16 Instruction 0 1 Zero nPC_MUX_sel New PC = { PC[31..28], target address, 00 } optarget address J-typejump 25 Mux 1 0 Jump TA 4 (MSBs) 00 Query Can Zero still get asserted? Does nPC_sel need to be 0? If not, what? 26

CS61CL L10 CPU II: Control & Pipeline (68) Huddleston, Summer 2009 © UCB Historical Trivia First MIPS design did not interlock and stall on load-use data hazard Real reason for name behind MIPS: Microprocessor without Interlocked Pipeline Stages Word Play on acronym for Millions of Instructions Per Second, also called MIPS

CS61CL L10 CPU II: Control & Pipeline (69) Huddleston, Summer 2009 © UCB Pipeline Hazard: Matching socks in later load A depends on D; stall since folder tied up; Note this is much different from processor cases so far. We have not had a earlier instruction depend on a later one. TaskOrderTaskOrder B C D A E F bubble 12 2 AM 6 PM Time 30

CS61CL L10 CPU II: Control & Pipeline (70) Huddleston, Summer 2009 © UCB Out-of-Order Laundry: Don’t Wait A depends on D; rest continue; need more resources to allow out-of-order TaskOrderTaskOrder 12 2 AM 6 PM Time B C D A 30 E F bubble

CS61CL L10 CPU II: Control & Pipeline (71) Huddleston, Summer 2009 © UCB Superscalar Laundry: Parallel per stage More resources, HW to match mix of parallel tasks? TaskOrderTaskOrder 12 2 AM 6 PM Time B C D A E F (light clothing) (dark clothing) (very dirty clothing) (light clothing) (dark clothing) (very dirty clothing) 30

CS61CL L10 CPU II: Control & Pipeline (72) Huddleston, Summer 2009 © UCB Superscalar Laundry: Mismatch Mix Task mix underutilizes extra resources TaskOrderTaskOrder 12 2 AM 6 PM Time 30 (light clothing) (dark clothing) (light clothing) A B D C