Problem with Single Cycle Processor Design

Slides:



Advertisements
Similar presentations
CS/COE1541: Introduction to Computer Architecture Datapath and Control Review Sangyeun Cho Computer Science Department University of Pittsburgh.
Advertisements

361 datapath Computer Architecture Lecture 8: Designing a Single Cycle Datapath.
The Processor: Datapath & Control
CS61C L26 Single Cycle CPU Datapath II (1) Garcia © UCB Lecturer PSOE Dan Garcia inst.eecs.berkeley.edu/~cs61c CS61C : Machine.
361 multipath..1 ECE 361 Computer Architecture Lecture 10: Designing a Multiple Cycle Processor.
Savio Chau Single Cycle Controller Design Last Time: Discussed the Designing of a Single Cycle Datapath Control Datapath Memory Processor (CPU) Input Output.
Microprocessor Design
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
CS61C L26 CPU Design : Designing a Single-Cycle CPU II (1) Garcia, Fall 2006 © UCB Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.
Inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 25 CPU design (of a single-cycle CPU) Intel is prototyping circuits that.
EEM 486: Computer Architecture Lecture 3 Designing a Single Cycle Datapath.
361 control Computer Architecture Lecture 9: Designing Single Cycle Control.
EECC550 - Shaaban #1 Lec # 5 Spring CPU Design Steps 1. Analyze instruction set operations using independent RTN => datapath requirements.
CS3350B Computer Architecture Winter 2015 Lecture 5.6: Single-Cycle CPU: Datapath Control (Part 1) Marc Moreno Maza [Adapted.
Computer Organization CS224 Fall 2012 Lesson 26. Summary of Control Signals addsuborilwswbeqj RegDst ALUSrc MemtoReg RegWrite MemWrite Branch Jump ExtOp.
CASE STUDY OF A MULTYCYCLE DATAPATH. Alternative Multiple Cycle Datapath (In Textbook) Minimizes Hardware: 1 memory, 1 ALU Ideal Memory Din Address 32.
CPE 442 multipath..1 Intro. to Computer Architecture CpE 242 Computer Architecture and Engineering Designing a Multiple Cycle Processor.
EEM 486: Computer Architecture Designing Single Cycle Control.
Computer Organization CS224 Fall 2012 Lesson 22. The Big Picture  The Five Classic Components of a Computer  Chapter 4 Topic: Processor Design Control.
EEM 486: Computer Architecture Designing a Single Cycle Datapath.
Computer Architecture and Design – ECEN 350 Part 6 [Some slides adapted from A. Sprintson, M. Irwin, D. Paterson and others]
CPE 442 single-cycle datapath.1 Intro. To Computer Architecture CpE242 Computer Architecture and Engineering Designing a Single Cycle Datapath.
W.S Computer System Design Lecture 4 Wannarat Suntiamorntut.
Datapath and Control Unit Design
CS3350B Computer Architecture Winter 2015 Lecture 5.7: Single-Cycle CPU: Datapath Control (Part 2) Marc Moreno Maza [Adapted.
By Wannarat Computer System Design Lecture 4 Wannarat Suntiamorntut.
ECE-C355 Computer Structures Winter 2008 The MIPS Datapath Slides have been adapted from Prof. Mary Jane Irwin ( )
Datapath and Control AddressInstruction Memory Write Data Reg Addr Register File ALU Data Memory Address Write Data Read Data PC Read Data Read Data.
Datapath and Control AddressInstruction Memory Write Data Reg Addr Register File ALU Data Memory Address Write Data Read Data PC Read Data Read Data.
COM181 Computer Hardware Lecture 6: The MIPs CPU.
Single Cycle Controller Design
CS/EE 362 multipath..1 ©DAP & SIK 1995 CS/EE 362 Hardware Fundamentals Lecture 14: Designing a Multi-Cycle Datapath (Chapter 5: Hennessy and Patterson)
Chapter 4 From: Dr. Iyad F. Jafar Basic MIPS Architecture: Multi-Cycle Datapath and Control.
Computer Architecture Lecture 6.  Our implementation of the MIPS is simplified memory-reference instructions: lw, sw arithmetic-logical instructions:
CS Computer Architecture Week 10: Single Cycle Implementation

Single-Cycle Datapath and Control
IT 251 Computer Organization and Architecture
(Chapter 5: Hennessy and Patterson) Winter Quarter 1998 Chris Myers
Designing a Multicycle Processor
Computer Organization Fall 2017 Chapter 4A: The Processor, Part A
Basic MIPS Architecture
Processor (I).
CS/COE0447 Computer Organization & Assembly Language
CpE242 Computer Architecture and Engineering Designing a Single Cycle Datapath Start: X:40.
CPU Organization (Design)
Single Cycle CPU Design
SOLUTIONS CHAPTER 4.
CS 704 Advanced Computer Architecture
Single-Cycle CPU DataPath.
CS 704 Advanced Computer Architecture
Vladimir Stojanovic and Nicholas Weaver
Instructors: Randy H. Katz David A. Patterson
CS152 Computer Architecture and Engineering Lecture 8 Designing a Single Cycle Datapath Start: X:40.
The Processor Lecture 3.2: Building a Datapath with Control
Vishwani D. Agrawal James J. Danaher Professor
Guest Lecturer TA: Shreyas Chand
COMS 361 Computer Organization
COSC 2021: Computer Organization Instructor: Dr. Amir Asif
Processor: Multi-Cycle Datapath & Control
inst.eecs.berkeley.edu/~cs61c-to
Prof. Giancarlo Succi, Ph.D., P.Eng.
Instructors: Randy H. Katz David A. Patterson
Alternative datapath (book): Multiple Cycle Datapath
The Processor: Datapath & Control.
COMS 361 Computer Organization
What You Will Learn In Next Few Sets of Lectures
Designing a Single-Cycle Processor
Processor: Datapath and Control
CS161 – Design and Architecture of Computer Systems
Presentation transcript:

Problem with Single Cycle Processor Design The Root of the Single Cycle Processor’s Problem: The Cycle Time has to be Long Enough for the Slowest Instruction. Time is wasted in short instructions. This is a serious problem because short instructions occur much more often. Solution: Break the Instruction into Smaller Steps Execute Each Step (Instead of the Entire Instruction) in One Cycle Cycle Time: Time it Takes to Execute the Longest Step Keep All the Steps to a Similar Length Clock Jump R-Type Load Instr Fetch Instr decode PC write Instr decode R read ALU delay Reg write Mem read Time wasted Clocks Jump R-Type Load Instr Fetch Instr decode PC write Instr decode R read ALU delay Reg write Mem read

Advantages and Complications of Multi Cycle Data Path Cycle time is much faster Allows different instructions take different number of cycles to complete Load takes five cycles Jump only takes three cycles Allows a functional unit to be used more than once per instruction Complications Need to add intermediate registers to hold data between steps To Make Sure Intermediate Values Are Captured Before Next Clock Need more complicate controller Need to add more multiplexors for sharing function units

Purpose of Intermediate Registers Slow Clock R1 R2 X X 1 1 1 X . . . . X 1 X X 1 X 1 1 Clk Fast Clock R1 R2 X X X X . . . . 1 X X X 1 X 1 1 Clk Intermediate Register R1 R2 1 X 1 X 1 X . . . . 1 1 X X X 1 X 1 1 Clk

But Where to Put the Intermediate Registers? To Start With, Add the Intermediate Registers at the End of Each Step in the Instruction Execution Sequence Instruction Fetch Decode/Operand Fetch Execute Access Memory Store Results Possible places to put intermediate registers Next Instruction Warning: Make Sure All Paths Between Intermediate Registers Have Similar Delays. Otherwise, the Overall Performance Can Be Worse Than Single Cycle Data Path!

Basic Idea of Multi Cycle Data Path Control Basic Idea of Multi Cycle Data Path R-type 4 cycles Load 5 cycles Jump 3 cycles nPC_sel MemtoReg PC_Wr IR_Wr Op Code ALUSrc ALUctr ExtOp MemWr RegDst RegWr Store Result PC R mux A Exec Instruction Fetch Operand Fetch Next PC IR B Memory Access M

Reuse of Function Units in Multi Cycle Data Path Since intermediate results are stored in intermediate registers, function units can be doing different things at different time. Examples: Memory can be used to store both instructions and data ALU can be used to do arithmetic and calculate branch address Price to pay: extra registers (IR, ALUout) and multiplexors Instr Reg Load Word Instruction: Additional logic PC 1) Instruction Fetch Mem Data Reg Mem mux 2) Calculate Address & Read Memory ALU Data address 3) Register Mem Data Next address calculation Arithmetic During instruction fetch All PC During branch During calculation Reg A PC PC Additional Logic 4 Reg A 4 mux Reg file or mem Imm16 shift 2 ALUout Shift 2 bits for branch Reg B Reg File Reg B mux Instr IR Need to hold the output so ALU can be reused Instruction (15:0) (15:0) Single Cycle Data Path Multi Cycle Data Path

Dual- Port Ideal Memory Need to Change Memory to Dual Port Independent Read (RAdr, Dout) and Write (WAdr, Din) Ports Read and Write (to Different Location) Can Occur at the Same Cycle Read Port is a Combinational Path: Read Address Valid  Memory Read Access Delay  Data Out Valid Write Port is Falling Edge Triggered MemWrite = 1 Data In is Written Into Location[ WrAdr] at the Falling Clock Edge

General Steps to Design Multi Cycle Datapath Step 1: Start with a single cycle data path that is capable to perform all execution steps Step 2: Insert registers after each step in the instruction execution sequence. Make sure the delays in all steps are balanced. Step 3: Combine components if possible and add multiplexors Step 4: Work out clock by clock control signal sequence Note: Make sure IR is not changed before end of instruction See Example Questions 1 and 2

Step 1: Start with a Single Cycle Data Path Example: A Single Cycle Data Path for add and lw PC+4 PC Next Address Logic Critical Delay Path for lw = 165 ns Critical Delay Path for add = 120 ns 10 ns 20 ns Instruction Memory 20 ns rs R[rs] Rd add1 Data Memory rt Rd add2 Reg File ALU mux rd Wr add R[rt] 1 imm16 5 ns Wr data mux Read = 50 ns Write Setup = 30 ns Read = 30 ns Write Setup = 30 ns 50 ns 1 5 ns ext 20 ns mux 1 5 ns Assume all control signals arrive before data: Delay for add: 10 + 50 + 30 + 5 + 20 + 5 + 30 = 150 ns Execution Time for add = 1 clock  150 ns/clock = 150 ns Delay for lw: 10 + 50 + 30 + 20 + 50 + 5 + 30 = 195 ns (Make sure you check all paths) Clock Cycle of Data Path = 195 ns Execution Time for lw and all instructions = 1 clock  195 ns/clock = 195 ns

Step 2: Insert Intermediate Registers Example: Insert Registers Without Considering Delays PC+4 PC Next Address Logic 10 ns 20 ns Instruction Memory 20 ns rs R[rs] Rd add1 A Data Memory rt Rd add2 Reg File 10 ns ALU Out Reg Mem Data Reg ALU Instr Reg rd mux Wr add R[rt] 1 B imm16 5 ns Wr data mux Read = 50 ns Write Setup = 30 ns Read = 30 ns Write Setup = 30 ns 10 ns 10 ns 10 ns 50 ns 1 10 ns 5 ns ext 20 ns mux 1 Assume all control signals arrive before data: For add: PC  Instr Mem out = 10 + 50 = 60 ns Instr Reg  B reg = 10 + 30 = 40 ns (mux not in critical path since not writing yet) B reg  ALU output = 10 + 5 + 20 = 35 ns ALUOut Reg  Reg File Written = 10 + 5 + 30 = 45 ns (IR can’t be updated till Reg File is written) For lw: PC  Instr Mem out = 10 + 50 = 60 ns Instr Reg  B reg = 10 + 5 + 30 = 45 ns B Reg  ALU output = 10 + 5 + 20 = 35 ns ALU Out Reg  Memory output = 10 + 50 = 60 ns Mem Data Reg  Reg File Written = 10 + 5 + 30 = 45 ns (IR can’t be updated till Reg File is written) Clock cycle = longest stage = 60 ns Execution time for add = 4 clocks x 60 ns/clock = 240 ns PC updated during last instruct execution Execution time for lw = 5 clocks x 60 ns/clock = 300 ns PC updated during last instruct execution 5 ns

Step 2: Insert Intermediate Registers A More Balanced Multi-Cycle Data Path PC+4 PC Next Address Logic 10 ns 20 ns Instruction Memory 20 ns rs R[rs] Rd add1 Data Memory rt Rd add2 Reg File ALU Out Reg Mem Data Reg ALU Instr Reg mux rd Wr add R[rt] 1 imm16 5 ns Wr data mux Read = 50 ns Write Setup = 30 ns 10 ns 10 ns Read = 30 ns Write Setup = 30 ns 50 ns 1 10 ns 5 ns ext 20 ns mux 1 For add: PC  Instr Mem out = 10 + 50 = 60 ns Instr Reg  ALU output = 10 + 30 + 5 + 20 = 65 ns ALUOut Reg  Reg File Written = 10 + 5 + 30 = 45 ns (IR can’t be updated till Reg File is written) For lw: PC  Instr Mem out = 10 + 50 = 60 ns Instr Reg  ALU output = 10 + 30 + 5 + 20 = 45 ns ALU Out Reg  Memory output = 10 + 50 = 60 ns Mem Data Reg  Reg File Written = 10 + 5 + 30 = 45 ns (IR can’t be updated till Reg File is written) Clock cycle = longest stage = 65 ns Execution time for add = 3 clocks x 65 ns/clock = 195 ns PC updated during last instruct execution Execution time for lw = 4 clocks x 65 ns/clock = 260 ns PC updated during last instruct execution Note: The add instruction is faster than the single cycle data path but lw is slower 5 ns

Step 2: Insert Intermediate Registers Effect of Register Locations PC+4 PC Next Address Logic 10 ns 20 ns Instruction Memory 20 ns rs R[rs] Rd add1 Data Memory rt Rd add2 Reg File ALU Out Reg Mem Data Reg ALU Instr Reg mux rd Wr add R[rt] 1 imm16 5 ns Wr data mux 10 ns Read = 30 ns Write Setup = 30 ns Read = 50 ns Write Setup = 30 ns 10 ns 50 ns 1 10 ns 5 ns ext 20 ns mux 1 5 ns For add: PC  Instr Mem out = 10 + 50 = 60 ns Instr Reg  Reg File Written = 10 + 30 + 5 + 20 + 5 + 30 = 100 ns For lw: PC  Instr Mem out = 10 + 50 = 60 ns Instr Reg  ALU output = 10 + 30 + 5 + 20 = 45 ns ALU Out Reg  Memory output = 10 + 50 = 60 ns Mem Data Reg  Reg File Written = 10 + 5 + 30 = 45 ns Clock cycle = longest stage = 100 ns Execution time for add = 2 clocks x 100 ns/clock = 200 ns PC updated during last instruct execution Execution time for lw = 4 clocks x 100 ns/clock = 400 ns PC updated during last instruct execution Note: The add instruction is faster than last design but lw is much slower

Observation For single cycle data path For multi-cycle data path Execution Time for add = 1 clock  195 ns/clock = 195 ns Execution Time for lw = 1 clock  195 ns/clock = 195 ns For multi-cycle data path Case 1: 4 levels of intermediate registers Execution time for add = 4 clocks x 60 ns/clock = 240 ns Execution time for lw = 5 clocks x 60 ns/clock = 300 ns Case 2: 3 levels of intermediate registers Execution time for add = 3 clocks x 65 ns/clock = 195 ns Execution time for lw = 4 clocks x 65 ns/clock = 260 ns Case 3: 3 levels of intermediate registers, new location for ALUout Reg Execution time for add = 2 clocks x 100 ns/clock = 200 ns Execution time for lw = 4 clocks x 100 ns/clock = 400 ns Observations: 1. In this example, all multi-cycle data paths are slower than the single cycle data path! Why? Reason: The lw path length is not much longer than the path length for add. In order for a multi-cycle data to have significant performance over single cycle data path, the path length of long instructions has to be much longer than short instructions. (In fact, if all instructions have the same path length, the multi-cycle data path is always worse than a single cycle data path due to delays of immediate registers.) Case 2 has the best performance among the multi-cycle data path. Reason: it has the most balanced data path among the multi-cycle data path.

Step 3: Combining Components Next Address Logic Instruction Memory Data Memory Mem Data Reg rs R[rs] Rd add1 A rt Rd add2 Reg File ALU Out Reg Instr Reg ALU mux rd Wr add R[rt] 1 B imm16 Wr data mux 1 ext mux 1

Step 3: Combining Components Next Address Logic mux Instruction Memory Instruction and data Memory rs R[rs] Rd add1 A rt Rd add2 Reg File ALU Out Reg ALU Instr Reg mux rd Wr add R[rt] 1 B imm16 Wr data mux 1 ext mux 1 Mem Data Reg

Describing Multi-Cycel Data Path with Multi Cycle RTL Group all RTL statements by clock All register transfers in the same clock occur simultaneously Example: Multi Cycle RTL for the add Instruction Execution Sequence Clock RTL   Instruction Fetch: 1 { IR  Mem[PC] PC  PC + 4 } Operand Fetch: 2 { rs  IR<25:21> rt  IR<20:16> rd  IR<15:11> RA  R[rs] RB  R[rt] } Execute: 3 ALUOUT  RA + RB Store Result: 4 R[rd]  ALUOUT Compare to Single Cycle RTL for the add Instruction instr  mem[PC] rs  instr<25:21> rt  instr<20:16> rd  instr<15:11> R[rd]  R[rs] + R[rt] PC  PC + 4

Operation Details of Multi Cycle Data Path Will Look at the Details of Each Step in the Instruction Execution Sequence: Step 1: Instruction Fetch Step 2: Instruction Decode and Register Fetch Step 3: Execution, Memory Address Computation, or Branch Completion Step 4: R-Type Completion or Memory Access for Load/Store Instructions Step 5: Memory Read and Load Completion

Instruction Fetch Step Cycle Begins Right AFTER the Clock Tick Instr Reg  mem[PC]; PC<31: 0> + 4 PC+4 PC+8 Cycle Ends AT the Next Clock Tick IRmem[PC]; PC<31: 0>  PC<31: 0> + 4 PC+8 PC+12 One Clock Cycle ALUOp= Add, ALUSrcB= 01 x: PCWrCond, RegDst, MemtoReg, ExtOp 1: PCWr, IRWr; Others: 0 PC+4

Minimal Functionality Required for Instruction Decode and Register Fetch Step Idle

Decoding of Branch-if-Equal (beq) Instruction: Simultaneously Preparing for Branch Address ALUOp= Add, ALUSrcB= 11 1: ExtOp x: RegDst, PCSrc, IorD, MemtoReg Others: 0 Use the idle components to do something useful: branch address calculation Motivation: To take advantage of the idle components while decoding instruction to save one more cycle if the instruction happens to be a branch

If Branch Actually Occurs in Execution Step Registers holding operand when execution step begins Holding branch address computed during instruction decode

Benefit of Branch Address Preparation Branch Instruction Execution Without Branch Address Preparation: Instruction fetch Instruction decode Operands rs and rt are read. ALU compares rs and rt. The zero output of ALU is sent to Control Unit. If branch occurs, Control Unit selects PC and imm16 for the ALU to compute branch target address PC is updated with the branch target address With Branch Address Preparation: Instruction fetch Instruction decode Operands rs and rt are read. ALU compares rs and rt. Branch address calculated ahead of time. The zero output of ALU is sent to Control Unit. If branch occurs, Control Unit updates PC with branch target address. Otherwise, PC = PC + 4.

R-Type Instruction Decode Step Branch address preparation as discussed before (result may not be used but it is harmless if ALUout is not written to other state elements)

R-Type Execution Step The branch target address is useless; this is not a branch

R- Type Completion Step instruction is not a branch, pre-calculated branch address is overwritten by the add instruction

I-Type Instruction Decode Step (Ori)

I-Type (Ori) Execution Step The branch target address is useless; this is not a branch

I-Type Completion Step

Store Instruction Decode Step

Store Instruction Execution Step (Memory Address Calculation)

Store Instruction Completion Steps

Load Instruction Decode Step

Load Instruction Execution Step (Memory Address Calculation)

Load Instruction Execution Step (Memory Access)

Load Instruction Completion Steps

Jump Instruction Decode and Complete Steps • PC_ incr  PC + 4 PC<31: 2>  PC_ incr<31: 28> concat target<25: 0> JComplete 1: PCWrite PCsrc = 10 x: others PCwr = 1 PCWr=1 PCsrc = 2 PCsrc=2 2 1 J PC<31:28> Instr<25:0> 4 26

Putting it all Together: Multiple Cycle Datapath PCsrc 2 1 MUX

Multiple Cycle Datapath Control Diagram JComplete 1: PCWrite PCsrc = 10 x: others J

Register File & Memory Write Timing: Ideal vs. Reality Our Ideal Register File and Memory are Simplified Write Happens at the Falling Clock Edge Address, Data, and Write Enable Must be Stable One “Set- up” Time Before the Falling Clock Edge In Real Life: Neither Register File nor Ideal Memory Has Clock Edge Triggered Writes The Write Path is a Combinational Logic Delay Path: Write Enable Goes to 1 and Din Settles Down Memory Write Access Delay Din is Written Into mem[ address] Important: Address and Data must be Stable one “Set- up” Time BEFORE Write Enable Goes High (1) Ideal Memory WrEn Adr Din Dout 32 Real Memory WrEn Adr Din Dout 32

Race Condition Between Address and Write Enable This “Real” (no clock input) Register File may not Work Reliably in our Design Because: We cannot Guarantee Rw will be Stable one “Set- up” Time BEFORE RegWr= 1 There is a “race” between Rw (address) and RegWr (write enable) The “Real” (no clock input) Memory may not Work Reliably in our Design Because: We cannot Guarantee Address will be Stable one “Set- up” Time BEFORE WrEn = 1 There is a race between Addr and WrEn Reg File RegWr Ra busW 5 32 Rw busB busA Memory WrEn Adr Din Dout 32

How to Avoid this Race Condition? A Possible Solution: Have A Register Attached Directly to the Address and Data Inputs Store Address and Data info at the End of Cycle N Assert Write Enable Signal with Combinational Logic Delay into Cycle (N+ 1) where: Delay into Cycle N+1  clock- to- Q + setup Disadvantage: Extra Register Delay Extra Logic Circuit Addr reg Data reg Clock WrEn Delay Memory Adr Din Dout 32