IT 251 Computer Organization and Architecture Multi Cycle CPU Datapath Chia-Chi Teng.

Slides:



Advertisements
Similar presentations
1 Today  All HW1 turned in on time, this is great!  HW2 will be out soon —You will work on procedure calls/stack/etc.  Lab1 will be out soon (possibly.
Advertisements

1 Chapter Five The Processor: Datapath and Control.
The Processor: Datapath & Control
The Processor Data Path & Control Chapter 5 Part 2 - Multi-Clock Cycle Design N. Guydosh 2/29/04.
Chapter 5 The Processor: Datapath and Control Basic MIPS Architecture Homework 2 due October 28 th. Project Designs due October 28 th. Project Reports.
EECE476 Lecture 9: Multi-cycle CPU Datapath Chapter 5: Section 5.5 The University of British ColumbiaEECE 476© 2005 Guy Lemieux.
331 Lec 14.1Fall 2002 Review: Abstract Implementation View  Split memory (Harvard) model - single cycle operation  Simplified to contain only the instructions:
©UCB CS 161Computer Architecture Chapter 5 Lecture 11 Instructor: L.N. Bhuyan Adapted from notes by Dave Patterson (http.cs.berkeley.edu/~patterson)
Preparation for Midterm Binary Data Storage (integer, char, float pt) and Operations, Logic, Flip Flops, Switch Debouncing, Timing, Synchronous / Asynchronous.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
331 W10.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 10 Building a Multi-Cycle Datapath [Adapted from Dave Patterson’s.
CSE431 L05 Basic MIPS Architecture.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 05: Basic MIPS Architecture Review Mary Jane Irwin.
1. 2 Multicycle Datapath  As an added bonus, we can eliminate some of the extra hardware from the single-cycle datapath. —We will restrict ourselves.
Dr. Iyad F. Jafar Basic MIPS Architecture: Multi-Cycle Datapath and Control.
Chapter 4 Sections 4.1 – 4.4 Appendix D.1 and D.2 Dr. Iyad F. Jafar Basic MIPS Architecture: Single-Cycle Datapath and Control.
COSC 3430 L08 Basic MIPS Architecture.1 COSC 3430 Computer Architecture Lecture 08 Processors Single cycle Datapath PH 3: Sections
1 A pipeline diagram  A pipeline diagram shows the execution of a series of instructions. —The instruction sequence is shown vertically, from top to bottom.
Lec 15Systems Architecture1 Systems Architecture Lecture 15: A Simple Implementation of MIPS Jeremy R. Johnson Anatole D. Ruslanov William M. Mongan Some.
Datapath and Control: MultiCycle Implementation. Performance of Single Cycle Machines °Assume following operation times: Memory units : 200 ps ALU and.
Gary MarsdenSlide 1University of Cape Town Chapter 5 - The Processor  Machine Performance factors –Instruction Count, Clock cycle time, Clock cycles per.
EE204 L12-Single Cycle DP PerformanceHina Anwar Khan EE204 Computer Architecture Single Cycle Data path Performance.
CPE232 Basic MIPS Architecture1 Computer Organization Multi-cycle Approach Dr. Iyad Jafar Adapted from Dr. Gheith Abandah slides
1 CS/COE0447 Computer Organization & Assembly Language Multi-Cycle Execution.
C HAPTER 5 T HE PROCESSOR : D ATAPATH AND C ONTROL M ULTICYCLE D ESIGN.
Computer Architecture and Design – ECEN 350 Part 6 [Some slides adapted from A. Sprintson, M. Irwin, D. Paterson and others]
1 A single-cycle MIPS processor  An instruction set architecture is an interface that defines the hardware operations which are available to software.
1 Processor: Datapath and Control Single cycle processor –Datapath and Control Multicycle processor –Datapath and Control Microprogramming –Vertical and.
D ATA P ATH OF A PROCESSOR (MIPS) Module 1.1 : Elements of computer system UNIT 1.
December 26, 2015©2003 Craig Zilles (derived from slides by Howard Huang) 1 A single-cycle MIPS processor  As previously discussed, an instruction set.
Performance of Single-cycle Design
1 Multicycle conclusion  My office hours, move to Mon or Wed?  Plan: Pipelining this and next week, maybe performance analysis  Today: —Microprogramming.
LECTURE 6 Multi-Cycle Datapath and Control. SINGLE-CYCLE IMPLEMENTATION As we’ve seen, single-cycle implementation, although easy to implement, could.
ECE-C355 Computer Structures Winter 2008 The MIPS Datapath Slides have been adapted from Prof. Mary Jane Irwin ( )
Datapath and Control AddressInstruction Memory Write Data Reg Addr Register File ALU Data Memory Address Write Data Read Data PC Read Data Read Data.
Chapter 4 From: Dr. Iyad F. Jafar Basic MIPS Architecture: Single-Cycle Datapath and Control.
February 22, 2016©2003 Craig Zilles (derived from slides by Howard Huang) 1 A single-cycle MIPS processor  As previously discussed, an instruction set.
1 CS/COE0447 Computer Organization & Assembly Language Chapter 5 Part 2.
Datapath and Control AddressInstruction Memory Write Data Reg Addr Register File ALU Data Memory Address Write Data Read Data PC Read Data Read Data.
COM181 Computer Hardware Lecture 6: The MIPs CPU.
1 Chapter 5: Datapath and Control (Part 2) CS 447 Jason Bakos.
Single Cycle Controller Design
Chapter 4 From: Dr. Iyad F. Jafar Basic MIPS Architecture: Multi-Cycle Datapath and Control.
1 The final datapath. 2 Control  The control unit is responsible for setting all the control signals so that each instruction is executed properly. —The.
Computer Architecture Lecture 6.  Our implementation of the MIPS is simplified memory-reference instructions: lw, sw arithmetic-logical instructions:
Design a MIPS Processor (II)
Multi-Cycle Datapath and Control
CS161 – Design and Architecture of Computer Systems
Single-Cycle Datapath and Control
Computer Architecture
IT 251 Computer Organization and Architecture
Performance of Single-cycle Design
Multi-Cycle CPU.
Multi-Cycle CPU.
Basic MIPS Architecture
CS/COE0447 Computer Organization & Assembly Language
Multiple Cycle Implementation of MIPS-Lite CPU
Single-cycle datapath, slightly rearranged
A pipeline diagram Clock cycle lw $t0, 4($sp) IF ID
The Processor Lecture 3.2: Building a Datapath with Control
Vishwani D. Agrawal James J. Danaher Professor
COSC 2021: Computer Organization Instructor: Dr. Amir Asif
COSC 2021: Computer Organization Instructor: Dr. Amir Asif
Processor: Multi-Cycle Datapath & Control
Chapter Four The Processor: Datapath and Control
5.5 A Multicycle Implementation
Control Unit (single cycle implementation)
A single-cycle MIPS processor
The Processor: Datapath & Control.
Processor: Datapath and Control
Presentation transcript:

IT 251 Computer Organization and Architecture Multi Cycle CPU Datapath Chia-Chi Teng

2 Summary - Single Cycle Datapath  A datapath contains all the functional units and connections necessary to implement an instruction set architecture. —For our single-cycle implementation, we use two separate memories, an ALU, some extra adders, and lots of multiplexers. —MIPS is a 32-bit machine, so most of the buses are 32-bits wide.  The control unit tells the datapath what to do, based on the instruction that’s currently being executed. —Our processor has ten control signals that regulate the datapath. —The control signals can be generated by a combinational circuit with the instruction’s 32-bit binary encoding as input.  Now we’ll see the performance limitations of this single-cycle machine and try to improve upon it.

3 Control  The control unit is responsible for setting all the control signals so that each instruction is executed properly. —The control unit’s input is the 32-bit instruction word. —The outputs are values for the blue control signals in the datapath.  Most of the signals can be generated from the instruction opcode alone, and not the entire 32-bit word.  To illustrate the relevant control signals, we will show the route that is taken through the datapath by R-type, lw, sw and beq instructions.

4 Control signal table  sw and beq are the only instructions that do not write any registers.  lw and sw are the only instructions that use the constant field. They also depend on the ALU to compute the effective memory address.  ALUOp for R-type instructions depends on the instructions’ func field.  The PCSrc control signal (not listed) should be set if the instruction is beq and the ALU’s Zero output is true. OperationRegDstRegWriteALUSrcALUOpMemWriteMemReadMemToReg add sub and or slt lw swX X beqX X

5 Generating control signals  The control unit needs 13 bits of inputs. —Six bits make up the instruction’s opcode. —Six bits come from the instruction’s func field. —It also needs the Zero output of the ALU.  The control unit generates 10 bits of output, corresponding to the signals mentioned on the previous page.  You can build the actual circuit by using big K-maps, big Boolean algebra, or big circuit design programs.  The textbook presents a slightly different control unit. Read address Instruction memory Instruction [31-0] Control I [ ] I [5 - 0] RegWrite ALUSrc ALUOp MemWrite MemRead MemToReg RegDst PCSrc Zero

Boolean Expressions for Controller RegDst = add + sub ALUSrc = ori + lw + sw MemtoReg = lw RegWrite = add + sub + ori + lw MemWrite = sw nPCsel = beq Jump = jump ExtOp = lw + sw ALUctr[0] = sub + beq (assume ALUctr is 0 ADD, 01: SUB, 10: OR ) ALUctr[1] = or where, rtype = ~op 5  ~op 4  ~op 3  ~op 2  ~op 1  ~op 0, ori = ~op 5  ~op 4  op 3  op 2  ~op 1  op 0 lw = op 5  ~op 4  ~op 3  ~op 2  op 1  op 0 sw = op 5  ~op 4  op 3  ~op 2  op 1  op 0 beq = ~op 5  ~op 4  ~op 3  op 2  ~op 1  ~op 0 jump = ~op 5  ~op 4  ~op 3  ~op 2  op 1  ~op 0 add = rtype  func 5  ~func 4  ~func 3  ~func 2  ~func 1  ~func 0 sub = rtype  func 5  ~func 4  ~func 3  ~func 2  func 1  ~func 0 How do we implement this in gates?

Controller Implementation add sub ori lw sw beq jump RegDst ALUSrc MemtoReg RegWrite MemWrite nPCsel Jump ExtOp ALUctr[0] ALUctr[1] “AND” logic “OR” logic opcodefunc

8 Multicycle datapath  We just saw a single-cycle datapath and control unit for our simple MIPS- based instruction set.  A multicycle processor fixes some shortcomings in the single-cycle CPU. —Faster instructions are not held back by slower ones. —The clock cycle time can be decreased. —We don’t have to duplicate any hardware units.  A multicycle processor requires a somewhat simpler datapath which we’ll see today, but a more complex control unit that we’ll see later.

9 The single-cycle design again… Read address Instruction memory Instruction [31-0] Read address Write address Write data Data memory Read data MemWrite MemRead 1Mux01Mux0 MemToReg 4 Shift left 2 PC Add 0Mux10Mux1 PCSrc Sign extend 0Mux10Mux1 ALUSrc Result Zero ALU ALUOp I [15 - 0] I [ ] I [ ] I [ ] 0Mux10Mux1 RegDst Read register 1 Read register 2 Write register Write data Read data 2 Read data 1 Registers RegWrite A control unit (not shown) generates all the control signals from the instruction’s “op” and “func” fields.

10 The example add from last time  Consider the instruction add $s4, $t1, $t2.  Assume $t1 and $t2 initially contain 1 and 2 respectively.  Executing this instruction involves several steps. 1.The instruction word is read from the instruction memory, and the program counter is incremented by 4. 2.The sources $t1 and $t2 are read from the register file. 3.The values 1 and 2 are added by the ALU. 4.The result (3) is stored back into $s4 in the register file oprsrtrdshamtfunc

I [ ] How the add goes through the datapath Read address Instruction memory Instruction [31-0] Read address Write address Write data Data memory Read data MemWrite MemRead 1Mux01Mux0 MemToReg 4 Shift left 2 PC Add 0Mux10Mux1 PCSrc Sign extend 0Mux10Mux1 ALUSrc Result Zero ALU ALUOp I [15 - 0] I [ ] I [ ] Mux10Mux1 RegDst Read register 1 Read register 2 Write register Write data Read data 2 Read data 1 Registers RegWrite PC+4 Which ones are state elements? clk

12 State elements  In an instruction like add $t1, $t1, $t2, how do we know $t1 is not updated until after its original value is read? Read address Write address Write data Data memory Read data MemWrite MemRead Read register 1 Read register 2 Write register Write data Read data 2 Read data 1 Registers RegWrite PC

13 State elements  In an instruction like add $t1, $t1, $t2, how do we know $t1 is not updated until after its original value is read? Read address Write address Write data Data memory Read data MemWrite MemRead Read register 1 Read register 2 Write register Write data Read data 2 Read data 1 Registers RegWrite PC Instruction Cycle

14 The datapath and the clock 1.STEP 1: A new instruction is loaded from memory. The control unit sets the datapath signals appropriately so that —registers are read, —ALU output is generated, —data memory is read and —branch target addresses are computed. 2.STEP 2: —The register file is updated for arithmetic or lw instructions. —Data memory is written for a sw instruction. —The PC is updated to point to the next instruction.  In a single-cycle datapath everything in Step 1 must complete within one clock cycle.

15 The slowest instruction...  If all instructions must complete within one clock cycle, then the cycle time has to be large enough to accommodate the slowest instruction.  For example, lw $t0, –4($sp) needs 8ns, assuming the delays shown here. 0Mux10Mux1 Read address Instruction memory Instruction [31-0] Read address Write address Write data Data memory Read data 1Mux01Mux0 Sign extend 0Mux10Mux1 Result Zero ALU I [15 - 0] I [ ] I [ ] I [ ] Read register 1 Read register 2 Write register Write data Read data 2 Read data 1 Registers 2 ns 1 ns 0 ns 8ns reading the instruction memory2ns reading the base register $sp1ns computing memory address $sp-42ns reading the data memory2ns storing data back to $t01ns

16...determines the clock cycle time  If we make the cycle time 8ns then every instruction will take 8ns, even if they don’t need that much time.  For example, the instruction add $s4, $t1, $t2 really needs just 6ns. 0Mux10Mux1 Read address Instruction memory Instruction [31-0] Read address Write address Write data Data memory Read data 1Mux01Mux0 Sign extend 0Mux10Mux1 Result Zero ALU I [15 - 0] I [ ] I [ ] I [ ] Read register 1 Read register 2 Write register Write data Read data 2 Read data 1 Registers 2 ns 1 ns 0 ns 6ns reading the instruction memory2ns reading registers $t1 and $t21ns computing $t1 + $t22ns storing the result into $s01ns

17 How bad is this?  With these same component delays, a sw instruction would need 7ns, and beq would need just 5ns.  Let’s consider the gcc instruction mix as an example  With a single-cycle datapath, each instruction would require 8ns.  But if we could execute instructions as fast as possible, the average time per instruction for gcc would be: (48% x 6ns) + (22% x 8ns) + (11% x 7ns) + (19% x 5ns) = 6.36ns  The single-cycle datapath with 8ns instruction cycle is about 26% slower! InstructionFrequency Arithmetic48% Loads22% Stores11% Branches19%

18 It gets worse...  We’ve made very optimistic assumptions about memory latency: —Main memory accesses on modern machines is >50ns. For comparison, an ALU on the Pentium4 takes ~0.3ns.  Our worst case cycle (loads/stores) includes 2 memory accesses —A modern single cycle implementation would be stuck at <10Mhz. —Caches will improve common case access time, not worst case.  Tying frequency to worst case path violates first law of performance!!

19 A multistage approach to instruction execution  We’ve informally described instructions as executing in several steps. 1.Instruction fetch and PC increment. 2.Reading sources from the register file. 3.Performing an ALU computation. 4.Reading or writing (data) memory. 5.Storing data back to the register file.  What if we made these stages explicit in the hardware design?

20 Performance benefits  Each instruction can execute only the stages that are necessary. —Arithmetic —Load —Store —Branches  This would mean that instructions complete as soon as possible, instead of being limited by the slowest instruction. Proposed execution stages 1.Instruction fetch and PC increment 2.Reading sources from the register file 3.Performing an ALU computation 4.Reading or writing (data) memory 5.Storing data back to the register file

21 The clock cycle  Things are simpler if we assume that each “stage” takes one clock cycle. —This means instructions will require multiple clock cycles to execute. —But since a single stage is fairly simple, the cycle time can be low.  For the proposed execution stages below and the sample datapath delays shown earlier, each stage needs 2ns at most. —This accounts for the slowest devices, the ALU and data memory. —A 2ns clock cycle time corresponds to a 500MHz clock rate! Proposed execution stages 1.Instruction fetch and PC increment 2.Reading sources from the register file 3.Performing an ALU computation 4.Reading or writing (data) memory 5.Storing data back to the register file

22 Cost benefits  As an added bonus, we can eliminate some of the extra hardware from the single-cycle datapath. —We will restrict ourselves to using each functional unit once per cycle, just like before. —But since instructions require multiple cycles, we could reuse some units in a different cycle during the execution of a single instruction.  For example, we could use the same ALU: —to increment the PC (first clock cycle), and —for arithmetic operations (third clock cycle). Proposed execution stages 1.Instruction fetch and PC increment 2.Reading sources from the register file 3.Performing an ALU computation 4.Reading or writing (data) memory 5.Storing data back to the register file

23 Two extra adders  Our original single-cycle datapath had an ALU and two adders.  The arithmetic-logic unit had two responsibilities. —Doing an operation on two registers for arithmetic instructions. —Adding a register to a sign-extended constant, to compute effective addresses for lw and sw instructions.  One of the extra adders incremented the PC by computing PC + 4.  The other adder computed branch targets, by adding a sign-extended, shifted offset to (PC + 4).

24 The extra single-cycle adders Read address Instruction memory Instruction [31-0] Read address Write address Write data Data memory Read data MemWrite MemRead 1Mux01Mux0 MemToReg 4 Shift left 2 PC Add 0Mux10Mux1 PCSrc Sign extend 0Mux10Mux1 ALUSrc Result Zero ALU ALUOp I [15 - 0] I [ ] I [ ] I [ ] 0Mux10Mux1 RegDst Read register 1 Read register 2 Write register Write data Read data 2 Read data 1 Registers RegWrite

25 Our new adder setup  We can eliminate both extra adders in a multicycle datapath, and instead use just one ALU, with multiplexers to select the proper inputs.  A 2-to-1 mux ALUSrcA sets the first ALU input to be the PC or a register.  A 4-to-1 mux ALUSrcB selects the second ALU input from among: —the register file (for arithmetic operations), —a constant 4 (to increment the PC), —a sign-extended constant (for effective addresses), and —a sign-extended and shifted constant (for branch targets).  This permits a single ALU to perform all of the necessary functions. —Arithmetic operations on two register operands. —Incrementing the PC. —Computing effective addresses for lw and sw. —Adding a sign-extended, shifted offset to (PC + 4) for branches.

26 The multicycle adder setup highlighted Result Zero ALU ALUOp 0Mux10Mux1 ALUSrcA ALUSrcB Read register 1 Read register 2 Write register Write data Read data 2 Read data 1 Registers RegWrite Sign extend Shift left 2 PC 4 0Mux10Mux1 RegDst 0Mux10Mux1 MemToReg 0Mux10Mux1 IorD Address Memory Mem Data Write data MemRead MemWrite PCWrite

27 Eliminating a memory  Similarly, we can get by with one unified memory, which will store both program instructions and data.  This memory is used in both the instruction fetch and data access stages, and the address could come from either: —the PC register (when we’re fetching an instruction), or —the ALU output (for the effective address of a lw or sw).  We add another 2-to-1 mux, IorD, to decide whether the memory is being accessed for instructions or for data. Proposed execution stages 1.Instruction fetch and PC increment 2.Reading sources from the register file 3.Performing an ALU computation 4.Reading or writing (data) memory 5.Storing data back to the register file

28 The new memory setup highlighted Result Zero ALU ALUOp 0Mux10Mux1 ALUSrcA ALUSrcB Read register 1 Read register 2 Write register Write data Read data 2 Read data 1 Registers RegWrite Sign extend Shift left 2 PC 4 0Mux10Mux1 RegDst 0Mux10Mux1 MemToReg 0Mux10Mux1 IorD Address Memory Mem Data Write data MemRead MemWrite PCWrite

29 Intermediate registers  Sometimes we need the output of a functional unit in a later clock cycle during the execution of one instruction. —The instruction word fetched in stage 1 determines the destination of the register write in stage 5. —The ALU result for an address computation in stage 3 is needed as the memory address for lw or sw in stage 4.  These outputs will have to be stored in intermediate registers for future use. Otherwise they would probably be lost by the next clock cycle. —The instruction read in stage 1 is saved in Instruction register. —Register file outputs from stage 2 are saved in registers A and B. —The ALU output will be stored in a register ALUOut. —Any data fetched from memory in stage 4 is kept in the Memory data register, also called MDR.

30 The final multicycle datapath Result Zero ALU ALUOp 0Mux10Mux1 ALUSrcA ALUSrcB Read register 1 Read register 2 Write register Write data Read data 2 Read data 1 Registers RegWrite Address Memory Mem Data Write data Sign extend Shift left 2 0Mux10Mux1 PCSource PCA 4[31-26] [25-21] [20-16] [15-11] [15-0] Instruction register Memory data register IRWrite 0Mux10Mux1 RegDst 0Mux10Mux1 MemToReg 0Mux10Mux1 IorD MemRead MemWrite PCWrite ALU Out B How may state elements do we have now?

31 Register write control signals  We have to add a few more control signals to the datapath.  Since instructions now take a variable number of cycles to execute, we cannot update the PC on each cycle. —Instead, a PCWrite signal controls the loading of the PC. —The instruction register also has a write signal, IRWrite. We need to keep the instruction word for the duration of its execution, and must explicitly re-load the instruction register when needed.  The other intermediate registers, MDR, A, B and ALUOut, will store data for only one clock cycle at most, and do not need write control signals.

32 Summary  A single-cycle CPU has two main disadvantages. —The cycle time is limited by the worst case latency. —It requires more hardware than necessary.  A multicycle processor splits instruction execution into several stages. —Instructions only execute as many stages as required. —Each stage is relatively simple, so the clock cycle time is reduced. —Functional units can be reused on different cycles.  We made several modifications to the single-cycle datapath. —The two extra adders and one memory were removed. —Multiplexers were inserted so the ALU and memory can be used for different purposes in different execution stages. —New registers are needed to store intermediate results.  Next time, we’ll look at controlling this datapath.