Computing Systems The Processor: Datapath and Control.

Slides:



Advertisements
Similar presentations
331 W08.1Spring :332:331 Computer Architecture and Assembly Language Spring 2006 Week 8: Datapath Design [Adapted from Dave Patterson’s UCB CS152.
Advertisements

1 Chapter Five The Processor: Datapath and Control.
CS-447– Computer Architecture Lecture 12 Multiple Cycle Datapath
The Processor: Datapath & Control
1  1998 Morgan Kaufmann Publishers Chapter Five The Processor: Datapath and Control.
L14 – Control & Execution 1 Comp 411 – Fall /04/09 Control & Execution Finite State Machines for Control MIPS Execution.
Chapter 5 The Processor: Datapath and Control Basic MIPS Architecture Homework 2 due October 28 th. Project Designs due October 28 th. Project Reports.
Levels in Processor Design
1 Chapter Five. 2 We're ready to look at an implementation of the MIPS Simplified to contain only: –memory-reference instructions: lw, sw –arithmetic-logical.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
Computer Structure - Datapath and Control Goal: Design a Datapath  We will design the datapath of a processor that includes a subset of the MIPS instruction.
The Processor 2 Andreas Klappenecker CPSC321 Computer Architecture.
Advanced Computer Architecture 5MD00 / 5Z033 MIPS Design data path and control Henk Corporaal TUEindhoven 2007.
Chapter Five The Processor: Datapath and Control.
L15 – Control & Execution 1 Comp 411 – Spring /25/08 Control & Execution Finite State Machines for Control MIPS Execution.
331 W10.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 10 Building a Multi-Cycle Datapath [Adapted from Dave Patterson’s.
1  1998 Morgan Kaufmann Publishers We're ready to look at an implementation of the MIPS Simplified to contain only: –memory-reference instructions: lw,
CPU Architecture Why not single cycle? Why not single cycle? Hardware complexity Hardware complexity Why not pipelined? Why not pipelined? Time constraints.
1 The Processor: Datapath and Control We will design a microprocessor that includes a subset of the MIPS instruction set: –Memory access: load/store word.
Class 9.1 Computer Architecture - HUJI Computer Architecture Class 9 Microprogramming.
Processor I CPSC 321 Andreas Klappenecker. Midterm 1 Thursday, October 7, during the regular class time Covers all material up to that point History MIPS.
1 We're ready to look at an implementation of the MIPS Simplified to contain only: –memory-reference instructions: lw, sw –arithmetic-logical instructions:
The Multicycle Processor CPSC 321 Andreas Klappenecker.
The Processor: Datapath & Control. Implementing Instructions Simplified instruction set memory-reference instructions: lw, sw arithmetic-logical instructions:
Dr. Iyad F. Jafar Basic MIPS Architecture: Multi-Cycle Datapath and Control.
Chapter 4 Sections 4.1 – 4.4 Appendix D.1 and D.2 Dr. Iyad F. Jafar Basic MIPS Architecture: Single-Cycle Datapath and Control.
COSC 3430 L08 Basic MIPS Architecture.1 COSC 3430 Computer Architecture Lecture 08 Processors Single cycle Datapath PH 3: Sections
Lec 15Systems Architecture1 Systems Architecture Lecture 15: A Simple Implementation of MIPS Jeremy R. Johnson Anatole D. Ruslanov William M. Mongan Some.
Chapter 5 Processor Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides We're ready to look at an implementation of the MIPS Simplified.
EECS 322: Computer Architecture
CPE232 Basic MIPS Architecture1 Computer Organization Multi-cycle Approach Dr. Iyad Jafar Adapted from Dr. Gheith Abandah slides
1 CS/COE0447 Computer Organization & Assembly Language Multi-Cycle Execution.
Computer Architecture and Design – ECEN 350 Part 6 [Some slides adapted from A. Sprintson, M. Irwin, D. Paterson and others]
1  2004 Morgan Kaufmann Publishers Chapter Five.
ECE-C355 Computer Structures Winter 2008 The MIPS Datapath Slides have been adapted from Prof. Mary Jane Irwin ( )
1  1998 Morgan Kaufmann Publishers Simple Implementation Include the functional units we need for each instruction Why do we need this stuff?
Datapath and Control AddressInstruction Memory Write Data Reg Addr Register File ALU Data Memory Address Write Data Read Data PC Read Data Read Data.
COM181 Computer Hardware Lecture 6: The MIPs CPU.
1 CS/COE0447 Computer Organization & Assembly Language Chapter 5 Part 3.
May 22, 2000Systems Architecture I1 Systems Architecture I (CS ) Lecture 14: A Simple Implementation of MIPS * Jeremy R. Johnson Mon. May 17, 2000.
Chapter 4 From: Dr. Iyad F. Jafar Basic MIPS Architecture: Multi-Cycle Datapath and Control.
1. 2 MIPS Hardware Implementation Full die photograph of the MIPS R2000 RISC Microprocessor. The 1986 MIPS R2000 with five pipeline stages and 450,000.
Computer Architecture Lecture 6.  Our implementation of the MIPS is simplified memory-reference instructions: lw, sw arithmetic-logical instructions:
Design a MIPS Processor (II)
CS161 – Design and Architecture of Computer Systems
CS161 – Design and Architecture of Computer Systems
Control & Execution Finite State Machines for Control MIPS Execution.
Chapter 5 The Processor: Datapath and Control
Computer Organization & Design Microcode for Control Sec. 5
Processor (I).
CS/COE0447 Computer Organization & Assembly Language
Multiple Cycle Implementation of MIPS-Lite CPU
Chapter Five.
Chapter Five The Processor: Datapath and Control
Chapter Five The Processor: Datapath and Control
Rocky K. C. Chang 6 November 2017
The Processor Lecture 3.2: Building a Datapath with Control
Multicycle Approach We will be reusing functional units
Systems Architecture I
COSC 2021: Computer Organization Instructor: Dr. Amir Asif
Processor (II).
COSC 2021: Computer Organization Instructor: Dr. Amir Asif
Processor: Multi-Cycle Datapath & Control
Simple Implementation
Multicycle Design.
Multi-Cycle Datapath Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
The Processor: Datapath & Control.
Processor: Datapath and Control
CS161 – Design and Architecture of Computer Systems
Presentation transcript:

Computing Systems The Processor: Datapath and Control

2 Introduction  The performance of a machine depends on 3 key factors:  instruction count  clock cycle time  clock cycles per instructions (CPI)  Implement a basic MIPS simplified to contains only:  memory-reference instructions: lw,sw  arithmetic-logical instructions: add,sub,and,or,slt  control-flow instructions: beq,j compiler and ISA hardware implementation

3 Overview of the implementation  Generic Implementation:  use the program counter (PC) to supply instruction address  get the instruction from memory  read registers  use the instruction to decide exactly what to do  The actions required to complete an instruction depend on the instruction class  Even across instruction classes there are some similarities  e.g., all instructions use the ALU after reading the registers Why? memory-reference? arithmetic? control flow?

4 Functional units to implement the processor  Two types of functional units:  elements that operate on data values (combinational)  elements that contain state (sequential)  Combinational units  The current outputs depends only on the current inputs  Sequential units  The current outputs depends on the current inputs but also on the past inputs  the element “remember” its history i.e., it has the capability of storing the input provided

5 State elements: latches and flip-flops  Output is equal to the stored value (state) inside the element (don't need to ask for permission to look at the value)  Change of state (value) is based on the “control” signal  Unclocked:  Latches  state update is level triggered  i.e., the state can change whenever the “control” change  Clocked:  Flip-Flops  State update is edge triggered  i.e., state can change only on clock edges clock rising edge clock cycle falling edge

6 Clocking methodology clock state element 1 state element 2 combinational logic  Computer design cannot tolerate unpredictability  clocking methodology is designed to prevent this circumstance.  specify the timing of reads and writes of the state elements  The easiest solution is to use a synchronous clocking scheme  edge triggered methodology  Typical execution:  read contents of some state elements,  send values through some combinational logic  write results to state elements

7 What functional units do we need ?  we need an ALU  we need memory to store instructions and data  instruction memory takes address and supply instruction  data memory takes address and supply data ( lw )  data memory address and data and write into memory ( sw )  we need to manage a PC and its update  we need a register file to include 32 registers  read two operands and write a result back  sometime the operand comes from the instruction  we need additional support for the immediate class of instructions (sign extension)  we need additional support for the jump instruction

8 Datapath building blocks

9 Register File (read circuits) Make sure you understand what is the “mux” above? Built using D flip-flops The clock signal is not shown

10 Register File (write circuits)  This is just a diagram to illustrate the “principle”. In practice never gate a clock signal !!!

11 Datapath implementation Use multiplexors to stitch the various functional units together

12 The datapath operation 1. Fetching instructions and incrementing PC

13 Arithmetic-logical instructions 2. two registers read from the register file 3. ALU operates on the data from the two registers 4. The result from ALU is written in the register file

14 Memory-reference instructions: sw 2. Two registers are read from the register file 3. ALU add the value read from one of the register and the sign-extended, lower 16 bits of instruction (offset) 4. The value read from the second register is written in data memory at the address given by the sum computed by the ALU

15 Memory-reference instructions: lw 2. A registers is read from the register file 3. ALU computes the sum of the value read from the register file and the sign-extended, lower 16 bits of instruction (offset) 4. The sum from the ALU is used as the address for the data memory in data memory 5. The data from the memory is written in the register file at the destination register

16 Control flow instructions: beq branch target address 2. Two registers are read from the register file 3. ALU performs a subtract on the values from the register file the value of PC+4 is added to the sign- extended, lower 16 bits of the instruction (offset) shifted left by 2 4. The Zero result from the ALU is used to decide which adder result to store into the PC

17  Arithmetic-logical: add,sub,and,or,slt (R-type) Example: add $t0,$t1,$t2 # $t1 in rs, $t2 in rt, $t0 in rd  Memory-reference: lw,sw (I-type) Example: lw $t0,offset($t1) # $t1 in rs, $t0 in rt sw $t0,offset($t1) # $t1 in rs, $t0 in rt Format of the instructions 31:2625:2120:1615:1110:65:0 0rsrtrdshamtfunct 31:2625:2120:1615:0 35 or 43 rsrtoffset Oops !! The destination register can be in two possible places. For load is in bit 20:16 (rt), while for R-type instruction it is in bit positions 15:11 (rd). We have a small bug !!!

18  branch: beq (I-type) Example: beq $t0,$t1,label # $t0 in rs, $t1 in rt  jump: j (J-type) Example: j address Format of the instructions 31:2625:2120:1615:0 4 rsrtaddress 31:2625:0 5 address We will leave the implementation of j out until the very end

19 A small bug fix !!! We need to add a mux to select which field of the instruction is used to indicate the register to be written

20 The control unit  Selecting the operations to perform (ALU, read/write of data memory and register file)  Controlling the flow of data (multiplexor inputs)  Information comes from the 32 bits of the instruction  Example: add $8, $17, $ op rs rt rd shamt funct  ALU's operation based on instruction type and function code

21 The control unit  ALU control input (ALU operation lines) 0000 and 0001or 0010add 0110subtract 0111set-on-less-than 1100nor Why is the code for subtract 0110 and not 0011?

22 The control unit  The control unit must compute 4-bit ALU control input:  given function code for arithmetic  given instruction type  ALUop=00 for lw,sw  ALUop=01 for beq,  ALUop=10 for arithmetic  Describe it using a truth table (can turn into gates): ALUop is an intermediate 2-bit code computed from the opcode field of the instruction to simplify the logic needed for computing the 4-bit ALU operation control input

23 ALUop ALU operation

24 The control unit Simple combinational logic (truth tables)

25 Implementing jump

26 Single-cycle control structure  Every instruction begins execution on one clock edge and completes execution on the next clock edge  We use a single long clock cycle for every instruction  All of the control logic is combinational  We wait for everything to settle down, and the right thing to be done  ALU might not produce “right answer” right away  we use write signals along with clock to determine when to write  Cycle time determined by length of the longest path

27 Single-cycle implementation  Critical path for different instruction classes Instruction class Functional units used by the instruction class R-type instr. fetch reg. access ALU reg. access Load instr. fetch reg. access ALU mem. access reg. access Store instr. fetch reg. access ALU mem. access Branch instr. fetch reg. access ALU Jump instr. fetch

28 Performance of single-cycle machines  Assume the major functional units of a machine have the following delays:  Memory Units: 200 ps  ALU and adders: 100 ps  Register File (read or write): 50 ps  Muxes, control unit, PC accesses, sign extension unit: no delay  Instruction mix  25% loads, 10% stores, 45% ALU instructions, 15% branches, and 5% jump  What is execution time for an implementation in which  every instruction operates in 1 clock cycle of a fixed length ?  every instruction executes in 1 clock cycle using a variable length clock ?

29 Timing for different instruction classes Instr. class Instr. Mem. Reg. Read ALU Data Mem. Reg. Write Total R-type ps Load ps Store ps Branch ps350 ps Jump ps

30 Single-cycle machine: performance  The clock cycle for a machine with a single clock for all instructions will be determined by the longest instruction  CPU clock cycle (single clock) = 600 ps  The average clock cycle for a machine with a variable clock is:  CPU clock cycle (variable clock) = 600 x 25% x 10% x 45% x 15% x 3% = ps  CPU execution time = IC x CPI x clock cycle time

31 Single-cycle problems  Single cycle Problems:  clock cycle is equal to the worst-case delay for all instructions  i.e., we violate the key design principle of making the common case fast  if we implement more complicated instructions (e.g., floating point arithmetic) the performance penalty is unbearable !!!  some functional units must be duplicated (wasteful of area)  One possible solution:  a “multicycle” datapath:  use a “smaller” cycle time  have different instructions take different numbers of cycles

32 Multi-cycle approach  We will be reusing functional units  ALU used to compute address and to increment PC  Memory used for instruction and data  Break up the instructions into steps, each step takes a cycle  balance the amount of work to be done  restrict each cycle to use only one major functional unit  At the end of a cycle  store values for use in later cycles (easiest thing to do)  introduce additional “internal” registers  Our control signals will not be determined directly by instruction  e.g., what should the ALU do for a “subtract” instruction?  We’ll use a finite state machine for control

33 PCLoad Multi-cycle datapath with control lines (without jump) IR Both for instruction and data MDR

34 Five execution steps  Instruction Fetch  Instruction Decode and Register Fetch  Execution, Memory Address Computation, or Branch Completion  Memory Access or R-type instruction completion  Write-back step Instructions take from 3 to 5 cycles !

35 Step 1: instruction fetch  Use PC to get instruction and put it in the Instruction Register.  Increment the PC by 4 and put the result back in the PC.  Can be described succinctly using RTL "Register- Transfer Language" IR <= Memory[PC]; PC <= PC + 4; Can we figure out the values of the control signals? What is the advantage of updating the PC now?

36 Step 2: instruction decode and register fetch  Read registers rs and rt in case we need them  Compute the branch address in case the instruction is a branch  RTL: A <= Reg[IR[25:21]]; B <= Reg[IR[20:16]]; ALUOut <= PC + (sign-extend(IR[15:0]) << 2);  We aren't setting any control lines based on the instruction type (the instruction is still being decoded in the control unit)

37 Step 3: instruction dependent  ALU is performing one of three functions, based on instruction type  Memory Reference: (address computation) ALUOut <= A + sign-extend(IR[15:0]);  R-type: (execution of the operation) ALUOut <= A op B;  Branch completion: (write PC) if (A==B) PC <= ALUOut;  Jump completion: (write PC) PC <= {PC[31:28], IR[25:0], 2b’00};

38 Step 4: R-type or memory-access  Memory Reference: (loads and stores access memory) load: MDR <= Memory[ALUOut]; (read access) or store completion step: Memory[ALUOut] <= B; (write access)  R-type instruction completion step: (write destination register) Reg[IR[15:11]] <= ALUOut; The write actually takes place at the end of the cycle on the edge

39 Step 5: Write back  Memory–Reference: load completion step: Reg[IR[20:16]] <= MDR;

40 Summary:

41 Simple Questions  How many cycles will it take to execute this code? lw $t2, 0($t3) lw $t3, 4($t3) beq $t2, $t3, Label#assume not add $t5, $t2, $t3 sw $t5, 8($t3) Label:...  What is going on during the 8th cycle of execution?  In what cycle does the actual addition of $t2 and $t3 takes place?

42 Complete multi-cycle machine

43 Review: Finite State Machine (FSM)  Finite state machines:  a set of states and  next state function (determined by current state and the input)  output function (determined by current state and possibly input)  We’ll use a Moore machine (output based only on current state) next-state function inputs current state output function outputs Mealy Machine Moore Machine output register

44 Implementing the control unit Note: don’t care if not mentioned asserted if name only otherwise exact value How many state bits will we need?

45 Implementing the control unit  Value of control signals is dependent upon:  what instruction is being executed  which step is being performed  In each clock cycle decide all the actions that need to be taken  Control unit is the most complex part of the design  Can be “hard-wired”, ROM based, or micro- programmed  Simpler instructions lead to simpler control  Sometime simple instructions are more effective than a single complex instruction  Complex instructions may have to be maintained for compatibility reasons

46 Historical perspective  Historical context of CISC:  Too much logic to put on a single chip  Use a ROM (or even RAM) to hold the microcode  It’s easy to add new instructions  Microprogramming  appropriate if hundreds of opcodes, modes, cycles, etc.  signals specified symbolically using microinstructions

47 Microprogramming = IR[31:35]

48 Microprogramming detail