Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Processor: Datapath and Control. Outline Goals in processor implementation Brief review of sequential logic design Pieces of the processor implementation.

Similar presentations


Presentation on theme: "The Processor: Datapath and Control. Outline Goals in processor implementation Brief review of sequential logic design Pieces of the processor implementation."— Presentation transcript:

1 The Processor: Datapath and Control

2 Outline Goals in processor implementation Brief review of sequential logic design Pieces of the processor implementation puzzle A simple implementation of a MIPS integer instruction subset  Datapath  Control logic design A multi-cycle MIPS implementation  Datapath  Control logic design Microcoded control Exceptions Some real microprocessor datapath and control

3 Goals in processor implementation Balance the rate of supply of instructions and data and the rate at which the execution core can consume them and can update memory instruction supplydata supplyexecution core

4 Goals in processor implementation Recall from Chapter 2  CPU Time = INST x CPI x CT INST largely a function of the ISA and compiler Objective: minimize CPI x CT within design constraints (cost, power, etc.) Trading off CPI and CT is tricky multiplier logic

5 Brief review of sequential logic design State elements are clocked devices  Flip flops, etc Combinatorial elements hold no state  ALU, caches, multiplier, multiplexers, etc. In edge triggered clocking, state elements are only updated on the (rising) edge of the clock pulse

6 Brief review of sequential logic design The same state element can be read at the beginning of a clock cycle and updated at the end Example: incrementing the PC Add 12 8 PC 4 clock PC register 812 Add output Add input 8 clock

7 Our processor design progression (1) Instruction fetch, execute, and operand reads from data memory all take place in a single clock cycle (2) Instruction fetch, execute, and operand reads from data memory take place in successive clock cycles (3) A pipelined design

8 Pieces of the processor puzzle Instruction fetch Execution Data memory instruction supplydata supplyexecution core

9 Instruction fetch datapath Memory to hold instructions Register to hold the instruction memory address Logic to generate the next instruction address PC +4

10 Execution datapath Focus on only a subset of all MIPS instructions  add, sub, and, or  lw, sw  slt  beq, j For all instructions except j, we  Read operands from the register file  Perform an ALU operation For all instructions except sw, beq, and j, we write a result into the register file

11 Execution datapath Register file block diagram  Read register 1,2: source operand register numbers  Read data 1,2: source operands (32 bits each)  Write register: destination operand register number  Write data: data written into register file  RegWrite: when asserted, enables the writing of Write Data

12 Execution datapath Datapath for R-type (add, sub, and, or, slt)  R-type instruction format: oprsrtfunctrdshamt 312616151110650252021

13 Execution datapath Datapath for beq instruction  I-type instruction format:  Zero ALU output indicates if rs=rt (branch is taken/not taken)  Branch target address is the sign extended immediate left shifted two positions, and added to PC+4 oprsrtimmediate 312616150252021

14 Data memory Used for lw, sw (I-type format) Block diagram  Address: memory location to be read or written  Read data: data out of the memory on a load  Write data: data into the memory on a store  MemRead: indicates a read operation is to be performed  MemWrite: indicates a write operation is to be performed

15 Execution datapath + data memory Datapath for lw, sw  Address is the sign-extended immediate added to the source operand read out of the register file  sw: data written to memory from specified register  lw: data written to register file from specified memory address

16 Putting the pieces together Single clock cycle for fetch, execute, and operand read from data memory 3 MUXes  Register file operand or sign extended immediate to ALU  ALU or data memory output written to register file  PC+4 or branch target address written to PC register

17 Datapath for R-type instructions Example: add $4, $18, $30

18 Datapath for I-type ALU instructions Example: slti $7, $4, 100

19 Datapath for not taken beq instruction Example: beq $28, $13, EXIT

20 Datapath for taken beq instruction Example: beq $28, $13, EXIT

21 Datapath for load instruction Example: lw $8, 112($2)

22 Datapath for store instruction Example: sw $10, 0($3)

23 Control signals we need to generate

24 ALU operation control ALU control input codes from Chapter 4 Two steps to generate the ALU control input  Use the opcode to distinguish R-type, lw and sw, and beq  If R-type, use funct field to determine the ALU control input ALU control inputALU operationUsed for 000and 001or 010addadd, lw, sw 110subtractsub, beq 111set on less thanslt

25 ALU operation control Opcode used to generate a 2-bit signal called ALUOp with the following encodings  00: lw or sw, perform an ALU add  01: beq, perform an ALU subtract  10: R-type, ALU operation is determined by the funct field FunctInstructionALU control input 100000add010 100010sub110 100100and000 100101or001 101010slt111

26 Comparing instruction fields Opcode, source registers, function code, and immediate fields always in same place Destination register is  bits 15-11 (rd) for R-type  bits 20-16 (rt) for lw  MUX to select the right one 0rsrtfunctrdshamt 312616151110650252021 4rsrtimmediate (offset) 312616150252021 R-type beq 35 (43)rsrtimmediate (offset) 312616150252021 lw (sw)

27 Datapath with instr fields and ALU control

28 Main control unit design

29 Truth table (4) (0) (34) (43)

30 Adding support for jump instructions J-type format Next PC formed by shifting left the 26-bit target two bits and combining it with the 4 high-order bits of PC+4 Now the next PC will be one of  PC+4  beq target address  j target address We need another MUX and control bit 2target 3126025

31 Adding support for jump instructions

32 Evaluation of the simple implementation All instructions take one clock cycle (CPI = 1) Assume the following worst case delays  Instruction memory: 4 time units  Data memory: 4 time units (read), 2 time units (write)  ALU: 4 time units  Adders: 3 time units  Register file: 2 time units (read), 1 time unit (write)  MUXes, sign extension, gates, and shifters: 1 time unit Large disparity in worst case delays among instruction types  R-type: 4+2+1+4+1+1 = 13 time units  beq: 4+2+1+4+1+1+1 = 14 time units  j: 4+1+1 = 6 time units  store: 4+2+4+2 = 12 time units  load: 4+2+4+4+1+1 = 16 time units

33 Evaluation of the simple implementation Disparity would be worse in a real machine  Even slower integer instructions (e.g., multiply/divide in MIPS)  Floating point instructions Simple instructions take as long as complex ones

34 A multicycle implementation Instruction fetch, register file access, etc occur in separate clock cycles Different instruction types take different numbers of cycles to complete Clock cycle time should be faster

35 High level view of datapath New registers store results of each step  Not programmer visible! Hardware can be shared  One ALU for PC+4, branch target calculation, EA calculation, and arithmetic operations  One memory for instructions and data

36 Detailed multi-cycle datapath

37 Multi-cycle control

38 First two cycles for all instructions Instruction fetch (1 st cycle)  Load the instruction into the IR register IR = Memory[PC]  Increment the PC PC = PC+4 Instruction decode and register fetch (2 nd cycle)  Read register file locations rs and rt, results into the A and B registers A=Reg[IR[25-21]] B=Reg[IR[20-16]]  Calculate the branch target address and load into ALUOut ALUOut = PC+(sign-extend (IR[15-0]) <<2)

39 Instruction fetch IR=Mem[PC]

40 Instruction fetch PC=PC+4

41 Instruction decode and register fetch A=Reg[IR[25-21]], B=Reg[IR[20-16]]

42 Instruction decode and register fetch ALUOut = PC+(sign-extend (IR[15-0]) <<2)

43 Additional cycles for R-type Execution  ALUOut = A op B Completion  Reg[IR[15-11]] = ALUOut

44 R-type execution cycle ALUOut = A op B

45 R-type completion cycle Reg[IR[15-11]] = ALUOut

46 Additional cycles for store Address computation  ALUOut = A + sign-extend (IR[15-0]) Memory access  Memory[ALUOut] = B

47 Store address computation cycle ALUOut = A + sign-extend (IR[15-0])

48 Store memory access cycle Memory[ALUOut] = B

49 Additional cycles for load Address computation  ALUOut = A + sign-extend (IR[15-0]) Memory access  MDR = Memory[ALUOut] Read completion  Reg[IR[20-16]] = MDR

50 Load memory access cycle MDR = Memory[ALUOut]

51 Load read completion cycle Reg[IR[20-16]] = MDR

52 Additional cycle for beq Branch completion  if (A == B) PC = ALUOut

53 Branch completion cycle for beq if (A == B) PC = ALUOut

54 Additional cycle for j Jump completion  PC = PC[31-28] || (IR[25-0]<<2)

55 Jump completion cycle for j PC = PC[31-28] || (IR[25-0]<<2)

56 Control logic design Implemented as a Finite State Machine  Inputs: 6 opcode bits  Outputs: 16 control signals  State: 4 bits for 10 states

57 High-level view of FSM

58 Instruction fetch cycle

59 Instruction decode/register fetch cycle

60 R-type execution cycle

61 R-type completion cycle

62 Memory address computation cycle

63 Store memory access cycle

64 Load memory access cycle

65 Load read completion cycle

66 beq branch completion cycle

67 j jump completion cycle

68 Complete FSM

69 Evaluation of the multi-cycle design CPI calculated based on the instruction mix  For gcc (Figure 4.54) 23% loads (5 cycles each) 13% stores (4 cycles each) 19% branches (3 cycles each) 2% jumps (3 cycles each) 43% ALU (4 cycles each)  CPI = 0.23*5+0.13*4+0.19*3+0.02*3+0.43*4=4.02 Cycle time is calculated from the longest delay path assuming the same timing delays as before

70 Worst case datapath: branch target ALUOut = PC+(sign-extend (IR[15-0]) <<2) Delay = 7 time units (delay of simple = 16)

71 Evaluation of the multi-cycle design Time per instruction of simple and multi-cycle  TPI(simple) = CPI(simple) x cycle time(simple) = 16  TPI(multi-cycle) = 4.02 x 7 = 28.1 Simple single-cycle implementation is faster Multicycle with pipelining will be considerably faster than single-cycle implementation

72 Exceptions An exception is an event that causes a deviation from the normal execution of instructions Types of exceptions  Operating system call (e.g., read a file, print a file)  Input/output device request  Page fault (request for instruction/data not in memory – Ch 7)  Arithmetic error (overflow, underflow, etc.)  Undefined instruction  Misaligned memory access (e.g., word access to odd address)  Memory protection violation  Hardware error  Power failure An exception is not usually due to an error! We need to be able to restart the program at the point where the exception was detected

73 Handling exceptions Detect the exception Save enough information about the exception to handle it properly Save enough information about the program to resume it after the exception is handled Handle the exception Either terminate the program or resume executing it depending on the exception type

74 Detecting exceptions Performed by hardware Overflow: determined from the opcode and the overflow output of the ALU Undefined instruction: determined from  The opcode in the main control unit  The function code and ALUop in the ALU control logic

75 Detecting exceptions overflow undefined instruction

76 Saving exception information Performed by hardware We need the type of exception and the PC of the instruction when the exception occurred In MIPS, the Cause register holds the exception type  Need an encoding for each exception type  Need a signal from the control unit to load it into the Cause register and the Exception Program Counter (EPC) register holds the PC  Need to subtract 4 from the PC register to get the correct PC (since we loaded PC+4 into the PC register during the Instruction Fetch cycle)  Need a signal from the control unit to load it into EPC

77 Saving exception information

78 Saving program information Needed in order to restart the program from the point where the exception occurred Performed by hardware and software EPC register holds the PC of the instruction that had the exception (where we will restart the program) The software routine that handles the exception saves any registers that it will need to the stack and restores them when it is done

79 Handling the exception Performed by hardware and software Need to transfer control to a software routine to handle the exception (exception handler) The exception handler runs in a privileged mode that allows it to use special instructions and access all of memory  Our programs run in user mode The hardware enables the privileged mode, loads PC with the address of the exception handler, and transitions to the Fetch state

80 Handling the exception Loading the PC with exception handler address

81 Exception handler Stores the values of the registers that it will need to the stack Handles the particular exception Operating system call: calls the subroutine associated with the call Underflow: sets register to zero or uses denormalized numbers I/O: handles the particular I/O request, e.g., keyboard input Restores registers from the stack (if program is to be restarted) Terminates the program, or resumes execution by loading the PC with EPC and transitioning to the Instruction Fetch state

82 FSM modifications

83 The Intel Pentium processor Introduced in 1993 Uses a multi-cycle datapath with the following steps for integer instructions  Prefetch (PF): read instruction from the instruction memory  Decode 1 (D1): first stage of instruction decode  Decode 2 (D2): second stage of instruction decode  Execute (E): perform the ALU operation  Write back (WB): write the result to the register file Datapath usage varies by instruction type  Simple instructions make one pass through the datapath using state machine control  Complex instructions make multiple passes, reusing the same hardware elements under microcode control

84 The Intel Pentium processor The Pentium is a 2-way superscalar design as two instructions can simultaneously execute Ideal CPI for a 2-way superscalar is 0.5 Conditions for superscalar execution  Both must be simple instructions  The result of the first instruction cannot be needed by the second  Both instructions cannot write the same register  The first instruction in program sequence cannot be a jump PF D1 D2E WB D2E WB U pipe V pipe

85 The Intel Pentium Pro processor Introduced in 1995 as the successor to the Pentium The basis for the Pentium II and Pentium III Implements a 14-cycle, 3-way superscalar integer datapath  Very high frequency is the goal Uses out-of-order execution in that instructions may execute out of their original program order  Completely handled by hardware transparently to software  Instructions execute as soon as their source operands become available  Complicates exception handling Some instructions before the excepting one may not have executed, while some after it may have executed

86 The Intel Pentium Pro processor Pentium Pro designers (and AMD designers before them) used innovative engineering to overcome the disadvantages of CISC ISAs  Many complex X86 instructions are internally translated by hardware into RISC-like micro-ops with state machine control  Achieves a very low CPI for simple integer operations even on programs compiled for older implementations Combination of high frequency and low CPI gave the Pentium Pro extremely competitive integer performance versus RISC microprocessors  Result has been that RISC CPUs have failed to gain the desktop market share that had been expected

87 The Intel Pentium 4 processor 20 cycle superscalar integer pipeline Extremely high frequency (>3GHz) Major effort to lower power dissipation  Clock gating: clock to a unit is turned off when the unit is not in use  Trace cache: caches micro-ops of previously decoded complex instructions to avoid power-consuming decode operation


Download ppt "The Processor: Datapath and Control. Outline Goals in processor implementation Brief review of sequential logic design Pieces of the processor implementation."

Similar presentations


Ads by Google