Download presentation
Presentation is loading. Please wait.
Published byDenis Johns Modified over 9 years ago
1
The Processor: Datapath and Control
2
Outline Goals in processor implementation Brief review of sequential logic design Pieces of the processor implementation puzzle A simple implementation of a MIPS integer instruction subset Datapath Control logic design A multi-cycle MIPS implementation Datapath Control logic design Microcoded control Exceptions Some real microprocessor datapath and control
3
Goals in processor implementation Balance the rate of supply of instructions and data and the rate at which the execution core can consume them and can update memory instruction supplydata supplyexecution core
4
Goals in processor implementation Recall from Chapter 2 CPU Time = INST x CPI x CT INST largely a function of the ISA and compiler Objective: minimize CPI x CT within design constraints (cost, power, etc.) Trading off CPI and CT is tricky multiplier logic
5
Brief review of sequential logic design State elements are clocked devices Flip flops, etc Combinatorial elements hold no state ALU, caches, multiplier, multiplexers, etc. In edge triggered clocking, state elements are only updated on the (rising) edge of the clock pulse
6
Brief review of sequential logic design The same state element can be read at the beginning of a clock cycle and updated at the end Example: incrementing the PC Add 12 8 PC 4 clock PC register 812 Add output Add input 8 clock
7
Our processor design progression (1) Instruction fetch, execute, and operand reads from data memory all take place in a single clock cycle (2) Instruction fetch, execute, and operand reads from data memory take place in successive clock cycles (3) A pipelined design
8
Pieces of the processor puzzle Instruction fetch Execution Data memory instruction supplydata supplyexecution core
9
Instruction fetch datapath Memory to hold instructions Register to hold the instruction memory address Logic to generate the next instruction address PC +4
10
Execution datapath Focus on only a subset of all MIPS instructions add, sub, and, or lw, sw slt beq, j For all instructions except j, we Read operands from the register file Perform an ALU operation For all instructions except sw, beq, and j, we write a result into the register file
11
Execution datapath Register file block diagram Read register 1,2: source operand register numbers Read data 1,2: source operands (32 bits each) Write register: destination operand register number Write data: data written into register file RegWrite: when asserted, enables the writing of Write Data
12
Execution datapath Datapath for R-type (add, sub, and, or, slt) R-type instruction format: oprsrtfunctrdshamt 312616151110650252021
13
Execution datapath Datapath for beq instruction I-type instruction format: Zero ALU output indicates if rs=rt (branch is taken/not taken) Branch target address is the sign extended immediate left shifted two positions, and added to PC+4 oprsrtimmediate 312616150252021
14
Data memory Used for lw, sw (I-type format) Block diagram Address: memory location to be read or written Read data: data out of the memory on a load Write data: data into the memory on a store MemRead: indicates a read operation is to be performed MemWrite: indicates a write operation is to be performed
15
Execution datapath + data memory Datapath for lw, sw Address is the sign-extended immediate added to the source operand read out of the register file sw: data written to memory from specified register lw: data written to register file from specified memory address
16
Putting the pieces together Single clock cycle for fetch, execute, and operand read from data memory 3 MUXes Register file operand or sign extended immediate to ALU ALU or data memory output written to register file PC+4 or branch target address written to PC register
17
Datapath for R-type instructions Example: add $4, $18, $30
18
Datapath for I-type ALU instructions Example: slti $7, $4, 100
19
Datapath for not taken beq instruction Example: beq $28, $13, EXIT
20
Datapath for taken beq instruction Example: beq $28, $13, EXIT
21
Datapath for load instruction Example: lw $8, 112($2)
22
Datapath for store instruction Example: sw $10, 0($3)
23
Control signals we need to generate
24
ALU operation control ALU control input codes from Chapter 4 Two steps to generate the ALU control input Use the opcode to distinguish R-type, lw and sw, and beq If R-type, use funct field to determine the ALU control input ALU control inputALU operationUsed for 000and 001or 010addadd, lw, sw 110subtractsub, beq 111set on less thanslt
25
ALU operation control Opcode used to generate a 2-bit signal called ALUOp with the following encodings 00: lw or sw, perform an ALU add 01: beq, perform an ALU subtract 10: R-type, ALU operation is determined by the funct field FunctInstructionALU control input 100000add010 100010sub110 100100and000 100101or001 101010slt111
26
Comparing instruction fields Opcode, source registers, function code, and immediate fields always in same place Destination register is bits 15-11 (rd) for R-type bits 20-16 (rt) for lw MUX to select the right one 0rsrtfunctrdshamt 312616151110650252021 4rsrtimmediate (offset) 312616150252021 R-type beq 35 (43)rsrtimmediate (offset) 312616150252021 lw (sw)
27
Datapath with instr fields and ALU control
28
Main control unit design
29
Truth table (4) (0) (34) (43)
30
Adding support for jump instructions J-type format Next PC formed by shifting left the 26-bit target two bits and combining it with the 4 high-order bits of PC+4 Now the next PC will be one of PC+4 beq target address j target address We need another MUX and control bit 2target 3126025
31
Adding support for jump instructions
32
Evaluation of the simple implementation All instructions take one clock cycle (CPI = 1) Assume the following worst case delays Instruction memory: 4 time units Data memory: 4 time units (read), 2 time units (write) ALU: 4 time units Adders: 3 time units Register file: 2 time units (read), 1 time unit (write) MUXes, sign extension, gates, and shifters: 1 time unit Large disparity in worst case delays among instruction types R-type: 4+2+1+4+1+1 = 13 time units beq: 4+2+1+4+1+1+1 = 14 time units j: 4+1+1 = 6 time units store: 4+2+4+2 = 12 time units load: 4+2+4+4+1+1 = 16 time units
33
Evaluation of the simple implementation Disparity would be worse in a real machine Even slower integer instructions (e.g., multiply/divide in MIPS) Floating point instructions Simple instructions take as long as complex ones
34
A multicycle implementation Instruction fetch, register file access, etc occur in separate clock cycles Different instruction types take different numbers of cycles to complete Clock cycle time should be faster
35
High level view of datapath New registers store results of each step Not programmer visible! Hardware can be shared One ALU for PC+4, branch target calculation, EA calculation, and arithmetic operations One memory for instructions and data
36
Detailed multi-cycle datapath
37
Multi-cycle control
38
First two cycles for all instructions Instruction fetch (1 st cycle) Load the instruction into the IR register IR = Memory[PC] Increment the PC PC = PC+4 Instruction decode and register fetch (2 nd cycle) Read register file locations rs and rt, results into the A and B registers A=Reg[IR[25-21]] B=Reg[IR[20-16]] Calculate the branch target address and load into ALUOut ALUOut = PC+(sign-extend (IR[15-0]) <<2)
39
Instruction fetch IR=Mem[PC]
40
Instruction fetch PC=PC+4
41
Instruction decode and register fetch A=Reg[IR[25-21]], B=Reg[IR[20-16]]
42
Instruction decode and register fetch ALUOut = PC+(sign-extend (IR[15-0]) <<2)
43
Additional cycles for R-type Execution ALUOut = A op B Completion Reg[IR[15-11]] = ALUOut
44
R-type execution cycle ALUOut = A op B
45
R-type completion cycle Reg[IR[15-11]] = ALUOut
46
Additional cycles for store Address computation ALUOut = A + sign-extend (IR[15-0]) Memory access Memory[ALUOut] = B
47
Store address computation cycle ALUOut = A + sign-extend (IR[15-0])
48
Store memory access cycle Memory[ALUOut] = B
49
Additional cycles for load Address computation ALUOut = A + sign-extend (IR[15-0]) Memory access MDR = Memory[ALUOut] Read completion Reg[IR[20-16]] = MDR
50
Load memory access cycle MDR = Memory[ALUOut]
51
Load read completion cycle Reg[IR[20-16]] = MDR
52
Additional cycle for beq Branch completion if (A == B) PC = ALUOut
53
Branch completion cycle for beq if (A == B) PC = ALUOut
54
Additional cycle for j Jump completion PC = PC[31-28] || (IR[25-0]<<2)
55
Jump completion cycle for j PC = PC[31-28] || (IR[25-0]<<2)
56
Control logic design Implemented as a Finite State Machine Inputs: 6 opcode bits Outputs: 16 control signals State: 4 bits for 10 states
57
High-level view of FSM
58
Instruction fetch cycle
59
Instruction decode/register fetch cycle
60
R-type execution cycle
61
R-type completion cycle
62
Memory address computation cycle
63
Store memory access cycle
64
Load memory access cycle
65
Load read completion cycle
66
beq branch completion cycle
67
j jump completion cycle
68
Complete FSM
69
Evaluation of the multi-cycle design CPI calculated based on the instruction mix For gcc (Figure 4.54) 23% loads (5 cycles each) 13% stores (4 cycles each) 19% branches (3 cycles each) 2% jumps (3 cycles each) 43% ALU (4 cycles each) CPI = 0.23*5+0.13*4+0.19*3+0.02*3+0.43*4=4.02 Cycle time is calculated from the longest delay path assuming the same timing delays as before
70
Worst case datapath: branch target ALUOut = PC+(sign-extend (IR[15-0]) <<2) Delay = 7 time units (delay of simple = 16)
71
Evaluation of the multi-cycle design Time per instruction of simple and multi-cycle TPI(simple) = CPI(simple) x cycle time(simple) = 16 TPI(multi-cycle) = 4.02 x 7 = 28.1 Simple single-cycle implementation is faster Multicycle with pipelining will be considerably faster than single-cycle implementation
72
Exceptions An exception is an event that causes a deviation from the normal execution of instructions Types of exceptions Operating system call (e.g., read a file, print a file) Input/output device request Page fault (request for instruction/data not in memory – Ch 7) Arithmetic error (overflow, underflow, etc.) Undefined instruction Misaligned memory access (e.g., word access to odd address) Memory protection violation Hardware error Power failure An exception is not usually due to an error! We need to be able to restart the program at the point where the exception was detected
73
Handling exceptions Detect the exception Save enough information about the exception to handle it properly Save enough information about the program to resume it after the exception is handled Handle the exception Either terminate the program or resume executing it depending on the exception type
74
Detecting exceptions Performed by hardware Overflow: determined from the opcode and the overflow output of the ALU Undefined instruction: determined from The opcode in the main control unit The function code and ALUop in the ALU control logic
75
Detecting exceptions overflow undefined instruction
76
Saving exception information Performed by hardware We need the type of exception and the PC of the instruction when the exception occurred In MIPS, the Cause register holds the exception type Need an encoding for each exception type Need a signal from the control unit to load it into the Cause register and the Exception Program Counter (EPC) register holds the PC Need to subtract 4 from the PC register to get the correct PC (since we loaded PC+4 into the PC register during the Instruction Fetch cycle) Need a signal from the control unit to load it into EPC
77
Saving exception information
78
Saving program information Needed in order to restart the program from the point where the exception occurred Performed by hardware and software EPC register holds the PC of the instruction that had the exception (where we will restart the program) The software routine that handles the exception saves any registers that it will need to the stack and restores them when it is done
79
Handling the exception Performed by hardware and software Need to transfer control to a software routine to handle the exception (exception handler) The exception handler runs in a privileged mode that allows it to use special instructions and access all of memory Our programs run in user mode The hardware enables the privileged mode, loads PC with the address of the exception handler, and transitions to the Fetch state
80
Handling the exception Loading the PC with exception handler address
81
Exception handler Stores the values of the registers that it will need to the stack Handles the particular exception Operating system call: calls the subroutine associated with the call Underflow: sets register to zero or uses denormalized numbers I/O: handles the particular I/O request, e.g., keyboard input Restores registers from the stack (if program is to be restarted) Terminates the program, or resumes execution by loading the PC with EPC and transitioning to the Instruction Fetch state
82
FSM modifications
83
The Intel Pentium processor Introduced in 1993 Uses a multi-cycle datapath with the following steps for integer instructions Prefetch (PF): read instruction from the instruction memory Decode 1 (D1): first stage of instruction decode Decode 2 (D2): second stage of instruction decode Execute (E): perform the ALU operation Write back (WB): write the result to the register file Datapath usage varies by instruction type Simple instructions make one pass through the datapath using state machine control Complex instructions make multiple passes, reusing the same hardware elements under microcode control
84
The Intel Pentium processor The Pentium is a 2-way superscalar design as two instructions can simultaneously execute Ideal CPI for a 2-way superscalar is 0.5 Conditions for superscalar execution Both must be simple instructions The result of the first instruction cannot be needed by the second Both instructions cannot write the same register The first instruction in program sequence cannot be a jump PF D1 D2E WB D2E WB U pipe V pipe
85
The Intel Pentium Pro processor Introduced in 1995 as the successor to the Pentium The basis for the Pentium II and Pentium III Implements a 14-cycle, 3-way superscalar integer datapath Very high frequency is the goal Uses out-of-order execution in that instructions may execute out of their original program order Completely handled by hardware transparently to software Instructions execute as soon as their source operands become available Complicates exception handling Some instructions before the excepting one may not have executed, while some after it may have executed
86
The Intel Pentium Pro processor Pentium Pro designers (and AMD designers before them) used innovative engineering to overcome the disadvantages of CISC ISAs Many complex X86 instructions are internally translated by hardware into RISC-like micro-ops with state machine control Achieves a very low CPI for simple integer operations even on programs compiled for older implementations Combination of high frequency and low CPI gave the Pentium Pro extremely competitive integer performance versus RISC microprocessors Result has been that RISC CPUs have failed to gain the desktop market share that had been expected
87
The Intel Pentium 4 processor 20 cycle superscalar integer pipeline Extremely high frequency (>3GHz) Major effort to lower power dissipation Clock gating: clock to a unit is turned off when the unit is not in use Trace cache: caches micro-ops of previously decoded complex instructions to avoid power-consuming decode operation
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.