Designing a Multicycle Processor Fudan University, Software School, 2017
Processor Design is a Process Bottom-up assemble components in target technology to establish critical timing Top-down specify component behavior from high-level requirements Iterative refinement establish partial solution, expand and improve Instruction Set Architecture processor datapath control Reg. File Mux ALU Reg Mem Decoder Sequencer Cells Gates 2
A Single Cycle Datapath Instruction<31:0> nPC_sel Instruction Fetch Unit Rd Rt <21:25> <16:20> <11:15> <0:15> RegDst Clk 1 Mux Rs Rt Rs Rt Rd Imm16 RegWr ALUctr 5 5 5 MemtoReg busA Equal MemWr Rw Ra Rb busW 32 32 32-bit Registers ALU 32 busB 32 Clk 32 Mux Mux 32 WrEn Adr 1 1 Data In 32 Data Memory imm16 Extender 32 16 Clk ALUSrc ExtOp 3
The “Truth Table” for the Main Control op 6 ALU (Local) func 3 ALUop ALUctr RegDst ALUSrc : R-type ori lw sw beq jump RegDst ALUSrc MemtoReg RegWrite MemWrite Branch Jump ExtOp ALUop (Symbolic) 1 x “R-type” Or Add Subtract xxx op 00 0000 00 1101 10 0011 10 1011 00 0100 00 0010 ALUop <2> ALUop <1> ALUop <0> 4
PLA Implementation of the Main Control op<0> op<5> . <0> R-type ori lw sw beq jump RegWrite ALUSrc MemtoReg MemWrite Branch Jump RegDst ExtOp ALUop<2> ALUop<1> ALUop<0> 5
Systematic Generation of Control OPcode Control Logic / Store (PLA, ROM) Decode microinstruction Conditions Instruction Control Points Datapath In our single-cycle processor, each instruction is realized by exactly one control command or “microinstruction” in general, the controller is a finite state machine microinstruction can also control sequencing (see later) 6
The Big Picture: Where are We Now? The Five Classic Components of a Computer Today’s Topic: Designing the Datapath for the Multiple Clock Cycle Datapath Processor Input Control Memory Datapath Output 7
Abstract View of our single cycle processor Main Control op ALU control fun ALUSrc Equal ExtOp MemRd MemWr nPC_sel RegDst RegWr MemWr ALUctr Reg. Wrt Register Fetch ALU Ext Mem Access PC Next PC Instruction Fetch Result Store Data Mem looks like a FSM with PC as state 8
What’s wrong with our CPI=1 processor? Arithmetic & Logical PC Inst Memory Reg File ALU mux mux setup Load PC Inst Memory Reg File ALU Data Mem mux mux setup Critical Path Store PC Inst Memory Reg File ALU Data Mem mux Branch PC Inst Memory Reg File cmp mux Long Cycle Time All instructions take as much time as the slowest Real memory is not as nice as our idealized memory cannot always get the job done in one (short) cycle 9
Memory Access Time Physics => fast memories are small (large memories are slow) => Use a hierarchy of memories Storage Array selected word line storage cell address bit line address decoder sense amps mem. bus proc. bus memory L2 Cache Cache Processor 1 time-period 20 - 50 time-periods 2-3 time-periods 10
Reducing Cycle Time Cut combinational dependency graph and insert register / latch Do same work in two fast cycles, rather than one slow one storage element Acyclic Combinational Logic Logic (A) Logic (B) May be able to short-circuit path and remove some components for some instructions! 11
An Abstract View of the Critical Path (Load) Register file and ideal memory: The CLK input is a factor ONLY during write operation During read operation, behave as combinational logic: Address valid => Output valid after “access time.” Critical Path (Load Operation) = PC’s Clk-to-Q + Instruction Memory’s Access Time + Register File’s Access Time + ALU to Perform a 32-bit Add + Data Memory Access Time + Setup Time for Register File Write + Clock Skew Ideal Instruction Memory Instruction Rd Rs Rt Imm 5 5 5 16 Instruction Address A Data Address Clk PC Rw Ra Rb ALU 32 32 32 Ideal Data Memory 32 32-bit Registers Next Address Data In B Clk Clk 32 12
Worst Case Timing (Load) Clk Clk-to-Q PC Old Value New Value Instruction Memoey Access Time Rs, Rt, Rd, Op, Func Old Value New Value Delay through Control Logic ALUctr Old Value New Value ExtOp Old Value New Value ALUSrc Old Value New Value MemtoReg Old Value New Value Register Write Occurs RegWr Old Value New Value Register File Access Time busA Old Value New Value Delay through Extender & Mux busB Old Value New Value ALU Delay Address Old Value New Value Data Memory Access Time busW Old Value New 13
Five stages of single cycle datapath 14
Ideal and Real Memory 15
Race condition 16
How to avoid the race condition 17
How to avoid the race condition 18
Basic Limits on Cycle Time Next address logic PC <= branch ? PC + offset : PC + 4 Instruction Fetch InstructionReg <= Mem[PC] Register Access A <= R[rs] ALU operation R <= A + B Control MemRd MemWr nPC_sel RegDst RegWr MemWr ALUSrc ALUctr ExtOp Reg. File Operand Fetch Exec Instruction Fetch Mem Access PC Next PC Result Store Data Mem 19
Partitioning the CPI=1 Datapath Add registers between smallest steps Place enables on all registers Equal MemRd MemWr nPC_sel RegDst RegWr MemWr ExtOp ALUSrc ALUctr Reg. File Operand Fetch Exec Instruction Fetch Mem Access PC Next PC Result Store Data Mem 20
Example Multicycle Datapath Ext ALU Reg. File Mem Access Data Result Store RegDst RegWr MemWr MemRd S M MemToReg Equal ALUctr ALUSrc ExtOp nPC_sel E Reg File A PC IR Next PC B Instruction Fetch Operand Fetch Advantages? 21
Recall: Step-by-step Processor Design Step 1: ISA => Logical Register Transfers Step 2: Components of the Datapath Step 3: RTL + Components => Datapath Step 4: Datapath + Logical RTs => Physical RTs Step 5: Physical RTs => Control 22
Step 4: R-rtype (add, sub, . . .) Logical Register Transfer Physical Register Transfers inst Logical Register Transfers ADDU R[rd] <– R[rs] + R[rt]; PC <– PC + 4 inst Physical Register Transfers IR <– MEM[pc] ADDU A<– R[rs]; B <– R[rt] S <– A + B R[rd] <– S; PC <– PC + 4 Time E Reg. File Reg File A S Exec PC IR Next PC Inst. Mem B Mem Access M Data Mem 23
Step 4: Logical immed Logical Register Transfer Physical Register Transfers inst Logical Register Transfers ORI R[rt] <– R[rs] OR ZExt(Im16); PC <– PC + 4 inst Physical Register Transfers IR <– MEM[pc] ORI A<– R[rs]; B <– R[rt] S <– A or ZExt(Im16) R[rt] <– S; PC <– PC + 4 Time E Reg. File Reg File A S Exec PC IR Next PC Inst. Mem B Mem Access M Data Mem 24
Step 4 : Load Logical Register Transfer Physical Register Transfers inst Logical Register Transfers LW R[rt] <– MEM[R[rs] + SExt(Im16)]; PC <– PC + 4 Logical Register Transfer Physical Register Transfers inst Physical Register Transfers IR <– MEM[pc] LW A<– R[rs]; B <– R[rt] S <– A + SExt(Im16) M <– MEM[S] R[rd] <– M; PC <– PC + 4 Time E Reg. File Reg File A S Exec PC IR Next PC Inst. Mem B Mem Access M Data Mem 25
Step 4 : Store Logical Register Transfer Physical Register Transfers inst Logical Register Transfers SW MEM[R[rs] + SExt(Im16)] <– R[rt]; PC <– PC + 4 Logical Register Transfer Physical Register Transfers inst Physical Register Transfers IR <– MEM[pc] SW A<– R[rs]; B <– R[rt] S <– A + SExt(Im16); MEM[S] <– B PC <– PC + 4 Time E Reg. File Reg File A S Exec PC IR Next PC Inst. Mem B Mem Access M Data Mem 26
Step 4 : Branch Logical Register Transfer Physical Register Transfers inst Logical Register Transfers BEQ if R[rs] == R[rt] then PC <= PC + 4+SExt(Im16) || 00 else PC <= PC + 4 inst Physical Register Transfers IR <– MEM[pc] BEQ E<– (R[rs] = R[rt]) if !E then PC <– PC + 4 else PC <–PC+4+SExt(Im16)||00 Time E Reg. File Reg File S A Exec PC IR Next PC Inst. Mem B Mem Access M Data Mem 27
Alternative datapath (book): Multiple Cycle Datapath Miminizes Hardware: 1 memory, 1 adder PCWr PCWrCond PCSrc BrWr Zero IorD MemWr IRWr RegDst RegWr ALUSelA 1 Target 32 32 Mux PC Mux 1 32 Zero Rs Mux 1 Ra 32 RAdr 5 32 Rt Rb busA 32 32 Ideal Memory ALU Instruction Reg 5 Reg File 32 ALU Out Mux 1 Rt 4 Rw 32 WrAdr 32 32 32 Rd 1 Din Dout busW busB 32 32 2 ALU Control Mux 1 3 << 2 Extend Imm 16 32 ALUOp ExtOp MemtoReg ALUSelB 28
Instruction Fetch (beginning) 29
Instruction Fetch (end) 30
Instruction Fetch 31
Instruction decode/register fetch 32
Instruction decode/register fetch 33
Branch Completion 34
Instruction decode (R-type) 35
The Execution of R-type 36
The Completion of R-type 37
Instruction decode (ORi) 38
The Execution of ORi 39
The Completion of ORi 40
Instruction decode (mem op) 41
Memory address computation 42
Memory Access (Store) 43
Memory Access (Load) 44
Write back 45
The directions are defined by a next-state function Finite State Machine A finite state machine states directions on how to change states. The directions are defined by a next-state function A signal that is not explicitly asserted is deasserted, rather than don’t care. For example, the RegWrite signal should be asserted only when a register file entry is to be written; when it is not explicitly asserted, it must be deasserted. 46
Our Control Model State specifies control points for Register Transfer Transfer occurs upon exiting state (same falling edge) inputs (conditions) Next State Logic State X Register Transfer Control Points Control State Depends on Input Output Logic outputs (control points) 47
The complete FSM 48
Step 4 Control Specification for multicycle proc “instruction fetch” IR <= MEM[PC] A <= R[rs] B <= R[rt] “decode / operand fetch” LW R-type ORi SW BEQ Execute Memory Write-back S <= A fun B S <= A or ZX S <= A + SX S <= A + SX PC <= Next(PC,Equal) M <= MEM[S] MEM[S] <= B PC <= PC + 4 R[rd] <= S PC <= PC + 4 R[rt] <= S PC <= PC + 4 R[rt] <= M PC <= PC + 4 49
Traditional FSM Controller next state state op cond control points Truth Table next State control points 11 Equal 6 State 4 op datapath State 50
Step 5 (datapath + state diagram control) Translate RTs into control points Assign states Then go build the controller 51
Mapping RTs to Control Points IR <= MEM[PC] “instruction fetch” imem_rd, IRen A <= R[rs] B <= R[rt] “decode” Aen, Ben, Een LW R-type ORi SW BEQ Execute Memory Write-back S <= A fun B ALUfun, Sen S <= A or ZX S <= A + SX S <= A + SX PC <= Next(PC,Equal) M <= MEM[S] MEM[S] <= B PC <= PC + 4 R[rd] <= S PC <= PC + 4 RegDst, RegWr, PCen R[rt] <= S PC <= PC + 4 R[rt] <= M PC <= PC + 4 52
Assigning States “instruction fetch” IR <= MEM[PC] 0000 “decode” A <= R[rs] B <= R[rt] “decode” 0001 LW R-type ORi SW BEQ Execute Memory Write-back S <= A fun B S <= A or ZX S <= A + SX S <= A + SX PC <= Next(PC) 0100 0110 1000 1011 0011 M <= MEM[S] MEM[S] <= B PC <= PC + 4 1001 1100 R[rd] <= S PC <= PC + 4 R[rt] <= S PC <= PC + 4 R[rt] <= M PC <= PC + 4 0101 0111 1010 53
(Mostly) Detailed Control Specification (missing0) State Op field Eq Next IR PC Ops Exec Mem Write-Back en sel A B E Ex Sr ALU S R W M M-R Wr Dst 0000 ?????? ? 0001 1 0001 BEQ x 0011 1 1 1 0001 R-type x 0100 1 1 1 0001 ORI x 0110 1 1 1 0001 LW x 1000 1 1 1 0001 SW x 1011 1 1 1 0011 xxxxxx 0 0000 1 0 x 0 x 0011 xxxxxx 1 0000 1 1 x 0 x 0100 xxxxxx x 0101 0 1 fun 1 0101 xxxxxx x 0000 1 0 0 1 1 0110 xxxxxx x 0111 0 0 or 1 0111 xxxxxx x 0000 1 0 0 1 0 1000 xxxxxx x 1001 1 0 add 1 1001 xxxxxx x 1010 1 0 1 1010 xxxxxx x 0000 1 0 1 1 0 1011 xxxxxx x 1100 1 0 add 1 1100 xxxxxx x 0000 1 0 0 1 0 -all same in Moore machine BEQ: R: ORi: LW: SW: 54
Performance Evaluation What is the average CPI? state diagram gives CPI for each instruction type workload gives frequency of each type Type CPIi for type Frequency Arith/Logic 4 40% Load 5 30% Store 4 10% branch 3 20% CPIi x freqIi = 1.6+ 1.5+0.4+0.6 Average CPI: 4.1 55
Use this structure to construct a simple “microsequencer” Controller Design The state digrams that arise define the controller for an instruction set processor are highly structured Use this structure to construct a simple “microsequencer” Control reduces to programming this very simple device microprogramming sequencer control datapath control micro-PC microinstruction 56
Example: Jump-Counter i i 0000 i+1 Map ROM op-code zero inc load Counter 57
Using a Jump Counter 58
Our Microsequencer taken datapath control Z I L Micro-PC op-code Map ROM 59
Microprogram Control Specification µPC Taken Next IR PC Ops Exec Mem Write-Back en sel A B Ex Sr ALU S R W M M-R Wr Dst 0000 ? inc 1 0001 0 load 1 1 0011 0 zero 1 0 0011 1 zero 1 1 0100 x inc 0 1 fun 1 0101 x zero 1 0 0 1 1 0110 x inc 0 0 or 1 0111 x zero 1 0 0 1 0 1000 x inc 1 0 add 1 1001 x inc 1 0 1 1010 x zero 1 0 1 1 0 1011 x inc 1 0 add 1 1100 x zero 1 0 0 1 0 BEQ R: ORi: LW: SW: 60
Designing a Microinstruction Set 1) Start with list of control signals 2) Group signals together that make sense (vs. random): called “fields” 3) Place fields in some logical order (e.g., ALU operation & ALU operands first and microinstruction sequencing last) 4) To minimize the width, encode operations that will never be used at the same time 5) Create a symbolic legend for the microinstruction format, showing name of field values and how they set the control signals Use computers to design computers 61
Again: Alternative multicycle datapath (book) Miminizes Hardware: 1 memory, 1 adder PCWr PCWrCond PCSrc Zero IorD MemWr IRWr RegDst RegWr ALUSelA 1 32 32 Mux PC Mux 1 32 Instruction Reg Zero Rs Mux 1 Ra 32 RAdr 5 32 Rt ALU Out 32 Rb busA A 32 Ideal Memory 32 ALU 5 Reg File Mux 1 Rt 4 Rw 32 WrAdr 32 B 32 Rd 32 1 32 Din Dout Mem Data Reg busW busB 32 2 ALU Control Mux 1 3 << 2 Extend Imm 16 32 ALUOp ExtOp MemtoReg ALUSelB 62
1&2) Start with list of control signals, grouped into fields Signal name Effect when deasserted Effect when asserted ALUSelA 1st ALU operand = PC 1st ALU operand = Reg[rs] RegWrite None Reg. is written MemtoReg Reg. write data input = ALU Reg. write data input = memory RegDst Reg. dest. no. = rt Reg. dest. no. = rd MemRead None Memory at address is read, MDR <= Mem[addr] MemWrite None Memory at address is written IorD Memory address = PC Memory address = S IRWrite None IR <= Memory PCWrite None PC <= PCSource PCWriteCond None IF ALUzero then PC <= PCSource PCSource PCSource = ALU PCSource = ALUout ExtOp Zero Extended Sign Extended Single Bit Control Signal name Value Effect ALUOp 00 ALU adds 01 ALU subtracts 10 ALU does function code 11 ALU does logical OR ALUSelB 00 2nd ALU input = 4 01 2nd ALU input = Reg[rt] 10 2nd ALU input = extended,shift left 2 11 2nd ALU input = extended Multiple Bit Control 63
5) Legend of Fields and Symbolic Names Field Name Values for Field Function of Field with Specific Value 64
Controller handles non-ideal memory “instruction fetch” IR <= MEM[PC] wait ~wait A <= R[rs] B <= R[rt] “decode / operand fetch” LW R-type ORi SW BEQ Execute Memory Write-back PC <= Next(PC) S <= A fun B S <= A or ZX S <= A + SX S <= A + SX M <= MEM[S] MEM[S] <= B ~wait wait wait ~wait R[rd] <= S PC <= PC + 4 R[rt] <= S PC <= PC + 4 R[rt] <= M PC <= PC + 4 PC <= PC + 4 65
Really Simple Time-State Control instruction fetch IR <= MEM[PC] wait ~wait A <= R[rs] B <= R[rt] decode LW R-type ORi SW BEQ Execute S <= A fun B S <= A or ZX S <= A + SX S <= A + SX Memory M <= MEM[S] MEM[S] <= B wait wait R[rd] <= S PC <= PC + 4 R[rt] <= S PC <= PC + 4 R[rt] <= M PC <= PC + 4 PC <= Next(PC) write-back PC <= PC + 4 66
Time-state Control Path Local decode and control at each stage Valid IRex IR IRwb Inst. Mem IRmem WB Ctrl Dcd Ctrl Ex Ctrl Mem Ctrl Equal Reg. File Reg File A S Exec PC Next PC B Mem Access M Data Mem 67
“microprogrammed control” Overview of Control Control may be designed using one of several initial representations. The choice of sequence control, and how logic is represented, can then be determined independently; the control can then be implemented with one of several methods using a structured logic technique. Initial Representation Finite State Diagram Microprogram Sequencing Control Explicit Next State Microprogram counter Function + Dispatch ROMs Logic Representation Logic Equations Truth Tables Implementation PLA ROM Technique “hardwired control” “microprogrammed control” 68
Disadvantages of the Single Cycle Processor Summary Disadvantages of the Single Cycle Processor Long cycle time Cycle time is too long for all instructions except the Load Multiple Cycle Processor: Divide the instructions into smaller steps Execute each step (instead of the entire instruction) in one cycle Partition datapath into equal size chunks to minimize cycle time ~10 levels of logic between latches Follow same 5-step method for designing “real” processor 69
Control is specified by finite state digram Summary (cont’d) Control is specified by finite state digram Specialize state-diagrams easily captured by microsequencer simple increment & “branch” fields datapath control fields Control design reduces to Microprogramming Control is more complicated with: complex instruction sets restricted datapaths (see the book) Simple Instruction set and powerful datapath simple control 70
Exceptions Exception = unprogrammed control transfer System user program System Exception Handler Exception: return from exception normal control flow: sequential, jumps, branches, calls, returns Exception = unprogrammed control transfer system takes action to handle the exception must record the address of the offending instruction record any other information necessary to return afterwards returns control to user must save & restore user state 71
Two Types of Exceptions: Interrupts and Traps caused by external events: Network, Keyboard, Disk I/O, Timer asynchronous to program execution Most interrupts can be disabled for brief periods of time Some (like “Power Failing”) are non-maskable (NMI) may be handled between instructions simply suspend and resume user program Traps caused by internal events exceptional conditions (overflow) errors (parity) faults (non-resident page) synchronous to program execution condition must be remedied by the handler instruction may be retried or simulated and program continued or program may be aborted 72
Big Picture: user / system modes Two modes of execution (user/system) : operating system runs in privileged mode and has access to all of the resources of the computer presents “virtual resources” to each user that are more convenient that the physical resources files vs. disk sectors virtual memory vs physical memory protects each user program from others protects system from malicious users. OS is assumed to “know best”, and is trusted code, so enter system mode on exception Exceptions allow the system to taken action in response to events that occur while user program is executing: Might provide supplemental behavior (dealing with denormal floating-point numbers for instance). “Unimplemented instruction” used to emulate instructions that were not included in hardware (I.e. MicroVax) 73
Additions to MIPS ISA to support Exceptions? Exception state is kept in “coprocessor 0”. Use mfc0 read contents of these registers Every register is 32 bits, but may be only partially defined BadVAddr (register 8) register contained memory address at which memory reference occurred Status (register 12) interrupt mask and enable bits Cause (register 13) the cause of the exception Bits 5 to 2 of this register encodes the exception type (e.g undefined instruction=10 and arithmetic overflow=12) EPC (register 14) address of the affected instruction (register 14 of coprocessor 0). Control signals to write BadVAddr, Status, Cause, and EPC Be able to write exception address into PC (8000 0080hex) May have to undo PC = PC + 4, since want EPC to point to offending instruction (not its successor): PC = PC - 4 74
Example: How Control Handles Traps in our FSD Undefined Instruction–detected when no next state is defined from state 1 for the op value. We handle this exception by defining the next state value for all op values other than lw, sw, 0 (R-type), jmp, beq, and ori as new state 12. Shown symbolically using “other” to indicate that the op field does not match any of the opcodes that label arcs out of state 1. Arithmetic overflow–detected on ALU ops such as signed add Used to save PC and enter exception handler External Interrupt – flagged by asserted interrupt line Again, must save PC and enter exception handler Note: Challenge in designing control of a real machine is to handle different interactions between instructions and other exception- causing events such that control logic remains small and fast. Complex interactions makes the control unit the most challenging aspect of hardware design 75
How add traps and interrupts to state diagram? “instruction fetch” EPC <= PC - 4 PC <= exp_addr cause <= 0(INT) Handle Interrupt Pending INT IR <= MEM[PC] PC <= PC + 4 0000 overflow EPC <= PC - 4 PC <= exp_addr cause <= 12 (Ovf) undefined instruction EPC <= PC - 4 PC <= exp_addr cause <= 10 (RI) other “decode” S<= PC +SX 0001 LW BEQ R-type ORi SW If A = B then PC <= S S <= A - B S <= A fun B S <= A op ZX S <= A + SX S <= A + SX 0100 0110 1000 1011 0010 M <= MEM[S] MEM[S] <= B 1001 Interrupts are precise because user-visible state committed after exceptions flagged! 1100 R[rd] <= S R[rt] <= S R[rt] <= M 0101 0111 1010 76
Where to get more information? D. Patterson, “Microprograming,” Scientific American, March 1983. D. Patterson and D. Ditzel, “The Case for the Reduced Instruction Set Computer,” Computer Architecture News 8, 6 (October 15, 1980) 77