Shift Instructions (1/4) Move (shift) all the bits in a word to the left or right by a number of bits. Example: shift right by 8 bits 0001 0010 0011 0100 0101 0110 0111 1000 0000 0000 0001 0010 0011 0100 0101 0110 Example: shift left by 8 bits 0001 0010 0011 0100 0101 0110 0111 1000 0011 0100 0101 0110 0111 1000 0000 0000
Shift Instructions (2/4) MIPS Shift Instruction Syntax: 1 2,3,4 where 1) operation name 2) register that will receive value 3) first operand (register) 4) shift amount (constant < 32, 5 bits) MIPS shift instructions: 1. sll (shift left logical): shifts left and fills emptied bits with 0s 2. srl (shift right logical): shifts right and fills emptied bits with 0s 3. sra (shift right arithmetic): shifts right and fills emptied bits by sign extending
Shift Instructions (3/4) Example: shift right arith by 8 bits 0001 0010 0011 0100 0101 0110 0111 1000 0000 0000 0001 0010 0011 0100 0101 0110 Example: shift right arith by 8 bits 1001 0010 0011 0100 0101 0110 0111 1000 1111 1111 1001 0010 0011 0100 0101 0110
Shift Instructions (4/4) Since shifting may be faster than multiplication, a good compiler usually notices when C code multiplies by a power of 2 and compiles it to a shift instruction: a *= 8; (in C) would compile to: sll $s0,$s0,3 (in MIPS) Likewise, shift right to divide by powers of 2 remember to use sra
“Shift and Add” Signed Multiplier Signed extend partial product at each stage Final step is a subtract n-clock cycles
Fast multiplication hardware
Chap.5 The processor: Datapath and control Jen-Chang Liu, Spring 2006
Hierarchy of Machine Structures I/O system Processor Compiler Operating System (Windows 98) Application (Netscape) Digital Design Circuit Design Instruction Set Architecture Datapath & Control transistors Memory Hardware Software Assembler
Five components of computer Input, output, memory, datapath, control
Inside Mother board (for Pentium Pro)
Chapter overview Chap5: datapath and control Chap6: pipeline Chap7: memory hierarchy Chap8: I/O Chap9: multiprocessor Inside CPU
Inside Processor: datapath and control Datapath: brawn of the processor Perform the arithmetic operations Control: brain of the processor Tells the datapath, memory, and I/O what to do 生產線
Inside Pentium Processor 1/3 cache
Inside Pentium Pro Processor
Clocks methodology high low Edge-triggered clocking: the content of the state elements (flip-flops, registers, memory) only change on the active clock edge 100 101 001 111 110 001 100
Timing constraint The clock period must be long enough to allow signals to be stable
Design Target: MIPS The instruction set architecture (ISA) determines the implementation We know how to execute MIPS codes manually, how to design a circuit to execute them? We design a simple implementation that includes a subset of MIPS inst. Memory-reference inst.: lw, sw Arithmetic-logic inst.: add,sub,and,or,slt Branch: beq, j
Outline of chapter 5 Building a datapath Instruction fetch R-type instructions Load/store Branch Single Datapath implementation Multiple cycle implementation
Preview: How to carry out an instruction 4 steps to implement an instruction 執行 Instruction fetch Data/register read Instruction execution Memory/register read/write Read inst. from memory ALU add $t0, $t1, $t2 $t1, $t2 $t1 + $t2 Write to $t0 lw $t0, 0($a0) $a0 $a0 + 0 Read from memory beq $t0, $t1, loop $t0, $t1 $t0 - $t1 Write PC
Abstract view of carrying out an instruction fetch Data/register read Instruction execution Memory/register read/write
How to build datapath for MIPS ISA? Datapath: path to perform an instruction Consider each major components Build datapath for each instruction class
Outline Building a datapath 1. Instruction fetch 2. R-type instructions 3. Load/store 4. Branch Build datapath for each instruction class, then combine them
1. Instruction fetch Increment the Address of the Place to store PC to next instruction Place to store the instructions Address of the instructions
Instruction fetch (cont.) 3 always adds, therefore no control lines 1 2
2. R-type instruction R-format instructions Arithmetic-logic instrcutions add, sub Ex. add $t1, $t2, $t3 and, or slt Opcode 6 rs 5 rt 5 rd 5 funct 6 shamt 5
Datapath elements for R-type inst. 4 input output 1. Read register: read register no., output data 2. Write register: write register no., input data, RegWrite=1
Datapath for R-type inst. 4 2 1 3 Opcode 6 rs 5 rt 5 rd 5 funct 6 shamt 5
3. Load/store from/to memory I-format Load/store examples lw $t1, offset_value($t2) sw $t1, offset_value($t2) Opcode 6 rs 5 rt 5 Signed offset 16 … offset $t2
Datapath elements for load/store lw $t1, offset_value($t2) Register file, ALU, and data memory Base+offset Store -> MemWrite Load -> MemRead Sign-extend the 16-bit offset field
Datapath for load/store Opcode 6 rs 5 rt 5 Signed offset 16 Datapath for load/store 4 2 1
4. Branch I-format Example beq $t1, $t2, offset PC-relative addressing Opcode 6 rs 5 rt 5 Signed offset 16
Details for branch: target address calculation Base address for offset: PC+4 Instructions are word-aligned: the offset is shifted left 2 bits … PC+4 offset Opcode 6 rs 5 rt 5 Immediate 16 00 offset
Opcode 6 rs 5 rt 5 Signed offset 16 Datapath for branch 2 4 1
How to combine these datapaths ? We have shown datapaths for Instruction fetch R-type instructions Load/store branch How to assemble the datapaths? How to handle control lines?
Outline Building a datapath Single Datapath implementation Instruction fetch R-type instructions Load/store Branch Single Datapath implementation Multiple cycle implementation
Single datapath implementation Attempt to execute all instructions in 1 clock cycle No datapath resources can be used more than once per instruction Duplicated units: ex. Memory for instructions and memory for data Shared units: use multiplexor to select input 生產線 add,… lw, sw beq,…
1. Combine R-type and lw/sw Opcode 6 rs 5 rt 5 rd 5 funct 6 shamt 5 1. Combine R-type and lw/sw Opcode 6 rs 5 rt 5 Signed offset 16 4 R-type 4 lw/sw
R-type + load/store 4 2 1
2. Add the instruction fetch 4
3. Add the branch unit 4
Simple datapath and control. See Fig 5.17 (p.307)
Trace the operation of the datapath !!! Explain in 4 steps, but they are actually operates in a single clock cycle Quiz later !!! Instruction fetch Data/register read Instruction execution Memory/register read/write
add $t1,$t2,$t3 => add $9, $10, $11 => 10 11 9 32 Step 1. Instruction fetch
add $t1,$t2,$t3 => 10 11 9 32 Step 2. Read source registers
add $t1,$t2,$t3 => 10 11 9 32 Step 3. Instruction execution
add $t1,$t2,$t3 => 10 11 9 32 Step 4. Write result
lw $t1, 0($t2) 36 9 10
How to combine the datapaths ? We have shown datapaths for Instruction fetch R-type instructions Load/store branch How to assemble the datapaths? How to handle control lines?
Simple datapath and control. See Fig 5.19 (p.360)
How to generate control? 6 bits 6 bits Truth table look-up 10 bits Control signal
Hierarchy of control units Instructions (binary representation) Main control unit ALUop (2 bits) Other control signals (6 1-bit) ALU control unit ALU control signals (3 bits)
Why multiple levels of control? Purpose: Reduce the size of main control unit ? Potentially increase the speed of the control unit ALUop(2 bits):指令分類 define 3 classes of instructions R-type Load/store Branch
Design main control unit Instructions (binary representation) Opcode[31-26] Main control unit ALUop (2 bits) Other control signals (6 1-bit) ALU control unit ALU control signals (3 bits)
Main control unit Observe instruction set
See Fig 5.19 Control signal for R-format?
1
Create truth table for main control unit
Design ALU control unit Instructions (binary representation) Opcode[31-26] Main control unit ALUop (2 bits) Other control signals (6 1-bit) ALU control unit ALU control signals (3 bits)
ALU control unit Instruction[5-0] ALUop ALU control 3 bits ALU control Input 1 (2 bits) Input 2 (6 bits) Output (3 bits) See Figure 4.20
ALU control signal (1 bit) (2 bits) ALU control line function 0 00 and 0 01 or 0 10 add 1 10 sub 1 11 slt +
Instruction set formats 決定ALU 動作 instruction set
creating truth table 28
Why a single-cycle implementation is not used? It is inefficient. Why? Single-cycle implementation => the clock cycle time is the same for every instruction Clock cycle = longest path = load Other instruction class can fit in a shorter cycle !!!
Performance evaluation for single-cycle implementation Assume the operation time Memory units: 2 ns ALU: 2ns Register file: 1 ns Calculate the necessary time for each instruction class
Memory units: 2 ns ALU: 2ns Register file: 1 ns
How to improve single-cycle datapath? A variable-speed clock for each instruction class Difficult to implement Multi-cycle implementation