Winter 2017 S. Areibi School of Engineering University of Guelph ENG3380 Computer Organization and Architecture “MIPS: Data Path Design Part 1” Winter 2017 S. Areibi School of Engineering University of Guelph
Topics Introduction Data Path Design Register File Memory Arithmetic Logic Unit MIPS ALU Summary With thanks to W. Stallings, Hamacher, J. Hennessy, M. J. Irwin for lecture slide contents Many slides adapted from the PPT slides accompanying the textbook and CSE331 Course School of Engineering
References “Computer Organization and Architecture: Designing for Performance”, 10th edition, by William Stalling, Pearson. “Computer Organization and Design: The Hardware/Software Interface”, 4th editino, by D. Patterson and J. Hennessy, Morgan Kaufmann Computer Organization and Architecture: Themes and Variations”, 2014, by Alan Clements, CENGAGE Learning School of Engineering
Introduction
Processor’s building blocks PC provides instruction address. Instruction is fetched into IR Instruction address generator updates PC Control circuitry interprets instruction and generates control signals to perform the actions needed. Register File stores all operands in registers that are manipulated by ALU. The ALU performs arithmetic and Logic operations and more ..
Parts of CPU Datapath Control unit Registers, Multiplexors, Adders, Subtractors and logic to perform operations on them (Comb Logic) Control unit Generates signals to control data-path Accepts status signals to perform sequencing Control Data Path
Datapaths Guiding principles for basic datapaths: The set of registers Collection of individual registers A set of registers with common access resources called a register file A combination of the above Microoperation implementation One or more shared resources for implementing microoperations Buses - shared transfer paths Arithmetic-Logic Unit (ALU) - shared resource for implementing arithmetic and logic microoperations Shifter - shared resource for implementing shift microoperations
Recall A Simple bus-based data path: four registers, an ALU, and a shifter. Each register is connected to two multiplexers to form ALU input buses A and B (Register File) Another Mux is used to choose between Registers and a constant. Functional Unit: ALU and a shifter Another Mux is used to choose between Functional Unit and external data (Memory)
Register File
Register File A Simple Register File: four registers, Each register is connected to two multiplexers to form ALU input buses A and B (Register File)
Hardware components: Register file A 2-port register file is needed to read the two source registers at the same time. It may be implemented using a 2-port memory.
Alternative implementation of 2-port register file Using two single- ported memory blocks.
A conceptual view – computational instructions Both source operands and the destination location are in the register file. [RA] and [RB] denote values of registers that are identified by addresses A and B new [RC] denotes the result that is stored to the register identified by address C [RB] new [RC] [RA]
A conceptual view – immediate instructions One of the source operands is the immediate value in the IR. new [RC] [RA]
Behavioral Description of a Register File write_cntrl src1_addr src1_data src2_addr 32 words dst_addr src2_data write_data 32 bits library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all; entity regfile is port(write_data: in std_logic_vector(31 downto 0); dst_addr,src1_addr,src2_addr: in UNSIGNED(4 downto 0); write_cntrl: in std_logic; src1_data,src2_data: out std_logic_vector(31 downto 0)); end entity regfile;
Behavioral Description of a Register File, con’t architecture process_behavior of regfile is type reg_array is array(0 to 31) of std_logic_vector (31 downto 0); begin regfile_process: process(src1_addr,src2_addr,write_cntrl) variable data_array: reg_array := ( (X”00000000”), . . . (X”00000000”)); variable addrofsrc1, addrofsrc2, addrofdst: integer; addrofsrc1 := conv_integer(src1_addr); addrofsrc2 := conv_integer(src2_addr); addrofdst := conv_integer(dst_addr); if write_cntrl = ‘1’ then data_array(addrofdst) := write_data; end if; src1_data <= data_array(addrofsrc1) after 10 ns; src2_data <= data_array(addrofsrc2) after 10 ns; end process regfile_process; end architecture process_behavior;
VHDL Implementation library IEEE; use IEEE.STD_LOGIC_1164.ALL; use IEEE.NUMERIC_STD.ALL; entity register_file is port ( DataOut1 : out std_logic_vector(15 downto 0); DataOut2 : out std_logic_vector(15 downto 0); DataIn : in std_logic_vector(15 downto 0); writeEnable : in std_logic; ReadAddr1 : in std_logic_vector(3 downto 0); ReadAddr2 : in std_logic_vector(3 downto 0); WriteAddr : in std_logic_vector(3 downto 0); Clk : in std_logic ); end register_file;
VHDL Implementation architecture behavioral of register_file is type registerFile is array(0 to 15) of std_logic_vector(15 downto 0); signal registers : registerFile; begin regFile : process (clk) is if rising_edge(clk) then -- Read A and B before bypass DataOut1 <= registers(to_integer(unsigned(ReadAddr1))); DataOut2 <= registers(to_integer(unsigned(ReadAddr2))); -- Write and bypass if WriteEnable = '1' then registers(to_integer(unsigned(WriteAddr))) <= DataIn; -- Write if ReadAddr1 = WriteAddr then -- Bypass for read A DataOut1 <= DataIn; end if; if ReadAddr2 = WriteAddr then -- Bypass for read B DataOut2 <= DataIn; end process; end behavioral;
Memory
Memory and I/O Control Unit + Data Path + Memory + Input/Output = Micro-computer System MEMORY Input and Output
Behavioral Description of Memory The VHDL Code below implements a Single Port RAM When you synthesize this design, XST uses Block RAM by default for implementing memory If you want the memory to be implemented using distributed RAM then add the following: attribute ram_style: string; attribute ram_style of ram : signal is "distributed"; library IEEE; use IEEE.STD_LOGIC_1164.ALL; entity ram_example is port (Clk : in std_logic; address : in integer; we : in std_logic; data_i : in std_logic_vector(7 downto 0); data_o : out std_logic_vector(7 downto 0) ); end ram_example; data_i Memory we address Clk data_o
Cont … Behavioral Description of Memory architecture Behavioral of ram_example is --Declaration of type and signal of a 256 element RAM --with each element being 8 bit wide. type ram_t is array (0 to 255) of std_logic_vector(7 downto 0); signal ram : ram_t := (others => (others => '0')); begin --process for read and write operation. PROCESS(Clk) BEGIN if(rising_edge(Clk)) then if(we='1') then ram(address) <= data_i; end if; data_o <= ram(address); end if; END PROCESS; end Behavioral;
Busses
Bus-Based Transfers A Bus is a shared transfer path. It is characterized by a set of common lines (i) Data + (ii) Control, (iii) Status The control signals for the logic select a single source and one or more destinations on any clock cycle. SRC1 DEST1 DEST2 SRC2
Simple Case: using Muxes! Signals S2, S1, S0 select the source Signals L0, L1, L2 enable loading of the registers. The single bus (on the right) One mux One output bus Capabilities??
Three-State Bus Remember three-state drivers allow having multiple outputs share wire Note the small inverted triangle denotes the 3-state output of the register. A bus can be constructed with the three state buffers. Many three state buffer outputs can be connected together to form a bit line of a bus less delay than multiplexer based systems
Same Example with 3-State Notice that both systems in the figure have the same capability in term of transfers. However the 3-state bus has: Fewer wires Easier to expand!
Bus An example of an interconnection network. When functional units are connected to a common bus, tri-state drivers are needed.
A 3-bus interconnection network
Memory Transfer Point to an address in Memory Read data from the Memory and Write Into Register D2, D1, D0
ALU Design
Arithmetic/Logic Unit (ALU) The ALU is a combinational circuit that performs a set of basic arithmetic and logic operations. An adder can perform addition, subtraction, … Select lines are used to determine the operation to be performed.
ALU Design using Hierarchy The ALU will have: 2 control lines S0,S1 for operation selections 1 control line S2 to select logical versus arithmetic operations Start designing in parts
Single Stage ALU Design a 1-bit Arithmetic unit Design a 1-bit Logic unit Combine the two units to form a 1-bit Arithmetic/Logic Replicate as many times to form an n-bit ALU
Arithmetic Circuit The basic component of an arithmetic circuit is a: N-bit Ripple Carry Adder (Parallel Adder). By controlling the data inputs to the parallel adder, it is possible to obtain different types of arithmetic operations (Cin is also an input) Select lines S0, S1 can be used to control input Y. Why?
Looking Inside What possible functionality can I achieve if I control the ‘Y’ Value to the n-bit Adder? B Input Logic B B’ Table Functionality. How to design the B Input Logic?
Design of B Select Logic Use an 8-to-1 Mux (Straight forward Solution). Or … use a 4-to-1 mux! Can we do better? YES: simplify the expression from the truth table using a K-Map
1-bit (Single Stage) Arithmetic Circuit The B logic is nothing but a 2-to-1 Mux instead of the 4-to-1 Mux
4-Bit Arithmetic Circuit Duplicating the one stage four times will produce a 4-bit circuit
Logic Section Design Generous number of operations
Arithmetic/Logic Unit The logic circuit can be combined with the arithmetic circuit to produce an ALU. Selection variables S1 and S0 can be common to both circuits, A third selection variable S2 can be used to differentiate between the logic and arithmetic operations.
One Stage Arithmetic Circuit
One Stage Logic Circuit
One Stage ALU Mux to choose Arithmetic or Logic
n-bit ALU Duplicate the one stage n times!!
Resulting Control The one stage ALU can provide 8 arithmetic, and 4 logic operations.
How to extend the ALU to support MIPS ISA? Need to support the set-on-less-than instruction (slt) Uses subtraction to determine if (a – b) < 0 (implies a < b) Need to support test for equality (bne, beq) Again use subtraction: (a - b) = 0 implies a = b Need to add overflow detection hardware overflow detection enabled only for add, addi, sub Immediates are sign extended outside the ALU with wiring (i.e., no logic needed)
MIPS Data Path
Arithmetic Where we've been What's up ahead Abstractions Instruction Set Architecture (ISA) Assembly and machine language What's up ahead Implementing the architecture (in VHDL) zero ovf 1 1 A 32 ALU result 32 B 32 4 m (operation)
ALU VHDL Representation entity ALU is port(A, B: in std_logic_vector (31 downto 0); m: in std_logic_vector (3 downto 0); result: out std_logic_vector (31 downto 0); zero: out std_logic; ovf: out std_logic) end entity ALU; architecture process_behavior of ALU is . . . begin ALU: process(A, B, m) result := A + B; end process ALU; end architecture process_behavior;
Design the MIPS Arithmetic Logic Unit (ALU) 32 m (operation) result A B ALU 4 zero ovf 1 Must support the Arithmetic/Logic operations of the ISA add, addi, addiu, addu sub, subu mult, multu, div, divu sqrt and, andi, nor, or, ori, xor, xori beq, bne, slt, slti, sltiu, sltu With special handling for sign extend – addi, addiu, slti, sltiu zero extend – andi, ori, xori overflow detection – add, addi, sub Tradeoffs of cost and speed based on frequency of occurrence, hardware budget
MIPS Arithmetic and Logic Instructions 31 25 20 15 5 R-type: op Rs Rt Rd funct I-Type: op Rs Rt Immed 16 Type op funct ADDI 001000 xx ADDIU 001001 xx SLTI 001010 xx SLTIU 001011 xx ANDI 001100 xx ORI 001101 xx XORI 001110 xx LUI 001111 xx Type op funct ADD 000000 100000 ADDU 000000 100001 SUB 000000 100010 SUBU 000000 100011 AND 000000 100100 OR 000000 100101 XOR 000000 100110 NOR 000000 100111 Type op funct 000000 101000 000000 101001 SLT 000000 101010 SLTU 000000 101011 000000 101100 funct = m (b3, b2, b1, b0) where b0 tells whether its signed (0) or not (1) (i.e., is overflow activated), b1 tells whether its an add (0) or subtract (1) operation, b2 tells whether it’s a logic operation (1) or arithmetic operation (2), and b3 tells whether its an immediate operation (1) or not (0) (except for slt)
Design Trick: Divide & Conquer Break the problem into simpler problems, solve them and glue together the solution Example: assume the immediates have been taken care of before the ALU now down to 10 operations can encode in 4 bits 0 add 1 addu 2 sub 3 subu 4 and 5 or 6 xor 7 nor a slt b sltu
Addition & Subtraction Just like in grade school (carry/borrow 1s) 0111 0111 0110 + 0110 - 0110 - 0101 Two's complement operations are easy do subtraction by negating and then adding 0111 0111 - 0110 + 1010 Overflow (result too large for finite computer word) e.g., adding two n-bit numbers does not yield an n-bit number 0111 + 0001 1101 0001 0001 0001 1 0001 for lecture 1000
Building a 1-bit Binary Adder carry_in A B carry_in carry_out S 1 A 1 bit Full Adder S B carry_out S = A xor B xor carry_in carry_out = A&B | A&carry_in | B&carry_in (majority function) How can we use it to build a 32-bit adder? How can we modify it easily to build an adder/subtractor?
Building 32-bit Adder 1-bit FA A0 B0 S0 c0=carry_in c1 Just connect the carry-out of the least significant bit FA to the carry-in of the next least significant bit and connect . . . 1-bit FA A1 B1 S1 c2 1-bit FA A2 B2 S2 c3 Ripple Carry Adder (RCA) advantage: simple logic, so small (low cost) disadvantage: slow and lots of glitching (so lots of energy consumption) c32=carry_out 1-bit FA A31 B31 S31 c31 . . .
A 32-bit Ripple Carry Adder/Subtractor add/sub 1-bit FA S0 c0=carry_in c1 S1 c2 S2 c3 c32=carry_out S31 c31 . . . A0 A1 A2 A31 Remember 2’s complement is just complement all the bits add a 1 in the least significant bit B0 control (0=add,1=sub) B0 if control = 0 !B0 if control = 1 A 0111 0111 B - 0110 + For lecture 1001 1 0001 1 0001
Overflow Detection and Effects Overflow: the result is too large to represent in the number of bits allocated When adding operands with different signs, overflow cannot occur! Overflow occurs when adding two positives yields a negative or, adding two negatives gives a positive or, subtract a negative from a positive gives a negative or, subtract a positive from a negative gives a positive On overflow, an exception (interrupt) occurs Control jumps to predefined address for exception Interrupted address (address of instruction causing the overflow) is saved for possible resumption Don't always want to detect (interrupt on) overflow Recalled from some earlier slides that the biggest positive number you can represent using 4-bit is 7 and the smallest negative you can represent is negative 8. So any time your addition results in a number bigger than 7 or less than negative 8, you have an overflow. Keep in mind is that whenever you try to add two numbers together that have different signs, that is adding a negative number to a positive number, overflow can NOT occur. Overflow occurs when you to add two positive numbers together and the sum has a negative sign. Or, when you try to add negative numbers together and the sum has a positive sign. If you spend some time, you can convince yourself that If the Carry into the most significant bit is NOT the same as the Carry coming out of the MSB, you have a overflow.
New MIPS Instructions Sign extend – addiu, addiu, slti, sltiu Category Instr Op Code Example Meaning Arithmetic (R & I format) add unsigned 0 and 21 addu $s1, $s2, $s3 $s1 = $s2 + $s3 sub unsigned 0 and 23 subu $s1, $s2, $s3 $s1 = $s2 - $s3 add imm.unsigned 9 addiu $s1, $s2, 6 $s1 = $s2 + 6 Data Transfer ld byte unsigned 24 lbu $s1, 25($s2) $s1 = Mem($s2+25) ld half unsigned 25 lhu $s1, 25($s2) Cond. Branch (I & R format) set on less than unsigned 0 and 2b sltu $s1, $s2, $s3 if ($s2<$s3) $s1=1 else $s1=0 set on less than imm unsigned b sltiu $s1, $s2, 6 if ($s2<6) $s1=1 else similarity of the binary representation of related instructions simplifies the hardware design Sign extend – addiu, addiu, slti, sltiu Zero extend – andi, ori, xori Overflow detected – add, addi, sub
Review: MIPS Arithmetic Instructions 31 25 20 15 5 32 m (operation) result A B ALU 4 zero ovf 1 R-type: op Rs Rt Rd funct I-Type: op Rs Rt Immed 16 expand immediates to 32 bits before ALU 10 operations so can encode in 4 bits 0 add 1 addu 2 sub 3 subu 4 and 5 or 6 xor 7 nor a slt b sltu Type op funct ADD 00 100000 ADDU 00 100001 SUB 00 100010 SUBU 00 100011 AND 00 100100 OR 00 100101 XOR 00 100110 NOR 00 100111 Type op funct 00 101000 00 101001 SLT 00 101010 SLTU 00 101011 00 101100
Review: A 32-bit Adder/Subtractor add/subt c0=carry_in Built out of 32 full adders (FAs) A0 1-bit FA S0 B0 c1 1 bit FA A B S carry_in carry_out A1 1-bit FA S1 B1 c2 A2 1-bit FA S2 B2 c3 S = A xor B xor carry_in carry_out = A&B | A&carry_in | B&carry_in (majority function) . . . c31 A31 1-bit FA S31 B31 c32=carry_out Small but slow!
Tailoring the ALU to the MIPS ISA Also need to support the logic operations (and, nor, or, xor) Bit wise operations (no carry operation involved) Need a logic gate for each function and a mux to choose the output Also need to support the set-on-less-than instruction (slt) Uses subtraction to determine if (a – b) < 0 (implies a < b) Also need to support test for equality (bne, beq) Again use subtraction: (a - b) = 0 implies a = b Also need to add overflow detection hardware overflow detection enabled only for add, addi, sub Immediates are sign extended outside the ALU with wiring (i.e., no logic needed)
A Simple ALU Cell with Logic Op Support B add/subt 1-bit FA carry_in carry_out result op Old book shows the B input to the logic gates as the output of the inverter mux (xor in our case) . This way you can also get A and !B, A or !B, A xor !B (which is A xnor B) and !(A or !B) (which is !A and B) in addition to A and B, A or B, A xor B, and A nor B by setting the add/subt control correctly. wouldn’t it be better to pull it directly from the B input? Yes, so I modified the design from that presented in the (old) book. Leads to simplier decoding of m bits (to ALUlogic) to the add_subt and op control lines. how many bits does op need to be?
Modifying the ALU Cell for slt add/subt carry_in op A 1 2 result 3 1-bit FA 6 B less 7 add/subt carry_out Remember that “slt” instruction sets a register value to 1 if $S1 < $S2 0 … otherwise
Modifying the ALU for slt B1 A0 B0 A31 B31 + result1 less result0 result31 . . . First perform a subtraction $S1 - $S2 … A - B Make the result 1 if the subtraction yields a negative result i.e. A < B Make the result 0 if the subtraction yields a positive result i.e. A > B set tie the most significant sum bit (sign bit) to the low order less input. Why? For lecture
Modifying the ALU for Zero op add/subt Modifying the ALU for Zero A0 result0 First perform subtraction Insert additional logic to detect when all result bits are zero zero . . . B0 + less A1 result1 B1 + less . . . A31 For lecture Note zero is a 1 when result is all zeros result31 B31 + less set
Overflow Detection Overflow occurs when the result is too large to represent in the number of bits allocated adding two positives yields a negative or, adding two negatives gives a positive or, subtract a negative from a positive gives a negative or, subtract a positive from a negative gives a positive On your own: Prove you can detect overflow by: Carry into MSB xor Carry out of MSB For lecture Recalled from some earlier slides that the biggest positive number you can represent using 4-bit is 7 and the smallest negative you can represent is negative 8. So any time your addition results in a number bigger than 7 or less than negative 8, you have an overflow. Keep in mind is that whenever you try to add two numbers together that have different signs, that is adding a negative number to a positive number, overflow can NOT occur. Overflow occurs when you to add two positive numbers together and the sum has a negative sign. Or, when you try to add negative numbers together and the sum has a positive sign. If you spend some time, you can convince yourself that If the Carry into the most significant bit is NOT the same as the Carry coming out of the MSB, you have a overflow. 1 1 1 1 1 1 1 + 7 3 1 + –4 – 5 – 6 1 1 7
Modifying the ALU for Overflow op add/subt Modifying the ALU for Overflow A0 Modify the most significant cell to determine overflow output setting Enable overflow bit setting for signed arithmetic (add, addi, sub) result0 B0 + less A1 result1 B1 + . . . zero less . . . A31 For slt (and slti and sltiu and sltu) “No integer overflow exception occurs under any circumstances. The comparison is valid even if the subtraction used during the comparison overflows.” The way I read this is that if the result overflows during the subtraction, no attempt is made to correct the set line to reflect that! Otherwise, you would need to add additional logic in front of the set line to do the correction in case of overflow. Like exoring the overflow bit with the sign bit. result31 overflow + B31 less set
But What about Performance? Critical path of n-bit ripple-carry adder is n*CP Design trick – throw hardware at it (Carry Lookahead) CarryIn0 A0 1-bit ALU Result0 B0 CarryOut0 CarryIn1 A1 1-bit ALU Result1 B1 CarryOut1 CarryIn2 A2 1-bit ALU Result2 B2 CarryOut2 CarryIn3 A3 1-bit ALU Result3 B3 CarryOut3
More complicated than addition Multiplication More complicated than addition Can be accomplished via shifting and adding 0010 (multiplicand) x_1011 (multiplier) 0010 0010 (partial product 0000 array) 0010 00010110 (product) Double precision product produced More time and more area to compute
MIPS Multiply Instruction Multiply produces a double precision product mult $s0, $s1 # hi||lo = $s0 * $s1 Low-order word of the product is left in processor register lo and the high-order word is left in register hi Instructions mfhi rd and mflo rd are provided to move the product to (user accessible) registers in the register file op rs rt rd shamt funct multu – does multiply unsigned Both multiplies ignore overflow, so its up to the software to check to see if the product is too big to fit into 32 bits. There is no overflow if hi is 0 for multu or the replicated sign of lo for mult. Multiplies are done by fast, dedicated hardware and are much more complex (and slower) than adders Hardware dividers are even more complex and even slower; ditto for hardware square root
Division Division is just a bunch of quotient digit guesses and left shifts and subtracts n n quotient dividend divisor partial remainder array remainder n
MIPS Divide Instruction Divide generates the reminder in hi and the quotient in lo div $s0, $s1 # lo = $s0 / $s1 # hi = $s0 mod $s1 Instructions mflo rd and mfhi rd are provided to move the quotient and reminder to (user accessible) registers in the register file op rs rt rd shamt funct Seems odd to me that the machine doesn’t support a double precision dividend in hi || lo but it looks like it doesn’t As with multiply, divide ignores overflow so software must determine if the quotient is too large. Software must also check the divisor to avoid division by 0.
Shift Operations Shifts move all the bits in a word left or right sll $t2, $s0, 8 #$t2 = $s0 << 8 bits srl $t2, $s0, 8 #$t2 = $s0 >> 8 bits sra $t2, $s0, 8 #$t2 = $s0 >> 8 bits op rs rt rd shamt funct Notice that a 5-bit shamt field is enough to shift a 32-bit value 25 – 1 or 31 bit positions Logical shifts fill with zeros, arithmetic left shifts fill with the sign bit An arithmetic shift (sra) maintain the arithmetic correctness of the shifted value (i.e., a number shifted right one bit should be ½ of its original value; a number shifted left should be 2 times its original value) The shift operation is implemented by hardware separate from the ALU using a barrel shifter (which would takes lots of gates in discrete logic, but is pretty easy to implement in VLSI)
Wrap-Up We can build an ALU to support the MIPS ISA we can efficiently perform subtraction using two’s complement we can replicate a 1-bit ALU to produce a 32-bit ALU Important points about hardware all of the gates are always working (concurrently) the speed of a gate is affected by the number of inputs to the gate (fan-in) and the number of gates that the output is connected to (fan-out) the speed of a circuit is affected by the speed of and number of gates in series (on the “critical path” or the “number of levels of logic”) and the length of wires interconnecting the gates Our primary focus is comprehension, however clever changes to organization can improve performance (similar to using better algorithms in software)
End Slides