Lecture 18: Datapath Functional Units
Outline Comparators Shifters Multi-input Adders Multipliers 18: Datapath Functional Units
Comparators 0’s detector: A = 00…000 1’s detector: A = 11…111 Equality comparator: A = B Magnitude comparator: A < B 18: Datapath Functional Units
1’s & 0’s Detectors 1’s detector: N-input AND gate 0’s detector: NOTs + 1’s detector (N-input NOR) Asymmetric design that favors the late-arriving inputs. Here, the delay from the latest bit A7 is a single gate 18: Datapath Functional Units
1’s & 0’s Detectors Fast pseudo-nMOS or dynamic NOR detector. Works well for words up to about 16 bits For larger words, the gates can be split into 8–16-bit chunks to reduce the parasitic delay 18: Datapath Functional Units
Equality Comparator Check if each bit is equal (XNOR, aka equality gate) 1’s detect on bitwise equality 18: Datapath Functional Units
Magnitude Comparator Compute B – A and look at sign B – A = B + ~A + 1 For unsigned numbers, carry out is sign bit If there is a carry-out: A<=B. Otherwise: A>B. zero detector indicates that the numbers are equal. 18: Datapath Functional Units
Signed vs. Unsigned For signed numbers, comparison is harder C: carry out Z: zero (all bits of B – A are 0) N: negative (MSB of result) V: overflow (inputs had different signs, output sign B) S: N xor V (sign of result) 18: Datapath Functional Units
Counters Features of counters: Resettable: counter value is reset to 0 when RESET is asserted Loadable: counter value is loaded with N-bit value when LOAD is asserted Enabled: counter counts only on clock cycles when EN is asserted Reversible: counter increments or decrements based on UP/DOWN input Terminal Count: TC output asserted when counter overflows 18: Datapath Functional Units
Binary Counter Asynchronous ripple-carry counter The falling transition of each register clocks the subsequent register. Asynchronous circuits introduce a whole assortment of problems. Delay is long. 18: Datapath Functional Units
Binary Counter Synchronous up/down counter Resettable Cycle time is limited by the ripple-carry delay with reset, load, and enable 18: Datapath Functional Units
Fast Binary Counter The speed of the previous counter is limited by the adder It can be overcome by dividing the counter into two or more segments. A 32-bit counter could be constructed from: a 4-bit prescalar counter 28-bit counter 18: Datapath Functional Units
Shifters Logical Shift: Shifts number left or right and fills with 0’s 1011 LSR 1 = 0101 1011 LSL1 = 0110 Arithmetic Shift: Shifts number left or right. Rt shift sign extends 1011 ASR1 = 1101 1011 ASL1 = 0110 Rotate: Shifts number left or right and fills with lost bits 1011 ROR1 = 1101 1011 ROL1 = 0111 18: Datapath Functional Units
Funnel Shifter A funnel shifter can do all six types of shifts creates a 2N – 1-bit input word Z from A and/or the kill values. Selects N-bit field Y from 2N–1-bit input Shift by k bits (0 k < N) Logically involves N N:1 multiplexers 18: Datapath Functional Units
Funnel Source Generator Rotate Right Logical Right Arithmetic Right Rotate Left Logical/Arithmetic Left 18: Datapath Functional Units
Array Funnel Shifter N N-input multiplexers Use 1-of-N hot select signals for shift amount nMOS pass transistor design (Vt drops!) N=4 18: Datapath Functional Units
Array Funnel Shifter Source generator consists of a 2N – 1-bit 5:1 multiplexer controlled by the shift type and direction. Z2N-2:N 18: Datapath Functional Units
Logarithmic Funnel Shifter Log N stages of 2-input muxes No select decoding needed 18: Datapath Functional Units
32-bit Logarithmic Funnel The first stage performs a coarse shift right by 0, 8, 16, or 24 bits. The second stage performs a fine shift right by 0–7 bits. The mux decode block conditionally inverts k for left shifts. 18: Datapath Functional Units
Barrel Shifter Barrel shifters perform right rotations using wrap-around wires. Left rotations are right rotations by N – k = k + 1 bits. Shifts are rotations with the end bits masked off. Barrel shifter forms: Array Logarithmic: better for large shifts. 18: Datapath Functional Units
Logarithmic Barrel Shifter 1 1 Right rotate only Right/Left Shift & Rotate Right/Left shift 18: Datapath Functional Units
32-bit Logarithmic Barrel Datapath never wider than 32 bits First stage preshifts by 1 to handle left shifts 18: Datapath Functional Units
Multi-input Adders Suppose we want to add k N-bit words Ex: 0001 + 0111 + 1101 + 0010 = 10111 Straightforward solution: k-1 N-input CPAs Large and slow 18: Datapath Functional Units
Carry Save Addition A full adder sums 3 inputs and produces 2 outputs Carry output has twice weight of sum output (X + Y + Z = S + 2C) N full adders in parallel are called carry save adder Produce N sums and N carry outs 18: Datapath Functional Units
CSA Application Use k-2 stages of CSAs Keep result in carry-save redundant form Final CPA computes actual result 18: Datapath Functional Units
Multiplication Multiplication is less common than addition. Multiplier is essential for Microprocessors Digital signal processors Graphics engines 18: Datapath Functional Units
Multiplication Example: M x N-bit multiplication Produce N M-bit partial products (PPs) Sum these to produce M+N-bit product 18: Datapath Functional Units
General Form Multiplicand: Y = (yM-1, yM-2, …, y1, y0) Multiplier: X = (xN-1, xN-2, …, x1, x0) Product: 18: Datapath Functional Units
Dot Diagram Large multiplications are illustrated using dot diagrams. Each dot represents a bit 18: Datapath Functional Units
Multiplier design Depends on: Latency Throughput Energy Area Design complexity Simple approach: use an M + 1-bit carry-propagate adder (CPA) to add the first two partial products then another CPA to add the third partial product to the running sum, and so forth Requires N – 1 CPAs and is slow Better approach: using arrays 18: Datapath Functional Units
Array Multiplier Fast multipliers use carry-save adders (CSAs) 4 × 4 array multiplier for unsigned numbers using an array of CSAs. Each cell contains: 2-input AND gate Full adder (CSA) First row: converts first PPs into CS form Other rows: use CSA to add PP to CS of previous row and generate a CS result N LSB outputs: are directly sum of CSAs. MSB output bits arrive in CS form require M-bit CPA to convert into binary form. (CPA: Carry-Propagate Adder) (CSA: Carry Save Adder) 18: Datapath Functional Units
Rectangular Array Squash array to fit rectangular floorplan (the parallelogram shape wastes space. 18: Datapath Functional Units
Improvement The first row of CSAs adds the first partial product to a pair of 0s (Sin=0 and Cin=0). Leads to a regular structure but is inefficient. The first row of CSAs can be used to add the first three partial products together. This reduces the number of rows by two and reduces the adder propagation delay Another improvement: replace the bottom row with a faster CPA such as a lookahead or tree adder The critical path of an array multiplier involves N–2 CSAs and a CPA. 18: Datapath Functional Units
Two’s Complement Array Multiplication MSB of a two’s complement number has a negative weight. Two of the PPs have negative weight and should be subtracted. Subtraction is handled by taking the two’s complement of the terms to be subtracted, i.e., inverting the bits and adding one. 18: Datapath Functional Units
Baugh-Wooley multiplier algorithm The upper parallelogram: Multiplication of all but the MSBs of the inputs Next row: product of the MSB (X5Y5) Next two pairs of rows: the inversions of the terms to be subtracted. Each term has implicit leading and trailing zeros, which are inverted to leading and trailing ones. Extra ones must be added in the least significant column when taking the two’s complement. 18: Datapath Functional Units
Modified Baugh-Wooley multiplier Reduces the number of partial products: precomputing the sums of the constant ones pushing some of the terms upward into extra columns 18: Datapath Functional Units
squashed into a rectangle Almost identical to the unsigned multiplier AND gates are replaced by NAND gates in the hatched cells. 1s are added in place of 0s at two of the unused inputs. Signed and unsigned arrays are so similar A single array can be used for both XOR gates are used to conditionally invert some of the terms depending on the mode. 18: Datapath Functional Units
Fewer Partial Products Array multiplier requires N partial products If we looked at groups of r bits, we could form N/r partial products. Faster and smaller? Called radix-2r encoding Ex: r = 2: look at pairs of bits Form partial products of 00 0 01 Y 10 2Y simple shift 11 3Y hard multiple, requiring a slow carry-propagate addition of Y+2Y First three are easy, but 3Y requires adder 18: Datapath Functional Units
Booth Encoding Observe that: 3Y = 4Y – Y 2Y = 4Y – 2Y 4Y in a radix-4 multiplier array is equivalent to Y in the next row 18: Datapath Functional Units
Booth Encoding Instead of 3Y, try –Y, then increment next partial product to add 4Y Similarly, for 2Y, try –2Y + 4Y in next partial product 18: Datapath Functional Units
Example 18: Datapath Functional Units
Booth Hardware Booth encoder generates control lines for each PP Booth selectors choose PP bits 18: Datapath Functional Units
Sign Extension Partial products can be negative Require sign extension, which is cumbersome High fanout on most significant bit 18: Datapath Functional Units
Simplified Sign Ext. Sign bits are either all 0’s or all 1’s Note that all 0’s is all 1’s + 1 in proper column Use this to reduce loading on MSB 18: Datapath Functional Units
Even Simpler Sign Ext. No need to add all the 1’s in hardware Precompute the answer! 18: Datapath Functional Units
Advanced Multiplication Signed vs. unsigned inputs Higher radix Booth encoding Array vs. tree CSA networks 18: Datapath Functional Units