CprE / ComS 583 Reconfigurable Computing

CprE / ComS 583 Reconfigurable Computing
Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #5 – FPGA Arithmetic

CprE 583 – Reconfigurable Computing
Quick Points HW #1 due Thursday at 12:00pm Any comments? September 4, 2007 CprE 583 – Reconfigurable Computing

Recap Cluster size of N = [6-8] is good, K = [4-5] September 4, 2007 CprE 583 – Reconfigurable Computing

LUT Mapping Techniques
As noted earlier, the decomposition of the network into gates can also affect the mapping. In the slide below, note that the topmost network is mapped naively into 4 LUTs (again K=4). The decompositions at the bottom show how different (smaller) mappings can be found, depending on how the OR gate is decomposed; the decomposition on the right does not allow a mapping into 2 LUTs. September 4, 2007 CprE 583 – Reconfigurable Computing

LUT Mapping Techniques (cont.)
Node replication in the network can also simplify the mapping; the network on the left cannot be mapped into fewer than 3 4-input LUTs; whereas the network on the right can fit in 2 LUTs. September 4, 2007 CprE 583 – Reconfigurable Computing

LUT Mapping Techniques (cont.)
We must also take advantage of reconvergence in the network; in the figure below, the mapping on the left does not look far back enough (towards the inputs) to find the superior mapping indicated on the right September 4, 2007 CprE 583 – Reconfigurable Computing

Outline Recap Motivation Carry / Cascade Logic Addition Ripple Carry Carry Bypass Carry Select Carry Lookahead Basic Multipliers September 4, 2007 CprE 583 – Reconfigurable Computing

Motivation Traditional microprocessors, DSPs, etc. don’t use LUTs Instead use a w-bit Arithmetic and Logic Unit (ALU) Carry connections are hard-wired No switches, no stubs, short wires (2) (1) AND2 OR2 XOR2 A B Cin ALU A B Out Op (1, 2) 2-LUT A B Out (1) 3-LUT 3-LUT Sum As carry connection is fast in ALU. Carry signals just ripple through and complete in less time. The carry path doesn't have extra switches, stubs etc which would add to the RC delay. Note that we are talking about the latency advantage here of the ALU, not computational density. Cout / Cin A B (2) ADD SUB CMP 3-LUT 3-LUT Sum Cout September 4, 2007 CprE 583 – Reconfigurable Computing

Adder Delays Assuming a ripple-carry adder: 32-bit ALU delay – 6ns 32-cascaded 4-LUTs – 32 x 2.5ns = 80ns Motivates extra hardware to accelerate carry operations 4-LUT delay Compare: 32-bit ALU (0.6λ) 2.0 ns Logic delay 16 ns Area optimized 0.5 ns Single channel delay 6 ns Delay optimized 2.5 ns per bit Here are some sample data points. The 32 bit ALU, by avoiding the programmable interconnect, can have delay as low as 16ns and 6ns. And for the LUT based design, the delay is high, motivating extra dedicated arithmetic hardware support. People have tried various ways to improve a pure LUT based computing structure, as will be explained in the next few slides. CprE 583 – Reconfigurable Computing September 4, 2007

Altera Flex 8000 Carry Chain
The first example here is carry and cascade support from Altera. Notice the extra carry-in and cascade-in inputs. The carry chain provides a very fast (less than 0.5ns) carry-forward function between LEs (logic elements). (Altera FPGA is organized in LABs (logic array block), each LAB has eight vertically stacked LEs) So, the carry-in signal from a lower-order bit moves forward into the higher-order bit via the carry chain, and feeds into both the LUT and the next portion of the carry chain. This feature allows Altera to implement very high-speed counters, adders and comparators of arbitrary width. September 4, 2007 CprE 583 – Reconfigurable Computing

Xilinx XC4000 Carry Chain Each Xilinx CLB can be used to make a 2x2 adder with fast-carry logic. The heart of the carry chain is a 2x1 mux that appears horizontally in the middle of the slide. To set up the carry chain, the CAD tools program the carry logic to work as follows: Cin is driven on the '1' input of the mux. Either A or B can be driven on the '0' input. For this example we will use B. The remaining multiplexers are set up to get the function A XOR B on the select line of the mux. The output of the mux now has this function: Cout = ((A XOR B) & Cin) | (~(A XOR B) & B) This reduces to: Cout = (~A & B & Cin) | (A & ~B & Cin) | (A & B) This is the correct equation for Cout. September 4, 2007 CprE 583 – Reconfigurable Computing

Cascades Large fanin operations (reductions): Decoding Matching Completion detection Many-to-one reductions Combining logic is simple Makes use of dedicated paths Some other opportunities here: cascades. Some example logic here is 9-input and, or etc. Note that carry is actually a particular instance of cascade. CprE 583 – Reconfigurable Computing September 4, 2007

Altera Cascade Logic LE delay – 2.4 ns Cascade delay – 0.6 ns A cascade example from Altera. In general, cascade is most likely useful when only a few of them are being used. But after a certain extend it might not has any advantage, as we can always implement a very wide input function by having a few LUTs operate in parallel, merging the intermediate results at the end. CprE 583 – Reconfigurable Computing September 4, 2007

Why Look at Arithmetic? Parallelization Specialization Architecture Size Inputs Adder problem – delay grows linearly with bit width Solutions for larger adders: Pipelining Carry bypass Carry select Parallelization is important. There's the issue of throughput vs. area, although on many FPGAs you get pipelining for free because of the registers on the logic blocks. CprE 583 – Reconfigurable Computing September 4, 2007

Adder Pipelining Not as practical in ASIC world (registers are expensive) Registers essentially “free” in FPGA logic blocks A1 B1 A2 B2 A3 B3 + Cout S1 + Cout This is not really done in ASICs because registers are almost as expensive as adders. This is not really an issue in FPGAs because, as mentioned above, there are registers for free on the outputs of the logic blocks. For chaining, after the initial registers are allocated for delay, register cost is light. S2 + S3 Cout September 4, 2007 CprE 583 – Reconfigurable Computing

Carry Bypass Adders If all the propagates are 1 (P0P1P2P3 = 1) then Cout3 = Cin Pi = Ai xor Bi Skip all the carry logic Inexpensive A0 B0 A1 B1 A2 B2 A3 B3 Cout0 Cout1 Cout2 + + + + Cin Cout3 1 P0P1P2P3 September 4, 2007 CprE 583 – Reconfigurable Computing

Carry Bypass Performance
Small hardware cost: 16-bit add – 4 CLB overhead 32-bit add – 9 CLB overhead Delay growth still linear, smaller slope Ripple Carry Carry Bypass Delay N-bit Adder CprE 583 – Reconfigurable Computing September 4, 2007

Carry Select Adders Precompute addition value for (Cin = 0) case and (Cin = 1) case Use mux to select between two with actual Cin value Cost of this approach? A0 B0 A1 B1 A2 B2 A3 B3 + + + + Cin A4 B4 A5 B5 A6 B6 A7 B7 + + + + This precomputes two answers - one for Cin=0 and one for Cin=1. A MUX is then used to select between the two when the carry arrives. This is expensive in terms of area because it nearly doubles the required number of adders. S4-7 1 + + + + 1 A4 B4 A5 B5 A6 B6 A7 B7 CprE 583 – Reconfigurable Computing September 4, 2007

Linear Carry-Select Adder delay = w, mux delay = 1 A31-24 A23-16 A15-8 A7-0 B31-24 B23-16 B15-8 B7-0 + + + + + + + + 1 1 1 1 t8 t8 t8 t8 t8 t8 t8 t8 Assumption: Adder delays = width of adder. Mux delays = 1. The critical path here is clearly the path across the MUXs, and this can be reduced as seen in the next slide. t11 t10 t9 1 1 1 1 t12 S31-24 S23-16 S15-8 S7-0 CprE 583 – Reconfigurable Computing September 4, 2007

Square-Root Carry Select
Each carry arrives when the corresponding sum is ready A31-30 A29-22 A21-15 A14-9 A8-4 A3-0 B31-30 B29-22 B21-15 B14-9 B8-4 B3-0 + + + + + + + + + + + + 1 1 1 1 1 1 This ensures that each carry arrives at the MUX when the corresponding sum is ready. The amount of hardware is more than double that required for the ripple carry adder. The amount of hardware is not any more than that required for the linear carry select adder. t8 t8 t7 t7 t6 t6 t5 t5 t4 t4 1 1 1 1 1 1 t9 t8 t7 t6 t5 t10 S31-30 S29-22 S21-15 S14-9 S8-4 S3-0 September 4, 2007 CprE 583 – Reconfigurable Computing

Constant Addition If one operand is constant: More speed? Less hardware? A0 A1 1 A2 A3 1 A0 S0 HA A2 S2 A3 S3 C3 A1 S1 HA FA FA FA C3 Constant addition doesn't gain you a whole lot in terms of speed, although it can reduce some full adders to half adders. Additionally, trailing zeros can be turned into wires. S0 S1 S2 S3 CprE 583 – Reconfigurable Computing September 4, 2007

Multiplication Shift and add operations Need N bit adder, M cycles 42 = Multiplicand (N bits) x 10 = x Multiplier (M bits) 000000 Partial products 420 = Product (N+M bits) CprE 583 – Reconfigurable Computing September 4, 2007

X3 X2 X1 X0 Array Multiplier Y0 Area – N x M cells Delay – O(N+M) Y1 X3 X2 X1 X0 + + + + Y2 X3 X2 X1 X0 + + + + Y3 X3 X2 X1 X0 + + + + Z2 Z1 Z0 September 4, 2007 CprE 583 – Reconfigurable Computing

X3 X2 X1 X0 Carry-Save Y0 Y1 X3 X2 X1 X0 + + + + Y2 + + + + There's one critical path here - along the right-hand side and across the bottom. Y3 + + + + Z2 Z1 Z0 CprE 583 – Reconfigurable Computing September 4, 2007

Multiplier Pipelining
Register cost: Multiplicand – (N bits/stage x M stages) Multiplier – (M2 + M) / 2 bits Early output values – (M2 + M) / 2 bits Total – M x (N + M + 1) bits Critical path = max: DFF + FA + setup Bottom-level adder September 4, 2007 CprE 583 – Reconfigurable Computing

Constant Multiplier Y0=0 Can greatly reduce the number of adders Removes all and gates Y1=1 X3 X2 X1 X0 Y2=0 Can eliminate adders with 0's on the inputs. This can drastically reduce the number of adders. In addition, the and gates can be eliminated. Y3=1 X3 X2 X1 X0 + + + + Z2 Z1 Z0 September 4, 2007 CprE 583 – Reconfigurable Computing

LUT-based Constant Multipliers
k-LUT can perform constant multiply of k-bit number Break operand into k-bit quantities Example: 8-bit x 8-bit constant, k=4 x NNNNNNNN AAAAAAAAAAA (N * 1011 (LSN)) + BBBBBBBBBBBBBBB (N * 1010 (MSN)) SSSSSSSSSSSSSSS Product A 4 bit constant with a 4 bit operand produces an 8 bit product. An array of LUTs can be used to implement a 4->8 multiplier. September 4, 2007 CprE 583 – Reconfigurable Computing

Summary Latency overhead of programmable logic Several approaches to reducing design latency: Fast carry Cascade Hardwired connections Multiplier optimization goals different from adder Other techniques: Logarithmic v. linear (Wallace Tree multiplier) Data encoding (Booth’s multiplier) September 4, 2007 CprE 583 – Reconfigurable Computing

CprE / ComS 583 Reconfigurable Computing

Similar presentations

Presentation on theme: "CprE / ComS 583 Reconfigurable Computing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CprE / ComS 583 Reconfigurable Computing

Similar presentations

Presentation on theme: "CprE / ComS 583 Reconfigurable Computing"— Presentation transcript:

Similar presentations

About project

Feedback