CprE / ComS 583 Reconfigurable Computing

Slides:



Advertisements
Similar presentations
Introduction So far, we have studied the basic skills of designing combinational and sequential logic using schematic and Verilog-HDL Now, we are going.
Advertisements

Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 14: March 19, 2014 Compute 2: Cascades, ALUs, PLAs.
1 KU College of Engineering Elec 204: Digital Systems Design Lecture 9 Programmable Configurations Read Only Memory (ROM) – –a fixed array of AND gates.
Datorteknik ArithmeticCircuits bild 1 Computer arithmetic Somet things you should know about digital arithmetic: Principles Architecture Design.
ECE C03 Lecture 61 Lecture 6 Arithmetic Logic Circuits Hai Zhou ECE 303 Advanced Digital Design Spring 2002.
Chapter # 5: Arithmetic Circuits Contemporary Logic Design Randy H
Lecture 8 Arithmetic Logic Circuits
Chapter 5 Arithmetic Logic Functions. Page 2 This Chapter..  We will be looking at multi-valued arithmetic and logic functions  Bitwise AND, OR, EXOR,
Lecture # 12 University of Tehran
Aug Shift Operations Source: David Harris. Aug Shifter Implementation Regular layout, can be compact, use transmission gates to avoid threshold.
Chapter 6-2 Multiplier Multiplier Next Lecture Divider
Chapter # 5: Arithmetic Circuits
Chapter 6-1 ALU, Adder and Subtractor
Advanced VLSI Design Unit 05: Datapath Units. Slide 2 Outline  Adders  Comparators  Shifters  Multi-input Adders  Multipliers.
CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #4 – FPGA.
CDA 3101 Fall 2013 Introduction to Computer Organization The Arithmetic Logic Unit (ALU) and MIPS ALU Support 20 September 2013.
COMP541 Arithmetic Circuits
Lecture #23: Arithmetic Circuits-1 Arithmetic Circuits (Part I) Randy H. Katz University of California, Berkeley Fall 2005.
Addition and multiplication1 Arithmetic is the most basic thing you can do with a computer, but it’s not as easy as you might expect! These next few lectures.
Reconfigurable Computing - Performance Issues John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound, Western.
Combinational Circuits
Somet things you should know about digital arithmetic:
Lecture 12 Logistics Last lecture Today HW4 due today Timing diagrams
Digital Electronics Multiplexer
Topics SRAM-based FPGA fabrics: Xilinx. Altera..
Multiplier Design [Adapted from Rabaey’s Digital Integrated Circuits, Second Edition, ©2003 J. Rabaey, A. Chandrakasan, B. Nikolic]
Swamynathan.S.M AP/ECE/SNSCT
CSE 575 Computer Arithmetic Spring 2003 Mary Jane Irwin (www. cse. psu
Addition and multiplication
Instructor: Dr. Phillip Jones
Digital Electronics Multiplexer
CprE 583 – Reconfigurable Computing
Basics Combinational Circuits Sequential Circuits Ahmad Jawdat
Topics Number representation. Shifters. Adders and ALUs.
Combinatorial Logic Design Practices
Computer Organization and Design Arithmetic & Logic Circuits
CprE / ComS 583 Reconfigurable Computing
CSE Winter 2001 – Arithmetic Unit - 1
Topic 3b Computer Arithmetic: ALU Design
VLSI Arithmetic Adders & Multipliers
Lecture 14 Logistics Last lecture Today
King Fahd University of Petroleum and Minerals
Computer Organization and Design Arithmetic & Logic Circuits
Arithmetic Circuits (Part I) Randy H
CSE 370 – Winter 2002 – Comb. Logic building blocks - 1
Programmable Configurations
Topic 3b Computer Arithmetic: ALU Design
Basic Adders and Counters Implementation of Adders
CprE / ComS 583 Reconfigurable Computing
Part III The Arithmetic/Logic Unit
COMS 361 Computer Organization
Addition and multiplication
Lecture 14 Logistics Last lecture Today
UNIVERSITY OF MASSACHUSETTS Dept
Addition and multiplication
ECE 352 Digital System Fundamentals
Combinational Circuits
ECE 352 Digital System Fundamentals
ECE 352 Digital System Fundamentals
ECE 352 Digital System Fundamentals
UNIVERSITY OF MASSACHUSETTS Dept
Lecture 9 Digital VLSI System Design Laboratory
Arithmetic Building Blocks
CprE / ComS 583 Reconfigurable Computing
ECE 352 Digital System Fundamentals
Arithmetic Circuits.
Lecture 3 Combinational units. Adders
UNIVERSITY OF MASSACHUSETTS Dept
Instructor: Michael Greenbaum
Reconfigurable Computing (EN2911X, Fall07)
Presentation transcript:

CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #5 – FPGA Arithmetic

CprE 583 – Reconfigurable Computing Quick Points HW #1 due Thursday at 12:00pm Any comments? September 4, 2007 CprE 583 – Reconfigurable Computing

CprE 583 – Reconfigurable Computing Recap Cluster size of N = [6-8] is good, K = [4-5] September 4, 2007 CprE 583 – Reconfigurable Computing

LUT Mapping Techniques As noted earlier, the decomposition of the network into gates can also affect the mapping. In the slide below, note that the topmost network is mapped naively into 4 LUTs (again K=4). The decompositions at the bottom show how different (smaller) mappings can be found, depending on how the OR gate is decomposed; the decomposition on the right does not allow a mapping into 2 LUTs. September 4, 2007 CprE 583 – Reconfigurable Computing

LUT Mapping Techniques (cont.) Node replication in the network can also simplify the mapping; the network on the left cannot be mapped into fewer than 3 4-input LUTs; whereas the network on the right can fit in 2 LUTs. September 4, 2007 CprE 583 – Reconfigurable Computing

LUT Mapping Techniques (cont.) We must also take advantage of reconvergence in the network; in the figure below, the mapping on the left does not look far back enough (towards the inputs) to find the superior mapping indicated on the right September 4, 2007 CprE 583 – Reconfigurable Computing

CprE 583 – Reconfigurable Computing Outline Recap Motivation Carry / Cascade Logic Addition Ripple Carry Carry Bypass Carry Select Carry Lookahead Basic Multipliers September 4, 2007 CprE 583 – Reconfigurable Computing

CprE 583 – Reconfigurable Computing Motivation Traditional microprocessors, DSPs, etc. don’t use LUTs Instead use a w-bit Arithmetic and Logic Unit (ALU) Carry connections are hard-wired No switches, no stubs, short wires (2) (1) AND2 OR2 XOR2 A B Cin ALU A B Out Op (1, 2) 2-LUT A B Out (1) 3-LUT 3-LUT Sum As carry connection is fast in ALU. Carry signals just ripple through and complete in less time. The carry path doesn't have extra switches, stubs etc which would add to the RC delay. Note that we are talking about the latency advantage here of the ALU, not computational density. Cout / Cin A B (2) ADD SUB CMP 3-LUT 3-LUT Sum Cout September 4, 2007 CprE 583 – Reconfigurable Computing

CprE 583 – Reconfigurable Computing Adder Delays Assuming a ripple-carry adder: 32-bit ALU delay – 6ns 32-cascaded 4-LUTs – 32 x 2.5ns = 80ns Motivates extra hardware to accelerate carry operations 4-LUT delay Compare: 32-bit ALU (0.6λ) 2.0 ns Logic delay 16 ns Area optimized 0.5 ns Single channel delay 6 ns Delay optimized 2.5 ns per bit Here are some sample data points. The 32 bit ALU, by avoiding the programmable interconnect, can have delay as low as 16ns and 6ns. And for the LUT based design, the delay is high, motivating extra dedicated arithmetic hardware support. People have tried various ways to improve a pure LUT based computing structure, as will be explained in the next few slides. CprE 583 – Reconfigurable Computing September 4, 2007

Altera Flex 8000 Carry Chain The first example here is carry and cascade support from Altera. Notice the extra carry-in and cascade-in inputs. The carry chain provides a very fast (less than 0.5ns) carry-forward function between LEs (logic elements). (Altera FPGA is organized in LABs (logic array block), each LAB has eight vertically stacked LEs) So, the carry-in signal from a lower-order bit moves forward into the higher-order bit via the carry chain, and feeds into both the LUT and the next portion of the carry chain. This feature allows Altera to implement very high-speed counters, adders and comparators of arbitrary width. September 4, 2007 CprE 583 – Reconfigurable Computing

CprE 583 – Reconfigurable Computing Xilinx XC4000 Carry Chain Each Xilinx CLB can be used to make a 2x2 adder with fast-carry logic. The heart of the carry chain is a 2x1 mux that appears horizontally in the middle of the slide. To set up the carry chain, the CAD tools program the carry logic to work as follows: Cin is driven on the '1' input of the mux. Either A or B can be driven on the '0' input. For this example we will use B. The remaining multiplexers are set up to get the function A XOR B on the select line of the mux. The output of the mux now has this function: Cout = ((A XOR B) & Cin) | (~(A XOR B) & B) This reduces to: Cout = (~A & B & Cin) | (A & ~B & Cin) | (A & B) This is the correct equation for Cout. September 4, 2007 CprE 583 – Reconfigurable Computing

CprE 583 – Reconfigurable Computing Cascades Large fanin operations (reductions): Decoding Matching Completion detection Many-to-one reductions Combining logic is simple Makes use of dedicated paths Some other opportunities here: cascades. Some example logic here is 9-input and, or etc. Note that carry is actually a particular instance of cascade. CprE 583 – Reconfigurable Computing September 4, 2007

CprE 583 – Reconfigurable Computing Altera Cascade Logic LE delay – 2.4 ns Cascade delay – 0.6 ns A cascade example from Altera. In general, cascade is most likely useful when only a few of them are being used. But after a certain extend it might not has any advantage, as we can always implement a very wide input function by having a few LUTs operate in parallel, merging the intermediate results at the end. CprE 583 – Reconfigurable Computing September 4, 2007

CprE 583 – Reconfigurable Computing Why Look at Arithmetic? Parallelization Specialization Architecture Size Inputs Adder problem – delay grows linearly with bit width Solutions for larger adders: Pipelining Carry bypass Carry select Parallelization is important. There's the issue of throughput vs. area, although on many FPGAs you get pipelining for free because of the registers on the logic blocks. CprE 583 – Reconfigurable Computing September 4, 2007

CprE 583 – Reconfigurable Computing Adder Pipelining Not as practical in ASIC world (registers are expensive) Registers essentially “free” in FPGA logic blocks A1 B1 A2 B2 A3 B3 + Cout S1 + Cout This is not really done in ASICs because registers are almost as expensive as adders. This is not really an issue in FPGAs because, as mentioned above, there are registers for free on the outputs of the logic blocks. For chaining, after the initial registers are allocated for delay, register cost is light. S2 + S3 Cout September 4, 2007 CprE 583 – Reconfigurable Computing

CprE 583 – Reconfigurable Computing Carry Bypass Adders If all the propagates are 1 (P0P1P2P3 = 1) then Cout3 = Cin Pi = Ai xor Bi Skip all the carry logic Inexpensive A0 B0 A1 B1 A2 B2 A3 B3 Cout0 Cout1 Cout2 + + + + Cin Cout3 1 P0P1P2P3 September 4, 2007 CprE 583 – Reconfigurable Computing

Carry Bypass Performance Small hardware cost: 16-bit add – 4 CLB overhead 32-bit add – 9 CLB overhead Delay growth still linear, smaller slope Ripple Carry Carry Bypass Delay N-bit Adder CprE 583 – Reconfigurable Computing September 4, 2007

CprE 583 – Reconfigurable Computing Carry Select Adders Precompute addition value for (Cin = 0) case and (Cin = 1) case Use mux to select between two with actual Cin value Cost of this approach? A0 B0 A1 B1 A2 B2 A3 B3 + + + + Cin A4 B4 A5 B5 A6 B6 A7 B7 + + + + This precomputes two answers - one for Cin=0 and one for Cin=1. A MUX is then used to select between the two when the carry arrives. This is expensive in terms of area because it nearly doubles the required number of adders. S4-7 1 + + + + 1 A4 B4 A5 B5 A6 B6 A7 B7 CprE 583 – Reconfigurable Computing September 4, 2007

CprE 583 – Reconfigurable Computing Linear Carry-Select Adder delay = w, mux delay = 1 A31-24 A23-16 A15-8 A7-0 B31-24 B23-16 B15-8 B7-0 + + + + + + + + 1 1 1 1 t8 t8 t8 t8 t8 t8 t8 t8 Assumption: Adder delays = width of adder. Mux delays = 1. The critical path here is clearly the path across the MUXs, and this can be reduced as seen in the next slide. t11 t10 t9 1 1 1 1 t12 S31-24 S23-16 S15-8 S7-0 CprE 583 – Reconfigurable Computing September 4, 2007

Square-Root Carry Select Each carry arrives when the corresponding sum is ready A31-30 A29-22 A21-15 A14-9 A8-4 A3-0 B31-30 B29-22 B21-15 B14-9 B8-4 B3-0 + + + + + + + + + + + + 1 1 1 1 1 1 This ensures that each carry arrives at the MUX when the corresponding sum is ready. The amount of hardware is more than double that required for the ripple carry adder. The amount of hardware is not any more than that required for the linear carry select adder. t8 t8 t7 t7 t6 t6 t5 t5 t4 t4 1 1 1 1 1 1 t9 t8 t7 t6 t5 t10 S31-30 S29-22 S21-15 S14-9 S8-4 S3-0 September 4, 2007 CprE 583 – Reconfigurable Computing

CprE 583 – Reconfigurable Computing Constant Addition If one operand is constant: More speed? Less hardware? A0 A1 1 A2 A3 1 A0 S0 HA A2 S2 A3 S3 C3 A1 S1 HA FA FA FA C3 Constant addition doesn't gain you a whole lot in terms of speed, although it can reduce some full adders to half adders. Additionally, trailing zeros can be turned into wires. S0 S1 S2 S3 CprE 583 – Reconfigurable Computing September 4, 2007

CprE 583 – Reconfigurable Computing Multiplication Shift and add operations Need N bit adder, M cycles 42 = 101010 Multiplicand (N bits) x 10 = x 1010 Multiplier (M bits) 000000 101010 Partial products + 101010 420 = 111001110 Product (N+M bits) CprE 583 – Reconfigurable Computing September 4, 2007

CprE 583 – Reconfigurable Computing X3 X2 X1 X0 Array Multiplier Y0 Area – N x M cells Delay – O(N+M) Y1 X3 X2 X1 X0 + + + + Y2 X3 X2 X1 X0 + + + + Y3 X3 X2 X1 X0 + + + + Z2 Z1 Z0 September 4, 2007 CprE 583 – Reconfigurable Computing

CprE 583 – Reconfigurable Computing X3 X2 X1 X0 Carry-Save Y0 Y1 X3 X2 X1 X0 + + + + Y2 + + + + There's one critical path here - along the right-hand side and across the bottom. Y3 + + + + Z2 Z1 Z0 CprE 583 – Reconfigurable Computing September 4, 2007

Multiplier Pipelining Register cost: Multiplicand – (N bits/stage x M stages) Multiplier – (M2 + M) / 2 bits Early output values – (M2 + M) / 2 bits Total – M x (N + M + 1) bits Critical path = max: DFF + FA + setup Bottom-level adder September 4, 2007 CprE 583 – Reconfigurable Computing

CprE 583 – Reconfigurable Computing Constant Multiplier Y0=0 Can greatly reduce the number of adders Removes all and gates Y1=1 X3 X2 X1 X0 Y2=0 Can eliminate adders with 0's on the inputs. This can drastically reduce the number of adders. In addition, the and gates can be eliminated. Y3=1 X3 X2 X1 X0 + + + + Z2 Z1 Z0 September 4, 2007 CprE 583 – Reconfigurable Computing

LUT-based Constant Multipliers k-LUT can perform constant multiply of k-bit number Break operand into k-bit quantities Example: 8-bit x 8-bit constant, k=4 10101011 x NNNNNNNN AAAAAAAAAAA (N * 1011 (LSN)) + BBBBBBBBBBBBBBB (N * 1010 (MSN)) SSSSSSSSSSSSSSS Product A 4 bit constant with a 4 bit operand produces an 8 bit product. An array of LUTs can be used to implement a 4->8 multiplier. September 4, 2007 CprE 583 – Reconfigurable Computing

CprE 583 – Reconfigurable Computing Summary Latency overhead of programmable logic Several approaches to reducing design latency: Fast carry Cascade Hardwired connections Multiplier optimization goals different from adder Other techniques: Logarithmic v. linear (Wallace Tree multiplier) Data encoding (Booth’s multiplier) September 4, 2007 CprE 583 – Reconfigurable Computing