CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #6 – Modern FPGA Devices
Lect-06.2CprE 583 – Reconfigurable ComputingSeptember 6, 2007 Quick Points HW #2 is out Due Thursday, September 20 (12:00pm) LUT mapping Comparing FPGA devices Synthesizing arithmetic operators AssignedDue Effort Level Standard Preferred CprE 583
Lect-06.3CprE 583 – Reconfigurable ComputingSeptember 6, 2007 Recap Hard-wired carry logic support Altera FLEX 8000Xilinx XCV4000
Lect-06.4CprE 583 – Reconfigurable ComputingSeptember 6, 2007 Recap (cont.) Square-root carry select adders A B S A B S A B S A 14-9 B 14-9 S A 8-4 B 8-4 S A 3-0 B 3-0 S 3-0 t4t4 t4t4 t5t5 t5t5 t5t5 t6t6 t6t6 t7t7 t7t7 t8t8 t8t8 t6t6 t7t7 t8t8 t9t9 t 10
Lect-06.5CprE 583 – Reconfigurable ComputingSeptember 6, 2007 Recap (cont.) If one operand is constant: More speed? Less hardware? HA A0A0 0 S0S0 FA A1A1 S1S1 A2A2 S2S2 A3A3 S3S3 101 C3C3 A0A0 S0S0 HA A2A2 S2S2 A3A3 S3S3 C3C3 A1A1 S1S1
Lect-06.6CprE 583 – Reconfigurable ComputingSeptember 6, 2007 Recap (cont.) + X0X0 Y0Y0 X1X1 X2X2 X3X3 Z0Z0 Y1Y1 X0X0 + X1X1 + X2X2 + X3X3 Z1Z1 + Y2Y Y3Y Z2Z2 Carry save multiplication
Lect-06.7CprE 583 – Reconfigurable ComputingSeptember 6, 2007 Recap (cont.) Y 0 =0 Z0Z0 X0X0 X1X1 X2X2 X3X3 Z1Z1 + Y 3 =1 X0X0 + X1X1 + X2X2 + X3X3 Z2Z2 If one operand is constant: Can greatly reduce the number of adders Removes all and gates Y 1 =1 Y 2 =0
Lect-06.8CprE 583 – Reconfigurable ComputingSeptember 6, 2007 LUT-Based Constant Multipliers Constants can be changed in the LUTs to program new multipliers 4-LUT N 0 –N x NNNNNNNN AAAAAAAAAAAA (N * 1011 (LSN)) + BBBBBBBBBBBB (N * 1010 (MSN)) SSSSSSSSSSSSSSSS Product 4-LUT N 4 –N 7 4-LUT + S 0 –S 15 A 0 –A 11 B 4 –B 15
Lect-06.9CprE 583 – Reconfigurable ComputingSeptember 6, 2007 Outline Recap More Multiplication Handling Fractional Values Fixed Point Floating Point Some Modern FPGA Devices Xilinx – XC5200, Virtex (-II / -II Pro / -4 / -5), Spartan (-II / -3) *Altera – FLEX 10K, APEX (20K / II), ACEX 1K, Cyclone (II), Stratix (GX / II / II GX)
Lect-06.10CprE 583 – Reconfigurable ComputingSeptember 6, 2007 Partial Product Generation AND gates in multiplication are wasteful Option 1 – use cascade logic Option 2 – break into smaller (2x2) multipliers 42 = Multiplicand x 11 = x 1011 Multiplier 0110 (10x11) 0100 (10x10) (10x10) 462 = Product
Lect-06.11CprE 583 – Reconfigurable ComputingSeptember 6, 2007 Representation Compression Multiplication can be simplified if the representation is compressed Standard – binary representation {0,1}x2 n Canonical Signed Digit (CSD) representation {-1,0,1}x2 n To encode CSD: Set C = (B + (B<<1)) Calculate -2C = 2*(C>>1) D i = B i + C i – 2C i+1, where C i+1 is the carryout of B i + C i Example: B = 61d = b C = b b = b -2C i+1 = D = = 1000(-1)01 For any n bit number, there can only be n/2 nonzero digits in a CSD representation (every other bit)
Lect-06.12CprE 583 – Reconfigurable ComputingSeptember 6, 2007 Booth Encoding Variation on CSD encoding: E j = -2B i + B i-1 + B i-2 Select a group of 3 digits, add the two least significant digits, and then subtract 2x the most significant bit E j is {-2,-1,0,1}x2 2n Example: B = 61d = b = b (with padding) E = 010(-1)1 Reduces the number of partial products for multiplication by ½ Can automatically handle negative numbers
Lect-06.13CprE 583 – Reconfigurable ComputingSeptember 6, 2007 Fractional Arithmetic Many important computations require fractional components Fractional arithmetic often ignored in FPGA literature Complex standards (ex. IEEE special cases) Resource intensive and slow Why not just extend the binary representation past the decimal point?
Lect-06.14CprE 583 – Reconfigurable ComputingSeptember 6, 2007 Fixed-Point Representation Separate value into Integer (I) and Fractional remainder (F) F bits represent {0,1}x2 -n How large to make I and F depends on application Ex: Q16.16 is 16 bits of integer [-2 15, 2 16 ) with 16 bits of fraction – increments of or Ex: Q1.127 is a normalized integer [-1,1) with 127 bits of fraction – increments of or e-39 IF
Lect-06.15CprE 583 – Reconfigurable ComputingSeptember 6, 2007 Fixed-Point Arithmetic Addition, subtraction the same (Q4.4 example): Multiplication requires realignment: x
Lect-06.16CprE 583 – Reconfigurable ComputingSeptember 6, 2007 Fixed-Point Issues Overflow/underflow Quantization Errors After rounding down previous example x = (0.08% error) In Q4.4, 2 divided by 3 = (6.25% error) Scaling Dynamic range needed for some applications
Lect-06.17CprE 583 – Reconfigurable ComputingSeptember 6, 2007 IEEE 754 Floating Point Single precision: V = (-1) S x 2 (E-127) x (1.F) Double precision: V = (-1) S x 2 (E-1023) x (1.F) Special conditions – not a number (NaN), +-0, +-infinity Gradual underflow SE 18 F 23 SE 111 F 52
Lect-06.18CprE 583 – Reconfigurable ComputingSeptember 6, 2007 Floating Point FPGA Hardware Xilinx XCV4085 Addition Single-precision – LUTs Double-precision – LUTs Multiplication Single-precision – LUTs Double-precision – LUTs Division Single-precision – LUTs Double-precision – LUTs For double-precision, can only fit any two of three units on a single device! See [Und04] for details
Lect-06.19CprE 583 – Reconfigurable ComputingSeptember 6, 2007 Capacity Trends Year 1985 Xilinx Device Complexity XC MHz 1K gates XC MHz 250K gates Virtex 200 MHz 1M gates Virtex-II 450 MHz 8M gates Spartan 80 MHz 40K gates Spartan-II 200 MHz 200K gates Spartan MHz 5M gates XC MHz 7.5K gates Virtex-E 240 MHz 4M gates XC MHz 23K gates Virtex-II Pro 450 MHz 8M gates* Virtex MHz 16M gates* Virtex MHz 24M gates*
Lect-06.20CprE 583 – Reconfigurable ComputingSeptember 6, 2007 Xilinx XC5200 FPGA Successor to the XC4000 Relatively small amount of CLBs with faster interconnect Device Logic Cells XC5202XC5204XC5206XC5210XC5215 Max Logic Gates VersaBlock Array CLBs Flip-Flops ,2961,936 3,0006,00010,00016,00023,000 8 x 810 x 1214 x 1418 x 1822 x ,2961,936 I/Os
Lect-06.21CprE 583 – Reconfigurable ComputingSeptember 6, 2007 Xilinx XC5200 (cont.) Each CLB consists of four Logic Cells (LCs) Logic Cell = LUT + DFF 20 inputs 12 outputs
Lect-06.22CprE 583 – Reconfigurable ComputingSeptember 6, 2007 Xilinx XC5200 (cont.)
Lect-06.23CprE 583 – Reconfigurable ComputingSeptember 6, 2007 Xilinx Spartan FPGAs Meant to be low-power / low-cost version of XC4000 series (on newer process technology) Device Logic Cells XCS05XCS10XCS20XCS30XCS40 Max Logic Gates CLB Matrix Total CLBs Flip-Flops ,3681,862 5,00010,00020,00030,00040, x 1014 x 1420 x 2024 x 2428 x ,1201,5362,016 I/Os Dist. RAM Bits 3,2006,27212,80018,43225,088
Lect-06.24CprE 583 – Reconfigurable ComputingSeptember 6, 2007 Xilinx Spartan (cont.) Identical CLB to XC4000 series
Lect-06.25CprE 583 – Reconfigurable ComputingSeptember 6, 2007 Xilinx Spartan (cont.) Individual LUTs can be programmed as 16x1 RAMs and combined to form larger memory structures
Lect-06.26CprE 583 – Reconfigurable ComputingSeptember 6, 2007 Xilinx Virtex FPGAs Device XCV50 Logic Cells Max Logic Gates CLB ArrayI/O Bits Block RAM Bits XCV100 XCV150 XCV200 1,72857,90616 x ,768 2,700108,90420 x ,960 3,888164,67424 x ,152 5,292238,66628 x ,844 XCV300 6,912322,97032 x ,536 XCV400 10,800468,25240 x ,920 XCV600 15,552661,11148 x ,304 XCV800 21,168888,43956 x ,688 XCV ,6481,124,02264 x ,072 Select RAM+ Bits 24,576 38,400 55,296 75,264 98, , , , ,216
Lect-06.27CprE 583 – Reconfigurable ComputingSeptember 6, 2007 Xilinx Virtex (cont.) 4 4-LUTs / FFs per CLB Organized into 2 “slices”
Lect-06.28CprE 583 – Reconfigurable ComputingSeptember 6, 2007 Xilinx Virtex (cont.) Block Select+RAM – dedicated blocks of on- chip, true dual port read/write synchronous RAM 4Kbit of RAM with different aspect ratios Faster, less flexible than distributed RAM using LUTs Virtex-E – updated, larger version of Virtex devices
Lect-06.29CprE 583 – Reconfigurable ComputingSeptember 6, 2007 Xilinx Spartan-II CLB structure similar to Virtex Device XC2S15 Logic Cells System Gates CLB ArrayI/O Bits Distributed RAM Bits XC2S30 XC2S50 XC2S ,0008 x 12866, ,00012 x ,824 1,72850,00016 x ,576 2,700100,00020 x ,400 XC2S150 3,888150,00024 x ,296 XC2S200 5,292200,00028 x ,264 Select RAM+ Bits 16,384 24,576 32,768 40,960 49,152 57,344
Lect-06.30CprE 583 – Reconfigurable ComputingSeptember 6, 2007 Xilinx Virtex-II Platform FPGAs Device XC2V40 Max Logic Gates CLB Array Multiplier Blocks Max I/O Pads Block RAM Bits XC2V80 XC2V250 XC2V500 40K8 x 84888K 80K16 x K 250K24 x K 500K32 x K XC2V1000 1M40 x K XC2V M48 x K XC2V2000 2M56 x K XC2V3000 3M64 x K XC2V4000 4M80 x K Select RAM+ Bits 72K 144K 432K 576K 720K 864K 1,008K 1,728K 2,160K XC2V6000 6M96 x ,1041,056K XC2V8000 8M112 x ,1081,456K 2,592K 3,024K “Platform” FPGA == Multiplier??
Lect-06.31CprE 583 – Reconfigurable ComputingSeptember 6, 2007 Xilinx Virtex-II (cont.) 4 Slices per CLB, 2 4-LUTs per slice 8 LUTs per CLB Block Select+RAMs now 18Kbit each
Lect-06.32CprE 583 – Reconfigurable ComputingSeptember 6, 2007 Xilinx Virtex-II (cont.) Block multipliers (18b x 18b) arranged in columns near RAM
Lect-06.33CprE 583 – Reconfigurable ComputingSeptember 6, 2007 Block Multipliers Synthesis tools can take larger multipliers and break them down into 18x18 multipliers
Lect-06.34CprE 583 – Reconfigurable ComputingSeptember 6, 2007 Xilinx Virtex-II Pro FPGAs Device XC2VP2 PowerPC CPU Blocks Logic Cells Multiplier Blocks Max I/O Pads Block RAM Bits XC2VP4 XC2VP7 XC2VP20 03, K 16, K 111, K 220, K XC2VP30 230, K XC2VP40 243, K XC2VP50 253, K XC2VP70 274, ,034K XC2VP , ,1641,378K Select RAM+ Bits 216K 504K 792K 1,584K 2,448K 3,456K 4,176K 5,904K 7,992K
Lect-06.35CprE 583 – Reconfigurable ComputingSeptember 6, 2007 Xilinx Virtex-II Pro (cont.) PowerPC processor block features 300+ MHz Harvard architecture (RISC) Five-stage pipeline Hardware multiply/divide Thirty-two 32-bit GPRs 16 KB two-way instruction cache 16 KB two-way data cache On-Chip Memory (OCM) interface IBM CoreConnect (OPB, PLB) interfaces
Lect-06.36CprE 583 – Reconfigurable ComputingSeptember 6, 2007 Xilinx Virtex-II Pro (cont.) PPC 405 details
Lect-06.37CprE 583 – Reconfigurable ComputingSeptember 6, 2007 Xilinx Spartan-3 FPGAs CLB structure similar to Virtex-II Device XC3S50 System Gates CLB Array Multiplier Blocks Max I/O Pads Distr. RAM Bits XC3S200 XC3S400 XC3S K16 x K 200K24 x K 400K32 x K 1M48 x K Select RAM+ Bits 72K 216K 288K 432K XC3S1500 XC3S2000 XC3S4000 XC3S M64 x K 2M80 x K 4M96 x K 5M104 x K 576K 720K 1,728K 1,872K
Lect-06.38CprE 583 – Reconfigurable ComputingSeptember 6, 2007 Xilinx Virtex-4 FPGAs Comes in three varieties: Virtex-4 LX: most amount of LUTs Virtex-4 FX: has PowerPCs like V2P Virtex-4 SX: contains most amount of XtremeDSP slices CLB structure similar to Virtex-II Largest LX device – 89,088 slices = 178,176 4-LUTs! FX devices limited to 2 PPC 405s like Virtex-II Pro XTremeDSP Slices: Same 18x18 block multiplier, now with optional pipelining Includes built-in 48-bit accumulator for MAC operations
Lect-06.39CprE 583 – Reconfigurable ComputingSeptember 6, 2007 Xilinx Virtex-5 CLB slices uses 6-input LUTs Block RAMs now 36Kbits per block DSP slices now support 25x18 MAC Diagonal routing LX, LXT, SXT, FXT varieties
Lect-06.40CprE 583 – Reconfigurable ComputingSeptember 6, 2007 Summary Handling fractional math in hardware is important, and expensive Data point – 3 double-precision dividers in a Xilinx XC2VP30 Data point – cannot fit a double-precision multiplier in a Xilinx XC3S50 Fixed point an alternative, but not practical for all applications Xilinx FPGAs 4-LUTs arranged in slices, CLBs (except for V5) Physical SRAM blocks for fast memory Physical multipliers for fast DSP operations Some physical CPUs to manage embedded systems