Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction.

Similar presentations


Presentation on theme: "Introduction."— Presentation transcript:

1 Introduction

2 Objectives After completing this module, you will be able to:
Describe why parallelism enables such high performance Describe the Virtex™-II, Virtex-II Pro™, and Spartan™-3 family architectures and how they lend to an optimum implementation of DSP functions

3 Outline Power of Parallelism
Platform FPGA Virtex-II/Virtex-II Pro Series Spartan-3 Architecture Why should I use FPGAs for DSP?

4 Multiply Accumulate Single Engine
Sequential processing limits data throughput Time-shared MAC unit High clock frequency creates difficult system challenge 256 Tap FIR Filter 256 multiply and accumulate (MAC) operations per data sample One output every 256 clock cycles Data In Reg MAC unit Loop Algorithm 256 times Data Out Conventional general purpose DSPs utilize a common architecture known as the Von Neumann architecture, or extensions thereof. This type of architecture is serial by nature, and performance is restricted as a result. Multiply and Accumulate (MAC) units within conventional DSPs are typically shared resources. FIR filter filters an incoming signal by multiplying each data sample of the signal by several constant values (coefficients) and then accumulating the result. The more MAC operations that are applied to the incoming signal, the higher the accuracy of the filter. For example, a 256 tap FIR filter requires 256 MAC operations on each data sample before the next sample can be processed. Implementing a filter in the FPGA based on sequential MAC can be very efficient for slower sample rate and is one of many techniques available to build a filter. One way to achieve very high sampling rate is by using parallel processing techniques and that is where Xilinx differentiates itself.

5 Sequential Processing Limits System Performance Algorithmic Complexity
Sample Rate (MSamples/s) 40 Single 300 MHz Processor 35 Two 300 MHz Processor 30 25 Channel Density or Sample Rate Fixed Processor Clock Rate = Number of operations per sample Max Sample Rate 20 15 10 This foil indicates the system performance achieved by one, and then two, DSP processors able to operate at MAC instruction per clock With a single, fixed MAC unit, the sample rate and algorithmic complexity (number of cycles per sample) are related with the above equation. As the algorithmic complexity increases and requires more clock cycles to process each sample, so the sample rate has to reduce. The only way we can cater for both an increase in algorithmic complexity and increasing sample rates, is to use multiple processor engines…i.e., parallelism. 5 No. of coefficients Algorithmic Complexity

6 Multiply Accumulate Multiple Engines
Parallel processing maximizes data throughput Support any level of parallelism Optimal performance/cost tradeoff 256 Tap FIR Filter 256 multiply and accumulate (MAC) operations per data sample One output every clock cycle Flexible architecture Distributed DSP resources (LUT, registers, multipliers, & memory) Data In Reg0 Reg1 Reg2 Reg255 C0 C1 C2 C255 .... All 256 MAC operations in one clock cycle Data Out The performance of FPGAs can reach up to 500 billion MACs per second in the Xilinx largest Virtex™-II FPGA (XC2V8000), which is significantly higher than that of conventional DSPs that are available from mainstream suppliers. FPGAs achieve this by utilizing a more parallel architecture in order to process the incoming signal. FPGAs give you the flexibility to design for a wide spectrum of sample rates, from multi-cycle implementation to single cycle. Using this architecture, each sample in the 256 tap FIR filter example can be processed in a single clock cycle, hence significantly improving the performance of the DSP.

7 Outline Power of Parallelism
Platform FPGA Virtex-II/Virtex-II Pro Series Spartan-3 Architecture Why should I use FPGAs for DSP?

8 Virtex-II Platform FPGA (1)
Active Interconnect™ Powerful CLB Switch Matrix Slice S0 Slice S1 Slice S2 Slice S3 Switch Matrix CLB, IOB, DCM Fully buffered Fast, predictable BRAM 8 LUTs 128b distributed RAM Wide input functions (32:1) Support for slices based multipliers Block RAM Typical DSP functions are based on multipliers, adders, and memory (MAC), (i.e., FFT). Key advancements in the Virtex™-II Platform FPGA make it very attractive for such DSP solutions as embedded multipliers and multiplier fabric within the CLB, larger block RAMs, and very fast distributed RAMs. Multipliers 18KBit True Dual Port Up to 3 Mbits / device 18b x 18b multiplier 200+ MHz pipelined

9 Virtex-II Platform FPGA (2)
16 Global Clocks 16 Clocks Eight clocks to any quadrant Switch glitch-free between clocks DCM DCI Zero delay clock Precision phase shift Frequency synthesis Duty cycle correction Clock multiply and divide Other advantages come with the new clocking scheme along with clock managers and the Xilinx new board digitally controlled impedance resistor matching capability. Key Points: 1) Sixteen Pre-Engineered Clock Domains: Each Virtex™-II device has 16 pre-engineered clock domains to support the multiple frequency and multiple phase requirements of complex system designs. Each built-in, low-skew clock network eliminates complex clock tree analysis and simplifies the system design process. 2) The Perfect Choice For High-Speed Clock Design Virtex-II devices contain up to 12 Digital Clock Managers (DCMs). Each DCM provides phase shifting and frequency synthesis capabilities, which are ideally suited for systems with multiple clock domains and critical timing requirements. The DCM delivers unsurpassed flexibility for managing both on-chip and off-chip clock synchronization. Virtex-II DCMs support 400+MHz clock outputs to enable leading-edge bus interface standards such as RapidIO and SPI-4. One very useful feature in DSP design is to timeshare some logic by clocking it on the rising and falling edges of the clock. 3) DCI—The Industry’s First Digitally Controlled Impedance Technology The Xilinx Digitally Controlled Impedance Technology in the Virtex-II solution dynamically eliminates drive strength variation due to process, temperature, and voltage fluctuation. DCI uses two external high-precision resistors to incorporate equivalent input and output impedance internally for hundreds of I/O pins. On-chip termination Guaranteed signal integrity Eliminates 100s of resistors

10 Virtex-II Memory Hierarchy
Distributed RAM High-Performance External Memory Interfaces DDR SDRAM ZBT® SRAM 16k x 1 8k x 2 4k x 4 2k x 9 1k x 18 512 x 36 Multiple types of memories or memory interfaces are available on the Virtex™-II platform FPGA. Distributed RAM are extremely fast and available in each CLB and hence accessible anywhere in the device. In DSP designs, they are very efficient and often used for small FIFOs, DSP coefficients storage, CAM, pipeline compensation, PN generator (LFSR), serial frame synchronizer, and Z-n sample delay. Block memories have true dual port capabilities and allow the storage of large amount of data. These are very useful for block type algorithms where hundreds of samples must be stored before processing. They are often used in DSP designs to implement large Sin/Cos/Log tables, large FIFOs, packet buffers, video line buffers. Finally, high performance external memory interface permit easy and fast access to external memories when the internal ones become a limitation, such as packet buffering, program store, sample storage. QDR SRAM True-Dual Port™ Synchronous Block RAM

11 Virtex-II CLB Flexible resources Ease of Performance
Wide-input functions 16:1 multiplexer in 1 CLB Fast arithmetic functions Two dedicated carry chains Cascadable shift registers in LUT 128-b shift register in 1 CLB Ease of Performance Direct routing enabling high speed CIN Switch Matrix TBUF COUT Slice S0 Slice S1 Fast Connects Slice S2 Slice S3 SHIFT Become familiar with the FPGA terminology because this is what will be used to define and compare design sizes.

12 Virtex-II Slice Each slice contains two: Each register:
Four inputs lookup tables 16-bit distributed SelectRAM 16-bit shift register RAM16 SRL16 MUXFx Each register: D flip-flop Latch Dedicated logic: Muxes Arithmetic logic MULT_AND Carry Chain LUT Register CY G MUXF5 Register LUT CY F Arithmetic Logic

13 Unique Distributed RAM
LUTs used as memory inside the fabric Flexible, can be used as RAM, ROM, or shift register Distributed memory with fast access time Cascadable with built-in CLB routing Applications Linear feedback shift register Distributed arithmetic Time-shared registers Small FIFO Digital delay lines (Z-1) 64b 64b Dual Port RAM 1 CLB 128b Single Port RAM RAM16 16b SRL16 1 CLB LUT Shift register LUT can be used as ROM for storing coefficients Implementing combinatorial logic RAM/SRL16 for storing sample or as delay lines 128b 16b 1 CLB

14 The SRL16E The 16 SRAM cells have been organized into a shift register
The ‘CE’ is used, in conjunction with the clock, to write data into the first flip-flop and for all other data to move right by one position Because this is a predictable operation, no address is required for writing The SRL16E is excellent in implementing efficient DSP Functions A very efficient way to delay data samples Shifting samples and scanning at faster rate A Q CE D Q15 SRLC16E Cascadable CE The reading of the flip-flop contents is completely independent. The address selects which flip-flop is read. Notice again that the read process is asynchronous, but the dedicated flip-flop is available to synchronize it. Also, dedicated Q15 output pin permits cascading of multiple SRL16E to make wide shift register. SRL16E’s shift scan ability is perfect for data buffer creation Note that the SRLC16E is not available in Virtex™ or Spartan™-II devices. CE CE CE CE CE CE CE CE CE CE CE CE CE CE CE CE Q15 D D Q D Q D Q D Q D Q D Q D Q D Q D Q D Q D Q D Q D Q D Q D Q D Q A[3:0] 0000 1111 Q

15 Signed Multiply Performance
Multiplier Unit Embedded 18-bit x 18-bit multiplier Quantity: Virtex-II : up to 168 2V40 : 4 2V8000 : 168 Virtex-II Pro : up to 556 2VP2 : 12 2VP125 : 556 2s complement signed operation 4- to 18-bit operands Combinational & pipelined options Operates with block RAM and fabric to implement MAC function 18 Bit 36 Bit 18 Bit Virtex-II Pro 18x18 300 MHz 18 x 18 245 MHz Virtex-II Signed Multiply Performance Preliminary V1.60 Speeds File The multiplier unit is useful in digital filters implementation. The embedded multipliers can be used an alternative resources when very high performance is not needed. For very high performance, multipliers can be implemented in LUTs, whereas for reasonable performance, embedded multipliers can be used, making LUTs available to implement logic and coefficients. Note that the above mentioned performance figures are valid for using them in isolation. When used in complex design, the overall design may not perform at that high rate. See application notes XAPP636: Optimal Pipelining of I/O Ports of the Virtex™-II Multiplier to achieve the stated performance. Pipelined multiplier with registered inputs and outputs

16 Virtex-II Family Virtex-II Part Number XC2V 40 XC2V 80 XC2V 250
LUTs + FFs 512 1,024 3,072 6,144 10,240 15,360 21,504 28,672 46,080 67,584 93,184 BRAM (kb) 72 144 432 576 720 864 1,008 1,728 2,160 2,592 3,024 Multipliers 4 8 24 32 40 48 56 96 120 168 DCM Units 12 Package Available SelectIO CS144 88 92 FG256 172 FG456 200 264 324 FG676 392 456 484 FF896 528 624 FF1152 824 FF1517 912 1,104 1,108 BG575 328 408 BG728 516 BG957 684 Key Points: 1. Eleven (11) members, all pin compatible (same package) 2. “Pinout-array compatibility” for FG256 & FG456, and FF896 & FF1152 3. It is important to notice the large amount of embedded multipliers and block RAM as you move up in the family offering. 4. There are as many block RAMs as number of Embedded Multipliers

17 Virtex-II Pro versus Virtex-II
Lower cost Up to 24 RocketIO™ embedded multi-gigabit transceivers Up to four PowerPC More Memory 10 MBits in block RAM 1,738 KBits in Distributed RAM More I/O pins per package (up to 1704) More block RAMs and multiplier blocks 556 embedded multipliers 556 Block RAMs Smaller technology 0.13 µm Nine-Layer Copper Process with 90 nm high-speed transistors • Flexible Logic Resources - Up to 111,232 internal registers/latches with Clock Enable - Up to 111,232 look-up tables or cascadable variable (1 to 16 bits) shift registers - Wide multiplexers and wide-input function support - Horizontal cascade chain and Sum-of-Products support - Internal 3-state busing SelectIO™-Ultra Technology - Up to 1,200 user I/Os - Twenty-two single-ended standards and - six differential standards - Programmable LVCMOS sink/source current (2 mA to 24 mA) per I/O - Differential signaling · 840 Mb/s LVDS I/O with current mode drivers · Bus LVDS I/O · HyperTransport (LDT) I/O with current driver buffers · Built-in DDR input and output registers - Proprietary high-performance SelectLink technology for communications between Xilinx devices · High-bandwidth data path · Double Data Rate (DDR) link · Web-based HDL generation methodology

18 Virtex-II Pro Family Virtex-II Pro 2VP2 2VP4 2VP7 2VP20 2VP30 2VP40 2VP50 2VP70 2VP100 2VP125 Logic Cells 3,168 6,768 11,088 20,880 30,816 43,632 53,136 74,448 99,216 125,136 PPC405 1 2 4 MGT3.125Gb 8 12* 16* 20 20* 24* BRAM (kb) 216 504 792 1,584 2,448 3,456 4,176 5,904 7,992 10,008 Multipliers 12 28 44 88 136 192 232 328 444 556 DCM Units Package MGT Available SelectIO FG256 140 FG456 156 248 FF672 204 348 396 FF896 FF1152 564 692 FF1148* 804 812 FF1517 16 852 964 FF1704 996 1040 FF1696* 1164 1200 * FF1148 and FF1696 special bond option: No MGT with Maximum SelectIO Note, there are as many block RAMs as number of Embedded Multipliers PPC406 – Power PC 405 CPU core MGT3.125Gb – Multi-Gigabits Transceivers providing data rate of up to 3:125 Gbps Logic Cells – fundamental logic elements each cell consist of one LUT, one FF, and one Carry logic. Each cell is equivalent to half of a slice. * FF1148 and FF1696 special bond option: No MGT with Maximum SelectIO

19 Outline Power of Parallelism
Platform FPGA Virtex-II/Virtex-II Pro Series Spartan-3 Architecture Why should I use FPGAs for DSP?

20 Spartan-3 versus Virtex-II
Lower cost Lower core voltage Vccint = 1.2V versus 1.5V Different I/O standard support New standards: 1.2V LVCMOS, 1.8V HSTL and SSTL Default is LVCMOS, versus LVTTL Smaller technology .09 micron versus .15 micron More I/O pins per package Only half of the slices support RAM or SRL16s (SLICEM) Fewer block RAMs and multiplier blocks Same size and functionality Half the global clock buffers Eight versus 16 Fewer DCM blocks No internal three-state buffers Three-state buffers are in the I/O SLICEM is described on the next page. Block RAM and multipliers: All other devices contain two columns of block RAM and multiplier resources, which are located near the right and left edges of the die. DCMs: The smallest Spartan™-3 device (XC3S50) does not contain any DCMs. All other devices contain four DCMs; these are located on the top and bottom edges of the die, above and below the block RAM and multiplier columns. CLBs: Half the slices don't have RAM (ones at bottom or right of CLB), or shift register (SRL16). 64 bits per CLB instead of 128 No 32x1 or 64x1 dual-port in CLB, however can be implemented across multiple CLBs 64-bit SRL per CLB instead of 128 Slice with RAM is referred to in the data sheet as the left-hand slice and in the software as the SLICEM Slice without RAM is referred to in the data sheet as the right-hand slice and in the software as the SLICEL

21 SLICEM and SLICEL Each Spartan™-3 CLB contains four slices
Similar to Virtex™-II device Slices are grouped in pairs Left-hand SLICEM (Memory) LUTs can be configured as memory or SRL16 Right-hand SLICEL (Logic) LUT can be used as logic only Left-Hand SLICEM Right-Hand SLICEL COUT COUT Switch Matrix Slice X1Y1 Slice X1Y0 SHIFTIN Slice X0Y1 Less block RAM Same size (18K) Fewer columns (2-4 per device vs 4-6 per device) Same functionality and modes Half to one-third the DCMs (4 vs. 8-12) No DSS Digital Spread Spectrum (not supported in Virtex-II) Half the global clock buffers (8 instead of 16) All 8 route to any CLB No BUFGS secondary buffers - GCLK pins are GCLK0, etc., instead of GCLKP0, etc. Different muxing of clock buffers - Lower skew Any GCK on top goes to any DCM on top and for bottom; (Virtex-II is limited to quadrant) Less interconnect lines 96 hex and 32 double per channel instead of 120 hex and 40 double Less connectivity on interconnect Slice X0Y0 Fast Connects CIN SHIFTOUT CIN

22 Spartan-3 Family Spartan-3 Part Number XC3S 50 XC3S 200 XC3S 400
Logic Cells 1,728 4,320 8,064 17,280 29,952 46,080 62,208 74,880 BRAM (kb) 72 216 288 432 576 720 1,872 Multipliers 4 12 16 24 32 40 96 104 DCM Units 2 Package Available SelectIO VQ100 63 TQ144 97 PQ208 124 141 FT256 173 FG456 264 333 FG676 391 487 489 FG900 565 633 FG1156 712 784

23 Outline Power of Parallelism
Platform FPGA Virtex-II/Virtex-II Pro Series Spartan-3 Architecture Why should I use FPGAs for DSP?

24 FPGAs Mean Parallelism
Reason 1: FPGAs handle high computational workloads Conventional DSP Device (Von Neumann architecture) FPGA Data In Data In Reg0 Reg1 Reg2 Reg255 Reg C0 C1 C2 .... C255 MAC unit Data Out Data Out 256 Loops needed to process samples All 256 MAC operations in 1 clock cycle 256 Tap FIR Filter Example

25 FPGAs are Ideal for Multi-channel DSP designs
80MHz Samples 20MHz Samples ch1 LPF ch2 LPF LPF ch3 LPF Multi Channel Filter ch4 LPF FPGAs are also ideally suited for multi-channel DSP designs Many low sample rate channels can be multiplexed (e.g. TDM) and processed in the FPGA, at a high rate Interpolation (using zeros) can also drive sample rates higher

26 But is this the only way in the FPGA?
Why FPGAs for DSP? (2) Reason 2: Tremendous Flexibility A × + Q = (A x B) + (C x D) + (E x F) + (G x H) can be implemented in parallel B C D Q E F G H But is this the only way in the FPGA?

27 Customize Architectures to Suit Your Ideal Algorithms
Parallel Semi-Parallel Serial × + × + D Q + D Q × Speed Optimized for? Cost FPGAs allow Area (cost) / Performance tradeoffs

28 Hundreds of Termination Resistors
Why FPGAs for DSP? (3) Reason 3: Lower System Cost through Integration DDC A/D D/A MACs Control DUC DSP Procs. SDRAM AFE FPGA DSP Card Hundreds of Termination Resistors ASSP SSTL3 Translators Quad TRx FPGA PowerPC Network Card Quad TRx SDRAM A/D SDRAM ASSP A/D Control PL4 PowerPC PowerPC 3.125 Gbps D/A MACs, DUCs, DDCs, Logic PowerPC PowerPC D/A Control CORBA SDRAM


Download ppt "Introduction."

Similar presentations


Ads by Google