Presentation is loading. Please wait.

Presentation is loading. Please wait.

Designing for 100+ MHz with Xilinx Virtex. 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability.

Similar presentations


Presentation on theme: "Designing for 100+ MHz with Xilinx Virtex. 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability."— Presentation transcript:

1 Designing for 100+ MHz with Xilinx Virtex

2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost  Shorter development time  Better product differentiation

3 Traditional Multi-Chip Boards  Discrete design components —CPU, memory —bus transceivers, PCI controller, FIFOs —Ethernet controller, Graphics accelerator, MPEG, DSP, etc. —programmable logic as glue and custom function  Advantages: —well-documented sophisticated functions —readily available as IP in silicon

4 Multi-Chip Board Problems  Physical size  Power consumption and reliability  PC board signal integrity  Limited flexibility —prevents design modifications and upgrades —prevents product diversification —prevents product customization  Poor product differentiation —standard parts = standard architecture

5 The FPGA Solution 4th Generation FPGA Logic+Memory+Routing Multi-Standard Select I/O Temperature Sensing Delay-Locked Loop for Fast Clock and I/O 3.3 ns Synchronous Dual-Port SRAM 500 Mbps SelectMAP Configuration

6 DLLs Maximize I/O Speed  Clock-to-output time plus set-up time determines the I/O speed and data bandwidth —min clock period = max clock-to-out + max set-up  Traditional solution: —use highly buffered, balanced clock trees –needed to reduce internal clock skew –cannot totally eliminate the delay  The Virtex solution: —use a Delay-Locked-Loop ( DLL ) –aligns the internal and external clocks –effectively eliminates the clock-distribution delay

7 Clock Data Comparator Error Delay Virtex Has 4 Independent DLLs  DLLs adjust clock delay to internal and external clocks —digital closed-loop control —25 to 200-MHz range, 35-picosecond resolution CLB IOB

8 LVTTL Data Rate with DLL 1.4 ns measured clock-to-output delay Output standard = LVTTL Fast 16mA (OBUF_F_16) Temp=100C, Vdd=2.375V, Vcco=3.3V Waveforms: 1: CLKIN 2: DATA OUT (no DLL) 3: DATA OUT (DLL deskewed) Timing w/o DLLw/ DLLr->r r->f 3.9n 3.9n1.4n 1.4n

9 Other DLL Functions  Double the incoming clock frequency —fast internal operation – slow external clock  Clock mirroring to the PCB  Divide clock by 1.5, 2, 2.5, 3, 4, 5, 8, or 16  Adjust clock duty cycle to 50-50  Create four quadrature clock phases —input four sequential bits per clock period

10 25 MHz 25% Duty Cycle 25 MHz 50% Duty Cycle Virtex FPGA 1X1X Duty Cycle Correction ~25% duty cycle in – 50% duty cycle out DLL

11 Clock Doubling and Mirroring  Clock mirror with less than 100 ps skew —simplifies PCB clock distribution Virtex Zero-Delay Internal Clock Buffer 37 MHz 74 MHz #1 74 MHz #2 74 MHz Internal 37 MHz Internal System Clock SDRAM Inside FPGA System Clock 1 Input Load Exactly Aligned Exactly Aligned Actual HDTV Customer Example SDRAM DLL 2 DLL 1

12 66MHz Clock 132 MHz Clock Virtex FPGA 2X2X DLL Precise Clock Mirroring 2x system clock for board use

13 CLKIn 200 MHz CLKout 200 MHz CLKDV 12.5 MHz Clock Division  Divide clock by 1.5, 2, 2.5, 3, 4, 5, 8, or 16 —maintain synchronous edges

14 Multi-Standard SelectI/O GTL+ 5V Tolerant 2.5V SSTL 1.8V 3.3V LVTTL 5V5V MicroProcessor SRAM DSP Mixed Signal Busses/Backplanes (3/5V PCI, ISA, GTL…) FLASH SDRAM

15 Mix & Match Output Standards  User-supplied voltages determine output swing —3.3 V, 2.5 V, 1.5 V —one voltage per bank —a bank is half of a chip edge  Output characteristics are programmable on a per-pin basis —push-pull or open-drain —LVTTL drive strength –2-mA to 24-mA sink and source current —LVTTL Slew rate

16 Internal Reference V REF Input V REF Mix & Match Input Standards  Internal or user-supplied threshold voltage —selectable on a per-pin basis —one user-supplied threshold voltage per bank  Programmable over-voltage protection —5-V tolerant or diode clamp to VCCO —selectable on a per-pin basis

17 SSTL Clock-to-Out With DLL  200 MHz inter-chip data rate —SSTL 3, Class II —IOB register to IOB register Clock 2.8 ns Virtex FPGA Q DLL D 1.9 ns 0.3 ns (Stub Series Transceiver Logic)

18 SSTL Data Rate with DLL Output standard = SSTL 3 Class 2 (OBUF_SSTL3_II) Temp=100C, Vdd=2.375V, Vcco=3.3V, Vtt=1.5V Waveforms: 1: CLKIN 2: DATA OUT (no DLL) 3: DATA OUT (DLL deskewed) Timing w/o DLLw/ DLLr->r r->f 3.5n 3.8n1.1n 1.3n  1.3 ns measured clock-to-output delay —much lower noise than LVTTL

19 ‘Redefining the FPGA’ From FPGA to System Component ‘Redefining the FPGA’ "Virtex moves FPGAs from glue to system component” - Ron Neale, EE GTL+ High Speed System Backplane Low Voltage CPU LVTTL SDRAM (133MHz) SSTL3 Cache SRAM (Mbytes) LVCMOS Chip 1 x1 CLK x2 CLK

20 Power and Thermal Issues  Power and heat are serious concerns  All CMOS power consumption is dynamic —proportional to V CC 2 —proportional to capacitance —proportional to frequency  Virtex conserves power —2.5-V supply voltage —small geometries and short interconnects reduce capacitance

21 384 16-bit Counters2.5 W Total 768 8-bit Counters3.7 W Total 1536 16-bit Counters9.8 W Total 3072 8-bit Counters14.7 W Total XCV300 XCV1000 Virtex Power Consumption  Virtex is designed to conserve power —100 MHz 16-bit counters –12.5 MHz average transition rate –6.5 mW per counter including clock distribution —100 MHz 8-bit counters –25 MHz average transition rate –5 mW per counter including clock distribution

22 DXP DXN Virtex FPGA SBMCLK SBMDATA ALERT Maxim MAX1617 Thermal Management  Temperature-sensing diode —matched to maxim MAX 1617 A/D —programmable alarms —similar to the Pentium II solution

23 Power Supply Decoupling  CMOS power-supply current is dynamic —current pulse every active clock edge  Peak current can be 5x the average current —instantaneous current peaks can only be supplied by decoupling capacitors  Use one 0.1 µF ceramic chip capacitor for each power-supply pin —low L and R are more important than high C —double up for lower L and R if necessary —use direct vias to the supply planes, close to the power-supply pins

24 Virtex FPGA WE, CS Data Virtex Configuration  New byte-wide SelectMAP mode —up to 528 Mbps at 66 MHz –simple handshake protocol —up to 400 Mbps at 50 MHz –no handshake required  Configuration bit-stream length —0.5 Mbits to 6.1 Mbits CS Address Configuration EPROM Control Logic (EPLD) Busy

25 Volts, Amps, and Watts: Recap  PCB design issues —minimize capacitance for higher speed —terminate transmission lines to reduce ringing  Chip inputs and outputs —use DLLs to maximize I/O bandwidth —use SelectI/O to interface with different standards  Power and thermal considerations —use the sensing diode to manage chip temperature —decouple the power supply well  Configuration —configure faster with the SelectMAP mode

26 Spending the 10 ns Budget  Fast logic requires fast function generators —signals often pass through several function generators  Routing delays must also be kept short —there are routing delays between every function generator  Arithmetic delays are important —carry chains often create critical paths

27 You Don’t Have To Be An Expert  You don’t have to be an FPGA architecture expert to implement high-performance designs —the benefits of a good architecture are automatic –all the logic goes faster –software provides easy access to the features  You can achieve high-performance only with a good FPGA architecture —a good FPGA empowers its users  You’ll design better if you know the architecture —matching your design style to the available features increases performance and/or lowers cost

28 Carry Fnct Gen Carry Fnct Gen Carry Fnct Gen Carry Fnct Gen Virtex CLB  Logic and arithmetic delay reduction demands improvements in the CLB  Virtex CLB is divided into two slices, each with: –2 function generators –2 flip-flops –2 bits of carry logic

29 Fast Function Generators  Each function generator emulates 2 to 3 levels of logic —a 10-level logic path typically requires 3 to 5 Function Generators in series —at 100 MHz, they must be less than 2 ns each including the routing  Virtex has 0.6-ns function generators —leaves 1.4 ns for each route

30 F5 Fnct Gen F6 Fnct Gen Fnct Gen Fnct Gen Connecting Function Generators  Some functions need several function generators —F5 MUXs connect pairs of function generators –functions with 5 to 9 inputs —F6 MUXs connect all 4 function generators –functions with 6 to 17 inputs

31 Carry Fnct Gen Carry Fnct Gen Carry Fnct Gen Carry Fnct Gen Carry Fnct Gen Carry Fnct Gen Carry Fnct Gen Carry Fnct Gen Fast Local Routing  Local routing provides fast interconnects —in a CLB, Function Generators connect with minimal routing delays —fast paths between adjacent CLBs increases flexibility

32 Use Pipelining for Speed  Shorter clock periods means doing less each period —create a pipeline structure —pipeline stages operate concurrently —more functions are done at the same time —throughput increases  All function generators have output flip-flops —most pipeline support is “free”

33  In directly cascaded pipelines the flip-flops are not free  One SRLUT can implement up to 16 bits of delay —shift data in and select the appropriate tap 16-Bit Shift Register 16-Bit Pipeline in One LUT Input Output Delay Select

34 Fast Logic Needs Fast Routing  Our typical design with 3 to 5 CLBs needed an average routing delay of 1.4 ns or less —the Virtex routing architecture delivers this performance  Delay is independent of direction —dependably short delays

35 Go Farther, Faster  Virtex achieves its speed through a hierarchy of highly buffered routing resources —wires span 1, 2, or 6 CLBs  The Virtex routing architecture is designed for large arrays —today’s FPGAs are big… but tomorrow’s will be even bigger  Virtex is designed to maintain its performance even in very large arrays

36 No Routing Congestion  For high-speed applications, routing must be dependably fast —not just capable of being fast  In the past, high device utilization has caused routing congestion —critical nets might be forced to meander  Virtex minimizes these problems —abundant resources prevent congestion If it needs to be fast, it will be fast – automatically!

37 CLB Built-in Tri-State Busses  Bi-directional busses are supported directly by tri-state buffers built into each CLB —two drivers per CLB —segmentable every four CLB columns

38 Arithmetic – A Special Case  Adders, accumulators, counters, and comparators all depend on carry chains  Carry-chain logic is usually much deeper than the rest of the design —32 levels for a 16-bit ripple adder —too deep to use function generators at 100 MHz —arithmetic delays would limit performance  Dedicated carry logic provides the desired speed —16-bit adders can operate at up to 200 MHz register-to-register

39 Wide Arithmetic  64-bit adders would require 128 levels of logic —expensive complex carry schemes would be needed to preserve performance  Virtex minimizes the carry propagation delay —100 ps per bit pair —zero routing delay between CLBs  Minimal performance loss for each extra bit 16-bit adders operate at up to 200 MHz 64-bit adders operate at up to 135 MHz

40 01 0 01 0 01 0 01 01 Fast Address Decoders  Wide address decoders could slow operation —wide AND gates with invertable inputs  Virtex carry-chain MUXs can act as AND gates —combine function generator ANDs  64-bit decoders operate at up to 155 MHz

41 Speed Is Never Wasted  You can never have too much performance —excess performance can always be traded for size and cost reduction  Replace single-cycle functions with smaller multi-cycle versions —a 2-cycle multiplier is half the cost of a single-cycle multiplier Reduce costs by designing down to the performance you need

42 2X2X 2X2X DLL2 DLL1 90 MHz 180 MHz 45 MHz Creating a High-Speed Clock  Logic sometimes needs to operate faster than the available clock —multiple RAM accesses in a single cycle —low-speed PCB clock distribution for power or noise reduction  Virtex DLLs can double and redouble incoming clocks

43 Optimized for the Future  Deep sub-micron technology permits larger and larger array sizes —poses new circuit-design challenges —changes the rules of FPGA architecture  Across-chip routing is the most vulnerable —could easily limit design performance  Virtex is designed for long-term growth —even long, across-chip routes will remain fast Virtex is tomorrow’s FPGA … today!

44 10 ns is Long Enough  Virtex CLBs can implement relatively complex functions in 10 ns — 0.6 ns per 4-input function generator  Virtex offers fast interconnections —even across-chip when fully utilized —fast tri-state buses  Support for very fast arithmetic operations —16-bit adders at 200MHz

45 Implement Designs Automatic  You don’t have to be an FPGA wizard to use Virtex  Virtex is optimized for automated implementation —uniform structure –efficient mapping/synthesis —ample routing –simple placement and no congestion —predictable performance –effective synthesis  IP cores speed design even more —validated functionality with guaranteed performance

46 100+ MHz Memory  Virtex memory operates up to 200 MHz  High-speed memory has two benefits —data storage –“work-in-progress” –input/output buffers, FIFOs —accelerating complex functions –store pre-computed values in look-up tables

47 Data Storage Hierarchy Virtex supports 3 levels of memory hierarchy  On-chip SelectRAM + —small-to-medium memories —0.6-ns read access time  On-chip Block SelectRAM + —larger memories —true dual-ported operation —3.3-ns read access time  Fast SelectI/O interfaces to external RAM —DLL boosts memory bandwidth

48 SelectRAM+  SelectRAM+ uses CLB LUTs as user memory —16-deep RAMs —32-deep RAMs —16-deep dual-ported RAMs —16-deep shift registers  Cascadable for larger memories —128 or more words deep —uses logic resources for expansion

49 Block SelectRAM+  Up to 32 dual-ported 4096-bit RAM Blocks —synchronous read and write  True dual-port memory —each port has full read and write capability —different clocks for each port  Configurable aspect ratio —trade width for depth –4096 x 1 bit to 256 x 16 bits —separate configurations for each port  Dedicated routing for memory expansion

50 High-Speed Memory Interfaces  SelectI0 and DLLs together provide fast access to many types of external memory  Xilinx currently offers two reference designs —fully synthesized —automatic placement and routing SDRAM … up to 125 MHz ZBTRAM … up to 143 MHz (Zero Bus-Turn-around)

51 Input/Output Data Buffers  High-performance systems need data buffers to decouple internal operation from I/O activity —I/O may be sporadic (burst-mode busses) —I/O may be faster or slower —I/O may be wider or narrower  I/O buffers can take several forms —dual-ported RAMs —ping-pong buffers —FIFOs

52 Dual-ported I/O Buffers  Block SelectRAM+ is ideal for I/O buffers —dual-ported operation –independent clocks and controls –bridges between clock domains –simultaneous read and write —port-specific aspect-ratio control –built-in rate/width conversions  SelectRAM+ provides similar benefits on a smaller scale

53  Ping-pong buffers are pairs of blocks that alternate between input and processing  SRLUT for small buffers —self-addressing input —0.6-ns read access  Larger buffers can use the dual-ported Block RAM —one address bit alternates read/write areas —3.3-ns read access 16-Bit Shift Register Select Read Address Input Output Ping Pong Buffers { {

54  Small FIFOs can be implemented in SRLUTs —word count addresses the output data —increment and enable SRLUT to Push —decrement to Pop —enable only for both  16-Byte FIFO in 4 CLBs —16 x 16 in 6 CLBs —200+ MHz  Expandable for deeper FIFOs 16-Bit Shift Register { Input Down Word Counter Up Push Pop Small FIFOs in SRLUTs Output

55 Large FIFOs in Block RAM  Large FIFOs can use the dual-ported block RAM —add read and write address counters  Asynchronous push and pop  Different port sizes give rate-for-width conversion  Block RAM FIFOs can operate at up to 170 MHz including flag logic Block SelectRAM+ InputOutput Push Pop Addrs WE Data Counter En Control Logic FullEmpty Counter

56 Pre-computing for Speed  Some functions are too complex for 10-ns logic implementation —pipelining is not always possible  An alternative is to pre-compute all the possible results and store them in memory —select a result according to the inputs  Function time is independent of complexity —0.6 ns SelectRAM + access time —3.3 ns Block SelectRAM + access time  The function table can be smaller than the logic

57 Multiplication By A Constant  Sometimes, data has to be “scaled” —multiplied by a constant value  A full multiplier is too expensive —it can multiply by a variable —unnecessarily general and too complex  Storing all multiples of the constant is a better alternative —smaller and much faster Constant Input Multiplier Array Scaled Data Input Scaled Data Product Table

58  A 2 16 -word product table is impractical —partition the input into nibbles –use 16-word LUTs for nibble products –combine the partial products in adders  Roughly half the CLBs of a full multiplier —for a 16-bit Coefficient: 36 CLBs vs. 62 CLBs  Pipeline the adders for extra speed Scaled Data Input LUT x16 x256 x4096 16-bit Scaler

59  The SRLUT mode can be used to update the table —“push-only” stack —last 16 bits loaded define the table  A simple accumulator computes all products of a new constant Output Clear Constant Change Constant Reg- ister Reg- ister Load Changing the Constant 16-Bit Shift Register { Input

60 Large Function Tables  Larger functions can be implemented in the Block SelectRAM + —12-input functions —micro-coded state machines  Data tables can also be implemented —sine/cosine tables for DSP, for example —dual-ported access gives the sine and cosine simultaneously —a simple address offset gives 90º phase shift for accessing sine and cosine from a single table

61 Block RAM/ROM Creation  CORE Generator software creates RAMs and ROMs —simple GUI interface  Initialization file is loaded into RAMs and ROMs at configuration time

62 Memory Summary  Virtex has two kinds of internal memory —distributed SelectRAM+ for small RAMs —Block SelectRAM+ for larger RAMs  SelectRAM+ —0.6 ns read access time —16- and 32-word RAMs / 16-word dual-ported RAMs —16-word shift registers –sequential write/random read FIFOs, pipelining, LUT functions  Dual-ported 4096-bit Block SelectRAM+ —3.3 ns read access time —true dual-ported operation –both ports are read/write / ports can be clocked asynchronously —configurable aspect ratio –4096 x 1 bit to 256 x 16 bits / configure ports differently for width/rate conversion  High-speed SelectI/O access to external RAM

63 Designing for 100+ MHz Volts, Amps, and Watts —DLLs and flexible I/O standards —fast inter-chip communication —simple rules for good signal integrity Ones and zeros —fast logic and fast interconnect —dependable high performance Bits and bytes —distributed SelectRAM + —dual-ported Block SelectRAM +

64 The Virtex Family The complete Virtex Data Sheet is on your AppLinx CD-ROM and at www.xilinx.com/partinfo/virtex.pdf


Download ppt "Designing for 100+ MHz with Xilinx Virtex. 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability."

Similar presentations


Ads by Google