Download presentation
Presentation is loading. Please wait.
Published byBlaze Mason Modified over 9 years ago
1
Designing for 100+ MHz with Xilinx Virtex
2
1999 Designs Demand... Higher system speed Higher integration —smaller size, less power, better reliability Lower cost Shorter development time Better product differentiation
3
Traditional Multi-Chip Boards Discrete design components —CPU, memory —bus transceivers, PCI controller, FIFOs —Ethernet controller, Graphics accelerator, MPEG, DSP, etc. —programmable logic as glue and custom function Advantages: —well-documented sophisticated functions —readily available as IP in silicon
4
Multi-Chip Board Problems Physical size Power consumption and reliability PC board signal integrity Limited flexibility —prevents design modifications and upgrades —prevents product diversification —prevents product customization Poor product differentiation —standard parts = standard architecture
5
The FPGA Solution 4th Generation FPGA Logic+Memory+Routing Multi-Standard Select I/O Temperature Sensing Delay-Locked Loop for Fast Clock and I/O 3.3 ns Synchronous Dual-Port SRAM 500 Mbps SelectMAP Configuration
6
DLLs Maximize I/O Speed Clock-to-output time plus set-up time determines the I/O speed and data bandwidth —min clock period = max clock-to-out + max set-up Traditional solution: —use highly buffered, balanced clock trees –needed to reduce internal clock skew –cannot totally eliminate the delay The Virtex solution: —use a Delay-Locked-Loop ( DLL ) –aligns the internal and external clocks –effectively eliminates the clock-distribution delay
7
Clock Data Comparator Error Delay Virtex Has 4 Independent DLLs DLLs adjust clock delay to internal and external clocks —digital closed-loop control —25 to 200-MHz range, 35-picosecond resolution CLB IOB
8
LVTTL Data Rate with DLL 1.4 ns measured clock-to-output delay Output standard = LVTTL Fast 16mA (OBUF_F_16) Temp=100C, Vdd=2.375V, Vcco=3.3V Waveforms: 1: CLKIN 2: DATA OUT (no DLL) 3: DATA OUT (DLL deskewed) Timing w/o DLLw/ DLLr->r r->f 3.9n 3.9n1.4n 1.4n
9
Other DLL Functions Double the incoming clock frequency —fast internal operation – slow external clock Clock mirroring to the PCB Divide clock by 1.5, 2, 2.5, 3, 4, 5, 8, or 16 Adjust clock duty cycle to 50-50 Create four quadrature clock phases —input four sequential bits per clock period
10
25 MHz 25% Duty Cycle 25 MHz 50% Duty Cycle Virtex FPGA 1X1X Duty Cycle Correction ~25% duty cycle in – 50% duty cycle out DLL
11
Clock Doubling and Mirroring Clock mirror with less than 100 ps skew —simplifies PCB clock distribution Virtex Zero-Delay Internal Clock Buffer 37 MHz 74 MHz #1 74 MHz #2 74 MHz Internal 37 MHz Internal System Clock SDRAM Inside FPGA System Clock 1 Input Load Exactly Aligned Exactly Aligned Actual HDTV Customer Example SDRAM DLL 2 DLL 1
12
66MHz Clock 132 MHz Clock Virtex FPGA 2X2X DLL Precise Clock Mirroring 2x system clock for board use
13
CLKIn 200 MHz CLKout 200 MHz CLKDV 12.5 MHz Clock Division Divide clock by 1.5, 2, 2.5, 3, 4, 5, 8, or 16 —maintain synchronous edges
14
Multi-Standard SelectI/O GTL+ 5V Tolerant 2.5V SSTL 1.8V 3.3V LVTTL 5V5V MicroProcessor SRAM DSP Mixed Signal Busses/Backplanes (3/5V PCI, ISA, GTL…) FLASH SDRAM
15
Mix & Match Output Standards User-supplied voltages determine output swing —3.3 V, 2.5 V, 1.5 V —one voltage per bank —a bank is half of a chip edge Output characteristics are programmable on a per-pin basis —push-pull or open-drain —LVTTL drive strength –2-mA to 24-mA sink and source current —LVTTL Slew rate
16
Internal Reference V REF Input V REF Mix & Match Input Standards Internal or user-supplied threshold voltage —selectable on a per-pin basis —one user-supplied threshold voltage per bank Programmable over-voltage protection —5-V tolerant or diode clamp to VCCO —selectable on a per-pin basis
17
SSTL Clock-to-Out With DLL 200 MHz inter-chip data rate —SSTL 3, Class II —IOB register to IOB register Clock 2.8 ns Virtex FPGA Q DLL D 1.9 ns 0.3 ns (Stub Series Transceiver Logic)
18
SSTL Data Rate with DLL Output standard = SSTL 3 Class 2 (OBUF_SSTL3_II) Temp=100C, Vdd=2.375V, Vcco=3.3V, Vtt=1.5V Waveforms: 1: CLKIN 2: DATA OUT (no DLL) 3: DATA OUT (DLL deskewed) Timing w/o DLLw/ DLLr->r r->f 3.5n 3.8n1.1n 1.3n 1.3 ns measured clock-to-output delay —much lower noise than LVTTL
19
‘Redefining the FPGA’ From FPGA to System Component ‘Redefining the FPGA’ "Virtex moves FPGAs from glue to system component” - Ron Neale, EE GTL+ High Speed System Backplane Low Voltage CPU LVTTL SDRAM (133MHz) SSTL3 Cache SRAM (Mbytes) LVCMOS Chip 1 x1 CLK x2 CLK
20
Power and Thermal Issues Power and heat are serious concerns All CMOS power consumption is dynamic —proportional to V CC 2 —proportional to capacitance —proportional to frequency Virtex conserves power —2.5-V supply voltage —small geometries and short interconnects reduce capacitance
21
384 16-bit Counters2.5 W Total 768 8-bit Counters3.7 W Total 1536 16-bit Counters9.8 W Total 3072 8-bit Counters14.7 W Total XCV300 XCV1000 Virtex Power Consumption Virtex is designed to conserve power —100 MHz 16-bit counters –12.5 MHz average transition rate –6.5 mW per counter including clock distribution —100 MHz 8-bit counters –25 MHz average transition rate –5 mW per counter including clock distribution
22
DXP DXN Virtex FPGA SBMCLK SBMDATA ALERT Maxim MAX1617 Thermal Management Temperature-sensing diode —matched to maxim MAX 1617 A/D —programmable alarms —similar to the Pentium II solution
23
Power Supply Decoupling CMOS power-supply current is dynamic —current pulse every active clock edge Peak current can be 5x the average current —instantaneous current peaks can only be supplied by decoupling capacitors Use one 0.1 µF ceramic chip capacitor for each power-supply pin —low L and R are more important than high C —double up for lower L and R if necessary —use direct vias to the supply planes, close to the power-supply pins
24
Virtex FPGA WE, CS Data Virtex Configuration New byte-wide SelectMAP mode —up to 528 Mbps at 66 MHz –simple handshake protocol —up to 400 Mbps at 50 MHz –no handshake required Configuration bit-stream length —0.5 Mbits to 6.1 Mbits CS Address Configuration EPROM Control Logic (EPLD) Busy
25
Volts, Amps, and Watts: Recap PCB design issues —minimize capacitance for higher speed —terminate transmission lines to reduce ringing Chip inputs and outputs —use DLLs to maximize I/O bandwidth —use SelectI/O to interface with different standards Power and thermal considerations —use the sensing diode to manage chip temperature —decouple the power supply well Configuration —configure faster with the SelectMAP mode
26
Spending the 10 ns Budget Fast logic requires fast function generators —signals often pass through several function generators Routing delays must also be kept short —there are routing delays between every function generator Arithmetic delays are important —carry chains often create critical paths
27
You Don’t Have To Be An Expert You don’t have to be an FPGA architecture expert to implement high-performance designs —the benefits of a good architecture are automatic –all the logic goes faster –software provides easy access to the features You can achieve high-performance only with a good FPGA architecture —a good FPGA empowers its users You’ll design better if you know the architecture —matching your design style to the available features increases performance and/or lowers cost
28
Carry Fnct Gen Carry Fnct Gen Carry Fnct Gen Carry Fnct Gen Virtex CLB Logic and arithmetic delay reduction demands improvements in the CLB Virtex CLB is divided into two slices, each with: –2 function generators –2 flip-flops –2 bits of carry logic
29
Fast Function Generators Each function generator emulates 2 to 3 levels of logic —a 10-level logic path typically requires 3 to 5 Function Generators in series —at 100 MHz, they must be less than 2 ns each including the routing Virtex has 0.6-ns function generators —leaves 1.4 ns for each route
30
F5 Fnct Gen F6 Fnct Gen Fnct Gen Fnct Gen Connecting Function Generators Some functions need several function generators —F5 MUXs connect pairs of function generators –functions with 5 to 9 inputs —F6 MUXs connect all 4 function generators –functions with 6 to 17 inputs
31
Carry Fnct Gen Carry Fnct Gen Carry Fnct Gen Carry Fnct Gen Carry Fnct Gen Carry Fnct Gen Carry Fnct Gen Carry Fnct Gen Fast Local Routing Local routing provides fast interconnects —in a CLB, Function Generators connect with minimal routing delays —fast paths between adjacent CLBs increases flexibility
32
Use Pipelining for Speed Shorter clock periods means doing less each period —create a pipeline structure —pipeline stages operate concurrently —more functions are done at the same time —throughput increases All function generators have output flip-flops —most pipeline support is “free”
33
In directly cascaded pipelines the flip-flops are not free One SRLUT can implement up to 16 bits of delay —shift data in and select the appropriate tap 16-Bit Shift Register 16-Bit Pipeline in One LUT Input Output Delay Select
34
Fast Logic Needs Fast Routing Our typical design with 3 to 5 CLBs needed an average routing delay of 1.4 ns or less —the Virtex routing architecture delivers this performance Delay is independent of direction —dependably short delays
35
Go Farther, Faster Virtex achieves its speed through a hierarchy of highly buffered routing resources —wires span 1, 2, or 6 CLBs The Virtex routing architecture is designed for large arrays —today’s FPGAs are big… but tomorrow’s will be even bigger Virtex is designed to maintain its performance even in very large arrays
36
No Routing Congestion For high-speed applications, routing must be dependably fast —not just capable of being fast In the past, high device utilization has caused routing congestion —critical nets might be forced to meander Virtex minimizes these problems —abundant resources prevent congestion If it needs to be fast, it will be fast – automatically!
37
CLB Built-in Tri-State Busses Bi-directional busses are supported directly by tri-state buffers built into each CLB —two drivers per CLB —segmentable every four CLB columns
38
Arithmetic – A Special Case Adders, accumulators, counters, and comparators all depend on carry chains Carry-chain logic is usually much deeper than the rest of the design —32 levels for a 16-bit ripple adder —too deep to use function generators at 100 MHz —arithmetic delays would limit performance Dedicated carry logic provides the desired speed —16-bit adders can operate at up to 200 MHz register-to-register
39
Wide Arithmetic 64-bit adders would require 128 levels of logic —expensive complex carry schemes would be needed to preserve performance Virtex minimizes the carry propagation delay —100 ps per bit pair —zero routing delay between CLBs Minimal performance loss for each extra bit 16-bit adders operate at up to 200 MHz 64-bit adders operate at up to 135 MHz
40
01 0 01 0 01 0 01 01 Fast Address Decoders Wide address decoders could slow operation —wide AND gates with invertable inputs Virtex carry-chain MUXs can act as AND gates —combine function generator ANDs 64-bit decoders operate at up to 155 MHz
41
Speed Is Never Wasted You can never have too much performance —excess performance can always be traded for size and cost reduction Replace single-cycle functions with smaller multi-cycle versions —a 2-cycle multiplier is half the cost of a single-cycle multiplier Reduce costs by designing down to the performance you need
42
2X2X 2X2X DLL2 DLL1 90 MHz 180 MHz 45 MHz Creating a High-Speed Clock Logic sometimes needs to operate faster than the available clock —multiple RAM accesses in a single cycle —low-speed PCB clock distribution for power or noise reduction Virtex DLLs can double and redouble incoming clocks
43
Optimized for the Future Deep sub-micron technology permits larger and larger array sizes —poses new circuit-design challenges —changes the rules of FPGA architecture Across-chip routing is the most vulnerable —could easily limit design performance Virtex is designed for long-term growth —even long, across-chip routes will remain fast Virtex is tomorrow’s FPGA … today!
44
10 ns is Long Enough Virtex CLBs can implement relatively complex functions in 10 ns — 0.6 ns per 4-input function generator Virtex offers fast interconnections —even across-chip when fully utilized —fast tri-state buses Support for very fast arithmetic operations —16-bit adders at 200MHz
45
Implement Designs Automatic You don’t have to be an FPGA wizard to use Virtex Virtex is optimized for automated implementation —uniform structure –efficient mapping/synthesis —ample routing –simple placement and no congestion —predictable performance –effective synthesis IP cores speed design even more —validated functionality with guaranteed performance
46
100+ MHz Memory Virtex memory operates up to 200 MHz High-speed memory has two benefits —data storage –“work-in-progress” –input/output buffers, FIFOs —accelerating complex functions –store pre-computed values in look-up tables
47
Data Storage Hierarchy Virtex supports 3 levels of memory hierarchy On-chip SelectRAM + —small-to-medium memories —0.6-ns read access time On-chip Block SelectRAM + —larger memories —true dual-ported operation —3.3-ns read access time Fast SelectI/O interfaces to external RAM —DLL boosts memory bandwidth
48
SelectRAM+ SelectRAM+ uses CLB LUTs as user memory —16-deep RAMs —32-deep RAMs —16-deep dual-ported RAMs —16-deep shift registers Cascadable for larger memories —128 or more words deep —uses logic resources for expansion
49
Block SelectRAM+ Up to 32 dual-ported 4096-bit RAM Blocks —synchronous read and write True dual-port memory —each port has full read and write capability —different clocks for each port Configurable aspect ratio —trade width for depth –4096 x 1 bit to 256 x 16 bits —separate configurations for each port Dedicated routing for memory expansion
50
High-Speed Memory Interfaces SelectI0 and DLLs together provide fast access to many types of external memory Xilinx currently offers two reference designs —fully synthesized —automatic placement and routing SDRAM … up to 125 MHz ZBTRAM … up to 143 MHz (Zero Bus-Turn-around)
51
Input/Output Data Buffers High-performance systems need data buffers to decouple internal operation from I/O activity —I/O may be sporadic (burst-mode busses) —I/O may be faster or slower —I/O may be wider or narrower I/O buffers can take several forms —dual-ported RAMs —ping-pong buffers —FIFOs
52
Dual-ported I/O Buffers Block SelectRAM+ is ideal for I/O buffers —dual-ported operation –independent clocks and controls –bridges between clock domains –simultaneous read and write —port-specific aspect-ratio control –built-in rate/width conversions SelectRAM+ provides similar benefits on a smaller scale
53
Ping-pong buffers are pairs of blocks that alternate between input and processing SRLUT for small buffers —self-addressing input —0.6-ns read access Larger buffers can use the dual-ported Block RAM —one address bit alternates read/write areas —3.3-ns read access 16-Bit Shift Register Select Read Address Input Output Ping Pong Buffers { {
54
Small FIFOs can be implemented in SRLUTs —word count addresses the output data —increment and enable SRLUT to Push —decrement to Pop —enable only for both 16-Byte FIFO in 4 CLBs —16 x 16 in 6 CLBs —200+ MHz Expandable for deeper FIFOs 16-Bit Shift Register { Input Down Word Counter Up Push Pop Small FIFOs in SRLUTs Output
55
Large FIFOs in Block RAM Large FIFOs can use the dual-ported block RAM —add read and write address counters Asynchronous push and pop Different port sizes give rate-for-width conversion Block RAM FIFOs can operate at up to 170 MHz including flag logic Block SelectRAM+ InputOutput Push Pop Addrs WE Data Counter En Control Logic FullEmpty Counter
56
Pre-computing for Speed Some functions are too complex for 10-ns logic implementation —pipelining is not always possible An alternative is to pre-compute all the possible results and store them in memory —select a result according to the inputs Function time is independent of complexity —0.6 ns SelectRAM + access time —3.3 ns Block SelectRAM + access time The function table can be smaller than the logic
57
Multiplication By A Constant Sometimes, data has to be “scaled” —multiplied by a constant value A full multiplier is too expensive —it can multiply by a variable —unnecessarily general and too complex Storing all multiples of the constant is a better alternative —smaller and much faster Constant Input Multiplier Array Scaled Data Input Scaled Data Product Table
58
A 2 16 -word product table is impractical —partition the input into nibbles –use 16-word LUTs for nibble products –combine the partial products in adders Roughly half the CLBs of a full multiplier —for a 16-bit Coefficient: 36 CLBs vs. 62 CLBs Pipeline the adders for extra speed Scaled Data Input LUT x16 x256 x4096 16-bit Scaler
59
The SRLUT mode can be used to update the table —“push-only” stack —last 16 bits loaded define the table A simple accumulator computes all products of a new constant Output Clear Constant Change Constant Reg- ister Reg- ister Load Changing the Constant 16-Bit Shift Register { Input
60
Large Function Tables Larger functions can be implemented in the Block SelectRAM + —12-input functions —micro-coded state machines Data tables can also be implemented —sine/cosine tables for DSP, for example —dual-ported access gives the sine and cosine simultaneously —a simple address offset gives 90º phase shift for accessing sine and cosine from a single table
61
Block RAM/ROM Creation CORE Generator software creates RAMs and ROMs —simple GUI interface Initialization file is loaded into RAMs and ROMs at configuration time
62
Memory Summary Virtex has two kinds of internal memory —distributed SelectRAM+ for small RAMs —Block SelectRAM+ for larger RAMs SelectRAM+ —0.6 ns read access time —16- and 32-word RAMs / 16-word dual-ported RAMs —16-word shift registers –sequential write/random read FIFOs, pipelining, LUT functions Dual-ported 4096-bit Block SelectRAM+ —3.3 ns read access time —true dual-ported operation –both ports are read/write / ports can be clocked asynchronously —configurable aspect ratio –4096 x 1 bit to 256 x 16 bits / configure ports differently for width/rate conversion High-speed SelectI/O access to external RAM
63
Designing for 100+ MHz Volts, Amps, and Watts —DLLs and flexible I/O standards —fast inter-chip communication —simple rules for good signal integrity Ones and zeros —fast logic and fast interconnect —dependable high performance Bits and bytes —distributed SelectRAM + —dual-ported Block SelectRAM +
64
The Virtex Family The complete Virtex Data Sheet is on your AppLinx CD-ROM and at www.xilinx.com/partinfo/virtex.pdf
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.