Download presentation
Presentation is loading. Please wait.
1
Programmable Logic Devices
Ernest Jamro Dept. Electronics AGH UST, Kraków Poland
2
PLD as a Black Box (logic variables) (logic functions) Logic gates
and programmable switches Inputs (logic variables) Outputs (logic functions)
3
Programmable Logic Devices (PLD)
PLA/PAL/GAL – very simply functions up to roughly 30 input/output pins, (EPROM / EEPROM based) CPLD – complex PLD – incorporating many PAL/GAL structures (based on EEPROM), medium scale logic ( input/output pins) FPGA (Field Programmable Gate Arrays) – large scale PLD (50 to 2000 pins), SRAM-based (requires configuration after power-on), System on a Chip (incorporates microprocessors, memory blocks etc.)
4
Programmable Logic Array (PLA)
The connections in the AND plane are programmable The connections in the OR plane are programmable f 1 AND plane OR plane Input buffers inverters and P k m x 2 n
5
Gate Level Version of PLA
1 P 2 x 3 OR plane Programmable AND plane connections 4 f1 = x1x2+x1x3'+x1'x2'x3 f2 = x1x2+x1'x2'x3+x1x3
6
Customary Schematic of a PLA
1 P 2 x 3 OR plane AND plane 4 f1 = x1x2+x1x3'+x1'x2'x3 f2 = x1x2+x1'x2'x3+x1x3 x marks the connections left in place after programming
7
AND Plane Implementation with Floating Gate Transistors
8
Programmable Array Logic (PAL/GAL)
The connections in the AND plane are programmable The connections in the OR plane are NOT programmable PAL – one-time programmable (like EPROM) GAL (Generic Array Logic) it is eraseable and re-programmable (like EEPROM) Programmable Array Logic (PAL/GAL) f 1 AND plane OR plane Input buffers inverters and P k m x 2 n fixed connections
9
Example Schematic of a PAL/GAL
1 P 2 x 3 AND plane 4 f1 = x1x2x3'+x1'x2x3 f2 = x1'x2'+x1x2x3
10
Macrocell PAL f back to AND plane D Q Clock Select Enable Flip-flop
1 back to AND plane D Q Clock Select Enable Flip-flop OR gate from PAL
11
Macrocell Functions Enable = 0 can be used to allow the output pin for f1 to be used as an additional input pin to the PAL Enable = 1, Select = 0 is normal for typical PAL operation Enable = Select = 1 allows the PAL to synchronize the output changes with a clock pulse The feedback to the AND plane provides for multi-level design f 1 back to AND plane D Q Clock Select Enable
12
Multi-Level Design with PALs/GALs
f = A'BC + A'B'C' + ABC' + AB'C = A'g + Ag' where g = BC + B'C' and C = h below D Q Clock Sel = 0 En = 0 1 Select En = 1 A B h g f
13
CPLD Complex Programmable Logic Devices (CPLD)
SPLDs (PLA, PAL) are limited in size due to the small number of input and output pins and the limited number of product terms Combined number of inputs + outputs < 32 or so CPLDs contain multiple circuit blocks on a single chip Each block is like a PAL: PAL-like block Connections are provided between PAL-like blocks via an interconnection network that is programmable Each block is connected to an I/O block as well
14
Structure of a CPLD PAL-like block I/O block Interconnection wires
15
Internal Structure of a PAL-like Block
Includes macrocells Usually about 16 each Fixed OR planes OR gates have fan-in between 5-20 XOR gates provide negation ability XOR has a control input D Q PAL-like block
16
Programming a CPLD CPLD is eraseable and re-programmable like EEPROM
CPLDs have many pins – large ones have > 200 Removal of CPLD from a PCB is difficult without breaking the pins Use ISP (in system programming) to program the CPLD JTAG (Joint Test Action Group) port used to connect the CPLD to a computer
17
FPGA Principles A Field-Programmable Gate Array (FPGA) is an integrated circuit that can be configured by the user to emulate any digital circuit as long as there are enough resources An FPGA can be seen as an array of Configurable Logic Blocks (CLBs) connected through programmable interconnect (Switch Boxes) An FPGA is usually based in SRAM memory therefore can be very quickly re-programmed and must be programmed after each power-on Copy from dr. Konstantinos Tatas
18
FPGA structure
19
Simplified CLB Structure
20
Programmable Logic lab example
21
Example of RAM: 4-input AND gate
B C D O 1
22
Example 2: Find the configuration bits for the following circuit
1 SRAM Memory Linia U’DD T1 T5 T3 T2 T6 T4
23
Interconnection Network
24
Example 3 Determine the configuration bits for the following circuit implementation in a 2x2 FPGA, with I/O constraints as shown in the following figure. Assume 2-input LUTs in each CLB.
25
CLBs required
26
Placement: Select CLBs
27
Routing: Select path
28
Configuration Bitstream
The configuration bitstream must include ALL CLBs and SBs, even unused ones CLB0: 00011 CLB1: 01100 CLB2: XXXXX CLB3: ????? SB0: SB1: SB2: SB3: SB4:
29
The Virtex CLB
30
Details of One Virtex Slice
31
Implements any Two 4-input Functions
registered
32
Implements any 5-input Function
33
Implement Some Larger Functions
e.g. 9-input
34
Two Slices: Any 6-input Function
from other slice 6-input function
35
Example: mod 10 counter Q0 Q1 Q2 Q3 0 1 1 0 2 1 3 0 ... 8 1
9 0 10X Q0 Q1 0 0 1 0 ... 6 0 7 1 8 1 9 0 10X Q2 Q3
36
Ripple Carry Adder si ci ai + bi+ci-1 = si + 2·ci si = ai bi ci-1
ci= ai bi + ai ci-1 + bi ci-1= ai bi + ci-1 (ai bi) si ci-1\ai,bi 00 01 11 10 1 ci ci-1\ai,bi 00 01 11 10 1
37
Ripple Carry Adders in FPGAs
si= ai bi ci-1 Fragment of Virtex Configurable Logic Block (CLB)
38
Lookup Tables used as memory (16 x 2) Distributed Memory
Synchronous write, asynchronous read
39
Lookup Tables used as distributed memory (32 x 1)
40
Virtex-5 Logic Architecture
Advanced logic structure True 6-input LUTs Exclusive 64-bit distributed RAM option per LUT Exclusive 32-bit or 16-bit x 2 shift register RAM64 SRL32 LUT6 Register/ Latch RAM64 SRL32 LUT6 Register/ Latch RAM64 SRL32 LUT6 Register/ Latch Complete support from Xilinx and 3rd party tools (removed bullet) RAM64 SRL32 LUT6 Register/ Latch
41
New Advanced Logic Structure
Improved slice Four LUT6s & FFs per slice Better local connection True 6-input LUTs Higher performance Best logic compaction Wide logic functions without MUX delays 65% higher capacity and one to two speed grades faster than Virtex-4 (4 inputs LUTs) Slice LUT6
42
Logic Compaction with LUT6
Use Fewer LUTs, Faster, Less Routing 8 to 1 Multiplexer 64 bit RAM LUT6 LUT4 LUT6 LUT4
43
New 6-Input LUT with Two Outputs
6-input LUT with 2 outputs A1 A2 A3 A4 A5 A6 O6 O5 True 6-input LUT Any function of 6 variables No input shared with other LUTs Second output adds functionality Reduces average slice count by 10% 2 independent functions of 5 variables 1 function of 6 variables plus 1 subfunction of 5 variables 1 function of 3 variables plus 1 function of 2 other variables Plus other combinations of subfunctions...
44
Virtex-5 Memory Options… The Right Memory for the Application
Distributed RAM/SRL32 On-chip BRAM/FIFO Fast Memory Interfaces DRAM SDRAM DDR SDRAM FCRAM RLDRAM SRAM Sync SRAM DDR SRAM ZBT QDR FLASH EEPROM LOGIC DRAM SRAM FLASH EEPROM RAM / SRL 32 Virtex-5 BRAM/FIFO Very granular, localized memory Minimal impact on logic routing Great for small FIFOs Efficient, on-chip blocks Flexible + optional FIFO logic Ideal for mid-sized FIFOs/buffers Cost-effective bulk storage Memory controller cores Large memory requirements Granularity Capacity Synchronous write Synchronous write Asynchronous read Synchronous read
45
Distributed memory can be placed anywhere in the FPGA
Distributed RAM Distributed LUT memory 64-bit blocks throughout the FPGA Single-port, dual-port, multi-port Can be used as 32-bit shift register Very fast (sub-nanosecond) Tightly coupled to logic Synchronous write, asynchronous read R A M Slice3 Logic Logic RAM Shift Register Distributed memory can be placed anywhere in the FPGA
46
32-bit Shift Registers in 1 LUT
Length is dynamically determined by the A inputs D 32-bit Shift register Q 31 CLK 32 MUX A 6 Qn Convenient way to dynamically change LUT content
47
Each RAM block can be configured as BRAM or FIFO
BRAM/FIFO Features Independent read and write port widths Multiple configurations True dual-port, simple dual-port, single-port Integrated logic for fast and efficient FIFOs Synchronous write and read FIFO or Dual-Port BRAM Each RAM block can be configured as BRAM or FIFO
48
BRAM Mode Top Level View
True dual port – unrestricted flexibility Read and write operations simultaneously and independently on port A and port B 32Kx1, 16Kx2, 8Kx4, 4Kx9, 2Kx18, 1Kx36 Each port can have different width Addr A Port A 36 36 Wdata A Rdata A 36Kb Memory Array Addr B Port B 36 36 Wdata B Rdata B In one clock cycle, 4 total operations can be performed
49
Block RAM # - can be: 1, 2, 4, 8 (9), 16 (18), 32 (36)
50
Virtex IOB
51
Virtex 7 IOB Differential / Single Ended Standards
52
Virtex 7 IOB Digitally Controlled Impedance (DCI)
53
IO Standards
54
I/O Banking Architecture
CMT Region LX330 Layout Many banks per device: Each bank has a seperate supply voltage (in order to suport different IO standards) Bank CMT GClk CFG Region LX30 Layout
55
Edge-Aligned DDR Inputs, Opposite-Edge
SelectIO™ FPGA Fabric D QA QB CLK DATA CLK This slide highlights the new DDR input same-edge and same-edge-pipelined capabilities. The data stream has each pair of bits alternately colored just so that it’s easier to follow. When the slide comes up, you see the data for QB misaligned from QA by one half cycle. Click once to add the extra QB register, moving the data to the same edge as QA, but delayed one full cycle. Click again to add the extra QA register to align QA’ and QB’, but with an extra latency cycle introduced. DATA 1 QA QA’ 1 QB QB’ 1
56
Need Frequency Conversion Internal Must Be Lower Than External
1 Gbps FPGA Fabric n CLK 4 1 Gbps X n FF FF DDR Data FF FF FF FF FF 4 FF FF FF FF FF FF RLOCs or LOCs FF FF Directed Routing Constraints Signals can’t move through the fabric at 1GHz, so data path must be made wider and slower. You can see in this diagram a divide-by-4 example where a single bit of DDR data is brought into the FPGA and then widened to lower the speed. Ultimately we wind up with thNotice the clock is divided in each successive stage as the data is widened. Also notice the clock polarity as the data is shifted from DDR to SDR. This handoff of data inside the FPGA creates tremendous timing challenges (The approach shown comes from what is done in XAPP622). <Click> … and those timing challenges are most apparent in the first part of the circuit where the DDR data is captured into the first set of FFs at full rate. As the data moves into the fabric to be widened out there are critical timing half clock cycle domains that may require special constraints to be able to lock down FFs to get the fastest timing. In really fast interfaces (i.e. 350Mhz and above) the routing also becomes critical and requires special constraints to lock down the fastest routing path. Bottom line is that Timing can really become a nightmare to deal with. <It may be interesting to point out that Xilinx is the only company that allows the user to lock down the actual routing path so that there are no changes from one run to the next> Timing can be tricky
57
Skew Affects Setup and Hold Times
Source Target Connector Data Clock tH2 tSU2 tSU1 tH1 CLK DATA1 This is the first of two slides that describe why timing is difficult. It shows that channel-to-channel skew creates a different set of setup and hold requirements for each channel, which is hard to manage over wide busses (72-bit memory is an extreme example). Before clicking, describe the scene and point out the setup and hold time on the first channel. Click once to bring up the second channel. DATA2
58
Capture Window
59
Channel Timing Can Create Additional Clock Domains
Fast Unaligned 1 Fast Unaligned 2 Frequency Reduction Alignment Channel 1 Channel 2 Channel 3 Fast Aligned Slow Aligned This slide motivates the need for lots of I/O and regional clock resources. Because of skews, each channel ends up having its own little clock domains, first at the original line frequency, then at the reduced frequency. All of the fast and slow frequencies are the same, but the domains all have a different phase relationship to the “master” clock. The channels all enter the same domain once they’ve been aligned. No clicks. Fast Unaligned 3
60
ISERDES Manages Incoming Data
ChipSync™ FPGA Fabric n Data ISERDES CLK CLKDIV CLK ÷ BUFIO BUFR Frequency division Data width to 10 bits Dynamic signal alignment Bit alignment Word alignment Clock alignment Supports Dynamic Phase Alignment (DPA) Shows the two major functions that ChipSync provides: frequency reduction and alignment. DPA is a term specifically used in SPI-4.2, including both bit and word alignment. The term is used more broadly than just SPI-4.2. Because every signal has this circuitry, including clocks, clocks can be aligned as well, making this the most flexible solution available. Click once for each 1st-level bullet
61
190-210 MHz (calibration clk)
Easy Bit Alignment ChipSync™ CLK FPGA Fabric ISERDES DATA IDELAY INC/DEC State Machine MHz (calibration clk) This shows how bit alignment is accomplished using DPA. Note that there is a 200-MHz clock; that is required to calibrate the delay elements, and can be generated internally or by a separate clock. The state machine controls the delay setting, which will be conceptually shown in the next slide. The IDELAY block is part of the ISERDES block, but can be used even in situations where the SERDES itself isn’t being used. No clicks. IDELAY CNTRL 64 delay elements of ~ 70 to 89 ps each
62
OSERDES Simplifies Frequency Multiplication
ChipSync n OSERDES FPGA Fabric m CLK CLKDIV DCM/PMCD This shows data leaving the chip. Just as it was divided down upon entering the chip, it must be multiplied up when leaving. The OSERDES does that. The OSERDES block also allows three-state control to be sped up, primarily for memory busses. The 1/2/4-bit settings allowed cover all the various memory configurations. No clicks
63
Gigabit Serial Signaling is Everywhere
2005 2006 64% 92% 0% 100% 25% 50% 75% Serial is faster than parallel Very high multi-gigabit data rates Embedded clock avoids clock/data skew Reduction in EMI & power consumption The preferred choice in many markets Telecom, datacom, computing, storage video/imaging, instrumentation, etc. Dominating all new standards activities Percentage of Engineers Designing Serial IO Systems Source: EE Times Survey, 2005 Serial transceivers must be flexible, robust and easy to use
64
The Gigabit Transceiver
FPGA Fabric Interface Tx PMA PCS Rx PMA PCS GTP Transceiver 8 to 96 transceivers per device Supporting data rates to 28 Gbps
65
Virtex-5 Delivers Powerful Clock Management
Combination digital and analog technology DCMs (Digital Clock Manager) – based on DLL (Delay Lock Loop) PLLs Highest performance 550MHz global clocking More than 2x jitter filtering Clock Buffers DCM PLL Select by: Function Automatic HDL code Component
66
Virtex-5 Clock Management Tile
CMT Up to 6 CMTs per device Each with 2 DCMs and 1 PLL DCM 5th generation all-digital technology Provides most clocking functions PLL Reduces internal clock jitter Supports higher jitter on reference clocks Replaces discrete PLLs and VCOs Powerful combination of flexibility and precision
67
Filter Jitter Using the Virtex-5 PLL
PLL Input Clock >400ps pk-pk jitter PLL Output Clock <100ps pk-pk jitter 400MHz noisy clock Quiet FPGA Typical Waveform Examples
68
DCM (Digital Clock Manager) Features
CLKIN CLKFB CLKO CLK90 CLK180 CLK270 CLK2X CLK2X180 CLKDV CLKFX CLKFX180 LOCKED RST DCM_BASE Phase Shift DRP DCM_ADV Operate from 19 MHz – 550 MHz Remove clock insertion delay “Zero delay clock buffer” Correct clock duty cycles Synthesize Fout = Fin * M/D M, D values up to 32 Additional DCM_ADV features Dynamically phase shift clocks in increments of period/256 or with direct delay line control Use Dynamic Reconfiguration Port to adjust parameters without reconfiguring Each DCM can be invoked with either the DCM_BASE or DCM_ADV primitive
69
DCM in VHDL Library UNISIM; use UNISIM.vcomponents.all;
-- DCM_SP: Digital Clock Manager -- Spartan-6 -- Xilinx HDL Libraries Guide, version 11.2 DCM_SP_inst : DCM_SP generic map ( CLKDV_DIVIDE => 2.0, -- Specifies the extent to which the CLKDLL, CLKDLLE, CLKDLLHF, or -- DCM_SP clock divider (CLKDV output) is to be frequency divided. CLKFX_DIVIDE => 1, -- Specifies the frequency divider value for the CLKFX output. CLKFX_MULTIPLY => 4, -- Specifies the frequency multiplier value for the CLKFX output. CLKIN_DIVIDE_BY_2 => FALSE, -- Enables CLKIN divide by two features. CLKIN_PERIOD => "10.0", -- Specifies the input period to the DCM_SP CLKIN input in ns. CLKOUT_PHASE_SHIFT => "NONE", -- This attribute specifies the phase shift mode. NONE = No phase -- shift capability. Any set value has no effect. FIXED = DCM -- outputs are a fixed phase shift from CLKIN. Value is specified -- by PHASE_SHIFT attribute. VARIABLE = Allows the DCM outputs to -- be shifted in a positive and negative range relative to CLKIN. -- Starting value is specified by PHASE_SHIFT. CLK_FEEDBACK => "1X", -- Defines the DCM feedbcak mode. 1X: CLK0 as feedback 2X: CLK2X -- as feedback. DESKEW_ADJUST => "SYSTEM_SYNCHRONOUS", -- Sets configuration bits affecting the clock delay alignment -- between the DCM_SP output clocks and an FPGA clock input pin. DLL_FREQUENCY_MODE => "LOW", -- AUTO mode allows DLL to do automatic frequency search to decide -- whether DLL will operate in LOW or HIGH mode. This is a legacy -- attribute where the high and low value has no affect, it is -- always in auto mode. DSS_MODE => "NONE", DUTY_CYCLE_CORRECTION => TRUE, -- Corrects the duty cycle of the CLK0, CLK90, CLK180, and CLK270 -- outputs. PHASE_SHIFT => 0, -- Defines the amount of fixed phase shift from -255 to 255 STARTUP_WAIT => FALSE -- Delays configuration DONE until DCM LOCK. ) port map ( CLK0 => CLK0, -- 1-bit Same frequency as CLKIN, 0 degree phase shift. CLK180 => CLK180, -- 1-bit Same frequency as CLKIN, 180 degree phase shift. CLK270 => CLK270, -- 1-bit Same frequency as CLKIN, 180 degree phase shift. CLK2X => CLK2X, -- 1-bit Two times CLKIN frequency clock, aligned with CLK0. CLK2X180 => CLK2X180, -- 1-bit 180 degree shifted version of the CLK2X clock. CLK90 => CLK90, -- 1-bit Same frequency as CLKIN, 90 degree phase shift. CLKDV => CLKDV, -- 1-bit Divided version of CLK0. Divide value is programmable. CLKFX => CLKFX, -- 1-bit Digital Frequency Synthesizer output (DFS). CLKFX180 => CLKFX180, -- 1-bit 180 degree shifted version of the CLKFX clock. LOCKED => LOCKED, -- 1-bit Signal indicating when the DCM has LOCKed. PSDONE => PSDONE, -- 1-bit Output signal that indicates variable phase shift is done. STATUS => STATUS, -- 8-bit DCM Status Bits CLKFB => CLKFB, -- 1-bit Feedback clock input to DCM. The feedback input is required unless the DFS -- is used stand-alone. The source of CLKFB must be CLK0 or CLK2X output from the -- DCM. CLKIN => CLKIN, -- 1-bit Clock input for the DCM. DSSEN => DSSEN, PSCLK => PSCLK, -- 1-bit Phase shift clock input. The PSCLK input pin provides the source clock for -- the DCM phase shift. PSEN => PSEN, -- 1-bit Variable Phase Shift enable signal, synchronous with PSCLK. PSINCDEC => PSINCDEC, -- 1-bit The phase shift increment/decrement (PSINCDEC) input signal must be -- synchronous with PSCLK. The PSINCDEC signal is used to increment or decrement -- the phase shift factor when PSEN is activated. The PSINCDEC is asserted HIGH for -- increment and LOW for decrement. RST => RST -- 1-bit The reset input pin (RST) resets the DCM circuitry. The RST signal is an -- active HIGH asynchronous reset. );
70
Using the DLL to De-Skew the Clock
71
Three Types of Clock Resources
I/O Clocks I/O Column Global Clocks Global Muxes Regional Clocks
72
BUFG - Global (Clock) Buffer
This design element is a high-fanout buffer that connects signals to the global routing resources for low skew distribution of the signal. BUFGs are typically used on clock nets. Library UNISIM; use UNISIM.vcomponents.all; -- BUFG: Global Clock Buffer -- Virtex-6 -- Xilinx HDL Libraries Guide, version 11.2 BUFG_inst : BUFG generic map ( ) port map ( O => O, -- 1-bit Clock buffer output I => I -- 1-bit Clock buffer input );
73
BUFGCE This design element is a global clock buffer with a single gated input. Its O output is "0" when clock enable (CE) is Low (inactive). When clock enable (CE) is High, the I input is transferred to the O output. This module is race condition free.
74
Can also be used for fast counters, barrel shifters, etc…
XtremeDSP in Virtex-5 Second-generation DSP slice architecture 25x18 multiplier Per-bit logic functions (AND, OR, XOR, XNOR,…) High performance for DSP “heavy lifting” 550 MHz operation DSP Slice Can also be used for fast counters, barrel shifters, etc…
75
Virtex-5 DSP48E Full Custom Design Enabling Efficient DSP
Wider internal data-path and 96-accumulated output enable higher precision Pipeline registers enable 550Mhz performance ACOUT BCOUT PCOUT 48-bit Optional Pipeline Register/ Routing Logic Optional Pipeline Register/ Routing Logic Optional Register Multiplier Routing Logic B (18-bit) A (25-bit) P (48-bit) Optional P(96-bit) C (48-bit) = Main Point: The DSP slice in Virtex-4 is much more than an embedded multiplier: The DSP slice includes the following features: Adder/Subtractor logic to perform the adder portion of DSP filter design 48-bit output to support 18x18 bit multiply and add functions Optional “C” input that is 48 bits wide, allowing users to feed in the 48-bit output or feed in coefficients directly to the adder portion of the DSP slice. Input and output pipeline registers enable 500MHz pipelined support. Cascade registers allow users to combine inputs and outputs between DSP slices within a column without using any programmable routing resources. Adder feedback allows users to implement accumulators using a single DSP slice. The DSP slice in Virtex-4 has been specially designed to be very high performance while at the same time consuming almost no power. ACIN BCIN PCIN Pattern detect circuitry increases functionality 25x18 input increases precision and efficiency
76
FPGAs For Massively Parallel DSP
Programmable DSP - Sequential FPGA - Fully Parallel Implementation Data In Data In Reg Reg Reg Reg Coefficients X C0 … C0 X C1 X C2 X C3 X C192 X MAC Unit 640 clock cycles needed + + 640 operations in 1 clock cycle Reg Data Out Data Out Don’t be fooled by 1GHz processors. We are higher performance thanks to parallelism 1 GHz 640 clock cycles = 1.6 MSPS 550 MHz 1 clock cycle = 550 MSPS 640-tap filter implementation is 340 times faster
77
Xilinx, 7 Series Families
78
Zynq = FPGA + Processor
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.