Download presentation
Presentation is loading. Please wait.
1
Design for Embedded Image Processing on FPGAs
Chapter 2 Field Programmable Gate Arrays
2
Outline Programmable logic Inside an FPGA FPGAs and ASICs
Interconnect Input / Output Clocking Configuration FPGAs and ASICs FPGAs and image processing FPGA families
3
FPGA Building Blocks Programmable logic Memory
Programmable interconnect Input and output Clocking and configuration
4
Programmable connection
Programmable Logic Any combinatorial logic function can be represented as a two-level OR of ANDs Early programmable devices used a two-level array Mask or fuse programmable connections Programmable connection
5
Programmable Array Logic
Only input is programmable Output section is fixed Advantages Reduces chip area Increases speed Programmable connections reprogrammable Using EEPROM rather than fuses Architecture primarily used by CPLDs
6
PROM or Lookup Table Only output is programmable
Each row corresponds to a unique input combination Implemented using memory or multiplexer tree Architecture primarily used by FPGAs
7
Basic FPGA Architecture
Sea of logic blocks with programmable interconnect I/O block I/O block I/O block I/O block I/O block I/O block Programmable logic blocks I/O block I/O block Programmable interconnects I/O block I/O block Configuration control Clock control
8
Logic Cell Smallest unit of logic on an FPGA Based on a lookup table
Output is an arbitrary function of its inputs Early devices had 3 or 4 inputs Modern devices have 5 or 6 inputs Used to implement logic, adders, counters, multiplexers, etc More complex functions require multiple LUTs Output also has a flip-flop Used to build registers, finite state machines, etc
9
Logic Block Several logic cells combined together
Typically 4-10 logic cells Share common control signals (clock, clock enable) Outputs directly available to other inputs Reduces propagation delay for deeper logic Additional dedicated logic Reduces propagation delay for more complex logic functions Some FPGAs allow logic cells or logic blocks to be used as RAM or shift registers
10
Dedicated Logic Carry chains Multiplexer controls DSP blocks
A full adder requires 2 outputs Sum and carry Separate carry hardware allows 1 LUT per bit added Multiplexer controls Combines outputs of multiple LUTs for wider functions Enables wide multiplexers and more complex logic functions to be built more efficiently DSP blocks Hardware multiplier or multiply and accumulate Reduces logic required and propagation delay for DSP applications
11
Fabric RAM Adapts structure of LUTs to enable them to be used as small memories 16×1 or 32×1 memory (depending on FPGA) Adjacent blocks combine to make a dual-port memory True dual-port Both ports can be used for read and write Simple dual-port One port is read only and one write only Used for banks of registers or coefficient memory Only one access per port per clock cycle Some FPGAs also allow LUTs to be configured as short shift registers
12
Other Memory Resources
Dual port block RAM Larger blocks 512 bits – 576 Kbits depending on family Flexible word width Eg 36 Kbit block can be configured from 32K×1 to 512×72 Each block is independent Local memories Potentially wide bandwidth Used for FIFO buffers, data caching, large lookup tables External RAM Used for larger blocks (frame buffers, etc) Access limited by bandwidth
13
Interconnect Flexibly connects the logic resources
Based on a grid structure Crossbar switches enable connected between horizontal vertical routing lines Often a segmented structure is used Not every routing line is switched at every junction Reduces propagation delay Some FPGAs have busses as well Requires tri-state drivers Only one source may drive the line at a time
14
Input and Output Connects FPGA to external devices Basic interface
LVTTL and LVCMOS Advanced signalling Double data rate (DDR) Data transferred on rising and falling edges of the clock Differential signalling (LVDS) Uses tow differential I/O bits to improve noise immunity High speed communication signals include parallel-to-serial and serial-to-parallel conversion Serialisation and deserialisation (SERDES) logic
15
Clocking FPGAs are synchronous devices
Each register, memory, I/O controlled by a clock Each block can be controlled by only one clock A clock domain is all of the logic controlled by a clock FPGAs use a dedicated clock network Minimises the skew between parts of a design Manages the large fan-out required Special clock control blocks Delay locked loops Synchronise clocks with external sources and minimise skew Phase locked loops: Synthesise different clock frequencies from a reference clock
16
FPGA Configuration FPGA contents controlled by SRAM cells
Allows infinite reprogrammability Configuration is volatile Must be reloaded on power on Commonly from a small ROM Some FGPAs enable encryption and compression of configuration file Configuration specifies LUT function Register controls Interconnect I/O configuration Memory contents ROM Configuration file FPGA
17
FPGAs vs other processors FPGAs for Vision
18
Where FPGAs fit Inherently serial Inherently parallel Software Hardware General purpose processors DSPs Multiprocessor and multi-core architectures FPGAs Dedicated hardware (ASICs) Generally increasing performance FPGAs combine speed of hardware with the flexibility of software at a relatively low cost
19
FPGAs vs ASICs Comparison with application specific integrated circuits FPGA size: 20× to 40× silicon area of ASIC Flexibility means additional logic not used in a particular design Programmable interconnect takes space Configuration logic has a significant overhead ASIC speed: 3× to 4× faster than FPGA Larger size means further for signals to travel Increased capacitance Interconnect switches introduce delay FPGA power: 10× to 15× that of an ASIC Number of transistors which must be switched
20
FPGAs vs ASICs Design and mask costs of FPGAs significantly lower than ASICs Programmability spreads cost over many chips Unit cost per chip of ASICs is significantly lower Structured ASICs fall in between Predefined sea of gates Routing added through one or more metal layers Compared to FPGA: Half the power required 50% performance improvement Unit cost Volume ~500 ~5k ~50k ASICs FPGAs Structured ASICs
21
FPGAs for Image Processing
FPGAs are programmable hardware Each logic block is independent hardware Parallel algorithm implemented on parallel hardware Able to exploit parallelism inherent in images Separate hardware built for each operation Coarse grained pipelining Partition image over several parallel function blocks Spatial parallelism / SIMD type parallelism Streaming can feed image data serially through a single function block Duplicated logic from unrolling inner loops Functional parallelism
22
FPGAs for Embedded Vision
Parallelism enables a lower clock speed Often by 2 or 3 orders of magnitude Slower clock enables a lower power design If whole algorithm can be implemented on an FPGA Small form factor Only 1 or 2 chips (plus power supplies) Enables vision to be embedded into a design Smart sensors and cameras Integrated applications
23
Benefits of FPGAs FPGAs if used correctly can give While achieving
Significant acceleration of the processing Improvements in both latency and throughput While achieving Lower clock speed Lower power Enabling Embedded vision Smart cameras Efficient real-time processing
24
FPGA Families Xilinx Altera Lattice Archronix Speedster
Spartan, Virtex, Artix, Kintex Altera Cyclone, Arria, Stratix Lattice ECP, XP, SC/M Archronix Speedster SiliconBlue iCE65 Tabula ABAX Actel ProASIC3, IGLOO Atmel AT40K QuickLogic Eclipse
25
Xilinx FPGAs Two main families: Low cost: Spartan, Atrix and Kintex
High performance: Virtex Generation LUT size Block size Fabric RAM Block RAM DSP size Spartan-II, Virtex 4 64 bit 4 Kbit none Spartan-III, Virtex-II, -4 8 18 Kbit 18×18 18×18+48 Spartan-6, Virtex-5, -6 6 or 5×2 256 bit 36 Kbit 18× ×18+48 Artix, Kintex, Virtex-7 6 25×18+48 Spartan-III DSP & Virtex-4 Spartan -6 Virtex-5, -6
26
Altera FPGAs Three families: Low cost: Cyclone
High speed communication: Arria High performance: Stratix Generation LUT size Block size Fabric RAM Block RAM DSP size Cyclone II 4 10 none 4.5 Kbit 18×18 Cyclone III, IV 16 9 Kbit Stratix II, Aria 8 ALM 8 576 bit 9 Kbit 576 Kbit 36×36 Stratix III, IV, Aria II 320 bit 640 bit 9 Kbit 144 Kbit 2×36×36 Stratix V 640 bit 20 Kbit 54×54 Stratix III
27
Altera ALM Adaptive Logic Module
With large LUTs, often logic is underutilised Adapts effective size of LUTs to suit logic requirements 8 inputs, 2 outputs Can be used for: Two independent 4-LUTs Independent 3-LUT and 5-LUT Two independent 5-LUTs with 2 shared inputs A single 6-LUT Two 6-LUTs with same function, and 4 shared inputs Other features Two full adders Two flip-flops (4 in the Stratix V)
28
Other FPGAs Suited for Imaging
Family LUT size Block size Fabric RAM Block RAM DSP size Lattice ECP, XP 4 8 128 bit 9 Kbit 36×36 Lattice ECP2,3, XP2 64 bit 18 Kbit 36×36 18×36+54 Achronix Speedster 18×18 SiliconBlue iCE65 none 4 Kbit Tabula ABAX 4:1 mux 16 576 bit 36 Kbit 72 Kbit 18×18+44 Actel ProASIC3, IGLOO 3-LUT 1 Speedster: high speed (1.5 GHz) self-synchronous pipelined ABAX: time multiplexes multiple configurations at 1.6 GHz iCE65, IGLOO: low power, with on-chip configuration flash
29
Summary A detailed understanding of the internal architecture is not essential to programme FPGAs Compiler and vendor mapping tools handle these details However a basic knowledge can be used to develop a more efficient implementation Basic architecture: Programmable logic, usually based on LUTs Memory available at a range of granularities Each memory block is parallel, enabling a wide bandwidth Programmable interconnect FPGAs configured using SRAM to control each function Volatile, so must be reconfigured on power-on
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.