Design for Embedded Image Processing on FPGAs

Design for Embedded Image Processing on FPGAs
Chapter 5 Mapping Techniques

Outline Timing constraints Memory bandwidth constraints
Pipelining, synchronisation, clock domains Memory bandwidth constraints Caching and row buffering Resource contention Busses, multiplexing, controllers, reconfigurability Computational techniques Number systems Lookup tables CORDIC Other techniques

Constraints Real-time implementation imposes constraints
Video rate input or output (timing) constraints Streamed processing VGA output must produce 1 pixel every 40 ns Memory bandwidth Memory can only be accessed once per clock cycle Requires data caching Resource constraints Contention between parallel hardware blocks Logic block usage (limited resources available)

Problem Pipelining Process synchronisation Multiple clock domains
Timing Constraints Problem Pipelining Process synchronisation Multiple clock domains

Timing Constraints Particularly important with stream processing
Source driven processes Example: processing data from a camera Timing governed by rate data produced at the source Processing hardware has little control over arrival rate Must respond to data and events as they arrive Sink driven processes Example: processing data to be displayed on screen Timing governed by rate data is consumed Processor must be scheduled to provide data and events at the right times

Stream Processing All calculations performed at pixel clock rate
Delay can easily exceed a single clock cycle Pipelining: spread operation over several cycles Achieve any throughput with sufficient resources Uses intermediate registers to hold temporary results Introduces control complications Priming, stalling, flushing the pipeline, synchronisation issues Input stream Output Function block Function block Input stream Output Register

Low-Level Pipelining Splits a calculation over several clock cycles
Example: 2 stage pipeline: Register intermediate results 4 stage pipeline: Each stage has smaller propagation delay Reduce further with retiming

Retiming Uneven timing issues Retiming balances propagation delays
In last case, multiplication takes longer than addition Propagation delay of multiply dominates Clock speed limited by slowest stage Retiming balances propagation delays Moves some of the multiplication logic to after the registers Speeds up the overall design

Low-Level Pipelining Throughput Latency Priming Flushing Stalling
1 value per clock cycle Latency Time from input to corresponding output 4 clock cycles here Priming Period of invalid output while waiting for first y Flushing Waiting after last valid input for the last y Stalling Stopping the clock for invalid input 1 2 3 4 x x1 y1 x2 y2 y3 y - 5 6 8 7 12 11 14 9 15 44 18 90 48 94 Time

Synchronisation Operations in an application level algorithm are often pipelined Easy if all of the operations are streamed More difficult for random access and hybrid modes Inputs and outputs of each operation in the pipeline must be synchronised Different latencies Different priming requirements for each operation Operation 1 Operation 2 Operation 3

Synchronisation Methods
Global scheduling Determine which operation requires data when Usually have a global counter and schedule events by matching count Effective if all events are synchronous

Data valid flags Each operation runs independently Associate with each output a data valid flag Flag controls operation of downstream function blocks Best suited for source driven processes Sink driven processes may also have a “wait” flag propagating upstream Operation 1 2 Data Data valid Enable

CSP channels Guarantees synchronous communication Both transmitter and receiver must be ready for data transfer One operation will block (stall) if the other is not ready Suitable for either input or output driven processes Channel may have an associated FIFO buffer Relaxes timing at expense of resources Operation 1 2 Channel

CSP Channels Synchronisation and transfer logic
Transfer takes place when both sender and receiver are ready

Multiple Clock Domains
Not all external devices will operate at the same clock frequency Interfaces may need to be built in different clock domains Different domains are asynchronous relative to one another Communicating between clock domains Channels Shared memory FIFO buffers Signals between domains require synchronisation Failure to synchronise can result in metastability

Cross Domain Synchronisation
Synchroniser with bi-directional handshaking Ensures data is stable before enabling clock in receiver Signals by toggling Request and Acknowledge lines Complete transfer takes 3 or 4 clock cycles Throughput limited to this rate

Cross Domain Synchronisation
Higher throughput requires a FIFO type buffer Build using small dual-port memory One port is in each domain, so can be clocked independently When buffer is half-full, synchronisation signal is sent While one half is being unloaded, other half is being filled

Memory Bandwidth Constraints
Memory architectures Caching Row buffering FIFOs

Software architecture Read from memory, process, write results to memory Memory bandwidth is often the bottleneck Streamed pipeline architecture Rather than write results to memory, results are passed immediately to next operation Many memory accesses are eliminated Where possible, as much processing as possible should be performed while data is passing through the FPGA

Some operations require image buffering When computation order does not follow a raster scan Required buffer size depends on operation Potentially whole frame (e.g. rotation by 90°) Large amount of memory required for frame buffer Buffering single PAL frame: 768 x 576 x 24 bpp = 1.2 MB Off-chip memory typically used for frame buffer Usually single access per clock cycle Problem accessing more than 1 pixel per clock cycle Higher speed pipelined memory complicates design

Parallel Memory Architectures
Multiple copies in parallel banks Each bank accessed separately Duplicate contents if necessary Partition odd and even banks Distrubutes data over multiple banks Allows 2×2 blocks to be accessible Scramble addresses Allows 1×4 and 4×1 blocks to be accessible

Parallel Memory Architectures
Data packing Increase word width Each access reads or writes several pixels Bank switching Double buffering Upstream process writes to one bank Downstream process reads from the other Swap banks at end of frame Frame buffer Frame buffer Bank 1 Frame buffer Bank 2

Multi-port frame buffer
Increasing Bandwidth Multi-port memory 2 or more parallel accesses R/R, R/W, W/W Higher speed RAM clock Multiple sequential accesses (multi-phase design) DDR RAM uses both rising and falling clock edges Multi-port frame buffer System clock RAM clock (×3) RAM accesses

Caching Stores data that will be used again in a secondary memory
Data from cache can be read in parallel with data from frame buffer Data accessed from memory Reduces memory bandwidth Previous results that will be used again Reduces computation expense On-chip RAM often used for cache Small blocks Multiple parallel blocks gives wide bandwidth Offchip RAM Cache

Cache Architectures Sequential cache Parallel cache
Cache controller responsible for ensuring data is available Usually a FIFO when interfacing with slow or burst memory Parallel cache Data can be read from cache in parallel with memory

Row Buffering A common form of caching in image processing
Consider 3×3 window filter Each output pixel value is a function of 9 input pixels Direct implementation Each window position requires reading 9 pixels from memory Each pixel is read 9 times For different window positions Resulting memory bandwidth limits processing speed Overcome with caching

Row Buffering for Filters
Streamed processing Significant overlap between windows Reuse data by shifting it along Only need to load new pixels Use row buffers to cache incoming stream Only need to load one pixel for each window position From row buffer New pixels Window registers Row buffer Input stream New pixels

Row Buffers Implementation
In series or parallel with window (series shown here) Shift register Circular memory (dual-port) Often indexed by pixel x address Dual-port memory Shift register Window Filter function Input stream Output Row buffers RAM Address counter Address Data

Circular Memory FIFO Address counter wraps around
Makes the memory addressing circular Logic to detect buffer full and empty Based on comparisons between address counters

Resource Constraints Contention Multiplexing and busses
Resource controllers Conflict arbitration Reconfigurability

Resource Constraints Resource contention
Finite system resources Local memory, off-chip memory, implemented function blocks Need scheduling if concurrent access required Avoid concurrent processes writing to same resource Often not detected in current programming languages Must make efficient use of logic blocks Not just for reducing usage but also propagation delay Arithmetic operations consume significant resources Multiplication, division, square roots, trig functions, etc Use look-up tables and other computation techniques where appropriate

Resource Contention Only one process can access a resource at a time
Algorithm must be designed to handle potential conflicts Reducing coupling between parallel sections Arbitration logic must be built and respected Contention strategies Carry on regardless Dangerous – danger of corrupting data for both processes Wait for the resource to be free (block) Need to stall any pipelines Perform some other task while waiting Hard to implement in hardware

Resource Sharing Expensive resources may be shared
Subject to bandwidth constraints Simultaneous access to shared resources prohibited Requires some form of arbitration Sharing is implemented using multiplexers To be practical, the resource must be more expensive than multiplexer and control logic Alternative is to connect resources via a bus Less expensive than multiplexers Bus is resource in its own right

Busses Not all FPGAs have internal tristate buffers
Can implement using distributed multiplexers

Resource Manager Encapsulate resource with arbitration logic
Access via interface Interface includes handshaking signals

Arbitration Methods Scheduled access Semaphore access Priority access
Algorithm is designed to separate accesses Works okay for simple or synchronous algorithms Difficult to handle asynchronous accesses Semaphore access Builds hardware to resolve conflicts Access may not be guaranteed Priority access Different parts of the algorithm have different priority High priority sections are guaranteed access Low priority sections only have access when resource is free

Scheduled Access Example
Bank switched memory Particular RAM used for reading or writing controlled by set State variable

Prioritised Access Based on a priority encoder
Can combine with multiplexer to give prioritised multiplexer

Semaphore Required if the access lasts longer than one cycle

Reconfigurability Compile time reconfigurability
Complete functionality determined at compile time Run time reconfigurability Algorithm is split into multiple sequential sections Complete FPGA reprogrammed as part of operation Must maintain data off-chip for it not to be lost Partial reconfigurability Only part of FPGA is reprogrammed Application is split into modules These interface with a static core through predefined interfaces Static core Dynamic module

Computation Techniques
Number systems Lookup tables CORDIC Approximations Other techniques

Number Representation
Integers Unsigned Signed Real numbers Fixed point Floating point Logarithmic number system Residue number systems Redundant representations

Integers Most common unsigned representation is based on binary numbers Use only as many bits as necessary Signed number formats Sign magnitude Separate sign bit (usually MSB) Two’s complement Gives MSB a negative weight Offset binary Adds an offset or bias to all numbers so that negative numbers are positive

Fixed Point Real Numbers
Scales an integer by a fraction, so the integer represents the number of fractions Arithmetic operations are the same as for integers Need to manually keep track of scale factor Changes with multiplication Need to align words with different fractions when adding Alignment is by a fixed amount A shift which is free in hardware Dynamic range limited by size of the integer Overcome by allowing fraction to vary (floating point)

Floating Point Numbers
Splits number into two parts Exponent: position of the binary point Represented using offset binary Significand: binary fraction, normalised between 1 and 2 Most significant bit is always 1, so does not need to be represented Two exponent values reserved for special cases All zeros Denormalised numbers (binary fraction is not normalised) All ones Overflow, infinity and NAN

Floating Point Numbers
Size of exponent and significand can be tailored to the accuracy of calculation being performed IEEE standard representation 32 bit: 1 sign, 8 bits for exponent, 24 bits for significand 64 bit: 1 sign, 11 bits for exponent, 53 bits for significand Also specifies the computation of arithmetic operations Makes them processor independent Logic for floating point significantly more than for fixed point Performing normalisation Detecting and managing all the different error conditions

Logarithmic Number System
Represents a number by its logarithm (base 2) Uses a separate sign bit Logarithms of negative numbers not defined Similar precision and range to floating point Attraction Multiplication and division become addition and subtraction Disadvantage Addition and subtraction are significantly more complex Second term implemented either directly or using a lookup table

Residue Number Systems
Represents integers by their residues with respect to a number of co-prime moduli Example: moduli {13,15,17} can represent the range Addition, subtraction and multiplication performed using modulo arithmetic on each residue Shorter propagation delay Less logic (especially for multiplication) Convert back to binary using Chinese remainder theorem

Redundant Representations
Arithmetic speed is limited by carry propagation Using a redundant representation limits length of carry propagation Gives significant speed improvements Standard binary: Signed digit: Asymmetric signed digit: Disadvantages Using additional digits requires wider signals 2 bits for radix-2 signed digit representation This requires more logic to implement

Lookup Tables Many functions are expensive to calculate
Precalculate and store results in a lookup table Can trade table size for accuracy Drop least significant bits Accuracy of approximation depends on slope Don’t need to round the input value Just adjust the value in the table … … …

Interpolated Lookup Tables
For “smooth” functions, can trade table size for computational resources and latency Two truncated tables Values, and slopes Use slope to interpolate values Accuracy depends on curvature LUTs in parallel are a single LUT with wider data path Slope

Interpolated Lookup Tables
Using dual-port RAM to reduce table width Allows two table accesses Calculate slope, rather than store in table Higher order interpolation More tables in parallel to give interpolation coefficients

Bipartite Lookup Tables
With smoothly varying functions, slopes are often similar Rather than multiply by slope, results of multiplication are stored in a lookup table Offsets from the slope are shared for several segments Symmetry can also be exploited to further reduce the size of the offset LUT

Higher Order Approximations
Lookup tables Table size grows exponentially with precision Polynomial approximations Coefficients chosen to minimise approximation error Size grows approximately linearly with precision Summary comparison Precision Best method < 10 bits Direct lookup table 8-14 bits Bipartite lookup table 12-20 bits Interpolated lookup table > 16 bits Polynomial approximation

CORDIC Method for calculating trigonometric functions
Based on incremental rotations Trick is choosing factors to be powers of 2 Multiplications then become shifts Where dk is direction of rotation and Result rotates the input vector by With gain of approximately

CORDIC Implementation
Add an additional accumulator z to hold the angle Load signal resets counter and loads initial value Mode signal selects the rotation direction Small ROM contains the angles for each iteration

CORDIC Operation Modes
Rotation mode Starts with angle in z Chooses dk = sign(zk) to converge the angle to zero Result is to rotate the vector by that angle Can be used to calculate sine and cosine of an angle Vectoring mode Aligns the vector with the x axis Chooses dk = -sign(yk) to converge y to zero Result is magnitude and angle of the vector Can be used to calculate arctangent Each iteration gives 1 bit of the result

Unrolled CORDIC Builds separate hardware for each iteration
Avoids the need for barrel shift This is an expensive operation Shift is fixed for an iteration so may be hard wired Angles added are constants so may be optimised If necessary may be pipelined for speed

CORDIC Variations Problem with CORDIC is the gain factor Linear CORDIC
Compensated CORDIC introduces a scale factor with each iteration which makes the gain equal to 1 Scale factor require one extra addition for x and y per iteration Linear CORDIC Long multiplication Non-restoring division Hyperbolic CORDIC Hyperbolic sine, cosine and arctangent Related shift and add techniques can also be used for exponential and logarithm

Iterative Approximations
CORDIC and related algorithms converge slowly One bit per iteration Newton-Raphson gives quadratic convergence Root finding algorithm Number of bits doubles with each iteration Example: square root Form equation to solve Form iteration Determine initial approximation Number of iterations depends on initial accuracy Lookup table or polynomial approximation reduces iterations

Other Techniques Bit-serial processing Incremental update Separability
Useful with scarce resources and latency is not a problem Processes one bit per iteration Incremental update Stream processing evaluates pixels successively Reuse result from previous pixel with an appropriate adjustment Separability Separates 2D operations to separate operations in each of X and Y directions

Summary Mapping requires matching computation to resources available
Constrained by Timing Overcome with pipelining and appropriate synchronisation Bandwidth Overcome with memory architecture and caching Resources Share resources with arbitration and resource controllers Reduce using appropriate computation techniques A range of number systems and computational techniques have been reviewed

Design for Embedded Image Processing on FPGAs

Similar presentations

Presentation on theme: "Design for Embedded Image Processing on FPGAs"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Design for Embedded Image Processing on FPGAs

Similar presentations

Presentation on theme: "Design for Embedded Image Processing on FPGAs"— Presentation transcript:

Similar presentations

About project

Feedback