Design for Embedded Image Processing on FPGAs

Slides:



Advertisements
Similar presentations
© 2003 Xilinx, Inc. All Rights Reserved Course Wrap Up DSP Design Flow.
Advertisements

Programmable FIR Filter Design
1/1/ / faculty of Electrical Engineering eindhoven university of technology Introduction Part 2: Data types and addressing modes dr.ir. A.C. Verschueren.
CENG536 Computer Engineering department Çankaya University.
Henry Hexmoor1 Chapter 7 Henry Hexmoor Registers and RTL.
UNIVERSITY OF MASSACHUSETTS Dept
CHAPTER 5: Floating Point Numbers
PH4705/ET4305: A/D: Analogue to Digital Conversion
Pipelining By Toan Nguyen.
Counters and Registers
Computer Arithmetic Nizamettin AYDIN
Computer Arithmetic. Instruction Formats Layout of bits in an instruction Includes opcode Includes (implicit or explicit) operand(s) Usually more than.
Computing Systems Basic arithmetic for computers.
Top Level View of Computer Function and Interconnection.
Mohammad Reza Najafi Main Ref: Computer Arithmetic Algorithms and Hardware Designs (Behrooz Parhami) Spring 2010 Class presentation for the course: “Custom.
Speeding up of pipeline segments © Fr Dr Jaison Mulerikkal CMI.
Computer Architecture Lecture 32 Fasih ur Rehman.
Computer and Information Sciences College / Computer Science Department CS 206 D Computer Organization and Assembly Language.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
1 KU College of Engineering Elec 204: Digital Systems Design Lecture 22 Memory Definitions Memory ─ A collection of storage cells together with the necessary.
Buffering Techniques Greg Stitt ECE Department University of Florida.
Chapter 9 Computer Arithmetic
William Stallings Computer Organization and Architecture 8th Edition
CHAPTER 5: Representing Numerical Data
Unit Microprocessor.
Serial Communications
Sequential Logic Design
COMP541 Memories II: DRAMs
Bus Interfacing Processor-Memory Bus Backplane Bus I/O Bus
Architectures of Digital Information Systems Part 1: Interrupts and DMA dr.ir. A.C. Verschueren Eindhoven University of Technology Section of Digital.
COURSE OUTCOMES OF Microprocessor and programming
REGISTER TRANSFER LANGUAGE (RTL)
Design for Embedded Image Processing on FPGAs
Control Unit Lecture 6.
ARM Organization and Implementation
Central Processing Unit Architecture
Edexcel GCSE Computer Science Topic 15 - The Processor (CPU)
SOFTWARE DESIGN AND ARCHITECTURE
Floating Point Numbers: x 10-18
UNIVERSITY OF MASSACHUSETTS Dept
Chap 7. Register Transfers and Datapaths
Embedded Systems Design
Cache Memory Presentation I
Chapter 3 Top Level View of Computer Function and Interconnection
William Stallings Computer Organization and Architecture 7th Edition
Chapter 6 Floating Point
Instructions at the Lowest Level
Subject Name: Digital Signal Processing Algorithms & Architecture
EE 445S Real-Time Digital Signal Processing Lab Spring 2014
Morgan Kaufmann Publishers Computer Organization and Assembly Language
Arithmetic Logical Unit
ECEG-3202 Computer Architecture and Organization
Computer Architecture
Chapter 8 Computer Arithmetic
UNIVERSITY OF MASSACHUSETTS Dept
Computer Architecture
Real time signal processing
UNIVERSITY OF MASSACHUSETTS Dept
Shift registers and Floating Point Numbers
William Stallings Computer Organization and Architecture 7th Edition
ADSP 21065L.
Computer Organization and Assembly Language
Serial Communications
Presentation transcript:

Design for Embedded Image Processing on FPGAs Chapter 5 Mapping Techniques

Outline Timing constraints Memory bandwidth constraints Pipelining, synchronisation, clock domains Memory bandwidth constraints Caching and row buffering Resource contention Busses, multiplexing, controllers, reconfigurability Computational techniques Number systems Lookup tables CORDIC Other techniques

Constraints Real-time implementation imposes constraints Video rate input or output (timing) constraints Streamed processing VGA output must produce 1 pixel every 40 ns Memory bandwidth Memory can only be accessed once per clock cycle Requires data caching Resource constraints Contention between parallel hardware blocks Logic block usage (limited resources available)

Problem Pipelining Process synchronisation Multiple clock domains Timing Constraints Problem Pipelining Process synchronisation Multiple clock domains

Timing Constraints Particularly important with stream processing Source driven processes Example: processing data from a camera Timing governed by rate data produced at the source Processing hardware has little control over arrival rate Must respond to data and events as they arrive Sink driven processes Example: processing data to be displayed on screen Timing governed by rate data is consumed Processor must be scheduled to provide data and events at the right times

Stream Processing All calculations performed at pixel clock rate Delay can easily exceed a single clock cycle Pipelining: spread operation over several cycles Achieve any throughput with sufficient resources Uses intermediate registers to hold temporary results Introduces control complications Priming, stalling, flushing the pipeline, synchronisation issues Input stream Output Function block Function block Input stream Output Register

Low-Level Pipelining Splits a calculation over several clock cycles Example: 2 stage pipeline: Register intermediate results 4 stage pipeline: Each stage has smaller propagation delay Reduce further with retiming

Retiming Uneven timing issues Retiming balances propagation delays In last case, multiplication takes longer than addition Propagation delay of multiply dominates Clock speed limited by slowest stage Retiming balances propagation delays Moves some of the multiplication logic to after the registers Speeds up the overall design

Low-Level Pipelining Throughput Latency Priming Flushing Stalling 1 value per clock cycle Latency Time from input to corresponding output 4 clock cycles here Priming Period of invalid output while waiting for first y Flushing Waiting after last valid input for the last y Stalling Stopping the clock for invalid input 1 2 3 4 x x1 y1 x2 y2 y3 y - 5 6 8 7 12 11 14 9 15 44 18 90 48 94 Time

Synchronisation Operations in an application level algorithm are often pipelined Easy if all of the operations are streamed More difficult for random access and hybrid modes Inputs and outputs of each operation in the pipeline must be synchronised Different latencies Different priming requirements for each operation Operation 1 Operation 2 Operation 3

Synchronisation Methods Global scheduling Determine which operation requires data when Usually have a global counter and schedule events by matching count Effective if all events are synchronous

Synchronisation Methods Data valid flags Each operation runs independently Associate with each output a data valid flag Flag controls operation of downstream function blocks Best suited for source driven processes Sink driven processes may also have a “wait” flag propagating upstream Operation 1 2 Data Data valid Enable

Synchronisation Methods CSP channels Guarantees synchronous communication Both transmitter and receiver must be ready for data transfer One operation will block (stall) if the other is not ready Suitable for either input or output driven processes Channel may have an associated FIFO buffer Relaxes timing at expense of resources Operation 1 2 Channel

CSP Channels Synchronisation and transfer logic Transfer takes place when both sender and receiver are ready

Multiple Clock Domains Not all external devices will operate at the same clock frequency Interfaces may need to be built in different clock domains Different domains are asynchronous relative to one another Communicating between clock domains Channels Shared memory FIFO buffers Signals between domains require synchronisation Failure to synchronise can result in metastability

Cross Domain Synchronisation Synchroniser with bi-directional handshaking Ensures data is stable before enabling clock in receiver Signals by toggling Request and Acknowledge lines Complete transfer takes 3 or 4 clock cycles Throughput limited to this rate

Cross Domain Synchronisation Higher throughput requires a FIFO type buffer Build using small dual-port memory One port is in each domain, so can be clocked independently When buffer is half-full, synchronisation signal is sent While one half is being unloaded, other half is being filled

Memory Bandwidth Constraints Memory architectures Caching Row buffering FIFOs

Memory Bandwidth Constraints Software architecture Read from memory, process, write results to memory Memory bandwidth is often the bottleneck Streamed pipeline architecture Rather than write results to memory, results are passed immediately to next operation Many memory accesses are eliminated Where possible, as much processing as possible should be performed while data is passing through the FPGA

Memory Bandwidth Constraints Some operations require image buffering When computation order does not follow a raster scan Required buffer size depends on operation Potentially whole frame (e.g. rotation by 90°) Large amount of memory required for frame buffer Buffering single PAL frame: 768 x 576 x 24 bpp = 1.2 MB Off-chip memory typically used for frame buffer Usually single access per clock cycle Problem accessing more than 1 pixel per clock cycle Higher speed pipelined memory complicates design

Parallel Memory Architectures Multiple copies in parallel banks Each bank accessed separately Duplicate contents if necessary Partition odd and even banks Distrubutes data over multiple banks Allows 2×2 blocks to be accessible Scramble addresses Allows 1×4 and 4×1 blocks to be accessible

Parallel Memory Architectures Data packing Increase word width Each access reads or writes several pixels Bank switching Double buffering Upstream process writes to one bank Downstream process reads from the other Swap banks at end of frame Frame buffer Frame buffer Bank 1 Frame buffer Bank 2

Multi-port frame buffer Increasing Bandwidth Multi-port memory 2 or more parallel accesses R/R, R/W, W/W Higher speed RAM clock Multiple sequential accesses (multi-phase design) DDR RAM uses both rising and falling clock edges Multi-port frame buffer System clock RAM clock (×3) RAM accesses

Caching Stores data that will be used again in a secondary memory Data from cache can be read in parallel with data from frame buffer Data accessed from memory Reduces memory bandwidth Previous results that will be used again Reduces computation expense On-chip RAM often used for cache Small blocks Multiple parallel blocks gives wide bandwidth Offchip RAM Cache

Cache Architectures Sequential cache Parallel cache Cache controller responsible for ensuring data is available Usually a FIFO when interfacing with slow or burst memory Parallel cache Data can be read from cache in parallel with memory

Row Buffering A common form of caching in image processing Consider 3×3 window filter Each output pixel value is a function of 9 input pixels Direct implementation Each window position requires reading 9 pixels from memory Each pixel is read 9 times For different window positions Resulting memory bandwidth limits processing speed Overcome with caching

Row Buffering for Filters Streamed processing Significant overlap between windows Reuse data by shifting it along Only need to load new pixels Use row buffers to cache incoming stream Only need to load one pixel for each window position From row buffer New pixels Window registers Row buffer Input stream New pixels

Row Buffers Implementation In series or parallel with window (series shown here) Shift register Circular memory (dual-port) Often indexed by pixel x address Dual-port memory Shift register Window Filter function Input stream Output Row buffers RAM Address counter Address Data

Circular Memory FIFO Address counter wraps around Makes the memory addressing circular Logic to detect buffer full and empty Based on comparisons between address counters

Resource Constraints Contention Multiplexing and busses Resource controllers Conflict arbitration Reconfigurability

Resource Constraints Resource contention Finite system resources Local memory, off-chip memory, implemented function blocks Need scheduling if concurrent access required Avoid concurrent processes writing to same resource Often not detected in current programming languages Must make efficient use of logic blocks Not just for reducing usage but also propagation delay Arithmetic operations consume significant resources Multiplication, division, square roots, trig functions, etc Use look-up tables and other computation techniques where appropriate

Resource Contention Only one process can access a resource at a time Algorithm must be designed to handle potential conflicts Reducing coupling between parallel sections Arbitration logic must be built and respected Contention strategies Carry on regardless Dangerous – danger of corrupting data for both processes Wait for the resource to be free (block) Need to stall any pipelines Perform some other task while waiting Hard to implement in hardware

Resource Sharing Expensive resources may be shared Subject to bandwidth constraints Simultaneous access to shared resources prohibited Requires some form of arbitration Sharing is implemented using multiplexers To be practical, the resource must be more expensive than multiplexer and control logic Alternative is to connect resources via a bus Less expensive than multiplexers Bus is resource in its own right

Busses Not all FPGAs have internal tristate buffers Can implement using distributed multiplexers

Resource Manager Encapsulate resource with arbitration logic Access via interface Interface includes handshaking signals

Arbitration Methods Scheduled access Semaphore access Priority access Algorithm is designed to separate accesses Works okay for simple or synchronous algorithms Difficult to handle asynchronous accesses Semaphore access Builds hardware to resolve conflicts Access may not be guaranteed Priority access Different parts of the algorithm have different priority High priority sections are guaranteed access Low priority sections only have access when resource is free

Scheduled Access Example Bank switched memory Particular RAM used for reading or writing controlled by set State variable

Prioritised Access Based on a priority encoder Can combine with multiplexer to give prioritised multiplexer

Semaphore Required if the access lasts longer than one cycle

Reconfigurability Compile time reconfigurability Complete functionality determined at compile time Run time reconfigurability Algorithm is split into multiple sequential sections Complete FPGA reprogrammed as part of operation Must maintain data off-chip for it not to be lost Partial reconfigurability Only part of FPGA is reprogrammed Application is split into modules These interface with a static core through predefined interfaces Static core Dynamic module

Computation Techniques Number systems Lookup tables CORDIC Approximations Other techniques

Number Representation Integers Unsigned Signed Real numbers Fixed point Floating point Logarithmic number system Residue number systems Redundant representations

Integers Most common unsigned representation is based on binary numbers Use only as many bits as necessary Signed number formats Sign magnitude Separate sign bit (usually MSB) Two’s complement Gives MSB a negative weight Offset binary Adds an offset or bias to all numbers so that negative numbers are positive

Fixed Point Real Numbers Scales an integer by a fraction, so the integer represents the number of fractions Arithmetic operations are the same as for integers Need to manually keep track of scale factor Changes with multiplication Need to align words with different fractions when adding Alignment is by a fixed amount A shift which is free in hardware Dynamic range limited by size of the integer Overcome by allowing fraction to vary (floating point)

Floating Point Numbers Splits number into two parts Exponent: position of the binary point Represented using offset binary Significand: binary fraction, normalised between 1 and 2 Most significant bit is always 1, so does not need to be represented Two exponent values reserved for special cases All zeros Denormalised numbers (binary fraction is not normalised) All ones Overflow, infinity and NAN

Floating Point Numbers Size of exponent and significand can be tailored to the accuracy of calculation being performed IEEE standard representation 32 bit: 1 sign, 8 bits for exponent, 24 bits for significand 64 bit: 1 sign, 11 bits for exponent, 53 bits for significand Also specifies the computation of arithmetic operations Makes them processor independent Logic for floating point significantly more than for fixed point Performing normalisation Detecting and managing all the different error conditions

Logarithmic Number System Represents a number by its logarithm (base 2) Uses a separate sign bit Logarithms of negative numbers not defined Similar precision and range to floating point Attraction Multiplication and division become addition and subtraction Disadvantage Addition and subtraction are significantly more complex Second term implemented either directly or using a lookup table

Residue Number Systems Represents integers by their residues with respect to a number of co-prime moduli Example: moduli {13,15,17} can represent the range 0-3119 Addition, subtraction and multiplication performed using modulo arithmetic on each residue Shorter propagation delay Less logic (especially for multiplication) Convert back to binary using Chinese remainder theorem

Redundant Representations Arithmetic speed is limited by carry propagation Using a redundant representation limits length of carry propagation Gives significant speed improvements Standard binary: Signed digit: Asymmetric signed digit: Disadvantages Using additional digits requires wider signals 2 bits for radix-2 signed digit representation This requires more logic to implement

Lookup Tables Many functions are expensive to calculate Precalculate and store results in a lookup table Can trade table size for accuracy Drop least significant bits Accuracy of approximation depends on slope Don’t need to round the input value Just adjust the value in the table … … …

Interpolated Lookup Tables For “smooth” functions, can trade table size for computational resources and latency Two truncated tables Values, and slopes Use slope to interpolate values Accuracy depends on curvature LUTs in parallel are a single LUT with wider data path Slope

Interpolated Lookup Tables Using dual-port RAM to reduce table width Allows two table accesses Calculate slope, rather than store in table Higher order interpolation More tables in parallel to give interpolation coefficients

Bipartite Lookup Tables With smoothly varying functions, slopes are often similar Rather than multiply by slope, results of multiplication are stored in a lookup table Offsets from the slope are shared for several segments Symmetry can also be exploited to further reduce the size of the offset LUT

Higher Order Approximations Lookup tables Table size grows exponentially with precision Polynomial approximations Coefficients chosen to minimise approximation error Size grows approximately linearly with precision Summary comparison Precision Best method < 10 bits Direct lookup table 8-14 bits Bipartite lookup table 12-20 bits Interpolated lookup table > 16 bits Polynomial approximation

CORDIC Method for calculating trigonometric functions Based on incremental rotations Trick is choosing factors to be powers of 2 Multiplications then become shifts Where dk is direction of rotation and Result rotates the input vector by With gain of approximately 1.64676

CORDIC Implementation Add an additional accumulator z to hold the angle Load signal resets counter and loads initial value Mode signal selects the rotation direction Small ROM contains the angles for each iteration

CORDIC Operation Modes Rotation mode Starts with angle in z Chooses dk = sign(zk) to converge the angle to zero Result is to rotate the vector by that angle Can be used to calculate sine and cosine of an angle Vectoring mode Aligns the vector with the x axis Chooses dk = -sign(yk) to converge y to zero Result is magnitude and angle of the vector Can be used to calculate arctangent Each iteration gives 1 bit of the result

Unrolled CORDIC Builds separate hardware for each iteration Avoids the need for barrel shift This is an expensive operation Shift is fixed for an iteration so may be hard wired Angles added are constants so may be optimised If necessary may be pipelined for speed

CORDIC Variations Problem with CORDIC is the gain factor Linear CORDIC Compensated CORDIC introduces a scale factor with each iteration which makes the gain equal to 1 Scale factor require one extra addition for x and y per iteration Linear CORDIC Long multiplication Non-restoring division Hyperbolic CORDIC Hyperbolic sine, cosine and arctangent Related shift and add techniques can also be used for exponential and logarithm

Iterative Approximations CORDIC and related algorithms converge slowly One bit per iteration Newton-Raphson gives quadratic convergence Root finding algorithm Number of bits doubles with each iteration Example: square root Form equation to solve Form iteration Determine initial approximation Number of iterations depends on initial accuracy Lookup table or polynomial approximation reduces iterations

Other Techniques Bit-serial processing Incremental update Separability Useful with scarce resources and latency is not a problem Processes one bit per iteration Incremental update Stream processing evaluates pixels successively Reuse result from previous pixel with an appropriate adjustment Separability Separates 2D operations to separate operations in each of X and Y directions

Summary Mapping requires matching computation to resources available Constrained by Timing Overcome with pipelining and appropriate synchronisation Bandwidth Overcome with memory architecture and caching Resources Share resources with arbitration and resource controllers Reduce using appropriate computation techniques A range of number systems and computational techniques have been reviewed