Video on DSP and FPGA John Johansson April 12, 2004
Agenda ► Overview of video processing ► A typical video encoder and the DCT ► Requirements of DCT ► Comparison of DSP and FPGA chips ► Analysis and conclusions ► Questions
Overview of Video Processing Video processing generally involves ► Compression / Decompression ► Special Effects ► TV Broadcasting ► Focus on Compression
Video Encoding Typical Video Encoder ► Focus on DCT algorithm
The Discrete Cosine Transformation ► DCT is a spatial transform, like the FFT ► Rearranges data into a more compressible format ► Typically done on 64 (8x8) pixels at a time ► Big nasty equation … ► … But no sharp teeth (optimizes extremely well)
Requirements for DCT Basic Idea ► Read in data (64 values, 8-24 bits signed / unsigned) ► Do transformation ► Write out data ► Profit !!! ► Easy, right ??
Requirements for DCT Memory Limitations ► Load an entire frame? ► One frame can vary from 50K to 50 MB in size when uncompressed ► External memory is much slower, more plentiful ► Do the DCT in chunks (8x8 block)
Requirements for DCT Degree of Parallelism ► DCT can be done serially, or broken up and done in parallel ► Parallelism depends largely on available memory ► Price / Performance tradeoffs
The Challengers Xilinx Spartan-3 FPGA ► 50K – 5M gates ► 326 MHz ► 100 KB – 2.3 MB internal memory ► dedicated multipliers ► Oodles of I/O pins (up to 784) Look at XC3S1000 ► 1M gates, 560 KB memory, 24 multipliers, 376 I/O pins
The Challengers ADSP-BF5xx Blackfin Processor ► 200 – 750 MHz ► Single or dual core ► DMA memory controller ► 52 KB – 326 KB internal memory ► Other processor goodies Look at ADSP-BF533 ► 500 MHz, single core, 148 KB memory
Performance How do we correctly benchmark an algorithm between two completely different processors? ► I don’t really know ► Look at some rough performance indicators and try and draw a conclusion
Performance FPGA ► Varies from 1-25 cycle(s) / pixel for DCT ► Reading and writing of data takes additional time ► Clock speed limited by degree of parallelism DSP ► Roughly 5 cycles / pixel for DCT ► DMA controller allows parallel reading and writing with some setup overhead
(Ideal) Performance Spartan-3 ► 64 read + 64 compute + 64 write = 196 cycles / block ► 326 MHz = 1.66 Mblocks / second Blackfin ► 319 compute + 10 DMA transfer = 329 cycles / block ► 500 MHz = 1.52 Mblocks / second
Advantages FPGA ► Potential for very high parallelism ► Existing video designs available for purchase ► Good middleman functionality DSP ► Higher potential clock speed ► Much more flexible design ► DMA memory controller
Disadvantages FPGA ► Low flexibility ► Hard to optimize ► Limited logic blocks DSP ► Difficult to achieve full utilization ► Higher power consumption
Conclusions FPGA ► Best for well defined roles, like DCT ► Faster in situations where throughput matters ► Can be very expensive DSP ► Better off for more flexible roles, like full encoder ► Situations where large amounts of (additional) memory are needed
Questions?
References Xilinx Spartan III jsp?title=Spartan-3 Analog Devices Blackfin fin/index.html
References Other articles /xc_pdf/xc_videoapps44.pdf _dspvid43.htm ectronics.com/ednmag/article/CA336860?stt=000& pubdate=11%2F27%25