Presentation is loading. Please wait.

Presentation is loading. Please wait.

DSP Algorithms on FPGA Part II Digital image Processing

Similar presentations


Presentation on theme: "DSP Algorithms on FPGA Part II Digital image Processing"— Presentation transcript:

1 DSP Algorithms on FPGA Part II Digital image Processing

2 Content Overview image processing and FPGA
Algorithm to FPGA Mapping Flow Nested Loop Algorithms and MODG Example: Motion Estimation Conclusion and Future Trends

3 Video signal in different formats
PAL *576(pixels) (f/s) (Mp/s) NTSC 720* HDTV * Common delivery form: Analog (cable) USB Firewire

4 Image Processing Character
Need available maximize logic by supporting N-D multiple configurable devices For Example : Image * 1 2 4

5 Challenges How to……??? Appropriate partitioning of algorithms between hardware and software Exploiting spatial and temporal parallelism Integration the configurable computer into the software framework Selecting a suitable configuration strategy How shall we deal with these challenges?

6 Why SRAM-Based FPGAs? (Pros)
Higher logic/storage capacity * Fast carry chain for adders /subtractors * Built-in XOR gates/LUT * Array of bit-parallel multipliers * Fast and local storage: array of SRAM blocks * Interconnect supports: three-state buffers/LUT Equivalent to fine-grained reconfigurable hardware * Finer-gained pipeling can help preserve the performance at low power supply voltage More mature CMOS manufacturing technology

7 Algorithm to FPGA Mapping Flow

8 The Matrix Multiplication MODG
A number of different execution orders can be carried out to achieve the same algorithm.

9 Nested Do Loop Algorithms and Inter-Iteration Dependence Graph
Do i=1 to M Do j=1 to N c[i,j]=0; Do k=1 to K c[i,j]= c[i,j]+a[i,k]*b[k,j]; EndDo k EndDo j EndDo I Dependence vectors da = (i,j,k)t = (0,1,0)t db = (i,j,k)t = (1,0,0)t dc = (i,j,k)t = (0,0,1)t Index Space J3 = {(i,j,k)t: 1£ i,j,k £ 3} (M=N=K=3) Inter-Iteration Data Dependence graph (DG)

10 Systolic Mapping (space-time) of Matrix Multiplication
3-D DG (Dependence Graph) 2-D Processor Array P

11 Systolic Mapping of Matrix Multiplication, cont.
C13 C23 C33 b13 b13 b13 b23 b23 b23 b33 b33 b33 a11 a21 a31 a12 a22 a32 a13 a23 a33 C12 C22 C32 b12 b12 b12 b22 b22 b22 b32 b32 b32 a11 a21 a31 a12 a22 a32 a13 a23 a33 a11 a21 a31 a12 a22 a32 a13 a23 a33 C11 C21 C31 b11 b11 b11 b21 b21 b21 b31 b31 b31

12 Why Space-Time Mapping is suitable for FPGAs?
It can bridge the nested Do loop signal/image processing algorithms to the processor array implementation. The space-time array matches the modular and regular FPGA structure. The localized/pipelined interprocessor links can overcome the long programmable interconnect delay. The size of configuration storage can be significantly reduced because of the almost identical processing elements and interconnect structure.

13 Problems with Existing Design Methodologies/Tools
The dependence graphs of many other algorithms are not uniform and must be predetermined by human designers. Existing methodologies cannot handle these complex algorithms use unrealistic cost functions (metrics) No built-in features of FPGAs have been incorporated. Longer interconnect delay in deep submicron CMOS technology Much lower hardware utilization due to programmable interconnect delay in FPGAs There is another problem--speed

14 What is Intra-PE Pipelining?
Interconnect delay of FPGAs results in even longer clock period. To enhance the overall throughput, Intra-Iteration parallelism must be exploited. A simple vector dot product array It can be observed that the utilization of each operator is increased. Of course, the control mechanism is more complex. Tech done example

15 Examples of Nested Do Loop Algorithms
Motion estimation One of the most time consuming operations (tasks) in digital video compression Stereo matching used to build disparity map for 3D robot/computer navigation Matrix/Vector Multiplication FFT, DCT, 2D/3D graphic etc. 2D Linear Transform/Operations 2D FFT, 2D DCT, etc.

16 Tennis frame 0

17 Tennis frame 1

18 Motion Vectors of 8x8-Pixel Blocks

19 Reconstructed Frame 1 from Frame 0 and Motion Vectors

20 Illustration of Full Search Block Matching Motion Estimation (6 level Nested do loop)
Motion vector=(m,n)

21 Exp: A Simpler PE Microarchitecture
MAD(m,n)= MAD(m,n)+|x(hN+i,vN+j)-y(hN+i+m-p,vN+j+n-p)| Xilinx Core Generator System Critical path delay = 25 ns. based on Xilinx Virtex data 1,500-2,000 equivalent gate count Critical path (blue line) can be shortened further by the Intra-PE pipelining

22 Significance of the Contributions
The MODG representation for nested Do loop algorithms The actual execution is not constrained to any predetermined order. keeps track of every variable instance so that there is no redundant memory access to save I/O, bandwidth and power consumption. can be automated using memory . Without the MODG, the motion estimation and many other nested DO loop algorithms can be written in many of different DGs, human must be involved to formulate a DG, the built-in ROM/RAM of FPGA may not be exploited, and

23 Significance of the Contributions, cont.
Space-Time mapping for the MODG can be applied to any SRAM-based FPGA Architecture Constraints and Practical Cost functions any coarse-grained architecture Intra-PE pipelining enhances/preserves the throughput rate at low power mode.

24 Conclusion Users demand more communication/multimedia processing capabilities on the resource-limited Internet appliances. Reconfigurable SOC is the ultimate solution to design the challenging low-power/high performance platform. Its success lies on the embedded high-density FPGA core as a reconfigurable (programmable) accelerating hardware. As technology (supply voltage) scales down, logic (transistor) is virtually free while the interconnect becomes the bottleneck and power consuming. Parallel execution of nested Do loop algorithms by an array of localized processing elements at moderate clock frequency is a viable solution. It can compromise the three main issues: design time, power consumption, and performance.

25 Future Trends Memory (storage) organization should be investigated due to multiple reads per-clock cycle in order to sustain such high throughput. The control mechanism of the entire array is one of the aspects that will determine its success. A given MODG may need to be partitioned of so that the resulting array fits the on-chip reconfigurable FPGA core.


Download ppt "DSP Algorithms on FPGA Part II Digital image Processing"

Similar presentations


Ads by Google