ELEC692 VLSI Signal Processing Architecture Lecture 7 VLSI Architecture for Block Matching Algorithm for Video compression * Part of the notes is taken.

Slides:



Advertisements
Similar presentations
Lecture 4. Topics covered in last lecture Multistage Switching (Clos Network) Architecture of Clos Network Routing in Clos Network Blocking Rearranging.
Advertisements

1 ECE734 VLSI Arrays for Digital Signal Processing Chapter 3 Parallel and Pipelined Processing.
Efficient access to TIN Regular square grid TIN Efficient access to TIN Let q := (x, y) be a point. We want to estimate an elevation at a point q: 1. should.
Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.
Basics of MPEG Picture sizes: up to 4095 x 4095 Most algorithms are for the CCIR 601 format for video frames Y-Cb-Cr color space NTSC: 525 lines per frame.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Efficient Representation of Data Structures on Associative Processors Jalpesh K. Chitalia (Advisor Dr. Robert A. Walker) Computer Science Department Kent.
Numerical Algorithms Matrix multiplication
T RUE -M OTION E STIMATION WITH 3-D R ECURSIVE S EARCH B LOCK M ATCHING Gerard de Haan, Paul W. A. C. Biezen Henk Huijgen Olukayode A. Ojo (Philips Research.
1 Friday, October 20, 2006 “Work expands to fill the time available for its completion.” -Parkinson’s 1st Law.
Mobile Motion Tracking using Onboard Camera Lam Man Kit CEG Wong Yuk Man CEG.
Frame-Level Pipelined Motion Estimation Array Processor Surin Kittitornkun and Yu Hen Hu IEEE Trans. on, for Video Tech., Vol. 11, NO.2 FEB, 2001.
CSE621/JKim Lec4.1 9/20/99 CSE621 Parallel Algorithms Lecture 4 Matrix Operation September 20, 1999.
1 Single Reference Frame Multiple Current Macroblocks Scheme for Multiple Reference IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY Tung-Chien.
Analysis, Fast Algorithm, and VLSI Architecture Design for H
(Page 554 – 564) Ping Perez CS 147 Summer 2001 Alternative Parallel Architectures  Dataflow  Systolic arrays  Neural networks.
A New Diamond Search Algorithm for Fast Block- Matching Motion Estimation Shan Zhu and Kai-Kuang Ma IEEE TRANSACTIONS ON IMAGE PROCESSION, VOL. 9, NO.
Image (and Video) Coding and Processing Lecture: Motion Compensation Wade Trappe Most of these slides are borrowed from Min Wu and KJR Liu of UMD.
ELEC692 VLSI Signal Processing Architecture Lecture 6
A Low-Power VLSI Architecture for Full-Search Block-Matching Motion Estimation Viet L. Do and Kenneth Y. Yun IEEE Transactions on Circuits and Systems.
Video Compression Concepts Nimrod Peleg Update: Dec
JPEG 2000 Image Type Image width and height: 1 to 2 32 – 1 Component depth: 1 to 32 bits Number of components: 1 to 255 Each component can have a different.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
ELE 488 F06 ELE 488 Fall 2006 Image Processing and Transmission ( ) Digital Video Motion Pictures Broadcast Television Digital Video 11/28.
Chapter 6-2 Multiplier Multiplier Next Lecture Divider
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
IMAGE COMPRESSION USING BTC Presented By: Akash Agrawal Guided By: Prof.R.Welekar.
Chapter One Introduction to Pipelined Processors.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Diploma Project Real Time Motion Estimation on HDTV Video Streams (using the Xilinx FPGA) Supervisor :Averena L.I. Student:Das Samarjit.
Low-Power H.264 Video Compression Architecture for Mobile Communication Student: Tai-Jung Huang Advisor: Jar-Ferr Yang Teacher: Jenn-Jier Lien.
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Directional DCT Presented by, -Shreyanka Subbarayappa, Sadaf Ahamed, Tejas Sathe, Priyadarshini Anjanappa K. R. RAO 1.
COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION 03/26/
1 University of Texas at Austin Machine Learning Group 图像与视频处理 计算机学院 Motion Detection and Estimation.
MOTION ESTIMATION IMPLEMENTATION IN RECONFIGURABLE PLATFORMS
Graph Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Adapted for 3030 To accompany the text ``Introduction to Parallel Computing'',
Advances in digital image compression techniques Guojun Lu, Computer Communications, Vol. 16, No. 4, Apr, 1993, pp
Spatiotemporal Saliency Map of a Video Sequence in FPGA hardware David Boland Acknowledgements: Professor Peter Cheung Mr Yang Liu.
Timo O. Korhonen, HUT Communication Laboratory 1 Convolutional encoding u Convolutional codes are applied in applications that require good performance.
Course14 Dynamic Vision. Biological vision can cope with changing world Moving and changing objects Change illumination Change View-point.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
Fall 2008Simple Parallel Algorithms1. Fall 2008Simple Parallel Algorithms2 Scalar Product of Two Vectors Let a = (a 1, a 2, …, a n ); b = (b 1, b 2, …,
Block-based coding Multimedia Systems and Standards S2 IF Telkom University.
Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.
A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Lecture 9 Architecture Independent (MPI) Algorithm Design
Motion Estimation Presented By: Dr. S. K. Singh Department of Computer Engineering, Indian Institute of Technology (B.H.U.) Varanasi
EE591f Digital Video Processing
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
Motion Estimation Multimedia Systems and Standards S2 IF Telkom University.
BLOCK BASED MOTION ESTIMATION. Road Map Block Based Motion Estimation Algorithms. Procedure Of 3-Step Search Algorithm. 4-Step Search Algorithm. N-Step.
ELEC692 VLSI Signal Processing Architecture Lecture 12 Numerical Strength Reduction.
Hierarchical Systolic Array Design for Full-Search Block Matching Motion Estimation Noam Gur Arie,August 2005.
Principles of Video Compression Dr. S. M. N. Arosha Senanayake, Senior Member/IEEE Associate Professor in Artificial Intelligence Room No: M2.06
Motion tracking TEAM D, Project 11: Laura Gui - Timisoara Calin Garboni - Timisoara Peter Horvath - Szeged Peter Kovacs - Debrecen.
1שידור ווידיאו ואודיו ברשת האינטרנט Dr. Ofer Hadar Communication Systems Engineering Department Ben-Gurion University of the Negev URL:
Buffering Techniques Greg Stitt ECE Department University of Florida.
Video Compression Video : Sequence of frames Each Frame : 2-D Array of Pixels Video: 3-D data – 2-D Spatial, 1-D Temporal Video has both : – Spatial Redundancy.
Dr. Ofer Hadar Communication Systems Engineering Department
Serial Multipliers Prawat Nagvajara
ELE 488 Fall 2006 Image Processing and Transmission ( )
Pipelining and Vector Processing
Sum of Absolute Differences Hardware Accelerator
VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder
Image and Video Processing
Instructor: Professor Yu Hen Hu
Lecture 2 The Art of Concurrency
COMPUTER ORGANIZATION AND ARCHITECTURE
Presentation transcript:

ELEC692 VLSI Signal Processing Architecture Lecture 7 VLSI Architecture for Block Matching Algorithm for Video compression * Part of the notes is taken from the course notes of Prof. Bing Zeng’s ELEC 533

Reference P. Pirsch, N. Demassieux, W. Gehrke, “VLSI architecture for Video compression – A survey”, in ther IEEE Proceedings, Vol. 83, No. 2, pp ,Feb 1995 T. Komarek, P. Pirsch, “Array Architecture for Block Matching Algorithm”, in IEEE Transactions of Circuit and Systems, vol. 36, No. 10, pp , Oct. 1989

Interframe Coding/Motion Estimation of Video Sequence

Interframe Transform/Predictive Coding

Prediction is based on a previously processed frame Prediction is accomplished by motion estimation (ME) Motion estimation is done in spatial domain 2-D DCT has to be inside the coding loop and a 2-D IDCT is needed to convert the frequency domain information back to spatial domain

Motion Compensated Prediction

Block Matching Method

Search window

Block matching Criterion Mean Square Error (MSE) Mean Absolute Difference (MAD)

Important factors for BM Motion Estimation Block size – 8X8, 16X16, variable Size of searching window –Depend on frame differences, speed of moving objects, resolution, etc Matching criterion –Accuracy vs complexity, use of truncated pixels Search strategy –Full search, hierarchical search, subsampling of motion field Hardware consideration

Real time processing for BMA Let Block size = 16*16, window size = 32*32, assuming CIF frame at 30f/s, we need For CCIR 601 or HDTV, it will require several or tens of GOPS/sec. So Full search has to be implemented in dedicated hardware.

Exhaustive Search Block Matching Block size of N X N of the current image (reference block, denote by X) Matched with all the block located within a search window (candidate blocks, denote by Y). Maximum displacement – w Computing the mean absolute difference (MAD) between the blocks Matching distance D is given by V is the motion vector No. of candidate block to be considered: (2w+1) 2

Algorithm to find the motion vector Dmin = MAXVALUE Vmin = (0,0) For m=-w to +w for n = -w to +w D(m,n) = 0 for i=1 to N for j = 1 to N D(m,n) = D(m,n)+|x(I,j)-y(i+m,j+n)| endfor if D(m,n) < Dmin then Dmin = D(m,n) Vmin = (m,n) endif endfor

Dependency graph Calculating MAD Calculating s i( m.n) and s(m,n) Calculate D min and v

Dependency graph The BM algorithm can be described by several different dependency graph Example 1 AD = absolute difference and addition M = minimum value computation

Dependency graph Example 2

Data input Line scan and block scan Line scan –TV lines run through as a whole, from the upper to the lower side of the frame Block scan –Quadratic blocks of n X n pixels are run through in a block- line manner –Well suited if the data are supplied by a memory with block scan output –Pixels within a block are traversed column by column –E.g. (3X3)-pixel block Data are read in the order x(1,1), x(2,1) x(3,1), x(1,2), x(2, 2) x(3,2), x(1,3), x(2,3) x(3,3),

Mapping BMA onto Systolic Arrays Decompose the algorithm into its basic operations and convert it into a form where each result is assigned to a unique variable Formulate it as an n-dimension dependence graph (DG) of computation nodes and data dependence arcs. One straight forward mapping is implementing a PE designated to each node of the DG and a communication link to each edge of the DG. More efficient design with a higher processor utilization if each PE executes the operations of multiple computation nodes Need time schedule and assignment of multiple nodes to a single PE by projection. PE need to be programmable to some extent.

Mapping BMA onto Systolic Arrays The BMA is defined over a 4-dimensional index space (i,j,m,n) The BMA can be decomposed into two parts which are defined over two-dimesional index spaces. –1 st one spawn by the index I,j, finding the sum of D(m,n) –2 nd one defined over m and n, the minium search and the selectin of displacement vector

Transform it into a 2D -array D(m,n) mapped into a 2D array of PE V(X,Y) is mapped into time

Realistic implementation of 2-D array Reduction of the cycle time –Pipelining of the computation of D(m,n). I/O management –Each of the AD-PE receives a new value of y(m+i,n+j) at each clock cycle. Transmitting the N 2 value from an external memory is not feasible. WE can take the advantage of that these values belong to the search window. A portion of the search window of size N.(2w+N) is stored in the circuit in a 2D bank of shift registers, able to shift in, up, down, and right direction. Each AD-PE has one of these registers and can at each cycle obtain the value of y(m+i,n+j) that it needs To update this register bank, a new column of 2w+N piexls of the serach area is serially entered in the circuit and is inserted in the back of regigters. To load in a new reference with a low I/O overhead, a double buffering of x(I,j) is required, with the pixels x’(I,j) of a new reference block serially loaded during the computation of the current reference block.

implementation of the 2-D array

2-D array Alternate projection of the DG onto the I and j –plane provides the architecture AB2 Current frame data x(i,j) remains fixed in the PE’s AD that they have to be loaded into the array before. Time required= (2w+1)*(2w+1)

Mapping to a 1-D array More efficient design with a higher processor utilization if each PE executes the operations of multiple computation nodes Mapped to a 1D array of PE, which is able to compute in parallel the partial distortion along one row. Compute D(m,n) in N cycles

1-D array Project the DG along the i-axis onto a one- dimensional signal flow graph. Called AB1 array, it has the size of a block Consecutive computation of all (2w+1) 2 candidate blocks per displacement vector may provide N*(2p+1)2 time instances

Another way of mapping-search area based The dependency graph for computing v(X,Y) is mapped into a 2D array of (2w+1) 2 PE while the dependency graph for computing D(m,n) is mapped into time Each PE working in parallel keeps track of a particular distortion computation and sequentially explore the reference block. At each cycle, one PE receives a different vlaue of y(m+I,n+j) and all the PE receive the value of one pixel of the reference block which is broadcasted to the array. After N 2 cycle, each of the (2w+1) 2 PE holds one value of D(m,n) corresponding to a particular displacement (m,n) To find the minimum distortion value, find the minimum of a column by downshifting the D(m,n) in the PEs and find the final minimum value by left-shifting the result D(m,n) in the M-PE.

2-D search area based architecture Part of the search area of size w.(2w+N) is needed to be stored in order to reduce I/O.

1-D search area based architecture An array of (2w+1) processing elements executes in N2 cycles the computation of the distortion D(m,n) corresponding to one line (resp. column) of possible motion vectors. This process is repeated sequentially 2w+1 times for computing all the distortion.

Another architecture Require only a sequential data input. Dummy data denotes by dots are inserted into the stream of reference data to guarantee a regular data flow without any data permutation within the array Time required = (2w+1)*(2w+1)*N