Download presentation
Presentation is loading. Please wait.
Published byCorey Williams Modified over 6 years ago
1
Sum of Absolute Differences Hardware Accelerator
Mark Lodermeier 9:13 PM
2
Outline Overview of Motion Estimation and MPEG4 – Part 10 - AVC
Approach Tasks Performed To Do 9:13 PM
3
Motion Estimation Used for video compression – block matching between successive frames Search for best matching block (find motion vectors) Used to create model of current frame using reference frame(s) from either previous or future frames Motion vectors found by determining the minimum SAD MV(r,s) = argmin[SAD(x,y,r,s)] 9:13 PM
4
Motion Estimation Full Search algorithm produces the best results.
However, computationally expensive. Motion Estimation accounts for 50-70% of computational complexity in MPEG-4 video encoding/decoding For a small search range of [-8, +7] in each direction with 16x16 macroblocks, there are 16*16 pixel comparisons performed 16*16 times = 65,536 additions of absolute differences for a single 16x16 block Real-time video of a 480x640 9:13 PM
5
MPEG-4 Part 10 – AVC Variable Block Sizes
Each 16x16 Macroblock can be split in half into two 16x8 or 8x16 blocks or into four 8x8 sub-blocks These sub-blocks can then be split in half into two 8x4 or 4x8 blocks or into four 4x4 blocks. 9:13 PM
6
MPEG-4 Part 10 – AVC Previous 16x16 macroblock split into smaller blocks 9:13 PM
7
MPEG4 Variable Block Sizes
Purpose: Many small blocks requires large amount of bits to encode Few large blocks may produce poor quality Can produce higher efficiency with same quality Challenges: Generate motion vectors for all block sizes - Increase computation cost to an already intensive algorithm Choose the correct block size among many choices to balance bandwidth and quality 9:13 PM
8
Approach How to efficiently generate all MV’s for variable sized blocks? Take full advantage of parallel nature of both Motion Estimation and the generation of variable sized blocks Maintain high processor utilization 9:13 PM
9
Tasks Performed Implemented 1-D systolic array in VHDL
16 Processing Elements, each with: Absolute Difference unit 9 to 2 Compressor 3 to 1 Compressor 9:13 PM
10
Absolute Difference Unit
9:13 PM
11
Absolute Difference Math behind absolute difference unit:
Just check condition - A > B B + B_not = 2n – 1 B_not = 2n – 1 – B 2n-1+|A-B| is the value of the sum of the two outputs of the absolute difference unit. Need to add a correction term of m to get rid of the 2n-1, where m is equal to the number of absolute difference units used. 9:13 PM
12
Single Processing Element
Abs Diff Unit C B A 9 to 2 Adder Reduction Tree 3 to 1 Reduction Tree Latch Correction term - 4 9:13 PM
13
… … PE0 PE1 PE2 PE15 Systolic Array C B A D D D D control
4x4 SAD and MV 4x4 SAD and MV 4x4 SAD and MV 4x4 SAD and MV Shift Register Shift Register Shift Register Shift Register … control 16 4x4 SAD values and MV’s 8 4x8 SAD values and MV’s 8 8x4 SAD values and MV’s Back-End Adder Array for Variable Blocks 4 8x8 SAD values and MV’s 2 8x16 SAD values and MV’s 2 16x8 SAD values and MV’s 1 16x16 SAD value and MV 9:13 PM
14
Back-End Adder Array Diagram
16 4x4 SAD’s and MV’s 8 8x4 SAD’s and MV’s 8 4x8 SAD’s and MV’s 4 8x8 SAD’s and MV’s 2 8x16 and 2 16x8 SAD’s and MV’s 16x16 SAD and MV Macroblock with 16 4x4 sub-blocks - Each dot represents the following: Min Latch 9:13 PM
15
Another way to represent the generation of motion vectors for the variable sized blocks
p x p Matrix SAD values for one 4x4 block in all search positions p x p Matrix SAD values for one 4x8 block in all search positions p x p Matrix SAD values for one 4x4 block in all search positions First you take the pxp matrices from the specified 4x4 blocks to generate the matrix for the 4x8 block (where p is the total search range) To find the corresponding motion vector, you just search for the minimum value in the matrix 9:13 PM
16
The Schedule for the Accelerator:
The top box is the current frame data and the bottom box is the reference frame data Every 4 clock cycles a PE will produce a 4x4 SAD value It takes 16 clock cycles to fill the systolic array and have 100% Processor Utilization After 64 clock cycles PE0 will have completed a 16x16 macroblock search for one position, Every other PE will then do the same on the following cycle So, to do a full search of (-8, 7) in the x and y positions, a total of 64x16 = 1024 cycles are needed. 9:13 PM
17
Questions? 9:13 PM
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.