An Area-Efficient VLSI Architecture for Variable Block Size Motion Estimation of H.264/AVC Hoai-Huong Nguyen Le' and Jongwoo Bae 1 1 Department of Information.

Slides:

Advertisements

Similar presentations

1 A HIGH THROUGHPUT PIPELINED ARCHITECTURE FOR H.264/AVC DEBLOCKING FILTER Kefalas Nikolaos, Theodoridis George VLSI Design Lab. Electrical & Computer.

Advertisements

A Performance Analysis of the ITU-T Draft H.26L Video Coding Standard Anthony Joch, Faouzi Kossentini, Panos Nasiopoulos Packetvideo Workshop 2002 Department.

-1/20- MPEG 4, H.264 Compression Standards Presented by Dukhyun Chang

Scalable and Low Cost Design Approach for Variable Block Size Motion Estimation Hadi Afshar, Philip Brisk, Paolo Ienne EPFL Hadi Afshar, Philip Brisk,

 Understanding the Sources of Inefficiency in General-Purpose Chips.

Hardware Implementation of Transform & Quantization Blocks in H.264/AVC Video Coding Standard By: Hoda Roodaki Instructor: Dr. Fakhraei Custom Implementation.

An Early Block Type Decision Method for Intra Prediction in H.264/AVC Jungho Do, Sangkwon Na and Chong-Min Kyung VLSI Systems Lab. Korea Advanced Institute.

Efficient Bit Allocation and CTU level Rate Control for HEVC Picture Coding Symposium, 2013, IEEE Junjun Si, Siwei Ma, Wen Gao Insitute of Digital Media,

H.264/AVC Baseline Profile Decoder Complexity Analysis Michael Horowitz, Anthony Joch, Faouzi Kossentini, and Antti Hallapuro IEEE TRANSACTIONS ON CIRCUITS.

1 Adaptive slice-level parallelism for H.264/AVC encoding using pre macroblock mode selection Bongsoo Jung, Byeungwoo Jeon Journal of Visual Communication.

{ Fast Disparity Estimation Using Spatio- temporal Correlation of Disparity Field for Multiview Video Coding Wei Zhu, Xiang Tian, Fan Zhou and Yaowu Chen.

CABAC Based Bit Estimation for Fast H.264 RD Optimization Decision

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

Wei Zhu, Xiang Tian, Fan Zhou and Yaowu Chen IEEE TCE, 2010.

Yu-Han Chen, Tung-Chien Chen, Chuan-Yung Tsai, Sung-Fang Tsai, and Liang-Gee Chen, Fellow, IEEE IEEE CSVT

High Speed Hardware Implementation of an H.264 Quantizer. Alex Braun Shruti Lakdawala.

Efficient Moving Object Segmentation Algorithm Using Background Registration Technique Shao-Yi Chien, Shyh-Yih Ma, and Liang-Gee Chen, Fellow, IEEE Hsin-Hua.

Low-complexity mode decision for MVC Liquan Shen, Zhi Liu, Ping An, Ran Ma and Zhaoyang Zhang CSVT

Department of Computer Engineering University of California at Santa Cruz Video Compression Hai Tao.

1 Single Reference Frame Multiple Current Macroblocks Scheme for Multiple Reference IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY Tung-Chien.

Analysis, Fast Algorithm, and VLSI Architecture Design for H

A Flexible Parallel Architecture Adapted to Block-Matching Motion-Estimation Algorithms Santanu Dutta, and Wayne Wolf IEEE Trans. On CSVT, vol. 6, NO.

FAST MULTI-BLOCK SELECTION FOR H.264 VIDEO CODING Chang, A.; Wong, P.H.W.; Yeung, Y.M.; Au, O.C.; Circuits and Systems, ISCAS '04. Proceedings of.

Multi-Frame Reference in H.264/AVC 卓傳育. Outline Introduction to Multi-Frame Reference in H.264/AVC Multi-Frame Reference Problem Two papers propose to.

1 An Efficient Mode Decision Algorithm for H.264/AVC Encoding Optimization IEEE TRANSACTION ON MULTIMEDIA Hanli Wang, Student Member, IEEE, Sam Kwong,

BY AMRUTA KULKARNI STUDENT ID : UNDER SUPERVISION OF DR. K.R. RAO Complexity Reduction Algorithm for Intra Mode Selection in H.264/AVC Video.

A Low-Power VLSI Architecture for Full-Search Block-Matching Motion Estimation Viet L. Do and Kenneth Y. Yun IEEE Transactions on Circuits and Systems.

HARDEEPSINH JADEJA UTA ID: What is Transcoding The operation of converting video in one format to another format. It is the ability to take.

PROJECT PROPOSAL HEVC DEBLOCKING FILTER AND ITS IMPLIMENTATION RAKESH SAI SRIRAMBHATLA UTA ID: EE 5359 Under the guidance of DR. K. R. RAO.

Platform-based Design for MPEG-4 Video Encoder Presenter: Yu-Han Chen.

1 Efficient Reference Frame Selector for H.264 Tien-Ying Kuo, Hsin-Ju Lu IEEE CSVT 2008.

Abdullah Aldahami ( ) Feb26, Introduction 2. Feedback Switch Logic 3. Arithmetic Logic Unit Architecture a.Ripple-Carry Adder b.Kogge-Stone.

1 Data Partition for Wavefront Parallelization of H.264 Video Encoder Zhuo Zhao, Ping Liang IEEE ISCAS 2006.

ELEC692 VLSI Signal Processing Architecture Lecture 7 VLSI Architecture for Block Matching Algorithm for Video compression * Part of the notes is taken.

Diploma Project Real Time Motion Estimation on HDTV Video Streams (using the Xilinx FPGA) Supervisor :Averena L.I. Student:Das Samarjit.

Low-Power H.264 Video Compression Architecture for Mobile Communication Student: Tai-Jung Huang Advisor: Jar-Ferr Yang Teacher: Jenn-Jier Lien.

MOTION ESTIMATION IMPLEMENTATION IN VERILOG

Radix-2 2 Based Low Power Reconfigurable FFT Processor Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Gin-Der Wu and Yi-Ming Liu Department.

COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION 03/26/

Compression video overview 演講者：林崇元. Outline Introduction Fundamentals of video compression Picture type Signal quality measure Video encoder and decoder.

- By Naveen Siddaraju - Under the guidance of Dr K R Rao Study and comparison between H.264.

-BY KUSHAL KUNIGAL UNDER GUIDANCE OF DR. K.R.RAO. SPRING 2011, ELECTRICAL ENGINEERING DEPARTMENT, UNIVERSITY OF TEXAS AT ARLINGTON FPGA Implementation.

Study and Optimization of the Deblocking Filter in H.265 and its Advantages over H.264 By: Valay Shah Under the guidance of: Dr. K. R. Rao.

Fast motion estimation and mode decision for H.264 video coding in packet loss environment Li Liu, Xinhua Zhuang Computer Science Department, University.

UNDER THE GUIDANCE DR. K. R. RAO SUBMITTED BY SHAHEER AHMED ID : Encoding H.264 by Thread Level Parallelism.

-BY KUSHAL KUNIGAL UNDER GUIDANCE OF DR. K.R.RAO. SPRING 2011, ELECTRICAL ENGINEERING DEPARTMENT, UNIVERSITY OF TEXAS AT ARLINGTON FPGA Implementation.

Improvement of Schema-Informed XML Binary Encoding Using Schema Optimization Method BumSuk Jang and Young-guk Ha' Konkuk University, Department of Computer.

Block-based coding Multimedia Systems and Standards S2 IF Telkom University.

COMPARATIVE STUDY OF HEVC and H.264 INTRA FRAME CODING AND JPEG2000 BY Under the Guidance of Harshdeep Brahmasury Jain Dr. K. R. RAO ID MS Electrical.

BLFS: Supporting Fast Editing/Writing for Large- Sized Multimedia Files Seung Wan Jung 1, Seok Young Ko 2, Young Jin Nam 3, Dae-Wha Seo 1, 1 Kyungpook.

Motion Estimation Multimedia Systems and Standards S2 IF Telkom University.

Time Optimization of HEVC Encoder over X86 Processors using SIMD Kushal Shah Advisor: Dr. K. R. Rao Spring 2013 Multimedia.

Hierarchical Systolic Array Design for Full-Search Block Matching Motion Estimation Noam Gur Arie,August 2005.

1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006.

VLSI Design of View Synthesis for 3DVC/FTV Jongwoo Bae' and Jinsoo Cho 2, 1 Department of Information and Communication Engineering, Myongji University.

Implementation and comparison study of H.264 and AVS china EE 5359 Multimedia Processing Spring 2012 Guidance : Prof K R Rao Pavan Kumar Reddy Gajjala.

NCTU, CS VLSI Information Processing Research Lab 研究生 : ABSTRACT Introduction NEW Recursive DFT/IDFT architecture Low computation cycle  1/2: Chebyshev.

Fine-granular Motion Matching for Inter-view Motion Skip Mode in Multi-view Video Coding Haitao Yanh, Yilin Chang, Junyan Huo CSVT.

1שידור ווידיאו ואודיו ברשת האינטרנט Dr. Ofer Hadar Communication Systems Engineering Department Ben-Gurion University of the Negev URL:

E ARLY TERMINATION FOR TZ SEARCH IN HEVC MOTION ESTIMATION PRESENTED BY: Rajath Shivananda ( ) 1 EE 5359 Multimedia Processing Individual Project.

Computational Controlled Mode Selection for H.264/AVC June Computational Controlled Mode Selection for H.264/AVC Ariel Kit & Amir Nusboim Supervised.

Quality Evaluation and Comparison of SVC Encoders

An Implementation Method of the Box Filter on FPGA

SoC and FPGA Oriented High-quality Stereo Vision System

Sum of Absolute Differences Hardware Accelerator

Study and Optimization of the Deblocking Filter in H

PROJECT PROPOSAL HEVC DEBLOCKING FILTER AND ITS IMPLIMENTATION RAKESH SAI SRIRAMBHATLA UTA ID: EE 5359 Under the guidance of DR. K. R. RAO.

Fast Decision of Block size, Prediction Mode and Intra Block for H

Jian Huang, Matthew Parris, Jooheung Lee, and Ronald F. DeMara

Bongsoo Jung, Byeungwoo Jeon

Presentation transcript:

An Area-Efficient VLSI Architecture for Variable Block Size Motion Estimation of H.264/AVC Hoai-Huong Nguyen Le' and Jongwoo Bae 1 1 Department of Information and Communication Engineering, Myongji University San 38-2 Namdong, Cheoin-Gu, Yongin-Si, Gyeonggi-Do , Korea. {hoaihuong, Abstract. In this paper, a highly efficient architecture for H.264/AVC variable block size motion estimation is proposed. The proposed architecture is based on the Propagate Partial Sum of Absolute Difference (PPSAD) architecture. A pipelined architecture is used to optimize the circuit of Processing Element (PE) and Row Adder Tree. The new architecture improves the system latency and hardware cost when removing small modes (4x4 and 4x8 modes) without affecting the coding quality. The proposed architecture is implemented in TSMC LVT 90 process, showing 26.47% improvement in the hardware cost and 18.3% in the frequency, compared to the previous PPSAD architecture. Keywords: Motion estimation, H.264/AVC, variable block size, VLSI design. 1 Introduction The H.264/Advance Video Coding (AVC) is an industry standard for video compression. The H.264/AVC standard was first published in 2003 [1], demonstrating the better compression efficiency and higher flexibility in video compression. Variable block size motion estimation (VBSME) which improves the coding efficiency by varying the size of blocks is adopted in H264/AVC. Generally, a video frame is divided into macroblocks (MBs) of size 16x16. Then each macroblock is segmented into subblocks of size 16x8, 8x16, 8x8, 8x4, 4x8, and 4x4 as shown in Fig. 1. Therefore, there are 41 subblocks in a macroblock. Although the VBSME achieves good quality, its computational complexity is high compared with other algorithms. Various fast algorithms such as three-step search, four-step search, diamond search and hexagonal search have been proposed to reduce the computational complexity in motion estimation technique. Some of the improved fast motion estimation algorithms were adopted by H.264/AVC JM Reference Software. The hardware architecture for VBSME widely used in H.264/AVC is PPSAD architecture. We show a new architecture for PPSAD architecture in this paper. 38 Computers, Networks, Systems, and Industrial lippications ://ww

1 16x x8 2 8x16 4 8x8 8 8x4 8 4x8 16 4x4 Fig. 1. Block sizes supported by H.264 for ME This paper is organized as follows. In Section II, the background for the PPSAD architecture is explained first and then the proposed architecture is explained in detail. In Section III, experimental results are shown. Finally, Section IV concludes the paper. 2 Proposed Method 2.1 Background Motion estimation is the key technique of video compression to reduce the temporal redundancies of image sequences for higher compression efficiency. Video encoding algorithms typically process one macroblock at a time. During the encoding process, the goal of motion estimation is to find the best match in the current frame from a reference frame by minimizing the cost function. Various functions have been proposed and analyzed in the literatures such as the mean absolute difference (MAD), mean square error (MSE) and sum of absolute difference (SAD). Due to its good performance and ease of hardware implementation, SAD is chosen for video coding as shown in Eq. (1). M —1N-1 SAD (i,j)=EEIR(m + i,n + — C(n 1,01. ( 1 ) m=0 n=0 In Eq. (1), MxN is the size of the current block, (i, j) represents the motion vector. R(m,n) and C(m,n) are the pixels of the reference and current block, respectively. The PPSAD architecture speeds up the computing process of VBSME algorithms by simultaneously calculating SAD values, and provides an efficient datapath. The original PPSAD architecture is first designed in [2] and the Parallel Improved PPSAD (PI-PPSAD) architecture is proposed in [3]. In fact, the dataflow of the PI-PPSAD is similar to that of the original PPSAD architecture. The current pixels are stored in the PE array. The 16x1 reference pixels are broadcasted to the PE array in every clock cycle. Each PE row Session1B39 ://ww

computes the SAD value of one row in a macroblock. One Row Adder Tree is accumulated with the partial SAD propagated in and then the result is propagated to the next stage vertically. 2.2 Proposed Architecture In H.264 encoding process, more than 50% of the computation power is consumed by VBSME 7 modes algorithm. Many studies and researches have been performed on the VLSI design of VBSME. Especially in the case of HDTV applications, the large resolution and frame rate increase the system latency and hardware cost. By employing the mode elimination technique [4], the small modes can be ignored without affecting the coding quality. Since the 4x4 mode negligibly affects the coding quality, it is eliminated in our architecture. The availability of the 4x8 or 8x4 mode has enough ability to handle fine motion and results in good quality. Our proposed PPSAD architecture employs 5 modes: 16x16, 16x8, 8x16, 8x8 and 8x4 are shown in Fig. 2. Broadcast Each Pixel to 1x16PE's 0 Processing Element f\N Shift Register for 8x4 Partial SAD Row Adder Tree Shift Register for 8x8 Partial SAD Fig. 2. Our proposed PPSAD architecture In our PPSAD architecture, each PE contains an 8-bit register to store one pixel of the current macroblock. The reference pixels are broadcasted to every 1x16 PEs array in B16x16_0 B8x4_30 B8x810 B8x4 00 B8x4 01  * 4 4 °M' Oit WOWOW B8x411 B8x16_1 o B8x4_31 B8x8_11 _ 0:0"0.0:00:0:0 kto _ 0 _ e _ 0 _ 0 CO ___ 40 Computers, Networks, Systems, and Industrial lippications ://ww

every cycle. The PE is used to calculate the difference between the stored current pixel and the reference pixel, which is chosen from the reference memory through the input bus. Every Row Adder Tree is accumulated with the partial SAD values and then the result is propagated to the next stage in the vertical direction. Two SAD values of B8x4_00 and B8x4_01 are generated in the first four cycles. For the next four cycles, B8x4 10 and B8x4_11 are generated. Then, these SAD values are delivered to the below adder to form the large SAD values: B8x8_00, B8x8_01 and B16x8_0. Similarly, in the next eight cycles, SAD values of B8x4_20, B8x4_21, B8x4_30, B8x4_31, B8x8_10, B8x8_11, B 16x8_1, B8x16_0, B8x16_1 and B16x16_0 are generated. To reduce the system latency and hardware cost, the pipeline operation is applied in order to optimize the PE and the adder tree in the PE row. The hardware implementation of the absolute difference operation is described in [5]. The last adder in every PE can be merged in the Carry Save Adder tree (CSA) of the Row Adder Tree. With the last adder eliminated in every PE, the number of remaining PE is 256, which leads to significant reduction in the hardware cost. The detailed design of Row Adder Tree for the Horizontal PPSAD architecture is shown in Fig. 3. AlIS4 AM ATS2 A3 S 1 P S. [ 7:0 ] V v 7:0 7:0 7:0 7:0 CSA...8 8:1 `11 1SN {SCZ,C6} 8:0 CSA 9 fit.0 {PSADISJ,S4} 1SC3,C 155 Nif :0 8:0 8:0 CS4. 9 8:0 NI4SC6,C3 JPSADI91,S61 1SC5,C4} 8:0 9;1 7:41 7:0 7: 8:1 ISA 8 7:0 7:11 7:0 (SA x 7:0 9:0 9:0 9:0 CSA 1 0 o f ISC7 enNie +10[ESAD [10],87] 10:0 ADDER 10:01 ( __{C11 Stir[11:14 Fig. 3. Our proposed Row Adder Tree for our PPSAD Architecture Session1B 41

3 Experimental results Our proposed architecture has been implemented in TSMC LVT90 process. The experimental results demonstrate the improvement in the system latency and the hardware cost compared to the previous PPSAD designs. The original PPSAD architecture is shown in [3]. All 16 partial SAD values of 4x4 blocks are propagated through the shift registers which becomes the critical path. The SAD values are accumulated to generate other SAD values resulting in longer system latency. The PI-PPSAD is designed in [2], which optimized the shift register for 4x4 blocks to reduce the hardware cost and system latency. Besides, this architecture shows the hardware of PE array and Row Adder Tree to save the adders in every PE array. By employing the mode reduction technique, our architecture eliminates the trivial modes without affecting the coding quality. In addition, the improvement of the Row Adder Tree architecture reduces 256 adders in the whole system. The compared performances between our architecture and two previous PPSAD architectures are shown in Table 1. The PPSAD architecture can operate at MHz. The hardware cost of the Horizontal PPSAD is improved by 33.35% when compared with the design in [3]. Compared to the architecture cost in [2], we reduce the hardware cost by 26.47% and increase the operating clock frequency by 18.3%. 4 Conclusion In this paper, a highly efficient PPSAD and Row Adder Tree architecture are presented to improve the hardware cost and the operating clock frequency for VBSME of H.264/AVC. By eliminating 4x4 and 4x8 modes without affecting the coding quality, we reduce the hardware cost by 26.47% and increase the frequency by 18.3 %. Various architectures are compared with the proposed one. The experimental results show that the proposed architecture is the best choice considering the frequency and size. The proposed architecture can achieve the real-time encoding capability for high resolution applications such as HDTV and Ultra HDTV. Table 1. The comparison with the previous architectures. DesignPropagate Partial SAD [3]PI-PPSAD [2]Our PP SAD The number of PE256 Clock (MHz) PE array & Cur. MB (gates) 79k71.6k52.65k 42 Computers, Networks, Systems, and Industrial Appications

Acknowledgements This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (grant number ). References 1.Draft ITU-T recommendation and final draft international standard of joint video specification (ITU-T Rec. H.264/ISO/IEC AVC). In Joint Video Team (JVT) of ISO/IEC MPEG and ITU-T VCEG, JVT-G050 (2003) 2.Y. W. Huang, T. C. Wang, B. Y. Hsieh, L. G. Chen: Hardware architecture design for variable block size motion estimation in MPEG-4 AVC/JVT/ITU-T H.264. In Proceedings of ISCAS 2003, volume 2, pages (2003) 3.Zhenyu Liu, Yiqing Huang, Yang Song, Satoshi Goto, Takeshi Ikenaga: Hardware Efficient Propagate Partial SAD Architecture for Variable Block Size Motion Estimation in H.264/AVC. In GLSVLSI'07, pp (2007) 4.Yiqing Huang, Zhenyu Liu, Yang Song, Satoshi Goto, Takeshi Ikenaga: Parallel Improved HDTV720p Targeted Propagate Partial SAD Architecture for Variable Block Size Motion Estimation in H.264/AVC. In IEICE Trans. Fundamentals, vol.E91-A, no.4 (2008) 5.J.Vanne, E. Aho, T. D. Hamalainen, K. Kuusilinna: A high-performance sum of absolute difference implementation for motion estimation. In IEEE Transactions on Circuits and Systems for Video Technology, 16(7): (2006) Session 1B43 ://ww