1 Data Partition for Wavefront Parallelization of H.264 Video Encoder Zhuo Zhao, Ping Liang IEEE ISCAS 2006.

Slides:



Advertisements
Similar presentations
Wen-Hsiao Peng Chun-Chi Chen
Advertisements

Introduction to H.264 / AVC Video Coding Standard Multimedia Systems Sharif University of Technology November 2008.
Parallel H.264 Decoding on an Embedded Multicore Processor
H.264 Intra Frame Coder System Design Özgür Taşdizen Microelectronics Program at Sabanci University 4/8/2005.
A Performance Analysis of the ITU-T Draft H.26L Video Coding Standard Anthony Joch, Faouzi Kossentini, Panos Nasiopoulos Packetvideo Workshop 2002 Department.
Basics of MPEG Picture sizes: up to 4095 x 4095 Most algorithms are for the CCIR 601 format for video frames Y-Cb-Cr color space NTSC: 525 lines per frame.
-1/20- MPEG 4, H.264 Compression Standards Presented by Dukhyun Chang
Technion - IIT Dept. of Electrical Engineering Signal and Image Processing lab Transrating and Transcoding of Coded Video Signals David Malah Ran Bar-Sella.
1 Video Coding Concept Kai-Chao Yang. 2 Video Sequence and Picture Video sequence Large amount of temporal redundancy Intra Picture/VOP/Slice (I-Picture)
A Highly Parallel Framework for HEVC Coding Unit Partitioning Tree Decision on Many-core Processors Chenggang Yan, Yongdong Zhang, Jizheng Xu, Feng Dai,
An Early Block Type Decision Method for Intra Prediction in H.264/AVC Jungho Do, Sangkwon Na and Chong-Min Kyung VLSI Systems Lab. Korea Advanced Institute.
H.264/AVC Baseline Profile Decoder Complexity Analysis Michael Horowitz, Anthony Joch, Faouzi Kossentini, and Antti Hallapuro IEEE TRANSACTIONS ON CIRCUITS.
1 Adaptive slice-level parallelism for H.264/AVC encoding using pre macroblock mode selection Bongsoo Jung, Byeungwoo Jeon Journal of Visual Communication.
{ Fast Disparity Estimation Using Spatio- temporal Correlation of Disparity Field for Multiview Video Coding Wei Zhu, Xiang Tian, Fan Zhou and Yaowu Chen.
Light Field Compression Using 2-D Warping and Block Matching Shinjini Kundu Anand Kamat Tarcar EE398A Final Project 1 EE398A - Compression of Light Fields.
CABAC Based Bit Estimation for Fast H.264 RD Optimization Decision
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
Evaluation of Data-Parallel Splitting Approaches for H.264 Decoding
Wei Zhu, Xiang Tian, Fan Zhou and Yaowu Chen IEEE TCE, 2010.
Yu-Han Chen, Tung-Chien Chen, Chuan-Yung Tsai, Sung-Fang Tsai, and Liang-Gee Chen, Fellow, IEEE IEEE CSVT
Highly Parallel Rate-Distortion Optimized Intra-Mode Decision on Multicore Graphics Processors Ngai-Man Cheung, Oscar C. Au, Senior Member, IEEE, Man-Cheung.
Shaobo Zhang, Xiaoyun Zhang, Zhiyong Gao
Ch. 6- H.264/AVC Part I (pp.160~199) Sheng-kai Lin
Efficient multi-frame motion estimation algorithms for MPEG-4 AVC/JVTH.264 Mei-Juan Chen, Yi-Yen Chiang, Hung- Ju Li and Ming-Chieh Chi ISCAS 2004.
Low-complexity mode decision for MVC Liquan Shen, Zhi Liu, Ping An, Ran Ma and Zhaoyang Zhang CSVT
Department of Computer Engineering University of California at Santa Cruz Video Compression Hai Tao.
1 Single Reference Frame Multiple Current Macroblocks Scheme for Multiple Reference IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY Tung-Chien.
Analysis, Fast Algorithm, and VLSI Architecture Design for H
1 Efficient Multithreading Implementation of H.264 Encoder on Intel Hyper- Threading Architectures Steven Ge, Xinmin Tian, and Yen-Kuang Chen IEEE Pacific-Rim.
Multi-Frame Reference in H.264/AVC 卓傳育. Outline Introduction to Multi-Frame Reference in H.264/AVC Multi-Frame Reference Problem Two papers propose to.
1 Slice-Balancing H.264 Video Encoding for Improved Scalability of Multicore Decoding Michael Roitzsch Technische Universität Dresden ACM & IEEE international.
1 An Efficient Mode Decision Algorithm for H.264/AVC Encoding Optimization IEEE TRANSACTION ON MULTIMEDIA Hanli Wang, Student Member, IEEE, Sam Kwong,
An Introduction to H.264/AVC and 3D Video Coding.
1. 1. Problem Statement 2. Overview of H.264/AVC Scalable Extension I. Temporal Scalability II. Spatial Scalability III. Complexity Reduction 3. Previous.
1 Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi- Processor Architecture Tom R. Jacobs, Vassilios A. Chouliars, and David J. Mulvaney.
Liquan Shen Zhi Liu Xinpeng Zhang Wenqiang Zhao Zhaoyang Zhang An Effective CU Size Decision Method for HEVC Encoders IEEE TRANSACTIONS ON MULTIMEDIA,
Kai-Chao Yang Hierarchical Prediction Structures in H.264/AVC.
Video Coding. Introduction Video Coding The objective of video coding is to compress moving images. The MPEG (Moving Picture Experts Group) and H.26X.
1 Efficient Reference Frame Selector for H.264 Tien-Ying Kuo, Hsin-Ju Lu IEEE CSVT 2008.
Vineeth Shetty Kolkeri University of Texas, Arlington
Adaptive Multi-path Prediction for Error Resilient H.264 Coding Xiaosong Zhou, C.-C. Jay Kuo University of Southern California Multimedia Signal Processing.
- By Naveen Siddaraju - Under the guidance of Dr K R Rao Study and comparison of H.264/MPEG4.
Low-Power H.264 Video Compression Architecture for Mobile Communication Student: Tai-Jung Huang Advisor: Jar-Ferr Yang Teacher: Jenn-Jier Lien.
Fast Mode Decision for H.264/AVC Based on Rate-Distortion Clustering IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 14, NO. 3, JUNE 2012 Yu-Huan Sung Jia-Ching.
2 3 Be introduced in H.264 FRExt profile, but most H.264 profiles do not support it. Do not need motion estimation operation.
- By Naveen Siddaraju - Under the guidance of Dr K R Rao Study and comparison between H.264.
Figure 1.a AVS China encoder [3] Video Bit stream.
-BY KUSHAL KUNIGAL UNDER GUIDANCE OF DR. K.R.RAO. SPRING 2011, ELECTRICAL ENGINEERING DEPARTMENT, UNIVERSITY OF TEXAS AT ARLINGTON FPGA Implementation.
Guillaume Laroche, Joel Jung, Beatrice Pesquet-Popescu CSVT
Computational Complexity Management of a Real-Time H.264/AVC Encoder C S Kannangara, I E Richardson, and A J Miller CSVT
Vineeth Shetty Kolkeri University of Texas, Arlington
Vamsi Krishna Vegunta University of Texas, Arlington
IEEE Transactions on Consumer Electronics, Vol. 58, No. 2, May 2012 Kyungmin Lim, Seongwan Kim, Jaeho Lee, Daehyun Pak and Sangyoun Lee, Member, IEEE 報告者:劉冠宇.
UNDER THE GUIDANCE DR. K. R. RAO SUBMITTED BY SHAHEER AHMED ID : Encoding H.264 by Thread Level Parallelism.
1 Modular Refinement of H.264 Kermin Fleming. 2 What is H.264? Mobile Devices Low bit-rate Video Decoder –Follow on to MPEG-2 and H.26x Operates on pixel.
Transcoding from H.264/AVC to HEVC
UNDER THE GUIDANCE DR. K. R. RAO SUBMITTED BY SHAHEER AHMED ID : Encoding H.264 by Thread Level Parallelism.
Time Optimization of HEVC Encoder over X86 Processors using SIMD Kushal Shah Advisor: Dr. K. R. Rao Spring 2013 Multimedia.
1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006.
Outline  Introduction  Observations and analysis  Proposed algorithm  Experimental results 2.
Principles of Video Compression Dr. S. M. N. Arosha Senanayake, Senior Member/IEEE Associate Professor in Artificial Intelligence Room No: M2.06
Multi-Frame Motion Estimation and Mode Decision in H.264 Codec Shauli Rozen Amit Yedidia Supervised by Dr. Shlomo Greenberg Communication Systems Engineering.
Introduction to H.264 / AVC Video Coding Standard Multimedia Systems Sharif University of Technology November 2008.
Steven Ge, Xinmin Tian, and Yen-Kuang Chen
Research Topic Error Concealment Techniques in H.264/AVC for Wireless Video Transmission Vineeth Shetty Kolkeri EE Graduate,UTA.
Sum of Absolute Differences Hardware Accelerator
Fast Decision of Block size, Prediction Mode and Intra Block for H
VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder
Standards Presentation ECE 8873 – Data Compression and Modeling
Bongsoo Jung, Byeungwoo Jeon
Presentation transcript:

1 Data Partition for Wavefront Parallelization of H.264 Video Encoder Zhuo Zhao, Ping Liang IEEE ISCAS 2006

2 Outline Introduction Introduction Data Dependencies in H.264 Data Dependencies in H.264 Data Partition and Task Priority Data Partition and Task Priority Experimental Results Experimental Results Conclusions Conclusions

3 Introduction Background Knowledge (1/7) Video compression technologies Video compression technologies Spatial Redundancy Spatial Redundancy Temporal Redundancy Temporal Redundancy H.264/AVC new features H.264/AVC new features Quarter-pel ME, variable block sizes, multiple reference frames, intra-prediction, CAVLC, CABAC, in-loop deblocking filter, etc. Quarter-pel ME, variable block sizes, multiple reference frames, intra-prediction, CAVLC, CABAC, in-loop deblocking filter, etc.

4 Introduction Background Knowledge (2/7) In [1], compared with MPEG-4 Simple profile In [1], compared with MPEG-4 Simple profile Up to 50% bitrate reduction is achieved at the cost of more than four times of computation. Up to 50% bitrate reduction is achieved at the cost of more than four times of computation. Bitrate Computation Complexity Bitrate Computation Complexity Hardware and Software acceleration for real-time applications Hardware and Software acceleration for real-time applications

5 Introduction Background Knowledge (3/7) In [2], a single chip encoder for H.264 using a four-stage macroblock pipeline architecture. In [2], a single chip encoder for H.264 using a four-stage macroblock pipeline architecture. Satisfactory R-D tradeoff is reported. Satisfactory R-D tradeoff is reported. Find the coding mode of current MB by approximations of neighboring coding information. Find the coding mode of current MB by approximations of neighboring coding information.

6 Introduction Background Knowledge (4/7) In [3], an H.264 encoder using the hyper- threading architecture is reported. In [3], an H.264 encoder using the hyper- threading architecture is reported. Split a frame into several slices and processed by multiple threads. Split a frame into several slices and processed by multiple threads. Heavy overheads : The impairments to data dependencies among MBs. Heavy overheads : The impairments to data dependencies among MBs.

7 Introduction Background Knowledge (5/7) Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Input File Output File Image buffer Slice Queue 0 (I/P) Slice Queue 1 (B)

8 Introduction Background Knowledge (6/7) In [4], a frame is divided into many small partitions with overlapping areas and processed concurrently. In [4], a frame is divided into many small partitions with overlapping areas and processed concurrently. Not feasible for H.264. Not feasible for H.264. Redundant data Redundant data  form the complete  form the complete search data search data

9 Introduction Background Knowledge (7/7) In [5][6], using temporal parallelism in GOP level In [5][6], using temporal parallelism in GOP level A large number of frames being ready before the encoding actually starts. A large number of frames being ready before the encoding actually starts. Temporal parallelism is limited to coding standards with GOP structure. Temporal parallelism is limited to coding standards with GOP structure.

10 Introduction Main Purpose (1/2) This paper presents a new method for parallel processing of H.264 video encoder This paper presents a new method for parallel processing of H.264 video encoder Data partition Data partition Task scheduling Task scheduling The new method outperforms prior approaches in both encoding speed and compression efficiency. The new method outperforms prior approaches in both encoding speed and compression efficiency.

11 Introduction Main Purpose (2/2) This paper gives the relations between This paper gives the relations between # of parallel processing element and theoretical encoding time. # of parallel processing element and theoretical encoding time. # of processors and # of concurrently processed frames. # of processors and # of concurrently processed frames. The result shows that this method achieves the same compression efficiency as a sequential processing encoder. The result shows that this method achieves the same compression efficiency as a sequential processing encoder.

12 Data Dependencies in H.264 Overview (1/2) Reference software : JM 9.0 Reference software : JM 9.0 Sequential processing of MBs Sequential processing of MBs Data dependencies Data dependencies Produce optimal bitstream in terms of coding efficiency Produce optimal bitstream in terms of coding efficiency  highest compression ratio  highest compression ratio

13 Data Dependencies in H.264 Overview (2/2) Objective Objective Explore elements of encoder that can be processed in parallel. Explore elements of encoder that can be processed in parallel. Maximally exploit the temporal and spatial data dependencies for optimal coding efficiency. Maximally exploit the temporal and spatial data dependencies for optimal coding efficiency.

14 Data Dependencies in H.264 Predicted Motion Vector Predicted Motion Vector In inter-prediction, PMV defines the search center of motion estimation. In inter-prediction, PMV defines the search center of motion estimation. Useful in maintaining continuity of the motion field. Useful in maintaining continuity of the motion field. It is determined by the MVs of its neighboring subblocks and the corresponding reference indexes. It is determined by the MVs of its neighboring subblocks and the corresponding reference indexes.

Intra-frame data dependencies Intra-frame data dependencies Only the difference (MVD) between the final optimal MV (MV ’ ) and PMV will be encoded. Only the difference (MVD) between the final optimal MV (MV ’ ) and PMV will be encoded. 15 Data Dependencies in H.264 Current MB MB A MB DMB BMB C

Inter-prediction and mode decision Inter-prediction and mode decision H.264 needs the reconstructed images from encoded frames as reference to exploit temporal redundancy. H.264 needs the reconstructed images from encoded frames as reference to exploit temporal redundancy. At least the co-located MB and its eight neighboring MBs must be available before current MB can be encoded. At least the co-located MB and its eight neighboring MBs must be available before current MB can be encoded. 16 Data Dependencies in H.264 Reference frame Current frame

Quarter-pel interpolation Quarter-pel interpolation Before the reconstructed result of current MB can be used as reference, it must be interpolated to get the values in ½ and ¼ pel position. Before the reconstructed result of current MB can be used as reference, it must be interpolated to get the values in ½ and ¼ pel position. Boundary area of current MB need 3 rows/cols of pixels value from it ’ s neighboring MBs. Boundary area of current MB need 3 rows/cols of pixels value from it ’ s neighboring MBs. 17 Data Dependencies in H.264

Quarter-pel interpolation Quarter-pel interpolation 18 Data Dependencies in H.264 CD AB E KLMNOP FGHIJ TU RS ccddeeff aa bb gg hh bac efg ijk pqr d h n m s

4×4 and 16×16 intra-prediction & mode decision 4×4 and 16×16 intra-prediction & mode decision 19 Data Dependencies in H.264

Intra-prediction data dependencies Intra-prediction data dependencies 20 Data Dependencies in H.264 MB(i, j)MB(i, j-1) MB(i-1, j)

Number of skipped MBs before current MB Number of skipped MBs before current MB In H.264/AVC standard : mb_skip_run In H.264/AVC standard : mb_skip_run Indicates how many MBs before current MB in raster- scan order are skipped. Indicates how many MBs before current MB in raster- scan order are skipped. Needs to know the encoding status of previous MBs. Needs to know the encoding status of previous MBs. 21 Data Dependencies in H.264

MBs in different frames can be processed concurrently, only if its necessary reconstructed MBs from reference frame are all available. MBs in different frames can be processed concurrently, only if its necessary reconstructed MBs from reference frame are all available. MBs from different MB rows in the same frame can be processed concurrently, only if its neighboring MBs in its top MB row all have been encoded and reconstructed. MBs from different MB rows in the same frame can be processed concurrently, only if its neighboring MBs in its top MB row all have been encoded and reconstructed. 22 Data Partition & Task Priority Data Partition (1/5)

Concurrently processed MBs Concurrently processed MBs 23 Data Partition & Task Priority Data Partition (2/5) Frame number MBs which have already been encoded MBs which are being encoded now MBs which have not been encoded yet Wavefront Parallelization

Wavefront Parallelization can achieve a constant frame rate for any video format. (e.g..QCIF, CIF, HDTV720). Wavefront Parallelization can achieve a constant frame rate for any video format. (e.g..QCIF, CIF, HDTV720). Sufficient number of processors. Sufficient number of processors. Video sequence is long enough. Video sequence is long enough. 24 Data Partition & Task Priority Data Partition (3/5)

Example Example With the increase of the frame number, the average encoding time for a frame approach 4TMB. With the increase of the frame number, the average encoding time for a frame approach 4TMB. The number of processor units to needed to achieve this is : The number of processor units to needed to achieve this is : 25 Data Partition & Task Priority Data Partition (4/5) Frame number

Each frame is partitioned into MB rows first Each frame is partitioned into MB rows first A MB can ’ t be processed until its left neighbor in the same row is encoded A MB can ’ t be processed until its left neighbor in the same row is encoded Reduce data exchanges between processors Reduce data exchanges between processors 26 Data Partition & Task Priority Data Partition (5/5) Current Frame ………

Task assignment timing diagram Task assignment timing diagram 27 Data Partition & Task Priority Task assigning and priorities (1/5) t t+2T t+4T Task assigning schedule Frame i, MB row j Frame i, MB row j + 1 Frame i, MB row j + 2 Frame i + 1, MB row j

Example Example 28 Data Partition & Task Priority Task assigning and priorities (2/5) Frame 1, MB row 1 … Frame 1, MB row 2 Frame 1, MB row 3 Frame 2, MB row 1 Frame 1, MB row 4 Frame 2, MB row 2 Frame 1, MB row 5 Frame 2, MB row 3 Frame 3, MB row 1 Frame 2, MB row 4 Frame 3, MB row 2 Frame 2, MB row 5 Frame 3, MB row 3 Frame 4, MB row 1 Task assigning schedule 4 TMB

To achieve optimal encoding speed To achieve optimal encoding speed QCIF  requires 25 processors QCIF  requires 25 processors CIF  requires 99 processors CIF  requires 99 processors HDTV720  requires 900 processors HDTV720  requires 900 processors 29 Data Partition & Task Priority Task assigning and priorities (3/5)

In practice, we can ’ t have a large number of processor unit. In practice, we can ’ t have a large number of processor unit.  Priority based task scheduling  Priority based task scheduling Define the priorities in two levels Define the priorities in two levels Inter-frame level Inter-frame level Intra-frame level Intra-frame level 30 Data Partition & Task Priority Task assigning and priorities (4/5)

Inter-frame level Inter-frame level If several MBs belonging to different frames are ready to be encoded concurrently, the MBs in the frame with smaller frame number should be encoded first. If several MBs belonging to different frames are ready to be encoded concurrently, the MBs in the frame with smaller frame number should be encoded first. Intra-frame level Intra-frame level If several MBs belonging to different MB rows in the same frame are ready to be encoded concurrently, the MBs in the row with smaller row index should be encoded first. If several MBs belonging to different MB rows in the same frame are ready to be encoded concurrently, the MBs in the row with smaller row index should be encoded first. 31 Data Partition & Task Priority Task assigning and priorities (5/5)

The wavefront simulator is developed in C language and implemented in a PC with a P4 2.8 GHz processor and a 512MB memory. The wavefront simulator is developed in C language and implemented in a PC with a P4 2.8 GHz processor and a 512MB memory. The simulation results are compared with JM 9.0 The simulation results are compared with JM 9.0 H.264 baseline profile H.264 baseline profile Search range = ±10 Search range = ±10 One reference frame, Hadamard transform, full R-D optimization, CAVLC entropy coding One reference frame, Hadamard transform, full R-D optimization, CAVLC entropy coding 32 Experimental Results Overview (1/1)

The relationship between the number of processors and the number of concurrently processed frames The relationship between the number of processors and the number of concurrently processed frames 33 Experimental Results Experimental Results

Theoretical processing time per frame Theoretical processing time per frame 34 Experimental Results Experimental Results

Simulation results Simulation results 35 Experimental Results Experimental Results Avg Encoding time per frame SnrYSnrUSnrV # of bytes Speed up Wavefront simulator 273 ms JM ms Avg Encoding time per frame SnrYSnrUSnrV # of bytes Speed up Wavefront simulator 1272 ms JM ms Grandma.YUV (QCIF) Paris.YUV (CIF)

This paper presents the new Wavefront Parallelization method for H.264 encoder. This paper presents the new Wavefront Parallelization method for H.264 encoder. Analysis and simulation results show that it can achieve the optimal compression at a frame rate that increases approximately linearly as the number of parallel processing elements. Analysis and simulation results show that it can achieve the optimal compression at a frame rate that increases approximately linearly as the number of parallel processing elements. 36 Conclusions Conclusions

[1] T.-C. Chen, Y.-W. Huang, and L.-G. Chen, "Analysis and design of macroblock pipelining for h.264/avc vlsi architecture," in Proceedings of the 200>4 International Symtposium on Circuits and Systems, vol. 2, May 2004, pp. II [1] T.-C. Chen, Y.-W. Huang, and L.-G. Chen, "Analysis and design of macroblock pipelining for h.264/avc vlsi architecture," in Proceedings of the 200>4 International Symtposium on Circuits and Systems, vol. 2, May 2004, pp. II [2] Y.-W. Huang, T.-C. Chen, C.-H. Tsai, C.-Y. Chen, T.-W. Chen, C.-S.Chen, C.-F. Shen, S.-Y. Ma, T.-C. Wang, B.-Y. Hsieh, H.-C. Fang, and L.-G. Chen, [2] Y.-W. Huang, T.-C. Chen, C.-H. Tsai, C.-Y. Chen, T.-W. Chen, C.-S.Chen, C.-F. Shen, S.-Y. Ma, T.-C. Wang, B.-Y. Hsieh, H.-C. Fang, and L.-G. Chen, "A 1.3tops h.264/avc single-chip encoder for hdtv applications, ” in IEEE Int. Conf.Solid-State Circuits, Feb 2005, pp [3] Y.-K. Chen, T. X, S. Ge, and G. M, "Towards efficient multi-level threading of h.264 encoder on intel hyper-threading architectures," in 18th Int.Parallel and Distributed Processing Symposium, Apr 2004, p.63 [4] S. M.Akramulah, I. Ahmad, and M. L.Liou, "Parallelization of mpeg-2 video encoder for parallel and distributed computing systems," in Proceedings of the 38th Midwest Symposium on Circuits and Systems, vol. 2, Aug 1995, pp [5] P. Tiwari and E. Viscito, "A parallel mpeg-2 video encoder with look- ahead rate control," in Int.Conf: Acoustics, Speech, and Signal Processing, vol. 4, May 1996, pp [6] K.Shen, L.A.Rowe, and E.J.Delp, "Parallel implementation of an mpeg-1 encoder: faster than real time," in SPIE, vol. 2419, Feb 1995, pp References References