Highly Parallel Rate-Distortion Optimized Intra-Mode Decision on Multicore Graphics Processors Ngai-Man Cheung, Oscar C. Au, Senior Member, IEEE, Man-Cheung.

Slides:



Advertisements
Similar presentations
H.264 Intra Frame Coder System Design Özgür Taşdizen Microelectronics Program at Sabanci University 4/8/2005.
Advertisements

MULTIMEDIA PROCESSING STUDY AND IMPLEMENTATION OF POPULAR PARALLELING TECHNIQUES APPLIED TO HEVC Under the guidance of Dr. K. R. Rao By: Karthik Suresh.
Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
MULTIMEDIA PROCESSING
A Highly Parallel Framework for HEVC Coding Unit Partitioning Tree Decision on Many-core Processors Chenggang Yan, Yongdong Zhang, Jizheng Xu, Feng Dai,
An Early Block Type Decision Method for Intra Prediction in H.264/AVC Jungho Do, Sangkwon Na and Chong-Min Kyung VLSI Systems Lab. Korea Advanced Institute.
H.264/AVC Baseline Profile Decoder Complexity Analysis Michael Horowitz, Anthony Joch, Faouzi Kossentini, and Antti Hallapuro IEEE TRANSACTIONS ON CIRCUITS.
1 Adaptive slice-level parallelism for H.264/AVC encoding using pre macroblock mode selection Bongsoo Jung, Byeungwoo Jeon Journal of Visual Communication.
Compressed-domain-based Transmission Distortion Modeling for Precoded H.264/AVC Video Fan li Guizhong Liu IEEE transactions on circuits and systems for.
CABAC Based Bit Estimation for Fast H.264 RD Optimization Decision
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
Reji Mathew and David S. Taubman CSVT  Introduction  Quad-tree representation  Quad-tree motion modeling  Motion vector prediction strategies.
Wei Zhu, Xiang Tian, Fan Zhou and Yaowu Chen IEEE TCE, 2010.
Yu-Han Chen, Tung-Chien Chen, Chuan-Yung Tsai, Sung-Fang Tsai, and Liang-Gee Chen, Fellow, IEEE IEEE CSVT
Overview of Error Resiliency Schemes in H.264/AVC Standard Sunil Kumar, Liyang Xu, Mrinal K. Mandal, and Sethuraman Panchanathan Elsevier Journal of Visual.
1 Single Reference Frame Multiple Current Macroblocks Scheme for Multiple Reference IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY Tung-Chien.
An Efficient Low Bit-Rate Video-coding Algorithm Focusing on Moving Regions Kwok-Wai Wong, Kin-Man Lam, Wan-Chi Siu IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS.
Analysis, Fast Algorithm, and VLSI Architecture Design for H
FAST MULTI-BLOCK SELECTION FOR H.264 VIDEO CODING Chang, A.; Wong, P.H.W.; Yeung, Y.M.; Au, O.C.; Circuits and Systems, ISCAS '04. Proceedings of.
Introduction to Video Transcoding Of MCLAB Seminar Series By Felix.
1 Slice-Balancing H.264 Video Encoding for Improved Scalability of Multicore Decoding Michael Roitzsch Technische Universität Dresden ACM & IEEE international.
2015/6/271 Intra-Prediction in H.264(JM82) Student : 林鴻志 Advisor : 杭學鳴 教授.
1 An Efficient Mode Decision Algorithm for H.264/AVC Encoding Optimization IEEE TRANSACTION ON MULTIMEDIA Hanli Wang, Student Member, IEEE, Sam Kwong,
BY AMRUTA KULKARNI STUDENT ID : UNDER SUPERVISION OF DR. K.R. RAO Complexity Reduction Algorithm for Intra Mode Selection in H.264/AVC Video.
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
BY AMRUTA KULKARNI STUDENT ID : UNDER SUPERVISION OF DR. K.R. RAO Complexity Reduction Algorithm for Intra Mode Selection in H.264/AVC Video.
Optimizing Baseline Profile in H
A Nonlinear Loop Filter for Quantization Noise Removal in Hybrid Video Compression Onur G. Guleryuz DoCoMo USA Labs
PROJECT PROPOSAL HEVC DEBLOCKING FILTER AND ITS IMPLIMENTATION RAKESH SAI SRIRAMBHATLA UTA ID: EE 5359 Under the guidance of DR. K. R. RAO.
Kai-Chao Yang Hierarchical Prediction Structures in H.264/AVC.
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
1 Efficient Reference Frame Selector for H.264 Tien-Ying Kuo, Hsin-Ju Lu IEEE CSVT 2008.
By Abhishek Hassan Thungaraj Supervisor- Dr. K. R. Rao.
EE 5359 PROJECT PROPOSAL FAST INTER AND INTRA MODE DECISION ALGORITHM BASED ON THREAD-LEVEL PARALLELISM IN H.264 VIDEO CODING Project Guide – Dr. K. R.
1 Data Partition for Wavefront Parallelization of H.264 Video Encoder Zhuo Zhao, Ping Liang IEEE ISCAS 2006.
MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.
Adaptive Multi-path Prediction for Error Resilient H.264 Coding Xiaosong Zhou, C.-C. Jay Kuo University of Southern California Multimedia Signal Processing.
Video Compression Standards for High Definition Video : A Comparative Study Of H.264, Dirac pro And AVS P2 By Sudeep Gangavati EE5359 Spring 2012, UT Arlington.
By: Hitesh Yadav Supervising Professor: Dr. K. R. Rao Department of Electrical Engineering The University of Texas at Arlington Optimization of the Deblocking.
Low-Power H.264 Video Compression Architecture for Mobile Communication Student: Tai-Jung Huang Advisor: Jar-Ferr Yang Teacher: Jenn-Jier Lien.
Fast Mode Decision for H.264/AVC Based on Rate-Distortion Clustering IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 14, NO. 3, JUNE 2012 Yu-Huan Sung Jia-Ching.
Directional DCT Presented by, -Shreyanka Subbarayappa, Sadaf Ahamed, Tejas Sathe, Priyadarshini Anjanappa K. R. RAO 1.
2 3 Be introduced in H.264 FRExt profile, but most H.264 profiles do not support it. Do not need motion estimation operation.
- By Naveen Siddaraju - Under the guidance of Dr K R Rao Study and comparison between H.264.
Figure 1.a AVS China encoder [3] Video Bit stream.
Optimizing Baseline Profile in H.264/AVC Video Coding by Parallel Programming and Fast Intra and Inter Predictions BY Under the Guidance of VINOOTHNA GAJULA.
Computational Complexity Management of a Real-Time H.264/AVC Encoder C S Kannangara, I E Richardson, and A J Miller CSVT
Fast motion estimation and mode decision for H.264 video coding in packet loss environment Li Liu, Xinhua Zhuang Computer Science Department, University.
Video Coding Using Spatially Varying Transform Cixun Zhang, Kermal Ugur, Jani Lainema, Antti Hallapuro and Moncef IEEE TRANSACTIONS ON CIRCUITS AND SYSTEM.
Vamsi Krishna Vegunta University of Texas, Arlington
IEEE Transactions on Consumer Electronics, Vol. 58, No. 2, May 2012 Kyungmin Lim, Seongwan Kim, Jaeho Lee, Daehyun Pak and Sangyoun Lee, Member, IEEE 報告者:劉冠宇.
UNDER THE GUIDANCE DR. K. R. RAO SUBMITTED BY SHAHEER AHMED ID : Encoding H.264 by Thread Level Parallelism.
Advanced Science and Technology Letters Vol.28 (CIA 2013), pp An OpenCL-based Implementation of H.264.
Encoding Stored Video for Streaming Applications IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 11, NO. 2, FEBRUARY 2001 I.-Ming.
UNDER THE GUIDANCE DR. K. R. RAO SUBMITTED BY SHAHEER AHMED ID : Encoding H.264 by Thread Level Parallelism.
Case Study: Implementing the MPEG-4 AS Profile on a Multi-core System on Chip Architecture R 楊峰偉 R 張哲瑜 R 陳 宸.
Time Optimization of HEVC Encoder over X86 Processors using SIMD
Time Optimization of HEVC Encoder over X86 Processors using SIMD Kushal Shah Advisor: Dr. K. R. Rao Spring 2013 Multimedia.
A Frame-Level Rate Control Scheme Based on Texture and Nontexture Rate Models for HEVC IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY,
Fine-granular Motion Matching for Inter-view Motion Skip Mode in Multi-view Video Coding Haitao Yanh, Yilin Chang, Junyan Huo CSVT.
Computational Controlled Mode Selection for H.264/AVC June Computational Controlled Mode Selection for H.264/AVC Ariel Kit & Amir Nusboim Supervised.
Adaptive Block Coding Order for Intra Prediction in HEVC
Research Topic Error Concealment Techniques in H.264/AVC for Wireless Video Transmission Vineeth Shetty Kolkeri EE Graduate,UTA.
Study and Optimization of the Deblocking Filter in H
Fast Decision of Block size, Prediction Mode and Intra Block for H
VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder
Optimizing Baseline Profile in H
Reduction of blocking artifacts in DCT-coded images
Bongsoo Jung, Byeungwoo Jeon
Presentation transcript:

Highly Parallel Rate-Distortion Optimized Intra-Mode Decision on Multicore Graphics Processors Ngai-Man Cheung, Oscar C. Au, Senior Member, IEEE, Man-Cheung Kung, Peter H.W. Wong, Senior Member, IEEE, and Chun Hung Liu CSVT NOVEMBER

Outline Introduction Intra-Prediction Parallel RD Optimized Intra-Mode Decision Experiments Conclusion 2

Introduction Multicore Graphics Processors ▫Graphic Processing Unit (GPUs) ▫Coprocessing units for CPUs to accelerate numerical and signal processing applications, thanks to high-performance multicore and pipeline architectures Investigate the use of GPUs to perform RD optimized intra-mode selection in AVS and H.264 3

Difficulties Intra-Mode Decision ▫Dependency between current block and adjacent block ▫Determine the encoding bit-rate for each of the candidate modes, some conditional branching may be needed 4

Contributions Analyze the dependency constraints in intra-mode decision Propose a strategy to determine the mode decisions of video blocks in parallel ▫Encode the blocks in novel orders Extend a bit-rate approximation method to estimate the rate in RD cost computation 5

Intra-prediction in H.264 MABCD Iabcd Jefgh Kijkl Lmnop EFGH (a) 4 × 4 current blocks and their neighboring reconstructed pixels. (b) Prediction directions and their corresponding modes : DC mode 4x4 6

Intra-prediction in AVS 1.0 Vertical modeHorizontal mode DC modeDown-right mode Down-left mode: bidirectional prediction 8x8 7

Dependency Analysis Dependency constraints on block encoding order ▫Prediction Direction  Determine the RD costs of the current block is hard before all the candidate reference blocks have been encoded and reconstructed ▫Pixel Filtering(AVS)  Filtering may be applied to the reconstructed pixels of the adjacent blocks before they are used in prediction, and this filtering may involve pixels from several blocks, leading to additional block dependency 8

Dependency Analysis Dependency between the four 8 × 8 blocks (K1-K4) in the current macroblock and their spatially adjacent neighbor blocks (T 1-T4,L1,L2), in AVS intra-prediction 9

Dependency Analysis Dependency between the four 4 × 4 blocks (K1-K4)in the current 8 × 8 block and their spatially adjacent neighbor blocks(T 1-T4,L1,L2), in H.264 intra-prediction 10

Dependency Analysis The dependency relationships form directed acyclic graphs. ▫Parallelize the RD cost computation of the four constituent blocks of the same 16x16MB ▫Compute in parallel RD costs of the blocks from different 16x16MB 11

Greedy-Based Block Encoding Order Encode those blocks of which all the reference reconstructed pixels are available. 12

Greedy-Based Block Encoding Order AVS Example 13

Greedy-Based Block Encoding Order AVS Example AVS Example modify version Postpone the encoding of several blocks along the left frame boundary, All the four constituent blocks of any MB could be encoded consecutively Does not incur any execution time penalty 14

Greedy-Based Block Encoding Order AVS Example 15

Optimality Lemma 1: The proposed greedy-based encoding order can process all bottleneck path(s) P ∗ with exactly n ∗ iterations Proof of Lemma 1: ▫Suppose the greedy-based order requires more that n* iterations to process ▫At least one processing gap of length w which P* is not being processed between Ki and Ki+1 ▫There would exist an immediate parent block Bm of Ki+1 ▫Continuing with backtracking eventually one would reach some block Kj in P* ▫P1 has no processing gap >P0 ▫P* replace Po to P1 would be longer than p* P* P1 P0 16

Optimality Theorem 1: The proposed greedy-based order can process all the video blocks in a frame in n ∗ iterations P ∗ = {K1,K2,...,Kn ∗ }, Kn ∗, would be processed in the n ∗ th iteration by Lemma 1. Since all the paths would also end in Kn ∗, all the blocks could be processed with n ∗ iterations 17

Performance estimation One of the longest paths in H × 4 intra-prediction The length can be found to be n*=((V/4)/2)x2+H/4-2 =V/4+H/4-2 n*=(V/4)x2+(H/4)/2-2 =(V/4)x2 +H/8-2 18

Bit-Rate Estimation Lagrangian cost function Entropy coding may involve many branching instructions, hard to implement on pipeline architecture 19

Fast Bit Rate Estimation for Mode Decision Tc : number of nonzero coefficients Tz : number of zeros before the last nonzero coefficients |Lk| : the absolute value of kth nonzero coefficient Fk : the frequency of kth nonzero coefficient [33] M. G. Sarwer and L.-M. Po, “Fast bit rate estimation for mode decision of H.264/AVC,” IEEE Trans. Circuits Syst. Video Technol., vol. 17, no. 10, pp. 1402–1407, Oct

Experiments PC equipped with one GeForce 8800 GTS PCIe graphics card with 96 stream processors Intel Pentium GHz processor with 1GB DDR2 memory H.264 JM 14.0 AVS RM 6.2 reference software 21

Encoding Bit-Rate Estimation 22

Parallel RD Optimized Intra-Mode Decision More than 80 times reduction QP has no significant effect H

Parallel RD Optimized Intra-Mode Decision Parallelism within a MB H

Parallel RD Optimized Intra-Mode Decision Similar speedups when RDO is disabled H

Parallel RD Optimized Intra-Mode Decision H

Parallel RD Optimized Intra-Mode Decision H

Parallel RD Optimized Intra-Mode Decision AVS 28

Parallel RD Optimized Intra-Mode Decision AVS 29

Parallel RD Optimized Intra-Mode Decision 39 96(processors)x2(threads)/ 5(modes) =

Conclusion Based on the dependency analysis of intra-mode decision, encode the video blocks following the greedy orders, leading to highly parallel RD cost computations. More than 80 times speedup for GPU based intra- prediction, GPU can be utilized to offload intra- prediction from CPU. To facilitate implementation on GPU, use a bitrate approximation method to estimate the rate in RD cost computation. The approximation errors only a small impact to the coding performance: no more than 0.12 dB loss in PSNR and 0.98% bit-rate increase. 31