Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parallelization of HEVC Deblocking filters using CUDA GPU

Similar presentations


Presentation on theme: "Parallelization of HEVC Deblocking filters using CUDA GPU"— Presentation transcript:

1 Parallelization of HEVC Deblocking filters using CUDA GPU
A PROJECT UNDER THE GUIDANCE OF DR. K. R. RAO COURSE: EE MULTIMEDIA PROCESSING, SPRING 2016 SUBMISSION DATE: 28th April 2016 SUBMITTED BY ARPITA YAGNIK UT ARLINGTON ID: ID:

2 TABLE OF CONTENTS 1 ACRONYMS 2 OBJECTIVE AND ACTION PLAN 3
WHAT ARE BLOCKING ARTIFACTS 4 WHY DO BLOCKING ARTIFACTS OCCUR 5 HOW TO REMOVE BLOCKING ARTIFACTS 6 HEVC DECODER 7 HEVC DEBLOCKING FILTER 8 TIMING ANALYSIS FOR ENCODING 9 WHY PARALLEL PROCESSING 10 ADVANTAGES OF PARALLELIZATION OF HEVC DEBLOCKING FILTER 11 WHAT IS GPU ACCELERATED COMPUTING 12 GPU VS CPU 13 INTRODUCTION TO CUDA GPU 14 OVERVIEW OF CUDA MEMORY 15 IMAGE TO GRID MAPPING 16 STEPS OF PROGRAMMING GPU 17 GPU GRID COMPUTING 18 THE EXISTING ALGORITHM FOR PARALLEL DEBLOCKING 19 EXPECTED RESULTS 20 TEST ENVIRONMENT 21 PROPOSED NEW ALGORITHM 22 CONCLUSIONS AND FUTURE WORK 23 REFERENCES

3 Acronyms AVC: Advanced Video Coding BS: Boundary Strength CODEC: COder/DECoder Croma: Chrominance CPU: Central Processing Unit CTU: Coding Tree Unit CU: Coding Unit CUDA: Compute Unified Device Architecture DBF: Deblocking Filter DCT: Discrete Cosine Transform DFT: Discrete Fourier Transform GPU: Graphics Processing Unit HEVC: High Efficiency Video Coding IEC: International Electrotechnical Commission ISO: International Standards Organization ITU-T: International Telecommunication Union (Telecommunication Standardization Sector) JBIG: Joint Bi-level Image Experts Group JPEG: Joint photographic experts group JCT-VC: Joint collaborative team on video coding LOT: Lapped Orthogonal Transform Luma: Luminance MB: Macro Block MPEG: Moving picture experts group OBMC: Overlapped Block Motion Compensation PU: Prediction Unit QP: Quantization Parameter SAO: Sample Adaptive Offset TU: Transform Unit

4 Objective and Implementation steps
Objective: To Implement the parallelization of Deblocking filter for HEVC CODEC using CUDA GPU. Implement the HM code and get the timing for deblocking filter. Implement a way to program CUDA GPU. Design a novel algorithm for parallel processing of deblocking filter.

5 What are blocking artifacts?
Blocking artifacts are visible discontinuities occurring at block boundaries in the reconstructed signal of the coding scheme that uses block based prediction and transform coding. These artifacts are very annoying especially at low bit rates.

6 Contd… With blocking artifacts The effect of blocking artifacts [2]

7 Why do Blocking Artifacts occur?
Modern video coding standards try to remove as much redundancy from the coded representation of video as possible. For these they use block based motion prediction and transform coding. These blocks are coded relatively independently from the neighbouring blocks and approximate the original signal with some degree of similarity. Since coded blocks only approximate the original signal, the difference between the approximations may cause discontinuities at the prediction and transform block boundaries. [2] [28]

8 How to remove blocking artifacts?
Approaches to reduce blocking artifacts are implemented as deblocking algorithms or filters. Four types of deblocking algorithms can be classified as In-loop filtering Post processing Pre-processing Overlapped block methods

9 HEVC Decoder HEVC Decoder Block Diagram [28]

10 Entropy Decoding and Picture Reconstruction
Contd… The three stages of HEVC video decoding Entropy Decoding and Picture Reconstruction DBF SAO Filtering

11 HEVC Deblocking filter
The deblocking filter in HEVC has been designed to improve the subjective quality while reducing the complexity. HEVC deblocking filter is sustainable to parallel processing. It has been designed in a way to prevent spatial dependencies across the picture, which together with design features, enables easy parallelization.

12 Contd… Deblocking in HEVC has been designed to prevent spatial dependencies of the deblocking process across the picture. There is no overlap between the filtering operations for one block edge, which can modify 3 pixels ,and the filtering decisions for the neighboring parallel block edge, which involves at most 4 pixels from block edge. Hence any vertical block edge in the picture can be deblocked in a parallel way to any other vertical edge. The same holds for horizontal edges. The picture is divided into non overlapping 8x8 blocks of samples. Each of those 8x8 blocks contains data for deblocking. consequently deblocking can be performed independently for each of the 8x8 blocks. Moreover the order of vertical and horizontal filtering for each block is exactly the same irrespective of the block position.

13 Contd… 8x8 grid for deblock filtering [1]

14 Full flow chart for luma as well as chroma deblocking
[19]

15 Contd… [19]

16 Contd… [1]

17 Implement HM to get the timings for deblocking filter

18 Timing Analysis for encoding
Comparison of encoding timings for different sequences[52]

19 Timing without Deblock filtering
Sequence:waterfall_cif.yuv 352x288 FPS:25 No of frames:50 QP:32 Random access main configuration The combined PSNR (PSNRYUV) is first calculated as the weighted sum of the PSNR per picture of the individual components (PSNRY, PSNRU, and PSNRV), and it is valid for 4:2:0 format only. PSNRYUV = (6 · PSNRY + PSNRU + PSNRV)/ (1)  where PSNRY, PSNRU, and PSNRV are each computed as   PSNR = 10 · log10((2B − 1)2/MSE) (2) where B = 8 is the number of bits per sample of the video signal to be coded and the MSE is the sum of squared differences divided by the number of samples in the signal.

20 Timing with Deblocking filter ON

21 Implementation Results for Timing
(1) Random Access Configuration Test Sequence Total time of encoding (Sec) YUV PSNR (dB) Bit rate (Kbps) Deblocking filter ON Deblocking filter OFF Waterfall_cif Akiyo_cif 39.660 39.972 Container_cif 75.508 75.690 (2) Intra main Configuration Test Sequence Total time of encoding (Sec) YUV PSNR (dB) Bit rate (Kbps) Deblocking filter ON Deblocking filter OFF Waterfall_cif 51.913 30.355 Akiyo_cif 24.749 24.340 Container_cif 27.922 27.733

22 Why Parallel Processing
It has been shown that the HEVC deblocking filter is responsible for 14% of the time consumption in the random access configuration, for Full HD video sequences. Compared to the DBF of AVC, DBF of HEVC is computationally less complex and offers more parallelization possibilities.

23 Advantages of Parallelization of Deblocking Filter
Reduced hardware complexity as the order of filtering the block boundaries does not change with different orders of CTU decoding. Useful for parallel processing on multi core processors. Improves throughput and greatly reduces the bandwidth requirement for multicore based HEVC implementation. Highly parallelized HEVC deblocking filter provides enough cycle margins to enable a combination of deblocking filter and SAO in the same building block.

24 Implementation of GPU programming

25 What is GPU accelerated computing
GPU-accelerated computing is the use of a graphics processing unit (GPU) together with a CPU to accelerate scientific, analytics, engineering, consumer, and enterprise applications. Pioneered in 2007 by NVIDIA, GPU accelerators now power energy-efficient data centers in government labs, universities, enterprises, and small-and-medium businesses around the world. GPUs are accelerating applications in platforms ranging from cars, to mobile phones and tablets, to drones and robots.[51]

26 Concept of GPU accelerated computing
Concept of GPU Parallel Processing [51]

27 GPU vs CPU A simple way to understand the difference between a CPU and GPU is to compare how they process tasks. A CPU consists of a few cores optimized for sequential serial processing while a GPU has a massively parallel architecture consisting of thousands of smaller, more efficient cores designed for handling multiple tasks simultaneously. [51] GPU vs CPU Video

28 Parallel processing trough GPU [51]
Contd… Parallel processing trough GPU [51]

29 Introduction to CUDA GPU
CUDA® (Compute Unified Device Architecture) is a parallel computing platform and programming model invented by NVIDIA. It enables dramatic increases in computing performance by harnessing the power of the graphics processing unit (GPU).[51] CUDA is widely used in general purpose computation, such as astronomical calculation, computational fluid dynamics simulation, image processing and video codec.

30 Contd… CUDA Programming model[52]

31 Contd… Thread view[54]

32 Overview of CUDA memory
Memory interface[54]

33 Image to grid mapping 1x128 8 [53]

34 Steps of programming GPU
[54]

35 GPU Grid Computing [53]

36 Contd… [53]

37 The existing Algorithms for parallel deblocking
[55]

38 Contd… [55]

39 Contd… From the perspective of the decoder the steps for deblocking are as follows: Perform horizontal filtering of all vertical edges in the picture. The different steps in horizontal filtering are: calculate the boundary strength (BS) of all the prediction units edges and the transform unit edges lying on the 8x8 grid. if (BS>0), then for each four-sample part of the 8x8 block, perform the filtering decisions (weak filter, strong filter or no filter) and apply the filtering on the samples. Perform vertical filtering of all horizontal edges in the picture. The filtering steps are similar to the horizontal filtering. However the modified samples after horizontal filtering are used as input to the vertical filtering.

40 Algorithm2/Expected Results
[59]

41 Test Environment CUDA toolkit and nvcc compiler
NVIDIA Quadro 4000 CUDA GPU,max therads:1356 HM code 16.9 Windows bit Visual studio 2013 Encoding profiles: all intra, random access

42 Technical specifications Quadros 4000k
Quadros 4000K specifications [61]

43 Proposed new CTU based algorithm

44 PROPOSED NEW ALGORITHM
Get CTU from the buffer Along with assigning 8x8 threads to 8x8 blocks of GPU memory for 64x64 CTU Call Kernel function for vertical and horizontal edge deblocking, pointer to CTU is passed as input parameter Vertical and horizontal edge filtering Kernel: 1)check BS Conditions 2)BS>0 then check other conditions for luminance deblocking. 3)condition 1 false the filter is off return the CTU back. 4) condition 2-7 are false then apply weak filter. 5) condition 2-7 true then strong filter. 6)If BS=2 apply chrominance strong filter. 7)change the thread map and repeat for horizontal edge filtering

45 For each 8x4 and 4x8 blocks around boundary:
Contd… Call Kernel function for CTU edge deblocking for the existing CTU in GPU memory, pixels are allocated along each boundary For each 8x4 and 4x8 blocks around boundary: 1)check BS Conditions 2)BS>0 then check other conditions for luminance deblocking. 3)condition 1 false the filter is off return the CTU back. 4) condition 2-7 are false then weak filteration 5) condition 2-7 true then strong filteration. 6)If BS=2 apply chrominance strong filter. 7)change the thread map and repeat for horizontal edge filtering

46 Method of implementation
Thread allocation and implementation design

47 Conclusion and Future work
The Deblocking filter of HEVC takes around 10%-17% of the total encoding time, but it has inbuilt flexibility of supporting parallelization which can be exploited to achieve lower encoding times. CUDA GPU is capable of processing the video for parallelization and can reduce the time taken as compared to the serial processing in CPU. Future work: 1) Incorporate the algorithm in HM code. Currently the CPP code does not allow the CUDA code to be included inside it and hence it needs to be compiled separately. 2) Decrease the overhead of pushing data to GPU. 3) Develop parallelization of other parts of HEVC CODEC.

48 References Andrey Norkin et al.,”HEVC Deblocking Filter”, IEEE Transactions on CSVT, vol. 22, no. 12, pp , Dec. 2012 Wei-Yi Wei, “Deblocking Algorithms in Video and Image Compression Coding”, National Taiwan University, Taipei, Taiwan, ROC B. Bross, et al., High Efficiency Video Coding (HEVC) Text Specification Draft 8, ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11 document JCTVTC-J1003, Joint Collaborative Team on Video Coding (JCTVC), Stockholm, Sweden, Jul ITU-T and ISO/IEC JCT 1, Advanced Video Coding for Generic Audiovisual Services, ITU-T Rec. H.264 and ISO/IEC (AVC), May 2003 (and subsequent editions). T. Wedi and H. G. Musmann, “Motion and aliasing compensated prediction for hybrid video coding,” IEEE Trans. on Circuits and Systems for Video Technology, vol. 13, no. 7, pp. 577–586, Jul P. List, et al., “Adaptive deblocking filter,” IEEE Trans. on Circuits and Systems for Video Technology, vol. 13, no. 7, pp. 614–619, Jul K. Ugur, K. R. Andersson, and A. Fuldseth, Video Coding Technology Proposal by Tandberg, Nokia, and Ericsson, ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11 document JCTVC-A119, Joint Collaborative Team on Video Coding (JCTVC), Dresden, Germany, Apr A. Norkin, et al., CE12: Ericsson’s and MediaTek’s Deblocking Filter, ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11 document JCTVC-F118, Joint Collaborative Team on Video Coding (JCTVC), Turin, Italy, Jul M. Ikeda and T. Suzuki, Non-CE10: Introduction of Strong Filter Clipping in Deblocking Filter, ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11 document JCTVC-H0275, Joint Collaborative Team on Video Coding (JCTVC), San Jose, CA, Feb M. Ikeda, J. Tanaka, and T. Suzuki, CE12 Subset2: Parallel Deblocking Filter, ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11 document JCTVC-E181, Joint Collaborative Team on Video Coding (JCTVC),Geneva, Switzerland, Mar

49 Contd… M. Narroschke, S. Esenlik, and T. Wedi, CE12 Subtest 1: Results for Modified Decisions for Deblocking, ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11 document JCTVC-G590, Joint Collaborative Team on Video Coding (JCTVC), Geneva, Switzerland, Nov A. Norkin, CE10.3: Deblocking Filter Simplifications: Bs Computation and Strong Filtering Decision, ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11 document JCTVC-H0473, Joint Collaborative Team on Video Coding (JCTVC), San Jose, CA, Feb A. Fuldseth, et al., Tiles, ITU-TSG16 WP3 and ISO/IEC JTC1/SC29/WG11 document JCTVC-F335,Joint Collaborative Team on Video Coding (JCTVC), Turin, Italy, Jul T. Yamakage, et al.,CE12: Deblocking Filter Parameter Adjustment in Slice Level, ITUT SG16 WP3 and ISO/IEC JTC1/SC29/WG11 document JCTVCG174,Joint Collaborative Team on Video Coding (JCTVC), Geneva, Switzerland, Nov. 2011 G. Van der Auwera,et al. (Panasonic), Support of Varying QP in Deblocking, ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11 document JCTVCG1031,Joint Collaborative Team on Video Coding (JCTVC), Geneva, Switzerland, Nov M. Zhou, O. Sezer, and V. Sze, CE12 Subset 2: Test Results and Architectural Study on De-Blocking Filter Without Parallel on/off Filter Decision, ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11 document JCTVC-G088, Joint Collaborative Team on Video Coding (JCTVC), Geneva, Switzerland, Nov G. Bjontegaard, Calculation of Average PSNR Differences Between RDCurves,ITU-T-T SG16 document VCEG-M33, Joint Collaborative Team on Video Coding (JCTVC), 2001. F. Bossen, Common Test Conditions, JCTVC-H1100, Joint Collaborative Team on Video Coding (JCTVC), San Jose, CA, 2012. Po-Kai Hsu and Chung-An Shen, The VLSI Architecture of a Highly Efficient Deblocking Filter for HEVC Systems, DOI /TCSVT , IEEE Transactions on Circuits and Systems for Video Technology HEVC presentation: Overview of H.264/AVC: Detailed overview of HEVC/H.265:

50 Contd… I.E.G. Richardson, “Video Codec Design: Developing Image and Video Compression Systems”, Wiley, 2002. I.E.G. Richardson, “The H.264 advanced video compression standard”, 2nd Edition, Hoboken, NJ, Wiley, 2010. K. Sayood, “Introduction to Data compression”, Third Edition, Morgan Kaufmann Series in Multimedia Information and Systems, San Francisco, CA, 2005. V. Sze and M. Budagavi, “Design and Implementation of Next Generation Video Coding Systems (H.265/HEVC Tutorial)”, IEEE International Symposium on Circuits and Systems (ISCAS), Melbourne, Australia, June 2014. V. Sze, M. Budagavi and G.J. Sullivan (Editors), “High Efficiency Video Coding (HEVC): Algorithms and Architectures”, Springer, 2014. G. J. Sullivan et al, “Overview of the High Efficiency Video Coding (HEVC) Standard”, IEEE Trans. on Circuits and Systems for Video Technology, Vol. 22, No. 12, pp , Dec G. J. Sullivan et al ,“Standardized Extensions of High Efficiency Video Coding (HEVC)”, IEEE Journal of selected topics in Signal Processing, vol. 7, pp , Dec K.R. Rao, D.N. Kim and J.J. Hwang, “Video Coding Standards: AVS China, H.264/MPEG-4 Part 10, HEVC, VP6, DIRAC and VC-1”, Springer, 2014. D. Grois, B. Bross and D. Marpe, “HEVC/H.265 Video Coding Standard (Version 2) including the Range Extensions, Scalable Extensions, and Multiview Extensions,” (Tutorial) Sunday 27 Sept 2015, 9:00 am to 12:30 pm), IEEE ICIP, Quebec City, Canada, 27 – 30 Sept Generic quadtree based approach for block partitioning The tutorial below is for personal use only [Password: a2FazmgNK ] Please find the links to YouTube videos on the tutorial - HEVC/H.265 Video Coding Standard including the Range Extensions Scalable Extensions and Multiview Extensions below: HEVC tutorial by I.E.G. Richardson: “Special issue on HEVC extensions and efficient HEVC implementations”, IEEE Trans. on Circuits and Systems for Video Technology, Vol. 26, pp , Jan K.R. Rao and J.J. Hwang, “Techniques and standards for image/video/audio coding”, Prentice Hall, 1996.

51 Contd… Video lectures from IITs and IISC: http://nptel.iitm.ac.in/
Image and video processing courses at UT Arlington (EE 5351, EE 5355, EE 5356 and EE 5359) : HEVC chapter 1: Online course on fundamentals of digital image and video processing from Coursera: Access to HM 16.0 Software Manual: Test Sequences: ftp://ftp.kw.bbc.co.uk/hevc/hm-11.0-anchors/bitstreams/ HEVC white paper-Ittiam Systems: HEVC white paper-Elemental Technologies: Access to HM 16.0 Reference Software: Han W-J, et al. (2010), “Improved video compression efficiency through flexible unit representation and corresponding extension of coding tools”, IEEE Trans. on Circuits and Systems for Video Technology, Vol. 20, no.12, pp , Dec Norkin A (2012) Non-CE1: non-normative improvement to deblocking filtering, Joint Collaborative Team on Video Coding (JCT-VC), Document JCTVC-K0289, Shanghai, Oct. 2012 Norkin A, Andersson K, Fuldseth A, Bjøntegaard G (2012) HEVC deblocking filtering and decisions. In: Proc. SPIE. 8499, Applications of Digital Image Processing XXXV, no , Oct. 2012 Norkin A, Andersson K, Kulyk V (2013) “Two HEVC encoder methods for block artifact reduction”. In: Proceedings of the IEEE international conference on visual communications and image processing (VCIP) 2013, Kuching, Sarawak, pp. 1–6, Nov. 2013 Norkin A, Andersson K, Sjöberg R (2013) AHG6: on deblocking filter and parameters signaling, Joint Collaborative Team on Video Coding (JCT-VC), Document JCTVC-L0232, Geneva, Jan. 2013 Information on GPU accelearted computing : X. Sun et al, “Aceelerating IEEE 1857 Deblocking Filter on GPU using CUDA’, IEEE International Conference on Multimedia Big Data, pp , Apr

52 Contd… 53. Course on parallel computing: Course on heterogeneous parallel programming A.M. Kotra et al, “Comparison of different parallel implementations for deblocking filter of HEVC”, IEEE International conference on Acoustics, speech and signal processing, pp , Mar Test sequence: NVIDIA quadro 4000 information guide: NVCC compiler and CUDA tool kit guide: D. F. de Souza, et al, “ Cooperative CPU+GPU Deblocking filter parallelization for high performance HEVC CODECS”, IEEE International conference on Acoustics, speech and signal processing, pp , May Instruction guides for running CUDA locally on machine: Windows: Quadros 4000 technical specifications::

53 Thank you


Download ppt "Parallelization of HEVC Deblocking filters using CUDA GPU"

Similar presentations


Ads by Google