Software Architecture of High Efficiency Video Coding for Many-Core Systems with Power- Efficient Workload Balancing Muhammad Usman Karim Khan, Muhammad.

Slides:



Advertisements
Similar presentations
Parallel Scalability and Efficiency of HEVC Parallelization Approaches
Advertisements

Wen-Hsiao Peng Chun-Chi Chen
Parallelizing Video Transcoding With Load Balancing On Cloud Computing Song Lin, Xinfeng Zhang, Qin Y, Siwei Ma Circuits and Systems, 2013 IEEE.
Parallel H.264 Decoding on an Embedded Multicore Processor
2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC),
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Towards Efficient Wavefront Parallel Encoding of HEVC: Parallelism Analysis and Improvement Keji Chen, Yizhou Duan, Jun Sun, Zongming Guo 2014 IEEE 16th.
A Highly Parallel Framework for HEVC Coding Unit Partitioning Tree Decision on Many-core Processors Chenggang Yan, Yongdong Zhang, Jizheng Xu, Feng Dai,
Efficient Bit Allocation and CTU level Rate Control for HEVC Picture Coding Symposium, 2013, IEEE Junjun Si, Siwei Ma, Wen Gao Insitute of Digital Media,
1 Adaptive slice-level parallelism for H.264/AVC encoding using pre macroblock mode selection Bongsoo Jung, Byeungwoo Jeon Journal of Visual Communication.
{ Fast Disparity Estimation Using Spatio- temporal Correlation of Disparity Field for Multiview Video Coding Wei Zhu, Xiang Tian, Fan Zhou and Yaowu Chen.
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
Evaluation of Data-Parallel Splitting Approaches for H.264 Decoding
Wei Zhu, Xiang Tian, Fan Zhou and Yaowu Chen IEEE TCE, 2010.
Yu-Han Chen, Tung-Chien Chen, Chuan-Yung Tsai, Sung-Fang Tsai, and Liang-Gee Chen, Fellow, IEEE IEEE CSVT
Shaobo Zhang, Xiaoyun Zhang, Zhiyong Gao
Outline Introduction Introduction Fast Inter Prediction Mode Decision for H.264 – –Pre-encoding An Efficient Inter Mode Decision Approach for H.264 Video.
1 Single Reference Frame Multiple Current Macroblocks Scheme for Multiple Reference IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY Tung-Chien.
Source Code Optimization and Profiling of Energy Consumption in Embedded System Simunic, T.; Benini, L.; De Micheli, G.; Hans, M.; Proceedings on The 13th.
Improving the Efficiency of Memory Partitioning by Address Clustering Alberto MaciiEnrico MaciiMassimo Poncino Proceedings of the Design,Automation and.
Analysis, Fast Algorithm, and VLSI Architecture Design for H
Fast Mode Decision And Motion Estimation For JVT/H.264 Pen Yin, Hye – Yeon Cheong Tourapis, Alexis Michael Tourapis and Jill Boyce IEEE ICIP 2003 Sep.
Source-Channel Prediction in Error Resilient Video Coding Hua Yang and Kenneth Rose Signal Compression Laboratory ECE Department University of California,
A New Rate-Complexity-QP Algorithm for HEVC Intra-Picture Rate Control LING TIAN, YIMIN ZHOU, AND XIAOJUN CAO 2014 INTERNATIONAL CONFERENCE ON COMPUTING,
Block Partitioning Structure in the HEVC Standard
Complexity Model Based Load- balancing Algorithm For Parallel Tools Of HEVC Yong-Jo Ahn, Tae-Jin Hwang, Dong-Gyu Sim, and Woo-Jin Han 2013 IEEE International.
H.264/AVC for Wireless Applications Thomas Stockhammer, and Thomas Wiegand Institute for Communications Engineering, Munich University of Technology, Germany.
Xinqiao LiuRate constrained conditional replenishment1 Rate-Constrained Conditional Replenishment with Adaptive Change Detection Xinqiao Liu December 8,
09/24/02ICIP20021 Drift Management and Adaptive Bit Rate Allocation in Scalable Video Coding H. Yang, R. Zhang and K. Rose Signal Compression Lab ECE Department.
1. 1. Problem Statement 2. Overview of H.264/AVC Scalable Extension I. Temporal Scalability II. Spatial Scalability III. Complexity Reduction 3. Previous.
1 Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi- Processor Architecture Tom R. Jacobs, Vassilios A. Chouliars, and David J. Mulvaney.
Liquan Shen Zhi Liu Xinpeng Zhang Wenqiang Zhao Zhaoyang Zhang An Effective CU Size Decision Method for HEVC Encoders IEEE TRANSACTIONS ON MULTIMEDIA,
Department of Electrical Engineering National Cheng Kung University
Online Dictionary Learning for Sparse Coding International Conference on Machine Learning, 2009 Julien Mairal, Francis Bach, Jean Ponce and Guillermo Sapiro.
Soner Yaldiz, Alper Demir, Serdar Tasiran Koç University, Istanbul, Turkey Paolo Ienne, Yusuf Leblebici Swiss Federal Institute of Technology (EPFL), Lausanne,
“Early Estimation of Cache Properties for Multicore Embedded Processors” ISERD ICETM 2015 Bangkok, Thailand May 16, 2015.
CPACT – The Conditional Parameter Adjustment Cache Tuner for Dual-Core Architectures + Also Affiliated with NSF Center for High- Performance Reconfigurable.
Low-Power Wireless Sensor Networks
Integrating Fine-Grained Application Adaptation with Global Adaptation for Saving Energy Vibhore Vardhan, Daniel G. Sachs, Wanghong Yuan, Albert F. Harris,
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
1 Customer-Aware Task Allocation and Scheduling for Multi-Mode MPSoCs Lin Huang, Rong Ye and Qiang Xu CHhk REliable computing laboratory (CURE) The Chinese.
1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.
Adaptive Multi-path Prediction for Error Resilient H.264 Coding Xiaosong Zhou, C.-C. Jay Kuo University of Southern California Multimedia Signal Processing.
Diploma Project Real Time Motion Estimation on HDTV Video Streams (using the Xilinx FPGA) Supervisor :Averena L.I. Student:Das Samarjit.
Low-Power H.264 Video Compression Architecture for Mobile Communication Student: Tai-Jung Huang Advisor: Jar-Ferr Yang Teacher: Jenn-Jier Lien.
Rate-GOP Based Rate Control for HEVC SHANSHE WANG, SIWEI MA, SHIQI WANG, DEBIN ZHAO, AND WEN GAO IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING,
Adaptive Multi-Threading for Dynamic Workloads in Embedded Multiprocessors 林鼎原 Department of Electrical Engineering National Cheng Kung University Tainan,
Study and Optimization of the Deblocking Filter in H.265 and its Advantages over H.264 By: Valay Shah Under the guidance of: Dr. K. R. Rao.
IEEE Transactions on Consumer Electronics, Vol. 58, No. 2, May 2012 Kyungmin Lim, Seongwan Kim, Jaeho Lee, Daehyun Pak and Sangyoun Lee, Member, IEEE 報告者:劉冠宇.
Run-time Adaptive on-chip Communication Scheme 林孟諭 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C.
UNDER THE GUIDANCE DR. K. R. RAO SUBMITTED BY SHAHEER AHMED ID : Encoding H.264 by Thread Level Parallelism.
Porting of Fast Intra Prediction in HM7.0 to HM9.2
The World Leader in High Performance Signal Processing Solutions Multi-core programming frameworks for embedded systems Kaushal Sanghai and Rick Gentile.
Department of Electrical and Computer Engineering University of Wisconsin - Madison Optimizing Total Power of Many-core Processors Considering Voltage.
Chance Constrained Robust Energy Efficiency in Cognitive Radio Networks with Channel Uncertainty Yongjun Xu and Xiaohui Zhao College of Communication Engineering,
1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006.
The Effect of Data-Reuse Transformations on Multimedia Applications for Application Specific Processors N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.
DAC50, Designer Track, 156-VB543 Parallel Design Methodology for Video Codec LSI with High-level Synthesis and FPGA-based Platform Kazuya YOKOHARI, Koyo.
DASH2M: Exploring HTTP/2 for Internet Streaming to Mobile Devices
Automatic Video Shot Detection from MPEG Bit Stream
Early termination for tz search in hevc motion estimation
Ching-Chi Lin Institute of Information Science, Academia Sinica
System Control based Renewable Energy Resources in Smart Grid Consumer
LOW POWER DIGITAL VIDEO COMPRESSION HARDWARE DESIGN
Study and Optimization of the Deblocking Filter in H
Fast Decision of Block size, Prediction Mode and Intra Block for H
/ Fast block partitioning method in HEVC Intra coding for UHD video /
Bongsoo Jung, Byeungwoo Jeon
Scalable light field coding using weighted binary images
Li Li, Zhu Li, Vladyslav Zakharchenko, Jianle Chen, Houqiang Li
Presentation transcript:

Software Architecture of High Efficiency Video Coding for Many-Core Systems with Power- Efficient Workload Balancing Muhammad Usman Karim Khan, Muhammad Shafique, Jörg Henkel Chair for Embedded Systems (CES), Karlsruhe Institute of Technology (KIT), Germany Design, Automation and Test in Europe Conference and Exhibition (DATE), 2014

Outline  Introduction  Analysis of HEVC Encoder  Proposed method  Experimental Result  Conclusion 2

Outline  Introduction  Analysis of HEVC Encoder  Proposed method  Experimental Result  Conclusion 3

Introduction(1/4)  By aiming at 50% bit-rate reduction and preserving the same subjective video quality as H.264, HEVC has become a prime candidate to replace H.264 encoders.  This gain in compression efficiency comes at a high cost of computational complexity due to the inclusion of numerous additional encoding tools. 4

Introduction(2/4)  In real-world video encoding systems, the video must be compressed under tight constraints of time budget and output bit-rate.  The additional tools and timing-constraints give rise to several challenges for implementing a HEVC system on a hardware platform. 5

Introduction(3/4)  By distributing the workload of HEVC encoder on multiple cores, the total encoding time can be reduced that may potentially improve the overall energy efficiency.  HEVC standard allows exploiting parallel encoding tools, like slices and video tiles to fulfill these requirements. 6

Introduction(4/4)  HEVC video encoder on a many-core system must exploit the changing workload to reduce the total power consumption while meeting the quality of service demands.  The power consumption can be dynamically reduced by using a workload-driven operating frequency adaptation scheme. 7

Outline  Introduction  Analysis of HEVC Encoder  Proposed method  Experimental Result  Conclusion 8

Analysis of HEVC Encoder(1/3)  Tiles are at the lowest level of coding hierarchy. Therefore, they will consume least system memory, and hence, will be fastest among other parallelisms.  Unlike slices, tiles do not have their associated headers. Thus, tiles exhibit the potential to provide relatively better output video quality compared to slices. 9

Analysis of HEVC Encoder(2/3) 10

Analysis of HEVC Encoder(3/3)  A single tile per frame generates the best quality.  For T 2 tiles per frame, T×T tile results in the best video quality [24].  The total number of tiles within a slice must be minimized and the total tile-rows and tile-columns within a slice must be equal. 11 [24] C. Chi et al., “Improving the parallelization efficiency of HEVC decoding,” in ICIP, pp. 213–216, 2012.

Outline  Introduction  Analysis of HEVC Encoder  Proposed method  Experimental Result  Conclusion 12

Proposed method(1/2)  Workload Estimation  Select the tile structure and the maximum workload of each core.  considering operating frequency, total number of cores and frames per second.  Workload Allocator  Allocating workload to each core by utilizing user’s tolerance of the output bit-rate.  Workload Manager  Managing the workload by adapting the operating frequency of each core in order to reduce power consumption. 13

Proposed method(2/2) 14

Proposed method – Workload Estimation(1/3)  Tile Formation and Maximum Workload Estimation  Number of cores is determined to distribute the HEVC-Intra application’s workload.  Adjust the number of Intra directions to curtail the computational complexity. 15

Proposed method – Workload Estimation(2/3)  Workload is given by: 16

Proposed method – Workload Estimation(3/3) 17

Proposed method – Workload Allocator(1/6)  Workload Allocator  For workload balancing, an adaptation interval is defined. 18

Proposed method – Workload Allocator(2/6)  The starting tile of this interval is always a fully searched tile (θ = θ init,k ) to achieve best compression.  Workload of tile in the future frames is gradually adjusted down (θ ≤ θ init,k ) to reduce workload and power consumption.  If the total number of compressed bytes for the current NKT increases beyond a certain threshold, we increase θ, thereby increasing the workload. 19

Proposed method – Workload Allocator(3/6)  The threshold is set statistically using the following equation: 20 B : total number of compressed bytes. μ : the average bit-rate. υ : the variance of B.

Proposed method – Workload Allocator(4/6)  If a certain number of frames have been processed or B exceeds a threshold, adaptation and KT insertion is required. 21

Proposed method – Workload Allocator(5/6)  For every CTU, if threshold in equation 4 is satisfied, θ is adjusted as: 22 u : a user defined parameter.

Proposed method – Workload Allocator(6/6)  We can estimate the total cycles consumed per CTU: 23

Proposed method – Workload Manager(1/3)  Workload Manager  Adjusting θ will increase or decrease the workload.  The intra prediction mode selected corresponds to the direction of texture flow.  Determine the most probable prediction and θ is centered on this prediction. 24

Proposed method – Workload Manager(2/3)  This prediction/direction is obtained by sorting a histogram created by gradients of each individual pixel [22][26].  We propose a much simpler solution, similar to the one presented in [27]. 25 [22] W. Jiang, H. Ma, Y. Chen, “Gradient based fast mode decision algorithm for intra prediction in HEVC,” in CECNet, pp. 1836–1840, [26] M. Shafique, B. Molkenthin, J. Henkel, “An HVS-based Adaptive Computational Complexity Reduction Scheme for H.264/AVC video encoder using Prognostic Early Mode Exclusion,” in DATE, pp.1713–1718, [27] M. U. K. Khan, J. M. Borrmann, L. Bauer, M. Shafique, J. Henkel, “An H.264 Quad-FullHD low- latency intra video encoder,” in DATE, pp.115–120, 2013.

Proposed method – Workload Manager(3/3) 26

Outline  Introduction  Analysis of HEVC Encoder  Proposed method  Experimental Result  Conclusion 27

Experimental Result(1/4)  We have developed a C++ based multi-threaded HEVC Intra-encoder in our lab.  With 1-tile (single thread) configuration, our software is ~13 faster than HM-9.2 reference software for full- HD (1920*1080) video sequences.  Hardware platform simulation is performed via the Sniper many-core simulator [30]. 28 [30] T.E. Carlson, W. Heirman, L. Eeckhout, “Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation,” in SC, pp. 1–12, 2011.

Experimental Result(2/4) 29

Experimental Result(3/4) 30

Experimental Result(4/4) 31

Outline  Introduction  Analysis of HEVC Encoder  Proposed method  Experimental Result  Conclusion 32

Conclusion  A novel software architecture of HEVC-Intra encoding with run-time power-efficient workload balancing on many-core systems is presented.  This adjusted workload is used to adapt operating frequency, thereby reducing the power consumption of the many-core system. 33