1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006.

Slides:

Advertisements

Similar presentations

CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.

Advertisements

Parallelizing Video Transcoding With Load Balancing On Cloud Computing Song Lin, Xinfeng Zhang, Qin Y, Siwei Ma Circuits and Systems, 2013 IEEE.

Parallel H.264 Decoding on an Embedded Multicore Processor

CoMPI: Enhancing MPI based applications performance and scalability using run-time compression. Rosa Filgueira, David E.Singh, Alejandro Calderón and Jesús.

Towards Efficient Wavefront Parallel Encoding of HEVC: Parallelism Analysis and Improvement Keji Chen, Yizhou Duan, Jun Sun, Zongming Guo 2014 IEEE 16th.

INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS, ICT '09. TAREK OUNI WALID AYEDI MOHAMED ABID NATIONAL ENGINEERING SCHOOL OF SFAX New Low Complexity.

LOGO Video Packet Selection and Scheduling for Multipath Streaming IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 9, NO. 3, APRIL 2007 Dan Jurca, Student Member,

Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.

An Optimal Algorithm of Adjustable Delay Buffer Insertion for Solving Clock Skew Variation Problem Juyeon Kim, Deokjin Joo, Taehan Kim DAC’13.

A Parallel Matching Algorithm Based on Image Gray Scale Liang Zong, Yanhui Wu cso, vol. 1, pp , 2009 International Joint Conference on Computational.

1 Adaptive slice-level parallelism for H.264/AVC encoding using pre macroblock mode selection Bongsoo Jung, Byeungwoo Jeon Journal of Visual Communication.

Evaluation of Data-Parallel Splitting Approaches for H.264 Decoding

Sang-Chun Han Hwangjun Song Jun Heo International Conference on Intelligent Hiding and Multimedia Signal Processing (IIH-MSP), Feb, /05 Feb 2009.

A Parallel Computational Model for Heterogeneous Clusters Jose Luis Bosque, Luis Pastor, IEEE TRASACTION ON PARALLEL AND DISTRIBUTED SYSTEM, VOL. 17, NO.

Reference: Message Passing Fundamentals.

Overview of Error Resiliency Schemes in H.264/AVC Standard Sunil Kumar, Liyang Xu, Mrinal K. Mandal, and Sethuraman Panchanathan Elsevier Journal of Visual.

An Error-Resilient GOP Structure for Robust Video Transmission Tao Fang, Lap-Pui Chau Electrical and Electronic Engineering, Nanyan Techonological University.

A Comparative Study of Network Protocols & Interconnect for Cluster Computing Performance Evaluation of Fast Ethernet, Gigabit Ethernet and Myrinet.

Page 1 CS Department Parallel Design of JPEG2000 Image Compression Xiuzhen Huang CS Department UC Santa Barbara April 30th, 2003.

1 Efficient Multithreading Implementation of H.264 Encoder on Intel Hyper- Threading Architectures Steven Ge, Xinmin Tian, and Yen-Kuang Chen IEEE Pacific-Rim.

1 Slice-Balancing H.264 Video Encoding for Improved Scalability of Multicore Decoding Michael Roitzsch Technische Universität Dresden ACM & IEEE international.

Measuring Network Performance of Multi-Core Multi-Cluster (MCMCA) Norhazlina Hamid Supervisor: R J Walters and G B Wills PUBLIC.

Unequal Loss Protection: Graceful Degradation of Image Quality over Packet Erasure Channels Through Forward Error Correction Alexander E. Mohr, Eva A.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

A Load Balancing Framework for Adaptive and Asynchronous Applications Kevin Barker, Andrey Chernikov, Nikos Chrisochoides,Keshav Pingali ; IEEE TRANSACTIONS.

HARDEEPSINH JADEJA UTA ID: What is Transcoding The operation of converting video in one format to another format. It is the ability to take.

1. 1. Problem Statement 2. Overview of H.264/AVC Scalable Extension I. Temporal Scalability II. Spatial Scalability III. Complexity Reduction 3. Previous.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

Sort-Last Parallel Rendering for Viewing Extremely Large Data Sets on Tile Displays Paper by Kenneth Moreland, Brian Wylie, and Constantine Pavlakos Presented.

Chapter 2 Computer Clusters Lecture 2.1 Overview.

1 Parallel computing and its recent topics. 2 Outline 1. Introduction of parallel processing (1)What is parallel processing (2)Classification of parallel.

2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.

Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

1 Parallel Computing Basics of Parallel Computers Shared Memory SMP / NUMA Architectures Message Passing Clusters.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Parallel ICA Algorithm and Modeling Hongtao Du March 25, 2004.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.

Definitions Speed-up Efficiency Cost Diameter Dilation Deadlock Embedding Scalability Big Oh notation Latency Hiding Termination problem Bernstein’s conditions.

An Energy-Efficient MAC Protocol for Wireless Sensor Networks Qingchun Ren and Qilian Liang Department of Electrical Engineering, University of Texas at.

The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.

Performance Analysis of a JPEG Encoder Mapped To a Virtual MPSoC-NoC Architecture Using TLM 林孟諭 Dept. of Electrical Engineering National Cheng Kung.

UNDER THE GUIDANCE DR. K. R. RAO SUBMITTED BY SHAHEER AHMED ID : Encoding H.264 by Thread Level Parallelism.

Efficiency of small size tasks calculation in grid clusters using parallel processing.. Olgerts Belmanis Jānis Kūliņš RTU ETF Riga Technical University.

An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.

MPEG-4: Multimedia Coding Standard Supporting Mobile Multimedia System Lian Mo, Alan Jiang, Junhua Ding April, 2001.

Dynamic Scheduling Monte-Carlo Framework for Multi-Accelerator Heterogeneous Clusters Authors: Anson H.T. Tse, David B. Thomas, K.H. Tsoi, Wayne Luk Source:

Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.

1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 8 Networks and Multiprocessors.

High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future.

An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.

Parallel processing

Encoding Stored Video for Streaming Applications IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 11, NO. 2, FEBRUARY 2001 I.-Ming.

UNDER THE GUIDANCE DR. K. R. RAO SUBMITTED BY SHAHEER AHMED ID : Encoding H.264 by Thread Level Parallelism.

A Bandwidth Scheduling Algorithm Based on Minimum Interference Traffic in Mesh Mode Xu-Yajing, Li-ZhiTao, Zhong-XiuFang and Xu-HuiMin International Conference.

PERFORMANCE EVALUATION OF LARGE RECONFIGURABLE INTERCONNECTS FOR MULTIPROCESSOR SYSTEMS Wim Heirman, Iñigo Artundo, Joni Dambre, Christof Debaes, Pham.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.

Network Processing Systems Design

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

H.264/SVC Video Transmission Over P2P Networks

Injong Rhee ICMCS’98 Presented by Wenyu Ren

Parallel Density-based Hybrid Clustering

Steven Ge, Xinmin Tian, and Yen-Kuang Chen

What is Parallel and Distributed computing?

CSE8380 Parallel and Distributed Processing Presentation

By Brandon, Ben, and Lee Parallel Computing.

Chapter 4 Multiprocessors

Scalable light field coding using weighted binary images

Presentation transcript:

1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006

2 Outline Introduction Performance Analysis Hierarchical H.264 Parallel Encoder Experimental Results Conclusions

3 Introduction Background Knowledge (1/5) Video Communication

4 Introduction Background Knowledge (2/5) H.264/AVC Remove sensitive redundant information In order to reach the limits on compression efficiency  intensive computation Video on demand, video conference, live broadcasting, etc.

5 Introduction Background Knowledge (3/5) H.264/AVC encoder High CPU demand Low latency Real time response Platforms with supercomputing capabilities Clusters Multiprocessors Special purpose devices

6 Introduction Background Knowledge (4/5) Cluster A group of linked computers Improve performance and/or availability over that provided by a single computer Categorizations High-availability clusters Load-balancing clusters High-performance clusters

7 Introduction Background Knowledge (5/5) Message Passing Parallelism Message passing runtimes and libraries  MPI Multithread Parallelism OpenMP Optimized libraries SIMD extension and global processing unit  Intel IPP, AMD ACML, etc.

8 Introduction Main Purpose (1/6) Apply parallel processing to H.264 encoders in order to reduce computation intensity. Given video quality and bit rate Image resolution Frame rate Latency

9 Introduction Main Purpose (2/6) Hierarchical parallelization of H.264 encoder Two level MPI message passing parallelization GOP level Slice level

10 Introduction Main Purpose (3/6) GOP level parallelism Good speed-up High latency …….. GOP

11 Introduction Main Purpose (4/6) Example of latency 1 GOP = 10 frames Frame rate = 30 frames/sec Time for encoding 1 GOP = 3 seconds We have to encode 9 GOP in parallel in order to achieve real time response Latency = 3 seconds

12 Introduction Main Purpose (5/6) Slice level parallelism Low latency Less coding efficiency

13 Introduction Main Purpose (6/6) Combination both approaches Speed-up Efficiency

14 Performance Analysis Overview (1/2) “ “Performance evaluation of parallel MPEG-4 video coding algorithms on clusters of workstations” “A Parallel implementation of H.26L video encoder” Combination Scalability and low latency

15 Performance Analysis Overview (2/2) Processing flow video sequence GOP ……..…….. Increase throughput Reduce latency

16 Performance Analysis Equation definition Little’s law N = X * R N : Number of GOPs processed in parallel. X : Number of GOPs encoded per second. R : Elapsed time between a GOP enters the system and the same GOP is completely encoded.

17 Performance Analysis Analysis (1/2) If we have n p nodes in the cluster and every GOP decomposed in n s slices  N = n p / n s R = R SEQ / ( n s * E s ) R SEQ : Sequential encoding time of a GOP E s : Parallel efficiency of slice level

18 Performance Analysis Analysis (2/2) GOP throughput of combined parallel encoder If E s is significantly less than 1, throughput would be affected negatively

19 Performance Analysis Example (1/4) Video sequence in HDTV format at 1280*720 Frame rate = 60 frames / sec We suppose that H.264 sequential encoder encodes one GOP(15 frames) in 5 seconds Only one slice per frame is defined

20 Performance Analysis Example (2/4) To get real time response, X has to be equal to 60 frames/sec or 4 GOPs/sec  n p = 4 * 5 = 20 nodes

21 Performance Analysis Example (3/4) Combined with slice level parallelism Maximum of allowed latency = 1 sec Slice parallelism efficiency = 0.8

22 Performance Analysis Example (4/4) We set n s to 7 and N to 4, and number of required nodes is adjusted to 28 Throughput Latency

23 Performance Analysis Efficiency Estimation (1/5) Why we have to estimate E s ? Throughput Latency How to estimate E s ? PAMELA (PerformAnce ModEling LAnguage) model

24 Performance Analysis Efficiency Estimation (2/5) Update DPB (Decoding Picture Buffer) in every node Using MPI_Allgather In this PAMELA model MPI_Allgather is implemented using binary tree

25 Performance Analysis Efficiency Estimation (3/5) The PAMELA model to parallel encode one frame is : L = par ( p = 1…n s ) delay (t s ); delay (t w ) seq ( I = 0…log 2 (n s )-1) par ( j = 1…n s ) delay ( t L + t c * 2 i ) n s : The number of slices processed in parallel t s : The mean of slice encoding time t w : The mean wait time due to variations in ts and global synchronization t L : Start up time t c : Transmission time of one encoded slice

26 Performance Analysis Efficiency Estimation (4/5) The parallel time obtained solving this model is Efficiency can be computed as T(L) = t s + t w + t AG t AG = log 2 (n s ) * t L + (n s - 1) * t c

27 Performance Analysis Efficiency Estimation (5/5) The experimental estimations of parameter values Estimated efficiency for a slice based parallel encoder tLtL tctc tsts twtw t AG *

28 Performance Analysis Slice Parallelism Scalability (1/4) The feasible number of slices will depend on the video resolution Number of MBs per slice Bit rate increment (%)

29 Performance Analysis Slice Parallelism Scalability (2/4) Bit rate overhead vs. number of slices per frame

30 Performance Analysis Slice Parallelism Scalability (3/4) PSNR loss vs. number of slices per frame

31 Performance Analysis Slice Parallelism Scalability (4/4) Encoding time vs. number of slices per frame

32 Hierarchical Parallel Encoder Overview In order to achieve scalability and low latency Combine GOP and slice level parallelism In the first level Divide sequence in GOPs(15 frames) Every GOP is assigned to a processor group inside the cluster Each group encodes independently

33 Hierarchical Parallel Encoder GOP assignment method Local manager Communicate with global manager Global manager Inform the GOP assignment by sending a message with the GOP number to the requesting local manager Simple and load balance

34 Hierarchical Parallel Encoder Framework Hierarchical H.264 parallel encoder Global Manager P0 P1P2 P0 P1P2 P0 P1P2

35 Experimental Results Environments (1/2) Mozart 4 biprocessor nodes with AMD Opteron 246 at 2 GHz interconnected by a switched Gigabit Ethernet Aldebaran SGI Altix 3700 with 44 nodes Itanium II interconnected by a high performance proprietary network

36 Experimental Results Environments (2/2) 720 * 480 standard sequence Ayersroc which composed by 16 GOPs ConfigurationCluster#Groups#Slices 01_Gr_08S1Mozart18 02_Gr_04S1Mozart24 04_Gr_02S1Mozart42 08_Gr_01S1Mozart81 01_Gr_16S1Aldebaran116 02_Gr_08S1Aldebaran28 04_Gr_04S1Aldebaran44 08_Gr_02S1Aldebaran82 16_Gr_01S1Aldebaran161

37 Experimental Results System Speedup (1/2) Speed up in Mozart

38 Experimental Results System Speedup (2/2) Speed up in Aldebaran

39 Experimental Results Encoding Latency Mean GOP encoding time

40 Conclusions A hierarchical parallel video encoder based on H.264/AVC was proposed. Experimental results confirm the results from previous analysis, showing the ability of getting a scalable and low latency H.264 encoder. Some issues remains open, as mentioned in previous section.

41 Reference J.C. Fernández and M. P. Malumbres, “A Parallel implementation of H.26L video encoder”, in proc. of EuroPar 2002 conf. (LNCS 2400), pp. 830, 833, Padderborn, [1] J.C. Fernández and M. P. Malumbres, “A Parallel implementation of H.26L video encoder”, in proc. of EuroPar 2002 conf. (LNCS 2400), pp. 830, 833, Padderborn, A. Rodriguez, A. González and M.P. Malumbres,“ Performance evaluation of parallel MPEG-4 video coding algorithms on clusters of workstations ”, IEEE Int. Conference on Parallel Computing in Electrical Engineering, pp. 354, 357, Dresden, [2] A. Rodriguez, A. González and M.P. Malumbres,“ Performance evaluation of parallel MPEG-4 video coding algorithms on clusters of workstations ”, IEEE Int. Conference on Parallel Computing in Electrical Engineering, pp. 354, 357, Dresden, Arjan J.C. van Gemund, “Symbolic Performance Modeling of Parallel Systems”, IEEE Transactions on Parallel and Distributed Systems, vol 14, no 2, Feb [3] Arjan J.C. van Gemund, “Symbolic Performance Modeling of Parallel Systems”, IEEE Transactions on Parallel and Distributed Systems, vol 14, no 2, Feb Pacheco, P.S.: Parallel Programming with MPI, Morgan Kaufman Publishers, Inc. [4] Pacheco, P.S.: Parallel Programming with MPI, Morgan Kaufman Publishers, Inc.