1 Slice-Balancing H.264 Video Encoding for Improved Scalability of Multicore Decoding Michael Roitzsch Technische Universität Dresden ACM & IEEE international.

Slides:



Advertisements
Similar presentations
Parallel Scalability and Efficiency of HEVC Parallelization Approaches
Advertisements

Introduction to H.264 / AVC Video Coding Standard Multimedia Systems Sharif University of Technology November 2008.
Parallel H.264 Decoding on an Embedded Multicore Processor
VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS, ICT '09. TAREK OUNI WALID AYEDI MOHAMED ABID NATIONAL ENGINEERING SCHOOL OF SFAX New Low Complexity.
Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.
Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
VIPER DSPS 1998 Slide 1 A DSP Solution to Error Concealment in Digital Video Eduardo Asbun and Edward J. Delp Video and Image Processing Laboratory (VIPER)
Software Architecture of High Efficiency Video Coding for Many-Core Systems with Power- Efficient Workload Balancing Muhammad Usman Karim Khan, Muhammad.
H.264/AVC Baseline Profile Decoder Complexity Analysis Michael Horowitz, Anthony Joch, Faouzi Kossentini, and Antti Hallapuro IEEE TRANSACTIONS ON CIRCUITS.
1 Adaptive slice-level parallelism for H.264/AVC encoding using pre macroblock mode selection Bongsoo Jung, Byeungwoo Jeon Journal of Visual Communication.
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
Evaluation of Data-Parallel Splitting Approaches for H.264 Decoding
Sliding-Window Digital Fountain Codes for Streaming of Multimedia Contents Matta C.O. Bogino, Pasquale Cataldi, Marco Grangetto, Enrico Magli, Gabriella.
Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula
1 Efficient Multithreading Implementation of H.264 Encoder on Intel Hyper- Threading Architectures Steven Ge, Xinmin Tian, and Yen-Kuang Chen IEEE Pacific-Rim.
Multiscalar processors
Scalable Rate Control for MPEG-4 Video Hung-Ju Lee, Member, IEEE, Tihao Chiang, Senior Member, IEEE, and Ya-Qin Zhang, Fellow, IEEE IEEE TRANSACTIONS ON.
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
Statistical Multiplexer of VBR video streams By Ofer Hadar Statistical Multiplexer of VBR video streams By Ofer Hadar.
Multi-core processors. History In the early 1970’s the first Microprocessor was developed by Intel. It was a 4 bit machine that was named the 4004 The.
EEL 6935 Embedded Systems Long Presentation 2 Group Member: Qin Chen, Xiang Mao 4/2/20101.
An Introduction to H.264/AVC and 3D Video Coding.
1. 1. Problem Statement 2. Overview of H.264/AVC Scalable Extension I. Temporal Scalability II. Spatial Scalability III. Complexity Reduction 3. Previous.
CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.
 Coding efficiency/Compression ratio:  The loss of information or distortion measure:
Video Coding. Introduction Video Coding The objective of video coding is to compress moving images. The MPEG (Moving Picture Experts Group) and H.26X.
Audio Compression Usha Sree CMSC 691M 10/12/04. Motivation Efficient Storage Streaming Interactive Multimedia Applications.
Performance Evaluation of Parallel Processing. Why Performance?
MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Image Processing and Computer Vision: 91. Image and Video Coding Compressing data to a smaller volume without losing (too much) information.
Adaptive Multi-path Prediction for Error Resilient H.264 Coding Xiaosong Zhou, C.-C. Jay Kuo University of Southern California Multimedia Signal Processing.
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
Codec structuretMyn1 Codec structure In an MPEG system, the DCT and motion- compensated interframe prediction are combined. The coder subtracts the motion-compensated.
8. 1 MPEG MPEG is Moving Picture Experts Group On 1992 MPEG-1 was the standard, but was replaced only a year after by MPEG-2. Nowadays, MPEG-2 is gradually.
Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
1 A high-level simulator for the H.264/AVC decoding process in multi-core systems Florian H. Seitner, Ralf M. Schreier, Michael Bleyer, Margrit Gelautz.
Adaptive Multi-Threading for Dynamic Workloads in Embedded Multiprocessors 林鼎原 Department of Electrical Engineering National Cheng Kung University Tainan,
Processor Architecture
UNDER THE GUIDANCE DR. K. R. RAO SUBMITTED BY SHAHEER AHMED ID : Encoding H.264 by Thread Level Parallelism.
Copyright © Curt Hill Parallelism in Processors Several Approaches.
1 Modular Refinement of H.264 Kermin Fleming. 2 What is H.264? Mobile Devices Low bit-rate Video Decoder –Follow on to MPEG-2 and H.26x Operates on pixel.
The World Leader in High Performance Signal Processing Solutions Multi-core programming frameworks for embedded systems Kaushal Sanghai and Rick Gentile.
A Flexible Interleaved Memory Design for Generalized Low Conflict Memory Access Laurence S.Kaplan BBN Advanced Computers Inc. Cambridge,MA Distributed.
Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,
1 Yu Liu 1, Feng Wu 2 and King Ngi Ngan 1 1 Department of Electronic Engineering, The Chinese University of Hong Kong 2 Microsoft Research Asia, Beijing,
Parallel Computing Presented by Justin Reschke
Hierarchical Systolic Array Design for Full-Search Block Matching Motion Estimation Noam Gur Arie,August 2005.
1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006.
Page 1 2P13 Week 1. Page 2 Page 3 Page 4 Page 5.
CPU (Central Processing Unit). The CPU is the brain of the computer. Sometimes referred to simply as the processor or central processor, the CPU is where.
1 Munther Abualkibash University of Bridgeport, CT.
Computer Orgnization Rabie A. Ramadan Lecture 9. Cache Mapping Schemes.
Introduction to H.264 / AVC Video Coding Standard Multimedia Systems Sharif University of Technology November 2008.
Computer Architecture Chapter (14): Processor Structure and Function
Automatic Video Shot Detection from MPEG Bit Stream
Parallel Processing - introduction
Multi-core processors
Fault-Tolerant NoC-based Manycore system: Reconfiguration & Scheduling
Steven Ge, Xinmin Tian, and Yen-Kuang Chen
Multi-core processors
Video Compression - MPEG
COMP60621 Fundamentals of Parallel and Distributed Systems
Bongsoo Jung, Byeungwoo Jeon
COMP60611 Fundamentals of Parallel and Distributed Systems
Presentation transcript:

1 Slice-Balancing H.264 Video Encoding for Improved Scalability of Multicore Decoding Michael Roitzsch Technische Universität Dresden ACM & IEEE international conference on Embedded software(EMSOFT 2007)

2 Outlines Introduction Parallelizing H.264 decoding Applying decoding time prediction Evaluation Conclusion

3 Introduction The advent of multicore processor technology. The reduced power consumption caused by distributing computations across multiple slower- clock cores. We will present a technique that increases the scalability of H.264 video decoding by modifying only the encoder stage. The key idea is to equalize the potentially differing decoding times of one frame’s slices by applying decoding time prediction at the encoder stage. This paper also contributes a way to accurately predict H.264 decoding times with average relative errors down to 1 %.

4 Parallelizing H.264 decoding Slices are the most promising candidates for independent decoding by multiple cores: (1) Individual frames have complex interdependencies due to the very flexible usage of reference pictures in H.264. (2) Other than frames, slices are the only syntactical bitstream element, whose boundaries can be found without decompressing the entropy coding layer of the video stream. (3) H.264 uses spatial prediction, which extrapolates already decoded parts of the final picture into yet to be decoded areas to predict their appearance. (4) For global picture coding parameters (e.g., video resolution), which must be known before a slice can be decoded, the standard ensures that they do not change between different slices of the same frame. (5) H.264 also uses a mandatory deblocking filter. This filter can operate across slice boundaries, which would defer the deblocking to the end of the decoding process of each frame, outside the slice context. (6) Decoders usually organize the final picture and any temporary per- macroblock data storage maps as two-dimensional arrays in memory.

5 Parallelizing H.264 decoding We parallelized the open-source H.264 decoder from the FFmpeg project [8] to decode multiple slices simultaneously in concurrent POSIX threads. Each thread decodes a single slice. We use the x264 encoder [17] to encode an ensemble of H.264 test sequences. Every one of the uncompressed source sequences was encoded with 1, 2, 4, 8, 16, 32, 64, 128, 256,512, and 1024 slices per frame, keeping the quality constant. All results presented in this paper have been obtained on a 2GHz Intel Core Duo machine.

6

7 Scalability concerns In the uniprocessor case, a frame is complete, when all slices of that frame are fully decoded. In the multiprocessor case, each frame’s decoding is finished after the slice with the longest execution time is fully decoded. For each encoded video, the speedup can be calculated by dividing the time required on a uniprocessor by the time required on a multiprocessor.

8

9 Scalability concerns One of the goals of multicore computing is to reduce the clock speed of the individual cores to reduce power consumption. The target clock speed of the system must be designed for the peak load, which is the frame that takes the longest time to decode.

10

11 Scalability concerns Parallel efficiency suffers because of sequential portions of the code that cannot be parallelized or because of synchronization overhead or idle time. The frame is not fully decoded until the last of its slices is finished. Inter-frame dependencies Uniform slices Because the time it takes to decode a slice largely depends on the coding features that are used, which are chosen by the encoder according to properties of the frame’s content like speed, direction and diversity of motion in the scene

12 Scalability concerns Balanced Slices To reduce waiting times by encoding the slices for balanced decoding time: The slice boundaries are placed so that the decoding times of all slices of that frame are equal. Slice boundaries in adjacent frames will generally not be at the same position H.264 standard allows different slice boundaries for each frame without any penalty

13 Applying decoding time prediction Balancing the slices according to their decoding time is possible with a feedback process: The encoding is done in a first pass with uniform slices, then information about the resulting decoding times of the slices is fed back into the encoder. so it can iteratively change the slice boundaries to approach equal decoding times Balanced slice The video is first encoded traditionally, resulting in uniform slices. For each frame of the resulting video, decoding time prediction is applied to each macroblock. The total decoding time t of a frame is the sum of its per-macroblock decoding times. If that frame should be divided into n balanced slices, each slice has to contain so many macroblocks that their cumulative decoding time is as close to t/n as possible. This idea is easily implemented by iterating over all macroblocks of one frame in raster-scan order and accounting their decoding time. We propose to use decoding time prediction to determine the decoding times.

14 37% 29% 23%

15 H.264 Decoder Model

16 Parallel H.264 Decoding(yyma’s) The H.264 Decoder The H.264 decoding process Stream Parsing Entropy Decoder Inverse Quantization Inverse DCT Spatial Prediction Motion Compensation Reference Frames Deblocking + Encoded Bitstream Parser Reconstructor Data-Parallel Processing

17

18 Decoding Time Prediction To balance the slices of one frame for equalized decoding times, we have to pass decoding time information to the encoder. The encoder can then use the training data obtained on the target hardware to balance the slices’ decoding time in the resulting H.264 video. The encoding uses no time measurements, but decoding time prediction only. Decoding time prediction is trained on separate hardware. The prediction can be applied on the macroblock level.

19 With average relative errors between -4.54% and %, the frame-level prediction is very accurate. Frame-level Evaluation

20 The prediction does not only work in average, but closely follows decoding time fluctuations of individual frames.

21 It is most likely due to the noisier behavior on the macroblock-level caused by effects like cache misses. With average relative errors for macroblock-level prediction as low as 0.86 %, the results are promising. Macroblock-level

22 Parkrun sequence Using decoding time prediction, we reencoded a balanced 2-slices version of the Parkrun sequence. Figure 10 visualizes slice boundaries and per-slice decoding times before and after balancing. The slice boundaries move between subsequent frames, resulting in more equalized decoding times.

23 Figure 11: Speedup of parallel decoding with balanced slices. The plots show practically achieved speedup with uniform slices and balanced slices as well as the hypothetical speedup with perfectly balanced slices.

24 Figure 12: Clock speed envelope of parallel decoding with balanced slices. Scalability improvements offer the potential of reducing the clock speed of the individual cores. Because the cores must still be fast enough to decode the frame with the longest decoding time, the 95% quantile of the decoding times is an interesting indicator.

25 Conclusion We presented a new technique to improve parallel efficiency of multithreaded H.264 decoding. By using slices balanced for decoding time, this method can achieve improvements in terms of scalability or clock speed reduction.

26 References [1] BBC Motion Gallery Reel. [2] High-Definition Test Sequences. [8] FFmpeg Project. [12] Roitzsch, M. Slice-Balancing H.264 Video Encoding for Improved Scalability of Multicore Decoding. In Proceedings of the 27th IEEE Real-Time Systems Symposium (RTSS 06) (Rio de Janeiro, Brazil, December 2006), IEEE, pp. 77–80. [13] Roitzsch, M., and Pohlack, M. Principles for the Prediction of Video Decoding Times applied to MPEG-1/2 and MPEG-4 Part 2 Video. In Proceedings of the 27th IEEE Real-Time Systems Symposium (RTSS 06) (Rio de Janeiro, Brazil, December 2006), IEEE, pp. 271–280. [17] x264 Project.