1 Efficient Multithreading Implementation of H.264 Encoder on Intel Hyper- Threading Architectures Steven Ge, Xinmin Tian, and Yen-Kuang Chen IEEE Pacific-Rim.

Slides:



Advertisements
Similar presentations
Parallel H.264 Decoding on an Embedded Multicore Processor
Advertisements

Multi-Threading LAME MP3 Encoder
Lecture 6: Multicore Systems
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS, ICT '09. TAREK OUNI WALID AYEDI MOHAMED ABID NATIONAL ENGINEERING SCHOOL OF SFAX New Low Complexity.
-1/20- MPEG 4, H.264 Compression Standards Presented by Dukhyun Chang
1 Video Coding Concept Kai-Chao Yang. 2 Video Sequence and Picture Video sequence Large amount of temporal redundancy Intra Picture/VOP/Slice (I-Picture)
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
A Highly Parallel Framework for HEVC Coding Unit Partitioning Tree Decision on Many-core Processors Chenggang Yan, Yongdong Zhang, Jizheng Xu, Feng Dai,
Efficient Bit Allocation and CTU level Rate Control for HEVC Picture Coding Symposium, 2013, IEEE Junjun Si, Siwei Ma, Wen Gao Insitute of Digital Media,
1 Adaptive slice-level parallelism for H.264/AVC encoding using pre macroblock mode selection Bongsoo Jung, Byeungwoo Jeon Journal of Visual Communication.
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
Evaluation of Data-Parallel Splitting Approaches for H.264 Decoding
Shaobo Zhang, Xiaoyun Zhang, Zhiyong Gao
1 Single Reference Frame Multiple Current Macroblocks Scheme for Multiple Reference IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY Tung-Chien.
Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula
Page 1 CS Department Parallel Design of JPEG2000 Image Compression Xiuzhen Huang CS Department UC Santa Barbara April 30th, 2003.
1 Slice-Balancing H.264 Video Encoding for Improved Scalability of Multicore Decoding Michael Roitzsch Technische Universität Dresden ACM & IEEE international.
Hyper-Threading Neil Chakrabarty William May. 2 To Be Tackled Review of Threading Algorithms Hyper-Threading Concepts Hyper-Threading Architecture Advantages/Disadvantages.
1. 1. Problem Statement 2. Overview of H.264/AVC Scalable Extension I. Temporal Scalability II. Spatial Scalability III. Complexity Reduction 3. Previous.
1 Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi- Processor Architecture Tom R. Jacobs, Vassilios A. Chouliars, and David J. Mulvaney.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
Philipp Merkle, Aljoscha Smolic Karsten Müller, Thomas Wiegand CSVT 2007.
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
Ioana Burcea Initial Observations of the Simultaneous Multithreading Pentium 4 Processor Nathan Tuck and Dean M. Tullsen.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
MPEG: (Moving Pictures Expert Group) A Video Compression Standard for Multimedia Applications Seo Yeong Geon Dept. of Computer Science in GNU.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
1 Efficient Reference Frame Selector for H.264 Tien-Ying Kuo, Hsin-Ju Lu IEEE CSVT 2008.
Binary Image Compression via Monochromatic Pattern Substitution: A Sequential Speed-Up Luigi Cinque and Sergio De Agostino Computer Science Department.
Multi-core architectures. Single-core computer Single-core CPU chip.
EE 5359 PROJECT PROPOSAL FAST INTER AND INTRA MODE DECISION ALGORITHM BASED ON THREAD-LEVEL PARALLELISM IN H.264 VIDEO CODING Project Guide – Dr. K. R.
1 Data Partition for Wavefront Parallelization of H.264 Video Encoder Zhuo Zhao, Ping Liang IEEE ISCAS 2006.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
GU Junli SUN Yihe 1.  Introduction & Related work  Parallel encoder implementation  Test results and Analysis  Conclusions 2.
History of Microprocessor MPIntroductionData BusAddress Bus
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
Hyper-Threading Technology Architecture and Micro-Architecture.
Tahir CELEBI, Istanbul, 2005 Hyper-Threading Technology Architecture and Micro-Architecture Prepared by Tahir Celebi Istanbul, 2005.
1 A high-level simulator for the H.264/AVC decoding process in multi-core systems Florian H. Seitner, Ralf M. Schreier, Michael Bleyer, Margrit Gelautz.
JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.
Hyper Threading Technology. Introduction Hyper-threading is a technology developed by Intel Corporation for it’s Xeon processors with a 533 MHz system.
Guillaume Laroche, Joel Jung, Beatrice Pesquet-Popescu CSVT
SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.
Hyper-Threading Technology Architecture and Microarchitecture
UNDER THE GUIDANCE DR. K. R. RAO SUBMITTED BY SHAHEER AHMED ID : Encoding H.264 by Thread Level Parallelism.
EE 5359 PROJECT PRESENTATION FAST INTER AND INTRA MODE DECISION
Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad*
HyperThreading ● Improves processor performance under certain workloads by providing useful work for execution units that would otherwise be idle ● Duplicates.
Overview of Fine Granularity Scalability in MPEG-4 Video Standard Weiping Li Presented by : Brian Eriksson.
CloudStream: delivering high-quality streaming videos through a cloud-based SVC proxy Authors: Zixia Huang1, Chao Mei1, Li Erran Li2, Thomas Woo2 1Department.
UNDER THE GUIDANCE DR. K. R. RAO SUBMITTED BY SHAHEER AHMED ID : Encoding H.264 by Thread Level Parallelism.
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Parallel Computing Presented by Justin Reschke
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.
Time Optimization of HEVC Encoder over X86 Processors using SIMD Kushal Shah Advisor: Dr. K. R. Rao Spring 2013 Multimedia.
A Frame-Level Rate Control Scheme Based on Texture and Nontexture Rate Models for HEVC IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY,
1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006.
Efficient Huffman Decoding Aggarwal, M. and Narayan, A., International Conference on Image Processing, vol. 1, pp. 936 – 939, 2000 Presenter :Yu-Cheng.
Xbox 360 Architecture Presenter: Ataç Deniz Oral Date: 30/11/06.
Computer Structure Multi-Threading
INTEL HYPER THREADING TECHNOLOGY
Comparative Analysis of Parallel OPIR Compression on Space Processors
Steven Ge, Xinmin Tian, and Yen-Kuang Chen
Multi-Core Computing Osama Awwad Department of Computer Science
Hyperthreading Technology
Bongsoo Jung, Byeungwoo Jeon
Presentation transcript:

1 Efficient Multithreading Implementation of H.264 Encoder on Intel Hyper- Threading Architectures Steven Ge, Xinmin Tian, and Yen-Kuang Chen IEEE Pacific-Rim Conference on Multimedia 2003

2 Outline Introduction Introduction Background Knowledge Background Knowledge Main Purpose Main Purpose Multithreaded Implementation Multithreaded Implementation Performance Results and Analysis Performance Results and Analysis Conclusions Conclusions

3 Introduction Background Knowledge (1/6) H.264 H.264 High quality with low bit-rate High quality with low bit-rate Hybrid block-based MC and transform coding model. Hybrid block-based MC and transform coding model. Intensive workload of computation. Intensive workload of computation.

4 Introduction Background Knowledge (2/6) SIMD technology [1] SIMD technology [1] Single Instruction Multiple Data Single Instruction Multiple Data Data level parallelism Data level parallelism Instruction Pool Data Pool PU

5 Introduction Background Knowledge (3/6) SIMD example (Interpolation) SIMD example (Interpolation) a0a1a2 a3a4 a5a6a7b0b1b2 b3b4 b5b6b7 a0xb0+a1xb1a4xb4+a5xb5 a2xb2+a3xb3a6xb6+a7xb7 128bits

6 Introduction Background Knowledge (4/6) Multiprocessor & Hyper Threading Multiprocessor & Hyper Threading Simultaneous Multithreading (SMT) Simultaneous Multithreading (SMT) Sharing the physical execution resources and duplicating architectural state Sharing the physical execution resources and duplicating architectural state Logical Processor Sharing execution resources

7 Introduction Background Knowledge (5/6) Multiprocessor & Hyper Threading Multiprocessor & Hyper Threading Arch State Processor Execution Resource Arch State Processor Execution Resource Arch State Processor Execution Resource Arch State Traditional Dual-CPU System Hyper-Threading technology-capable Dual-CPU System

8 Introduction Background Knowledge (6/6) OpenMP programming model OpenMP programming model High level application programming interface High level application programming interface Supports shared memory multi-processing Supports shared memory multi-processing

9 Introduction Main Purpose (1/1) In order to speed up H.264 encoder performance In order to speed up H.264 encoder performance Parallelizing encoder by OpenMP programming model Parallelizing encoder by OpenMP programming model Multi-level (frame & slice level) data partition scheme Multi-level (frame & slice level) data partition scheme The experiments show the speedups on Intel Xeon TM system The experiments show the speedups on Intel Xeon TM system

10 Implementation Overview (1/1) Dividing the H.264 encode process into multiple threads via data domain decomposition. Dividing the H.264 encode process into multiple threads via data domain decomposition. GOP, frame, slice, MB GOP, frame, slice, MB Judgements of thread granularity and proposed implementation. Judgements of thread granularity and proposed implementation.

11 Implementation Thread Granularity (1/5) Slice-Level Parallelism Slice-Level Parallelism Independent Independent Breaking the dependency of MBs Breaking the dependency of MBs  Bit rate ↑  Bit rate ↑

12 Implementation Thread Granularity (2/5) Slice-Level Parallelism Slice-Level Parallelism

13 Implementation Thread Granularity (3/5) Frame-Level Parallelism Frame-Level Parallelism IBBP structure IBBP structure 0 (I) 3 (P) 2 (B) 6 (P)9 (P) 1 (B)4 (B)5 (B)7 (B)8 (B) 12 (P)

14 Implementation Thread Granularity (4/5) Frame-Level Parallelism Frame-Level Parallelism Encode I, P-frames first Encode I, P-frames first No bit rate increasing problem No bit rate increasing problem Dependence among frames will limit the threads scalability Dependence among frames will limit the threads scalability

15 Implementation Thread Granularity (5/5) Combine above two approaches Combine above two approaches Explore the parallelism among frames Explore the parallelism among slices Reach the upper limit of the thread number

16 Implementation Multithreaded Implementation (1/4) Input preprocessing Input preprocessing Read uncompressed images Read uncompressed images Issue the images to encoding threads Issue the images to encoding threads Encoding Encoding Use two slice buffers to distinguish the priority of I, P and B frames Use two slice buffers to distinguish the priority of I, P and B frames Post-processing Post-processing Check the encoding status Check the encoding status Commit the result to the bit-stream Commit the result to the bit-stream

17 Implementation Multithreaded Implementation (2/4) Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Input File Output File Image buffer Slice Queue 0 (I/P) Slice Queue 1 (B)

18 Implementation Multithreaded Implementation (3/5) Pseudo code Pseudo code # pragma omp parallel sections { # pragma omp section { while ( there is frame to encode ) { if (there is free entry in img buffer) issue new frame to img buffer else if (there are frame encoded in img buffer) commit the encoded frame, release entry else wait; }

19 Implementation Multithreaded Implementation (4/5) Pseudo code Pseudo code # pragma omp section { # pragma omp parallel num_thread (# of encoding thread) { while (1) { if ( there is slice in slice queue 0 ) encode one slice // higher priority for I/P-frames else if ( there is slice in slice queue 1) encode one slice // lower priority for B-frames else if ( all frames are encoded ) exit; else wait; }

20 Performance Results Environment (1/5) Dell 530 MT workstation Dell 530 MT workstation Dual Intel Xeon processor running at 2.0GHz with HT enabled Dual Intel Xeon processor running at 2.0GHz with HT enabled 512K L2 Cache, 1G memory 512K L2 Cache, 1G memory IBM 360 Server IBM 360 Server Quad Intel Xeon processor running at 1.5GHz with HT enabled Quad Intel Xeon processor running at 1.5GHz with HT enabled 256K L2 Cache, 512K L3 Cache, 2G memory 256K L2 Cache, 512K L3 Cache, 2G memory

21 Performance Results Encoder profile (1/5) All intersearch types are enabled All intersearch types are enabled Only the nearest previous frame is used for inter motion search Only the nearest previous frame is used for inter motion search Maximum search range is 16 Maximum search range is 16 1/4-pel motion vector resolution is used 1/4-pel motion vector resolution is used Quant parameter is set to 16 for all frames Quant parameter is set to 16 for all frames

22 Performance Results SIMD Technology (1/5) Speedups of the key modules in H.264 encoder Speedups of the key modules in H.264 encoder

23 Performance Results Speedup & Compression Efficiency Speedup and bit-rate vs. number of slice in a frame Speedup and bit-rate vs. number of slice in a frame

24 Performance Results Performance with HT Technology Encoder speedsup on different sequence after multithreading Encoder speedsup on different sequence after multithreading

25 Performance Results Performance with HT Technology With HT enabled, we can have 1.2x speedup. With HT enabled, we can have 1.2x speedup.

26 Conclusions This paper presents efficient multithreaded implementation of H.264 encoder. This paper presents efficient multithreaded implementation of H.264 encoder. The first one who considers compression efficiency degradation as well as parallel speed up. The first one who considers compression efficiency degradation as well as parallel speed up. Speedsup ranging from 4.31 x to 4.69x on 4-CPU system with HT. Speedsup ranging from 4.31 x to 4.69x on 4-CPU system with HT. Their work demonstrates that HT can gain ~20% performance. Their work demonstrates that HT can gain ~20% performance.

27 Reference [1] X. Zhou, E. Q. Li, and Y.-K. Chen, “Implementation of H.264 Decoder on General-Purpose Processors with Media Instructions,” in Proc. of SPIE Conf. on Image and Video Communications and Processing, Jan [2] Y.-K. Chen, M. Holliman, E. Debes, S. Zheltov, A. Knyazev, S. Bratanov, R. Belenov, and I. Santos, “Media Applications on Hyper-Threading Technology,“ Intel Technology Journal, pp , Feb [3] D. Marr, F. Binns, D. L. Hill, G. Hinton, D. A. Koufaty, I. A. Miller, and M. Upton, “Hyper-Threading Technology Microarchitecture and Architecture,’’ Intel Technology Journul, Vol. 6, QI, 2002.