Steven Ge, Xinmin Tian, and Yen-Kuang Chen

Steven Ge, Xinmin Tian, and Yen-Kuang Chen
Efficient Multithreading Implementation of H.264 Encoder on Intel Hyper-Threading Architectures Steven Ge, Xinmin Tian, and Yen-Kuang Chen IEEE Pacific-Rim Conference on Multimedia 2003

Outline Introduction Multithreaded Implementation
Background Knowledge Main Purpose Multithreaded Implementation Performance Results and Analysis Conclusions

Introduction Background Knowledge (1/6)
H.264 High quality with low bit-rate Hybrid block-based MC and transform coding model. Intensive workload of computation.

SIMD technology [1] Single Instruction Multiple Data Data level parallelism Instruction Pool Data Pool PU

SIMD example (Interpolation) 128bits a0 a1 a2 a3 a4 a5 a6 a7 128bits b0 b1 b2 b3 b4 b5 b6 b7 128bits a0xb0+a1xb1 a4xb4+a5xb5 a2xb2+a3xb3 a6xb6+a7xb7

Multiprocessor & Hyper Threading Simultaneous Multithreading (SMT) Sharing the physical execution resources and duplicating architectural state Logical Processor Sharing execution resources Logical Processor

Multiprocessor & Hyper Threading Traditional Dual-CPU System Arch State Processor Execution Resource Hyper-Threading technology-capable Dual-CPU System Processor Execution Resource Arch State

OpenMP programming model High level application programming interface Supports shared memory multi-processing

Introduction Main Purpose (1/1)
In order to speed up H.264 encoder performance Parallelizing encoder by OpenMP programming model Multi-level (frame & slice level) data partition scheme The experiments show the speedups on Intel XeonTM system

Implementation Overview (1/1)
Dividing the H.264 encode process into multiple threads via data domain decomposition. GOP, frame, slice, MB Judgements of thread granularity and proposed implementation.

Implementation Thread Granularity (1/5)
Slice-Level Parallelism Independent Breaking the dependency of MBs  Bit rate ↑

Slice-Level Parallelism

Frame-Level Parallelism IBBP structure 0 (I) 3 (P) 6 (P) 9 (P) 12 (P) 1 (B) 4 (B) 7 (B) 2 (B) 5 (B) 8 (B)

Frame-Level Parallelism Encode I, P-frames first No bit rate increasing problem Dependence among frames will limit the threads scalability

Combine above two approaches Explore the parallelism among frames Reach the upper limit of the thread number Explore the parallelism among slices

Implementation Multithreaded Implementation (1/4)
Input preprocessing Read uncompressed images Issue the images to encoding threads Encoding Use two slice buffers to distinguish the priority of I, P and B frames Post-processing Check the encoding status Commit the result to the bit-stream

Image buffer Input File Thread 0 Output File Thread 1 Slice Queue 0 (I/P) Thread 2 Slice Queue 1 (B) Thread 3 Thread 4

Pseudo code # pragma omp parallel sections { # pragma omp section while ( there is frame to encode ) if (there is free entry in img buffer) issue new frame to img buffer else if (there are frame encoded in img buffer) commit the encoded frame, release entry else wait; }

Pseudo code # pragma omp section { # pragma omp parallel num_thread (# of encoding thread) while (1) if ( there is slice in slice queue 0 ) encode one slice // higher priority for I/P-frames else if ( there is slice in slice queue 1) encode one slice // lower priority for B-frames else if ( all frames are encoded ) exit; else wait; }

Performance Results Environment (1/5)
Dell 530 MT workstation Dual Intel Xeon processor running at 2.0GHz with HT enabled 512K L2 Cache, 1G memory IBM 360 Server Quad Intel Xeon processor running at 1.5GHz with HT enabled 256K L2 Cache, 512K L3 Cache, 2G memory

Performance Results Encoder profile (1/5)
All intersearch types are enabled Only the nearest previous frame is used for inter motion search Maximum search range is 16 1/4-pel motion vector resolution is used Quant parameter is set to 16 for all frames

Performance Results SIMD Technology (1/5)
Speedups of the key modules in H.264 encoder

Performance Results Speedup & Compression Efficiency
Speedup and bit-rate vs. number of slice in a frame

Performance Results Performance with HT Technology
Encoder speedsup on different sequence after multithreading

Performance Results Performance with HT Technology
With HT enabled, we can have 1.2x speedup.

Conclusions This paper presents efficient multithreaded implementation of H.264 encoder. The first one who considers compression efficiency degradation as well as parallel speed up. Speedsup ranging from 4.31 x to 4.69x on 4-CPU system with HT. Their work demonstrates that HT can gain ~20% performance.

Reference [1] X. Zhou, E. Q. Li, and Y.-K. Chen, “Implementation of H.264 Decoder on General-Purpose Processors with Media Instructions,” in Proc. of SPIE Conf. on Image and Video Communications and Processing, Jan [2] Y.-K. Chen, M. Holliman, E. Debes, S. Zheltov, A. Knyazev, S. Bratanov, R. Belenov, and I. Santos, “Media Applications on Hyper-Threading Technology,“ Intel Technology Journal, pp , Feb [3] D. Marr, F. Binns, D. L. Hill, G. Hinton, D. A. Koufaty, I. A. Miller, and M. Upton, “Hyper-Threading Technology Microarchitecture and Architecture,’’ Intel Technology Journul, Vol. 6, QI, 2002.

Steven Ge, Xinmin Tian, and Yen-Kuang Chen

Similar presentations

Presentation on theme: "Steven Ge, Xinmin Tian, and Yen-Kuang Chen"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Steven Ge, Xinmin Tian, and Yen-Kuang Chen

Similar presentations

Presentation on theme: "Steven Ge, Xinmin Tian, and Yen-Kuang Chen"— Presentation transcript:

Similar presentations

About project

Feedback