1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006.

1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006

2 Outline Introduction Performance Analysis Hierarchical H.264 Parallel Encoder Experimental Results Conclusions

3 Introduction Background Knowledge (1/5) Video Communication

4 Introduction Background Knowledge (2/5) H.264/AVC Remove sensitive redundant information In order to reach the limits on compression efficiency  intensive computation Video on demand, video conference, live broadcasting, etc.

5 Introduction Background Knowledge (3/5) H.264/AVC encoder High CPU demand Low latency Real time response Platforms with supercomputing capabilities Clusters Multiprocessors Special purpose devices

6 Introduction Background Knowledge (4/5) Cluster A group of linked computers Improve performance and/or availability over that provided by a single computer Categorizations High-availability clusters Load-balancing clusters High-performance clusters

7 Introduction Background Knowledge (5/5) Message Passing Parallelism Message passing runtimes and libraries  MPI Multithread Parallelism OpenMP Optimized libraries SIMD extension and global processing unit  Intel IPP, AMD ACML, etc.

8 Introduction Main Purpose (1/6) Apply parallel processing to H.264 encoders in order to reduce computation intensity. Given video quality and bit rate Image resolution Frame rate Latency

9 Introduction Main Purpose (2/6) Hierarchical parallelization of H.264 encoder Two level MPI message passing parallelization GOP level Slice level

10 Introduction Main Purpose (3/6) GOP level parallelism Good speed-up High latency …….. GOP

11 Introduction Main Purpose (4/6) Example of latency 1 GOP = 10 frames Frame rate = 30 frames/sec Time for encoding 1 GOP = 3 seconds We have to encode 9 GOP in parallel in order to achieve real time response Latency = 3 seconds

12 Introduction Main Purpose (5/6) Slice level parallelism Low latency Less coding efficiency

13 Introduction Main Purpose (6/6) Combination both approaches Speed-up Efficiency

14 Performance Analysis Overview (1/2) “ “Performance evaluation of parallel MPEG-4 video coding algorithms on clusters of workstations” “A Parallel implementation of H.26L video encoder” Combination Scalability and low latency

15 Performance Analysis Overview (2/2) Processing flow video sequence GOP ……..…….. Increase throughput Reduce latency

16 Performance Analysis Equation definition Little’s law N = X * R N : Number of GOPs processed in parallel. X : Number of GOPs encoded per second. R : Elapsed time between a GOP enters the system and the same GOP is completely encoded.

17 Performance Analysis Analysis (1/2) If we have n p nodes in the cluster and every GOP decomposed in n s slices  N = n p / n s R = R SEQ / ( n s * E s ) R SEQ : Sequential encoding time of a GOP E s : Parallel efficiency of slice level

18 Performance Analysis Analysis (2/2) GOP throughput of combined parallel encoder If E s is significantly less than 1, throughput would be affected negatively

19 Performance Analysis Example (1/4) Video sequence in HDTV format at 1280*720 Frame rate = 60 frames / sec We suppose that H.264 sequential encoder encodes one GOP(15 frames) in 5 seconds Only one slice per frame is defined

20 Performance Analysis Example (2/4) To get real time response, X has to be equal to 60 frames/sec or 4 GOPs/sec  n p = 4 * 5 = 20 nodes

21 Performance Analysis Example (3/4) Combined with slice level parallelism Maximum of allowed latency = 1 sec Slice parallelism efficiency = 0.8

22 Performance Analysis Example (4/4) We set n s to 7 and N to 4, and number of required nodes is adjusted to 28 Throughput Latency

23 Performance Analysis Efficiency Estimation (1/5) Why we have to estimate E s ? Throughput Latency How to estimate E s ? PAMELA (PerformAnce ModEling LAnguage) model

24 Performance Analysis Efficiency Estimation (2/5) Update DPB (Decoding Picture Buffer) in every node Using MPI_Allgather In this PAMELA model MPI_Allgather is implemented using binary tree

25 Performance Analysis Efficiency Estimation (3/5) The PAMELA model to parallel encode one frame is : L = par ( p = 1…n s ) delay (t s ); delay (t w ) seq ( I = 0…log 2 (n s )-1) par ( j = 1…n s ) delay ( t L + t c * 2 i ) n s : The number of slices processed in parallel t s : The mean of slice encoding time t w : The mean wait time due to variations in ts and global synchronization t L : Start up time t c : Transmission time of one encoded slice

26 Performance Analysis Efficiency Estimation (4/5) The parallel time obtained solving this model is Efficiency can be computed as T(L) = t s + t w + t AG t AG = log 2 (n s ) * t L + (n s - 1) * t c

27 Performance Analysis Efficiency Estimation (5/5) The experimental estimations of parameter values Estimated efficiency for a slice based parallel encoder tLtL tctc tsts twtw t AG 6.00.0133*405684000020586421

28 Performance Analysis Slice Parallelism Scalability (1/4) The feasible number of slices will depend on the video resolution Number of MBs per slice Bit rate increment (%)

29 Performance Analysis Slice Parallelism Scalability (2/4) Bit rate overhead vs. number of slices per frame

30 Performance Analysis Slice Parallelism Scalability (3/4) PSNR loss vs. number of slices per frame

31 Performance Analysis Slice Parallelism Scalability (4/4) Encoding time vs. number of slices per frame

32 Hierarchical Parallel Encoder Overview In order to achieve scalability and low latency Combine GOP and slice level parallelism In the first level Divide sequence in GOPs(15 frames) Every GOP is assigned to a processor group inside the cluster Each group encodes independently

33 Hierarchical Parallel Encoder GOP assignment method Local manager Communicate with global manager Global manager Inform the GOP assignment by sending a message with the GOP number to the requesting local manager Simple and load balance

34 Hierarchical Parallel Encoder Framework Hierarchical H.264 parallel encoder Global Manager P0 P1P2 P0 P1P2 P0 P1P2

35 Experimental Results Environments (1/2) Mozart 4 biprocessor nodes with AMD Opteron 246 at 2 GHz interconnected by a switched Gigabit Ethernet Aldebaran SGI Altix 3700 with 44 nodes Itanium II interconnected by a high performance proprietary network

36 Experimental Results Environments (2/2) 720 * 480 standard sequence Ayersroc which composed by 16 GOPs ConfigurationCluster#Groups#Slices 01_Gr_08S1Mozart18 02_Gr_04S1Mozart24 04_Gr_02S1Mozart42 08_Gr_01S1Mozart81 01_Gr_16S1Aldebaran116 02_Gr_08S1Aldebaran28 04_Gr_04S1Aldebaran44 08_Gr_02S1Aldebaran82 16_Gr_01S1Aldebaran161

37 Experimental Results System Speedup (1/2) Speed up in Mozart

38 Experimental Results System Speedup (2/2) Speed up in Aldebaran

39 Experimental Results Encoding Latency Mean GOP encoding time

40 Conclusions A hierarchical parallel video encoder based on H.264/AVC was proposed. Experimental results confirm the results from previous analysis, showing the ability of getting a scalable and low latency H.264 encoder. Some issues remains open, as mentioned in previous section.

41 Reference J.C. Fernández and M. P. Malumbres, “A Parallel implementation of H.26L video encoder”, in proc. of EuroPar 2002 conf. (LNCS 2400), pp. 830, 833, Padderborn, 2002. [1] J.C. Fernández and M. P. Malumbres, “A Parallel implementation of H.26L video encoder”, in proc. of EuroPar 2002 conf. (LNCS 2400), pp. 830, 833, Padderborn, 2002. A. Rodriguez, A. González and M.P. Malumbres,“ Performance evaluation of parallel MPEG-4 video coding algorithms on clusters of workstations ”, IEEE Int. Conference on Parallel Computing in Electrical Engineering, pp. 354, 357, Dresden, 2004. [2] A. Rodriguez, A. González and M.P. Malumbres,“ Performance evaluation of parallel MPEG-4 video coding algorithms on clusters of workstations ”, IEEE Int. Conference on Parallel Computing in Electrical Engineering, pp. 354, 357, Dresden, 2004. Arjan J.C. van Gemund, “Symbolic Performance Modeling of Parallel Systems”, IEEE Transactions on Parallel and Distributed Systems, vol 14, no 2, Feb. 2003. [3] Arjan J.C. van Gemund, “Symbolic Performance Modeling of Parallel Systems”, IEEE Transactions on Parallel and Distributed Systems, vol 14, no 2, Feb. 2003. Pacheco, P.S.: Parallel Programming with MPI, Morgan Kaufman Publishers, Inc. [4] Pacheco, P.S.: Parallel Programming with MPI, Morgan Kaufman Publishers, Inc.

1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006.

Similar presentations

Presentation on theme: "1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006.

Similar presentations

Presentation on theme: "1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006."— Presentation transcript:

Similar presentations

About project

Feedback