Download presentation
Presentation is loading. Please wait.
Published byZoe Phelps Modified over 8 years ago
1
1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006
2
2 Outline Introduction Performance Analysis Hierarchical H.264 Parallel Encoder Experimental Results Conclusions
3
3 Introduction Background Knowledge (1/5) Video Communication
4
4 Introduction Background Knowledge (2/5) H.264/AVC Remove sensitive redundant information In order to reach the limits on compression efficiency intensive computation Video on demand, video conference, live broadcasting, etc.
5
5 Introduction Background Knowledge (3/5) H.264/AVC encoder High CPU demand Low latency Real time response Platforms with supercomputing capabilities Clusters Multiprocessors Special purpose devices
6
6 Introduction Background Knowledge (4/5) Cluster A group of linked computers Improve performance and/or availability over that provided by a single computer Categorizations High-availability clusters Load-balancing clusters High-performance clusters
7
7 Introduction Background Knowledge (5/5) Message Passing Parallelism Message passing runtimes and libraries MPI Multithread Parallelism OpenMP Optimized libraries SIMD extension and global processing unit Intel IPP, AMD ACML, etc.
8
8 Introduction Main Purpose (1/6) Apply parallel processing to H.264 encoders in order to reduce computation intensity. Given video quality and bit rate Image resolution Frame rate Latency
9
9 Introduction Main Purpose (2/6) Hierarchical parallelization of H.264 encoder Two level MPI message passing parallelization GOP level Slice level
10
10 Introduction Main Purpose (3/6) GOP level parallelism Good speed-up High latency …….. GOP
11
11 Introduction Main Purpose (4/6) Example of latency 1 GOP = 10 frames Frame rate = 30 frames/sec Time for encoding 1 GOP = 3 seconds We have to encode 9 GOP in parallel in order to achieve real time response Latency = 3 seconds
12
12 Introduction Main Purpose (5/6) Slice level parallelism Low latency Less coding efficiency
13
13 Introduction Main Purpose (6/6) Combination both approaches Speed-up Efficiency
14
14 Performance Analysis Overview (1/2) “ “Performance evaluation of parallel MPEG-4 video coding algorithms on clusters of workstations” “A Parallel implementation of H.26L video encoder” Combination Scalability and low latency
15
15 Performance Analysis Overview (2/2) Processing flow video sequence GOP ……..…….. Increase throughput Reduce latency
16
16 Performance Analysis Equation definition Little’s law N = X * R N : Number of GOPs processed in parallel. X : Number of GOPs encoded per second. R : Elapsed time between a GOP enters the system and the same GOP is completely encoded.
17
17 Performance Analysis Analysis (1/2) If we have n p nodes in the cluster and every GOP decomposed in n s slices N = n p / n s R = R SEQ / ( n s * E s ) R SEQ : Sequential encoding time of a GOP E s : Parallel efficiency of slice level
18
18 Performance Analysis Analysis (2/2) GOP throughput of combined parallel encoder If E s is significantly less than 1, throughput would be affected negatively
19
19 Performance Analysis Example (1/4) Video sequence in HDTV format at 1280*720 Frame rate = 60 frames / sec We suppose that H.264 sequential encoder encodes one GOP(15 frames) in 5 seconds Only one slice per frame is defined
20
20 Performance Analysis Example (2/4) To get real time response, X has to be equal to 60 frames/sec or 4 GOPs/sec n p = 4 * 5 = 20 nodes
21
21 Performance Analysis Example (3/4) Combined with slice level parallelism Maximum of allowed latency = 1 sec Slice parallelism efficiency = 0.8
22
22 Performance Analysis Example (4/4) We set n s to 7 and N to 4, and number of required nodes is adjusted to 28 Throughput Latency
23
23 Performance Analysis Efficiency Estimation (1/5) Why we have to estimate E s ? Throughput Latency How to estimate E s ? PAMELA (PerformAnce ModEling LAnguage) model
24
24 Performance Analysis Efficiency Estimation (2/5) Update DPB (Decoding Picture Buffer) in every node Using MPI_Allgather In this PAMELA model MPI_Allgather is implemented using binary tree
25
25 Performance Analysis Efficiency Estimation (3/5) The PAMELA model to parallel encode one frame is : L = par ( p = 1…n s ) delay (t s ); delay (t w ) seq ( I = 0…log 2 (n s )-1) par ( j = 1…n s ) delay ( t L + t c * 2 i ) n s : The number of slices processed in parallel t s : The mean of slice encoding time t w : The mean wait time due to variations in ts and global synchronization t L : Start up time t c : Transmission time of one encoded slice
26
26 Performance Analysis Efficiency Estimation (4/5) The parallel time obtained solving this model is Efficiency can be computed as T(L) = t s + t w + t AG t AG = log 2 (n s ) * t L + (n s - 1) * t c
27
27 Performance Analysis Efficiency Estimation (5/5) The experimental estimations of parameter values Estimated efficiency for a slice based parallel encoder tLtL tctc tsts twtw t AG 6.00.0133*405684000020586421
28
28 Performance Analysis Slice Parallelism Scalability (1/4) The feasible number of slices will depend on the video resolution Number of MBs per slice Bit rate increment (%)
29
29 Performance Analysis Slice Parallelism Scalability (2/4) Bit rate overhead vs. number of slices per frame
30
30 Performance Analysis Slice Parallelism Scalability (3/4) PSNR loss vs. number of slices per frame
31
31 Performance Analysis Slice Parallelism Scalability (4/4) Encoding time vs. number of slices per frame
32
32 Hierarchical Parallel Encoder Overview In order to achieve scalability and low latency Combine GOP and slice level parallelism In the first level Divide sequence in GOPs(15 frames) Every GOP is assigned to a processor group inside the cluster Each group encodes independently
33
33 Hierarchical Parallel Encoder GOP assignment method Local manager Communicate with global manager Global manager Inform the GOP assignment by sending a message with the GOP number to the requesting local manager Simple and load balance
34
34 Hierarchical Parallel Encoder Framework Hierarchical H.264 parallel encoder Global Manager P0 P1P2 P0 P1P2 P0 P1P2
35
35 Experimental Results Environments (1/2) Mozart 4 biprocessor nodes with AMD Opteron 246 at 2 GHz interconnected by a switched Gigabit Ethernet Aldebaran SGI Altix 3700 with 44 nodes Itanium II interconnected by a high performance proprietary network
36
36 Experimental Results Environments (2/2) 720 * 480 standard sequence Ayersroc which composed by 16 GOPs ConfigurationCluster#Groups#Slices 01_Gr_08S1Mozart18 02_Gr_04S1Mozart24 04_Gr_02S1Mozart42 08_Gr_01S1Mozart81 01_Gr_16S1Aldebaran116 02_Gr_08S1Aldebaran28 04_Gr_04S1Aldebaran44 08_Gr_02S1Aldebaran82 16_Gr_01S1Aldebaran161
37
37 Experimental Results System Speedup (1/2) Speed up in Mozart
38
38 Experimental Results System Speedup (2/2) Speed up in Aldebaran
39
39 Experimental Results Encoding Latency Mean GOP encoding time
40
40 Conclusions A hierarchical parallel video encoder based on H.264/AVC was proposed. Experimental results confirm the results from previous analysis, showing the ability of getting a scalable and low latency H.264 encoder. Some issues remains open, as mentioned in previous section.
41
41 Reference J.C. Fernández and M. P. Malumbres, “A Parallel implementation of H.26L video encoder”, in proc. of EuroPar 2002 conf. (LNCS 2400), pp. 830, 833, Padderborn, 2002. [1] J.C. Fernández and M. P. Malumbres, “A Parallel implementation of H.26L video encoder”, in proc. of EuroPar 2002 conf. (LNCS 2400), pp. 830, 833, Padderborn, 2002. A. Rodriguez, A. González and M.P. Malumbres,“ Performance evaluation of parallel MPEG-4 video coding algorithms on clusters of workstations ”, IEEE Int. Conference on Parallel Computing in Electrical Engineering, pp. 354, 357, Dresden, 2004. [2] A. Rodriguez, A. González and M.P. Malumbres,“ Performance evaluation of parallel MPEG-4 video coding algorithms on clusters of workstations ”, IEEE Int. Conference on Parallel Computing in Electrical Engineering, pp. 354, 357, Dresden, 2004. Arjan J.C. van Gemund, “Symbolic Performance Modeling of Parallel Systems”, IEEE Transactions on Parallel and Distributed Systems, vol 14, no 2, Feb. 2003. [3] Arjan J.C. van Gemund, “Symbolic Performance Modeling of Parallel Systems”, IEEE Transactions on Parallel and Distributed Systems, vol 14, no 2, Feb. 2003. Pacheco, P.S.: Parallel Programming with MPI, Morgan Kaufman Publishers, Inc. [4] Pacheco, P.S.: Parallel Programming with MPI, Morgan Kaufman Publishers, Inc.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.