Evaluation of Data-Parallel Splitting Approaches for H.264 Decoding Florian H. Seitner, Michael Bleyer, Ralf M. Schreier, Margrit Gelautz International Conference on Advances in Mobile & Multimedia (MoMM 2008)
Outline Introduction Parallel H.264 Decoding Evaluated Methods Experimental Results Conclusions
Introduction H.264 video standard is currently used in a wide range of video-related areas Video content distribution Television broadcasting High coding efficiency Qpel motion estimation Variable block size Multiple reference frames Significantly increased CPU and memory loads
Introduction Using multi-core systems to increase system performance How to distribute H.264 decoding algorithm among multiple processing units ? The decoding load should be distributed equally Data dependency issues Inter-communication Synchronization
Introduction The aim of this work is to evaluate the behavior of different decoding approaches Run-time complexity Efficient core usage Data transfers
Parallel H.264 Decoding Functional and Data-parallel splitting Functional partitioned decoding system Decoding tasks are assigned to individual processing cores Each processing unit can be optimized for a certain task Unequal workload distribution High transfer rate for inter-communication
Parallel H.264 Decoding Functional and Data-parallel splitting Data-parallel decoding system Distributing MBs among multiple processing unit Data dependencies between different cores must be minimized MB distribution onto the processing cores must achieve an equal workload balancing
Parallel H.264 Decoding The H.264 Decoder The H.264 decoding process Encoded Bitstream Inverse Quantization Inverse DCT Stream Parsing Entropy Decoder Deblocking + Spatial Prediction Motion Compensation Reference Frames Reconstructor Data-Parallel Processing Parser
Parallel H.264 Decoding Macroblock Dependencies Data-parallel splitting of the decoder’s reconstruction module is challenging due to spatial and temporal dependencies Intra prediction Deblocking Inter prediction
Evaluated Methods Overview Comparing the performance of five different approaches for accomplishing data-parallel splitting of the decoder’s reconstructor module Single row approach Multi-column approach Blocking slice-parallel method Nonblocking slice-parallel method Diagonal approach
Evaluated Methods Single Row Approach The assignment of MBs to processors 2 Cores 4 Cores 8 Cores N is the number of processors Processor i ( i = 0, 1, …, N - 1 ) is responsible for decoding the yth row of MBs if ( y mod N ) = i
Evaluated Methods Single Row Approach An example of SR approach ( 2 cores ) It takes a constant value of 1 unit of time to process a macroblock T = 2 T = 3 T = 8 T = 10 T = 34
Evaluated Methods Single Row Approach Advantage Simplicity Only a small start delay Disadvantage So many dependencies across processor assignment borders
Evaluated Methods Multi-column Approach The assignment of MBs to processors 2 Cores 4 Cores 8 Cores w is the width of a multi-column Processor i ( i = 0, 1, …, N - 1 ) is responsible for decoding a MB of the xth column if iw < x < ( i + 1)w
Evaluated Methods Multi-column Approach An example of MC approach ( 2 cores ) Advantage Less dependencies across processors One processor has to wait for the results only at the boundaries T = 4 T = 5 T = 8 T = 36
Evaluated Methods Slice-parallel Approach The assignment of MBs to processors 2 Cores 4 Cores 8 Cores h is the height of a slice Processor i ( i = 0, 1, …, N - 1 ) is responsible for decoding a MB of the yth row if ih < x < (i + 1)h
Evaluated Methods Slice-parallel Approach An example of SP approach in the blocking version ( 2 cores) Disadvantage Long delay CPU idle, less core usage T = 26 T = 32 T = 58
Evaluated Methods Slice-parallel Approach An example of SP approach in the non-blocking version ( 2 cores ) No dependencies is considered across slice boundaries (completely independent) NBSP requires having full control over the encoder T = 1 T = 32
Evaluated Methods Diagonal Approach The assignment of MBs to processors Dividing the first line of MBs into equally-sized columns The assignments for the subsequent lines are derived by left-shifting the MB of the line above 2 Cores 4 Cores 8 Cores
Evaluated Methods Diagonal Approach An example of DG approach T = 4 T = 10 T = 12 T = 13 T = 16 T = 18 T = 20 T = 23 T = 24 T = 43
Evaluated Methods Diagonal Approach Comparing the inter-processor dependencies introduced by DG and MC approach Diagonal approach Multi-column approach Dependencies for CPU 2 originate solely from MB assigned to CPU1 MBs assigned to CPU 2 are also dependent on CPU 3
Experimental Results Overview Test sequences Parameters GOP size = 14 Search range = +/- 16 pixels 5 reference frames
Experimental Results Run-time Complexity Two major indicators for the efficiency of multi-core decoding system Decoder’s run-time A low run-time indicates a high system decoding performance Number of data-dependency stalls occurring during the decoding process The number of stalls provides an estimate on how efficiently the system’s computational resources are used
Experimental Results Run-time Complexity Speed-up in run-time The speed increase for each parallelization approach in multiples of the single-core performance
Experimental Results Run-time Complexity Stall cycles caused by data dependencies between the cores
Experimental Results Inter-communication Memory transfer to and from the external DRAM and between the cores’ local memories are expensive in terms of power consumption and transfer time Core inter-communication Loading reference data and deblocking pixels
Experimental Results Inter-communication Data transform volume for reference data and deblocking information
Conclusions In this study, we have evaluated 5 data-parallel approaches for the H.264 decoder The run-time of each parallelization approaches is influenced by the frame partitions’ sizes and shapes Large and dependency-minimizing partitions cause less inter-communication between cores