Analysis, Fast Algorithm, and VLSI Architecture Design for H Analysis, Fast Algorithm, and VLSI Architecture Design for H.264/AVC Intra Frame Coder Yu-Wen Huang, Bing-Yu Hsieh, Tung-Chien Chen, and Liang-Gee Chen IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 2005
Outline Introduction H.264/AVC Intra Coding Computation Reduction Hardware Architecture
Introduction - Decoder Input Coder Video Signal Control Data Entropy Coding Scaling & Inv. Transform Motion- Compensation Control Data Quant. Transf. coeffs Motion Intra/Inter Coder Decoder Estimation Transform/ Scal./Quant. - Input Video Signal Split into Macroblocks 16x16 pixels Intra-frame Prediction De-blocking Filter Output Multiple Reference Frames & Variable Block sizes
Introduction Compressed Data Source Prediction Transform Quantization Entropy Coding 44/1616 Luma 88 Chroma 4 4 DCT Scalar Nonuniform Q CAVLC CABAC lossy lossless (Bit per pixel)
Introduction H.264/AVC I-Frame Coder (CAVLC) vs. JPEG2000 (DWT 53) Computational Complexity Block-based coding vs. Frame-based coding DWT 53 Hardware-friendly Memory-wasting
Introduction Comparison between different image coding standards JPEG JPEG 2000 DWT53 H.264 I-Frame CAVLC 0.225 bpp
Introduction Two solutions for platform-based design of H.264/AVC intra frame coder Fast algorithm for software implementation Reduce 45% complexity PSNR drop 0.3 dB Hardware accelerator Max. clock rate 55 MHz 31 fps for 4:2:0 SDTV (All intra frames)
H.264/AVC Intra Coding Intra Prediction I4MB (44) I16MB (1616) + DC Current 1 3 4 5 6 7 8 + DC + DC + Plane 1
H.264/AVC Intra Coding Mode Decision Low complexity mode SATD (Original pels – Predictors) Rate (bit of Mode information) High complexity mode MSE (Original pels – Reconstructed pels) Rate (Mode information + Residual)
H.264/AVC Intra Coding Transform and Quantization 4 4 integer transform Hadamard transform DCT-based integer transform
H.264/AVC Intra Coding Entropy Coding Context-Based Adaptive Binary Arithmetic Coding (CABAC) Context-Based Adaptive Variable Length Coding (CAVLC)
H.264/AVC Intra Coding Run-time percentage 720 480 4:2:0 30fps 10829 MIPS
Computation Reduction Intra Prediction Table look-up Cost generation Sub-sampling
Computation Reduction Fast Intra Prediction The smaller the mode number is, the more possible it will occur. global statistics cannot reflect the correlation of local modes. Local statistics of neighboring blocks are applied.
Computation Reduction Fast Intra Prediction Skip unlikely candidates
Computation Reduction Rate-distortion under different numbers of local-searched I4MB modes without insertion of full-search blocks 6 4 1 All DC modes 2
Computation Reduction Fast Intra Prediction Prevention of error propagation Periodic insertion of full-search 4x4 blocks Adaptive threshold on the distortion for a MB If min SATD of P > THMinSATD, then search all modes. THMinSATD = (min SATD of F) = 2.0 F P F P P P P P F P F P P P P P
Computation Reduction Subsampling Patterns
Computation Reduction Saved Computation and PSNR Drop PSNR drop < 0.3 dB Global: subsampling + partial search using global statistics Local: subsampling + partial search Proposed: subsampling + partial search + periodic insertion of full search + adaptive SATD threshold
Hardware Architecture Assumptions A RISC can execute one instruction per cycle, except multiplication requiring two. A processing element (PE) can generate predictors of one pixel per cycle.
Hardware Architecture Solutions luma chroma Produce all modes per cycle Produce one mode per cycle 30fps # of modes Avg. cycles per predictors
Hardware Architecture Comparisons in different degrees of parallelism
Hardware Architecture DRAM M A B C D E F G H I K J L Register
Hardware Architecture Four-Parallel Reconfigurable Intra Prediction Generator 8-bit adder 9-bit adder
Hardware Architecture M A B C D E F G H I K J L Intra Prediction Generator
Hardware Architecture Top PE0 PE1 PE2 PE3 Cycle 1: T0+T4+T8+T12 Cycle 1: T1+T5+T9+T13 Cycle 1: T2+T6+T10+T14 Cycle 1: T3+T7+T11+T15 Cycle 2: +L0+L4+L8 Cycle 2: +L0+L5+L9 Cycle 2: +L2+L6+L10 Cycle 2: +L3+L7+L11 Cycle 3: +L12 Cycle 3: +L13 Cycle 3: +L14 Cycle 3: +L15 Left Cycle 4: +++ I16MB DC Prediction Mode
Hardware Architecture I16MB Plane Prediction Mode Pred[y, x] = Clip1((a + b (x – 7) + c (y – 7) >> 5) a = 16 (p[-1, 15] + p[15, -1]) b = (5 H + 32) >> 6 c = (5 V + 32) >> 6 H = 7x’=0 (x’+1) (p[-1, 8+x’] – p[-1, 6 – x’]) V = 7x’=0 (y’+1) (p[8+y’, -1] – p[6 – y’, -1]) Pred[0,0] Pred[0,8] Pred[0,4] Pred[0,12] A0 A1 A2 A3
Hardware Architecture
Hardware Architecture
Hardware Architecture Transform (Implemented by shifters and adders) DCT iDCT Hadamard
Hardware Architecture