1 A HIGH THROUGHPUT PIPELINED ARCHITECTURE FOR H.264/AVC DEBLOCKING FILTER Kefalas Nikolaos, Theodoridis George VLSI Design Lab. Electrical & Computer Eng. Department University of Patras, Greece
2 Outline 1.Deblocking filter algorithm 2.Filtering ordering 3.Memory organization 4.Pipelined architecture 5.Synthesis results and comparisons 6.Conclusions and future work
3 Deblocking Filter Algorithm (1/3) The deblocking filter is used in H.264/AVC to reduce the blocking artifacts – Improves subjective & objective quality and reduces the bit-rate typically 5-10%. It is performed on a macroblock (MB) basis after the completion of the macroblock reconstruction stage It includes a large number of data depended branches – each 4x4 pixel area is processed up to four times It spends over one-third (1/3) of the total decoding time
4 Deblocking Filter Algorithm (2/3) Each MB is processed in 4x4 blocks The vertical edges are filtered at first rightwards – from edge V0 to edge V3 Then horizontal ones downwards – from edge H0 to H3 Each 8 pixels of two adjacent 4x4 sub- blocks are filtered at the same time – The same process repeats for the chroma components
5 Deblocking Filter Algorithm (3/3) Each sub-edge shares a BS value The BS along with two thresholds α, β decides the filtering strength of each sub-edge – A filter samples flag is calculated Three filter types are used – Strong filter (4- or 5-tap filter) – Weak filter – No filtering
6 Outline 1.Deblocking filter algorithm 2.Filtering ordering 3.Memory organization 4.Pipelined architecture 5.Synthesis results and comparisons 6.Conclusions and future work
7 Filtering Order During filtering all four sub-edges of each sub-block are filtered and almost all pixels are involved and updated A suitable filtering order is needed to: – Reduce the size of the on-chip memory for buffering intermediate data – Increase data reuse – Reduce the external memory accesses – Simplify control and steering logic – Avoid pipeline stalls due to data and resource hazards
8 Proposed Filtering Order The vertical sub-edges are filtered in raster scan-order followed by the horizontal ones The filtering direction is not changed before all vertical edges of luma and chroma are filtered The proposed order is in accordance to the standard
9 Outline 1.Deblocking filter algorithm 2.Filtering ordering 3.Memory organization 4.Pipelined architecture 5.Synthesis results and comparisons 6.Conclusions and future work
10 Memory Organization (1/2) Four single port memories are employed (sizes in bits) – Current-A (CM-A) 96x32 – Current-B (CM-B) 96x32 – Left _mem (LM) 32x32 – Upper_mem (UM) 2xFWx32 + 2x(FW/16)x32 Transpose buffers TR-P and TR-Q (4x32) – typical systolic array All internal buses are 32 bits
11 Memory Organization (2/2)
12 Outline 1.Deblocking filter algorithm 2.Filtering ordering 3.Memory organization 4.Pipelined architecture 5.Synthesis results and comparisons 6.Conclusions and future work
13 Algorithm Features Deblocking filter algorithm computational intensive operations – LUT operations – retrieve values α(IndexA), β(IndexB), c1(Index A, BS) – BS calculation – Weak Filter BS(1~3) filtering, δ calculation and clipping operations – Strong Filter BS(4) The introduced pipeline exploits specific algorithmic features – BS is the same for all micro-edges of a sub-edge for the luma component – BS of the luma component is reused for the chroma components – For the (4:2:0) format BS changes every 2 micro-edges in chroma components
14 Proposed Pipeline Organization
15 Pipeline Operation Each sub-block needs 4 cycles to be processed The BS unit spends 4 cycles (BS calculation & LUT operations) – BS and LUT operations are do not depend on pixel values BS calculation & LUT operations are overlapped with the filtering operations for the luma component Four initialization cycles are needed to calculate the BS and the α, β, c1 for the first luma sub-block
16 BS=4 Filtering Filter equations modified to improve delay & area BS=4 – 13 adders instead of 28 Total components Adders: =31
17 Pipeline Benefits LUT operations and BS calculation are not squeezed in a single pipeline stage – Bs Unit has 4-cycles The filtering operations are expanded in three pipeline stages The BS values are reused for filtering the chroma components Modification of the original filtering equations (improve performance & area) The proposed ordering eases control logic and memory addressing avoiding any potential critical path increase
18 Edge Filter Process Block Cycle01234 Filtered Sub-edge01234 PINL0B0B1B2L1 QINB0B1B2B3B4 TR_P-W B0B1B2 TR_P-R B0B1 TR_Q-W B3 TR_Q-R CM_A-RB0B1B2B3B4 CM_B-W B0B1 LM-RL0 L1 LM-W UPM-W Ext_M-WL0
19 Vertical Edge Filter Process Total cycles = 4*27= 108 – If two port memory has been used then total cycles = 4x24=96 which is the optimum Block Cycle Filtered Sub-edge PINL0B0B1B2L1B4B5…L3B12B13…L1B22 QINB0B1B2B3B4B5B6B12B13B14B22B23 TR_P-W B0B1B2 B4…B10L3B12…B20L1B22 TR_P-R B0B1B2 B9B10L3 B20 B22 TR_Q-W B3 …B11 …B21 B23 TR_Q-R 3 B11B19 B21 B23 CM_A-RB0B1B2B3B4B5B6…B12B13B14…B22B23 CM_B-W B0B1B2B3B9B10B11B19B20B21B22B23 LM-RL0 L1 …L3 …L1 LM-W UPM-W L3 L1 Ext_M-WL0L1
20 Processing Cycles Vertical Edges: 108 cycles Horizontal Edges: 108 cycles Initialize: 10 cycles – 6 fetch coding info, initialize control – 4 1 st BS calculation Normal operation: 226 cycles For the last row (edges 27, 31, 35, 41, 45): 5x4=20 extra cycles – Resource hazard (Bus conflict) For the last MB in frame 12 extra cycles are needed (edges 39, 43, 47) – Resource hazard (Bus conflict) Worst case total cycles: 258
21 Outline 1.Deblocking filter algorithm 2.Filtering ordering 3.Memory organization 4.Pipelined architecture 5.Synthesis results and comparisons 6.Conclusions and future work
22 Experimental Setup Synthesis Setup – Synopsys design compiler – TSMC 0.18um FPGA proven – Stand alone, compared with the JM reference software – It has also verified as a part of a H.264 hardware encoder – It achieves 280 MHz in Virtex 5 speed grade 3
23 Synthesis Results and Comparisons [5] (2008)[6] (2008)[7] (2009)[8] (2006)Proposed Pipeline stages55455 Filtering orderHybrid Impr. Sequential Local RAMs (bits) 1P 1 2x96x32 1P 96x32, 2P 1 32x32 1P 32x32 1P 96x32, 1P 32x32 1P 96x32, 2P 32x32 1P 2x96x32, 1P 32x32 Upper neighbour RAM (bits)1P 2FWx32N/A1P 2FW 2 x321P 1.5FWx321P 2FWx32 Coding information RAM (bits)N/A 2(FW/16)x32 7 Transpose buffers (4x32 bits)71522 Technology (μm)0.18 Gate count (10 3 gates) Kernel processing (cycles/MB)204210/ / /246 6 Max frequency (MHz) (1.8x up to 4x) Throughput (10 3 MB/s) (1.5x up to 3.8x) Fps – Full HD (1920x1080) Fps – Ultra HD (3840x2160) :1P: Single-Port, 2P: Two-Port, 2::FW: Frame width, 3: Filtering cycles only, 4: Filtering cycles only, 5: It takes 246 cycles to filter a MB at the right frame boundary, 6: It takes 246 cycles to filter a MB at the bottom frame row, 7: The 2x(FW/16)x32 bits are stored in upper memory
24 Conclusions A novel high speed pipeline architecture for the H.264/AVC deblocking filter is proposed It operates at 400 MHz and occupies 19.2 Kgates in 0.18 um CMOS technology It achieves 216 and 54 fps for Full and Ultra-HD frames, respectively Only single port memories are employed No external memory accesses are needed during filtering – Parameters and neighbors are store internally – Only fully filtered data are written to external memories
25 Questions ???
26 Hardware Architecture (Pipeline organization) 5/ Threshold Calculation
27 BS=4 Filtering
28 Deblocking Filter Algorithm 3/3 Each sub-edge between two adjacent 4x4 luma sub-blocks share a Boundary Strength (BS) The BS value along with two threshold variables, α and β, decide the filtering strength of the sub-edge
29 Hardware Architecture (Pipeline organization) 5/ Bs 1,2,3 filter
30 Deblocking Filter Algorithm 4/4 Boundary strength across horizontal edges – The boundary strength is calculated for each sub-edge for the luma component – It is reused for the chroma components in 2:1 ratio for 4:2:0 format