1 A HIGH THROUGHPUT PIPELINED ARCHITECTURE FOR H.264/AVC DEBLOCKING FILTER Kefalas Nikolaos, Theodoridis George VLSI Design Lab. Electrical & Computer.

Slides:

Advertisements

Similar presentations

CSCI 4717/5717 Computer Architecture

Advertisements

Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.

Lecture 12 Reduce Miss Penalty and Hit Time

Sumitha Ajith Saicharan Bandarupalli Mahesh Borgaonkar.

H.264 Intra Frame Coder System Design Özgür Taşdizen Microelectronics Program at Sabanci University 4/8/2005.

Parallell Processing Systems1 Chapter 4 Vector Processors.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Scalable and Low Cost Design Approach for Variable Block Size Motion Estimation Hadi Afshar, Philip Brisk, Paolo Ienne EPFL Hadi Afshar, Philip Brisk,

H.264/AVC Baseline Profile Decoder Complexity Analysis Michael Horowitz, Anthony Joch, Faouzi Kossentini, and Antti Hallapuro IEEE TRANSACTIONS ON CIRCUITS.

Yen-Lin Lee and Truong Nguyen ECE Dept., UCSD, La Jolla, CA Method and Architecture Design for Motion Compensated Frame Interpolation in High-Definition.

11 A Memory Interleaving and Interlacing Architecture for Deblocking Filter in H.264/AVC Yeong-Kang Lai, Member, IEEE, Lien-Fei Chen, Student Member, IEEE,

1 ReCPU:a Parallel and Pipelined Architecture for Regular Expression Matching Department of Computer Science and Information Engineering National Cheng.

1 Single Reference Frame Multiple Current Macroblocks Scheme for Multiple Reference IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY Tung-Chien.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

The Design of Improved Dynamic AES and Hardware Implementation Using FPGA 游精允.

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

Adaptive Deblocking Filter in H.264 Ehsan Maani Course Project:

Real Time Image Feature Vector Generator Employing Functional Cache Memory for Edge Takuki Nakagawa, Department of Electronic Engineering The University.

Low power and cost effective VLSI design for an MP3 audio decoder using an optimized synthesis- subband approach T.-H. Tsai and Y.-C. Yang Department of.

2011/03 Jinhong Park The Graduate School Yonsei University Department of Computer Science Design of Effective Memory Architectures for Mobile 2D/3D Graphics.

GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.

Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.

H.264 Deblocking Filter Irfan Ullah Department of Information and Communication Engineering Myongji university, Yongin, South Korea Copyright © solarlits.com.

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

1 Electronics Lab, Physics Dept., Aristotle Univ. of Thessaloniki, Greece 2 Micro2Gen Ltd., NCSR Demokritos, Greece 17th IEEE International Conference.

Tinoosh Mohsenin and Bevan M. Baas VLSI Computation Lab, ECE Department University of California, Davis Split-Row: A Reduced Complexity, High Throughput.

A Fast Hardware Approach for Approximate, Efficient Logarithm and Anti-logarithm Computation Suganth Paul Nikhil Jayakumar Sunil P. Khatri Department of.

Memory/Storage Architecture Lab Computer Architecture Pipelining Basics.

Adaptive Multi-path Prediction for Error Resilient H.264 Coding Xiaosong Zhou, C.-C. Jay Kuo University of Southern California Multimedia Signal Processing.

J. Greg Nash ICNC 2014 High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations J. Greg.

Lecture 8: Processors, Introduction EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014,

By: Hitesh Yadav Supervising Professor: Dr. K. R. Rao Department of Electrical Engineering The University of Texas at Arlington Optimization of the Deblocking.

MOTION ESTIMATION IMPLEMENTATION IN VERILOG

Radix-2 2 Based Low Power Reconfigurable FFT Processor Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Gin-Der Wu and Yi-Ming Liu Department.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing.

Design of a Novel Bridge to Interface High Speed Image Sensors In Embedded Systems Tareq Hasan Khan ID: ECE, U of S Term Project (EE 800)

1 Implementation in Hardware of Video Processing Algorithm Performed by: Yony Dekell & Tsion Bublil Supervisor : Mike Sumszyk SPRING 2008 High Speed Digital.

The TM3270 Media-Processor. Introduction Design objective – exploit the high level of parallelism available. GPPs with Multi-media extensions (Ex: Intel’s.

A Programmable Single Chip Digital Signal Processing Engine MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions.

1 Modular Refinement of H.264 Kermin Fleming. 2 What is H.264? Mobile Devices Low bit-rate Video Decoder –Follow on to MPEG-2 and H.26x Operates on pixel.

Edge Detection. 256x256 Byte image UART interface PC FPGA 1 Byte every a few hundred cycles of FPGA Sobel circuit Edge and direction.

Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.

Recursive Architectures for 2DLNS Multiplication RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR 11 Recursive Architectures for 2DLNS.

Introduction to Computer Organization Pipelining.

Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

Hierarchical Systolic Array Design for Full-Search Block Matching Motion Estimation Noam Gur Arie,August 2005.

CS 1410 Intro to Computer Tecnology Computer Hardware1.

1. 2 Design of a 125  W, Fully-Scalable MPEG-2 and H.264/AVC Video Decoder for Mobile Applications Tsu-Ming Liu 1, Ching-Che Chung 1, Chen-Yi Lee 1,

Buffering Techniques Greg Stitt ECE Department University of Florida.

Presenter: Darshika G. Perera Assistant Professor

Backprojection Project Update January 2002

Hiba Tariq School of Engineering

An Implementation Method of the Box Filter on FPGA

ARM Organization and Implementation

Morgan Kaufmann Publishers

Single Clock Datapath With Control

CDA 3101 Spring 2016 Introduction to Computer Organization

LOW POWER DIGITAL VIDEO COMPRESSION HARDWARE DESIGN

Supplement, Chapters 6 MC Course, 2009.

Study and Optimization of the Deblocking Filter in H

Ka-Ming Keung Swamy D Ponpandi

VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder

The Xilinx Virtex Series FPGA

Implementation of a De-blocking Filter and Optimization in PLX

Ka-Ming Keung Swamy D Ponpandi

Memory Considerations

Presentation transcript:

1 A HIGH THROUGHPUT PIPELINED ARCHITECTURE FOR H.264/AVC DEBLOCKING FILTER Kefalas Nikolaos, Theodoridis George VLSI Design Lab. Electrical & Computer Eng. Department University of Patras, Greece

2 Outline 1.Deblocking filter algorithm 2.Filtering ordering 3.Memory organization 4.Pipelined architecture 5.Synthesis results and comparisons 6.Conclusions and future work

3 Deblocking Filter Algorithm (1/3)  The deblocking filter is used in H.264/AVC to reduce the blocking artifacts – Improves subjective & objective quality and reduces the bit-rate typically 5-10%.  It is performed on a macroblock (MB) basis after the completion of the macroblock reconstruction stage  It includes a large number of data depended branches – each 4x4 pixel area is processed up to four times  It spends over one-third (1/3) of the total decoding time

4 Deblocking Filter Algorithm (2/3)  Each MB is processed in 4x4 blocks  The vertical edges are filtered at first rightwards – from edge V0 to edge V3  Then horizontal ones downwards – from edge H0 to H3  Each 8 pixels of two adjacent 4x4 sub- blocks are filtered at the same time – The same process repeats for the chroma components

5 Deblocking Filter Algorithm (3/3)  Each sub-edge shares a BS value  The BS along with two thresholds α, β decides the filtering strength of each sub-edge – A filter samples flag is calculated  Three filter types are used – Strong filter (4- or 5-tap filter) – Weak filter – No filtering

6 Outline 1.Deblocking filter algorithm 2.Filtering ordering 3.Memory organization 4.Pipelined architecture 5.Synthesis results and comparisons 6.Conclusions and future work

7 Filtering Order  During filtering all four sub-edges of each sub-block are filtered and almost all pixels are involved and updated  A suitable filtering order is needed to: – Reduce the size of the on-chip memory for buffering intermediate data – Increase data reuse – Reduce the external memory accesses – Simplify control and steering logic – Avoid pipeline stalls due to data and resource hazards

8 Proposed Filtering Order  The vertical sub-edges are filtered in raster scan-order followed by the horizontal ones  The filtering direction is not changed before all vertical edges of luma and chroma are filtered  The proposed order is in accordance to the standard

9 Outline 1.Deblocking filter algorithm 2.Filtering ordering 3.Memory organization 4.Pipelined architecture 5.Synthesis results and comparisons 6.Conclusions and future work

10 Memory Organization (1/2) Four single port memories are employed (sizes in bits) – Current-A (CM-A) 96x32 – Current-B (CM-B) 96x32 – Left _mem (LM) 32x32 – Upper_mem (UM) 2xFWx32 + 2x(FW/16)x32  Transpose buffers TR-P and TR-Q (4x32) – typical systolic array All internal buses are 32 bits

11 Memory Organization (2/2)

12 Outline 1.Deblocking filter algorithm 2.Filtering ordering 3.Memory organization 4.Pipelined architecture 5.Synthesis results and comparisons 6.Conclusions and future work

13 Algorithm Features  Deblocking filter algorithm computational intensive operations – LUT operations – retrieve values α(IndexA), β(IndexB), c1(Index A, BS) – BS calculation – Weak Filter BS(1~3) filtering, δ calculation and clipping operations – Strong Filter BS(4)  The introduced pipeline exploits specific algorithmic features – BS is the same for all micro-edges of a sub-edge for the luma component – BS of the luma component is reused for the chroma components – For the (4:2:0) format BS changes every 2 micro-edges in chroma components

14 Proposed Pipeline Organization

15 Pipeline Operation  Each sub-block needs 4 cycles to be processed  The BS unit spends 4 cycles (BS calculation & LUT operations) – BS and LUT operations are do not depend on pixel values  BS calculation & LUT operations are overlapped with the filtering operations for the luma component  Four initialization cycles are needed to calculate the BS and the α, β, c1 for the first luma sub-block

16 BS=4 Filtering Filter equations modified to improve delay & area BS=4 – 13 adders instead of 28 Total components Adders: =31

17 Pipeline Benefits  LUT operations and BS calculation are not squeezed in a single pipeline stage – Bs Unit has 4-cycles  The filtering operations are expanded in three pipeline stages  The BS values are reused for filtering the chroma components  Modification of the original filtering equations (improve performance & area)  The proposed ordering eases control logic and memory addressing avoiding any potential critical path increase

18 Edge Filter Process Block Cycle01234 Filtered Sub-edge01234 PINL0B0B1B2L1 QINB0B1B2B3B4 TR_P-W B0B1B2 TR_P-R B0B1 TR_Q-W B3 TR_Q-R CM_A-RB0B1B2B3B4 CM_B-W B0B1 LM-RL0 L1 LM-W UPM-W Ext_M-WL0

19 Vertical Edge Filter Process  Total cycles = 4*27= 108 – If two port memory has been used then total cycles = 4x24=96 which is the optimum Block Cycle Filtered Sub-edge PINL0B0B1B2L1B4B5…L3B12B13…L1B22 QINB0B1B2B3B4B5B6B12B13B14B22B23 TR_P-W B0B1B2 B4…B10L3B12…B20L1B22 TR_P-R B0B1B2 B9B10L3 B20 B22 TR_Q-W B3 …B11 …B21 B23 TR_Q-R 3 B11B19 B21 B23 CM_A-RB0B1B2B3B4B5B6…B12B13B14…B22B23 CM_B-W B0B1B2B3B9B10B11B19B20B21B22B23 LM-RL0 L1 …L3 …L1 LM-W UPM-W L3 L1 Ext_M-WL0L1

20 Processing Cycles  Vertical Edges: 108 cycles  Horizontal Edges: 108 cycles  Initialize: 10 cycles – 6 fetch coding info, initialize control – 4 1 st BS calculation  Normal operation: 226 cycles  For the last row (edges 27, 31, 35, 41, 45): 5x4=20 extra cycles – Resource hazard (Bus conflict)  For the last MB in frame 12 extra cycles are needed (edges 39, 43, 47) – Resource hazard (Bus conflict)  Worst case total cycles: 258

21 Outline 1.Deblocking filter algorithm 2.Filtering ordering 3.Memory organization 4.Pipelined architecture 5.Synthesis results and comparisons 6.Conclusions and future work

22 Experimental Setup  Synthesis Setup – Synopsys design compiler – TSMC 0.18um  FPGA proven – Stand alone, compared with the JM reference software – It has also verified as a part of a H.264 hardware encoder – It achieves 280 MHz in Virtex 5 speed grade 3

23 Synthesis Results and Comparisons [5] (2008)[6] (2008)[7] (2009)[8] (2006)Proposed Pipeline stages55455 Filtering orderHybrid Impr. Sequential Local RAMs (bits) 1P 1 2x96x32 1P 96x32, 2P 1 32x32 1P 32x32 1P 96x32, 1P 32x32 1P 96x32, 2P 32x32 1P 2x96x32, 1P 32x32 Upper neighbour RAM (bits)1P 2FWx32N/A1P 2FW 2 x321P 1.5FWx321P 2FWx32 Coding information RAM (bits)N/A 2(FW/16)x32 7 Transpose buffers (4x32 bits)71522 Technology (μm)0.18 Gate count (10 3 gates) Kernel processing (cycles/MB)204210/ / /246 6 Max frequency (MHz) (1.8x up to 4x) Throughput (10 3 MB/s) (1.5x up to 3.8x) Fps – Full HD (1920x1080) Fps – Ultra HD (3840x2160) :1P: Single-Port, 2P: Two-Port, 2::FW: Frame width, 3: Filtering cycles only, 4: Filtering cycles only, 5: It takes 246 cycles to filter a MB at the right frame boundary, 6: It takes 246 cycles to filter a MB at the bottom frame row, 7: The 2x(FW/16)x32 bits are stored in upper memory

24 Conclusions  A novel high speed pipeline architecture for the H.264/AVC deblocking filter is proposed  It operates at 400 MHz and occupies 19.2 Kgates in 0.18 um CMOS technology  It achieves 216 and 54 fps for Full and Ultra-HD frames, respectively  Only single port memories are employed  No external memory accesses are needed during filtering – Parameters and neighbors are store internally – Only fully filtered data are written to external memories

25 Questions ???

26 Hardware Architecture (Pipeline organization) 5/ Threshold Calculation

27 BS=4 Filtering

28 Deblocking Filter Algorithm 3/3  Each sub-edge between two adjacent 4x4 luma sub-blocks share a Boundary Strength (BS)  The BS value along with two threshold variables, α and β, decide the filtering strength of the sub-edge

29 Hardware Architecture (Pipeline organization) 5/ Bs 1,2,3 filter

30 Deblocking Filter Algorithm 4/4  Boundary strength across horizontal edges – The boundary strength is calculated for each sub-edge for the luma component – It is reused for the chroma components in 2:1 ratio for 4:2:0 format