Low Power and Low Area Transform–Quant & Inverse Quant–Inverse Transform Hardware Design for H.264 Encoder.

Slides:

Advertisements

Similar presentations

Chapter 14 Finite Impulse Response (FIR) Filters

Advertisements

ECE555 Lecture 10 Nam Sung Kim University of Wisconsin – Madison

Low-Complexity Transform and Quantization in H.264/AVC

Table de multiplication, division, addition et soustraction.

robot con 6 gradi di mobilità

Institute of Applied Microelectronics and Computer Engineering College of Computer Science and Electrical Engineering, University of Rostock Spezielle.

March 24, 2004 Will H.264 Live Up to the Promise of MPEG-4 ? Vide / SURA March Marshall Eubanks Chief Technology Officer.

Instructions for using this template. Remember this is Jeopardy, so where I have written Answer this is the prompt the students will see, and where I.

GLOWA-Elbe II Statuskonferenz 14. Dez Potsdam H. Fischer, BfG Koblenz Helmut Fischer, Volker Kirchesch, Katrin Quiel, Andreas Schöl Bundesanstalt.

Rule Learning – Overview Goal: learn transfer rules for a language pair where one language is resource-rich, the other is resource-poor Learning proceeds.

Quid-Pro-Quo-tocols Strengthening Semi-Honest Protocols with Dual Execution Yan Huang 1, Jonathan Katz 2, David Evans 1 1. University of Virginia 2. University.

1January 18, 2006irk Rich Katz, Grunt Engineer NASA Office of Logic Design Some SEE Testing Considerations for the RTAX-S Series Devices.

My Special Number My Special Number Afrin12/10/2007.

A Flow Graph Technique for DFT Controller Modification

Bank Accounts Management System - p. 448

Verification Methodology Based on Algorithmic State Machines and Cycle-Accurate Contract Specifications Sergey Frenkel 1 and Alexander Kamkin 2 1 Institute.

André Augustinus HLT power (CR2) Due to RCU upgrade HLT expects to double computing power 150kW -> 300kW Cannot be all UPS (will compete with DAQ) Proposal.

CS 473Lecture X1 CS473-Algorithms I Lecture X Dynamic Tables.

Ideal Parent Structure Learning School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan with Iftach Nachman and Nir.

Outcome: Determine the square root of perfect squares.

On / By / With The building blocks of the Mplus language.

Introduction to H.264 / AVC Video Coding Standard Multimedia Systems Sharif University of Technology November 2008.

Software Pipelined Execution of Stream Programs on GPUs Abhishek Udupa et. al. Indian Institute of Science, Bangalore 2009.

Gamma-ray Large Area Space Telescope Efficiency Studies (Draft) Michael Kuss INFN Pisa Instrument Analysis Group Meeting 1 April 2005.

23-8 3x6 Double it Take Away 6 Share By 9 Double it +10 Halve it Beginner Start Answer Intermediate 70 50% of this ÷7÷7 x8 Double it Start Answer.

2 x0 0 12/13/2014 Know Your Facts!. 2 x1 2 12/13/2014 Know Your Facts!

2 x /18/2014 Know Your Facts!. 11 x /18/2014 Know Your Facts!

Solving Linear Systems by Linear Combinations

MPEG-2 to H.264/AVC Transcoding Techniques Jun Xin Xilient Inc. Cupertino, CA.

Literary Issues Ups and downs Inside and OUT Make sure.

2 x /10/2015 Know Your Facts!. 8 x /10/2015 Know Your Facts!

X2+y2x2+y2 x 2 +y 2 +Dx+Ey+F = 0x2+y2x2+y2 General Form x 2 +y 2 +Dx+Ey+F = 0 (a) How do we identify the equations of circles? The coefficients of x 2.

Strategies – Multiplication and Division

H.261: A Standard for VideoConferencing Applications Nimrod Peleg Update: Nov

Ken McMillan Microsoft Research

1 Lecture 5 PRAM Algorithm: Parallel Prefix Parallel Computing Fall 2008.

A man-machine human interface for a special device of the pervasive computing world B. Apolloni, S. Bassis, A. Brega, S. Gaito, D. Malchiodi, A.M. Zanaboni.

5 x4. 10 x2 9 x3 10 x9 10 x4 10 x8 9 x2 9 x4.

Production Mix Problem Graphical Solution med lrg Electronics Cabinetry Profit (10,20) (Optimal Product Mix!) Profit.

The Problem of K Maximum Sums and its VLSI Implementation Sung Eun Bae, Tadao Takaoka University of Canterbury Christchurch New Zealand.

Multiplication Facts Practice

Spatial Information Systems (SIS) COMP Spatial data structures (1)

SATISFIABILITY Eric L. Frederich.

COMP9314Xuemin Continuously Maintaining Order Statistics Over Data Streams Lecture Notes COM9314.

Graeme Henchel Multiples Graeme Henchel

Quiz Number 2 Group 1 – North of Newark Thamer AbuDiak Reynald Benoit Jose Lopez Rosele Lynn Dave Neal Deyanira Pena Professor Kenneth D. Lawerence New.

1.Print and cut round outside of cootie catcher 2.Fold in half and in half again 3.Open out, turn over so.

The Project Problem formulation (one page) Literature review –“Related work" section of final paper, –Go to writing center, –Present paper(s) to class.

(for Prof. Oleg Shpyrko)

0 x x2 0 0 x1 0 0 x3 0 1 x7 7 2 x0 0 9 x0 0.

H.264 Intra Frame Coder System Design Özgür Taşdizen Microelectronics Program at Sabanci University 4/8/2005.

H.264/AVC Baseline Profile Decoder Complexity Analysis Michael Horowitz, Anthony Joch, Faouzi Kossentini, and Antti Hallapuro IEEE TRANSACTIONS ON CIRCUITS.

Ch. 6- H.264/AVC Part I (pp.160~199) Sheng-kai Lin

High Speed Hardware Implementation of an H.264 Quantizer. Alex Braun Shruti Lakdawala.

Analysis, Fast Algorithm, and VLSI Architecture Design for H

H.264 / MPEG-4 Part 10 Nimrod Peleg March 2003.

CS :: Fall 2003 MPEG-1 Video (Part 1) Ketan Mayer-Patel.

CS430 © 2006 Ray S. Babcock Lossy Compression Examples JPEG MPEG JPEG MPEG.

5. 1 JPEG “ JPEG ” is Joint Photographic Experts Group. compresses pictures which don't have sharp changes e.g. landscape pictures. May lose some of the.

MPEG-1 and MPEG-2 Digital Video Coding Standards Author: Thomas Sikora Presenter: Chaojun Liang.

Codec structuretMyn1 Codec structure In an MPEG system, the DCT and motion- compensated interframe prediction are combined. The coder subtracts the motion-compensated.

Introduction to JPEG m Akram Ben Ahmed

MPEG CODING PROCESS. Contents  What is MPEG Encoding?  Why MPEG Encoding?  Types of frames in MPEG 1  Layer of MPEG1 Video  MPEG 1 Intra frame Encoding.

Implementing JPEG Encoder for FPGA ECE 734 PROJECT Deepak Agarwal.

JPEG Compression What is JPEG? Motivation

The 8085 Microprocessor Architecture

LOW POWER DIGITAL VIDEO COMPRESSION HARDWARE DESIGN

Supplement, Chapters 6 MC Course, 2009.

Presentation transcript:

Low Power and Low Area Transform–Quant & Inverse Quant–Inverse Transform Hardware Design for H.264 Encoder

Outline I. H.264 TQ & IQIT II. DESIGNED HARDWARE III. RESULTS

H.264 TQ & IQIT Each residual macroblock is transformed, quantized. Previous standards such as MPEG-1,MPEG-2, MPEG-4 and H.263 made use of the 8x8 Discrete Cosine Transform (DCT) as the basic transform. The “baseline” profile of H.264 uses three transforms depending on “the type of residual data : 1)A transform for the 4x4 array of luma DC coefficients in intra macroblocks(predicted in 16x16 mode), 2)A transform for the 2x2 array of chroma DC coefficients (in any macroblock) 3) A transform for all other 4x4 blocks in the residual data.

Work accomplished... ( T, Q, IQ, IT)... Future work ( MC, toplevel,...)

Data within a macroblock are transmitted in the order shown in Figure If the macroblock is coded in 16x16 Intra mode, then the block labelled “-1” is transmitted first, containing the DC coefficient of each 4x4 luma block. Next, the luma residual blocks 0-15 are transmitted in the order shown (with the DC coefficient set to zero in a 16x16 Intra macroblock). Blocks 16 and 17 contain a 2x2 array of DC coefficients from the Cb and Cr chroma components respectively. Finally, chroma residual blocks (with zero DC coefficients) are sent. The entire process of transform and quantization can be carried out using 16-bit integer arithmetic

4x4 Integer Transform &Inverse Transform It is an integer transform The core part of the transform is multiply-free, it only requires additions and shifts. A scaling multiplication (part of the complete transform) is integrated into the quantizer (reducing the total number of multiplications).

4x4 Forward Integer Transform [(x0+x4+x8+x12) + (x1+x5+x9+x13) + (x2+x6+x10+x14) + (x3+x7+x11+x15), 2*(x0+x4+x8+x12) + (x1+x5+x9+x13) - (x2+x6+x10+x14) - 2*(x3+x7+x11+x15), (x0+x4+x8+x12) - (x1+x5+x9+x13) - (x2+x6+x10+x14) + (x3+x7+x11+x15), (x0+x4+x8+x12) - 2*(x1+x5+x9+x13) + 2*(x2+x6+x10+x14) - (x3+x7+x11+x15); (2*x0+x4-x8-2*x12) + (2*x1+x5-x9-2*x13) + (2*x2+x6-x10-2*x14) + (2*x3+x7-x11-2*x15), 2*(2*x0+x4-x8-2*x12) + (2*x1+x5-x9-2*x13) - (2*x2+x6-x10-2*x14) - 2*(2*x3+x7-x11-2*x15), (2*x0+x4-x8-2*x12) - (2*x1+x5-x9-2*x13) - (2*x2+x6-x10-2*x14) + (2*x3+x7-x11-2*x15), (2*x0+x4-x8-2*x12) - 2*(2*x1+x5-x9-2*x13) + 2*(2*x2+x6-x10-2*x14) - (2*x3+x7-x11-2*x15); (x0-x4-x8+x12) + (x1-x5-x9+x13) + (x2-x6-x10+x14) + (x3-x7-x11+x15), 2*(x0-x4-x8+x12) + (x1-x5-x9+x13) - (x2-x6-x10+x14) - 2*(x3-x7-x11+x15), (x0-x4-x8+x12) - (x1-x5-x9+x13) - (x2-x6-x10+x14) + (x3-x7-x11+x15), (x0-x4-x8+x12) - 2*(x1-x5-x9+x13) + 2*(x2-x6-x10+x14) - (x3-x7-x11+x15); (x0-2*x4+2*x8-x12) + (x1-2*x5+2*x9-x13) + (x2-2*x6+2*x10-x14) + (x3-2*x7+2*x11-x15), 2*(x0-2*x4+2*x8-x12) + (x1-2*x5+2*x9-x13) - (x2-2*x6+2*x10-x14) - 2*(x3-2*x7+2*x11-x15), (x0-2*x4+2*x8-x12) - (x1-2*x5+2*x9-x13) - (x2-2*x6+2*x10-x14) + (x3-2*x7+2*x11-x15), (x0-2*x4+2*x8-x12) - 2*(x1-2*x5+2*x9-x13) + 2*(x2-2*x6+2*x10-x14) - (x3-2*x7+2*x11-x15)]

4x4 Inverse Integer Transform [(y0 + y4 + y8 + y12/2) + (y1 + y5 + y9 + y13/2) + (y2 + y6 + y10 + y14/2) + 1/2 * (y3 + y7 + y11 + y15/2), (y0 + y4 + y8 + y12/2) + 1/2 * (y1 + y5 + y9 + y13/2) - (y2 + y6 + y10 + y14/2) - (y3 + y7 + y11 + y15/2), (y0 + y4 + y8 + y12/2) - 1/2 * (y1 + y5 + y9 + y13/2) - (y2 + y6 + y10 + y14/2) + (y3 + y7 + y11 + y15/2), (y0 + y4 + y8 + y12/2) - (y1 + y5 + y9 + y13/2) + (y2 + y6 + y10 + y14/2) - 1/2 * (y3 + y7 + y11 + y15/2); (y0 + y4/2 - y8 - y12) + (y1 + y5/2 - y9 - y13) + (y2 + y6/2 - y10 - y14) + 1/2 * (y3 + y7/2 - y11 - y15), (y0 + y4/2 - y8 - y12) + 1/2 * (y1 + y5/2 - y9 - y13) - (y2 + y6/2 - y10 - y14) - (y3 + y7/2 - y11 - y15), (y0 + y4/2 - y8 - y12) - 1/2 * (y1 + y5/2 - y9 - y13) - (y2 + y6/2 - y10 - y14) + (y3 + y7/2 - y11 - y15), (y0 + y4/2 - y8 - y12) - (y1 + y5/2 - y9 - y13) + (y2 + y6/2 - y10 - y14) - 1/2 * (y3 + y7/2 - y11 - y15); (y0 - y4/2 - y8 + y12) + (y1 - y5/2 - y9 + y13) + (y2 - y6/2 - y10 + y14) + 1/2 * (y3 - y7/2 - y11 + y15), (y0 - y4/2 - y8 + y12) + 1/2 * (y1 - y5/2 - y9 + y13) - (y2 - y6/2 - y10 + y14) - (y3 - y7/2 - y11 + y15), (y0 - y4/2 - y8 + y12) - 1/2 * (y1 - y5/2 - y9 + y13) - (y2 - y6/2 - y10 + y14) + (y3 - y7/2 - y11 + y15), (y0 - y4/2 - y8 + y12) - (y1 - y5/2 - y9 + y13) + (y2 - y6/2 - y10 + y14) - 1/2 * (y3 - y7/2 - y11 + y15); (y0 - y4 + y8 - y12/2) + (y1 - y5 + y9 - y13/2) + (y2 - y6 + y10 - y14/2) + 1/2 * (y3 - y7 + y11 - y15/2), (y0 - y4 + y8 - y12/2) + 1/2 * (y1 - y5 + y9 - y13/2) - (y2 - y6 + y10 - y14/2) - (y3 - y7 + y11 - y15/2), (y0 - y4 + y8 - y12/2) - 1/2 * (y1 - y5 + y9 - y13/2) - (y2 - y6 + y10 - y14/2) + (y3 - y7 + y11 - y15/2), (y0 - y4 + y8 - y12/2) - (y1 - y5 + y9 - y13/2) + (y2 - y6 + y10 - y14/2) - 1/2 * (y3 - y7 + y11 - y15/2)]

>> indicates a binary shift right. In the reference model software, f is 2qbits/3 for Intra blocks or 2qbits/6 for Inter blocks. For QP>5, the factors MF remain unchanged but the divisor 2qbits increases by a factor of 2 for each increment of 6 in QP. Quantization

the rescaled output increase by a factor of 2 for every increment of 6 in QP. a further constant scaling factor of 64 to avoid rounding errors The values at the output of the inverse transform are divided by 64 to remove the scaling factor Inverse Quantization

4x4 luma DC coefficient Transform & Quantization 16x16 Intra-mode only an inverse Hadamard transform is applied followed by rescaling (note that the order is not reversed as might be expected) If QP is greater than or equal to 12, rescaling is performed by: If QP is less than 12, rescaling is performed by:

4x4 Forward & Inverse Hadamard Transform [(z0+z4+z8+z12) + (z1+z5+z9+z13) + (z2+z6+z10+z14) + (z3+z7+z11+z15), (z0+z4+z8+z12) + (z1+z5+z9+z13) - (z2+z6+z10+z14) - (z3+z7+z11+z15), (z0+z4+z8+z12) - (z1+z5+z9+z13) - (z2+z6+z10+z14) + (z3+z7+z11+z15), (z0+z4+z8+z12) - (z1+z5+z9+z13) + (z2+z6+z10+z14) - (z3+z7+z11+z15); (z0+z4-z8-z12) + (z1+z5-z9-z13) + (z2+z6-z10-z14) + (z3+z7-z11-z15), (z0+z4-z8-z12) + (z1+z5-z9-z13) - (z2+z6-z10-z14) - (z3+z7-z11-z15), (z0+z4-z8-z12) - (z1+z5-z9-z13) - (z2+z6-z10-z14) + (z3+z7-z11-z15), (z0+z4-z8-z12) - (z1+z5-z9-z13) + (z2+z6-z10-z14) - (z3+z7-z11-z15); (z0-z4-z8+z12) + (z1-z5-z9+z13) + (z2-z6-z10+z14) + (z3-z7-z11+z15), (z0-z4-z8+z12) + (z1-z5-z9+z13) - (z2-z6-z10+z14) - (z3-z7-z11+z15), (z0-z4-z8+z12) - (z1-z5-z9+z13) - (z2-z6-z10+z14) + (z3-z7-z11+z15), (z0-z4-z8+z12) - (z1-z5-z9+z13) + (z2-z6-z10+z14) - (z3-z7-z11+z15); (z0-z4+z8-z12) + (z1-z5+z9-z13) + (z2-z6+z10-z14) + (z3-z7+z11-z15), (z0-z4+z8-z12) + (z1-z5+z9-z13) - (z2-z6+z10-z14) - (z3-z7+z11-z15), (z0-z4+z8-z12) - (z1-z5+z9-z13) - (z2-z6+z10-z14) + (z3-z7+z11-z15), (z0-z4+z8-z12) - (z1-z5+z9-z13) + (z2-z6+z10-z14) - (z3-z7+z11-z15)]

2x2 chroma DC coefficient Transform & Quantization Inverse transform is identical During decoding, the inverse transform is applied before rescaling If QP is greater than or equal to 6, rescaling is performed by: If QP is less than 6, rescaling is performed by: The rescaled coefficients are replaced in their respective 4x4 blocks of chroma coefficients [ (z0+z2) + (z1+z3), (z0+z2) - (z1+z3); (z0-z2) + (z1-z3), (z0-z2) - (z1-z3)]

DESIGNED HARDWARE

Problems encountered Signed arithmetic Initially designed for 100Mhz Due to creating a dual purpose datapath we get extra MUX delays Hardware specified in the standart to avoid rounding errors Error of the book “H.264 and MPEG-4 Video Compression” ! Unpredicted and unbelievable routing error !

Designed hardware supports up to H.264 level fps). A dual purpose datapath is designed. Transform and Quantization of a 4x4 block is completed in 36 clock cycles. Inverse Quantization of a 4x4 block takes 18 clock cycles. Inverse Transform of a 4x4 block is done in 36 clock cycles. It takes nearly 2400 cycles to complete an intra 16x16 predicted macroblock. Working at 80Mhz designed hardware can process up to mb’s per second. RESULTS

Number of ports : 68 Number of nets : 212 Number of instances : 30 Number of references to this view : 0 Total accumulated area : Number of Dffs or Latches : 493 Number of Function Generators : 2688 Number of MUX CARRYs : 148 Number of MUXF5 : 608 Number of MUXF6 : 184 Number of accumulated instances : 3847 Number of global buffers used: 0 Synthesis Results Synthesis is done with LeonardoSpectrum Clock frequency is 80MHz

Device Utilization for 2V8000ff1152 Resource Used Avail Utilization IOs % Global Buffers % Function Generators % CLB Slices % Dffs or Latches % Block RAMs % Block Multipliers %

FPGA & ASIC The design can be used either for FPGA or for ASIC. Only one multiplier is used (2V8000ff1152 has 168 block multipliers). A clock frequency of 80 MHz for FPGA is achieved. To be able to reach 80MHz lots of pipelining stages are added. Designed hardware may work at a clock frequency up to 200MHz in ASIC. Removing pipelining registers will decrease the area and power consumption.

Thanks... ? Questions