L28:Lower Power Algorithm for Multimedia Systems(2) 1999. 8 성균관대학교 조 준 동

Slides:

Advertisements

Similar presentations

A Graduate Course on Multimedia Technology 3. Multimedia Communication © Wolfgang Effelsberg Media Scaling and Media Filtering Definition of.

Advertisements

T.Sharon-A.Frank 1 Multimedia Compression Basics.

Computer Science and Engineering Laboratory, Transport-triggered processors Jani Boutellier Computer Science and Engineering Laboratory This.

Data Compression CS 147 Minh Nguyen.

Overview Part 2 – Combinational Logic Functions and functional blocks

VADA Lab.SungKyunKwan Univ. 1 L3: Lower Power Design Overview (2) 성균관대학교 조 준 동 교수

Power Reduction Techniques For Microprocessor Systems

1 Computer Communication & Networks Lecture 6 Physical Layer: Digital Transmission Waleed Ejaz

System Design Tricks for Low-Power Video Processing Jonah Probell, Director of Multimedia Solutions, ARC International.

Motivation Application driven -- VoD, Information on Demand (WWW), education, telemedicine, videoconference, videophone Storage capacity Large capacity.

Design Goal Design an Analog-to-Digital Conversion chip to meet demands of high quality voice applications such as: Digital Telephony, Digital Hearing.

VIPER DSPS 1998 Slide 1 A DSP Solution to Error Concealment in Digital Video Eduardo Asbun and Edward J. Delp Video and Image Processing Laboratory (VIPER)

Error detection and concealment for Multimedia Communications Senior Design Fall 06 and Spring 07.

1 Asynchronous Bit-stream Compression (ABC) IEEE 2006 ABC Asynchronous Bit-stream Compression Arkadiy Morgenshtein, Avinoam Kolodny, Ran Ginosar Technion.

L27:Lower Power Algorithm for Multimedia Systems 성균관대학교 조 준 동

Spatial and Temporal Data Mining

Analysis, Fast Algorithm, and VLSI Architecture Design for H

Digital Voice Communication Link EE 413 – TEAM 2 April 21 st, 2005.

Losslessy Compression of Multimedia Data Hao Jiang Computer Science Department Sept. 25, 2007.

Mehdi Amirijoo1 Power estimation n General power dissipation in CMOS n High-level power estimation metrics n Power estimation of the HW part.

1 Lecture 24: Parallel Algorithms I Topics: sort and matrix algorithms.

McGraw-Hill©The McGraw-Hill Companies, Inc., 2004 Chapter 4 Digital Transmission.

A Low-Power Low-Memory Real-Time ASR System. Outline Overview of Automatic Speech Recognition (ASR) systems Sub-vector clustering and parameter quantization.

PH4705/ET4305: A/D: Analogue to Digital Conversion

On Error Preserving Encryption Algorithms for Wireless Video Transmission Ali Saman Tosun and Wu-Chi Feng The Ohio State University Department of Computer.

Digital Communication Techniques

1 Background The latest video coding standard H.263 -> MPEG4 Part2 -> MPEG4 Part10/AVC Superior compression performance 50%-70% bitrate saving (H.264 v.s.MPEG-2)

Final Year Project A CMOS imager with compact digital pixel sensor (BA1-08) Supervisor: Dr. Amine Bermak Group Members: Chang Kwok Hung

1 VLSI Design SMD154 LOW-POWER DESIGN Magnus Eriksson & Simon Olsson.

1 A 252Kgates/4.9Kbytes SRAM/71mW Multi-Standard Video Decoder for High Definition Video Applications Motivation A variety of video coding standards Increasing.

1 Lecture 21: Core Design, Parallel Algorithms Today: ARM Cortex A-15, power, sort and matrix algorithms.

CS 395 T Real-Time Graphics Architectures, Algorithms, and Programming Systems Spring’03 Vector Quantization for Texture Compression Qiu Wu Dept. of ECE.

: Chapter 12: Image Compression 1 Montri Karnjanadecha ac.th/~montri Image Processing.

Object Based Video Coding - A Multimedia Communication Perspective Muhammad Hassan Khan

Logic Synthesis for Low Power(CHAPTER 6) 6.1 Introduction 6.2 Power Estimation Techniques 6.3 Power Minimization Techniques 6.4 Summary.

1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University

Abhik Majumdar, Rohit Puri, Kannan Ramchandran, and Jim Chou /24 1 Distributed Video Coding and Its Application Presented by Lei Sun.

Tinoosh Mohsenin and Bevan M. Baas VLSI Computation Lab, ECE Department University of California, Davis Split-Row: A Reduced Complexity, High Throughput.

ELEC692 VLSI Signal Processing Architecture Lecture 7 VLSI Architecture for Block Matching Algorithm for Video compression * Part of the notes is taken.

CIS679: Multimedia Basics r Multimedia data type r Basic compression techniques.

ELEC692/04 course_des 1 ELEC 692 Special Topic VLSI Signal Processing Architecture Fall 2004 Chi-ying Tsui Department of Electrical and Electronic Engineering.

Low-Power H.264 Video Compression Architecture for Mobile Communication Student: Tai-Jung Huang Advisor: Jar-Ferr Yang Teacher: Jenn-Jier Lien.

MOTION ESTIMATION IMPLEMENTATION IN VERILOG

Radix-2 2 Based Low Power Reconfigurable FFT Processor Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Gin-Der Wu and Yi-Ming Liu Department.

Outline Kinds of Coding Need for Compression Basic Types Taxonomy Performance Metrics.

Compression video overview 演講者：林崇元. Outline Introduction Fundamentals of video compression Picture type Signal quality measure Video encoder and decoder.

1 Implementation in Hardware of Video Processing Algorithm Performed by: Yony Dekell & Tsion Bublil Supervisor : Mike Sumszyk SPRING 2008 High Speed Digital.

Advances in digital image compression techniques Guojun Lu, Computer Communications, Vol. 16, No. 4, Apr, 1993, pp

A Fast LBG Codebook Training Algorithm for Vector Quantization Presented by 蔡進義.

A Programmable Single Chip Digital Signal Processing Engine MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions.

STATISTIC & INFORMATION THEORY (CSNB134) MODULE 11 COMPRESSION.

Low Power, High-Throughput AD Converters

Class Report 林常仁 Low Power Design: System and Algorithm Levels.

Recursive Architectures for 2DLNS Multiplication RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR 11 Recursive Architectures for 2DLNS.

Low Power, High-Throughput AD Converters

Seok-jae, Lee VLSI Signal Processing Lab. Korea University

VADA Lab.SungKyunKwan Univ. 1 L5:Lower Power Architecture Design 성균관대학교 조 준 동 교수

Motion Estimation Multimedia Systems and Standards S2 IF Telkom University.

Hierarchical Systolic Array Design for Full-Search Block Matching Motion Estimation Noam Gur Arie,August 2005.

Low Power, High-Throughput AD Converters

JPEG Compression What is JPEG? Motivation

B.Sc. Thesis by Çağrı Gürleyük

Low-power Digital Signal Processing for Mobile Phone chipsets

Data Compression.

SECTIONS 1-7 By Astha Chawla

Data Compression.

LOW POWER DIGITAL VIDEO COMPRESSION HARDWARE DESIGN

Sum of Absolute Differences Hardware Accelerator

A High Performance SoC: PkunityTM

Foundation of Video Coding Part II: Scalar and Vector Quantization

Presentation transcript:

L28:Lower Power Algorithm for Multimedia Systems(2) 성균관대학교 조 준 동

Low Power Video Processor Uzi Zangi, Technion - VLSI Systems Research Center, 1997 Asynchronous logic to save power ý Didn’t work because:Slow design (13.5MHz) &Small circuit (<100K gates) : clock load is small.Adding Async. control costs more then clocking. Gated clock ý Didn’t work because: 6 Frequency is very low (13.5MHz). 6 Register activity is very high. 6 No need for clock tree.

Minimizing bus switching ý Transfer the value or it’s negative on the bus, according to the minimum number of toggle bits. ý Add one bit that will indicate the polarity of the bus. ý Good for buses with: 4 large number of bits (more than 10). 4 High capacitance (more then 2pF). 4 High toggle activity (more then 1/2). ý Overheads: 8 Routing of one more bit. 8 Extra logic for the decision (timing, area).

Minimizing bus switching (Cont.) Didn’t work because: Largest bus is 8bit. Capacitance less than 1pF. Toggle activity not very high.

Power Reduction in InfoPad

Power Management by Gated Clock Power Management Scheme by Enabling Clock Power Management Scheme by adding Clock Generation block

Method That Works: Pixel Differentials ý Pixel value area locality. ý This is exploited most heavily in compression (save on storage and transmission). ý Most of the functions are linear, able to work on differences. ý The entire algorithm was rewritten (interpolations, filters, matrices, etc.) ý New algorithm differs from original by no more then 1 lsb bit per pixel.

Methodology C++ Simulator Algorithm Image Compare Verilog Simulator RTL Synopsys Netlist P&R Cadence Opus Spice Netlist Epic Powermill Currents, power Image 0.35 Lib Compass

Pixel Difference

Pixel Differentials Algorithm Results

Summary ý Attempted to save power on a battery-operated chip by application specific algorithmic/architectural techniques: Async. Logic, Gated clock, Minimizing bus switching. ý All Attempts failed. These methods may still apply to very large, very fast chips, and on variable load application. ý Successfully applied an algorithmic change, inspired by image compression. It may not work on non-compressible data but works exceptionally well on images. ý Easily saved 80% power, potentially can save more than 90%.

A SINGLE-CHIP DIGITAL CAMERA H. Teresa H. Meng, “Low-Power Wireless Video System”, IEEE Communication Magazine, June, 1998 ◈ Given the recent development in CMOS RF transceiver design, wireless transmission at a bandwidth in excess of 10Mb/s will soon become possible using next-generation CMOS technology. ◈ The design of a low-power large-scale parallel MPEG2 encoder architecture to be used in a single-chip digital CMOS video camera. ◈ The single-chip digital camera architecture includes a 640 x 480 array of CMOS photo diodes, embedded DRAM for storing four frames of color data, and parallel array processor for video signal processing ◈ The parallel processor architecture is designed to implement highly computationally intensive image and video processing tasks such as color conversion, discrete cosine transform(DCT), and motion estimation for MPGE2.

A SINGLE-CHIP DIGITAL CAMERA

Energy per operation at a 1.5V supply in 0.8  m CMOS technology

A SINGLE-CHIP DIGITAL CAMERA ◈ Design Consideration  The proposed architecture considers three algorithms commonly used in video coding standards : red-green-blue(RGB)-to-yellow-ultraviolet (YUV) conversion, discrete cosign transform(DCT), and motion estimation  To reduce power consumption, as many parallel processors as practically feasible should be used to reduce the clock frequency, because a reduced clock frequency implies a lower supply voltage.  For MPEG-2 encoding, the computational demand required for motion estimation(1.6 BOPS for 30 frames/s based on the algorithm proposed by Chalidabhongese and Kuo) limits the number of columns in each processor domain to 16, because otherwise the required clock speed for each processor would be too high for a low-power design

A SINGLE-CHIP DIGITAL CAMERA ◈ PERFORMANCE  In order to sustain this computational demand, each processor is required to run at a clock frequency equal to or higher than 40 MHz.  When implemented in a 0.2  CMOS technology, a 1V supply voltage should be more than enough to support a 40MHz operation  Under these condition, this parallel processor architecture delivers a processing of 1.6 BOPS with a power consumption of 40mW

Vector Quantization Lossy compression technique which exploits the correlation that exists between neighboring samples and quantizes samples together

Complexity of VQ Encoding The distortion metric between an input vector X and a codebook vector C_i is computed as follows: Three VQ encoding algorithms will be evaluated: full search, tree search and differential codebook tree- search.

Full Search Brute-force VQ: the distortion between the input vector and every entry in the code-book is computed, and the codeindex that corresponds to the minimum distortion is determined and sent over to the decoder. For each distortion computation, there are 16 8-bit memory accesses (to fetch the entries in the codeword), 16 subtractions, 16 multiplications, 15 additions. In addition, the minimum of 256 distortion values, which involves 255 comparison operations, must be determined.

Tree-structured Vector Quantization If for example at level 1, the input vector is closer to the left entry, then the right portion of the tree is never compared below level 2 and an index bit 0 is transmitted. Here only 2 x log = 16 distortion calculations with 8 comparisons

Algorithmic Optimization Minimizing the number of operations –example video data stream using the vector quantization (VQ) algorithm distortion metric –Full search VQ exhaustive full-search distortion calculation : 256 value comparison : 255 –Tree-structured VQ binary tree-search some performance degradation distortion calculation : 16 ( 2 x log ) value comparison :

Differential Codebook Tree-structure Vector Quantization The distortion difference b/w the left and right node needs to be computed. This equation can be manipulated to reduce the number of operations.

Algorithmic Optimization –Differential codebook tree-structure VQ modify equation for optimizing operations algorithm # of mem. access full search tree search differential tree search # of mul. # of add. # of sub

Multiplication with Constants Techniques and tools have been developed to scale coefficients so as to minimize the number of 1’s in the coefficients so as to minimize the number of shift-add operations.

Gated clocks to shut down modules when not used.