PRESENTED BY: MOHAMAD HAMMAM ALSAFRJALANI UFL ECE Dept. 3/31/2010 UFL ECE Dept 1 CACHE OPTIMIZATION FOR AN EMBEDDED MPEG-4 VIDEO DECODER.

Slides:

Advertisements

Similar presentations

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.

Advertisements

Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.

Introduction to H.264 / AVC Video Coding Standard Multimedia Systems Sharif University of Technology November 2008.

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow

DSPs Vs General Purpose Microprocessors

T.Sharon-A.Frank 1 Multimedia Compression Basics.

Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.

Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Reporter :LYWang We propose a multimedia SoC platform with a crossbar on-chip bus which can reduce the bottleneck of on-chip communication.

INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS, ICT '09. TAREK OUNI WALID AYEDI MOHAMED ABID NATIONAL ENGINEERING SCHOOL OF SFAX New Low Complexity.

System Design Tricks for Low-Power Video Processing Jonah Probell, Director of Multimedia Solutions, ARC International.

Distributed Multimedia Systems

Technion - IIT Dept. of Electrical Engineering Signal and Image Processing lab Transrating and Transcoding of Coded Video Signals David Malah Ran Bar-Sella.

 Understanding the Sources of Inefficiency in General-Purpose Chips.

H.264/AVC Baseline Profile Decoder Complexity Analysis Michael Horowitz, Anthony Joch, Faouzi Kossentini, and Antti Hallapuro IEEE TRANSACTIONS ON CIRCUITS.

Evaluation of Data-Parallel Splitting Approaches for H.264 Decoding

Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.

1 Audio Compression Techniques MUMT 611, January 2005 Assignment 2 Paul Kolesnik.

1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )

SWE 423: Multimedia Systems Chapter 7: Data Compression (1)

1 Single Reference Frame Multiple Current Macroblocks Scheme for Multiple Reference IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY Tung-Chien.

Associative Cache Mapping A main memory block can load into any line of cache Memory address is interpreted as tag and word (or sub-address in line) Tag.

Memory Organization.

Presenter : Cheng-Ta Wu Antti Rasmus, Ari Kulmala, Erno Salminen, and Timo D. Hämäläinen Tampere University of Technology, Institute of Digital and Computer.

EEL 6935 Embedded Systems Long Presentation 2 Group Member: Qin Chen, Xiang Mao 4/2/20101.

An Introduction to H.264/AVC and 3D Video Coding.

On Error Preserving Encryption Algorithms for Wireless Video Transmission Ali Saman Tosun and Wu-Chi Feng The Ohio State University Department of Computer.

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

JPEG C OMPRESSION A LGORITHM I N CUDA Group Members: Pranit Patel Manisha Tatikonda Jeff Wong Jarek Marczewski Date: April 14, 2009.

MPEG-2 Standard By Rigoberto Fernandez. MPEG Standards MPEG (Moving Pictures Experts Group) is a group of people that meet under ISO (International Standards.

Motivation Mobile embedded systems are present in: –Cell phones –PDA’s –MP3 players –GPS units.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Basics and Architectures

MPEG: (Moving Pictures Expert Group) A Video Compression Standard for Multimedia Applications Seo Yeong Geon Dept. of Computer Science in GNU.

Picture typestMyn1 Picture types There are three types of coded pictures. I (intra) pictures are fields or frames coded as a stand-alone still image. These.

Microprocessor-based systems Curse 7 Memory hierarchies.

© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Principles of I/0 hardware.

A Flexible Multi-Core Platform For Multi-Standard Video Applications Soo-Ik Chae Center for SoC Design Technology Seoul National University MPSoC 2009.

Data Compression. Compression? Compression refers to the ways in which the amount of data needed to store an image or other file can be reduced. This.

Adaptive Multi-path Prediction for Error Resilient H.264 Coding Xiaosong Zhou, C.-C. Jay Kuo University of Southern California Multimedia Signal Processing.

8. 1 MPEG MPEG is Moving Picture Experts Group On 1992 MPEG-1 was the standard, but was replaced only a year after by MPEG-2. Nowadays, MPEG-2 is gradually.

Compression video overview 演講者：林崇元. Outline Introduction Fundamentals of video compression Picture type Signal quality measure Video encoder and decoder.

Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.

1 DSP handling of Video sources and Etherenet data flow Supervisor: Moni Orbach Students: Reuven Yogev Raviv Zehurai Technion – Israel Institute of Technology.

Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.

Advances in digital image compression techniques Guojun Lu, Computer Communications, Vol. 16, No. 4, Apr, 1993, pp

UNDER THE GUIDANCE DR. K. R. RAO SUBMITTED BY SHAHEER AHMED ID : Encoding H.264 by Thread Level Parallelism.

-BY KUSHAL KUNIGAL UNDER GUIDANCE OF DR. K.R.RAO. SPRING 2011, ELECTRICAL ENGINEERING DEPARTMENT, UNIVERSITY OF TEXAS AT ARLINGTON FPGA Implementation.

Von Neumann Computers Article Authors: Rudolf Eigenman & David Lilja

The World Leader in High Performance Signal Processing Solutions Multi-core programming frameworks for embedded systems Kaushal Sanghai and Rick Gentile.

IT3002 Computer Architecture

Jeffrey Ellak CS 147. Topics What is memory hierarchy? What are the different types of memory? What is in charge of accessing memory?

MPEG CODING PROCESS. Contents  What is MPEG Encoding?  Why MPEG Encoding?  Types of frames in MPEG 1  Layer of MPEG1 Video  MPEG 1 Intra frame Encoding.

Architectural Effects on DSP Algorithms and Optimizations Sajal Dogra Ritesh Rathore.

Hierarchical Systolic Array Design for Full-Search Block Matching Motion Estimation Noam Gur Arie,August 2005.

Memory Management memory hierarchy programs exhibit locality of reference - non-uniform reference patterns temporal locality - a program that references.

Visit for more Learning Resources

Data Compression.

Cache Memory Presentation I

Vector Processing => Multimedia

Software Equipment Survey

VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder

Standards Presentation ECE 8873 – Data Compression and Modeling

Wavelet “Block-Processing” for Reduced Memory Transfers

Govt. Polytechnic Dhangar(Fatehabad)

What Choices Make A Killer Video Processor Architecture?

Presentation transcript:

PRESENTED BY: MOHAMAD HAMMAM ALSAFRJALANI UFL ECE Dept. 3/31/2010 UFL ECE Dept 1 CACHE OPTIMIZATION FOR AN EMBEDDED MPEG-4 VIDEO DECODER

Outline UFL ECE Dept 2  Introduction  Quick MPEG-4 Background  Original design  Optimized design  Conclusion 3/31/2010

Introduction 3/31/2010 UFL ECE Dept 3  Video decoders are popular  Streaming video and audio applications  Video surveillance  The predominant format is MPEG-4  MPEG-4 decoder specifications are narrow Unlike the encoder, designer’s options are limited

The Challenge UFL ECE Dept 4  The high data rate, large sizes, and distinctive memory access patterns of MPEG-4 exert a particular strain on cache  Miss rates are tolerable  But they generate excess cache-memory traffic  Multimedia application suffer due cache inefficiency  The goal is optimizing the algorithm for a better cache-SRAM, SRMA-subsystem interaction 3/31/2010

MPEG-4 Background UFL ECE Dept 5  MPEG-4 achieves the compression of video data using two orthogonal lines  In the spatial domain, it extracts redundancies from each individual image, operating on a frame by frame basis  In the temporal domain, the algorithm operates between frames, taking advantage of visual content common to adjacent frames.  Each frame is broken down into 8 × 8 pixel image fragments known as blocks 3/31/2010

 With spatial compression, each block is filtered through the discrete cosine transform and quantized, and then undergoes variable length coding  Temporal compression uses motion estimation, constructing frames from pieces of other frames translated as a group from their location in the source image MPEG-4 Background UFL ECE Dept 6 The most relevant for cache performance known as reference frames 3/31/2010

MPEG-4 Background UFL ECE Dept 7  For MPEG-4 simple profile, the motion estimation and compensation are organized using two different types of frames.  I (intra) frames contain a spatially compressed image without motion-compensated elements.  P (predicted) frames are built primarily of pixels from the closest previous I or P frame.  The ability to reference data in past frames provides additional opportunities for compression, but introduces data dependencies 3/31/2010

The DM642 UFL ECE Dept 8  The DM642 is the first integrated media processor based on the C64x VLIW DSP core.  Optimized for video processing including both 8-way VLIW parallelism and packed data processing (SIMD) within each functional unit.  Key elements  Two-level cache architecture  Enhanced DMA (EDMA) controller  64-bit external memory interface  Three 20-bit video ports  Ethernet MAC 3/31/2010

Cache Architecture UFL ECE Dept 9 two-way set associative to account for multiple input sources allows large amounts of data to be brought on chip. regions of reference data for motion estimation can be retained on-chip across multiple blocks minimizing redundant external I/O. 3/31/2010

Decoder Architecture UFL ECE Dept 10  3 parts  Receive task, Decode task and Output task  DSP decodes and writes the video streams into the SDRAM through its cache hierarchy  The DSP reads the encoded data from and writes the decoded video into the SDRAM  Another EDMA-I/O transfers the decoded video from SDRAM to the display peripheral. 3/31/2010

Original Design UFL ECE Dept 11  Parameters: size, associatively, levels  Size  Size: 16k, 32k, 64k, etc  Size performance. But very large size, performance improvement is not so much  The DM642, size L1 is 16k for instruction and 16 for data. L2 size is 256k  Associatively  Direct mapped -> 2-way, may performance by 50%  The DM462 L1 instruction is direct mapped, and data is 2- way 3/31/2010

Original Design UFL ECE Dept 12  Levels:  In general, addition of L2 decreases the bus traffic and memory latency.  The DM642 utilizes two-level cache architecture, namely L1P, L1D and L2 cache 3/31/2010

Optimized Design UFL ECE Dept 13  The optimal memory allocation method is based on the Cache- SRAM-SDRAM structure.  The SRAM is in the middle and it negotiates the unbalance of the cache and the SDRAM in speed and capacity, so it affects the system efficiency greatly.  SRAM is divided into three sections:  Data exchanging section  two data buffers, solves the video data storage and exchanging problems  The core code and variables storage section  Used to store the frequently called functions  Cache section  For the other codes and data management 3/31/2010

More on the SRAM UFL ECE Dept 14  Additionally, the code structure is properly modularized, the code fragments in teamwork are adjusted to be stored continuously, and the related data are stored together.  These actions increase the hit rate of the cache in the SRAM. 3/31/2010

Allocation of DES UFL ECE Dept 15  Data exchange section must be large:  The system decodes 8 ch CIF (Common Intermediate Format) (352 × 288 pixels)  Size of 8 ch’s is 8x352x288x1.5, additional space needed for source code, etc.. Total needed, 2.85MB  Much bigger than the 256k we have  The buffer must be also able to store the decoded video data reference frame data and the relevant structure variables of each GMBL (Group of  Macro Block Lines) 3/31/2010

Decoding unit UFL ECE Dept 16  The unit of MPEG-4 decoding is a 16 × 16 pixel fragment known as a macro block, covering the area of six 8× 8 blocks. Subsampling  Decoders take the GMBL the decoding unit  Access to SDRAM is reduced since buffer is ale to store the decoded video data  In order to process multiple blocks at a time to obtain optimal cache performance and EDMA bandwidth efficiency, a ping-pang data buffer, namely BUF1  and BUF2, is used to store the decoded data in turn. 3/31/2010

MPEG Decoder, Based on GMBL UFL ECE Dept 17 3/31/2010

Allocation of code and cache settings UFL ECE Dept 18  The method that utilizes SRAM to save video data is to solve the video data storage and exchanging problems.  Unless the data exchange can be realized in time, the performance of the decoder still may not be improved.  Therefore, it’s necessary to allocate parts of SRAM for storing the relevant variables of code and program.  Repeating operations is very common in MPEG-4 decoders  => L1P must allocate space for frequent functions  Space is allocated to store cache settings (2 way, or 4 way) 3/31/2010

Proposed Modes 2/26/2010 UFL ECE Dept 19  Mode 1, ALL SRAM of L2:  All the spaces of On-Chip memory are configured to SRAM that is used to store codes, data, and global variables  This mode is feasible because the direction of data streams is clear and the data exchange can be completed by scheduling EDMA.  Yet, it will consume a lot of time once DSP accesses external memory

Proposed Modes 2/26/2010 UFL ECE Dept 20  Mode 2: Cache based memory mode  The storage space of 83KB size SRAM is divided into two parts with one of 64KB size for cache  The surplus is for saving codes and data which are in common use.  In this mode, the L2 cache is four-way set associative.

Proposed Modes 2/26/2010 UFL ECE Dept 21  Mode 3:  The storage space of 83KB size SRAM is divided into two parts with one of 32KB size for cache  The surplus is for saving codes and data which are in common use.  In this way, the access to SDRAM is possibly completed in a high speed taking advantage of the cache  The L2 cache is two-way set associative.

Which mode is the best? 2/26/2010 UFL ECE Dept 22  Compared with mode 1, both mode2 and 3 has a higher miss rates when CPU read block which could waste lots of time because of the impossibility of manual scheduler resulted from the invisible map program.  The miss rates decrease significantly as the associativity level is increased from 2- to 4-way

The winner 2/26/2010 UFL ECE Dept 23  the second mode is finally the optimum one and is used in this paper which is called Cache-based memory mode  It separates the 64KB SRAM into L2 levels and surplus 19KB SRAM is used for store core codes and variables

Conclusion 2/26/2010 UFL ECE Dept 24  Memory access has great influence on system performance  Cache-based memory allocation mode is proposed to optimally allocate the limited on–chip memory  of the DM642 processor  Hardware characteristics  Software characteristics of MPEG-4 video decoders  The obtained MPEG-4 decoders can simultaneously decode eight channels CIF video frames on DM642.  The experiment results show that MPEG-4 decoding performance can be improved by almost 25% compared to the no cache optimization, with the standard video streams.

Questions? 3/19/2010 UFL ECE Dept 25