Improving cache performance of MPEG video codec

Slides:



Advertisements
Similar presentations
P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
Advertisements

1 Optimizing compilers Managing Cache Bercovici Sivan.
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Computer Organization CS224 Fall 2012 Lesson 44. Virtual Memory  Use main memory as a “cache” for secondary (disk) storage l Managed jointly by CPU hardware.
Lecture 34: Chapter 5 Today’s topic –Virtual Memories 1.
CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.
Increasing the Cache Efficiency by Eliminating Noise Philip A. Marshall.
Virtual Memory Chapter 8. Hardware and Control Structures Memory references are dynamically translated into physical addresses at run time –A process.
Multiprocessing Memory Management
Reducing Cache Misses (Sec. 5.3) Three categories of cache misses: 1.Compulsory –The very first access to a block cannot be in the cache 2.Capacity –Due.
Virtual Memory Chapter 8.
1 Chapter 8 Virtual Memory Virtual memory is a storage allocation scheme in which secondary memory can be addressed as though it were part of main memory.
Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching by: Josefin Hallberg, Tuva Palm and Mats Brorsson Presented.
Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.
CSI 400/500 Operating Systems Spring 2009 Lecture #9 – Paging and Segmentation in Virtual Memory Monday, March 2 nd and Wednesday, March 4 th, 2009.
Chapter 8 Virtual Memory
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy (Part II)
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
Data Cache Prefetching using a Global History Buffer Presented by: Chuck (Chengyan) Zhao Mar 30, 2004 Written by: - Kyle Nesbit - James Smith Department.
Chapter 3 Memory Management: Virtual Memory
CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Lecture 15: Virtual Memory EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr.
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
July 30, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 8: Exploiting Memory Hierarchy: Virtual Memory * Jeremy R. Johnson Monday.
Prefetching Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen and Mark Hill Updated by Mikko Lipasti.
Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Task Graph Scheduling for RTR Paper Review By Gregor Scott.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Project Presentation By: Dean Morrison 12/6/2006 Dynamically Adaptive Prepaging for Effective Virtual Memory Management.
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache And Pefetch Buffers Norman P. Jouppi Presenter:Shrinivas Narayani.
1  2004 Morgan Kaufmann Publishers Chapter Seven Memory Hierarchy-3 by Patterson.
CS2100 Computer Organisation Virtual Memory – Own reading only (AY2015/6) Semester 1.
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
Architectural Effects on DSP Algorithms and Optimizations Sajal Dogra Ritesh Rathore.
Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a “cache” for secondary (disk) storage – Managed jointly.
Virtual memory.
CSE 351 Section 9 3/1/12.
Multilevel Memories (Improving performance using alittle “cash”)
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Lecture: Cache Hierarchies
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Lecture: Cache Hierarchies
Virtual Memory Chapter 8.
Bojian Zheng CSCD70 Spring 2018
Presented by: Isaac Martin
CSCI1600: Embedded and Real Time Software
Lecture 14: Reducing Cache Misses
ECE Dept., University of Toronto
Lecture: Cache Innovations, Virtual Memory
Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory
Department of Electrical Engineering Joint work with Jiong Luo
Lecture: Cache Hierarchies
ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Mapping DSP algorithms to a general purpose out-of-order processor
Virtual Memory: Working Sets
Principle of Locality: Memory Hierarchies
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
CSCI1600: Embedded and Real Time Software
Presentation transcript:

Improving cache performance of MPEG video codec Anne Pratoomtong Aug,2002

Motivation Limitation of the memory bandwidth and space. Multimedia application are data intensive applications which implies that the data storage and transfer has a dominant impact on the power and area cost of the system. Memory access pattern of most multimedia algorithm are complicate but predictable. In a computation intensive algorithm such as motion estimation, memory bandwidth preservation and on-chip/off-chip memory organization should be investigate further.

Techniques Hardware prefetching Software prefetching Code positioning Data organization to reduce memory traffic

Memory access pattern

Hardware prefetching One block lookahead (OBL) Stream buffer When fetching block i. Also fetch block i+1 Efficient with instruction cache where the access pattern are mostly 1-D consecutive. Stream buffer FIFO type queue that sits on the refill path to the main cache.

Hardware prefetching Stride prediction table (SPT) The miss that causes block i to be brought into the cache, block i+1, i+2,…, i+n are also fetched into stream buffer. Not very efficient for non unit strides data access pattern. Stride prediction table (SPT) Table, indexed by instruction address, holds the address of last access. LRU replacement policy. Works well with medium to large cache size.

Hardware prefetching

Hardware prefetching Stream cache Additional small stream cache that accessed in parallel with the main cache. The prefetch data goes into stream cache instead of main cache. Use with SPT to overcome the cache pollution that occur when using SPT with a very small cache. For middle to large cache, the performance is approximately the same as the SPT implementation.

Hardware prefetching 2D prefetching Prefetch with constant stride Stride value depend on data structure (image size in this case) Image reference table maintains information on the displacement with respect to the physical address in order to find the next prefetch block.

cache with 1D blocks and 2D prefetch Hardware prefetching cache with 1D blocks and 2D prefetch

Code positioning DM cache offer the highest storage capacity on a given silicon area and require short access time. Real time signal processing applications typically involve a limited set of functions executed periodically to process the incoming data. A heuristic approach to reduce high I-cache miss rates in DM cache. Require trace profiling ability. Rearranges functions in memory based on trace data so as to minimize cache line conflicts. Partition look up table into smaller tables

Data organization to reduce memory traffic Selective caching Line locking Locality hints Scratch memory Loop merging

HW/SW Co-Synthesis for SOC Chooses cache sizes and allocates tasks to caches as part of co-synthesis. Assume only one-level DM cache is modeled and tasks are well-contained in the level-1 cache Partition an application into an acyclic task graph which contains nodes represented tasks connected by direct edges represented data dependencies between tasks.

HW/SW Co-Synthesis for SOC Initial solution: Assign each task graphs the fastest PE that is available for the task. PE and cache cost reduction: try to eliminate lightly loaded PE’s by moving the tasks on those PE’s to other PE’s that provide the best performance for the tasks. Tries to implement the remaining unmovable tasks with a cheaper PE. If such PE can’t be found, the current PE is kept but an attempt is made to cut its instruction and data cache sizes if applicable.

Future work While hardware prefetching is widely implement in GPP platform, is it worth the extra area/power/complexity in SOC implementation or is it necessary? These techniques are applied on software encoder/decoder application that run on GPP or multimedia processor, can they efficiently applied on a hardware/software co-design implementation of encoder/decoder on reconfigurable platform?

Future work Characterized the memory traffic and access pattern among various video codec algorithms and standards and design the best techniques that can adapt according to the changes of applications.