A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference.

Slides:



Advertisements
Similar presentations
Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.
Advertisements

1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
1 A Self-Tuning Configurable Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.
CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.
Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.
Mathew Paul and Peter Petrov Proceedings of the IEEE Symposium on Application Specific Processors (SASP ’09) July /6/13.
A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.
A highly Configurable Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Walid Najjar* *University of California, Riverside **The.
Very low power pipelines using significance compression Canal, R. Gonzalez, A. Smith, J.E. Dept. d'Arquitectura de Computadors, Univ. Politecnica de Catalunya,
Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation.
Improving the Efficiency of Memory Partitioning by Address Clustering Alberto MaciiEnrico MaciiMassimo Poncino Proceedings of the Design,Automation and.
Introduction to ARM Architecture, Programmer’s Model and Assembler Embedded Systems Programming.
A Highly Configurable Cache Architecture for Embedded Systems Chuanjun Zhang, Frank Vahid and Walid Najjar University of California, Riverside ISCA 2003.
1 Chapter 8 Virtual Memory Virtual memory is a storage allocation scheme in which secondary memory can be addressed as though it were part of main memory.
1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.
Benefits of Early Cache Miss Determination Memik G., Reinman G., Mangione-Smith, W.H. Proceedings of High Performance Computer Architecture Pages: 307.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Compilation Techniques for Energy Reduction in Horizontally Partitioned Cache Architectures Aviral Shrivastava, Ilya Issenin, Nikil Dutt Center For Embedded.
1 Balanced Cache:Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders ISCA 2006,IEEE. By Chuanjun Zhang Speaker: WeiZeng.
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
1 Energy-efficiency potential of a phase-based cache resizing scheme for embedded systems G. Pokam and F. Bodin.
Chuanjun Zhang, UC Riverside 1 Using a Victim Buffer in an Application- Specific Memory Hierarchy Chuanjun Zhang*, Frank Vahid** *Dept. of Electrical Engineering.
A One-Shot Configurable- Cache Tuner for Improved Energy and Performance Ann Gordon-Ross 1, Pablo Viana 2, Frank Vahid 1, Walid Najjar 1, and Edna Barros.
Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec /7/14.
ECE 510 Brendan Crowley Paper Review October 31, 2006.
Embedded Systems Programming
Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.
Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.
Secure Embedded Processing through Hardware-assisted Run-time Monitoring Zubin Kumar.
CPACT – The Conditional Parameter Adjustment Cache Tuner for Dual-Core Architectures + Also Affiliated with NSF Center for High- Performance Reconfigurable.
Low Power Cache Design M.Bilal Paracha Hisham Chowdhury Ali Raza.
By: Aidahani Binti Ahmad
Mahesh Sukumar Subramanian Srinivasan. Introduction Embedded system products keep arriving in the market. There is a continuous growing demand for more.
1 of 20 Phase-based Cache Reconfiguration for a Highly-Configurable Two-Level Cache Hierarchy This work was supported by the U.S. National Science Foundation.
Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.
A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.
Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.
A S ELF -T UNING C ACHE ARCHITECTURE FOR E MBEDDED S YSTEMS Chuanjun Zhang, Frank Vahid and Roman Lysecky Presented by: Wei Zang Mar. 29, 2010.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
Kyushu University Las Vegas, June 2007 The Effect of Nanometer-Scale Technologies on the Cache Size Selection for Low Energy Embedded Systems.
Analysis of Cache Tuner Architectural Layouts for Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.
Improving Energy Efficiency of Configurable Caches via Temperature-Aware Configuration Selection Hamid Noori †, Maziar Goudarzi ‡, Koji Inoue ‡, and Kazuaki.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
1  2004 Morgan Kaufmann Publishers Chapter Seven Memory Hierarchy-3 by Patterson.
Making Good Points : Application-Specific Pareto-Point Generation for Design Space Exploration using Rigorous Statistical Methods David Sheldon, Frank.
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.
Sunpyo Hong, Hyesoon Kim
Cache Miss-Aware Dynamic Stack Allocation Authors: S. Jang. et al. Conference: International Symposium on Circuits and Systems (ISCAS), 2007 Presenter:
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
Presented by Rania Kilany.  Energy consumption  Energy consumption is a major concern in many embedded computing systems.  Cache Memories 50%  Cache.
Author : Tzi-Cker Chiueh, Prashant Pradhan Publisher : High-Performance Computer Architecture, Presenter : Jo-Ning Yu Date : 2010/11/03.
Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.
COSC3330 Computer Architecture
CSC 4250 Computer Architectures
Fine-Grain CAM-Tag Cache Resizing Using Miss Tags
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Tosiron Adegbija and Ann Gordon-Ross+
TLC: A Tag-less Cache for reducing dynamic first level Cache Energy
Ann Gordon-Ross and Frank Vahid*
Module IV Memory Organization.
A Self-Tuning Configurable Cache
Realizing Closed-loop, Online Tuning and Control for Configurable-Cache Embedded Systems: Progress and Challenges Islam S. Badreldin*, Ann Gordon-Ross*,
Automatic Tuning of Two-Level Caches to Embedded Applications
Presentation transcript:

A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference and Exhibition Volume: 1 Pages: 142 – 147 Feb. 2004

A Self-Tuning Cache Architecture for Embedded Systems 2/ /6/18 Abstract  Memory accesses can account for about half of a microprocessor system’s power consumption. Customizing a microprocessor cache’s total size, line size and associativity to a particular program is well known to have tremendous benefits for performance and power. Customizing caches has until recently been restricted to core-based flows, in which a new chip will be fabricated. However, several configurable cache architectures have been proposed recently for use in pre-fabricated microprocessor platforms. Tuning those caches to a program is still however a cumbersome task left for designers, assisted in part by recent computer-aided design (CAD) tuning aids.  We propose to move that CAD on-chip, which can greatly increase the acceptance of configurable caches. We introduce on-chip hardware implementing an efficient cache tuning heuristic that can automatically, transparently, and dynamically tune the cache to an executing program. We carefully designed the heuristic to avoid any cache flushing, since flushing is power and performance costly. By simulating numerous Powerstone and MediaBench benchmarks, we show that such a dynamic self-tuning cache can reduce memory- access energy by 45% to 55% on average, and as much as 97%, compared with a four-way set-associative base cache, completely transparently to the programmer.

A Self-Tuning Cache Architecture for Embedded Systems 3/ /6/18 What’s the Problem  Tuning a configurable cache to a application is benefic for power and performance  How to obtain the best cache configuration ?? Sometimes increase cache size (associativity) only improve limited performance but increase energy greatly  Determine the best cache configuration via simulation  Straightly, but slowly and can’t capture runtime behavior  Thus, it’s essential to automatically tune a configurable cache dynamically as an application executes

A Self-Tuning Cache Architecture for Embedded Systems 4/ /6/18 Introduction  Previous work of this team  A highly configurable cache architecture [13],[14] Four parameters that designers can configure: 1) Cache total size: 8, 4 or 2 KB 2) Associativity: 4, 2 or 1 way for 8 KB; 2 or 1 way for 4KB; 1 way for 2KB 3) Cache line size: 64, 32 or 16 bytes 4) Way prediction : ON or OFF  The proposed dynamic cache tuning method  Cache tuning heuristic implementing with on-chip hardware Without exhaustively tries all possible cache configurations Dynamically tunes the cache to an executing program Automate the process of finding the best cache configuration The space of configuration may more larger

A Self-Tuning Cache Architecture for Embedded Systems 5/ /6/18 Energy Evaluation  Equations for total memory access energy consumption E hit : cache hit energy per cache access E miss : cache miss energy E static_per_cycle : static energy dissipation  Equation for the heuristic cache tuner energy consumption Time total : the total time used to finish one cache configuration search NumSearch: the number of cache configurations search Related to cache size, associativity Related to cache line size Related to cache size

A Self-Tuning Cache Architecture for Embedded Systems 6/ /6/18 Problem Overview  A naive tuning approach  Exhaustively tries all possible cache configurations Two main drawbacks  Involves too many configurations  Requires too many cache flushes  Searching in an arbitrary order may require flushing the cache  Goal: develop a self-tuning heuristic that  Minimizes the number of cache configurations examined  Minimizes cache flushing While still finding a near-optimal cache configuration 1)Tuning dynamically as execution 2) Can be enabled, disabled by SW

A Self-Tuning Cache Architecture for Embedded Systems 7/ /6/18 Heuristic Development Through Analysis  Energy dissipation for benchmark parser at cache size from 1 KB to 1MB  However this tradeoff point is different for application and exist not only for cache size, but also for cache associativity and line size  Therefore, the goal of searching heuristic is to find the configuration Improve performance slightly but increase energy significantly Energy dissipation of off-chip memory decreases rapidly Increase cache performance and decrease total energy is observed

A Self-Tuning Cache Architecture for Embedded Systems 8/ /6/18 Determine the Impact of Each Parameter  The parameter with the greatest impact configure first  Vary cache size has the biggest impact on miss rate and energy  Vary line size cause little energy variation for I$ but more variation for D$  Vary associativity has the smallest impact on energy consumption Different line size Different associativity Develop a search heuristic that finds best cache size first, then best line size, finally best associativity

A Self-Tuning Cache Architecture for Embedded Systems 9/ /6/18 Minimizing Cache Flushing  The order of vary the values of each parameter  One order may require flushing, a different order may not  Cache flush analysis when changing cache size  Increasing the cache size is preferable over decreasing When decreasing the cache size, an original hit may turn into miss  EX: address 000 (index=00) and 110 (index=10) are misses after shutdown  For D $, need to write back when the data in the shutdown ways is dirty When increasing the cache size does’t require flushing  EX: address 100 (index=0) and 010 (index=0)  No write back is needed and thus avoid flushing 8 byte Memory

A Self-Tuning Cache Architecture for Embedded Systems 10/ /6/18 Minimizing Cache Flushing  Cache flush analysis when changing associativity  Increasing the associativity is preferable over decreasing Decreasing the associativity may turn a hit into miss  EX: address 000 (index=0) and 100 (index=0) Increasing the associativity will be no extra misses  EX: address 000 (index=00) and 010 (index=10)  Both still be hit after the associativity is increased

A Self-Tuning Cache Architecture for Embedded Systems 11/ /6/18 Search Heuristic for Determining the Best Cache Configuration  Inputs to the heuristic  Cache size: C[ i ], 1 ≤ i ≤ n n=3 in our configurable cache C[1]=2 KB, C[2]=4 KB, C[3]=8 KB  Line size: L[ j ], 1 ≤ j ≤ p p=3 in our configurable cache L[1]=16 bytes, L[2]=32 bytes, L[3]=64 bytes  Associativity: A[ k ], 1 ≤ k ≤ m m=3 in our configurable cache A[1]=1 way, A[2]=2 way, A[3]=4 way  Way prediction W[1]= OFF,W[2]= ON E[1] As long as increase the cache size result in total energy decrease First Then And then Finally

A Self-Tuning Cache Architecture for Embedded Systems 12/ /6/18 The Efficiency of Search Heuristic  Suppose there are n configurable parameters, and each parameter has m values  Total of m n different combinations  Our heuristic only searches m*n combinations at most EX: 10 configurable parameters, each has 10 values  Brute force searching: searches combinations  Our search heuristic: searches 100 combinations instead  Thus, using our search heuristic Minimizes the number of cache configurations examined Avoids most of the cache flushing &

A Self-Tuning Cache Architecture for Embedded Systems 13/ /6/18 Implementing the Heuristic in Hardware  Hardware-based approach is preferable over software  SW approach not only change the runtime behavior of application but also affect the cache behavior  FSMD of the cache tuner  E hit : correspond to 8KB 4way, 2way and 1way; 4KB 2way and 1way; 2KB 1way  E miss : correspond to line size of 16 bytes, 32bytes and 64 bytes  E static_per_cycle : correspond to cache size of 8KB, 4KB and 2KB  Configure register (7 bits wide) : 2 bits for cache size, 2 bits for line size, 2 bits for associativity and 1bit for way prediction Runtime information Application independent information Result of energy calculation Lowest of configuration tested Used to configure cache

A Self-Tuning Cache Architecture for Embedded Systems 14/ /6/18 Implementing the Heuristic in Hardware  FSM of the cache tuner  Composed of three smaller state machines EX: If the current state of PSM is P1  State V1 of VSM will determine the energy of 2 KB cache, V2 for 4 KB cache, V3 for 8 KB cache Why we need CSM ??  Because we have three multiplications but only one multiplier  Used four states to compute the energy Determines best cache size Line size Tuning each cache parameter AssociativityWay prediction Determines the energy for many possible values of each parameter 2 KB 4 KB 8 KB Controls the calculation of energy PSM states depend on VSM, and VSM states depend on CSM

A Self-Tuning Cache Architecture for Embedded Systems 15/ /6/18 Results of Search Heuristic  Searches average 5.8 configurations compared to 27 configurations  Finds the optimal configuration in nearly all cases, except  D-cache cfg. of pjepg  D-cache cfg. of mpeg2

A Self-Tuning Cache Architecture for Embedded Systems 16/ /6/18 The Reason of the Inaccuracy  Larger cache consume more dynamic and static energy  Larger cache is only preferable if the reduction in E off_chip_mem overcomes the energy increase due to larger cache For mpeg2, using 8 KB cache, the reduction in E off_chip_mem is not larger enough to overcome the added energy by larger cache  Therefore, selects a cache size of 4 KB When associativity is considered (increased from 1 way to 2 way), the miss rate of 8 KB cache is significantly reduced  The heuristic does’t choose the optimal configuration due to  When heuristic is determining the best cache size, it does’t predict what will happen when associativity is increased

A Self-Tuning Cache Architecture for Embedded Systems 17/ /6/18 Area and Power of the Tuning Hardware  The area of cache tuner is about 4000 gates or mm 2 in 0.18 um technology  An increase in area of just 3% over MIPS 4kp with cache  The power consumption of cache tuner is 2.69 mw at 200 MHz  Only 0.5% of the power consumed by a MIPS processor  The average energy consumption of cache tuner Used 164 cycles to finish one cache configuration Average number of configurations searched is 5.4  The average energy dissipation of benchmarks is 2.34 J  Impact of avoid flushing by careful ordering of search When cache size is configured in the order of 8 KB down to 2 KB  The average energy consumption due to writing back dirty data is 5.38 mJ  Thus, if we search the possible cache size from largest to smallest = 2.69 mw * (164/200M) * 5.4 = 11.9 nJ negligible The energy due to cache flushes would be 480,000 times than cache tuner

A Self-Tuning Cache Architecture for Embedded Systems 18/ /6/18 Conclusions  Proposed a self-tuning on-chip CAD method finding the best configuration automatically  Relieving designers from the burden to determine the best configuration  Increasing the usefulness and acceptance of a configurable cache  Our cache tuning heuristic  Minimizes the number of configurations examined  Minimizes the need for cache flushing  Reduces 40% memory-access energy on average, compared to a standard cache