1 Energy-efficiency potential of a phase-based cache resizing scheme for embedded systems G. Pokam and F. Bodin.

Slides:



Advertisements
Similar presentations
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
Advertisements

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.
Leakage Energy Management in Cache Hierarchies L. Li, I. Kadayif, Y-F. Tsai, N. Vijaykrishnan, M. Kandemir, M. J. Irwin, and A. Sivasubramaniam Penn State.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Managing Static (Leakage) Power S. Kaxiras, M Martonosi, “Computer Architecture Techniques for Power Effecience”, Chapter 5.
Scheduling Algorithms for Unpredictably Heterogeneous CMP Architectures J. Winter and D. Albonesi, Cornell University International Conference on Dependable.
CHIMAERA: A High-Performance Architecture with a Tightly-Coupled Reconfigurable Functional Unit Kynan Fraser.
Adaptive Techniques for Leakage Power Management in L2 Cache Peripheral Circuits Houman Homayoun Alex Veidenbaum and Jean-Luc Gaudiot Dept. of Computer.
S CRATCHPAD M EMORIES : A D ESIGN A LTERNATIVE FOR C ACHE O N - CHIP M EMORY IN E MBEDDED S YSTEMS - Nalini Kumar Gaurav Chitroda Komal Kasat.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
Phase Detection Jonathan Winter Casey Smith CS /05/05.
Mathew Paul and Peter Petrov Proceedings of the IEEE Symposium on Application Specific Processors (SASP ’09) July /6/13.
A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference.
On the Limits of Leakage Power Reduction in Caches Yan Meng, Tim Sherwood and Ryan Kastner UC, Santa Barbara HPCA-2005.
Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Compilation Techniques for Energy Reduction in Horizontally Partitioned Cache Architectures Aviral Shrivastava, Ilya Issenin, Nikil Dutt Center For Embedded.
1 Drowsy Caches Simple Techniques for Reducing Leakage Power Krisztián Flautner Nam Sung Kim Steve Martin David Blaauw Trevor Mudge
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec /7/14.
ECE 510 Brendan Crowley Paper Review October 31, 2006.
Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.
1 Techniques de compilation pour la gestion et l’optimisation de la consommation d’énergie des architectures VLIW Thèse de doctorat Gilles POKAM* 15 Juillet.
Hardware Caches with Low Access Times and High Hit Ratios Xiaodong Zhang College of William and Mary.
A Compiler-in-the-Loop (CIL) Framework to Explore Horizontally Partitioned Cache (HPC) Architectures Aviral Shrivastava*, Ilya Issenin, Nikil Dutt *Compiler.
Reconfigurable Caches and their Application to Media Processing Parthasarathy (Partha) Ranganathan Dept. of Electrical and Computer Engineering Rice University.
Drowsy Caches: Simple Techniques for Reducing Leakage Power Authors: ARM Ltd Krisztián Flautner, Advanced Computer Architecture Lab, The University of.
Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu
Low Power Cache Design M.Bilal Paracha Hisham Chowdhury Ali Raza.
1 of 20 Phase-based Cache Reconfiguration for a Highly-Configurable Two-Level Cache Hierarchy This work was supported by the U.S. National Science Foundation.
2013/01/14 Yun-Chung Yang Energy-Efficient Trace Reuse Cache for Embedded Processors Yi-Ying Tsai and Chung-Ho Chen 2010 IEEE Transactions On Very Large.
1 Tuning Garbage Collection in an Embedded Java Environment G. Chen, R. Shetty, M. Kandemir, N. Vijaykrishnan, M. J. Irwin Microsystems Design Lab The.
Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.
A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.
Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.
Low-Power Cache Organization Through Selective Tag Translation for Embedded Processors with Virtual Memory Support Xiangrong Zhou and Peter Petrov Proceedings.
Embedded System Lab. 정범종 A_DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters H. Wang et al. VEE, 2015.
Improving Energy Efficiency of Configurable Caches via Temperature-Aware Configuration Selection Hamid Noori †, Maziar Goudarzi ‡, Koji Inoue ‡, and Kazuaki.
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.
CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.
ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency.
Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.
BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.
Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.
Project 11: Influence of the Number of Processors on the Miss Rate Prepared By: Suhaimi bin Mohd Sukor M
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
Cache Miss-Aware Dynamic Stack Allocation Authors: S. Jang. et al. Conference: International Symposium on Circuits and Systems (ISCAS), 2007 Presenter:
Cache Perf. CSE 471 Autumn 021 Cache Performance CPI contributed by cache = CPI c = miss rate * number of cycles to handle the miss Another important metric.
Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.
1 Improved Policies for Drowsy Caches in Embedded Processors Junpei Zushi Gang Zeng Hiroyuki Tomiyama Hiroaki Takada (Nagoya University) Koji Inoue (Kyushu.
Presented by Rania Kilany.  Energy consumption  Energy consumption is a major concern in many embedded computing systems.  Cache Memories 50%  Cache.
An Offline Approach for Whole-Program Paths Analysis using Suffix Arrays G. Pokam, F. Bodin.
Rakesh Kumar Keith Farkas Norman P Jouppi,Partha Ranganathan,Dean M.Tullsen University of California, San Diego MICRO 2003 Speaker : Chun-Chung Chen Single-ISA.
5.2 Eleven Advanced Optimizations of Cache Performance
Energy-Efficient Address Translation
Fine-Grain CAM-Tag Cache Resizing Using Miss Tags
Tosiron Adegbija and Ann Gordon-Ross+
Module IV Memory Organization.
Ann Gordon-Ross and Frank Vahid*
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
A Novel Cache-Utilization Based Dynamic Voltage Frequency Scaling (DVFS) Mechanism for Reliability Enhancements *Yen-Hao Chen, *Yi-Lun Tang, **Yi-Yu Liu,
Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project
Presentation transcript:

1 Energy-efficiency potential of a phase-based cache resizing scheme for embedded systems G. Pokam and F. Bodin

2 Plan n Motivation n Previous work n Our approach n Cache model n Trace-based analysis n Experimental setup n Program behavior n Preliminary results n Conclusions and future work

3 Motivation (1/3) n High-performance accommodates difficultly with low-power n Consider the cache hierarchy for instance benefits of large caches benefits of large caches  maintain embedded code + data workload on-chip  reduce off-chip memory traffic however, however,  caches account for ~80% of the transistors count  we usually devote half of the chip area to caches

4 Motivation (2/3) n Cache impact on the energy consumption static energy is incommensurate in comparison to the rest of the chip u 80% of the transistors contribute steadily to the leakage power dynamic energy (transistors switching activities) represents an important fraction of the total energy due to the high access frequency of caches n Caches design is therefore critical in the context of high-performance embedded systems

5 Motivation (3/3)

6 Previous work (1/2) n Some configurable cache proposals that apply to embedded systems include: Albonesi [MICRO’99]: selective cache ways Albonesi [MICRO’99]: selective cache ways  to disable/enable individual cache ways of a highly set-associative cache Zhang & al. [ISCA’03]: way-concatenation Zhang & al. [ISCA’03]: way-concatenation  to reduce the cache associativity while still maintaining the full cache capacity

7 Previous work (2/2) per-application basis n These approaches only consider configuration on a per-application basis n Problems : empirically, no best cache size exists for a given application empirically, no best cache size exists for a given application varying dynamic cache behavior within an application, and from one application to another varying dynamic cache behavior within an application, and from one application to another Therefore, these approaches do not accommodate well to program phase changes Therefore, these approaches do not accommodate well to program phase changes

8 Our approach Objective : emphasize on application-specific cache architectural parameters emphasize on application-specific cache architectural parameters To do so, we consider a cache with fixed line size and modulus set mapping function power/perf is dictated by size and associativity power/perf is dictated by size and associativity Not all dynamic program phases may have the same requirements on cache size and associativity ! Dynamically varying size and assoc. to leverage power/perf. tradeoff at phase-level Dynamically varying size and assoc. to leverage power/perf. tradeoff at phase-level

9 Cache model (1/8) n Baseline cache model: [Zhang ISCA’03] way-concatenation cache [Zhang ISCA’03] Functionality of the way-concatenation cache mn  on each cache lookup, a logic selects the number of active cache ways m out of the n available cache ways n-way cache.  virtually, each active cache way is a multiple of the size of a single bank in the base n-way cache.

10 Cache model (2/8) n Our proposal: modify the associativity while guaranteeing cache coherency modify the cache size while preserving data availability on unused cache portions

11 Cache model (3/8) n First enhancement: associativity level Problem with baseline model Problem with baseline model  consider the following scenario in the baseline Bank 0Bank 1Bank 2Bank 3 Phase 0: 32K 2-way, active banks are 0 and 2 Phase 1: 32K 1-way, active bank is is modified Old invalidation

12 Cache model (4/8) Proposed solution Proposed solution :  assume a write-through cache associative tag array  the unused tag and status arrays must be made accessible on a write to ensure coherency across cache configurations => associative tag array  actions of the cache controller: access all tag arrays on a write request to set the corresponding status bit to invalid

13 Cache model (5/8) n Second enhancement: cache size level Problem with the baseline model Problem with the baseline model:  Gated-Vdd is used to disconnect a bank => data are not preserved across 2 configurations! Proposed solution Proposed solution:  unused cache ways are put in a low-power mode => drowsy mode [Flautner & al. ISCA’02]  tag portion is left unchanged !  Main advantage preserve the state of the unused memory cells u we can reduce the cache size, preserve the state of the unused memory cells across program phases, while still reducing leakage energy !

14 Cache model (6/8) n Overall cache model

15 Cache model (7/8) n Modified cache line (DVS is assumed)

16 Cache model (8/8) Drowsy circuitry accounts for less than 3% of the chip area Accessing a line in drowsy mode requires 1 cycle delay [Flautner & al. ISCA’02] ISA extension we assume the ISA can be extended with a reconfiguration instruction having the following effects on WCR:

17 Trace-based analysis (1/3) n Goal : We want to extract a performance and energy profiles from the trace in order to adapt the cache structure to the dynamic application requirements n Assumptions : LRU replacement policy no prefetching

18 Trace-based analysis (2/3) sample interval = set mapping function = (for varying the associativity) LRU-Stack distance d = (for varying the cache size) Then, define the LRU-stack profiles :  : performance  for each pair, this expression defines the number of dynamic references that hit in caches with LRU-stack distance

19 Trace-based analysis (3/3)  : energy Cache energy Tag energy Drowsy transitions energy memory energy

20 Experimental setup (1/2) n Focus on data cache n Simulation platform 4-issue VLIW processor [Faraboschi & al. ISCA’00] 32KB 4-way data cache 32B block size 20 cycles miss penalty Benchmarks MiBench: fft, gsm, susan MediaBench: mpeg, epic PowerStone: summin, whestone, v42bis

21 Experimental setup (2/2) n CACTI 3.0 n to obtain energy values n we extend it to provide leakage energy values for each simulated cache configuration Hotleakage from where we adapted the leakage energy calculation for each simulated leakage reduction technique estimated memory ratio = 50 drowsy energy from [Flautner & al. ISCA’02]

22 Program behavior (1/4) n GSM All 32K config All 16K config 8K config Capacity miss effect Tradeoff region Sensitive region Insensitive region (log10 scale)

23 Program behavior (2/4) n FFT

24 Program behavior (3/4) Working set size sensitivity property Working set size sensitivity property the working set can be partitioned into clusters with similar cache sensitivity Capturing sensitivity through working set size clustering Capturing sensitivity through working set size clustering the partitioning is done relative to the base cache configuration We use a simple metric based on the Manhattan distance vector from two points and

25 Program behavior (4/4) n More energy/performance profiles summinwhestone

26 Results (1/3) n Dynamic energy reduction

27 Results (2/3) n Leakage energy savings (0.07um) Better due to gated-Vdd

28 Results (3/3) n Performance Worst-case degradation (65% due to drowsy transitions)

29 Conclusions and future work n Can do better for improving performance n reduce the frequency of drowsy transitions within a phase with refined cache bank access policies n management of reconfiguration at the compiler level n insert BB annotation in the trace n exploit feedback-directed compilation n promising scheme for embedded systems