1 Drowsy Caches Simple Techniques for Reducing Leakage Power Krisztián Flautner Nam Sung Kim Steve Martin David Blaauw Trevor Mudge

Slides:



Advertisements
Similar presentations
COEN 180 SRAM. High-speed Low capacity Expensive Large chip area. Continuous power use to maintain storage Technology used for making MM caches.
Advertisements

Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.
Performance of Cache Memory
Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.
Leakage Energy Management in Cache Hierarchies L. Li, I. Kadayif, Y-F. Tsai, N. Vijaykrishnan, M. Kandemir, M. J. Irwin, and A. Sivasubramaniam Penn State.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.
Managing Static (Leakage) Power S. Kaxiras, M Martonosi, “Computer Architecture Techniques for Power Effecience”, Chapter 5.
Keeping Hot Chips Cool Ruchir Puri, Leon Stok, Subhrajit Bhattacharya IBM T.J. Watson Research Center Yorktown Heights, NY Circuits R-US.
Power Reduction Techniques For Microprocessor Systems
Adaptive Techniques for Leakage Power Management in L2 Cache Peripheral Circuits Houman Homayoun Alex Veidenbaum and Jean-Luc Gaudiot Dept. of Computer.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science August 20, 2009 Enabling.
S. Reda EN160 SP’08 Design and Implementation of VLSI Systems (EN1600) Lecture 14: Power Dissipation Prof. Sherief Reda Division of Engineering, Brown.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
CS 7810 Lecture 12 Power-Aware Microarchitecture: Design and Modeling Challenges for Next-Generation Microprocessors D. Brooks et al. IEEE Micro, Nov/Dec.
Super-Drowsy Caches Single-V DD and Single-V T Super-Drowsy Techniques for Low- Leakage High-Performance Instruction Caches Nam Sung Kim, Krisztián Flautner,
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
On the Limits of Leakage Power Reduction in Caches Yan Meng, Tim Sherwood and Ryan Kastner UC, Santa Barbara HPCA-2005.
CS 7810 Lecture 3 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.
Highly-Associative Caches for Low-Power Processors Michael Zhang Krste Asanovic
1 EECS Components and Design Techniques for Digital Systems Lec 21 – RTL Design Optimization 11/16/2004 David Culler Electrical Engineering and Computer.
Micro-Architecture Techniques for Sensor Network Processors Amir Javidi EECS 598 Feb 25, 2010.
On-Line Adjustable Buffering for Runtime Power Reduction Andrew B. Kahng Ψ Sherief Reda † Puneet Sharma Ψ Ψ University of California, San Diego † Brown.
1 Razor: A Low Power Processor Design Presented By: - Murali Dharan.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
S. Reda EN160 SP’07 Design and Implementation of VLSI Systems (EN0160) Lecture 13: Power Dissipation Prof. Sherief Reda Division of Engineering, Brown.
Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.
1 Energy-efficiency potential of a phase-based cache resizing scheme for embedded systems G. Pokam and F. Bodin.
1  2004 Morgan Kaufmann Publishers Chapter Seven.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
Power Management in Multicores Minshu Zhao. Outline Introduction Review of Power management technique Power management in Multicore ◦ Identify Multicores.
Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.
Power, Energy and Delay Static CMOS is an attractive design style because of its good noise margins, ideal voltage transfer characteristics, full logic.
1 VLSI Design SMD154 LOW-POWER DESIGN Magnus Eriksson & Simon Olsson.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
1 Lecture 21: Core Design, Parallel Algorithms Today: ARM Cortex A-15, power, sort and matrix algorithms.
Power Reduction for FPGA using Multiple Vdd/Vth
Low-Power Wireless Sensor Networks
CS Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.
Dept. of Computer Science, UC Irvine
Drowsy Caches: Simple Techniques for Reducing Leakage Power Authors: ARM Ltd Krisztián Flautner, Advanced Computer Architecture Lab, The University of.
1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University
1 CS/EE 6810: Computer Architecture Class format:  Most lectures on YouTube *BEFORE* class  Use class time for discussions, clarifications, problem-solving,
Power Management in High Performance Processors through Dynamic Resource Adaptation and Multiple Sleep Mode Assignments Houman Homayoun National Science.
Computer Architecture Lecture 26 Fasih ur Rehman.
Multiple Sleep Mode Leakage Control for Cache Peripheral Circuits in Embedded Processors Houman Homayoun, Avesta Makhzan, Alex Veidenbaum Dept. of Computer.
1 Bus Encoding for Total Power Reduction Using a Leakage-Aware Buffer Configuration 班級:積體所碩一 學生:林欣緯 指導教授:魏凱城 老師 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION.
Leakage reduction techniques Three major leakage current components 1. Gate leakage ; ~ Vdd 4 2. Subthreshold ; ~ Vdd 3 3. P/N junction.
Modern DRAM Memory Architectures Sam Miller Tam Chantem Jon Lucas CprE 585 Fall 2003.
Dec 3, 2008Sheth: MS Thesis1 A Hardware-Software Processor Architecture Using Pipeline Stalls For Leakage Power Management Khushboo Sheth Master’s Thesis.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
JouleTrack - A Web Based Tool for Software Energy Profiling Amit Sinha and Anantha Chandrakasan Massachusetts Institute of Technology June 19, 2001.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.
1 Improved Policies for Drowsy Caches in Embedded Processors Junpei Zushi Gang Zeng Hiroyuki Tomiyama Hiroaki Takada (Nagoya University) Koji Inoue (Kyushu.
CS203 – Advanced Computer Architecture
Improving Multi-Core Performance Using Mixed-Cell Cache Architecture
Basic Performance Parameters in Computer Architecture:
5.2 Eleven Advanced Optimizations of Cache Performance
Fine-Grain CAM-Tag Cache Resizing Using Miss Tags
On-demand solution to minimize I-cache leakage energy
TLC: A Tag-less Cache for reducing dynamic first level Cache Energy
A Novel Cache-Utilization Based Dynamic Voltage Frequency Scaling (DVFS) Mechanism for Reliability Enhancements *Yen-Hao Chen, *Yi-Lun Tang, **Yi-Yu Liu,
Presentation transcript:

1 Drowsy Caches Simple Techniques for Reducing Leakage Power Krisztián Flautner Nam Sung Kim Steve Martin David Blaauw Trevor Mudge

2 Motivation  On-chip caches responsible for 15%~20% of the total power leakage power can exceed 50% of total cache power according to our projection using Berkeley Predictive Models  Ever increasing leakage power as feature size shrinks  V t scales down exponential increase in leakage power

3 Processor power trends Based on ITRS roadmap and transistor count estimates. Total power in this projection cannot come true.

4 An observation about data caches  L1 data caches Working set: fraction of cache lines accessed in a time window. Window size = 2000 cycles. Only a small fraction of lines are accessed in a window. Working set of current window Working set of current + 1, 8, and 32 previous windows

5 The Drowsy Cache approach Optimize across circuit-microarchitecture boundary: –Use of the appropriate circuit technique enables simplified microarchitectural control. Requirement: state preservation in low leakage mode. Instead of being sophisticated about predicting the working set, reduce the penalty for being wrong. Algorithm: Periodically put all lines in cache into drowsy mode. When accessed, wake up the line.

6 Access control flow – Awake tags Awake tag match Line wake upLine access Memory Awake tag miss Replacement Line wake up Awake tags Hit Miss Drowsy hit / miss adds at most 1 cycle latency Access to awake line is not penalized

7 Drowsy tags implementation is more complicated Is the complexity worth it? –Tags use about 7% of data bits (32 bit address) –Only small incremental leakage reduction Worst case: 3 cycle extra latency Access control flow – Drowsy tags Awake tag match Line wake upLine access Memory Awake tag miss Replacement Line wake up Drowsy tags Hit Miss Tag wake up Unneeded tags and lines back to drowsy

8 Low-leakage circuit techniques CircuitProsCons Gated-V DD Largest leakage reduction Fast mode switching Easy implementation Loses cell state ABB-MTCMOS Retains cell stateSlow mode switching DVS Retains cell state Fase mode switching More power reduction than ABB More SEU noise susceptible

9 Drowsy memory using DVS Low supply voltage for inactive memory cells –Low voltage reduces leakage current too! –Quadratic reduction in leakage power leakage path supply voltage for drowsy mode supply voltage for normal mode P  = I   V 

10 Leakage reduction using DVS High-V t devices for access transistors  reduce leakage power  increase access time of cache  Right Trade-off point 91% leakage reduction 6% cycle time increase Projections for 0.07μm process

11 Drowsy cache line architecture

12 Energy reduction Projections for 0.07μm process High leakage: lines have to be powered up when accessed. Drowsy circuit –Without high v t device (in SRAM): 6x leakage reduction, no access delay. –With high v t device: 10x leakage reduction, 6% access time increase. Drowsy

13 1 cycle vs. 2 cycle wake up Fast wakeup is important – but easy to accomplish ! –Cache access time: 0.57ns (for 0.07μm from CACTI using 0.18μm baseline). –Speed dependent on voltage controller size: 64 x L eff – 0.28ns (half cycle at 4 GHz), 32 x L eff – 0.42ns, 16 x L eff – 0.77ns. Impact of drowsy tags are quite similar to double-cycle wake up.

14 Policy comparison simple 2000 simple 4000 noaccess 4000

15 Energy reduction Theoretical minimum assumes zero leakage in drowsy mode Total energy reduction within 0.1 of theoretical minimum –Diminishing returns for better leakage reduction techniques Above figures assume 6x leakage reduction, 10x possible with small additional run-time impact Normalized Total EnergyNormalized Leakage Energy Run-time increase DVSTheoretical min.DVSTheoretical min. Awake tags % Drowsy tags % > 50% total energy reduction> 70% leakage energy reduction

16 Conclusions Simple circuit technique –Need high-V t transistors, low V dd supply Simple architecture –No need to keep counter/predictor state for each line –Periodic global counter asserts drowsy signal –Window size (for periodic drowsy transition) depends on core: ~4000 cycles has good E-delay trade-off Technique also works well on in-order procesors –Memory subsystem is already latency tolerant Drowsy circuit is good enough –Diminishing returns on further leakage reduction –Focus is again on dynamic energy