Drowsy Caches: Simple Techniques for Reducing Leakage Power Authors: ARM Ltd Krisztián Flautner, Advanced Computer Architecture Lab, The University of.

Slides:



Advertisements
Similar presentations
Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.
Advertisements

Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,
Lecture 12 Reduce Miss Penalty and Hit Time
Virtual Exclusion: An Architectural Approach to Reducing Leakage Energy in Multiprocessor Systems Mrinmoy Ghosh Hsien-Hsin S. Lee School of Electrical.
Leakage Energy Management in Cache Hierarchies L. Li, I. Kadayif, Y-F. Tsai, N. Vijaykrishnan, M. Kandemir, M. J. Irwin, and A. Sivasubramaniam Penn State.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
Managing Static (Leakage) Power S. Kaxiras, M Martonosi, “Computer Architecture Techniques for Power Effecience”, Chapter 5.
A Multi-Core Approach to Addressing the Energy-Complexity Problem in Microprocessors Rakesh Kumar Keith Farkas (HP Labs) Norman Jouppi (HP Labs) Partha.
Power Reduction Techniques For Microprocessor Systems
Chapter 12 Memory Organization
Adaptive Techniques for Leakage Power Management in L2 Cache Peripheral Circuits Houman Homayoun Alex Veidenbaum and Jean-Luc Gaudiot Dept. of Computer.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science August 20, 2009 Enabling.
Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.
S. Reda EN160 SP’08 Design and Implementation of VLSI Systems (EN1600) Lecture 14: Power Dissipation Prof. Sherief Reda Division of Engineering, Brown.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
Chia-Yen Hsieh Laboratory for Reliable Computing Microarchitecture-Level Power Management Iyer, A. Marculescu, D., Member, IEEE IEEE Transaction on VLSI.
Super-Drowsy Caches Single-V DD and Single-V T Super-Drowsy Techniques for Low- Leakage High-Performance Instruction Caches Nam Sung Kim, Krisztián Flautner,
Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation.
On the Limits of Leakage Power Reduction in Caches Yan Meng, Tim Sherwood and Ryan Kastner UC, Santa Barbara HPCA-2005.
Compiler-Directed instruction cache leakage optimizations Discussed by Discussed by Raid Ayoub CSE D EPARTMENT.
Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.
1 Razor: A Low Power Processor Design Presented By: - Murali Dharan.
Benefits of Early Cache Miss Determination Memik G., Reinman G., Mangione-Smith, W.H. Proceedings of High Performance Computer Architecture Pages: 307.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
S. Reda EN160 SP’07 Design and Implementation of VLSI Systems (EN0160) Lecture 13: Power Dissipation Prof. Sherief Reda Division of Engineering, Brown.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
1 Drowsy Caches Simple Techniques for Reducing Leakage Power Krisztián Flautner Nam Sung Kim Steve Martin David Blaauw Trevor Mudge
Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.
1 Energy-efficiency potential of a phase-based cache resizing scheme for embedded systems G. Pokam and F. Bodin.
Power Management in Multicores Minshu Zhao. Outline Introduction Review of Power management technique Power management in Multicore ◦ Identify Multicores.
Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec /7/14.
ECE 510 Brendan Crowley Paper Review October 31, 2006.
Circuit-Level Timing Speculation: The Razor Latch Developed by Trevor Mudge’s group at the University of Michigan, 2003.
Power-aware Computing n Dramatic increases in computer power consumption: » Some processors now draw more than 100 watts » Memory power consumption is.
1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University
ENGG 6090 Topic Review1 How to reduce the power dissipation? Switching Activity Switched Capacitance Voltage Scaling.
Power Reduction for FPGA using Multiple Vdd/Vth
Dept. of Computer Science, UC Irvine
1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University
Exploiting Program Hotspots and Code Sequentiality for Instruction Cache Leakage Management J. S. Hu, A. Nadgir, N. Vijaykrishnan, M. J. Irwin, M. Kandemir.
Power Management in High Performance Processors through Dynamic Resource Adaptation and Multiple Sleep Mode Assignments Houman Homayoun National Science.
A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University.
1 Memory Hierarchy The main memory occupies a central position by being able to communicate directly with the CPU and with auxiliary memory devices through.
1 Tuning Garbage Collection in an Embedded Java Environment G. Chen, R. Shetty, M. Kandemir, N. Vijaykrishnan, M. J. Irwin Microsystems Design Lab The.
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
LA-LRU: A Latency-Aware Replacement Policy for Variation Tolerant Caches Aarul Jain, Cambridge Silicon Radio, Phoenix Aviral Shrivastava, Arizona State.
Leakage reduction techniques Three major leakage current components 1. Gate leakage ; ~ Vdd 4 2. Subthreshold ; ~ Vdd 3 3. P/N junction.
DSP Architectures Additional Slides Professor S. Srinivasan Electrical Engineering Department I.I.T.-Madras, Chennai –
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.
Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.
1 Efficient System-on-Chip Energy Management with a Segmented Counting Bloom Filter Mrinmoy Ghosh- Georgia Tech Emre Özer- ARM Ltd Stuart Biles- ARM Ltd.
A memory is just like a human brain. It is used to store data and instructions. Computer memory is the storage space in computer where data is to be processed.
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.
1 Improved Policies for Drowsy Caches in Embedded Processors Junpei Zushi Gang Zeng Hiroyuki Tomiyama Hiroaki Takada (Nagoya University) Koji Inoue (Kyushu.
CS203 – Advanced Computer Architecture
Presented by Rania Kilany.  Energy consumption  Energy consumption is a major concern in many embedded computing systems.  Cache Memories 50%  Cache.
Dynamic Associative Caches:
Improving Multi-Core Performance Using Mixed-Cell Cache Architecture
Temperature and Power Management
Alireza Shafaei, Shuang Chen, Yanzhi Wang, and Massoud Pedram
Cache Memory Presentation I
Microarchitectural Techniques for Power Gating of Execution Units
Fine-Grain CAM-Tag Cache Resizing Using Miss Tags
A High Performance SoC: PkunityTM
Memory Organization.
A Novel Cache-Utilization Based Dynamic Voltage Frequency Scaling (DVFS) Mechanism for Reliability Enhancements *Yen-Hao Chen, *Yi-Lun Tang, **Yi-Yu Liu,
Phase based adaptive Branch predictor: Seeing the forest for the trees
Presentation transcript:

Drowsy Caches: Simple Techniques for Reducing Leakage Power Authors: ARM Ltd Krisztián Flautner, Advanced Computer Architecture Lab, The University of Michigan Nam Sung Kim, Steve Martin, David Blaauw & Trevor Mudge In-class presentation on 11/24/2008 by : Harshit Khanna ( ) 1 Arizona State University CSE 520 Advanced Computer Architecture ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE 2002, VOL 29, pages

Outline Summary Motivation Circuit Techniques – Traditional Circuit Techniques Gated-VDD ABB-MTCMOS Dynamic V DD Scaling (DVS) – Comparison of various low-leakage circuit techniques – Proposed circuit technique Policies – Implementation of drowsy cache line – Additions to the traditional cache line – Basic working description – Working set characteristics – Observations – Results Policy evaluation – Policy evaluation – Test Setup – Energy consumption Future work Arizona State University CSE 520 Advanced Computer Architecture 2

Summary Simplest policy – cache lines are periodically put into a low-power mode without regard to their access histories - can reduce the cache’s static power consumption by more than 80%. Total energy consumed in the cache can be reduced by an average of 54%. Fraction of leakage energy is reduced from an average of 76% in projected conventional caches to an average of 50% in the drowsy cache. Performance degradation - 9% for crafty & < 4% for equake. 3 Arizona State University CSE 520 Advanced Computer Architecture

Motivation Speed density leakage (static) power consumption Leakage power accounts for 15%-20% of the total power on chips. As processor technology moves below 0.1 micron, static power consumption is set to increase exponentially, setting static power consumption on the path to dominating the total power used by the CPU. The on-chip caches are one of the main candidates for leakage reduction since they contain a significant fraction of the processor’s transistors. 4 Arizona State University CSE 520 Advanced Computer Architecture

Circuit Techniques Arizona State University CSE 520 Advanced Computer Architecture 5

Traditional Circuit Techniques Gated-VDD – Working: Reduces the leakage power by using a high threshold (high- V t ) transistor to turn off the power to the memory cell when the cell is set to low-power mode. – Advantages : Leakage significantly reduced. – Disadvantages : It loses any information stored in the cell when switched into low-leakage mode. Performance penalty. Requires special high-V t devices for the control logic. 6 Arizona State University CSE 520 Advanced Computer Architecture

Traditional Circuit Techniques (contd.) ABB-MTCMOS – Working: Threshold voltages of the transistors in the cell are dynamically increased when the cell is set to drowsy mode by raising the source to body voltage of the transistors in the circuit. – Advantages : Leakage significantly reduced – Disadvantages: Supply voltage of the circuit is increased, thereby offsetting some of the gain in total leakage power. Requires special high-V t devices for the control logic. 7 Arizona State University CSE 520 Advanced Computer Architecture

Dynamic V DD Scaling (DVS) Disadvantages – Process variation dependent. – More noise susceptible. Advantages – Retains cell information in low-power mode. – Fast switching between power modes. – Easy implementation. – More power reduction than ABB-MTCMOS. Arizona State University CSE 520 Advanced Computer Architecture 8

Comparison of various low-leakage circuit techniques 9 Arizona State University CSE 520 Advanced Computer Architecture

Proposed circuit technique Choose between two different supply voltages in each cache line. DVS technique - used in the past to trade off dynamic power consumption and performance. Exploiting voltage scaling to reduce static power consumption. Due to short-channel effects in deep-submicron processes, leakage current reduces significantly with voltage scaling. Arizona State University CSE 520 Advanced Computer Architecture 10

Policies Arizona State University CSE 520 Advanced Computer Architecture 11

Arizona State University CSE 520 Advanced Computer Architecture 12 L1 drowsy data caches. All lines in an L2 cache can be kept in drowsy mode without significant impact on performance. Implementation of the drowsy cache line

Additions to the cache line word line gating circuit – prevent accesses when in drowsy mode since unchecked accesses to a drowsy line could destroy the memory’s contents. voltage controller – Determines operating voltage of an array of memory cells in the cache line – It switches the array voltage between the high (active) and low (drowsy) supply voltages depending on the state of the drowsy bit. drowsy bit – Controlling the voltage to the memory cells Arizona State University CSE 520 Advanced Computer Architecture 13

Basic working description If a drowsy cache line is accessed, the drowsy bit is cleared, and consequently the supply voltage is switched to high VDD. The wordline gating circuit is used to prevent accesses when in drowsy mode, since the supply voltage of the drowsy cache line is lower than the bit line precharge voltage; unchecked accesses to a drowsy line could destroy the memory’s contents. Whenever a cache line is accessed, the cache controller monitors the condition of the voltage of the cache line by reading the drowsy bit. If (accessed line == normal mode) Then read the contents of the cache line (without losing any performance because the power mode of the line can be checked by reading the drowsy bit concurrently with the read and comparison of the tag). If (accessed line == drowsy mode) Then prevent the discharge of the bit lines of the memory array (because it may read out incorrect data). The line is woken up automatically during the next cycle, and the data can be accessed during consecutive cycles. Arizona State University CSE 520 Advanced Computer Architecture 14

Arizona State University CSE 520 Advanced Computer Architecture 15 Working set characteristics ExecFactor - expected worst-case execution time increase for the baseline algorithm accs - the number of accesses wakelatency - wakeup latency = 1 cycle accsperline - number of accesses per line Memimpact (how much impact a single memory access has on overall performance) assumption : increase in cache access latency = increase in execution time So memimpact is set to 1

Observations Arizona State University CSE 520 Advanced Computer Architecture 16 Should tags be put into drowsy mode along with the data? In both cases, no extra latencies are involved when an awake line is accessed In direct-mapped caches there is no performance advantage to keeping the tags awake. There is only one possible line for each index, thus if that line is drowsy, it needs to be woken up immediately to be accessed. A drowsy access takes at least three cycles to complete

Results The fraction of unique cache lines accessed during an update window—is relatively small. On most benchmarks more than 90% of the lines can be in drowsy mode at any one time. Performance degradation - 9% for crafty & < 4% for equake. Advantages: – Significantly reduce the static power consumption of the cache – prediction techniques to control the drowsy cache not necessary if drowsy cache can transition between drowsy and awake modes relatively quickly. Arizona State University CSE 520 Advanced Computer Architecture 17

Arizona State University CSE 520 Advanced Computer Architecture 18 Policy evaluation

The following parameters can be varied: Update window size: specifies in cycles how frequently decisions are made about which lines to put into drowsy mode. Simple or Noaccess policy: The policy that uses no perline access history is referred to as the simple policy. In this case, all lines in the cache are put into drowsy mode periodically (the period is the window size). The noaccess policy means that only lines that have not been accessed in a window are put into drowsy mode. Awake or drowsy tag: specifies whether tags in the cache may be drowsy or not. Transition time: the number of cycles for waking up or putting to sleep cache lines. They only consider 1 or 2 cycle transition times, since the circuit simulations indicate that these are reasonable assumptions. Arizona State University CSE 520 Advanced Computer Architecture 19

Test setup They use various benchmarks from the SPEC2000 suite on SimpleScalar using the Alpha instruction set. All simulations were run for 1 billion instructions. The simulator configuration parameters are summarized below: – OO4: 4-wide superscalar pipeline, 32K direct-mapped L1 icache, 32 byte line size - 1 cycle hit latency, 32K 4-way set associative L1 dcache, 32 byte line size - 1 cycle hit latency, 8 cycle L2 cache latency. – IO2: 2-wide in-order pipeline, cache parameters same as for OO4. Arizona State University CSE 520 Advanced Computer Architecture 20

Energy consumption The authors find that the simple policy with a window size of 4000 cycles reaches a reasonable compromise between simplicity of implementation, power savings, and performance. The impact of this policy on leakage energy is characterized by : – Normalized total energy - the ratio of total energy used in the drowsy cache divided by the total energy consumed in a regular cache. – Normalized leakage energy - the ratio of leakage energy in the drowsy cache to leakage energy in a normal cache. – The data in the DVS columns - energy savings resulting from the scaled-VDD (DVS) circuit technique. – The data in the theoretical minimum column - assumes that leakage in low-power mode can be reduced to zero (without losing state). i.e. it estimates the energy savings given the best possible hypothetical circuit technique. Arizona State University CSE 520 Advanced Computer Architecture 21

Arizona State University CSE 520 Advanced Computer Architecture 22 Drowsy cache implementation – reduces the total energy consumed in the data cache by more than 50% without significantly impacting performance. Total leakage energy is reduced by : - average of 71% when tags are always awake. - average of 76% using the drowsy tag scheme.

Future work The proposed scheme is not a solution to all caches in the processor. L1 instruction cache does not do as well with the proposed algorithm. Investigate the use of instruction prefetch algorithms combined with the drowsy circuit technique. Extension of these techniques to other memory structures, such as branch predictors. Impact of having adaptive window size. Arizona State University CSE 520 Advanced Computer Architecture 23

Thank you Questions? Arizona State University CSE 520 Advanced Computer Architecture 24