Dept. of Computer Science, UC Irvine

Slides:



Advertisements
Similar presentations
COEN 180 SRAM. High-speed Low capacity Expensive Large chip area. Continuous power use to maintain storage Technology used for making MM caches.
Advertisements

Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,
Semiconductor Memory Design. Organization of Memory Systems Driven only from outside Data flow in and out A cell is accessed for reading by selecting.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.
Managing Static (Leakage) Power S. Kaxiras, M Martonosi, “Computer Architecture Techniques for Power Effecience”, Chapter 5.
Keeping Hot Chips Cool Ruchir Puri, Leon Stok, Subhrajit Bhattacharya IBM T.J. Watson Research Center Yorktown Heights, NY Circuits R-US.
COEN 180 DRAM. Dynamic Random Access Memory Dynamic: Periodically refresh information in a bit cell. Else it is lost. Small footprint: transistor + capacitor.
Power Reduction Techniques For Microprocessor Systems
DAAD Project ISSNBS Niš, LOW POWER MICROCONTROLLER DESIGN BY USING UPF Borisav Jovanović, Milunka Damnjanović, Faculty of Electronic Engineering.
Power, Temperature, Reliability and Performance - Aware Optimizations in On-Chip SRAMs Houman Homayoun PhD Candidate Dept. of Computer Science, UC Irvine.
1 Dual Threshold Voltage Domino Logic Synthesis for High Performance with Noise and Power Constraint Seong-Ook Jung, Ki-Wook Kim and Sung-Mo (Steve) Kang.
Adaptive Techniques for Leakage Power Management in L2 Cache Peripheral Circuits Houman Homayoun Alex Veidenbaum and Jean-Luc Gaudiot Dept. of Computer.
Introduction to CMOS VLSI Design Lecture 13: SRAM
Introduction to CMOS VLSI Design Lecture 18: Design for Low Power David Harris Harvey Mudd College Spring 2004.
Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.
S. Reda EN160 SP’08 Design and Implementation of VLSI Systems (EN1600) Lecture 14: Power Dissipation Prof. Sherief Reda Division of Engineering, Brown.
Super-Drowsy Caches Single-V DD and Single-V T Super-Drowsy Techniques for Low- Leakage High-Performance Instruction Caches Nam Sung Kim, Krisztián Flautner,
Introduction to CMOS VLSI Design SRAM/DRAM
LOW-LEAKAGE REPEATERS FOR NETWORK-ON-CHIP INTERCONNECTS Arkadiy Morgenshtein, Israel Cidon, Avinoam Kolodny, Ran Ginosar Technion – Israel Institute of.
On the Limits of Leakage Power Reduction in Caches Yan Meng, Tim Sherwood and Ryan Kastner UC, Santa Barbara HPCA-2005.
Die-Hard SRAM Design Using Per-Column Timing Tracking
Low-Power CMOS SRAM By: Tony Lugo Nhan Tran Adviser: Dr. David Parent.
On-Line Adjustable Buffering for Runtime Power Reduction Andrew B. Kahng Ψ Sherief Reda † Puneet Sharma Ψ Ψ University of California, San Diego † Brown.
S. Reda EN160 SP’07 Design and Implementation of VLSI Systems (EN0160) Lecture 13: Power Dissipation Prof. Sherief Reda Division of Engineering, Brown.
1 Drowsy Caches Simple Techniques for Reducing Leakage Power Krisztián Flautner Nam Sung Kim Steve Martin David Blaauw Trevor Mudge
1 adaptive body bias for reducing process variations nuno alves 19 / october / 2006.
Lecture 5 – Power Prof. Luke Theogarajan
Architectural Power Management for High Leakage Technologies Department of Electrical and Computer Engineering Auburn University, Auburn, AL /15/2011.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Lecture 19: SRAM.
Lecture 7: Power.
ECE 510 Brendan Crowley Paper Review October 31, 2006.
Parts from Lecture 9: SRAM Parts from
Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.
The CMOS Inverter Slides adapted from:
1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University
Case Study - SRAM & Caches
EE466: VLSI Design Power Dissipation. Outline Motivation to estimate power dissipation Sources of power dissipation Dynamic power dissipation Static power.
ENGG 6090 Topic Review1 How to reduce the power dissipation? Switching Activity Switched Capacitance Voltage Scaling.
Low Power Techniques in Processor Design
A Class Presentation for VLSI Course by : Fatemeh Refan Based on the work Leakage Power Analysis and Comparison of Deep Submicron Logic Gates Geoff Merrett.
1 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy.
Drowsy Caches: Simple Techniques for Reducing Leakage Power Authors: ARM Ltd Krisztián Flautner, Advanced Computer Architecture Lab, The University of.
1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University
Power Management in High Performance Processors through Dynamic Resource Adaptation and Multiple Sleep Mode Assignments Houman Homayoun National Science.
A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University.
הפקולטה למדעי ההנדסה Faculty of Engineering Sciences.
Dynamic Fine-Grain Leakage Reduction Using Leakage-Biased Bitlines Seongmoo Heo, Kenneth Barr, Mark Hampton, and Krste Asanović Computer Architecture Group,
Z. Feng MTU EE4800 CMOS Digital IC Design & Analysis 12.1 EE4800 CMOS Digital IC Design & Analysis Lecture 12 SRAM Zhuo Feng.
Advanced VLSI Design Unit 06: SRAM
Multiple Sleep Mode Leakage Control for Cache Peripheral Circuits in Embedded Processors Houman Homayoun, Avesta Makhzan, Alex Veidenbaum Dept. of Computer.
LA-LRU: A Latency-Aware Replacement Policy for Variation Tolerant Caches Aarul Jain, Cambridge Silicon Radio, Phoenix Aviral Shrivastava, Arizona State.
Houman Homayoun, Sudeep Pasricha, Mohammad Makhzan, Alex Veidenbaum Center for Embedded Computer Systems, University of California, Irvine,
Architectural and Circuit-Levels Design Techniques for Power and Temperature Optimizations in On- Chip SRAM Memories Houman Homayoun PhD Candidate Dept.
Leakage reduction techniques Three major leakage current components 1. Gate leakage ; ~ Vdd 4 2. Subthreshold ; ~ Vdd 3 3. P/N junction.
Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.
Z. Feng MTU EE4800 CMOS Digital IC Design & Analysis 6.1 EE4800 CMOS Digital IC Design & Analysis Lecture 6 Power Zhuo Feng.
Seok-jae, Lee VLSI Signal Processing Lab. Korea University
1 Dual-V cc SRAM Class presentation for Advanced VLSIPresenter:A.Sammak Adopted from: M. Khellah,A 4.2GHz 0.3mm 2 256kb Dual-V CC SRAM Building Block in.
Low Power SRAM VLSI Final Presentation Stephen Durant Ryan Kruba Matt Restivo Voravit Vorapitat.
1 RELOCATE Register File Local Access Pattern Redistribution Mechanism for Power and Thermal Management in Out-of-Order Embedded Processor Houman Homayoun,
FaridehShiran Department of Electronics Carleton University, Ottawa, ON, Canada SmartReflex Power and Performance Management Technologies.
Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.
Power-Optimal Pipelining in Deep Submicron Technology
YASHWANT SINGH, D. BOOLCHANDANI
Temperature and Power Management
Hot Chips, Slow Wires, Leaky Transistors
Fine-Grain CAM-Tag Cache Resizing Using Miss Tags
Lecture 7: Power.
Lecture 7: Power.
Presentation transcript:

Dept. of Computer Science, UC Irvine ZZ-HVS: Zig-Zag Horizontal and Vertical Sleep Transistor Sharing to Reduce Leakage Power in On-Chip SRAM Peripheral Circuits Houman Homayoun Avesta Makhzan and Alex Veidenbaum Dept. of Computer Science, UC Irvine hhomayou@ics.uci.edu

Outline Cache Power Dissipation Why Cache Peripheral ? Proposed Circuit Technique to Reduce Leakage in Cache Peripheral Circuit Evaluation Proposed Architecture to Control the Circuit Results Conclusion

On-chip Caches and Power On-chip caches in high-performance processors are large more than 60% of chip budget Dissipate significant portion of power via leakage Much of it was in the SRAM cells Many architectural techniques proposed to remedy this Today, there is also significant leakage in the peripheral circuits of an SRAM (cache) In part because cell design has been optimized Pentium M processor die photo Courtesy of intel.com

Peripherals ? Data Input/Output Driver Address Input/Output Driver Row Pre-decoder Wordline Driver Row Decoder Others : sense-amp, bitline pre-charger, memory cells, decoder logic

Why Peripherals ? Using minimal sized transistor for area considerations in cells and larger, faster and accordingly more leaky transistors to satisfy timing requirements in peripherals. Using high vt transistors in cells compared with typical threshold voltage transistors in peripherals

Leakage Power Components of L2 Lache SRAM peripheral circuits dissipate more than 90% of the total leakage power

Circuit Techniques Address Leakage in SRAM Cell Gated-Vdd, Gated-Vss Voltage Scaling (DVFS) ABB-MTCMOS Forward Body Biasing (FBB), RBB Sleepy Stack Sleepy Keeper Target SRAM memory cell

Architectural Techniques Way Prediction, Way Caching, Phased Access Predict or cache recently access ways, read tag first Drowsy Cache Keeps cache lines in low-power state, w/ data retention Cache Decay Evict lines not used for a while, then power them down Applying DVS, Gated Vdd, Gated Vss to memory cell Many architectural support to do that. All target cache SRAM memory cell

Sleep Transistor Stacking Effect Subthreshold current: inverse exponential function of threshold voltage Stacking transistor N with slpN: The source to body voltage (VM ) of transistor N increases, reduces its subthreshold leakage current, when both transistors are off Drawback : rise time, fall time, wakeup delay, area, dynamic power, instability

Source of Subthreshold Leakage in the Peripheral Circuitry The inverter chain has to drive a logic value 0 to the pass transistors when a memory row is not selected N1,N3 and P2,P4 are in the off state and are leaking

A Redundant Circuit Approach Drawback impact on wordline driver output rise time, fall time and propagation delay

Impact on Rise Time and Fall Time The rise time and fall time of the output of an inverter is proportional to the Rpeq * CL and Rneq * CL Inserting the sleep transistors increases both Rneq and Rpeq Increasing in rise time Impact on performance Impact on memory functionality Increasing in fall time

Fall Time Increase Impact Fall time increase  pass transistor active period increase (read operation) The bitline over-discharge, the memory content over-charge during the read operation. Such over-discharge increases the dynamic power dissipation of bitlines can cause cell content flip if the over-discharge period is large The sense amplifier timing circuit and the wordline pulse generator circuit need to be redesigned!

A Zig-Zag Circuit Rpeq for the first and third inverters and Rneq for the second and fourth inverters doesn’t change. Fall time of the circuit does not change

A Zig-Zag Share Circuit To improve leakage reduction and area-efficiency of the zig-zag scheme, using one set of sleep transistors shared between multiple stages of inverters Zig-Zag Horizontal Sharing Zig-Zag Horizontal and Vertical Sharing

Zig-Zag Horizontal Sharing Comparing zz-hs with zigzag scheme, with the same area overhead Zz-hs less impact on rise time Both reduce leakage almost the same

Zig-Zag Horizontal and Vertical Sharing

Leakage Reduction of Zig-Zag Horizontal and Vertical Sharing Increase in virtual ground voltage increase leakage reduction

Circuit Evaluation Test Experiment Wordline inverter chain drives 256 one-bit memory cells. Using Mentor Graphic IC-Station in TSMC 65nm technology Use Synopsis Hspice and the supply voltage of 1.08V at typical corner (250 C) The empirical results presented are for the leakage current rise time and fall time propagation delay dynamic power area

Zig-zag Horizontal Sharing: Power Results Dynamic power increase of 1.5% to 3.5% Max leakage reduction of 94%

Zig-zag Horizontal Sharing: Latency Results Both zig-zag and zig-zag share wordline driver fall time is not affected zz-hs-2W has the least impact on rise time and propagation delay

Zig-zag Horizontal Sharing: Area Results Area increase varies significantly from 25% for zz-hs-1W circuit to 115% for the redundant scheme

ZZ-HVS Evaluation : Power Result Increasing the number of wordline rows share sleep transistors increases the leakage reduction and reduces the area overhead Leakage power reduction varies form a 10X to a 100X when 1 to 10 wordline shares the same sleep transistors 2~10X more leakage reduction, compare to the zig-zag scheme

ZZ-HVS Evaluation : Area Result zz-hvs has the least impact on area, 4~25% depends on the number of wordline rows shared

ZZ-HVS Circuit Evaluation: Sleep Transistor Sizing Trade-off between the leakage savings and impact on the wordline driver propagation delay zz-hvs-3W (3X) show an optimal trade-off 40X reduction in leakage at 5% increase in propagation delay

Wakeup Latency To benefit the most from the leakage savings of stacking sleep transistors keep the bias voltage of NMOS sleep transistor as low as possible (and for PMOS as high as possible) Drawback: impact on the wakeup latency of wordline drivers Wakeup latency associated with the zz-hvs-3W circuit is 1.3ns 4 processor cycles (3.3 GHz) For large memory, such as 2MB L2 cache the overall wake up latency can be as high 6 to 10 cycles

Impact on Propagation Delay The zz-hvs increases the propagation delay of the peripheral circuit by 5%, when applied to wordline drivers, input/output drivers, etc Translate to 5% reduction in maximum operating clock frequency of the memory in a single pipeline memory Deep pipelined memories such as L1 and L2 cache hide negligible increase in peripheral circuit latency

Sleep-Share: ZZ-HVS + Architectural Control When an L2 cache miss occurs the processor executes a number of miss-independent instructions and then ends up stalling The processor stays idle until the L2 cache miss is serviced. This may take hundreds of cycle (300 cycles for our processor architecture) During such a stall period there is no access to L1 and L2 caches and they can be put into low-power mode

Detecting Processor Idle Period The instruction queue and functional units of the processor monitored after an L2 miss Instruction queue has not issued any instructions Functional units have not executed any instructions for K consecutive cycles (K=10) The sleep signal is asserted The sleep signal is de-asserted 10 cycles before the miss service is completed Assumption: memory access latency is deterministic. No performance loss

Simulated Processor Architecture SimpleScalar 4.0 SPEC2K benchmarks Compiled with the -O4 flag using the Compaq compiler targeting the Alpha 21264 processor fast–forwarded for 3 billion instructions, then fully simulated for 4 billion instructions using the reference data sets.

L1 and L2 Leakage Power Reduction Leakage reduction of 30% for the L2 cache and 28% for the L1 cache

Conclusion Study break down of leakage in L2 cache components, show peripheral circuit leaking considerably proposed zig-zag share to reduce leakage in SRAM memory peripheral circuits zig-zag share reduces peripheral leakage by up to 40X with only a small increase in memory area and delay Propose Sleep-Share to control zig-zag share circuits in L1 and L2 cache peripherals Leakage reduction of 30% for the L2 cache and 28% for the L1 cache

T H A N K S