Dynamic Fine-Grain Leakage Reduction Using Leakage-Biased Bitlines Seongmoo Heo, Kenneth Barr, Mark Hampton, and Krste Asanović Computer Architecture Group,

Slides:

Advertisements

Similar presentations

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

Advertisements

Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,

Leakage Energy Management in Cache Hierarchies L. Li, I. Kadayif, Y-F. Tsai, N. Vijaykrishnan, M. Kandemir, M. J. Irwin, and A. Sivasubramaniam Penn State.

Semiconductor Memory Design. Organization of Memory Systems Driven only from outside Data flow in and out A cell is accessed for reading by selecting.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Keeping Hot Chips Cool Ruchir Puri, Leon Stok, Subhrajit Bhattacharya IBM T.J. Watson Research Center Yorktown Heights, NY Circuits R-US.

Leakage-Biased Domino Circuits for Dynamic Fine- Grain Leakage Reduction Seongmoo Heo and Krste Asanović Massachusetts Institute of Technology Lab for.

Adaptive Techniques for Leakage Power Management in L2 Cache Peripheral Circuits Houman Homayoun Alex Veidenbaum and Jean-Luc Gaudiot Dept. of Computer.

Introduction to CMOS VLSI Design Lecture 13: SRAM

Introduction to CMOS VLSI Design Lecture 18: Design for Low Power David Harris Harvey Mudd College Spring 2004.

Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.

11/29/2004EE 42 fall 2004 lecture 371 Lecture #37: Memory Last lecture: –Transmission line equations –Reflections and termination –High frequency measurements.

S. Reda EN160 SP’08 Design and Implementation of VLSI Systems (EN1600) Lecture 14: Power Dissipation Prof. Sherief Reda Division of Engineering, Brown.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

CS 7810 Lecture 12 Power-Aware Microarchitecture: Design and Modeling Challenges for Next-Generation Microprocessors D. Brooks et al. IEEE Micro, Nov/Dec.

Super-Drowsy Caches Single-V DD and Single-V T Super-Drowsy Techniques for Low- Leakage High-Performance Instruction Caches Nam Sung Kim, Krisztián Flautner,

Introduction to CMOS VLSI Design SRAM/DRAM

LOW-LEAKAGE REPEATERS FOR NETWORK-ON-CHIP INTERCONNECTS Arkadiy Morgenshtein, Israel Cidon, Avinoam Kolodny, Ran Ginosar Technion – Israel Institute of.

CS 7810 Lecture 3 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.

On-Line Adjustable Buffering for Runtime Power Reduction Andrew B. Kahng Ψ Sherief Reda † Puneet Sharma Ψ Ψ University of California, San Diego † Brown.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

S. Reda EN160 SP’07 Design and Implementation of VLSI Systems (EN0160) Lecture 13: Power Dissipation Prof. Sherief Reda Division of Engineering, Brown.

1 Drowsy Caches Simple Techniques for Reducing Leakage Power Krisztián Flautner Nam Sung Kim Steve Martin David Blaauw Trevor Mudge

Lecture 5 – Power Prof. Luke Theogarajan

Lecture 19: SRAM.

Lecture 7: Power.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.

Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma

Parts from Lecture 9: SRAM Parts from

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

Power, Energy and Delay Static CMOS is an attractive design style because of its good noise margins, ideal voltage transfer characteristics, full logic.

Case Study - SRAM & Caches

6.893: Advanced VLSI Computer Architecture, September 28, 2000, Lecture 4, Slide 1. © Krste Asanovic Krste Asanovic

EE466: VLSI Design Power Dissipation. Outline Motivation to estimate power dissipation Sources of power dissipation Dynamic power dissipation Static power.

Low Power Techniques in Processor Design

1 Lecture 21: Core Design, Parallel Algorithms Today: ARM Cortex A-15, power, sort and matrix algorithms.

Power Reduction for FPGA using Multiple Vdd/Vth

CS Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.

Dept. of Computer Science, UC Irvine

Drowsy Caches: Simple Techniques for Reducing Leakage Power Authors: ARM Ltd Krisztián Flautner, Advanced Computer Architecture Lab, The University of.

Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs

Exploiting Program Hotspots and Code Sequentiality for Instruction Cache Leakage Management J. S. Hu, A. Nadgir, N. Vijaykrishnan, M. J. Irwin, M. Kandemir.

Power Management in High Performance Processors through Dynamic Resource Adaptation and Multiple Sleep Mode Assignments Houman Homayoun National Science.

A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University.

Z. Feng MTU EE4800 CMOS Digital IC Design & Analysis 12.1 EE4800 CMOS Digital IC Design & Analysis Lecture 12 SRAM Zhuo Feng.

Advanced VLSI Design Unit 06: SRAM

Energy-Effective Issue Logic Hasan Hüseyin Yılmaz.

Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.

Multiple Sleep Mode Leakage Control for Cache Peripheral Circuits in Embedded Processors Houman Homayoun, Avesta Makhzan, Alex Veidenbaum Dept. of Computer.

Houman Homayoun, Sudeep Pasricha, Mohammad Makhzan, Alex Veidenbaum Center for Embedded Computer Systems, University of California, Irvine,

Leakage reduction techniques Three major leakage current components 1. Gate leakage ; ~ Vdd 4 2. Subthreshold ; ~ Vdd 3 3. P/N junction.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

Basics of Energy & Power Dissipation

1 Energy-Efficient Register Access Jessica H. Tseng and Krste Asanović MIT Laboratory for Computer Science, Cambridge, MA 02139, USA SBCCI2000.

Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction.

Z. Feng MTU EE4800 CMOS Digital IC Design & Analysis 6.1 EE4800 CMOS Digital IC Design & Analysis Lecture 6 Power Zhuo Feng.

Seok-jae, Lee VLSI Signal Processing Lab. Korea University

Click to edit Master title style Progress Update Energy-Performance Characterization of CMOS/MTJ Hybrid Circuits Fengbo Ren 05/28/2010.

FaridehShiran Department of Electronics Carleton University, Ottawa, ON, Canada SmartReflex Power and Performance Management Technologies.

Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

Presented by Rania Kilany.  Energy consumption  Energy consumption is a major concern in many embedded computing systems.  Cache Memories 50%  Cache.

Dynamic Zero Compression for Cache Energy Reduction Luis Villa Michael Zhang Krste Asanovic

Power-Optimal Pipelining in Deep Submicron Technology

Temperature and Power Management

Lecture 19: SRAM.

Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs

Microarchitectural Techniques for Power Gating of Execution Units

Fine-Grain CAM-Tag Cache Resizing Using Miss Tags

A High Performance SoC: PkunityTM

Presentation transcript:

Dynamic Fine-Grain Leakage Reduction Using Leakage-Biased Bitlines Seongmoo Heo, Kenneth Barr, Mark Hampton, and Krste Asanović Computer Architecture Group, MIT LCS ISCA 2002

Growing impact of leakage power –Increase of leakage power due to scaling of transistor lengths and threshold voltages –Power budget limits use of fast leaky transistors Challenge: –How to maintain performance scaling in face of increasing leakage power? Leakage Power

Static: Design-time Selection of Slow Transistors (SSST) for non-critical paths –Replace fast transistors with slow ones on non-critical paths –Tradeoff between delay and leakage power Dynamic: Run-time Deactivation of Fast Transistors (DDFT) for critical paths –DDFT switches critical path transistors between inactive and active modes Leakage Reduction Techniques

Critical paths dominate leakage after applying SSST techniques Example: PowerPC 750 –5% of transistor width is low Vt, but these account for >50% of total leakage.  DDFT could give large leakage savings Observation:

Body Biasing –V t increase by reverse-biased body effect –Large transition time and wakeup latency due to well cap and resistance Power Gating –Sleep transistor between supply and virtual supply lines –Increased delay due to sleep transistor Sleep Vector –Input vector which minimizes leakage –Increased delay due to mux and active energy due to spurious toggles after applying sleep vector V body > V dd Gate Source Drain Body Existing DDFT Circuit Techniques Sleep signal Virtual Vdd Vdd Logic cells 0 0

Have to turn off small pieces of an active processor for short periods of time –Difficult to turn off large pieces for long periods  Fine-grain DDFT techniques Requirements of Fine-grain DDFT techniques –Circuits with low active delay penalty, low energy moving in and out of sleep, and fast wakeup time –Micro-architectural scheduling to keep the sleep time as long and often as possible Compare to coarse-grain DDFT techniques –O.S. puts whole processor to sleep for a long time  doesn’t save power when running code –Low steady-state leakage only concern. Fine-Grain DDFT Techniques

Highlights of This Work We introduce metrics for comparing fine- grain dynamic deactivation techniques –Steady-stage leakage, Transition time, Fixed transition energy, Breakeven time We present a new circuit-level leakage reduction technique, Leakage-Biased Bitlines (LBB) –Low deactivation energy and fast wakeup We save leakage power of I-Cache and Multiported regfile by LBB –I-cache: idle subbank deactivation –Multiported regfile: idle read ports and dead register deactivation

Outline 1.Methodology and DDFT Metrics 2.Cache Leakage Saving Idle subbank deactivation 3.Multiported Regfile Leakage Saving Dead reg deactivation (Horizontal) Idle read port deactivation (Vertical) 4.Conclusion

Methodology Process Technology –180nm DVT process modeled after 0.18um TSMC LVT and MVT processes –Scaled to 130, 100, and 70nm processes based on SIA roadmap –Optimistic/pessimistic leakage prediction: 2x/4x increase of leakage current density (nA/um) Evaluation with SimpleScalar –Modified to model unified physical register file –4 issue, 100 integer physical regs, 16KB/4-Way/32- B block I-Cache and D-Cache, Unified L-2 Cache –SPECint95 refs Energy measurements – Hspice simulation for 180nm process and scaled to other processes accordingly

Metrics for Fine-Grain DDFT Techniques Wakeup Latency Active delay and power Original Leakage Steady-state Sleep Leakage Time Leakage Current Length of Sleep Leakage Energy Transition Time Break-Even Time Fixed Active Transition Energy DDFT Leakage Original Leakage DDFT applied

L1 Cache and Multiported Regfile Good targets for Fine-grain DDFT techniques –Timing-critical Contrast: L2 cache is a better target for SSST (long channel or HVT transistors) –Large leakage current Cache: Large number of fast transistors Multiported Regfile: Ever increasing number of registers and ports –Alpha register file is 5x larger than 64KB data cache

LBB for Caches Local bitlines (32-bit cells) disconnected from senseamp by local-global switch. LBB for Caches: If a subbank is not in use, turn off precharge transistors and delay precharging. Modern cache structure : Hierarchical Bitlines –To save active power –To reduce delay –To reduce bitline noise Local Bitline SenseAmp Subbank Local-Global Switch Global Bitline

Cache: Dual Vt SRAM cell 0 1 BIT BIT_BAR GLOBAL BIT GLOBAL BIT_BAR 0 WL 1 1 HVT transistors: green-colored

Cache: Dual Vt SRAM cell 0 1 BIT BIT_BAR GLOBAL BIT GLOBAL BIT_BAR 0 WL 1 1

Cache: Dual Vt SRAM cell Bitline leakage depends on the stored value 0 1 BIT BIT_BAR GLOBAL BIT GLOBAL BIT_BAR 0 WL 1 1

Cache: Dual Vt SRAM cell Bitline leakage depends on the stored value 0 1 BIT BIT_BAR GLOBAL BIT GLOBAL BIT_BAR 0 WL 1 1 Our Target

Forcing 0 Forcing 1 Forcing ?

Leakage-Biased Bitlines (LBB) LBB lets bitlines float by turning off the local HVT NMOS precharge transistors –No static current draw because local bitline isolated –LBB uses leakage itself to bias bitlines to the voltage which minimizes leakage! A good fine-grain dynamic technique –Minimal transition energy: Same number of precharges (delayed precharge) –Minimal transition time: Wakeup latency is only that of precharge phase Discharge to 0 Stay at 1 Discharge to an intermediate value between 0 and

LBB finds the minimal leakage state. –Always better than sleep vectors LBB versus Sleep Vector

32-row x 32B SRAM subbank (optimistic leakage current used. 75% zero assumed) Cumulative Leakage Energy LBB Original Dynamic energy cost: Need to replace the lost charge -LBB curve increases fast in the beginning Decrease of Breakeven time -180nm: 200 cycles, 70nm: less than a cycle -Active energy scales down faster than leakage energy

Performance Issues for LBB Caches Subbank must be precharged before use –Case 1 (best): subbank decode and precharge happen before more complex word-line decode, therefore no penalty. –Case 2 (worst): add additional pipeline stage for precharge One cycle increase in branch misprediction penalty –Focus on I-Cache because any latency increase can be partly hidden by branch prediction

Case 2 (worst) assumption (adding additional pipeline stage)  2.5% IPC decrease on average I-Cache Subbank Deactivation

Multiported Regfile Cell Simplified but active/leakage power-aware baseline READ[0:7] WWL[0:3] RWL[0:7] WRITE[0:3] WRITEB[0:3] x8 x4 8R, 4W unbalanced DVT reg cell HVT transistors: green-colored

LBB for Multiported Regfiles LBB for Multiported Regfiles: Turn off the precharge transistor on idle subbank read ports –Leakage current discharges bitlines to 0 if any bits are holding 1.

Horizontal technique Dead registers = Registers in free list If all registers in a subbank are dead, all read ports in the subbank are turned off by LBB No performance penalty since there is ample time to re-precharge between allocation and write. Dead Register Deactivation Readport 0 Readport 1 Readport 2 Subbank 1

Dead Register Deactivation Readport 0 Readport 1 Readport 2 Subbank 1 Horizontal technique Dead registers = Registers in free list If all registers in a subbank are dead, all read ports in the subbank are turned off by LBB No performance penalty since there is ample time to re-precharge between allocation and write.

Alternative horizontal DDFT To turn off dead registers using NMOS sleep transistors (NST) Advantage: registers can be turned off individually Disadvantage: increased read access time –Set delay penalty to 5% (tradeoff between delay and leakage) NMOS Sleep Transistor (NST) Readport 0 Readport 1 Readport 2 Register 1 1

NMOS Sleep Transistor (NST) Readport 0 Readport 1 Readport 2 Register 1 0 Alternative horizontal DDFT To turn off dead registers using NMOS sleep transistors (NST) Advantage: registers can be turned off individually Disadvantage: increased read access time –Set delay penalty to 5% (tradeoff between delay and leakage)

Vertical technique Idle read ports when fewer than max # of instructions are issued in a superscalar machine Idle read ports deactivated by LBB No performance penalty since it is known whether a read port is needed before it is known which register will be accessed in the pipeline. Idle Readport Deactivation Readport 0 Readport 1 Readport 2

Idle Readport Deactivation Readport 0 Readport 1 Readport 2 Vertical technique Idle read ports when fewer than max # of instructions are issued in a superscalar machine Idle read ports deactivated by LBB No performance penalty since it is known whether a read port is needed before it is known which register will be accessed in the pipeline.

32 x 32-b Regfile subbank (75% zero assumed. Optimistic leakage current used.) Comparison of DDFTs Original NMOS Sleep Transistor Leakage-Biased Bitlines Sleep Vector NMOS Sleep Transistor Original Sleep Vector Leakage-Biased Bitlines

Comparison of DDFTs Blowup: 70nm Original NMOS Sleep Transistor Leakage-Biased Bitlines Sleep Vector

Dead Register/Subbank Deactivation Policies Free list policies for NST (NMOS Sleep Transistor): queue and stack –queue: conventional –stack: keeps some regs dead for longer % greater savings than queue at 70nm Benefit increases as feature sizes shrink Subbank allocation policy for LBB: stack –Allocate a new subbank only when the previous bank is empty of dead registers

Dead Reg Deactivation (Horizontal) NST stack better than NST queue, LBB stack better than either NST Colored: optimistic White: pessimistic

Read Port Deactivation (Vertical) More energy saving for wider issue processors Readport deactivation can be combined with dead subbank deactivation.

Conclusion Most leakage power is in critical paths –Dynamic leakage reduction (DDFT) desired LBB allows Fine-grain dynamic leakage reduction with zero or minimal performance penalty. –0% performance penalty for multiported regfiles Sleep time can be improved by changing micro-architectural scheduling policies. –Stack better than queue for free list policy Follow on work: –Leakage-biased domino logic to save leakage power in critical ALUs [ VLSI Symposium 2002 ]

Acknowledgments Thanks to Christopher Batten, Ronny Krashinsky, Rajesh Kumar, and anonymous reviewers Funded by DARPA PAC/C award F , NSF CAREER award CCR , and a donation from Infineon Technologies.

DDFT Examples Body Biasing Power GatingSleep Vector Steady-state leakage power Less than 5% (depends on V body ) Less than 5% (depends on sleep transistor) Less than 50% (depends on the circuit) Transition time, Wakeup latency 0.1~100usLess than a cycle Transition energy,Breakeven time Well cap switching energy Sleep transistor gate cap switching energy Active energy consumed due to spurious toggling after sleep vector Delay Impact NoYes. Due to sleep transistor Yes. Due to mux Etc Area for sleep transistor and virtual supplies Finding sleep vector is hard