Dynamic Zero Compression for Cache Energy Reduction Luis Villa Michael Zhang Krste Asanovic

Slides:

Advertisements

Similar presentations

Instruction Set Design

Advertisements

COEN 180 SRAM. High-speed Low capacity Expensive Large chip area. Continuous power use to maintain storage Technology used for making MM caches.

Semiconductor Memory Design. Organization of Memory Systems Driven only from outside Data flow in and out A cell is accessed for reading by selecting.

COEN 180 DRAM. Dynamic Random Access Memory Dynamic: Periodically refresh information in a bit cell. Else it is lost. Small footprint: transistor + capacitor.

Energy-efficient Instruction Dispatch Buffer Design for Superscalar Processors* Gurhan Kucuk, Kanad Ghose, Dmitry V. Ponomarev Department of Computer Science.

Introduction to CMOS VLSI Design Lecture 13: SRAM

Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

11/29/2004EE 42 fall 2004 lecture 371 Lecture #37: Memory Last lecture: –Transmission line equations –Reflections and termination –High frequency measurements.

Introduction to CMOS VLSI Design SRAM/DRAM

Very low power pipelines using significance compression Canal, R. Gonzalez, A. Smith, J.E. Dept. d'Arquitectura de Computadors, Univ. Politecnica de Catalunya,

Die-Hard SRAM Design Using Per-Column Timing Tracking

Low-Power CMOS SRAM By: Tony Lugo Nhan Tran Adviser: Dr. David Parent.

Highly-Associative Caches for Low-Power Processors Michael Zhang Krste Asanovic

S. Reda EN160 SP’07 Design and Implementation of VLSI Systems (EN0160) Lecture 31: Array Subsystems (SRAM) Prof. Sherief Reda Division of Engineering,

Lecture 19: SRAM.

Parts from Lecture 9: SRAM Parts from

55:035 Computer Architecture and Organization

Mrinmoy Ghosh Weidong Shi Hsien-Hsin (Sean) Lee

Case Study - SRAM & Caches

Review: Basic Building Blocks  Datapath l Execution units -Adder, multiplier, divider, shifter, etc. l Register file and pipeline registers l Multiplexers,

Dept. of Computer Science, UC Irvine

Exploiting Program Hotspots and Code Sequentiality for Instruction Cache Leakage Management J. S. Hu, A. Nadgir, N. Vijaykrishnan, M. J. Irwin, M. Kandemir.

Low Power Cache Design M.Bilal Paracha Hisham Chowdhury Ali Raza.

ISLPED’99 International Symposium on Low Power Electronics and Design

Cache Physical Implementation Panayiotis Charalambous Xi Research Group Panayiotis Charalambous Xi Research Group.

Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 7: February 6, 2012 Memories.

Dynamic Fine-Grain Leakage Reduction Using Leakage-Biased Bitlines Seongmoo Heo, Kenneth Barr, Mark Hampton, and Krste Asanović Computer Architecture Group,

Z. Feng MTU EE4800 CMOS Digital IC Design & Analysis 12.1 EE4800 CMOS Digital IC Design & Analysis Lecture 12 SRAM Zhuo Feng.

Advanced VLSI Design Unit 06: SRAM

Penn ESE370 Fall DeHon 1 ESE370: Circuit-Level Modeling, Design, and Optimization for Digital Systems Day 24: November 5, 2010 Memory Overview.

CSE477 L23 Memories.1Irwin&Vijay, PSU, 2002 CSE477 VLSI Digital Circuits Fall 2002 Lecture 23: Semiconductor Memories Mary Jane Irwin (

Computer Memory Storage Decoding Addressing 1. Memories We've Seen SIMM = Single Inline Memory Module DIMM = Dual IMM SODIMM = Small Outline DIMM RAM.

CMPEN 411 VLSI Digital Circuits Spring 2009 Lecture 22: Memery, ROM

Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 5: February 1, 2010 Memories.

Chapter 11 Instruction Sets: Addressing Modes and Formats Gabriel Baron Sydney Chow.

1 Energy-Efficient Register Access Jessica H. Tseng and Krste Asanović MIT Laboratory for Computer Science, Cambridge, MA 02139, USA SBCCI2000.

Introduction to Computer Organization and Architecture Lecture 7 By Juthawut Chantharamalee wut_cha/home.htm.

Penn ESE370 Fall DeHon 1 ESE370: Circuit-Level Modeling, Design, and Optimization for Digital Systems Day 28: November 7, 2014 Memory Overview.

EE 466/586 VLSI Design Partha Pande School of EECS Washington State University

Penn ESE370 Fall DeHon 1 ESE370: Circuit-Level Modeling, Design, and Optimization for Digital Systems Day 28: November 8, 2013 Memory Overview.

Low Power SRAM VLSI Final Presentation Stephen Durant Ryan Kruba Matt Restivo Voravit Vorapitat.

Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 8: February 19, 2014 Memories.

Index What is an Interface Pins of 8085 used in Interfacing Memory – Microprocessor Interface I/O – Microprocessor Interface Basic RAM Cells Stack Memory.

Prof. Hsien-Hsin Sean Lee

CS 704 Advanced Computer Architecture

Electrical and Computer Engineering University of Cyprus

COMP211 Computer Logic Design

Lecture: Large Caches, Virtual Memory

Variable Word Width Computation for Low Power

Edexcel GCSE Computer Science Topic 15 - The Processor (CPU)

Assistant Professor, EECS Department

Morgan Kaufmann Publishers The Processor

Local secondary storage (local disks)

Improving Program Efficiency by Packing Instructions Into Registers

Hakim Weatherspoon CS 3410 Computer Science Cornell University

ADITI SHINDE CHIDAMBARAM ALAGAPPAN

Day 26: November 11, 2011 Memory Overview

Fine-Grain CAM-Tag Cache Resizing Using Miss Tags

The Main Memory system: DRAM organization

Lecture: DRAM Main Memory

Lecture: DRAM Main Memory

Semiconductor Memories

Processor Organization and Architecture

Rocky K. C. Chang 6 November 2017

Electronics for Physicists

15-740/ Computer Architecture Lecture 19: Main Memory

Modified from notes by Saeid Nooshabadi

A Novel Cache-Utilization Based Dynamic Voltage Frequency Scaling (DVFS) Mechanism for Reliability Enhancements *Yen-Hao Chen, *Yi-Lun Tang, **Yi-Yu Liu,

Presentation transcript:

Dynamic Zero Compression for Cache Energy Reduction Luis Villa Michael Zhang Krste Asanovic

Conventional Cache Structure Energy Dissipation  Bitlines (~75%)  Decoders  I/O Drivers  Wordlines wlbitbit_b addr BUS Address Decoder I/O

Existing Energy Reduction Techniques Sub-banking Hierarchical Bitlines Low-swing Bitlines  Only for reads, writes performed full swing. Wordline Gating I/O BUS addr Address Decoder gwl lwl Offset Dec. offset SRAM Cells Sense Amps lwl Offset Dec. offset SRAM Cells Sense Amps

Asymmetry of Bits in Cache >70% of the bits in D-cache accesses are “0”s  Measured from SPECint95 and MediaBench  Examples: small values, data types Differential bitlines preferred in large SRAM designs.  Better Noise Immunity  Faster Sensing Related work with single-ended bitlines  [Tseng and Asanovic ’00] --- Used in register file design with single-ended bitlines.  [Chang et. al. ’99] --- Used in ROM and small RAM with single-ended bitlines.

Dynamic Zero Compression Z ero I ndicator B it  One bit per grouping of bits  Set if bits are zeros  Controls wordline gating I/O addr Address Decoder lwl SRAM Cells SnsAmp off dec Address-controlled BUS lwl SRAM Cells Sns Amp ZIB Data-Controlled

Data Cache Bitline Swing Reduction % Bitline Swing Reduction Calculation includes the bitline swings introduced by ZIB

Hardware Modifications Zero Indicator Bit Wordline Gating Circuitry Sense Amplifier CPU Store Driver Cache Output Driver

ZIB and Wordline Gating Circuitry ZIB Wordline Gating Circuitry LWL Bit BWL Bit_b ZIB_b W_EN I/O BUS addr Address Decoder bwl SRAM Cells Sense Amplifiers ZIB C wl C wl /4 Small Delay Overhead

Sense Amplifier Modification BUS Modified Sense-Amp Bit Bit_b zero Data Bit Sense-Amp ZIBZIB_b sense ZIB I/O addr Address Decoder bwl SRAM Cells Sense Amplifiers ZIB Zero-valued data:  Not driven onto bus  Not in critical path  ZIB read w/o delay

CPU Store and Cache Output Drivers ZIB W_EN write data CPU LWL Data Bits ZIB Cache Data Bits ZIB Low-Swing Bus To WLG Reduce Data Bus Energy Dissipation

Area Overhead Area Overhead: 9%  Zero-Indicator-Bits  Sense Amplifiers  WLG Circuitry  I/O Circuitry Byte slice of the sub-bank (Data,ZIB,WLG)

Delay Overhead No delay overhead for writes  Zero check performed in parallel with tag check 2 F04 gate-delays for reads  A pessimistic 7% worst case delay Data Bits Low-Swing Bus ZIB LWL

Data Cache Energy Savings % of Energy Savings Savings obtained for a low-power cache with sub- banking, wordline gating, and low-swing bitlines

Bits Distribution for Instruction Cache Zeros are not as prevalent in I-Cache. Use a recoding scheme to increase the zero-byte in I-cache. [Panich ’99] --- IWLG technique that compacts the instructions.  Use two-address form when src reg = dest reg  Shorter immediates  Three different instruction length: short, medium, long  Gate the unused portion of the instruction to avoid bitline swing  Faster read-out for top two bytes (opcode, reg. acc., inter-locks) Optimal: lwl s/m m/l

IWLG to Dynamic Zero Compression Adopting IWLG technique for Dynamic Zero Compression  Small modification on instruction format Use instead of  Upper two byte are zero-detected  Lower two bytes are usage-detected  Able to eliminate bitline swings of zero-valued bytes in 2 upper bytes Example: Opcode  Slower than IWLG due to wordline gating in the critical path s/m m/l 0? lwl

Instruction Cache Bit Swing Reduction % Reduction of Bitline Swings

Instruction Cache Energy Savings % Energy Savings

Conclusion A novel hardware technique to reduce cache energy by eliminating the access of zero bytes.  Small area and delay overhead Area: 9%, Delay: 2 F04 gate-delays  Average energy saving: D-Cache: 26%, I- Cache:18% Processor wide: ~10% for typical embedded processors  Completely orthogonal to existing energy reduction techniques Dynamic Zero Compression is applicable to  Second level caches  DRAM  Datapath [Canal et. al. Micro-33]

Thank You!