Highly-Associative Caches for Low-Power Processors Michael Zhang Krste Asanovic

Slides:

Advertisements

Similar presentations

Outline Memory characteristics SRAM Content-addressable memory details DRAM © Derek Chiou & Mattan Erez 1.

Advertisements

COEN 180 SRAM. High-speed Low capacity Expensive Large chip area. Continuous power use to maintain storage Technology used for making MM caches.

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

Cosc 3P92 Week 9 Lecture slides

Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,

Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.

L1 Data Cache Decomposition for Energy Efficiency Michael Huang, Joe Renau, Seung-Moon Yoo, Josep Torrellas University of Illinois at Urbana-Champaign.

A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.

Semiconductor Memory Design. Organization of Memory Systems Driven only from outside Data flow in and out A cell is accessed for reading by selecting.

5-1 Memory System. Logical Memory Map. Each location size is one byte (Byte Addressable) Logical Memory Map. Each location size is one byte (Byte Addressable)

COEN 180 DRAM. Dynamic Random Access Memory Dynamic: Periodically refresh information in a bit cell. Else it is lost. Small footprint: transistor + capacitor.

Evaluating an Adaptive Framework For Energy Management in Processor- In-Memory Chips Michael Huang, Jose Renau, Seung-Moon Yoo, Josep Torrellas.

1 Recap: Memory Hierarchy. 2 Memory Hierarchy - the Big Picture Problem: memory is too slow and or too small Solution: memory hierarchy Fastest Slowest.

Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.

CS Lecture 10 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers N.P. Jouppi Proceedings.

Power Efficient IP Lookup with Supernode Caching Lu Peng, Wencheng Lu*, and Lide Duan Dept. of Electrical & Computer Engineering Louisiana State University.

Modified from notes by Saeid Nooshabadi

Non-Uniform Power Access in Large Caches with Low-Swing Wires Aniruddha N. Udipi with Naveen Muralimanohar*, Rajeev Balasubramonian University of Utah.

Die-Hard SRAM Design Using Per-Column Timing Tracking

Low-Power CMOS SRAM By: Tony Lugo Nhan Tran Adviser: Dr. David Parent.

Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of.

CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of.

1 Balanced Cache:Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders ISCA 2006,IEEE. By Chuanjun Zhang Speaker: WeiZeng.

1 Drowsy Caches Simple Techniques for Reducing Leakage Power Krisztián Flautner Nam Sung Kim Steve Martin David Blaauw Trevor Mudge

Lecture 19: SRAM.

Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma

Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Memory Hierarchy 2.

CS 7810 Lecture 17 Managing Wire Delay in Large CMP Caches B. Beckmann and D. Wood Proceedings of MICRO-37 December 2004.

University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian.

CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of.

Mrinmoy Ghosh Weidong Shi Hsien-Hsin (Sean) Lee

Case Study - SRAM & Caches

1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.

Hardware Caches with Low Access Times and High Hit Ratios Xiaodong Zhang College of William and Mary.

High Speed 64kb SRAM ECE 4332 Fall 2013 Team VeryLargeScaleEngineers Robert Costanzo Michael Recachinas Hector Soto.

Review: Basic Building Blocks  Datapath l Execution units -Adder, multiplier, divider, shifter, etc. l Register file and pipeline registers l Multiplexers,

Low Power Cache Design M.Bilal Paracha Hisham Chowdhury Ali Raza.

ISLPED’99 International Symposium on Low Power Electronics and Design

Cache Physical Implementation Panayiotis Charalambous Xi Research Group Panayiotis Charalambous Xi Research Group.

Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 7: February 6, 2012 Memories.

Z. Feng MTU EE4800 CMOS Digital IC Design & Analysis 12.1 EE4800 CMOS Digital IC Design & Analysis Lecture 12 SRAM Zhuo Feng.

CPEN Digital System Design

CS 3410, Spring 2014 Computer Science Cornell University See P&H Chapter: , 5.8, 5.15.

Energy Reduction for STT-RAM Using Early Write Termination Ping Zhou, Bo Zhao, Jun Yang, *Youtao Zhang Electrical and Computer Engineering Department *Department.

Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 5: February 1, 2010 Memories.

1 Energy-Efficient Register Access Jessica H. Tseng and Krste Asanović MIT Laboratory for Computer Science, Cambridge, MA 02139, USA SBCCI2000.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency.

DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

High Performance Embedded Computing © 2007 Elsevier Lecture 7: Memory Systems & Code Compression Embedded Computing Systems Mikko Lipasti, adapted from.

1 Dual-V cc SRAM Class presentation for Advanced VLSIPresenter:A.Sammak Adopted from: M. Khellah,A 4.2GHz 0.3mm 2 256kb Dual-V CC SRAM Building Block in.

Low Power SRAM VLSI Final Presentation Stephen Durant Ryan Kruba Matt Restivo Voravit Vorapitat.

Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

Microprocessor Microarchitecture Memory Hierarchy Optimization Lynn Choi Dept. Of Computer and Electronics Engineering.

COMP 3221: Microprocessors and Embedded Systems Lectures 27: Cache Memory - III Lecturer: Hui Wu Session 2, 2005 Modified.

1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.

Dynamic Zero Compression for Cache Energy Reduction Luis Villa Michael Zhang Krste Asanovic

University of Utah 1 Interconnect Design Considerations for Large NUCA Caches Naveen Muralimanohar Rajeev Balasubramonian.

Advanced Caches Smruti R. Sarangi.

Two Dimensional Highly Associative Level-Two Cache Design

Full Custom Associative Memory Core

Cache Memory Presentation I

Fine-Grain CAM-Tag Cache Resizing Using Miss Tags

Lecture 20: OOO, Memory Hierarchy

Lecture 22: Cache Hierarchies, Memory

A Novel Cache-Utilization Based Dynamic Voltage Frequency Scaling (DVFS) Mechanism for Reliability Enhancements *Yen-Hao Chen, *Yi-Lun Tang, **Yi-Yu Liu,

Presentation transcript:

Highly-Associative Caches for Low-Power Processors Michael Zhang Krste Asanovic

Motivation Cache uses 30-60% processor energy in embedded systems.  Example: 43% for StrongArm-1 Many academic studies on cache [Albera, Bahar, ’98] – Power and performance trade-offs [Amrutur, Horowitz, ‘98,’00] – Speed and power scaling [Bellas, Hajj, Polychronopoulos, ’99] – Dynamic cache management [Ghose, Kamble,’99] – Power reduction through sub-banking, etc. [Inoue, Ishihara, Murakami,’99] – Way predicting set-associative cache [Kin,Gupta, Mangione-Smith, ’97] – Filter cache [Ko, Balsara, Nanda, ’98] – Multilevel caches for RISC and CISC [Wilton, Jouppi, ’94] – CACTI cache model Many Industrial Low-Power Processors use CAM (content- addressable-memory)  ARM3 – 64-way set-associative – [Furber et. al. ’89]  StrongArm – 32-way set-associative – [Santhanam et. al. ’98]  Intel XScale – 32-way set-associative – ’01 CAM: Fast and Energy-Efficient

Talk Outline Structural Comparison Area and Delay Comparison Energy Comparison Related work Conclusion

Set-Associative RAM-tag Cache Not energy-efficient  All ways are read out Two-phase approach  More energy-efficient  2X latency =? Tag Status Data Tag Index Offset

Set-Associative RAM-tag Sub-bank Not energy-efficient  All ways are read out Two-phase approach  More energy-efficient  2X latency Sub-banking 1 sub-bank = 1 way Low-swing Bitlines  Only for reads, writes performed full-swing Wordline Gating I/O BUS addr Address Decoder gwl lwl Offset Dec. offset Data SRAM Cells Sense Amps lwl Offset Dec. offset Data SRAM Cells Sense Amps Tag SRAM Cells Tag Comp BUS Cache

CAM-tag Cache Only one sub-bank activated Associativity within sub-bank Easy to implement high associativity Tag Status Data HIT? Word Tag Offset Bank Tag Status Data HIT?

CAM-tag Cache Sub-bank Only one sub-bank activated Associativity within sub-bank Easy to implement high associativity I/O BUS tag CAM-tag Array gwl lwl Offset Dec. offset SRAM Cells Sense Amps lwl Offset Dec. offset SRAM Cells Sense Amps

CAM Functionality and Energy Usage WL BitBit_bSBitSBit_b match 10-T CAM Cell With Separate Write/Search Lines And Low-Swing Match Line WL BitBit_bSBitSBit_b match Match WL BitBit_bSBitSBit_b match Mismatch CAM Energy Dissipation  Search Lines  Match Lines  Drivers SRAM XOR

CAM-tag Cache Sub-bank Layout 10% area overhead over RAM-tag cache 2x12x32 CAM Array 1-KB Cache Sub-bank implemented in 0.25  m CMOS technology 32x64 RAM Array

Delay Comparison gwl Global Wordline Decoding lwl Decoded offset Local Wordline Decoding Tag readout Data readout Index Bits Tag bits Data out Tag bits Tag Comp. Data out RAM tag Cache Critical Path: CAM tag Cache Critical Path: Within 3% of each other Tag bits broadcasting Tag bits Tag Comp. gwl Data readout Local Wordline Decoding lwl Decoded offset

Hit Energy Comparison Hit Energy per Access for 8KB Cache in pJ Associativity and Implementation

Miss Rate Results LZW ijpeg m88ksim perl pegwit gcc

Total Access Energy (pegwit) Pegwit – High miss rate for high associativity Total Energy per Access for 8KB Cache in pJ Miss Energy Expressed in Multiples of 32-bit Read Access Energy

Total Access Energy (perl) Perl – Very low miss rate for high associativity Total Energy per Access for 8KB Cache in pJ Miss Energy Expressed in Multiples of 32-bit Read Access Energy

Other Advantages of CAM-tag Hit signal generated earlier  Simplifies pipelines Simplified store operation  Wordline only enabled during a hit  Stores can happen in a single cycle  No write buffer necessary

Related Work CACTI and CACTI2  [Wilton and Jouppi ’94],[Reinman and Jouppi, ’99]  Accurate delay and energy estimate Results within 10%  Energy estimate not suited for low-power designs  Typical Low-power features not included in CACTI Sub-banking Low-swing bitlines Wordline gating Separate CAM search line Low-swing match lines  Energy Estimation 10X greater than our model for one CAM-tag cache sub-bank Our results closely agree with [Amruthur and Horowitz, 98]

Conclusion CAM tags – high performance and low-power  Energy consumption of 32-way CAM < 2-way RAM  Easy to implement highly-associative tags  Low area overhead (10%)  Comparable access delay  Better CPI by reducing miss rate

Thank You!