Survey of Cache Compression

Slides:



Advertisements
Similar presentations
Gennady Pekhimenko Advisers: Todd C. Mowry & Onur Mutlu
Advertisements

SE-292 High Performance Computing
- Dr. Kalpakis CMSC Dr. Kalpakis 1 Outline In implementing DBMS we need to answer How should the system store and manage very large amounts of data?
Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
DSPs Vs General Purpose Microprocessors
Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.
Lecture 12 Reduce Miss Penalty and Hit Time
Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Amoeba-Cache Adaptive Blocks for Eliminating Waste in the Memory Hierarchy Snehasish Kumar Arrvindh Shriraman Eric Matthews Lesley Shannon Hongzhou Zhao.
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim,
Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.
Increasing the Cache Efficiency by Eliminating Noise Philip A. Marshall.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
Restrictive Compression Techniques to Increase Level 1 Cache Capacity Prateek Pujara Aneesh Aggarwal Dept of Electrical and Computer Engineering Binghamton.
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches Gennady Pekhimenko Vivek Seshadri Onur Mutlu, Todd C. Mowry Phillip B.
A Decompression Architecture for Low Power Embedded Systems Lekatsas, H.; Henkel, J.; Wolf, W.; Computer Design, Proceedings International.
Microprocessor Microarchitecture Instruction Fetch Lynn Choi Dept. Of Computer and Electronics Engineering.
Page 1 Trace Caches Michele Co CS 451. Page 2 Motivation  High performance superscalar processors  High instruction throughput  Exploit ILP –Wider.
CSE Advanced Computer Architecture Week-1 Week of Jan 12, 2004 engr.smu.edu/~rewini/8383.
Virtual Memory. Virtual Memory: Topics Why virtual memory? Virtual to physical address translation Page Table Translation Lookaside Buffer (TLB)
Design Exploration of an Instruction-Based Shared Markov Table on CMPs Karthik Ramachandran & Lixin Su Design Exploration of an Instruction-Based Shared.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.
Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache And Pefetch Buffers Norman P. Jouppi Presenter:Shrinivas Narayani.
1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
CS 704 Advanced Computer Architecture
Memory Hierarchy Ideal memory is fast, large, and inexpensive
CSE 351 Section 9 3/1/12.
Lecture: Large Caches, Virtual Memory
Basic Performance Parameters in Computer Architecture:
Prof. Onur Mutlu and Gennady Pekhimenko Carnegie Mellon University
The Hardware/Software Interface CSE351 Winter 2013
Lecture: Cache Hierarchies
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Lecture: Cache Hierarchies
Lecture: SMT, Cache Hierarchies
Introduction, Focus, Overview
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Reducing Memory Reference Energy with Opportunistic Virtual Caching
Lecture 17: Case Studies Topics: case studies for virtual memory and cache hierarchies (Sections )
Lecture: SMT, Cache Hierarchies
Ka-Ming Keung Swamy D Ponpandi
Lecture: Cache Innovations, Virtual Memory
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Lecture: SMT, Cache Hierarchies
CS-447– Computer Architecture Lecture 20 Cache Memories
Summary 3 Cs: Compulsory, Capacity, Conflict Misses Reducing Miss Rate
Lecture: SMT, Cache Hierarchies
Lecture: Cache Hierarchies
CSC3050 – Computer Architecture
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Introduction, Focus, Overview
CS703 - Advanced Operating Systems
Cache - Optimization.
Main Memory Background
Principle of Locality: Memory Hierarchies
10/18: Lecture Topics Using spatial locality
Ka-Ming Keung Swamy D Ponpandi
Overview Problem Solution CPU vs Memory performance imbalance
Spring 2019 Prof. Eric Rotenberg
Presentation transcript:

Survey of Cache Compression

Outline Background & Motivation Block based cache compression FPC ZCA BDI SC2 HyComp Stream based cache compression MORC

Background Cache is important… Conflicts: Compression: Larger LLC  more area, more latency, more energy Limited LLC  off-chip access Compression: Trade off latency for less off-chip access

Frequent Pattern Compression An example: 1 2 3 5 8 13 21 … 8: 0000 0000 0000 0000 0000 0000 0000 1000 Compression ratio : ~2 Pattern: 4-bits sign-extended

Frequent Pattern Compression Patterns : Use 3bits prefix and 4~32 bits for content

Zero-content Augmentation Common lines of code: Several “blank” blocks Use few bits to represent a block

Base Delta Immediate A simple example: 0x8048004 0x8048004 +, 0x0 +, 0xc0 -, 0x4 0x8048004 0x8048008 0x80480c0 0x8048000

Base Delta Immediate Multiple Bases: it is clear that not every cache line can be represented B+ delta with one base. Having more than two bases does not provide additional improvement in compression ratio How to make use of the saved space ?

Base Delta Immediate Organization: Number of tags is doubled, compression encoding bits are added to every tag, data storage is the same in size, but partitioned into segments.

Base Delta Immediate Decompression Compression ratio : ~2 Lower decompression latency

Statistical Cache Compression Huffman encoding “Heap” for sampling The most mathematical method so far, in my opinion This circuit is too complex, not to develop the topic in class

Statistical Cache Compression 10 cycles for decompression Compression ratio : 3~4

Hycomp FP-H compression Floating-point number specified compression method, based on SC2 Compression is unusually not the critical path

Hycomp FP-H paralleled decompression However, decompression does. Because Huffman encoding is not fix-sized, the offset of a certain segment cannot be recorded( otherwise, the compression ratio drops). a non-paralleled decompression processes mL, exp, and mH sequentially. However, a paralleled decompression processes mH and mH simultaneously in phase one, and then exp in phase 2

Hycomp Hybrid compression Heuristics for Prediction of Data Types Perform better on floating-pointing numbers Compression ratio : ~4, 12cycles

MORC Log-based cache In fact, this picture is kind of misleading MORC is loop-up-table based….

MORC LMT

MORC LMT: valid bits for addrs IF valid -> decompress tag & check tag IF hit -> decompress data OR check next tag Sequentially ? Yes, because most tags will miss!

MORC Throughput oriented Manycore-Oriented-Compressed-Cache When cores accumulate, off-chip bandwidth limits performance For throughput oriented works, reducing off-chip access is more important than reducing latency. Less off-chip access saves energy ~6