Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches Gennady Pekhimenko Vivek Seshadri Onur Mutlu, Todd C. Mowry Phillip B.

Slides:



Advertisements
Similar presentations
Gennady Pekhimenko Advisers: Todd C. Mowry & Onur Mutlu
Advertisements

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.
High Performing Cache Hierarchies for Server Workloads
1 Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines Moinuddin K. Qureshi M. Aater Suleman Yale N. Patt HPCA 2007.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim,
Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.
Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture
Improving Cache Performance by Exploiting Read-Write Disparity
1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.
Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.
Restrictive Compression Techniques to Increase Level 1 Cache Capacity Prateek Pujara Aneesh Aggarwal Dept of Electrical and Computer Engineering Binghamton.
Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks Vivek Seshadri Samihan Yedkar ∙ Hongyi Xin ∙ Onur Mutlu Phillip.
Skewed Compressed Cache
1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.
1 Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, Onur Mutlu.
The Dirty-Block Index Vivek Seshadri Abhishek Bhowmick ∙ Onur Mutlu Phillip B. Gibbons ∙ Michael A. Kozuch ∙ Todd C. Mowry.
Adaptive Cache Compression for High-Performance Processors Alaa Alameldeen and David Wood University of Wisconsin-Madison Wisconsin Multifacet Project.
Compressed Memory Hierarchy Dongrui SHE Jianhua HUI.
Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.
Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging.
Page Overlays An Enhanced Virtual Memory Framework to Enable Fine-grained Memory Management Vivek Seshadri Gennady Pekhimenko, Olatunji Ruwase, Onur Mutlu,
Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.
Embedded System Lab. 김해천 Linearly Compressed Pages: A Low- Complexity, Low-Latency Main Memory Compression Framework Gennady Pekhimenko†
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
Energy-Efficient Data Compression for Modern Memory Systems Gennady Pekhimenko ACM Student Research Competition March, 2015.
@ Carnegie Mellon Databases Inspector Joins Shimin Chen Phillip B. Gibbons Todd C. Mowry Anastassia Ailamaki 2 Carnegie Mellon University Intel Research.
A Decompression Architecture for Low Power Embedded Systems Lekatsas, H.; Henkel, J.; Wolf, W.; Computer Design, Proceedings International.
Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison.
Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.
Optimizing DRAM Timing for the Common-Case Donghyuk Lee Yoongu Kim, Gennady Pekhimenko, Samira Khan, Vivek Seshadri, Kevin Chang, Onur Mutlu Adaptive-Latency.
Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
Exploiting Compressed Block Size as an Indicator of Future Reuse
The Evicted-Address Filter
Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, Amirali Boroumand,
IMP: Indirect Memory Prefetcher
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.
Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.
Optimizing DRAM Timing for the Common-Case Donghyuk Lee Yoongu Kim, Gennady Pekhimenko, Samira Khan, Vivek Seshadri, Kevin Chang, Onur Mutlu Adaptive-Latency.
Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.
HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.
Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.
18-740/640 Computer Architecture Lecture 14: Memory Resource Management I Prof. Onur Mutlu Carnegie Mellon University Fall 2015, 10/26/2015.
RFVP: Rollback-Free Value Prediction with Safe to Approximate Loads Amir Yazdanbakhsh, Bradley Thwaites, Hadi Esmaeilzadeh Gennady Pekhimenko, Onur Mutlu,
A Robust Main-Memory Compression Scheme (ISCA 06) Magnus Ekman and Per Stenström Chalmers University of Technolog, Göteborg, Sweden Speaker: 雋中.
A Case for Toggle-Aware Compression for GPU Systems
Improving Multi-Core Performance Using Mixed-Cell Cache Architecture
Improving Cache Performance using Victim Tag Stores
Adaptive Cache Partitioning on a Composite Core
Understanding Latency Variation in Modern DRAM Chips Experimental Characterization, Analysis, and Optimization Kevin Chang Abhijith Kashyap, Hasan Hassan,
Prof. Onur Mutlu and Gennady Pekhimenko Carnegie Mellon University
Computer Architecture: (Shared) Cache Management
Ambit In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology Vivek Seshadri Donghyuk Lee, Thomas Mullins, Hasan Hassan, Amirali.
Moinuddin K. Qureshi ECE, Georgia Tech Gabriel H. Loh, AMD
Energy-Efficient Address Translation
Application Slowdown Model
CARP: Compression Aware Replacement Policies
Reducing Memory Reference Energy with Opportunistic Virtual Caching
Address-Value Delta (AVD) Prediction
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim, Hongyi.
CARP: Compression-Aware Replacement Policies
Survey of Cache Compression
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim, Hongyi.
Cache - Optimization.
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Computer Architecture Lecture 19a: Multi-Core Cache Management II
Lei Zhao, Youtao Zhang, Jun Yang
DSPatch: Dual Spatial pattern prefetcher
Presentation transcript:

Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches Gennady Pekhimenko Vivek Seshadri Onur Mutlu, Todd C. Mowry Phillip B. Gibbons* Michael A. Kozuch* *

Executive Summary Off-chip memory latency is high – Large caches can help, but at significant cost Compressing data in cache enables larger cache at low cost Problem: Decompression is on the execution critical path Goal: Design a new compression scheme that has 1. low decompression latency, 2. low cost, 3. high compression ratio Observation: Many cache lines have low dynamic range data Key Idea: Encode cachelines as a base + multiple differences Solution: Base-Delta-Immediate compression with low decompression latency and high compression ratio – Outperforms three state-of-the-art compression mechanisms 2

Motivation for Cache Compression Significant redundancy in data: 3 0x How can we exploit this redundancy? – Cache compression helps – Provides effect of a larger cache without making it physically larger 0x B0x x …

Background on Cache Compression Key requirements: – Fast (low decompression latency) – Simple (avoid complex hardware changes) – Effective (good compression ratio) 4 CPU L2 Cache Uncompressed Compressed Decompression Uncompressed L1 Cache Hit

Shortcomings of Prior Work 5 Compression Mechanisms Decompression Latency ComplexityCompression Ratio Zero 

Shortcomings of Prior Work 6 Compression Mechanisms Decompression Latency ComplexityCompression Ratio Zero  Frequent Value 

Shortcomings of Prior Work 7 Compression Mechanisms Decompression Latency ComplexityCompression Ratio Zero  Frequent Value  Frequent Pattern   /

Shortcomings of Prior Work 8 Compression Mechanisms Decompression Latency ComplexityCompression Ratio Zero  Frequent Value  Frequent Pattern   / Our proposal: BΔI

Outline Motivation & Background Key Idea & Our Mechanism Evaluation Conclusion 9

Key Data Patterns in Real Applications 10 0x … 0x000000FF … 0x x B0x x … 0xC04039C00xC04039C80xC04039D00xC04039D8… Zero Values: initialization, sparse matrices, NULL pointers Repeated Values: common initial values, adjacent pixels Narrow Values: small values stored in a big data type Other Patterns: pointers to the same memory region

How Common Are These Patterns? 11 SPEC2006, databases, web workloads, 2MB L2 cache “Other Patterns” include Narrow Values 43% of the cache lines belong to key patterns

Key Data Patterns in Real Applications 12 0x … 0x000000FF … 0x x B0x x … 0xC04039C00xC04039C80xC04039D00xC04039D8… Zero Values: initialization, sparse matrices, NULL pointers Repeated Values: common initial values, adjacent pixels Narrow Values: small values stored in a big data type Other Patterns: pointers to the same memory region Low Dynamic Range: Differences between values are significantly smaller than the values themselves

32-byte Uncompressed Cache Line Key Idea: Base+Delta (B+Δ) Encoding 13 0xC04039C00xC04039C80xC04039D0…0xC04039F8 4 bytes 0xC04039C0 Base 0x00 1 byte 0x08 1 byte 0x10 1 byte …0x38 12-byte Compressed Cache Line 20 bytes saved Fast Decompression: vector addition Simple Hardware: arithmetic and comparison Effective: good compression ratio

Can We Do Better? Uncompressible cache line (with a single base): Key idea: Use more bases, e.g., two instead of one Pro: – More cache lines can be compressed Cons: – Unclear how to find these bases efficiently – Higher overhead (due to additional bases) 14 0x x09A401780x B0x09A4A838…

B+Δ with Multiple Arbitrary Bases 15 2 bases – the best option based on evaluations

How to Find Two Bases Efficiently? 1.First base - first element in the cache line 2.Second base - implicit base of 0 Advantages over 2 arbitrary bases: – Better compression ratio – Simpler compression logic 16 Base+Delta part Immediate part Base-Delta-Immediate (BΔI) Compression

B+Δ (with two arbitrary bases) vs. BΔI 17 Average compression ratio is close, but BΔI is simpler

BΔI Implementation Decompressor Design – Low latency Compressor Design – Low cost and complexity BΔI Cache Organization – Modest complexity 18

Δ0Δ0 B0B0 BΔI Decompressor Design 19 Δ1Δ1 Δ2Δ2 Δ3Δ3 Compressed Cache Line V0V0 V1V1 V2V2 V3V3 ++ Uncompressed Cache Line ++ B0B0 Δ0Δ0 B0B0 B0B0 B0B0 B0B0 Δ1Δ1 Δ2Δ2 Δ3Δ3 V0V0 V1V1 V2V2 V3V3 Vector addition

BΔI Compressor Design byte Uncompressed Cache Line 8-byte B 0 1-byte Δ CU 8-byte B 0 2-byte Δ CU 8-byte B 0 4-byte Δ CU 4-byte B 0 1-byte Δ CU 4-byte B 0 2-byte Δ CU 2-byte B 0 1-byte Δ CU Zero CU Rep. Values CU Compression Selection Logic (based on compr. size) CFlag & CCL CFlag & CCL CFlag & CCL CFlag & CCL CFlag & CCL CFlag & CCL CFlag & CCL CFlag & CCL Compression Flag & Compressed Cache Line CFlag & CCL Compressed Cache Line

BΔI Compression Unit: 8-byte B 0 1-byte Δ byte Uncompressed Cache Line V0V0 V1V1 V2V2 V3V3 8 bytes ---- B0=B0= V0V0 V0V0 B0B0 B0B0 B0B0 B0B0 V0V0 V1V1 V2V2 V3V3 Δ0Δ0 Δ1Δ1 Δ2Δ2 Δ3Δ3 Within 1-byte range? Is every element within 1-byte range? Δ0Δ0 B0B0 Δ1Δ1 Δ2Δ2 Δ3Δ3 B0B0 Δ0Δ0 Δ1Δ1 Δ2Δ2 Δ3Δ3 YesNo

BΔI Cache Organization 22 Tag 0 Tag 1 …… …… Tag Storage: Set 0 Set 1 Way 0 Way 1 Data 0 … … Set 0 Set 1 Way 0 Way 1 … Data 1 … 32 bytes Data Storage: Conventional 2-way cache with 32-byte cache lines BΔI: 4-way cache with 8-byte segmented data Tag 0 Tag 1 …… …… Tag Storage: Way 0 Way 1 Way 2 Way 3 …… Tag 2 Tag 3 …… Set 0 Set 1 Twice as many tags C - Compr. encoding bits C Set 0 Set 1 …………………… S0S0 S0S0 S1S1 S2S2 S3S3 S4S4 S5S5 S6S6 S7S7 …………………… 8 bytes Tags map to multiple adjacent segments2.3% overhead for 2 MB cache

Qualitative Comparison with Prior Work Zero-based designs – ZCA [Dusser+, ICS’09] : zero-content augmented cache – ZVC [Islam+, PACT’09] : zero-value cancelling – Limited applicability (only zero values) FVC [Yang+, MICRO’00] : frequent value compression – High decompression latency and complexity Pattern-based compression designs – FPC [Alameldeen+, ISCA’04] : frequent pattern compression High decompression latency (5 cycles) and complexity – C-pack [Chen+, T-VLSI Systems’10] : practical implementation of FPC-like algorithm High decompression latency (8 cycles) 23

Outline Motivation & Background Key Idea & Our Mechanism Evaluation Conclusion 24

Methodology Simulator – x86 event-driven simulator based on Simics [Magnusson+, Computer’02] Workloads – SPEC2006 benchmarks, TPC, Apache web server – 1 – 4 core simulations for 1 billion representative instructions System Parameters – L1/L2/L3 cache latencies from CACTI [Thoziyoor+, ISCA’08] – 4GHz, x86 in-order core, 512kB - 16MB L2, simple memory model (300-cycle latency for row-misses) 25

Compression Ratio: BΔI vs. Prior Work BΔI achieves the highest compression ratio SPEC2006, databases, web workloads, 2MB L2

Single-Core: IPC and MPKI % 5.2% 5.1% 4.9% 5.6% 3.6% 16% 24% 21% 13% 19% 14% BΔI achieves the performance of a 2X-size cache Performance improves due to the decrease in MPKI

Multi-Core Workloads Application classification based on Compressibility: effective cache size increase (Low Compr. (LC) = 1.40) Sensitivity: performance gain with more cache (Low Sens. (LS) = 1.10; 512kB -> 2MB) Three classes of applications: – LCLS, HCLS, HCHS, no LCHS applications For 2-core - random mixes of each possible class pairs (20 each, 120 total workloads) 28

Multi-Core: Weighted Speedup BΔI performance improvement is the highest (9.5%) If at least one application is sensitive, then the performance improves 29

Other Results in Paper IPC comparison against upper bounds – BΔI almost achieves performance of the 2X-size cache Sensitivity study of having more than 2X tags – Up to 1.98 average compression ratio Effect on bandwidth consumption – 2.31X decrease on average Detailed quantitative comparison with prior work Cost analysis of the proposed changes – 2.3% L2 cache area increase 30

Conclusion A new Base-Delta-Immediate compression mechanism Key insight: many cache lines can be efficiently represented using base + delta encoding Key properties: – Low latency decompression – Simple hardware implementation – High compression ratio with high coverage Improves cache hit ratio and performance of both single- core and multi-core workloads – Outperforms state-of-the-art cache compression techniques: FVC and FPC 31

Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches Gennady Pekhimenko, Vivek Seshadri, Onur Mutlu, Todd C. Mowry Phillip B. Gibbons*, Michael A. Kozuch* *

Backup Slides 33

B+Δ: Compression Ratio Good average compression ratio (1.40) 34 But some benchmarks have low compression ratio SPEC2006, databases, web workloads, L2 2MB cache

Single-Core: Effect on Cache Capacity BΔI achieves performance close to the upper bound % 1.7% 2.3% Fixed L2 cache latency

Multiprogrammed Workloads - I 36

Cache Compression Flow 37 CPU L1 Data Cache Uncompressed L2 Cache Compressed Memory Uncompressed Hit L1 Miss Hit L2 Decompress Compress Writeback Decompress Writeback Compress

Example of Base+Delta Compression Narrow values (taken from h264ref): 38