Prof. Onur Mutlu and Gennady Pekhimenko Carnegie Mellon University

Slides:

Advertisements

Similar presentations

Gennady Pekhimenko Advisers: Todd C. Mowry & Onur Mutlu

Advertisements

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.

High Performing Cache Hierarchies for Server Workloads

FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.

1 Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines Moinuddin K. Qureshi M. Aater Suleman Yale N. Patt HPCA 2007.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim,

Improving Cache Performance by Exploiting Read-Write Disparity

Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.

Overview of Cache and Virtual MemorySlide 1 The Need for a Cache (edited from notes with Behrooz Parhami’s Computer Architecture textbook) Cache memories.

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.

Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks Vivek Seshadri Samihan Yedkar ∙ Hongyi Xin ∙ Onur Mutlu Phillip.

Skewed Compressed Cache

The Dirty-Block Index Vivek Seshadri Abhishek Bhowmick ∙ Onur Mutlu Phillip B. Gibbons ∙ Michael A. Kozuch ∙ Todd C. Mowry.

Compressed Memory Hierarchy Dongrui SHE Jianhua HUI.

Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches Gennady Pekhimenko Vivek Seshadri Onur Mutlu, Todd C. Mowry Phillip B.

Page Overlays An Enhanced Virtual Memory Framework to Enable Fine-grained Memory Management Vivek Seshadri Gennady Pekhimenko, Olatunji Ruwase, Onur Mutlu,

Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

Moinuddin K.Qureshi, Univ of Texas at Austin MICRO’ , 12, 05 PAK, EUNJI.

Embedded System Lab. 김해천 Linearly Compressed Pages: A Low- Complexity, Low-Latency Main Memory Compression Framework Gennady Pekhimenko†

Energy-Efficient Data Compression for Modern Memory Systems Gennady Pekhimenko ACM Student Research Competition March, 2015.

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison.

Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.

1 Utility-Based Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches Written by Moinuddin K. Qureshi and Yale N.

Exploiting Compressed Block Size as an Indicator of Future Reuse

The Evicted-Address Filter

Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, Amirali Boroumand,

Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.

Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.

A Robust Main-Memory Compression Scheme (ISCA 06) Magnus Ekman and Per Stenström Chalmers University of Technolog, Göteborg, Sweden Speaker: 雋中.

A Case for Toggle-Aware Compression for GPU Systems

Computer Architecture Lecture 12: Virtual Memory I

Improving Multi-Core Performance Using Mixed-Cell Cache Architecture

Improving Cache Performance using Victim Tag Stores

Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh

Reducing Memory Interference in Multicore Systems

Memory COMPUTER ARCHITECTURE

Prof. Onur Mutlu Carnegie Mellon University

18-447: Computer Architecture Lecture 23: Caches

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Zhichun Zhu Zhao Zhang ECE Department ECE Department

ASR: Adaptive Selective Replication for CMP Caches

Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.

Computer Architecture: (Shared) Cache Management

18742 Parallel Computer Architecture Caching in Multi-core Systems

Cache Memory Presentation I

Lecture 13: Large Cache Design I

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Energy-Efficient Address Translation

Application Slowdown Model

Rachata Ausavarungnirun

Part V Memory System Design

CARP: Compression Aware Replacement Policies

CMSC 611: Advanced Computer Architecture

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Reducing Memory Reference Energy with Opportunistic Virtual Caching

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Achieving High Performance and Fairness at Low Cost

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim, Hongyi.

CARP: Compression-Aware Replacement Policies

Survey of Cache Compression

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim, Hongyi.

Cache - Optimization.

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Computer Architecture Lecture 19a: Multi-Core Cache Management II

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

18-742 Parallel Computer Architecture Lecture 11: Caching in Multi-Core Systems Prof. Onur Mutlu and Gennady Pekhimenko Carnegie Mellon University Fall 2012, 10/01/2012

Review: Multi-core Issues in Caching How does the cache hierarchy change in a multi-core system? Private cache: Cache belongs to one core Shared cache: Cache is shared by multiple cores CORE 0 CORE 1 CORE 2 CORE 3 L2 CACHE DRAM MEMORY CONTROLLER CORE 0 CORE 1 CORE 2 CORE 3 DRAM MEMORY CONTROLLER L2 CACHE L2 CACHE

Outline Multi-cores and Caching: Review Utility-based partitioning Cache compression Frequent value Frequent pattern Base-Delta-Immediate Main memory compression IBM MXT Linearly Compressed Pages (LCP)

Review: Shared Caches Between Cores Advantages: Dynamic partitioning of available cache space No fragmentation due to static partitioning Easier to maintain coherence Shared data and locks do not ping pong between caches Disadvantages Cores incur conflict misses due to other cores’ accesses Misses due to inter-core interference Some cores can destroy the hit rate of other cores What kind of access patterns could cause this? Guaranteeing a minimum level of service (or fairness) to each core is harder (how much space, how much bandwidth?) High bandwidth harder to obtain (N cores  N ports?)

Shared Caches: How to Share? Free-for-all sharing Placement/replacement policies are the same as a single core system (usually LRU or pseudo-LRU) Not thread/application aware An incoming block evicts a block regardless of which threads the blocks belong to Problems A cache-unfriendly application can destroy the performance of a cache friendly application Not all applications benefit equally from the same amount of cache: free-for-all might prioritize those that do not benefit Reduced performance, reduced fairness

Problem with Shared Caches Processor Core 1 Processor Core 2 ←t1 L1 $ L1 $ L2 $ ……

Problem with Shared Caches Processor Core 1 Processor Core 2 t2→ L1 $ L1 $ L2 $ ……

Problem with Shared Caches Processor Core 1 Processor Core 2 ←t1 t2→ L1 $ L1 $ L2 $ …… t2’s throughput is significantly reduced due to unfair cache sharing.

Controlled Cache Sharing Utility based cache partitioning Qureshi and Patt, “Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches,” MICRO 2006. Suh et al., “A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning,” HPCA 2002. Fair cache partitioning Kim et al., “Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture,” PACT 2004. Shared/private mixed cache mechanisms Qureshi, “Adaptive Spill-Receive for Robust High-Performance Caching in CMPs,” HPCA 2009.

Utility Based Shared Cache Partitioning Goal: Maximize system throughput Observation: Not all threads/applications benefit equally from caching  simple LRU replacement not good for system throughput Idea: Allocate more cache space to applications that obtain the most benefit from more space The high-level idea can be applied to other shared resources as well. Qureshi and Patt, “Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches,” MICRO 2006. Suh et al., “A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning,” HPCA 2002.

Utility Based Cache Partitioning (I) Utility Uab = Misses with a ways – Misses with b ways Low Utility Misses per 1000 instructions High Utility Saturating Utility Num ways from 16-way 1MB L2

Utility Based Cache Partitioning (II) equake vpr Misses per 1000 instructions (MPKI) UTIL LRU Idea: Give more cache to the application that benefits more from cache

Utility Based Cache Partitioning (III) UMON1 UMON2 PA I$ Shared L2 cache I$ Core1 Core2 D$ D$ Main Memory Three components: Utility Monitors (UMON) per core Partitioning Algorithm (PA) Replacement support to enforce partitions

Cache Capacity How to get more cache without making it physically larger? Idea: Data compression for on chip-caches

Gennady Pekhimenko Vivek Seshadri Onur Mutlu , Todd C. Mowry Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches Gennady Pekhimenko Vivek Seshadri Onur Mutlu , Todd C. Mowry Phillip B. Gibbons* Michael A. Kozuch* This is a joint work between CMU and Intel Labs Pittsburgh. *

Executive Summary Off-chip memory latency is high Large caches can help, but at significant cost Compressing data in cache enables larger cache at low cost Problem: Decompression is on the execution critical path Goal: Design a new compression scheme that has 1. low decompression latency, 2. low cost, 3. high compression ratio Observation: Many cache lines have low dynamic range data Key Idea: Encode cachelines as a base + multiple differences Solution: Base-Delta-Immediate compression with low decompression latency and high compression ratio Outperforms three state-of-the-art compression mechanisms

Motivation for Cache Compression Significant redundancy in data: 0x00000000 0x0000000B 0x00000003 0x00000004 … How can we exploit this redundancy? Cache compression helps Provides effect of a larger cache without making it physically larger

Background on Cache Compression Hit L2 Cache CPU L1 Cache Decompression Uncompressed Uncompressed Compressed Key requirements: Fast (low decompression latency) Simple (avoid complex hardware changes) Effective (good compression ratio)

Zero Value Compression Advantages: Low decompression latency Low complexity Disadvantages: Low average compression ratio

Shortcomings of Prior Work Compression Mechanisms Decompression Latency Complexity Ratio Zero  

Frequent Value Compression Idea: encode cache lines based on frequently occurring values Advantages: Good compression ratio Disadvantages: Needs profiling High decompression latency High complexity

Shortcomings of Prior Work Compression Mechanisms Decompression Latency Complexity Ratio Zero   Frequent Value

Frequent Pattern Compression Idea: encode cache lines based on frequently occurring patterns, e.g., half word is zero Advantages: Good compression ratio Disadvantages: High decompression latency (5-8 cycles) High complexity (for some designs)

Shortcomings of Prior Work Compression Mechanisms Decompression Latency Complexity Ratio Zero   Frequent Value Frequent Pattern /

Shortcomings of Prior Work Compression Mechanisms Decompression Latency Complexity Ratio Zero   Frequent Value Frequent Pattern / Our proposal: BΔI

Outline Motivation & Background Key Idea & Our Mechanism Evaluation Conclusion

Key Data Patterns in Real Applications Zero Values: initialization, sparse matrices, NULL pointers 0x00000000 0x00000000 0x00000000 0x00000000 … Repeated Values: common initial values, adjacent pixels 0x000000FF 0x000000FF 0x000000FF 0x000000FF … Narrow Values: small values stored in a big data type 0x00000000 0x0000000B 0x00000003 0x00000004 … Other Patterns: pointers to the same memory region 0xC04039C0 0xC04039C8 0xC04039D0 0xC04039D8 …

How Common Are These Patterns? SPEC2006, databases, web workloads, 2MB L2 cache “Other Patterns” include Narrow Values 43% of the cache lines belong to key patterns

Key Data Patterns in Real Applications Zero Values: initialization, sparse matrices, NULL pointers Low Dynamic Range: Differences between values are significantly smaller than the values themselves 0x00000000 0x00000000 0x00000000 0x00000000 … Repeated Values: common initial values, adjacent pixels 0x000000FF 0x000000FF 0x000000FF 0x000000FF … Narrow Values: small values stored in a big data type 0x00000000 0x0000000B 0x00000003 0x00000004 … Other Patterns: pointers to the same memory region 0xC04039C0 0xC04039C8 0xC04039D0 0xC04039D8 …

Key Idea: Base+Delta (B+Δ) Encoding 4 bytes 32-byte Uncompressed Cache Line 0xC04039C0 0xC04039C8 0xC04039D0 … 0xC04039F8 0xC04039C0 Base 12-byte Compressed Cache Line 0x00 0x08 0x10 … 0x38 1 byte 1 byte 1 byte  Fast Decompression: vector addition  Simple Hardware: arithmetic and comparison 20 bytes saved  Effective: good compression ratio

B+Δ: Compression Ratio SPEC2006, databases, web workloads, L2 2MB cache Good average compression ratio (1.40) But some benchmarks have low compression ratio

Can We Do Better? Uncompressible cache line (with a single base): Key idea: Use more bases, e.g., two instead of one Pro: More cache lines can be compressed Cons: Unclear how to find these bases efficiently Higher overhead (due to additional bases) 0x00000000 0x09A40178 0x0000000B 0x09A4A838 …

B+Δ with Multiple Arbitrary Bases  2 bases – the best option based on evaluations

How to Find Two Bases Efficiently? First base - first element in the cache line Second base - implicit base of 0 Advantages over 2 arbitrary bases: Better compression ratio Simpler compression logic  Base+Delta part  Immediate part Base-Delta-Immediate (BΔI) Compression

B+Δ (with two arbitrary bases) vs. BΔI Average compression ratio is close, but BΔI is simpler

BΔI Implementation Decompressor Design Compressor Design Low latency Compressor Design Low cost and complexity BΔI Cache Organization Modest complexity

BΔI Decompressor Design Compressed Cache Line B0 B0 Δ0 Δ0 Δ1 Δ1 Δ2 Δ2 Δ3 Δ3 B0 B0 B0 B0 Vector addition + + + + V1 V2 V3 V0 V0 V1 V2 V3 Uncompressed Cache Line

Compression Flag & Compressed Cache Line BΔI Compressor Design 32-byte Uncompressed Cache Line 8-byte B0 1-byte Δ CU 8-byte B0 2-byte Δ CU 8-byte B0 4-byte Δ CU 4-byte B0 1-byte Δ CU 4-byte B0 2-byte Δ CU 2-byte B0 1-byte Δ CU Zero CU Rep. Values CU CFlag & CCL CFlag & CCL CFlag & CCL CFlag & CCL CFlag & CCL CFlag & CCL CFlag & CCL CFlag & CCL CFlag & CCL Compression Selection Logic (based on compr. size) Compression Flag & Compressed Cache Line Compressed Cache Line

BΔI Compression Unit: 8-byte B0 1-byte Δ 32-byte Uncompressed Cache Line 8 bytes V0 V0 V0 V1 V1 V2 V2 V3 V3 B0= V0 B0 B0 B0 B0 - - - - Δ0 Δ1 Δ2 Δ3 Within 1-byte range? Within 1-byte range? Within 1-byte range? Within 1-byte range? Is every element within 1-byte range? Yes No B0 B0 Δ0 Δ0 Δ1 Δ1 Δ2 Δ2 Δ3 Δ3

BΔI Cache Organization Tag Storage: Data Storage: 32 bytes Conventional 2-way cache with 32-byte cache lines Set0 … … Set0 … … Set1 Tag0 Tag1 Set1 Data0 Data1 … … … … Way0 Way1 Way0 Way1 BΔI: 4-way cache with 8-byte segmented data 8 bytes Tag Storage: Set0 … … … … Set0 … … … … … … … … Set1 Tag0 Tag1 Tag2 Tag3 Set1 S0 S0 S1 S2 S3 S4 S5 S6 S7 C C - Compr. encoding bits … … … … … … … … … … … … Way0 Way1 Way2 Way3 Twice as many tags Tags map to multiple adjacent segments 2.3% overhead for 2 MB cache

Qualitative Comparison with Prior Work Zero-based designs ZCA [Dusser+, ICS’09]: zero-content augmented cache ZVC [Islam+, PACT’09]: zero-value cancelling Limited applicability (only zero values) FVC [Yang+, MICRO’00]: frequent value compression High decompression latency and complexity Pattern-based compression designs FPC [Alameldeen+, ISCA’04]: frequent pattern compression High decompression latency (5 cycles) and complexity C-pack [Chen+, T-VLSI Systems’10]: practical implementation of FPC-like algorithm High decompression latency (8 cycles)

Outline Motivation & Background Key Idea & Our Mechanism Evaluation Conclusion

Methodology Simulator Workloads System Parameters x86 event-driven simulator based on Simics [Magnusson+, Computer’02] Workloads SPEC2006 benchmarks, TPC, Apache web server 1 – 4 core simulations for 1 billion representative instructions System Parameters L1/L2/L3 cache latencies from CACTI [Thoziyoor+, ISCA’08] 4GHz, x86 in-order core, 512kB - 16MB L2, simple memory model (300-cycle latency for row-misses)

Compression Ratio: BΔI vs. Prior Work SPEC2006, databases, web workloads, 2MB L2 1.53 BΔI achieves the highest compression ratio

Single-Core: IPC and MPKI 3.6% 16% 5.6% 4.9% 24% 5.1% 21% 5.2% 13% 8.1% 19% 14% BΔI achieves the performance of a 2X-size cache Performance improves due to the decrease in MPKI

Single-Core: Effect on Cache Capacity Fixed L2 cache latency 2.3% 1.7% 1.3% BΔI achieves performance close to the upper bound

Multi-Core Workloads Application classification based on Compressibility: effective cache size increase (Low Compr. (LC) < 1.40, High Compr. (HC) >= 1.40) Sensitivity: performance gain with more cache (Low Sens. (LS) < 1.10, High Sens. (HS) >= 1.10; 512kB -> 2MB) Three classes of applications: LCLS, HCLS, HCHS, no LCHS applications For 2-core - random mixes of each possible class pairs (20 each, 120 total workloads)

Multi-Core Workloads

Multi-Core: Weighted Speedup If at least one application is sensitive, then the performance improves BΔI performance improvement is the highest (9.5%)

Other Results in Paper Sensitivity study of having more than 2X tags Up to 1.98 average compression ratio Effect on bandwidth consumption 2.31X decrease on average Detailed quantitative comparison with prior work Cost analysis of the proposed changes 2.3% L2 cache area increase

Conclusion A new Base-Delta-Immediate compression mechanism Key insight: many cache lines can be efficiently represented using base + delta encoding Key properties: Low latency decompression Simple hardware implementation High compression ratio with high coverage Improves cache hit ratio and performance of both single-core and multi-core workloads Outperforms state-of-the-art cache compression techniques: FVC and FPC

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri, Yoongu Kim, Hongyi Xin, Onur Mutlu, Phillip B. Gibbons*, Michael A. Kozuch*, Todd C. Mowry

Executive Summary Main memory is a limited shared resource Observation: Significant data redundancy Idea: Compress data in main memory Problem: How to avoid latency increase? Solution: Linearly Compressed Pages (LCP): fixed-size cache line granularity compression 1. Increases capacity (69% on average) 2. Decreases bandwidth consumption (46%) 3. Improves overall performance (9.5%)

Challenges in Main Memory Compression Address Computation Mapping and Fragmentation Physically Tagged Caches

Address Computation Uncompressed Page Compressed Page Cache Line (64B) . . . LN-1 Address Offset 64 128 (N-1)*64 Compressed Page L0 L1 L2 . . . LN-1 Address Offset ? ? ?

Mapping and Fragmentation Virtual Page (4kB) Virtual Address Physical Address Physical Page (? kB) Fragmentation

Physically Tagged Caches Virtual Address Core Critical Path Address Translation TLB Physical Address tag data L2 Cache Lines tag data tag data

Shortcomings of Prior Work Compression Mechanisms Access Latency Decompression Complexity Ratio IBM MXT [IBM J.R.D. ’01]  

Shortcomings of Prior Work Compression Mechanisms Access Latency Decompression Complexity Ratio IBM MXT [IBM J.R.D. ’01]   Robust Main Memory Compression [ISCA’05]

Shortcomings of Prior Work Compression Mechanisms Access Latency Decompression Complexity Ratio IBM MXT [IBM J.R.D. ’01]   Robust Main Memory Compression [ISCA’05] LCP: Our Proposal

Linearly Compressed Pages (LCP): Key Idea Uncompressed Page (4kB: 64*64B) 64B 64B 64B 64B . . . 64B 4:1 Compression Exception Storage . . . M E Metadata (64B): ? (compressible) Compressed Data (1kB)

LCP Overview Page Table entry extension compression type and size extended physical base address Operating System management support 4 memory pools (512B, 1kB, 2kB, 4kB) Changes to cache tagging logic physical page base address + cache line index (within a page) Handling page overflows Compression algorithms: BDI [PACT’12] , FPC [ISCA’04]

LCP Optimizations Integration with cache compression Metadata cache Avoids additional requests to metadata Memory bandwidth reduction: Zero pages and zero cache lines Handled separately in TLB (1-bit) and in metadata (1-bit per cache line) Integration with cache compression BDI and FPC 1 transfer instead of 4 64B 64B 64B 64B

Methodology Simulator Workloads System Parameters x86 event-driven simulators Simics-based [Magnusson+, Computer’02] for CPU Multi2Sim [Ubal+, PACT’12] for GPU Workloads SPEC2006 benchmarks, TPC, Apache web server, GPGPU applications System Parameters L1/L2/L3 cache latencies from CACTI [Thoziyoor+, ISCA’08] 512kB - 16MB L2, simple memory model

Compression Ratio Comparison SPEC2006, databases, web workloads, 2MB L2 cache LCP-based frameworks achieve competitive average compression ratios with prior work

Bandwidth Consumption Decrease SPEC2006, databases, web workloads, 2MB L2 cache Better BPKI – bandwidth per kilo instruction – lower means better. LCP frameworks significantly reduce bandwidth (46%)

Performance Improvement Cores LCP-BDI (BDI, LCP-BDI) (BDI, LCP-BDI+FPC-fixed) 1 6.1% 9.5% 9.3% 2 13.9% 23.7% 23.6% 4 10.7% 22.6% 22.5% LCP frameworks significantly improve performance

Conclusion A new main memory compression framework called LCP(Linearly Compressed Pages) Key idea: ﬁxed size for compressed cache lines within a page and fixed compression algorithm per page LCP evaluation: Increases capacity (69% on average) Decreases bandwidth consumption (46%) Improves overall performance (9.5%) Decreases energy of the off-chip bus (37%)