Enabling Transparent Memory-Compression for Commodity Memory Systems

Slides:



Advertisements
Similar presentations
Part IV: Memory Management
Advertisements

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.
Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.
D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1.
1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
1 Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines Moinuddin K. Qureshi M. Aater Suleman Yale N. Patt HPCA 2007.
Difference Engine: Harnessing Memory Redundancy in Virtual Machines by Diwaker Gupta et al. presented by Jonathan Berkhahn.
Virtual Memory Introduction to Operating Systems: Module 9.
Moinuddin K. Qureshi ECE, Georgia Tech Gabriel H. Loh, AMD Fundamental Latency Trade-offs in Architecting DRAM Caches MICRO 2012.
Design and Analysis of a Robust Pipelined Memory System Hao Wang †, Haiquan (Chuck) Zhao *, Bill Lin †, and Jun (Jim) Xu * † University of California,
A Cache-Like Memory Organization for 3D memory systems CAMEO 12/15/2014 MICRO Cambridge, UK Chiachen Chou, Georgia Tech Aamer Jaleel, Intel Moinuddin K.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
Memory Management (II)
Memory Management and Paging CSCI 3753 Operating Systems Spring 2005 Prof. Rick Han.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
MemTracker Efficient and Programmable Support for Memory Access Monitoring and Debugging Guru Venkataramani, Brandyn Roemer, Yan Solihin, Milos Prvulovic.
Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.
Embedded System Lab. 김해천 Linearly Compressed Pages: A Low- Complexity, Low-Latency Main Memory Compression Framework Gennady Pekhimenko†
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
Chapter 9: Virtual Memory Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped Files Allocating Kernel.
1 Memory Management. 2 Fixed Partitions Legend Free Space 0k 4k 16k 64k 128k Internal fragmentation (cannot be reallocated) Divide memory into n (possible.
BEAR: Mitigating Bandwidth Bloat in Gigascale DRAM caches
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Virtual Memory Hardware.
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
Efficiently Prefetching Complex Address Patterns Manjunath Shevgoor, Sahil Koladiya, Rajeev Balasubramonian University of Utah Chris Wilkerson, Zeshan.
COMPSYS 304 Computer Architecture Cache John Morris Electrical & Computer Enginering/ Computer Science, The University of Auckland Iolanthe at 13 knots.
Memory Management memory hierarchy programs exhibit locality of reference - non-uniform reference patterns temporal locality - a program that references.
CMSC 611: Advanced Computer Architecture Memory & Virtual Memory Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material.
Translation Lookaside Buffer
Improving Cache Performance using Victim Tag Stores
Non Contiguous Memory Allocation
Memory COMPUTER ARCHITECTURE
Lecture: Large Caches, Virtual Memory
Virtual Memory - Part II
Lecture: Large Caches, Virtual Memory
CS703 - Advanced Operating Systems
COMBINED PAGING AND SEGMENTATION
Good morning everyone, my name is Arjun Deb
Computer Memory.
Cache Memory Presentation I
Moinuddin K. Qureshi ECE, Georgia Tech Gabriel H. Loh, AMD
Energy-Efficient Address Translation
Chapter 8: Main Memory.
Example Cache Coherence Problem
Chapter 9: Virtual-Memory Management
Lecture: DRAM Main Memory
Lecture 23: Cache, Memory, Virtual Memory
Lecture: DRAM Main Memory
Computer Architecture
CARP: Compression-Aware Replacement Policies
Adapted from slides by Sally McKee Cornell University
(A Research Proposal for Optimizing DBMS on CMP)
Practical Session 9, Memory
MICRO-2018 Gururaj Saileshwar1 Prashant Nair1 Prakash Ramrakhyani2
CANDY: Enabling Coherent DRAM Caches for Multi-node Systems
SYNERGY: Rethinking Secure-Memory Design for Error-Correcting Memories
Contents Memory types & memory hierarchy Virtual memory (VM)
CS 3410, Spring 2014 Computer Science Cornell University
Lecture 7: Flexible Address Translation
Memory System Performance Chapter 3
VIRTIO 1.1 FOR HARDWARE Rev2.0
CSE 542: Operating Systems
COMP755 Advanced Operating Systems
CSE 542: Operating Systems
Virtual Memory 1 1.
2019 2학기 고급운영체제론 ZebRAM: Comprehensive and Compatible Software Protection Against Rowhammer Attacks 3 # 단국대학교 컴퓨터학과 # 남혜민 # 발표자.
DSPatch: Dual Spatial pattern prefetcher
Presentation transcript:

Enabling Transparent Memory-Compression for Commodity Memory Systems HPCA 2019 Vinson Young* Sanjay Kariyappa* Moinuddin Qureshi *These authors contributed equally to this work

MOORE’s LAW HITS BANDWIDTH WALL Channel On-Chip Bandwidth demand Bandwidth demand has been steadily increasing. Therefore, we need practical solutions for increasing memory bandwidth Need practical solutions for scaling memory bandwidth

Mem Compression for capacity and Bandwidth Data A Operating System Data B Data B OS Mem Capacity Physical Capacity 2 3 ✔ OS Mem Capacity Phys Capacity Compression frees up space to store more data, improves capacity ✘ Compressed memory is an attractive approach to solve this problem Compression allows for packing more lines in a physical amount of space, which improves memory capacity. While this seems useful we haven't seen widespread adoption of compressed memory systems.  The problem is that with compressed memory the capacity changes and we need OS support to adapt to this variable capacity.  This means the MICROSOFT and the INTELs of the world will need to both agree for making compressed memory viable.  We want the bandwidth benefits of compression in an OS-transparent manner.  With Transparent Memory Compression But, memory capacity fluctuates – needs OS support to utilize capacity Capacity benefits require OS support (multi-vendor) – hinders adoption.  Instead, we focus solely on improving memory bandwidth, with Transparent Memory Compression: HW compression for BW without OS

Transparent Memory Compression Memzip TMC: Enable bandwidth benefits without OS support e.g., Memzip* ✔ Bandwidth Benefits M Line A (top) Data A Line A (top) Line B (compress) Line B (compress) ✔ OS transparent Line A (bottom) Line A (bottom) ✘ Commodity Memory ½ Channel ½ Channel Channel ✘ Negligible Metadata Overhead TMC. BW without OS changes. Compress in place. E.g. Memzip. Changes DIMM organization, reorganize such that it takes Multiple accesses to read data. Save on accesses when lines are compressible Gets BW without OS. But, requires: DIMM Changes, to support half line accesses Expensive metadata accesses, to understand compressed mapping Suffer from performance degradation when running incompressible workloads Our work has a simple goal. We want more bandwidth out of commodity memory systems Can read compressed lines with half bandwidth Current TMC proposals require non-commodity memory and require significant metadata overhead *Ali Shafiee, Meysam Taassori, Rajeev Balasubramonian, & Al Davis. “Memzip” in HPCA 2014

Goal: Practical transparent mem compression ✔ Bandwidth Benefits OS Transparent Commodity Memory Negligible Metadata Overhead Robust Performance ✔ ✔ 🗴 🗴 ✔ 🗴 ✔ Goal of our paper is to improve memory bandwidth, without the costs of compression. Should be OS transparent, applicable to commodity memory, have negligible metadata overhead, and have robust performance under any circumstance. Our goal is to enable Practical TMC – to obtain bandwidth benefits without the costs of compression. Should be OS transparent, use commodity memory, have minimal metadata access, and be robust.

Useful for Commodity Memory Overview Background Proposal Address Mapping In-line Metadata Location Prediction Results Dynamic Policy Useful for Commodity Memory

Problem of TMC on commodity memories Interface: Conventional systems transfer 64B on each access Compressed Commodity Memory Line A Line A Line B Data A Line B Data A Channel ✘ 64B transfers No partial-line transfers due to internal bandwidth constraints No Partial-line Transfers in Commodity Memory TMC on commodity memory does not improve memory bandwidth

Enabling TMC on commodity memories Compressed Commodity Memory Approach: Relocate lines together in one location Line A Line B Line A Line B Data A Channel ✔ Since we can’t change cacheline size, maybe we can get more lines per CL-sized access. If both are useful, 2x effective bandwidth (with BW-free adjacent-line prefetch), while maintaining access length. Enables 2x BW, and works on commodity DRAM. Retrieve two lines per access, and store into L3. If both lines useful, 2x effective bandwidth. Access length unmodified. Pair-wise remapping compression enables 2x effective bandwidth, and works on commodity DRAM

Targeting Metadata Overhead Overview Background Proposal Address Mapping In-line Metadata Location Prediction Results Dynamic Policy Targeting Metadata Overhead

Understanding metadata lookup Location of B changes depending on compressibility Read Request: Line B Uncompressed Read Metadata, informs uncompressed mapping M=0 M M M 1 Line A Read location-2 2 Line B Double-access to read one line Compressed Read Metadata, informs compressed mapping M=1 M M M 1 Line A Line B Read location-1 Location of B changes depending on compressibility. Prior approaches need metadata to know where to look. Costs BW and latency. 2 Double-access to read two lines Prior approaches rely on reading metadata first to learn mapping. Costs bandwidth and latency

TMC with Metadata (+ METADATA Cache) Metadata Lookup limits performance (41% slowdown) Slowdown on average Metadata lookup hinders performance. Even with dedicated metadata cache. SPEC GAP (Graph) MIX TMC suffers from constantly accessing metadata Evaluated on 8-core with 2 channels of DRAM

Insight: Can we Store Metadata in Line? Read Request: Line B Compressed Read Metadata, informs compressed mapping M M M M Line A Line B Read Line and metadata Read Line Single-access, avoid metadata lookup The double access (for metadata) is harmful. Can we eliminate it? We can try storing metadata in line. 1 access to get both line and metadata. Specifics as follows. Storing metadata with line, enables single-access mem read

In-line Compression-status Marker Lines compressible together if size is <64B Often, compressed lines don’t use all space Insight: We can repurpose space inside compressed line to store a small 4-Byte marker that denotes compressibility On reading a line with marker, we know it is a compressed line Can spare 4B cheaply How to make space for metadata: Compressible only if both lines fit to <60B instead. Ensures that 4B can be used to store metdata. Does not cost much compressibbility. Repurpose for 4-byte marker. And separate invalid marker to avoid having multiple valid copies of data. On reading line with marker, we know it is compressed line. Marker informs compressibility, without metadata lookup 4-byte marker Line A Line B 0xdeadbeef Invalid marker Marker informs compressibility, without metadata lookup

Marker Collision But, an uncompressed line could coincidentally store the marker value(we call this marker collision). 1 in 4 billion chance ? Line A 0xdeadbeef ? Line A Line B 0xdeadbeef Solution: Track lines that coincidentally store marker value in a small SRAM structure (16-entry table of colliding addresses). Marker collision (coincidental uncompressed line storing marker). 1/2^32 chance of collision. Track lines that collide in small table Average overflow time is 10 million years, and can change marker and recompress on such rare occurrences. Address A Marker Collision Table Small Collision Table (64B) is sufficient. Average time for collision table overflow is 10 million years. See paper for more detailed collision analysis.

Metadata with line, How to locate? But, marker is with line. Don’t know where to access Uncompressed Find Line B ? 1 Line A ? 2 Line B Single access Compressed 1 Line A Line B 0xdeadbeef Single access 2 Invalid Marker But, marker is with line. Don’t know where to look. Reading all locations will waste bandwidth. Can try to predict compressiblity and thus location. If accurate, single-access to read from compressed memory. Reading all possible locations will waste bandwidth Solution: We can Predict compressibility and location, to enable reading & interpreting line in one access

Targeting Metadata Overhead Overview Background Proposal Address Mapping In-line Metadata Location Prediction Results Dynamic Policy Targeting Metadata Overhead

Page-based Line Location Predictor Generally, lines within a page have similar compressibility Page Addr Hash M = 0 1 Predict Compressed indexing Store last-compressibility seen M = 1 Predict Base indexing M = 1 M = 0 Exploit property that lines within page are likely storing similar data and thus have similar compressibility. Last-time prediction good enough. Augment to per-page last-time prediction when multiple pages are accessed simultaneously. Page-based predictor achieves 98% location-prediction accuracy, by exploiting spatial locality in compressibility

Verifying predictions with marker Correct prediction Access Line B Compressed Correct pred 1 Line A Line B 0xdeadbeef Line B found Location Predictor 2 Invalid Marker Single-access, avoid metadata lookup Incorrect prediction Access Line B Compressed Incorrect pred Case 1 predict correctly. 1 access, find marker, avoid metadata lookup. Case 2 predict incorrectly. 1 access, find marker telling you incorrect location, look up second location. Fortunately, rare. 1 Line A Line B 0xdeadbeef Location Predictor 2 Line B not found Invalid Marker Double-access, some bandwidth overhead Fortunately, mispredictions are rare With in-line metadata and location prediction, Practical TMC avoids most of the metadata overhead

Overview Background Proposal Results Dynamic Policy Address Mapping In-line Metadata Location Prediction Results Dynamic Policy

3.2GHz 4-wide out-of-order core 8 cores, 8MB shared last-level cache Methodology Commodity DRAM CPU Core Chip 3.2GHz 4-wide out-of-order core 8 cores, 8MB shared last-level cache Compression FPC + BDI. Supports 4-to-1 also (see paper) 8-cores. Compression algorithms used in this study were Frequent Pattern Compression and Base-Delta Immediate Compression. Algorithm used is orthogonal to study as any can be substituted. 4-to-1 compression also used (more in paper).

Other sensitivities in paper Methodology Other sensitivities in paper Commodity DRAM CPU Commodity DRAM Capacity 16GB Bus DDR1.6GHz, 64-bit Channels 2 channels Bandwidth 25 GBps Latency 35ns 2 channels of DRAM with DRAMSim2

TMC and Practical TMC Performance Significant Location Mispredictions  wastes bandwidth and latency Inline-Metadata + Location-Prediction, eliminates metadata lookup Metadata Lookup limits performance PTMC eliminates most metadata–lookup with Line-Location Prediction and Inline-Marker. However, some cases are still degraded. SPEC GAP (Graph) MIX PTMC eliminates most metadata lookup to enable speedup Evaluated on 8-core with 2 channels of DRAM. See paper for more workloads.

Enabling Robust Performance Overview Background Proposal Address Mapping In-line Metadata Location Prediction Results Dynamic Policy Enabling Robust Performance

bandwidth benefits/costs of compression Location Misprediction 2 Line A Line B 1 useful line with 2 accesses Useful Prefetch 1 Relocating lines when compressibility changes 3 Might want to turn off compression when hurts. Benefits. Useful prefetch. Costs. Multiple accesses to read lines (location misprediction). BW costs to relocate lines. Line A Line B 2 useful lines with 1 access Should disable compression when costs outweigh benefits

Dynamic PTMC Implementation 4096 Benefits Increment Utility Counter Set 1 Set 1 Set 1 1 Enable Enforce Policy Set 0 Sampled Set (always compress) 2 Decrement Utility Counter 3 Set 99 Set 99 Set 99 Disable Compaction Costs Sample. 1% of sets (or memory) always attempt compression. Observe benefit vs. cost. 99% compress or not depending on utility counter. Can extend to multi-core Note: does not work for metadata-based approaches. Prior approaches have needed to clean memory before it can confidently read a line without metadata access. Our approach on the other hand simply stops compressing lines together, and line location prediction accuracy will be naturally high as memory becomes mostly uncompressed. Utility Counter observes if cost of compression is greater than benefit. If cost is greater, Dynamic-PTMC disables compression. Can extend to multi-core with per-thread counters Note: disabling compaction for metadata-based approaches does not reduce metadata lookup

Dynamic PTMC Performance Disables compression when harmful (e.g. low location-prediction accuracy) Removes all slowdown. Robust performance. SPEC GAP (Graph) MIX Dynamic-PTMC ensures robust performance (no slowdown across all workloads tested)

Hardware Cost of proposed dynamic-ptmc Memory controller modifications Compression / Decompression Logic Additional SRAM storage in controller SRAM Storage Markers 72B Collision Table 64B Location Predictor 128B Dynamic-PTMC counter 12B Total Storage 276B Cheap Modifications to memory controller. Compression/Decompression. Small SRAM storage. Dynamic-PTMC enables robust speedup with minimal cost (276B SRAM, single-point modification)

Practical Transparent Compressed Mem ✔ Bandwidth Benefits OS Transparent Commodity Memory Negligible Metadata Lookup Robust Performance ✔ ✔ 🗴 🗴 ✔ ✔ 🗴 Overview. BW benefits. OS transparent with HW approach. Commodity memory with modified mapping. Remove metadata lookup with in-line marker and location prediction. And robust performance with dynamic solution. Thank you. Dynamic Solution Modified Mapping In-line Marker Line A Line B 0xdeadbeef Location Predictor Invalid Marker Thank you!

Additional Slides Additional Slides

Marker Collisions where data stored matches marker Security of Markers Marker Collisions where data stored matches marker We use 32-bit markers, which are created per-line and with a cryptographic collision-resistant hash. 1 / (2^32) chance of line coincidentally storing the same marker value. On average, memory space has one marker collision. We provision storing 16 of these exceptions, and overflowing is unlikely (> 10 million years per overflow event, assuming all of memory is written each nanosecond. On overflow, can change marker and recompress memory)

Dynamic solution on prior metadata methods Prior approach: Major cost is metadata access, which occurs even for incompressible lines. Even if you disable actively compacting lines, you still need to read metadata (until you clean all possibly compressed lines from memory). Whereas, Dynamic-PTMC can disable compression costs by simply choosing not to actively compress data. Lines will be in uncompressed index and location will be easily predicted.

Location prediction accuracy vs. metadata cache hit-rate 72% metadata cache hit-rate, vs 98% location pred. Prior approaches pay high BW overhead for metadata

Enabling 4-to-1 compression On write: Reorganize compressed lines together in mem Restrict to 3 possible remappings (4-to-1, 2-to-1, uncompressed) Possible locations: 1 2 3 uncompressed A B C D A B C D 2:1 compressed A C B D C D A B 4:1 compressed A B C D B has two possible locations On read: lines have 1-3 possible locations and compression status Page-based Last-compressibility Line Location Predictor still effective. Need additional 4-to-1 marker value.

Invert line on collision: only compressed lines store marker x44444444 Marker Collision ? x0 . . . . . . . . . . . . . . . . 44444444 No 2 Line to install in mem addr A Update LIT A 1 Invert on collision Install as is - Yes x1 . . . . . . . . . . . . . . . . BBBBBBBB Do: & Line inversion Table 1 2 Inverted line Store uncompressed lines that store marker value, in inverted form. Only compressed lines have marker. (Reduces collision table check for compressed lines)