Download presentation
Presentation is loading. Please wait.
1
Enabling Technologies for Memory Compression: Metadata, Mapping and Prediction
Good morning everyone, my name is Arjun Deb. Today I’ll be talking about my paper that deals with technologies that make memory compression more conducive for commercial adoption. Arjun Deb*, Ali Shafiee*, Rajeev Balasubramonian*, Paolo Faraboschi† , Naveen Muralimanohar† , Robert Schreiber† *University of Utah †Hewlett Packard Labs
2
Executive Summary Prior works on memory compression have required OS involvement To eliminate OS involvement and facilitate commercial adoption: Integrate compression metadata in ECC Introduce compressibility prediction for memory reads Map cache lines to promote uniform activity in chips The problem with compression based designs of the past is that they have always required some form of intervention from the OS. Our approach focuses on eliminating the need to involve the OS. We do this by encoding compression metadata into the ECC . And we go a step further by introducing a mechanism to predict the compression size of data required to be read. We also explore the possibility of mapping cache lines to chips in a smart manner, so as to allow uniform activity across chips.
3
Background Reasons for OS Involvement:
Managing non-uniform page sizes (IBM MXT, Ekman et al.). Managing metadata (Shafiee et al., Sathish et al.). Our Solution: Confine metadata management to hardware. Compressing data in main memory leads to non-uniform page sizes, thereby complicating the paging mechanism. It also add to the OS, the additional responsibility of managing metadata related to compression. Our solution focuses on eliminating the need to involve the OS. We do this by ensuring that managing metadata is a “hardware only” function. This is achieved by using a mechanism to predict the compression metadata.
4
Compression Metadata Pertains to data regarding compression sizes.
Needed to ensure the right number of reads and writes. Can be determined on the fly while writing back. Required before issuing a read. Compression metadata is nothing but the data that tells us the extent to which cache lines residing in main memory are compressed to. In order to save power while reading and writing, this data enables us to issue the exact number of reads. Issuing lesser than required reads will lead to functional issues, and issuing more than the required will lead to wasteful energy consumption. During write backs to main memory this can be determined at the time of writing. But this is unknown before issuing a read. There are a two ways to handle this: We can either cache this data inside our memory controller like in the case of IBM MXT, MemZip, Ekman et al. and Sathish et al. Or we can predict this data.
5
Baseline System Memory Controller ADDR/CMD BUS DATA BUS Chip Chip 1
Chip 1 Chip 2 Chip 3 Chip 4 Chip 5 Chip 6 Chip 7 Chip 8 - A conventional “by 8” DRAM rank with ECC has 9 chips, with a common command/address and data bus. And all 9 chips in the rank are accessed in unison Rank
6
Sub-Ranked System RD: ECC Check + Decompress WR: ECC Code + Compress
Memory Controller ADDR/CMD BUS DATA BUS CHIP SELECT Sub Rank Sub Rank 1 Sub Rank 2 Sub Rank 3 Sub Rank 4 Sub Rank 5 Sub Rank 6 Sub Rank 7 Sub Rank 8 The idea of “Sub Ranking” has been introduced in prior work. This is a slight deviation from the existing JEDEC protocol (animation). Sub Ranking advocates the addition of chip select signals at the DRAM interface in order to allow the memory controller to access individual chips in a rank. This fine grained access of chips coupled with compression has the potential to save power during reads and writes, by minimizing the number of chips affected by each read/write. Rank
7
Metadata Elimination We use a 7 bit SECDED code to protect 57 bits of data. An uncompressed cache line (512 bits) would require 9x7 bits of ECC. This gives us 1 bit in the data field to indicate incompressibility. A cache line compressed with BΔI (<= 455 bits) would require 8x7 bits of ECC. This frees up 7 more bits in the ECC field to store compression information. I’ve been talking about eliminating metadata, and I’m going to shed some light on that now. Instead of using an 8 bit SECDED code to protect chunks of data that are 64 bit wide, we used a 7 bit SECDED code to protect chunks of data that are 57 bits wide. In case of an uncompressed cache line we would need 9x 7 bits of ECC, and these can protect 513 bits of data. Thereby giving us one bit to indicate incompressibility.
8
Leveraging Metadata Elimination
To know the exact size of a cache line, the metadata block has to be read. This metadata block can be placed in chips in two ways: Confine a cache line to one chip (vertical interleaving). Distribute a cache line across all chips (horizontal interleaving). Read from slide
9
Leveraging Metadata Elimination Vertical Interleaving
Memory Controller RD First Read : Get ECC + Metadata RD Decode meta and determine reads needed E.g., 8 reads RD Now I’m going to describe vertical interleaving. In this case the first 8 bytes of a cache line contains the compression metadata. And the first read to the sub rank fetches it (press arrow). This metadata is then decoded to determine the exact number of subsequent reads needed (press arrow). RD RD Read remaining data RD RD RD Sub Rank
10
Leveraging Metadata Elimination Vertical Interleaving
This graph compares the performance of an NVM based system with vertical interleaving with an NVM baseline system. It also shows how vertical interleaving in DRAM fares against a baseline DRAM system. In case of DRAM compressible workloads benefit, whereas incompressible workloads don’t (due to the issuing of 9 back to back reads). In case of some NVM systems, there is no row buffer. And coupled with high read latencies, reads become really slow. Performance of DRAM is bad, NVM is worse
11
Leveraging Metadata Elimination Horizontal Interleaving
RD: ECC Check + Decompress WR: ECC Code + Compress Memory Controller ADDR/CMD BUS DATA BUS CHIP SELECT Now I’m going to explain what horizontal interleaving is. In this scheme 1/9th of a cache line’s data resides in each chip. Doing a first read to get metadata amounts to issuing 2 reads, which is slow. Therefore we need prediction. Let’s consider an example where there’s an initial prediction to read 4 chips(press arrow). After the reads are done, an ECC check is performed on the data to determine whether it is correct. In this case if the ECC check will fails, it would mean that the data is incomplete. In such a scenario all the remaining chips will be read to get the complete data (press arrow). RD RD 1 RD 2 RD 3 4 RD RD 5 RD 6 7 RD 8 RD Rank
12
Prediction Mechanism History Table Saturating Counters PC 0 PC 1 .
C2(0),C3(0),C4(0),C6(0),C9(0) C2(1),C3(1),C4(1),C6(1),C9(1) . C2(63),C3(63),C4(63),C6(63),C9(63) Incoming Request Hit Final Prediction We make use of a history table to store PC values of load instructions. The history table for a PC based predictor 64 entries, it’s fully associative with LRU . For Page based predictors we add prediction logic into the TLB. Incoming request is used to index into this table. (press arrow) Each PC entry in the history table has an array of 5 saturating 2-bit counters(one for each compression size of BDI). In case of a hit in the history table, the corresponding counter array is looked up. The block size corresponding the counter with the largest value is taken as the final prediction. In case of a miss, a cache line is treated as uncompressed (press arrow) After a read is completed, the feedback from the ECC checker is used to update counter values. In case of a correct prediction, the counter corresponding to the correct size is incremented. In case of a wrong prediction, the counter corresponding to the correct size is incremented, and the counter that caused the wrong prediction will be decremented. ECC Feedback update
13
Pitfalls of No Data Mapping
1 2 3 4 5 6 7 8 RD RD tCCD + Read Latency Another contribution of our paper is data mapping. In a baseline system all cachelines are mapped to chips 0-8, and that has a potential of causing conflicts. To understand this, let’s take an example of a cache line needing 2 chips. (press arrow). Now let assume that the next cache line to be read needs 2 chips as well. (press arrow). With no mapping both cache lines end up reading the same chips, thereby decreasing the scope for parallelism. 1 2 3 4 5 6 7 8 RD RD
14
Data Mapping Simple Data Mapping Even 1 2 3 4 5 6 7 8 No Delay Odd 8 7
RD 1 2 3 4 5 6 7 8 RD No Delay With simple mapping, even numbered cache lines are mapped to chips 0-8, and odd numbered cache lines are mapped to chips 8-0. (press arrow) This ensures mutual exclusiveness in terms of chips needed for adjacent cache line accesses. But leads to non uniforms write activity. We evaluated a large number of mapping permutations in terms of read parallelism and uniform write activity using synthetic access patterns to arrive at the best mapping scheme possible. With permuted mapping adjacent cachelines are mutually exclusive in terms of chips needed, and the permutation of chips ensures write activity is uniformly distributed. Odd RD 8 7 6 5 4 3 2 1 RD
15
Data Mapping Permuted Data Mapping 1 2 3 4 5 6 7 8 Cache Line A
1 2 3 4 5 6 7 8 Cache Line A Cache Line B Cache Line C Cache Line D Cache Line E This diagram shows a mapping scheme for a set of 8 consecutive cache lines. (press arrow) Cache Line F Cache Line G Cache Line H
16
Data Mapping Permuted Data Mapping 1 2 3 4 5 6 7 8 Cache Line A
1 2 3 4 5 6 7 8 Cache Line A Cache Line B Cache Line C No Conflicts Cache Line D Cache Line E (press arrow). Our example of back to back reads requiring 2 chips benefits heavily with such an arrangement. Cache Line F Cache Line G Cache Line H
17
Write Activity This graph shows the write activity of each sub rank normalized to a system with no mapping. - Default mapping causes writes on chips 0 & 1 always. Since the min compression size is 2. Permuted mapping is the best in terms of write uniformity with the least amount of variance (in terms of write activity) between chips
18
Methodology Processor Model 8 Core OOO model using Simics
Modified USIMM to model Memristor & DRAM Evaluated on SPEC2006 & NPB workloads Memory Model 1 Channel, 4 Ranks, 8 Banks, 9 devices/Rank For our simulations we use the simics cycle accurate simulator, and the USIMM memory model.
19
DRAM Results Prediction Accuracy Page Based Predictor
These graphs show the predictions accuracies for the PC and Page based predictors for permuted mapping. Their accuracies are 93% and 97% respectively. Overestimations are bad for energy, and underestimations are bad for performance. PC Based Predictor Page Based Predictor
20
DRAM Results Performance
Mapping has a significant affect on performance. Simple mapping improves performance by 6%. Permuted mapping improves it further by 1%. Due to high accuracy of our predictors, the performance of our predictor based systems is within 1% of oracular schemes. Workloads with high compressibility (on the left), show an average improvement of 11%.
21
DRAM Results Energy Energy saved is high for workloads with high compressibility due to fewer chip accesses and reduced execution time. (left side of the graph) The average system energy savings are around 12% with permuted mapping.
22
NVM Results Performance
We’ve evaluated an NVM based system for our paper as well, and the energy and performance improvements are better than that of DRAM. Simple mapping improves performance by 7.2%. Permuted mapping improves it further by 1%. Workloads with high compressibility (on the left), show an average improvement of 13.5%. The best case average energy reduction is 14%.
23
Conclusions Eliminated OS involvement: metadata integrated with ECC.
Improved read efficiency with PC or page-based predictions. Improved activity profile with permutation based mapping. Improves performance by 8%, system energy by 14%, and activity variance by 18x.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.