Good morning everyone, my name is Arjun Deb

Slides:

Advertisements

Similar presentations

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.

Advertisements

Memory Controller Innovations for High-Performance Systems

Lecture 12 Reduce Miss Penalty and Hit Time

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Operating Systems Lecture 10 Issues in Paging and Virtual Memory Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing.

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim,

Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian,

Benefits of Early Cache Miss Determination Memik G., Reinman G., Mangione-Smith, W.H. Proceedings of High Performance Computer Architecture Pages: 307.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 Lecture 14: Virtual Memory Today: DRAM and Virtual memory basics (Sections )

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

CIS 429/529 Winter 2007 Branch Prediction.1 Branch Prediction, Multiple Issue.

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

CMPE 421 Parallel Computer Architecture

2013/01/14 Yun-Chung Yang Energy-Efficient Trace Reuse Cache for Embedded Processors Yi-Ying Tsai and Chung-Ho Chen 2010 IEEE Transactions On Very Large.

Low-Power Cache Organization Through Selective Tag Translation for Embedded Processors with Virtual Memory Support Xiangrong Zhou and Peter Petrov Proceedings.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

33 rd IEEE International Conference on Computer Design ICCD rd IEEE International Conference on Computer Design ICCD 2015 Improving Memristor Memory.

Cache Replacement Policy Based on Expected Hit Count

Improving Multi-Core Performance Using Mixed-Cell Cache Architecture

Memory Hierarchy Ideal memory is fast, large, and inexpensive

Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.

Reducing Memory Interference in Multicore Systems

Memory COMPUTER ARCHITECTURE

Lecture 12 Virtual Memory.

Dynamic Branch Prediction

Lecture: Large Caches, Virtual Memory

CS703 - Advanced Operating Systems

Outline Paging Swapping and demand paging Virtual memory.

CAM Content Addressable Memory

Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.

Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.

Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy

5.2 Eleven Advanced Optimizations of Cache Performance

Cache Memory Presentation I

Moinuddin K. Qureshi ECE, Georgia Tech Gabriel H. Loh, AMD

CS61C : Machine Structures Lecture 6. 2

Lecture 21: Memory Hierarchy

Computer Architecture Lecture 3

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

CMSC 611: Advanced Computer Architecture

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

TLC: A Tag-less Cache for reducing dynamic first level Cache Energy

Lecture: DRAM Main Memory

Lecture 23: Cache, Memory, Virtual Memory

Lecture 22: Cache Hierarchies, Memory

Address-Value Delta (AVD) Prediction

Lecture: DRAM Main Memory

Lecture: DRAM Main Memory

Page that info back into your memory!

Ali Shafiee Rajeev Balasubramonian Feifei Li

Lecture 24: Memory, VM, Multiproc

CARP: Compression-Aware Replacement Policies

CSE 351: The Hardware/Software Interface

Lecture 10: Branch Prediction and Instruction Delivery

Lecture 20: OOO, Memory Hierarchy

Main Memory Cache Architectures

Lecture 20: OOO, Memory Hierarchy

/ Computer Architecture and Design

Manjunath Shevgoor, Rajeev Balasubramonian, University of Utah

Lecture 22: Cache Hierarchies, Memory

CS 3410, Spring 2014 Computer Science Cornell University

Lecture 24: Virtual Memory, Multiprocessors

Lecture 23: Virtual Memory, Multiprocessors

Enabling Transparent Memory-Compression for Commodity Memory Systems

Operating Systems: Internals and Design Principles, 6/E

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Rajeev Balasubramonian

Haonan Wang, Adwait Jog College of William & Mary

Virtual Memory 1 1.

Presentation transcript:

Enabling Technologies for Memory Compression: Metadata, Mapping and Prediction Good morning everyone, my name is Arjun Deb. Today I’ll be talking about my paper that deals with technologies that make memory compression more conducive for commercial adoption. Arjun Deb*, Ali Shafiee*, Rajeev Balasubramonian*, Paolo Faraboschi† , Naveen Muralimanohar† , Robert Schreiber† *University of Utah †Hewlett Packard Labs

Executive Summary Prior works on memory compression have required OS involvement To eliminate OS involvement and facilitate commercial adoption: Integrate compression metadata in ECC Introduce compressibility prediction for memory reads Map cache lines to promote uniform activity in chips The problem with compression based designs of the past is that they have always required some form of intervention from the OS. Our approach focuses on eliminating the need to involve the OS. We do this by encoding compression metadata into the ECC . And we go a step further by introducing a mechanism to predict the compression size of data required to be read. We also explore the possibility of mapping cache lines to chips in a smart manner, so as to allow uniform activity across chips.

Background Reasons for OS Involvement: Managing non-uniform page sizes (IBM MXT, Ekman et al.). Managing metadata (Shafiee et al., Sathish et al.). Our Solution: Confine metadata management to hardware. Compressing data in main memory leads to non-uniform page sizes, thereby complicating the paging mechanism. It also add to the OS, the additional responsibility of managing metadata related to compression. Our solution focuses on eliminating the need to involve the OS. We do this by ensuring that managing metadata is a “hardware only” function. This is achieved by using a mechanism to predict the compression metadata.

Compression Metadata Pertains to data regarding compression sizes. Needed to ensure the right number of reads and writes. Can be determined on the fly while writing back. Required before issuing a read. Compression metadata is nothing but the data that tells us the extent to which cache lines residing in main memory are compressed to. In order to save power while reading and writing, this data enables us to issue the exact number of reads. Issuing lesser than required reads will lead to functional issues, and issuing more than the required will lead to wasteful energy consumption. During write backs to main memory this can be determined at the time of writing. But this is unknown before issuing a read. There are a two ways to handle this: We can either cache this data inside our memory controller like in the case of IBM MXT, MemZip, Ekman et al. and Sathish et al. Or we can predict this data.

Baseline System Memory Controller ADDR/CMD BUS DATA BUS Chip Chip 1 Chip 1 Chip 2 Chip 3 Chip 4 Chip 5 Chip 6 Chip 7 Chip 8 - A conventional “by 8” DRAM rank with ECC has 9 chips, with a common command/address and data bus. And all 9 chips in the rank are accessed in unison Rank

Sub-Ranked System RD: ECC Check + Decompress WR: ECC Code + Compress Memory Controller ADDR/CMD BUS DATA BUS CHIP SELECT Sub Rank Sub Rank 1 Sub Rank 2 Sub Rank 3 Sub Rank 4 Sub Rank 5 Sub Rank 6 Sub Rank 7 Sub Rank 8 The idea of “Sub Ranking” has been introduced in prior work. This is a slight deviation from the existing JEDEC protocol (animation). Sub Ranking advocates the addition of chip select signals at the DRAM interface in order to allow the memory controller to access individual chips in a rank. This fine grained access of chips coupled with compression has the potential to save power during reads and writes, by minimizing the number of chips affected by each read/write. Rank

Metadata Elimination We use a 7 bit SECDED code to protect 57 bits of data. An uncompressed cache line (512 bits) would require 9x7 bits of ECC. This gives us 1 bit in the data field to indicate incompressibility. A cache line compressed with BΔI (<= 455 bits) would require 8x7 bits of ECC. This frees up 7 more bits in the ECC field to store compression information. I’ve been talking about eliminating metadata, and I’m going to shed some light on that now. Instead of using an 8 bit SECDED code to protect chunks of data that are 64 bit wide, we used a 7 bit SECDED code to protect chunks of data that are 57 bits wide. In case of an uncompressed cache line we would need 9x 7 bits of ECC, and these can protect 513 bits of data. Thereby giving us one bit to indicate incompressibility.

Leveraging Metadata Elimination To know the exact size of a cache line, the metadata block has to be read. This metadata block can be placed in chips in two ways: Confine a cache line to one chip (vertical interleaving). Distribute a cache line across all chips (horizontal interleaving). Read from slide

Leveraging Metadata Elimination Vertical Interleaving Memory Controller RD First Read : Get ECC + Metadata RD Decode meta and determine reads needed E.g., 8 reads RD Now I’m going to describe vertical interleaving. In this case the first 8 bytes of a cache line contains the compression metadata. And the first read to the sub rank fetches it (press arrow). This metadata is then decoded to determine the exact number of subsequent reads needed (press arrow). RD RD Read remaining data RD RD RD Sub Rank

Leveraging Metadata Elimination Vertical Interleaving This graph compares the performance of an NVM based system with vertical interleaving with an NVM baseline system. It also shows how vertical interleaving in DRAM fares against a baseline DRAM system. In case of DRAM compressible workloads benefit, whereas incompressible workloads don’t (due to the issuing of 9 back to back reads). In case of some NVM systems, there is no row buffer. And coupled with high read latencies, reads become really slow. Performance of DRAM is bad, NVM is worse

Leveraging Metadata Elimination Horizontal Interleaving RD: ECC Check + Decompress WR: ECC Code + Compress Memory Controller ADDR/CMD BUS DATA BUS CHIP SELECT Now I’m going to explain what horizontal interleaving is. In this scheme 1/9th of a cache line’s data resides in each chip. Doing a first read to get metadata amounts to issuing 2 reads, which is slow. Therefore we need prediction. Let’s consider an example where there’s an initial prediction to read 4 chips(press arrow). After the reads are done, an ECC check is performed on the data to determine whether it is correct. In this case if the ECC check will fails, it would mean that the data is incomplete. In such a scenario all the remaining chips will be read to get the complete data (press arrow). RD RD 1 RD 2 RD 3 4 RD RD 5 RD 6 7 RD 8 RD Rank

Prediction Mechanism History Table Saturating Counters PC 0 PC 1 . C2(0),C3(0),C4(0),C6(0),C9(0) C2(1),C3(1),C4(1),C6(1),C9(1) . C2(63),C3(63),C4(63),C6(63),C9(63) Incoming Request Hit Final Prediction We make use of a history table to store PC values of load instructions. The history table for a PC based predictor 64 entries, it’s fully associative with LRU . For Page based predictors we add prediction logic into the TLB. Incoming request is used to index into this table. (press arrow) Each PC entry in the history table has an array of 5 saturating 2-bit counters(one for each compression size of BDI). In case of a hit in the history table, the corresponding counter array is looked up. The block size corresponding the counter with the largest value is taken as the final prediction. In case of a miss, a cache line is treated as uncompressed (press arrow) After a read is completed, the feedback from the ECC checker is used to update counter values. In case of a correct prediction, the counter corresponding to the correct size is incremented. In case of a wrong prediction, the counter corresponding to the correct size is incremented, and the counter that caused the wrong prediction will be decremented. ECC Feedback update

Pitfalls of No Data Mapping 1 2 3 4 5 6 7 8 RD RD tCCD + Read Latency Another contribution of our paper is data mapping. In a baseline system all cachelines are mapped to chips 0-8, and that has a potential of causing conflicts. To understand this, let’s take an example of a cache line needing 2 chips. (press arrow). Now let assume that the next cache line to be read needs 2 chips as well. (press arrow). With no mapping both cache lines end up reading the same chips, thereby decreasing the scope for parallelism. 1 2 3 4 5 6 7 8 RD RD

Data Mapping Simple Data Mapping Even 1 2 3 4 5 6 7 8 No Delay Odd 8 7 RD 1 2 3 4 5 6 7 8 RD No Delay With simple mapping, even numbered cache lines are mapped to chips 0-8, and odd numbered cache lines are mapped to chips 8-0. (press arrow) This ensures mutual exclusiveness in terms of chips needed for adjacent cache line accesses. But leads to non uniforms write activity. We evaluated a large number of mapping permutations in terms of read parallelism and uniform write activity using synthetic access patterns to arrive at the best mapping scheme possible. With permuted mapping adjacent cachelines are mutually exclusive in terms of chips needed, and the permutation of chips ensures write activity is uniformly distributed. Odd RD 8 7 6 5 4 3 2 1 RD

Data Mapping Permuted Data Mapping 1 2 3 4 5 6 7 8 Cache Line A 1 2 3 4 5 6 7 8 Cache Line A Cache Line B Cache Line C Cache Line D Cache Line E This diagram shows a mapping scheme for a set of 8 consecutive cache lines. (press arrow) Cache Line F Cache Line G Cache Line H

Data Mapping Permuted Data Mapping 1 2 3 4 5 6 7 8 Cache Line A 1 2 3 4 5 6 7 8 Cache Line A Cache Line B Cache Line C No Conflicts Cache Line D Cache Line E (press arrow). Our example of back to back reads requiring 2 chips benefits heavily with such an arrangement. Cache Line F Cache Line G Cache Line H

Write Activity This graph shows the write activity of each sub rank normalized to a system with no mapping. - Default mapping causes writes on chips 0 & 1 always. Since the min compression size is 2. Permuted mapping is the best in terms of write uniformity with the least amount of variance (in terms of write activity) between chips

Methodology Processor Model 8 Core OOO model using Simics Modified USIMM to model Memristor & DRAM Evaluated on SPEC2006 & NPB workloads Memory Model 1 Channel, 4 Ranks, 8 Banks, 9 devices/Rank For our simulations we use the simics cycle accurate simulator, and the USIMM memory model.

DRAM Results Prediction Accuracy Page Based Predictor These graphs show the predictions accuracies for the PC and Page based predictors for permuted mapping. Their accuracies are 93% and 97% respectively. Overestimations are bad for energy, and underestimations are bad for performance. PC Based Predictor Page Based Predictor

DRAM Results Performance Mapping has a significant affect on performance. Simple mapping improves performance by 6%. Permuted mapping improves it further by 1%. Due to high accuracy of our predictors, the performance of our predictor based systems is within 1% of oracular schemes. Workloads with high compressibility (on the left), show an average improvement of 11%.

DRAM Results Energy Energy saved is high for workloads with high compressibility due to fewer chip accesses and reduced execution time. (left side of the graph) The average system energy savings are around 12% with permuted mapping.

NVM Results Performance We’ve evaluated an NVM based system for our paper as well, and the energy and performance improvements are better than that of DRAM. Simple mapping improves performance by 7.2%. Permuted mapping improves it further by 1%. Workloads with high compressibility (on the left), show an average improvement of 13.5%. The best case average energy reduction is 14%.

Conclusions Eliminated OS involvement: metadata integrated with ECC. Improved read efficiency with PC or page-based predictions. Improved activity profile with permutation based mapping. Improves performance by 8%, system energy by 14%, and activity variance by 18x.