A Robust Main-Memory Compression Scheme (ISCA 06) Magnus Ekman and Per Stenström Chalmers University of Technolog, Göteborg, Sweden Speaker: 雋中.

Slides:

Advertisements

Similar presentations

Gennady Pekhimenko Advisers: Todd C. Mowry & Onur Mutlu

Advertisements

Part IV: Memory Management

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.

Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,

ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

LEMap: Controlling Leakage in Large Chip-multiprocessor Caches via Profile-guided Virtual Address Translation Jugash Chandarlapati Mainak Chaudhuri Indian.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim,

Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.

Allocating Memory.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science August 20, 2009 Enabling.

WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

File System Implementation CSCI 444/544 Operating Systems Fall 2008.

Exploiting Load Latency Tolerance for Relaxing Cache Design Constraints Ramu Pyreddy, Gary Tyson Advanced Computer Architecture Laboratory University of.

1 Balanced Cache:Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders ISCA 2006,IEEE. By Chuanjun Zhang Speaker: WeiZeng.

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

Restrictive Compression Techniques to Increase Level 1 Cache Capacity Prateek Pujara Aneesh Aggarwal Dept of Electrical and Computer Engineering Binghamton.

Skewed Compressed Cache

Code Coverage Testing Using Hardware Performance Monitoring Support Alex Shye, Matthew Iyer, Vijay Janapa Reddi and Daniel A. Connors University of Colorado.

A Characterization of Processor Performance in the VAX-11/780 From the ISCA Proceedings 1984 Emer & Clark.

Virtual Memory By: Dinouje Fahih. Definition of Virtual Memory Virtual memory is a concept that, allows a computer and its operating system, to use a.

Adaptive Cache Compression for High-Performance Processors Alaa Alameldeen and David Wood University of Wisconsin-Madison Wisconsin Multifacet Project.

Compressed Memory Hierarchy Dongrui SHE Jianhua HUI.

Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches Gennady Pekhimenko Vivek Seshadri Onur Mutlu, Todd C. Mowry Phillip B.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

A Low-Cost Memory Remapping Scheme for Address Bus Protection Lan Gao *, Jun Yang §, Marek Chrobak *, Youtao Zhang §, San Nguyen *, Hsien-Hsin S. Lee ¶

Lecture 19: Virtual Memory

« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)

Building Expressive, Area-Efficient Coherence Directories Michael C. Huang Guofan Jiang Zhejiang University University of Rochester IBM 1 Lei Fang, Peng.

Embedded System Lab. 김해천 Linearly Compressed Pages: A Low- Complexity, Low-Latency Main Memory Compression Framework Gennady Pekhimenko†

A Robust Main-Memory Compression Scheme Magnus Ekman and Per Stenström Chalmers University of Technology Göteborg, Sweden.

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison.

Low-Power Cache Organization Through Selective Tag Translation for Embedded Processors with Virtual Memory Support Xiangrong Zhou and Peter Petrov Proceedings.

CS 241 Section Week #9 (11/05/09). Topics MP6 Overview Memory Management Virtual Memory Page Tables.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

Introduction: Memory Management 2 Ideally programmers want memory that is large fast non volatile Memory hierarchy small amount of fast, expensive memory.

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Demand Paging.

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Memory Management Overview.

Memory management Ref: Stallings G.Anuradha. What is memory management? The task of subdivision of user portion of memory to accommodate multiple processes.

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

Chapter 7 Memory Management Eighth Edition William Stallings Operating Systems: Internals and Design Principles.

Memory Management.

Non Contiguous Memory Allocation

Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh

Memory COMPUTER ARCHITECTURE

Lecture 3: MIPS Instruction Set

Assessing and Understanding Performance

Zhichun Zhu Zhao Zhang ECE Department ECE Department

Improving Memory Access 1/3 The Cache and Virtual Memory

Selective Code Compression Scheme for Embedded System

Chapter 8: Main Memory.

Chapter 11: File System Implementation

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Using Dead Blocks as a Virtual Victim Cache

Memory Management Lectures notes from the text supplement by Siberschatz and Galvin Modified by B.Ramamurthy Chapter 8 11/24/2018.

Memory Management Lectures notes from the text supplement by Siberschatz and Galvin Modified by B.Ramamurthy Chapter 9 12/1/2018.

Main Memory Background Swapping Contiguous Allocation Paging

Memory Management Lectures notes from the text supplement by Siberschatz and Galvin Modified by B.Ramamurthy Chapter 9 4/5/2019.

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Virtual Memory 1 1.

Restrictive Compression Techniques to Increase Level 1 Cache Capacity

Presentation transcript:

A Robust Main-Memory Compression Scheme (ISCA 06) Magnus Ekman and Per Stenström Chalmers University of Technolog, Göteborg, Sweden Speaker: 雋中

Outline Introduction Contributions Effectiveness of Zero-Aware Compressors The Proposed Compression Scheme Performance Results Conclusions

Introduction Memory resources are wasted to compensate for the increasing processor/memory/disk speedgap >50% of die size occupied by caches >50% of cost of a server is DRAM (and increasing) Lossless data compression techniques have the potential to free up more than 50% of memory resources. Unfortunately, compression introduces several challenging design and performance issues

core L1-cache L2-cache Main memory space Request Data

core L1-cache L2-cache Compressed main memory space Translation table Address Translation Request Decompressor Data Fragmented compressed main memory space

Contributions A low-overhead main-memory compression scheme: Low decompression latency by using simple and fast algorithm (zero aware) Fast address translation by a proposed small translation structure that fits on the processor die Reduction of fragmentation through occassional relocation of data when compressibility varies Overall, our compression scheme frees up 30% of the memory at a marginal performance loss of 0.2%!

Frequency of zero-valued locations 12% of all 8KB pages only contain zeros 30% of all 64B blocks only contain zeros 42% of all 4B words only contain zeros 55% of all bytes are zero! Zero-aware compression schemes have a great potential!

System Overview

Evaluated Algorithms Zero aware algorithms: FPC (Alameldeen and Wood) + 3 simplified versions For comparison, we also consider: X-Match Pro (efficient hardware implementations exist) LZSS (popular algorithm, previously used by IBM for memory compression) Deflate (upper bound on compressibility)

Frequent Pattern Compression (Alameldeen and Wood) PrefixPattern encodedData size 00Zero word0 01One byte sign-extended8 bits 10halfword sign-extended16 bits 11Uncompressed32 bits 3 bits (for runs up to 8 ”0”) Zero run Each 32-bit word is coded using a prefix plus data: PrefixPattern encodedData size 000Zero run3 bits (for runs up to 8 ”0”) 0014-bit sign-extended4 bits 010One byte sign-extended8 bits 011Half word sign-extended16 bits 100Half word padded with a zero halfword16 bits 101Two halfwords, each a byte sign-ext.16 bits 110Word consisting of repeated bytes8 bits 111Uncompressed32 bits

Resulting Compressed Sizes Main observations: FPC and all its variations can free up about 45% of memory LZSS and X-MatchPro only marginally better in spite of complexity Deflate can free up about 80% of memory but not clear how to exploit it Fast and efficient compression algorithms exist! SpecInt SpecFP Server

Three FPC versions FPC only zeros –Use eight patterns but does not code sign-extended negative numbers FPC simple –Use four patterns (zero run, 1 byte SE, half word SE,and uncompressed) FPC simple on zeros –Use four patterns but does not allow sign-extended negative numbers

Uncompressed data Compressed data Compressed fragmented data Block size vector Address translation A block is assigned one out of n predefined sizes. In this example n= Block Size Table (BST)TLB Address Calculator OS changes –Block size vector is kept in page table –Each page is assigned one out of k predefined sizes. Physical address grows with log 2 k bits. The Block Size Table enables fast translation!

Size changes and compaction Sub-page 0 Sub-page 1 sub-page slack Terminology: block overflow/underflow, sub-page overflow/underflow, page overflow/underflow Block overflow slack Block underflow page slack

Handling of overflows/underflows Block and sub-page overflows/underflows implies moving data within a page On a page overflow/underflow the entire page needs to be moved to avoid having to move several pages Block and sub-page overflows/underflows are handled in hardware by an off-chip DMA-engine On a page overflow/underflow a trap is taken and the mapping for the page is changed Processor has to stall if it accesses data that is being moved!

core L1-cache L2-cache BST Calc. Sub-page 0Sub-page 1 Page 0 CompDec. DMA- engine Putting it all together

Experimental Methodology Key issues to experimentally evaluate: Compressibility and impact of fragmentation Performance losses for proposed approach

Architectural Parameters Instr. Issue4-w ooo Exec units4 int, 2 int mul/div Branch pred.16-k entr. gshare, 2k BTB L1 I-cache32 k, 2-w, 2-cycle L1 D-cache32 k, 4-w, 2-cycle L2 cache512k/thread, 8-w, 16 cycles Memory latency150 cycles 1 Block lock-out4000 cycles 2 Subpage lock-out23000 cycles 2 Page lock-out23000 cycles 2 Predefined sizes Block Subpage Page Loads to a block only containing zeros can retire without accessing memory!

Benchmarks SPEC2000 ran with reference set. SAP and SpecJBB ran 4 billion instructions per thread SpecInt200 0 Bzip Gap Gcc Gzip Mcf Parser Perlbmk Twolf Vpr SpecFP2000 Ammp Art Equake Mesa Server SAP S&D SpecJBB

Fragmentation - Results

Detailed Performance Results

Conclusion It is possible to free up significant amounts of memory resources with virtually zero performance overhead This was achieved by exploiting zero-valued bytes which account for as much as and 55% of the memory contents leveraging a fast compression/decompression scheme a fast translation mechanism a hierarchical memory layout which offers some slack at the block, subpage, and page level Overall, 30% of memory could be freed up at a loss of 0.2% on average