Gennady Pekhimenko Advisers: Todd C. Mowry & Onur Mutlu

Slides:



Advertisements
Similar presentations
Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology.
Advertisements

0 - 0.
Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.
Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011.
SE-292 High Performance Computing
Presenter : Cheng-Ta Wu Kenichiro Anjo, Member, IEEE, Atsushi Okamura, and Masato Motomura IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 39,NO. 5, MAY 2004.
Improving DRAM Performance by Parallelizing Refreshes with Accesses
Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.
OS-aware Tuning Improving Instruction Cache Energy Efficiency on System Workloads Authors : Tao Li, John, L.K. Published in : Performance, Computing, and.
1 Overview Assignment 4: hints Memory management Assignment 3: solution.
Cache and Virtual Memory Replacement Algorithms
The Performance Impact of Kernel Prefetching on Buffer Cache Replacement Algorithms (ACM SIGMETRIC 05 ) ACM International Conference on Measurement & Modeling.
ULC: An Unified Placement and Replacement Protocol in Multi-level Storage Systems Song Jiang and Xiaodong Zhang College of William and Mary.
A Survey of Web Cache Replacement Strategies Stefan Podlipnig, Laszlo Boszormenyl University Klagenfurt ACM Computing Surveys, December 2003 Presenter:
Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a cache for secondary (disk) storage – Managed jointly.
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy J. Zebchuk, E. Safi, and A. Moshovos.
Bypass and Insertion Algorithms for Exclusive Last-level Caches
1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache.
Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :
University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 1 Computer Systems Cache characteristics.
Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.
Virtual Memory In this lecture, slides from lecture 16 from the course Computer Architecture ECE 201 by Professor Mike Schulte are used with permission.
KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.
Virtual Memory Really this is in OS – but We need to see how the OS will interact with the HW Peer Instruction Lecture Materials for Computer Architecture.
SE-292 High Performance Computing
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
Application-to-Core Mapping Policies to Reduce Memory System Interference Reetuparna Das * Rachata Ausavarungnirun $ Onur Mutlu $ Akhilesh Kumar § Mani.
Virtual Storage SystemCS510 Computer ArchitecturesLecture Lecture 14 Virtual Storage System.
Rethinking Database Algorithms for Phase Change Memory
T-SPaCS – A Two-Level Single-Pass Cache Simulation Methodology + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing Wei Zang.
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.
Operating Systems Lecture 10 Issues in Paging and Virtual Memory Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing.
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim,
Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
Skewed Compressed Cache
Compressed Memory Hierarchy Dongrui SHE Jianhua HUI.
Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches Gennady Pekhimenko Vivek Seshadri Onur Mutlu, Todd C. Mowry Phillip B.
Page Overlays An Enhanced Virtual Memory Framework to Enable Fine-grained Memory Management Vivek Seshadri Gennady Pekhimenko, Olatunji Ruwase, Onur Mutlu,
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos
Embedded System Lab. 김해천 Linearly Compressed Pages: A Low- Complexity, Low-Latency Main Memory Compression Framework Gennady Pekhimenko†
Energy-Efficient Data Compression for Modern Memory Systems Gennady Pekhimenko ACM Student Research Competition March, 2015.
Exploiting Compressed Block Size as an Indicator of Future Reuse
Lecture#15. Cache Function The data that is stored within a cache might be values that have been computed earlier or duplicates of original values that.
Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, Amirali Boroumand,
Virtual Memory Review Goal: give illusion of a large memory Allow many processes to share single memory Strategy Break physical memory up into blocks (pages)
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
RFVP: Rollback-Free Value Prediction with Safe to Approximate Loads Amir Yazdanbakhsh, Bradley Thwaites, Hadi Esmaeilzadeh Gennady Pekhimenko, Onur Mutlu,
A Case for Toggle-Aware Compression for GPU Systems
Presented by: Nick Kirchem Feb 13, 2004
Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh
Memory COMPUTER ARCHITECTURE
A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Session 2A: Today, 10:20 AM Nandita Vijaykumar.
A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait.
Prof. Onur Mutlu and Gennady Pekhimenko Carnegie Mellon University
Mosaic: A GPU Memory Manager
(Find all PTEs that map to a given PPN)
RFVP: Rollback-Free Value Prediction with Safe to Approximate Loads
Energy-Efficient Address Translation
CSCI206 - Computer Organization & Programming
Part V Memory System Design
Reducing Memory Reference Energy with Opportunistic Virtual Caching
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim, Hongyi.
CARP: Compression-Aware Replacement Policies
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim, Hongyi.
Border Control: Sandboxing Accelerators
Presentation transcript:

Gennady Pekhimenko Advisers: Todd C. Mowry & Onur Mutlu Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko Advisers: Todd C. Mowry & Onur Mutlu

Executive Summary Main memory is a limited shared resource Observation: Significant data redundancy Idea: Compress data in main memory Problem: How to avoid latency increase? Solution: Linearly Compressed Pages (LCP): fixed-size cache line granularity compression 1. Increases capacity (69% on average) 2. Decreases bandwidth consumption (46%) 3. Improves overall performance (9.5%)

Challenges in Main Memory Compression Address Computation Mapping and Fragmentation Physically Tagged Caches

Address Computation Uncompressed Page Compressed Page Cache Line (64B) . . . LN-1 Address Offset 64 128 (N-1)*64 Compressed Page L0 L1 L2 . . . LN-1 Address Offset ? ? ?

Mapping and Fragmentation Virtual Page (4kB) Virtual Address Physical Address Physical Page (? kB) Fragmentation

Physically Tagged Caches Virtual Address Core Critical Path Address Translation TLB Physical Address tag data L2 Cache Lines tag data tag data

Shortcomings of Prior Work Compression Mechanisms Access Latency Decompression Complexity Ratio IBM MXT [IBM J.R.D. ’01]  

Shortcomings of Prior Work Compression Mechanisms Access Latency Decompression Complexity Ratio IBM MXT [IBM J.R.D. ’01]   Robust Main Memory Compression [ISCA’05]

Shortcomings of Prior Work Compression Mechanisms Access Latency Decompression Complexity Ratio IBM MXT [IBM J.R.D. ’01]   Robust Main Memory Compression [ISCA’05] LCP: Our Proposal

Linearly Compressed Pages (LCP): Key Idea Uncompressed Page (4kB: 64*64B) 64B 64B 64B 64B . . . 64B 4:1 Compression Exception Storage . . . M E Metadata (64B): ? (compressible) Compressed Data (1kB)

LCP Overview Page Table entry extension compression type and size extended physical base address Operating System management support 4 memory pools (512B, 1kB, 2kB, 4kB) Changes to cache tagging logic physical page base address + cache line index (within a page) Handling page overflows Compression algorithms: BDI [PACT’12] , FPC [ISCA’04]

LCP Optimizations Integration with cache compression Metadata cache Avoids additional requests to metadata Memory bandwidth reduction: Zero pages and zero cache lines Handled separately in TLB (1-bit) and in metadata (1-bit per cache line) Integration with cache compression BDI and FPC 1 transfer instead of 4 64B 64B 64B 64B

Methodology Simulator Workloads System Parameters x86 event-driven simulators Simics-based [Magnusson+, Computer’02] for CPU Multi2Sim [Ubal+, PACT’12] for GPU Workloads SPEC2006 benchmarks, TPC, Apache web server, GPGPU applications System Parameters L1/L2/L3 cache latencies from CACTI [Thoziyoor+, ISCA’08] 512kB - 16MB L2, simple memory model

Compression Ratio Comparison SPEC2006, databases, web workloads, 2MB L2 cache LCP-based frameworks achieve competitive average compression ratios with prior work

Bandwidth Consumption Decrease SPEC2006, databases, web workloads, 2MB L2 cache Better BPKI – bandwidth per kilo instruction – lower means better. LCP frameworks significantly reduce bandwidth (46%)

Performance Improvement Cores LCP-BDI (BDI, LCP-BDI) (BDI, LCP-BDI+FPC-fixed) 1 6.1% 9.5% 9.3% 2 13.9% 23.7% 23.6% 4 10.7% 22.6% 22.5% LCP frameworks significantly improve performance

Conclusion A new main memory compression framework called LCP(Linearly Compressed Pages) Key idea: fixed size for compressed cache lines within a page and fixed compression algorithm per page LCP evaluation: Increases capacity (69% on average) Decreases bandwidth consumption (46%) Improves overall performance (9.5%) Decreases energy of the off-chip bus (37%)

Gennady Pekhimenko Advisers: Todd C. Mowry & Onur Mutlu Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko Advisers: Todd C. Mowry & Onur Mutlu