A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

Slides:

Advertisements

Similar presentations

Application-to-Core Mapping Policies to Reduce Memory System Interference Reetuparna Das * Rachata Ausavarungnirun $ Onur Mutlu $ Akhilesh Kumar § Mani.

Advertisements

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.

High Performing Cache Hierarchies for Server Workloads

FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

1 MemScale: Active Low-Power Modes for Main Memory Qingyuan Deng, David Meisner*, Luiz Ramos, Thomas F. Wenisch*, and Ricardo Bianchini Rutgers University.

Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture

Data Mapping for Higher Performance and Energy Efficiency in Multi-Level Phase Change Memory HanBin Yoon*, Naveen Muralimanohar ǂ, Justin Meza*, Onur Mutlu*,

Efficient Data Mapping and Buffering Techniques for Multi-Level Cell Phase-Change Memories HanBin Yoon, Justin Meza, Naveen Muralimanohar*, Onur Mutlu,

1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.

Memory System Characterization of Big Data Workloads

1 Lecture 13: DRAM Innovations Today: energy efficiency, row buffer management, scheduling.

Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian,

1 Virtual Private Caches ISCA’07 Kyle J. Nesbit, James Laudon, James E. Smith Presenter: Yan Li.

Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.

1 Lecture 7: Caching in Row-Buffer of DRAM Adapted from “A Permutation-based Page Interleaving Scheme: To Reduce Row-buffer Conflicts and Exploit Data.

Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks Vivek Seshadri Samihan Yedkar ∙ Hongyi Xin ∙ Onur Mutlu Phillip.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

1 Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, Onur Mutlu.

Stall-Time Fair Memory Access Scheduling Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research.

HIgh Performance Computing & Systems LAB Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems Rachata Ausavarungnirun,

Dong Hyuk Woo Nak Hee Seong Hsien-Hsin S. Lee

Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

Scalable Many-Core Memory Systems Topic 2: Emerging Technologies and Hybrid Memories Prof. Onur Mutlu HiPEAC.

Towards Dynamic Green-Sizing for Database Servers Mustafa Korkmaz, Alexey Karyakin, Martin Karsten, Kenneth Salem University of Waterloo.

VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.

Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research.

Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.

The Evicted-Address Filter

Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.

Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.

HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.

DRAM Tutorial Lecture Vivek Seshadri. Vivek Seshadri – Thesis Proposal DRAM Module and Chip 2.

CS203 – Advanced Computer Architecture Main Memory Slides adapted from Onur Mutlu (CMU)

Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.

15-740/ Computer Architecture Lecture 5: Project Example Justin Meza Yoongu Kim Fall 2011, 9/21/2011.

1 Lecture 16: Main Memory Innovations Today: DRAM basics, innovations, trends HW5 due on Thursday; simulations can take a few hours Midterm: 32 scores.

CS161 – Design and Architecture of Computer Main Memory Slides adapted from Onur Mutlu (CMU)

Improving Cache Performance using Victim Tag Stores

UH-MEM: Utility-Based Hybrid Memory Management

Seth Pugsley, Jeffrey Jestes,

Reducing Memory Interference in Multicore Systems

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Adaptive Cache Partitioning on a Composite Core

ASR: Adaptive Selective Replication for CMP Caches

18-447: Computer Architecture Lecture 34: Emerging Memory Technologies

Understanding Latency Variation in Modern DRAM Chips Experimental Characterization, Analysis, and Optimization Kevin Chang Abhijith Kashyap, Hasan Hassan,

18-740: Computer Architecture Recitation 5:

Scalable High Performance Main Memory System Using PCM Technology

Application Slowdown Model

Row Buffer Locality Aware Caching Policies for Hybrid Memories

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Achieving High Performance and Fairness at Low Cost

Horizontally Partitioned Hybrid Main Memory with PCM

Demystifying Complex Workload–DRAM Interactions: An Experimental Study

Haonan Wang, Adwait Jog College of William & Mary

Presented by Florian Ettinger

Presentation transcript:

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu

Overview Emerging memories such as PCM offer higher density than DRAM, but have drawbacks Hybrid memories aim to achieve best of both We identify row buffer locality (RBL) as a key criterion for data placement – We develop a policy that caches to DRAM rows with low RBL and high reuse 50% perf. improvement over all-PCM memory Within 23% perf. of all-DRAM memory 2

Demand for Memory Capacity Increasing cores and thread contexts – Intel Sandy Bridge: 8 cores (16 threads) – AMD Abu Dhabi: 16 cores – IBM POWER7: 8 cores (32 threads) – Sun T4: 8 cores (64 threads) Modern data-intensive applications operate on huge datasets 3

Emerging High Density Memory DRAM density scaling becoming costly Phase change memory (PCM) +Projected 3−12  denser than DRAM 1 However, cannot simply replace DRAM −Higher access latency (4−12  DRAM 2 ) −Higher access energy (2−40  DRAM 2 ) −Limited write endurance (  10 8 writes 2 )  Use DRAM as a cache to PCM memory 3 4 [ 1 Mohan HPTS’09; 2 Lee+ ISCA’09; 3 Qureshi+ ISCA’09]

Hybrid Memory Benefits from both DRAM and PCM – DRAM: Low latency, high endurance – PCM: High capacity Key question: Where to place data between these heterogeneous devices? To help answer this question, let’s take a closer look at these technologies 5

Hybrid Memory: A Closer Look 6 MC DRAM (small capacity cache) PCM (large capacity memory) CPU Memory channel Bank Row buffer

Row Buffers and Latency Memory cells organized in columns and rows Row buffers store last accessed row – Hit: Access data from row buffer  fast – Miss: Access data from cell array  slow 7

Key Observation Row buffers exist in both DRAM and PCM – Row buffer hit latency similar in DRAM & PCM 2 – Row buffer miss latency small in DRAM – Row buffer miss latency large in PCM Place data in DRAM which – Frequently miss in row buffer (low row buffer locality)  miss penalty is smaller in DRAM – Are reused many times  data worth the caching effort (contention in mem. channel and DRAM) 8 [ 2 Lee+ ISCA’09]

Data Placement Implications 9 Let’s say a processor accesses four rows Row A Row B Row C Row D

Data Placement Implications 10 Let’s say a processor accesses four rows with different row buffer localities (RBL) Row A Row B Row C Row D Low RBL (Frequently miss in row buffer) High RBL (Frequently hit in row buffer)

RBL-Unaware Policy 11 A row buffer locality-unaware policy could place these rows in the following manner DRAM (High RBL) PCM (Low RBL) Row C Row D Row A Row B

RBL-Unaware Policy 12 DRAM (High RBL) PCM (Low RBL) A A B B C C D D C C C C D D D D A A B B A A B B Accesses pattern to main memory: A (oldest), B, C, C, C, A, B, D, D, D, A, B (youngest) Stall time: 6 PCM device accesses time

RBL-Aware Policy 13 A row buffer locality-aware policy would place these rows in the following manner DRAM (Low RBL) PCM (High RBL)  Access data at lower row buffer miss latency of DRAM  Access data at low row buffer hit latency of PCM Row A Row B Row C Row D

RBL-Aware Policy 14 DRAM (Low RBL) PCM (High RBL) A A B B C C D D C C C C D D D D A A B B A A B B Accesses pattern to main memory: A (oldest), B, C, C, C, A, B, D, D, D, A, B (youngest) Stall time: 6 DRAM device accesses Saved cycles time

Our Mechanism: DynRBLA 1.For a subset of recently used rows in PCM: – Count row buffer misses as indicator of row buffer locality (RBL) 2.Cache to DRAM rows with misses  threshold – Row buffer miss counts are periodically reset (only cache rows with high reuse) 3.Dynamically adjust threshold to adapt to workload/system characteristics – Interval-based cost-benefit analysis 15

Evaluation Methodology Cycle-level x86 CPU-memory simulator – CPU: 16 out-of-order cores, 32KB private L1 per core, 512KB shared L2 per core – Memory: DDR MT/s, 1GB DRAM, 16GB PCM, 1KB migration granularity Multi-programmed server & cloud workloads – Server (18): TPC-C, TPC-H – Cloud (18): Webserver, Video, TPC-C/H 16

Comparison Points and Metrics DynRBLA: Adaptive RBL-aware caching RBLA: Row buffer locality-aware caching Freq 4 : Frequency-based caching DynFreq: Adaptive Freq.-based caching Weighted speedup (performance) = sum of speedups versus when run alone Max slowdown (fairness) = highest slowdown experienced by any thread 17 [ 4 Jiang+ HPCA’10]

Performance 18 Weighted Speedup (norm.) Benefit 1: Reduced memory bandwidth consumption due to stricter caching criteria Benefit 2: Increased row buffer locality (RBL) in PCM by moving low RBL data to DRAM

Fairness 19 Maximum Slowdown (norm.) Lower contention for row buffer in PCM & memory channel

Energy Efficiency 20 Performance per Watt (norm.) Increased performance & reduced data movement between DRAM and PCM

Compared to All-PCM/DRAM 21 Weighted Speedup (norm.)Maximum Slowdown (norm.) Our mechanism achieves 50% better performance than all PCM, within 23% of all DRAM performance

Conclusion Demand for huge main memory capacity – PCM offers greater density than DRAM – Hybrid memories achieve the best of both We identify row buffer locality (RBL) as a key metric for caching decisions We develop a policy that caches to DRAM rows with low RBL and high reuse Enables high-performance energy-efficient hybrid main memories 22

Thank you! Questions? 23