Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.

Slides:

Advertisements

Similar presentations

Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.

Advertisements

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1.

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.

Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture Seongbeom Kim, Dhruba Chandra, and Yan Solihin Dept. of Electrical and Computer.

High Performing Cache Hierarchies for Server Workloads

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture

Improving Cache Performance by Exploiting Read-Write Disparity

1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian.

1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.

Memory System Characterization of Big Data Workloads

1 Lecture 13: DRAM Innovations Today: energy efficiency, row buffer management, scheduling.

Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.

Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks Vivek Seshadri Samihan Yedkar ∙ Hongyi Xin ∙ Onur Mutlu Phillip.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

The Dirty-Block Index Vivek Seshadri Abhishek Bhowmick ∙ Onur Mutlu Phillip B. Gibbons ∙ Michael A. Kozuch ∙ Todd C. Mowry.

Scalable Many-Core Memory Systems Lecture 4, Topic 3: Memory Interference and QoS-Aware Memory Systems Prof. Onur Mutlu

Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches Gennady Pekhimenko Vivek Seshadri Onur Mutlu, Todd C. Mowry Phillip B.

An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing Taecheol Oh, Kiyeon Lee, and Sangyeun Cho Computer Science.

Stall-Time Fair Memory Access Scheduling Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research.

1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.

Page Overlays An Enhanced Virtual Memory Framework to Enable Fine-grained Memory Management Vivek Seshadri Gennady Pekhimenko, Olatunji Ruwase, Onur Mutlu,

Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.

Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.

Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.

Embedded System Lab. 정범종 A_DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters H. Wang et al. VEE, 2015.

Exploiting Compressed Block Size as an Indicator of Future Reuse

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research.

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.

The Evicted-Address Filter

Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

Optimizing DRAM Timing for the Common-Case Donghyuk Lee Yoongu Kim, Gennady Pekhimenko, Samira Khan, Vivek Seshadri, Kevin Chang, Onur Mutlu Adaptive-Latency.

HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.

Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.

Priority Based Fair Scheduling: A Memory Scheduler Design for Chip-Multiprocessor Systems Tsinghua University Tsinghua National Laboratory for Information.

Improving Cache Performance using Victim Tag Stores

UH-MEM: Utility-Based Hybrid Memory Management

Reducing Memory Interference in Multicore Systems

Prof. Onur Mutlu Carnegie Mellon University

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Milad Hashemi, Onur Mutlu, and Yale N. Patt

Rachata Ausavarungnirun, Kevin Chang

Energy-Efficient Address Translation

Application Slowdown Model

Rachata Ausavarungnirun

Thesis Defense Lavanya Subramanian

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Milad Hashemi, Onur Mutlu, Yale N. Patt

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Achieving High Performance and Fairness at Low Cost

Application Slowdown Model

CARP: Compression-Aware Replacement Policies

Reducing DRAM Latency via

Prof. Onur Mutlu ETH Zürich Fall November 2017

Prof. Onur Mutlu Carnegie Mellon University 11/14/2012

Haonan Wang, Adwait Jog College of William & Mary

Presented by Florian Ettinger

Presentation transcript:

Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur Mutlu 1 Application Slowdown Model

Problem: Interference at Shared Resources 2 Main Memory Shared Cache Core

Impact of Shared Resource Interference 2. Unpredictable application slowdowns 1. High application slowdowns 3 gcc (core 1) mcf (core 1) Our Goal: Achieve High and Predictable Performance

Outline 1.Quantify Impact of Interference - Slowdown – Key Observation – Estimating Cache Access Rate Alone – ASM: Putting it All Together – Evaluation 2.Control Slowdown – Slowdown-aware Cache Capacity Allocation – Slowdown-aware Memory Bandwidth Allocation – Coordinated Cache/Memory Management 4

Quantifying Impact of Shared Resource Interference 5 Alone (No interference) time Execution time Shared (With interference) time Execution time Impact of Interference

Slowdown: Definition 6

Approach: Impact of Interference on Performance 7 Alone (No interference) time Execution time Shared (With interference) time Execution time Impact of Interference Previous Approach: Estimate impact of interference at a per-request granularity Difficult to estimate due to request overlap Our Approach: Estimate impact of interference aggregated over requests

Outline 1.Quantify Slowdown – Key Observation – Estimating Cache Access Rate Alone – ASM: Putting it All Together – Evaluation 2.Control Slowdown – Slowdown-aware Cache Capacity Allocation – Slowdown-aware Memory Bandwidth Allocation – Coordinated Cache/Memory Management 8

Observation: Shared Cache Access Rate is a Proxy for Performance Performance  Shared Cache Access rate 9 Intel Core i5, 4 cores Difficult Easy Shared Cache Access Rate Alone Shared Cache Access Rate Shared

Outline 1.Quantify Slowdown – Key Observation – Estimating Cache Access Rate Alone – ASM: Putting it All Together – Evaluation 2.Control Slowdown – Slowdown-aware Cache Capacity Allocation – Slowdown-aware Memory Bandwidth Allocation – Coordinated Cache/Memory Management 10

Estimating Cache Access Rate Alone 11 Main Memory Shared Cache Core Challenge 1: Main memory bandwidth interference Challenge 2: Shared cache capacity interference

Estimating Cache Access Rate Alone 12 Main Memory Shared Cache Core Challenge 1: Main memory bandwidth interference

Highest Priority Minimizes Memory Bandwidth Interference Can minimize impact of main memory interference by giving the application highest priority at the memory controller (Subramanian et al., HPCA 2013) Highest priority  Little interference (almost as if the application were run alone) 13 1.Highest priority minimizes interference 2.Enables estimation of miss service time (used to account for shared cache interference)

Estimating Cache Access Rate Alone 14 Main Memory Shared Cache Core Challenge 2: Shared cache capacity interference

Cache Capacity Contention 15 Main Memory Shared Cache Access Rate Priority Core Applications evict each other’s blocks from the shared cache

Shared Cache Interference is Hard to Minimize Through Priority 16 Main Memory Shared Cache Priority Core Priority Takes effect instantly Many blocks of the blue core Takes a long time for red core to benefit from cache priority Long warmup Lots of interference to other applications

Our Approach: Quantify and Remove Cache Interference 1.Quantify impact of shared cache interference 2.Remove impact of shared cache interference on CAR Alone estimates 17

1. Quantify Shared Cache Interference 18 Main Memory Shared Cache Access Rate Auxiliary Tag Store Priority Core Still in auxiliary tag store Auxiliary Tag Store Count number of such contention misses

2. Remove Cycles to Serve Contention Misses from CAR Alone Estimates 19 From auxiliary tag store when given high priority Measured when application is given high priority Remove cache contention cycles when estimating Cache Access Rate Alone (CAR Alone )

Accounting for Memory and Shared Cache Interference 20

Outline 1.Quantify Slowdown – Key Observation – Estimating Cache Access Rate Alone – ASM: Putting it All Together – Evaluation 2.Control Slowdown – Slowdown-aware Cache Capacity Allocation – Slowdown-aware Memory Bandwidth Allocation – Coordinated Cache/Memory Management 21

Application Slowdown Model (ASM) 22

ASM: Interval Based Operation time Interval Estimate slowdown Interval Estimate slowdown Measure CAR Shared Estimate CAR Alone Measure CAR Shared Estimate CAR Alone 23

A More Accurate and Simple Model More accurate: Takes into account request overlap behavior – Implicit through aggregate estimation of cache access rate and miss service time – Unlike prior works that estimate per-request interference Simpler hardware: Amenable to set sampling in the auxiliary tag store – Need to measure only contention miss count – Unlike prior works that need to know if each request is a contention miss or not 24

Outline 1.Quantify Slowdown – Key Observation – Estimating Cache Access Rate Alone – ASM: Putting it All Together – Evaluation 2.Control Slowdown – Slowdown-aware Cache Capacity Allocation – Slowdown-aware Memory Bandwidth Allocation 25

Previous Work on Slowdown Estimation Previous work on slowdown estimation – STFM (Stall Time Fair Memory) Scheduling [Mutlu et al., MICRO ’07] – FST (Fairness via Source Throttling) [Ebrahimi et al., ASPLOS ’10] – Per-thread Cycle Accounting [Du Bois et al., HiPEAC ’13] Basic Idea: Count interference cycles experienced by each request 26

Methodology Configuration of our simulated system – 4 cores – 1 channel, 8 banks/channel – DDR DRAM – 2MB shared cache Workloads – SPEC CPU2006 and NAS – 100 multiprogrammed workloads 27

Model Accuracy Results 28 Select applications Average error of ASM’s slowdown estimates: 10% Previous models have 29%/40% average error

Outline 1.Quantify Slowdown – Key Observation – Estimating Cache Access Rate Alone – ASM: Putting it All Together – Evaluation 2.Control Slowdown – Slowdown-aware Cache Capacity Allocation – Slowdown-aware Memory Bandwidth Allocation – Coordinated Cache/Memory Management 29

Outline 1.Quantify Slowdown – Key Observation – Estimating Cache Access Rate Alone – ASM: Putting it All Together – Evaluation 2.Control Slowdown – Slowdown-aware Cache Capacity Allocation – Slowdown-aware Memory Bandwidth Allocation – Coordinated Cache/Memory Management 30

Cache Capacity Partitioning 31 Main Memory Shared Cache Core Previous partitioning schemes mainly focus on miss count reduction Problem: Does not directly translate to performance and slowdowns

ASM-Cache: Slowdown-aware Cache Capacity Partitioning Goal: Achieve high fairness and performance through slowdown-aware cache partitioning Key Idea: Allocate more cache space to applications whose slowdowns reduce the most with more cache space 32

Outline 1.Quantify Slowdown – Key Observation – Estimating Cache Access Rate Alone – ASM: Putting it All Together – Evaluation 2.Control Slowdown – Slowdown-aware Cache Capacity Allocation – Slowdown-aware Memory Bandwidth Allocation – Coordinated Cache/Memory Management 33

Memory Bandwidth Partitioning 34 Main Memory Shared Cache Core Goal: Achieve high fairness and performance through slowdown-aware bandwidth partitioning

ASM-Mem: Slowdown-aware Memory Bandwidth Partitioning Key Idea: Prioritize an application proportionally to its slowdown Application i’s requests prioritized at the memory controller for its fraction 35

Outline 1.Quantify Slowdown – Key Observation – Estimating Cache Access Rate Alone – ASM: Putting it All Together – Evaluation 2.Control Slowdown – Slowdown-aware Cache Capacity Allocation – Slowdown-aware Memory Bandwidth Allocation – Coordinated Cache/Memory Management 36

Coordinated Resource Allocation Schemes 37 Core Main Memory Shared Cache Cache capacity-aware bandwidth allocation 1. Employ ASM-Cache to partition cache capacity 2. Drive ASM-Mem with slowdowns from ASM-Cache

Fairness and Performance Results core system 100 workloads 14%/8% unfairness reduction on 1/2 channel systems compared to PARBS+UCP with similar performance

Other Results in the Paper Distribution of slowdown estimation error Sensitivity to system parameters – Core count, memory channel count, cache size Sensitivity to model parameters Impact of prefetching Case study showing ASM’s potential for providing slowdown guarantees 39

Summary Problem: Uncontrolled memory interference cause high and unpredictable application slowdowns Goal: Quantify and control slowdowns Key Contribution: – ASM: An accurate slowdown estimation model – Average error of ASM: 10% Key Ideas: – Shared cache access rate is a proxy for performance – Cache Access Rate Alone can be estimated by minimizing memory interference and quantifying cache interference Applications of Our Model – Slowdown-aware cache and memory management to achieve high performance, fairness and performance guarantees Source Code Release by January

Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur Mutlu 41 Application Slowdown Model