Reducing Memory Interference in Multicore Systems

Slides:

Advertisements

Similar presentations

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

Advertisements

Improving DRAM Performance by Parallelizing Refreshes with Accesses

Application-to-Core Mapping Policies to Reduce Memory System Interference Reetuparna Das * Rachata Ausavarungnirun $ Onur Mutlu $ Akhilesh Kumar § Mani.

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.

Exploiting Locality in DRAM

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.

Understanding a Problem in Multicore and How to Solve It

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.

1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.

1 Lecture 13: DRAM Innovations Today: energy efficiency, row buffer management, scheduling.

Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian,

Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.

1 Lecture 15: DRAM Design Today: DRAM basics, DRAM innovations (Section 5.3)

Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers Manu Awasthi, David Nellans, Kshitij Sudan, Rajeev Balasubramonian,

Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

1 Lecture 4: Memory: HMC, Scheduling Topics: BOOM, memory blades, HMC, scheduling policies.

Stall-Time Fair Memory Access Scheduling Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research.

1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.

Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.

Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

Analyzing Performance Vulnerability due to Resource Denial-Of-Service Attack on Chip Multiprocessors Dong Hyuk WooGeorgia Tech Hsien-Hsin “Sean” LeeGeorgia.

BEAR: Mitigating Bandwidth Bloat in Gigascale DRAM caches

Modern DRAM Memory Architectures Sam Miller Tam Chantem Jon Lucas CprE 585 Fall 2003.

Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research.

DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.

1 Lecture 3: Memory Buffers and Scheduling Topics: buffers (FB-DIMM, RDIMM, LRDIMM, BoB, BOOM), memory blades, scheduling policies.

Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.

1 Lecture: DRAM Main Memory Topics: DRAM intro and basics (Section 2.3)

Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.

Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.

15-740/ Computer Architecture Lecture 25: Main Memory

1 Lecture 4: Memory Scheduling, Refresh Topics: scheduling policies, refresh basics.

1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,

CSE 502: Computer Architecture

A Staged Memory Resource Management Method for CMP Systems

18-447: Computer Architecture Lecture 23: Caches

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Zhichun Zhu Zhao Zhang ECE Department ECE Department

Cache Memory Presentation I

Lecture 23: Cache, Memory, Security

Lecture 15: DRAM Main Memory Systems

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Application Slowdown Model

Lecture 21: Memory Hierarchy

Row Buffer Locality Aware Caching Policies for Hybrid Memories

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Lecture: DRAM Main Memory

Lecture 23: Cache, Memory, Virtual Memory

Achieving High Performance and Fairness at Low Cost

Lecture 22: Cache Hierarchies, Memory

Lecture: DRAM Main Memory

Lecture: DRAM Main Memory

Lecture: Memory Technology Innovations

CARP: Compression-Aware Replacement Policies

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

Manjunath Shevgoor, Rajeev Balasubramonian, University of Utah

Lecture 22: Cache Hierarchies, Memory

15-740/ Computer Architecture Lecture 19: Main Memory

Prof. Onur Mutlu Carnegie Mellon University 11/14/2012

Haonan Wang, Adwait Jog College of William & Mary

Presented by Florian Ettinger

Presentation transcript:

Reducing Memory Interference in Multicore Systems Lavanya Subramanian Department of ECE 11/04/2011 Hello. My name is Lavanya Subramanian. Today, I am going to talk about Application-Aware Memory Channel Partitioning

Main Memory is a Bottleneck Core Memory Core Channel Core Main memory latency is long. Reduces performance stalling core In a multicore system, applications running on multiple cores share the main memory

Problem of Inter-Application Interference Core Memory Channel Req Req Req Req Core Applications’ requests interfere at the main memory. This inter-application interference degrades system performance. Problem further exacerbated due to Fast growing core counts Limited off-chip pin bandwidth

Talk Summary Goal Address the problem of inter-application interference at main memory with the goal of improving performance. Outline of this talk Background/Motivation Previous Approaches Our Approach Goal of this talk is to motivate, describe and address the problem of inter-application interference at main memory. I shall give some background on main memory organization/operation, describe previous approaches, their shortcomings and why we need our approach – memory channel partitioning

Background: Main Memory Organization

DRAM Main Memory Organization Core Channel The processor accesses the off-chip DRAM main memory through one or more channels. Here, I show a single channel. The smallest accessible unit within a channel is a bank. There are other levels in the hierarchy such as ranks and DIMMs, which I shall not go into in detail. Accesses to multiple banks can proceed in parallel. But, only one bank can send data on the channel at the same time. Bank

DRAM Organization: Bank Organization Column (8 bytes) A bank is a 2D array of DRAM cells Row (4 KB) Row Addr Row Decoder Each bank is a 2D array of DRAM cells. The x dimension is a row. A row is organized as several columns. Row Buffer Column Addr Column Mux

DRAM Organization: Accessing data Required Data Row A B C D E F Column Now, I want to access the highlighted piece of data. Row Buffer

DRAM Organization: The Row-Buffer Required Data Row A B C D E F Destructive Read into Row Buffer Column The entire row is read from the array into the “row-buffer” shown below. The read is destructive. Now, the required piece of data is read from the row-buffer and sent on the channel. The data of that row is now present in the row-buffer. Row Buffer Sent onto channel

DRAM Organization: Row Hit Required Data Row A B C D E F Column So, a subsequent access to another piece/column of data from the same row is serviced in the row buffer. An array access is NOT required. This is called a row hit Row Buffer Sent on channel

DRAM Organization: Row Miss Required Data Row A B C D E F 1. Write back data in row buffer to array 2. Destructive Read into Row Buffer Column On the other hand, if subsequent access is to data in another row, i) row buffer contents have to be written back into the array ii) the new row is read into the row buffer. This is called a row buffer miss Row Buffer Sent onto channel Row Miss latency = 2 x Row hit latency

The Memory Controller Medium between the core and the main memory Channel Bank 0 Bank 1 Memory Controller Core Request Buffer Medium between the core and the main memory Buffers memory requests from core in request buffer Re-orders and schedules requests to main memory banks

FR-FCFS (Rixner et al. ISCA’00) Exploits row hits to minimize overall DRAM access latency FCFS FR-FCFS Bank Bank Req Req Req Req Req Req Service Timeline time Service Timeline time

Memory Scheduling in Multicore Systems FR-FCFS Core 1 App1 Bank Req Req Req Req Req Core 2 App2 time Service Timeline Application 2’s single request starves behind three of Application 1’s requests Low memory-intensity application 2 starves behind application 1 Minimizing overall DRAM access latency != System Performance

Need for Application Awareness Memory Scheduler needs to be aware of application characteristics. Thread Cluster Memory (TCM) Scheduling is the current best application-aware memory scheduling policy. TCM (Kim et al. MICRO’10) always prioritizes low memory-intensity applications shuffles between high memory intensity applications Strength Provides good system performance Shortcoming High hardware complexity due to ranking and prioritization logic

Modern Systems have Multiple Channels Memory Core Memory Controller Channel Memory Core Memory Controller Channel Allocation of data to channels – a new degree of freedom

Interleaving rows across channels Memory Core Memory Controller Channel Memory Core Memory Controller Channel Enables parallelism in access to rows on different channels

Interleaving cache lines across channels Memory Core Memory Controller Channel Memory Core Memory Controller Channel Enables finer grained parallelism at the cache line granularity

Key Insight 1 High memory-intensity applications interfere with low memory-intensity applications in shared memory channels Time Units Channel Partitioning Core 0 App A Core 1 App B Channel 0 Bank 1 Bank 0 Time Units 1 2 3 4 5 Channel 1 Channel 0 5 4 3 2 1 Bank 0 Core 0 App A Bank 1 Bank 0 Core 1 App B Bank 1 Channel 1 Conventional Page Mapping Solution: Map data of low and high memory-intensity applications to different channels

Key Insight 2 Channel 0 Bank 1 Channel 1 Bank 0 Request Buffer State D E Request Buffer State Conventional Page Mapping Channel 0 Bank 1 Channel 1 Bank 0 B E C D A Request Buffer State Channel Partitioning Channel 0 Bank 1 Bank 0 B Service Order Channel 1 1 2 3 4 5 6 C D E A Channel 1 Channel 0 Bank 1 Bank 0 B Service Order 1 2 3 4 5 6 C D A E

Memory Channel Partitioning (MCP) Hardware Profile Applications Classify applications into groups Partition available channels between groups Assign a preferred channel to each application Allocate application pages to preferred channel System Software

Profile/Classify Applications Profiling Collect Last Level Misses per Kilo Instruction (MPKI) and Row-buffer hit rate (RBH) of applications online Classification MPKI > MPKIt Low Intensity High Intensity RBH > RBHt Low Row-Buffer Locality No Yes High Row-Buffer Locality

Partition Between Low and High Intensity Groups Channel 1 Low Intensity Channel 2 Assign channels proportional to number of applications in group Channel 3 High Intensity Channel 4

Partition b/w Low and High RBH groups High Intensity Low Row- Buffer Locality Channel 3 Assign channels proportional to bandwidth demand of group High Intensity High Row- Buffer Locality Channel 4

Preferred Channel Assignment/Allocation Load balance group’s bandwidth demand across group’s allocated channels Each application now has a preferred channel allocation Page allocation to preferred channel on first touch Operating system assigns a page to a preferred channel if free page available Else use modified replacement policy to preferentially choose a replacement candidate from preferred channel

Integrating Partitioning and Scheduling Inter-application Interference Mitigation Memory Scheduling Memory Partitioning Integrated Memory Partitioning and Scheduling

Integrated Memory Partitioning and Scheduling (IMPS) Applications with very low memory intensities (< 1 MPKI) do not need dedicated bandwidth In fact, dedicating bandwidth results in wastage These applications need short access latencies interfere minimally with other applications Solution: Always prioritize them in the scheduler. Handle other applications via memory channel partitioning

Methodology Core Model Memory Model 4 GHz out-of-order processor 128 entry instruction window 512 KB cache/core Memory Model DDR2 1 GB capacity 4 channels, 4 banks/channel Row interleaved Row hit: 200 cycles Row Miss: 400 cycles

Comparison to Previous Scheduling Policies MCP performs 1% better than TCM (best previous scheduler) at no extra hardware complexity IMPS performs 5% better than TCM (best previous scheduler) at minimal extra hardware complexity Perform consistently well across all intensity categories

Comparison to AFT/DPM (Awasthi et al. PACT’11) MCP/IMPS outperform AFT and DPM by 7/12.4% (across 40 workloads). Application-aware page allocation mitigates inter-application interference better.

Future Work Further exploration of integrated memory partitioning and scheduling for system performance Integrated partitioning and scheduling for fairness Workload aware memory scheduling