MASK: Redesigning the GPU Memory Hierarchy

Slides:

Advertisements

Similar presentations

Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.

Advertisements

Gennady Pekhimenko Advisers: Todd C. Mowry & Onur Mutlu

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications Adwait Jog 1, Evgeny Bolotin 2, Zvika Guz 2,a, Mike Parker.

The Interaction of Simultaneous Multithreading processors and the Memory Hierarchy: some early observations James Bulpin Computer Laboratory University.

Application-to-Core Mapping Policies to Reduce Memory System Interference Reetuparna Das * Rachata Ausavarungnirun $ Onur Mutlu $ Akhilesh Kumar § Mani.

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1.

Supporting x86-64 Address Translation for 100s of GPU Lanes Jason Power, Mark D. Hill, David A. Wood UW-Madison Computer Sciences 2/19/2014.

A Case Against Small Data Types in GPGPUs Ahmad Lashgar and Amirali Baniasadi ECE Department University of Victoria.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.

Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.

Managing GPU Concurrency in Heterogeneous Architect ures Shared Resources Network LLC Memory.

Recap. The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of the.

Veynu Narasiman The University of Texas at Austin Michael Shebanow

Mascar: Speeding up GPU Warps by Reducing Memory Pitstops

Virtual Memory. DRAM as cache What about programs larger than DRAM? When we run multiple programs, all must fit in DRAM! Add another larger, slower level.

Virtual Memory Part 1 Li-Shiuan Peh Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology May 2, 2012L22-1

Redundant Memory Mappings for Fast Access to Large Memories

CS 162 Discussion Section Week 6. Administrivia Project 2 Deadlines – Initial Design Due: 3/1 – Review Due: 3/5 – Code Due: 3/15.

1 Lecture 5a: CPU architecture 101 boris.

Computer Architecture Lecture 12: Virtual Memory I

µC-States: Fine-grained GPU Datapath Power Management

Zorua: A Holistic Approach to Resource Virtualization in GPUs

Improving Cache Performance using Victim Tag Stores

Virtual Memory Chapter 7.4.

Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh

Reducing Memory Interference in Multicore Systems

Supporting x86-64 Address Translation for 100s of GPU Lanes

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Session 2A: Today, 10:20 AM Nandita Vijaykumar.

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Zhichun Zhu Zhao Zhang ECE Department ECE Department

Controlled Kernel Launch for Dynamic Parallelism in GPUs

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait.

ISPASS th April Santa Rosa, California

Multi-core processors

Mosaic: A GPU Memory Manager

Managing GPU Concurrency in Heterogeneous Architectures

(Find all PTEs that map to a given PPN)

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Energy-Efficient Address Translation

MQSim: A Framework for Enabling Realistic Studies of Modern Multi-Queue SSD Devices Arash Tavakkol, Juan Gómez-Luna, Mohammad Sadrosadati, Saugata Ghose,

Rachata Ausavarungnirun GPU 2 (Virginia EF) Tuesday 2PM-3PM

Rachata Ausavarungnirun

Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, Onur Mutlu

Rachata Ausavarungnirun, Joshua Landgraf, Vance Miller

Presented by: Isaac Martin

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Reducing Memory Reference Energy with Opportunistic Virtual Caching

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

FIGURE 12-1 Memory Hierarchy

Managing GPU Concurrency in Heterogeneous Architectures

Lecture 22: Cache Hierarchies, Memory

Andy Wang Operating Systems COP 4610 / CGS 5675

Virtual Memory فصل هشتم.

Sarah Diesburg Operating Systems CS 3430

Memory Systems CH008.

Reducing DRAM Latency via

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Andy Wang Operating Systems COP 4610 / CGS 5675

Computer System Design Lecture 11

Operation of the Basic SM Pipeline

` A Framework for Memory Oversubscription Management in Graphics Processing Units Chen Li, Rachata Ausavarungnirun, Christopher J. Rossbach, Youtao.

Sarah Diesburg Operating Systems COP 4610

CoNDA: Efficient Cache Coherence Support for Near-Data Accelerators

Andy Wang Operating Systems COP 4610 / CGS 5675

Haonan Wang, Adwait Jog College of William & Mary

Presentation transcript:

MASK: Redesigning the GPU Memory Hierarchy to Support Multi-Application Concurrency Rachata Ausavarungnirun, Vance Miller, Joshua Landgraf, Saugata Ghose, Jayneel Gandhi, Adwait Jog, Christopher J. Rossbach, Onur Mutlu Performance Impact of Address Translation Address Translation Enables GPU Sharing A page is shared by many threads, causing one translation miss to stall multiple warps Parallel page walks further reduce the GPU’s latency hiding capability GPU Core Private TLB Shared TLB Page Table Walkers Page Table High TLB Contention High latency page walks Data Private Shared 45.6% Performance Degradation from Address Translation Issues with State-of-the-Art Address Translation Designs High Shared TLB Miss Rate Address Translation Data Is Latency-Sensitive GPU Does Not Have Any Awareness of Address Translation Data Inefficient L2 Data Cache Design Caches only part of the translation requests Thrashing among multiple address translation requests Thrashing from normal data demand requests Translation-Oblivious Memory Scheduler Address translation requests suffer from high DRAM latency Highly-parallel page walks lead to cache thrashing GPU becomes unable to hide memory latency 3DS_HISTO CONS_LPS MUM_HISTO RED_RAY MASK: A Translation-Aware Memory Hierarchy for GPUs Results Methodology GPGPU-Sim based model of GTX 750 Ti Multiple GPGPU applications can concurrently execute Models page walks and page tables Models virtual-to-physical address mapping Available at https://github.com/CMU-SAFARI/Mosaic 35 pairs of workloads 3 workload categories based on the TLB hit rate Results