MASK: Redesigning the GPU Memory Hierarchy

Slides:



Advertisements
Similar presentations
Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.
Advertisements

Gennady Pekhimenko Advisers: Todd C. Mowry & Onur Mutlu
Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.
Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications Adwait Jog 1, Evgeny Bolotin 2, Zvika Guz 2,a, Mike Parker.
The Interaction of Simultaneous Multithreading processors and the Memory Hierarchy: some early observations James Bulpin Computer Laboratory University.
Application-to-Core Mapping Policies to Reduce Memory System Interference Reetuparna Das * Rachata Ausavarungnirun $ Onur Mutlu $ Akhilesh Kumar § Mani.
A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.
D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1.
Supporting x86-64 Address Translation for 100s of GPU Lanes Jason Power, Mark D. Hill, David A. Wood UW-Madison Computer Sciences 2/19/2014.
A Case Against Small Data Types in GPGPUs Ahmad Lashgar and Amirali Baniasadi ECE Department University of Victoria.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.
Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.
Managing GPU Concurrency in Heterogeneous Architect ures Shared Resources Network LLC Memory.
Recap. The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of the.
Veynu Narasiman The University of Texas at Austin Michael Shebanow
Mascar: Speeding up GPU Warps by Reducing Memory Pitstops
Virtual Memory. DRAM as cache What about programs larger than DRAM? When we run multiple programs, all must fit in DRAM! Add another larger, slower level.
Virtual Memory Part 1 Li-Shiuan Peh Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology May 2, 2012L22-1
Redundant Memory Mappings for Fast Access to Large Memories
CS 162 Discussion Section Week 6. Administrivia Project 2 Deadlines – Initial Design Due: 3/1 – Review Due: 3/5 – Code Due: 3/15.
1 Lecture 5a: CPU architecture 101 boris.
Computer Architecture Lecture 12: Virtual Memory I
µC-States: Fine-grained GPU Datapath Power Management
Zorua: A Holistic Approach to Resource Virtualization in GPUs
Improving Cache Performance using Victim Tag Stores
Virtual Memory Chapter 7.4.
Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh
Reducing Memory Interference in Multicore Systems
Supporting x86-64 Address Translation for 100s of GPU Lanes
A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Session 2A: Today, 10:20 AM Nandita Vijaykumar.
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Zhichun Zhu Zhao Zhang ECE Department ECE Department
Controlled Kernel Launch for Dynamic Parallelism in GPUs
A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait.
ISPASS th April Santa Rosa, California
Multi-core processors
Mosaic: A GPU Memory Manager
Managing GPU Concurrency in Heterogeneous Architectures
(Find all PTEs that map to a given PPN)
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Energy-Efficient Address Translation
MQSim: A Framework for Enabling Realistic Studies of Modern Multi-Queue SSD Devices Arash Tavakkol, Juan Gómez-Luna, Mohammad Sadrosadati, Saugata Ghose,
Rachata Ausavarungnirun GPU 2 (Virginia EF) Tuesday 2PM-3PM
Rachata Ausavarungnirun
Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, Onur Mutlu
Rachata Ausavarungnirun, Joshua Landgraf, Vance Miller
Presented by: Isaac Martin
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Reducing Memory Reference Energy with Opportunistic Virtual Caching
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
FIGURE 12-1 Memory Hierarchy
Managing GPU Concurrency in Heterogeneous Architectures
Lecture 22: Cache Hierarchies, Memory
Andy Wang Operating Systems COP 4610 / CGS 5675
Virtual Memory فصل هشتم.
Sarah Diesburg Operating Systems CS 3430
Memory Systems CH008.
Reducing DRAM Latency via
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Andy Wang Operating Systems COP 4610 / CGS 5675
Computer System Design Lecture 11
Operation of the Basic SM Pipeline
` A Framework for Memory Oversubscription Management in Graphics Processing Units Chen Li, Rachata Ausavarungnirun, Christopher J. Rossbach, Youtao.
Sarah Diesburg Operating Systems COP 4610
CoNDA: Efficient Cache Coherence Support for Near-Data Accelerators
Andy Wang Operating Systems COP 4610 / CGS 5675
Haonan Wang, Adwait Jog College of William & Mary
Presentation transcript:

MASK: Redesigning the GPU Memory Hierarchy to Support Multi-Application Concurrency Rachata Ausavarungnirun, Vance Miller, Joshua Landgraf, Saugata Ghose, Jayneel Gandhi, Adwait Jog, Christopher J. Rossbach, Onur Mutlu Performance Impact of Address Translation Address Translation Enables GPU Sharing A page is shared by many threads, causing one translation miss to stall multiple warps Parallel page walks further reduce the GPU’s latency hiding capability GPU Core Private TLB Shared TLB Page Table Walkers Page Table High TLB Contention High latency page walks Data Private Shared 45.6% Performance Degradation from Address Translation Issues with State-of-the-Art Address Translation Designs High Shared TLB Miss Rate Address Translation Data Is Latency-Sensitive GPU Does Not Have Any Awareness of Address Translation Data Inefficient L2 Data Cache Design Caches only part of the translation requests Thrashing among multiple address translation requests Thrashing from normal data demand requests Translation-Oblivious Memory Scheduler Address translation requests suffer from high DRAM latency Highly-parallel page walks lead to cache thrashing GPU becomes unable to hide memory latency 3DS_HISTO CONS_LPS MUM_HISTO RED_RAY MASK: A Translation-Aware Memory Hierarchy for GPUs Results Methodology GPGPU-Sim based model of GTX 750 Ti Multiple GPGPU applications can concurrently execute Models page walks and page tables Models virtual-to-physical address mapping Available at https://github.com/CMU-SAFARI/Mosaic 35 pairs of workloads 3 workload categories based on the TLB hit rate Results