Rachata Ausavarungnirun GPU 2 (Virginia EF) Tuesday 2PM-3PM

Slides:



Advertisements
Similar presentations
1 Memory hierarchy and paging Electronic Computers M.
Advertisements

Appendix B. Memory Hierarchy CSCI/ EENG – W01 Computer Architecture 1 Dr. Babak Beheshti Slides based on the PowerPoint Presentations created by.
Virtual Memory. Hierarchy Cache Memory : Provide invisible speedup to main memory.
EECS 470 Virtual Memory Lecture 15. Why Use Virtual Memory? Decouples size of physical memory from programmer visible virtual memory Provides a convenient.
Virtual Memory main memory can act as a cache for secondary storage motivation: Allow programs to use more memory that there is available transparent to.
Managing GPU Concurrency in Heterogeneous Architect ures Shared Resources Network LLC Memory.
1 Memory Systems Virtual Memory Lecture 25 Digital Design and Computer Architecture Harris & Harris Morgan Kaufmann / Elsevier, 2007.
Computer ArchitectureFall 2008 © November 10, 2007 Nael Abu-Ghazaleh Lecture 23 Virtual.
Recap. The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of the.
Computer ArchitectureFall 2007 © November 21, 2007 Karem A. Sakallah Lecture 23 Virtual Memory (2) CS : Computer Architecture.
Principle Behind Hierarchical Storage  Each level memorizes values stored at lower levels  Instead of paying the full latency for the “furthermost” level.
Virtual Memory I Chapter 8.
Cache Organization of Pentium
CS 241 Section Week #12 (04/22/10).
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
1 Lecture 13: Cache, TLB, VM Today: large caches, virtual memory, TLB (Sections 2.4, B.4, B.5)
Accelerating Two-Dimensional Page Walks for Virtualized Systems Jun Ma.
Virtual Memory Part 1 Li-Shiuan Peh Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology May 2, 2012L22-1
Virtual Memory 1 1.
1 Some Real Problem  What if a program needs more memory than the machine has? —even if individual programs fit in memory, how can we run multiple programs?
Full and Para Virtualization
Redundant Memory Mappings for Fast Access to Large Memories
Exam Review Andy Wang Operating Systems COP 4610 / CGS 5675.
Lectures 8 & 9 Virtual Memory - Paging & Segmentation System Design.
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
Summary of caches: The Principle of Locality: –Program likely to access a relatively small portion of the address space at any instant of time. Temporal.
Computer Architecture Recitation 3 Presenter: Kevin Chang Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 3/16/2015.
Computer Architecture Lecture 12: Virtual Memory I
Computer Organization
Virtual memory.
Virtual Memory Chapter 7.4.
Cache Organization of Pentium
A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Session 2A: Today, 10:20 AM Nandita Vijaykumar.
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Lecture 12 Virtual Memory.
Lecture: Large Caches, Virtual Memory
Section 9: Virtual Memory (VM)
A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait.
Page Table Implementation
Mosaic: A GPU Memory Manager
Some Real Problem What if a program needs more memory than the machine has? even if individual programs fit in memory, how can we run multiple programs?
Managing GPU Concurrency in Heterogeneous Architectures
(Find all PTEs that map to a given PPN)
Lecture: Large Caches, Virtual Memory
CSE 153 Design of Operating Systems Winter 2018
Virtual Memory Chapter 8.
Rachata Ausavarungnirun
Rachata Ausavarungnirun, Joshua Landgraf, Vance Miller
MASK: Redesigning the GPU Memory Hierarchy
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
FIGURE 12-1 Memory Hierarchy
Andy Wang Operating Systems COP 4610 / CGS 5675
Virtual Memory فصل هشتم.
Sarah Diesburg Operating Systems CS 3430
Exam Review Mark Stanovich Operating Systems COP
Memory Systems CH008.
TLB Performance Seung Ki Lee.
Virtual Memory Overcoming main memory size limitation
Andy Wang Operating Systems COP 4610 / CGS 5675
© 2004 Ed Lazowska & Hank Levy
Computer System Design Lecture 11
Lecture 8: Efficient Address Translation
Translation Lookaside Buffers
CSE 153 Design of Operating Systems Winter 2019
Andy Wang Operating Systems COP 4610 / CGS 5675
4.3 Virtual Memory.
Paging Andrew Whitaker CSE451.
Manage Your Tasks in Operating System
Virtual Memory 1 1.
Presentation transcript:

Rachata Ausavarungnirun GPU 2 (Virginia EF) Tuesday 2PM-3PM MASK: Redesigning the GPU Memory Hierarchy to Support Multi-Application Concurrency Rachata Ausavarungnirun Vance Miller Joshua Landgraf Saugata Ghose Jayneel Gandhi Adwait Jog Christopher J. Rossbach Onur Mutlu GPU 2 (Virginia EF) Tuesday 2PM-3PM

Enabling GPU Sharing with Address Translation Virtual Address GPU Core GPU Core GPU Core GPU Core In this example, the GPU contains four cores and is shared among two applications, app 1 and app 2. In this scenario, when the GPU want to load or store data, it first needs to translate the virtual address into the physical address. To do this, the page table walk walk the page table using the virtual address. This process incur multiple reads to the main memory, and is a high latency operation. To alleviate this performance penalty, modern processor, including the state-of-the-art GPU design employs a private translation lookaside buffer to reduce the latency of address translation. This TLB can include the private per-core TLB, as shown in this example, and a second level shared TLB. With these structures, the page table walker only need to translate addresses that miss in both private and the shared TLB. However, we found that as the GPU is shared across multiple applications, each application generates contention at the shared TLB, decreasing the effectiveness of these components. Page Table Walkers App 1 App 2 Page Table (in main memory)

Enabling GPU Sharing with Address Translation Virtual Address GPU Core GPU Core GPU Core GPU Core In this example, the GPU contains four cores and is shared among two applications, app 1 and app 2. In this scenario, when the GPU want to load or store data, it first needs to translate the virtual address into the physical address. To do this, the page table walk walk the page table using the virtual address. This process incur multiple reads to the main memory, and is a high latency operation. To alleviate this performance penalty, modern processor, including the state-of-the-art GPU design employs a private translation lookaside buffer to reduce the latency of address translation. This TLB can include the private per-core TLB, as shown in this example, and a second level shared TLB. With these structures, the page table walker only need to translate addresses that miss in both private and the shared TLB. However, we found that as the GPU is shared across multiple applications, each application generates contention at the shared TLB, decreasing the effectiveness of these components. Page Table Walkers High latency page walks App 1 App 2 Page Table (in main memory)

State-of-the-Art Translation Support in GPUs Virtual Address GPU Core GPU Core GPU Core GPU Core Private TLB Private TLB Private TLB Private TLB Private Shared Shared TLB In this example, the GPU contains four cores and is shared among two applications, app 1 and app 2. In this scenario, when the GPU want to load or store data, it first needs to translate the virtual address into the physical address. To do this, the page table walk walk the page table using the virtual address. This process incur multiple reads to the main memory, and is a high latency operation. To alleviate this performance penalty, modern processor, including the state-of-the-art GPU design employs a private translation lookaside buffer to reduce the latency of address translation. This TLB can include the private per-core TLB, as shown in this example, and a second level shared TLB. With these structures, the page table walker only need to translate addresses that miss in both private and the shared TLB. However, we found that as the GPU is shared across multiple applications, each application generates contention at the shared TLB, decreasing the effectiveness of these components. Page Table Walkers High latency page walks App 1 App 2 Page Table (in main memory)

Three Sources of Inefficiency in Translation High TLB contention n Inefficient caching Bypass Address-translation is latency sensitive MASK: A Translation-aware Memory Hierarchy To this end, our design goal are the following: we would like to be able to achieve high TLB reach with low demand paging latency. In addition, we want our design our mechanism to be application-transparent such that programmers do not need to modify GPGPU applications to take advantage of our design.

Three Sources of Inefficiency in Translation High TLB contention n To this end, our design goal are the following: we would like to be able to achieve high TLB reach with low demand paging latency. In addition, we want our design our mechanism to be application-transparent such that programmers do not need to modify GPGPU applications to take advantage of our design.

Three Sources of Inefficiency in Translation High TLB contention n Inefficient caching Bypass To this end, our design goal are the following: we would like to be able to achieve high TLB reach with low demand paging latency. In addition, we want our design our mechanism to be application-transparent such that programmers do not need to modify GPGPU applications to take advantage of our design.

Three Sources of Inefficiency in Translation High TLB contention n Inefficient caching Bypass Address translation is latency-sensitive To this end, our design goal are the following: we would like to be able to achieve high TLB reach with low demand paging latency. In addition, we want our design our mechanism to be application-transparent such that programmers do not need to modify GPGPU applications to take advantage of our design.

MASK: A Translation-aware Memory Hierarchy Our Solution MASK: A Translation-aware Memory Hierarchy To this end, our design goal are the following: we would like to be able to achieve high TLB reach with low demand paging latency. In addition, we want our design our mechanism to be application-transparent such that programmers do not need to modify GPGPU applications to take advantage of our design.

Three Components of MASK To this end, our design goal are the following: we would like to be able to achieve high TLB reach with low demand paging latency. In addition, we want our design our mechanism to be application-transparent such that programmers do not need to modify GPGPU applications to take advantage of our design.

Three Components of MASK TLB-fill Tokens Reduces TLB contention Shared TLB To this end, our design goal are the following: we would like to be able to achieve high TLB reach with low demand paging latency. In addition, we want our design our mechanism to be application-transparent such that programmers do not need to modify GPGPU applications to take advantage of our design.

Three Components of MASK TLB-fill Tokens Reduces TLB contention Translation-aware L2 Bypass Improves L2 cache utilization Shared TLB Translation Data L2 Data Cache To this end, our design goal are the following: we would like to be able to achieve high TLB reach with low demand paging latency. In addition, we want our design our mechanism to be application-transparent such that programmers do not need to modify GPGPU applications to take advantage of our design.

Three Components of MASK TLB-fill Tokens Reduces TLB contention Translation-aware L2 Bypass Improves L2 cache utilization Address-space-aware Memory Scheduler Lowers address translation latency Shared TLB Translation Data L2 Data Cache Translation Data Main Memory To this end, our design goal are the following: we would like to be able to achieve high TLB reach with low demand paging latency. In addition, we want our design our mechanism to be application-transparent such that programmers do not need to modify GPGPU applications to take advantage of our design.

Three Components of MASK TLB-fill Tokens Reduces TLB contention Translation-aware L2 Bypass Improves L2 cache utilization Address-space-aware Memory Scheduler Lowers address translation latency Shared TLB Translation Data L2 Data Cache Translation Data Main Memory To this end, our design goal are the following: we would like to be able to achieve high TLB reach with low demand paging latency. In addition, we want our design our mechanism to be application-transparent such that programmers do not need to modify GPGPU applications to take advantage of our design. MASK improves performance by 57.8%

Rachata Ausavarungnirun GPU 2 (Virginia EF) Tuesday 2PM-3PM MASK: Redesigning the GPU Memory Hierarchy to Support Multi-Application Concurrency Rachata Ausavarungnirun Vance Miller Joshua Landgraf Saugata Ghose Jayneel Gandhi Adwait Jog Christopher J. Rossbach Onur Mutlu GPU 2 (Virginia EF) Tuesday 2PM-3PM