Debashis Ganguly Ziyu Zhang, Jun Yang, Rami Melhem

Slides:



Advertisements
Similar presentations
SE-292: High Performance Computing
Advertisements

Background Virtual memory – separation of user logical memory from physical memory. Only part of the program needs to be in memory for execution. Logical.
Chapter 4 Memory Management Page Replacement 补充:什么叫页面抖动?
1 Virtual Memory Management B.Ramamurthy. 2 Demand Paging Main memory LAS 0 LAS 1 LAS 2 (Physical Address Space -PAS) LAS - Logical Address.
9.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Virtual Memory OSC: Chapter 9. Demand Paging Copy-on-Write Page Replacement.
CGS 3763 Operating Systems Concepts Spring 2013 Dan C. Marinescu Office: HEC 304 Office hours: M-Wd 11: :30 AM.
Segmentation and Paging Considerations
Virtual Memory Management
CMPT 300: Final Review Chapters 8 – Memory Management: Ch. 8, 9 Address spaces Logical (virtual): generated by the CPU Physical: seen by the memory.
Chapter 101 Virtual Memory Chapter 10 Sections and plus (Skip:10.3.2, 10.7, rest of 10.8)
Virtual Memory Chapter 8. Hardware and Control Structures Memory references are dynamically translated into physical addresses at run time –A process.
Chapter 9: Virtual Memory. 9.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts – 7 th Edition, Feb 22, 2005 Chapter 9: Virtual Memory Background.
CMPT 300: Final Review Chapters 8 – Memory Management: Ch. 8, 9 Address spaces Logical (virtual): generated by the CPU Physical: seen by the memory.
CSI 400/500 Operating Systems Spring 2009 Lecture #9 – Paging and Segmentation in Virtual Memory Monday, March 2 nd and Wednesday, March 4 th, 2009.
1 Lecture 14: Virtual Memory Today: DRAM and Virtual memory basics (Sections )
1 Virtual Memory Management B.Ramamurthy Chapter 10.
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
Operating System Machine Level  An operating system is a program that, from the programmer’s point of view, adds a variety of new instructions and features,
Memory Management ◦ Operating Systems ◦ CS550. Paging and Segmentation  Non-contiguous memory allocation  Fragmentation is a serious problem with contiguous.
Virtual Memory.
Rensselaer Polytechnic Institute CSC 432 – Operating Systems David Goldschmidt, Ph.D.
Chapter 21 Virtual Memoey: Policies Chien-Chung Shen CIS, UD
Memory Management COSC 513 Presentation Jun Tian 08/17/2000.
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Virtual Memory.
Lecture 11 Page 1 CS 111 Online Working Sets Give each running process an allocation of page frames matched to its needs How do we know what its needs.
Virtual Memory The memory space of a process is normally divided into blocks that are either pages or segments. Virtual memory management takes.
Demand Paging Reference Reference on UNIX memory management
Silberschatz, Galvin and Gagne  Operating System Concepts Virtual Memory Virtual memory – separation of user logical memory from physical memory.
1 Virtual Memory. Cache memory: provides illusion of very high speed Virtual memory: provides illusion of very large size Main memory: reasonable cost,
10.1 Chapter 10: Virtual Memory Background Demand Paging Process Creation Page Replacement Allocation of Frames Thrashing Operating System Examples.
An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado, Javier Cabezas. John Stone, Sanjay Patel, Nacho Navarro.
Virtual Memory Chapter 8.
Improving Cache Performance using Victim Tag Stores
Virtual memory.
Lecture: Large Caches, Virtual Memory
Chapter 9: Virtual Memory
Multilevel Memories (Improving performance using alittle “cash”)
Cache Memory Presentation I
Lecture: Cache Hierarchies
Operating Systems Virtual Memory Alok Kumar Jagadev.
Virtual Memory Chapter 8.
Prefetch-Aware Cache Management for High Performance Caching
Lecture 11: DMBS Internals
Module 9: Virtual Memory
Chapter 8: Main Memory.
Virtual Memory.
Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy Snehasish Kumar, Hongzhou Zhao†, Arrvindh Shriraman Eric Matthews∗, Sandhya.
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Chapter 9: Virtual-Memory Management
Chapter 5 Memory CSE 820.
Outline Midterm results summary Distributed file systems – continued
5: Virtual Memory Background Demand Paging
Lecture: Cache Innovations, Virtual Memory
Chapter 9: Virtual Memory
CDA 5155 Caches.
Adapted from slides by Sally McKee Cornell University
Chapter 9: Virtual Memory
Contents Memory types & memory hierarchy Virtual memory (VM)
Lecture: Cache Hierarchies
Virtual Memory: Working Sets
` A Framework for Memory Oversubscription Management in Graphics Processing Units Chen Li, Rachata Ausavarungnirun, Christopher J. Rossbach, Youtao.
Lecture 9: Caching and Demand-Paged Virtual Memory
CSE 153 Design of Operating Systems Winter 19
CSE 542: Operating Systems
Operating Systems Concepts
Module 9: Virtual Memory
Chapter 9: Virtual Memory
Virtual Memory.
Module IV Memory Organization.
Presentation transcript:

Debashis Ganguly Ziyu Zhang, Jun Yang, Rami Melhem Interplay between Hardware Prefetcher and Page Eviction Policy in CPU-GPU Unified Virtual Memory ISCA 2019 Debashis Ganguly Ziyu Zhang, Jun Yang, Rami Melhem

Why do we need Hardware Prefetchers? Kernel Execution Far fault Data Migration

Why do we need Hardware Prefetchers? Kernel Execution Far fault Data Migration Stream 0 Stream 1 User Directed Prefetch

Why do we need Hardware Prefetchers? Kernel Execution Far fault Data Migration Stream 0 Stream 1 User Directed Prefetch What and when to prefetch? How do I synchronize between streams?

Why do we need Hardware Prefetchers? Kernel Execution Far fault Data Migration Stream 0 Stream 1 User Directed Prefetch What and when to prefetch? How do I synchronize between streams? Hardware Prefetch

Why do we need Hardware Prefetchers? Kernel Execution Far fault Data Migration Stream 0 Stream 1 User Directed Prefetch What and when to prefetch? How do I synchronize between streams? Takes away the programming effort Follows spatio-temporal locality of past accesses Overlap kernel execution and data migration Hardware Prefetch

Different Hardware Prefetchers Random Prefetcher (Rp) 2MB 2MB 2MB Randomly prefetch a 4KB page local to the 2MB large page to which the current faulty page belongs

Different Hardware Prefetchers Random Prefetcher (Rp) 2MB 2MB 2MB Randomly prefetch a 4KB page local to the 2MB large page to which the current faulty page belongs Sequential-local 64KB Prefetcher (SLp) [Variation of Sequential and Locality-aware] 2MB 2MB 2MB 64KB 64KB 64KB Prefetch 64KB basic block corresponding to which the current faulty page belongs

Tree-based Neighborhood Prefetcher (TBNp) Invalid Page Access Far fault Prefetch 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 64K 64K 64K 64K 64K 64K 64K 64K

Tree-based Neighborhood Prefetcher (TBNp) Invalid Page Access Far fault Prefetch 12.5% 25% 0% 50% 0% 0% 0% 0% 0% 100% 0% 0% 0% 0% 0% 0% 64K 4K 1 60K 64K 64K 64K 64K 64K 64K 1

Tree-based Neighborhood Prefetcher (TBNp) Invalid Page Access Far fault Prefetch 25% 50% 0% 50% 50% 0% 0% 0% 0% 100% 0% 0% 100% 0% 0% 0% 0% 64K 4K 1 60K 64K 4K 2 60K 64K 64K 64K 64K 1 2

Tree-based Neighborhood Prefetcher (TBNp) Invalid Page Access Far fault Prefetch 37.5% 75% 0% 100% 50% 0% 0% 0% 100% 100% 0% 0% 0% 100% 0% 0% 0% 0% 4K 3 60K 4K 1 60K 64K 4K 2 60K 64K 64K 64K 64K 3 1 2

Tree-based Neighborhood Prefetcher (TBNp) Invalid Page Access Far fault Prefetch 50% 100% 0% 100% 100% 0% 0% 0% 100% 0% 100% 100% 0% 0% 100% 0% 0% 0% 0% 4K 3 60K 4K 1 60K 4 64K 4K 2 60K 64K 64K 64K 64K 3 1 2

Tree-based Neighborhood Prefetcher (TBNp) Invalid Page Access Far fault Prefetch 62.5% 100% 25% 100% 100% 50% 0% 100% 0% 100% 0% 100% 0% 0% 100% 0% 100% 0% 0% 0% 4K 3 60K 4K 1 60K 4 64K 64K 4K 2 60K 4K 5 60K 64K 64K 64K 3 1 2 4

Tree-based Neighborhood Prefetcher (TBNp) Invalid Page Access Far fault Prefetch 100% 100% 100% 100% 100% 100% 100% 0% 100% 0% 100% 0% 100% 0% 100% 0% 100% 100% 0% 100% 0% 100% 0% 4K 3 60K 4K 1 60K 4 64K 64K 4K 2 60K 4K 5 60K 5 64K 5 64K 5 64K 3 1 2 4

When working set fits in device memory TBNp has 1-2 order of magnitude performance improvement over no prefetching Larger the transfer size, higher the bandwidth Reduced number of far-faults

What happens under device memory oversubscription? Disable hardware prefetchers To avoid displacement of heavily referenced pages Pre-eviction to maintain free-page buffer To avoid write-back latency Early disabling of prefetcher by pre-eviction ~100x performance degradation with just 110% oversubscription

Interplay between Prefetcher and Naïve Eviction Policies LRU 4KB 2MB 64KB

Interplay between Prefetcher and Naïve Eviction Policies LRU 4KB 2MB 64KB

Interplay between Prefetcher and Naïve Eviction Policies LRU 4KB 2MB 64KB No contiguous free space to prefetch Renders prefetcher ineffective

Interplay between Prefetcher and Naïve Eviction Policies LRU 4KB LRU 2MB 2MB 64KB 2MB 64KB No contiguous free space to prefetch Renders prefetcher ineffective

Interplay between Prefetcher and Naïve Eviction Policies LRU 4KB LRU 2MB 2MB 64KB 2MB 64KB No contiguous free space to prefetch Renders prefetcher ineffective

Interplay between Prefetcher and Naïve Eviction Policies LRU 4KB LRU 2MB 2MB 64KB 2MB 64KB 2MB 64KB No contiguous free space to prefetch Renders prefetcher ineffective

Interplay between Prefetcher and Naïve Eviction Policies LRU 4KB LRU 2MB 2MB 64KB 2MB 64KB 2MB 64KB No contiguous free space to prefetch Renders prefetcher ineffective Displace heavily referenced pages Causes large thrashing

Prefetcher Inspired Eviction Policies Random Eviction (Re) 2MB 2MB 2MB Randomly evict a 4KB page from the entire virtual address space

Prefetcher Inspired Eviction Policies Random Eviction (Re) 2MB 2MB 2MB Randomly evict a 4KB page from the entire virtual address space Sequential-local 64KB Pre-eviction (SLe) 2MB 2MB 2MB 64KB 64KB 64KB Pre-evict 64KB basic block corresponding to the 4KB LRU candidate

Tree-based Neighborhood Pre-eviction (TBNe) Valid LRU Candidate LRU Eviction Pre-eviction 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 64K 64K 64K 64K 64K 64K 64K 64K

Tree-based Neighborhood Pre-eviction (TBNe) Valid LRU Candidate LRU Eviction Pre-eviction 87.5% 75% 100% 50% 100% 100% 100% 100% 0% 100% 100% 100% 100% 100% 100% 100% 64K 4K 1 60K 64K 64K 64K 64K 64K 64K 1

Tree-based Neighborhood Pre-eviction (TBNe) Valid LRU Candidate LRU Eviction Pre-eviction 75% 50% 100% 50% 50% 100% 100% 100% 0% 100% 100% 100% 0% 100% 100% 100% 100% 64K 4K 1 60K 64K 4K 2 60K 64K 64K 64K 64K 1 2

Tree-based Neighborhood Pre-eviction (TBNe) Valid LRU Candidate LRU Eviction Pre-eviction 62.5% 50% 75% 50% 50% 50% 100% 100% 0% 100% 100% 0% 100% 0% 100% 100% 100% 100% 64K 4K 1 60K 64K 4K 2 60K 4K 3 60K 64K 64K 64K 1 2 3

Tree-based Neighborhood Pre-eviction (TBNe) Valid LRU Candidate LRU Eviction Pre-eviction 50% 25% 75% 0% 50% 50% 100% 0% 100% 0% 100% 100% 100% 0% 0% 100% 100% 100% 100% 4K 4 60K 4K 1 60K 64K 4K 2 60K 4K 3 60K 64K 64K 64K 4 1 2 3

Tree-based Neighborhood Pre-eviction (TBNe) Valid LRU Candidate LRU Eviction Pre-eviction 37.5% 0% 75% 0% 0% 50% 100% 0% 100% 0% 100% 100% 0% 0% 100% 100% 0% 100% 100% 100% 4K 4 60K 4K 1 60K 5 64K 4K 2 60K 4K 3 60K 64K 64K 64K 4 1 2 3

Tree-based Neighborhood Pre-eviction (TBNe) Valid LRU Candidate LRU Eviction Pre-eviction 0% 0% 0% 0% 0% 0% 0% 0% 100% 0% 100% 0% 100% 100% 0% 0% 100% 0% 100% 100% 0% 0% 100% 4K 4 60K 4K 1 60K 5 64K 4K 2 60K 4K 3 60K 6 64K 6 64K 6 64K 4 1 2 3

Combining Pre-evictions (4KB Granularity) and Prefetchers Order of magnitude performance improvement by TBNp and TBNe combo No additional co-ordination required Respecting each other pays off

Combining Pre-evictions (2MB Granularity) and Prefetchers Average 18.5% performance improvement by TBNe Dynamic eviction granularity Reduced number of thrashing

Conclusion Leverages the framework for hardware prefetcher No additional implementation and performance overhead Builds on generic concepts Vendor agnostic Opportunistically decide on dynamic eviction granularity Navigates between two extremes: 4KB and 2MB Overcomes limitations with static granularity Micro-benchmarks, UVM benchmarks, and simulator Public for future collaboration https://github.com/DebashisGanguly/gpgpu-sim_UVMSmart

Interplay between Hardware Prefetcher and Page Eviction Policy in CPU-GPU Unified Virtual Memory Debashis Ganguly Ph.D. Student debashis@cs.pitt.edu https://people.cs.pitt.edu/~debashis/