Debashis Ganguly Ziyu Zhang, Jun Yang, Rami Melhem Interplay between Hardware Prefetcher and Page Eviction Policy in CPU-GPU Unified Virtual Memory ISCA 2019 Debashis Ganguly Ziyu Zhang, Jun Yang, Rami Melhem
Why do we need Hardware Prefetchers? Kernel Execution Far fault Data Migration
Why do we need Hardware Prefetchers? Kernel Execution Far fault Data Migration Stream 0 Stream 1 User Directed Prefetch
Why do we need Hardware Prefetchers? Kernel Execution Far fault Data Migration Stream 0 Stream 1 User Directed Prefetch What and when to prefetch? How do I synchronize between streams?
Why do we need Hardware Prefetchers? Kernel Execution Far fault Data Migration Stream 0 Stream 1 User Directed Prefetch What and when to prefetch? How do I synchronize between streams? Hardware Prefetch
Why do we need Hardware Prefetchers? Kernel Execution Far fault Data Migration Stream 0 Stream 1 User Directed Prefetch What and when to prefetch? How do I synchronize between streams? Takes away the programming effort Follows spatio-temporal locality of past accesses Overlap kernel execution and data migration Hardware Prefetch
Different Hardware Prefetchers Random Prefetcher (Rp) 2MB 2MB 2MB Randomly prefetch a 4KB page local to the 2MB large page to which the current faulty page belongs
Different Hardware Prefetchers Random Prefetcher (Rp) 2MB 2MB 2MB Randomly prefetch a 4KB page local to the 2MB large page to which the current faulty page belongs Sequential-local 64KB Prefetcher (SLp) [Variation of Sequential and Locality-aware] 2MB 2MB 2MB 64KB 64KB 64KB Prefetch 64KB basic block corresponding to which the current faulty page belongs
Tree-based Neighborhood Prefetcher (TBNp) Invalid Page Access Far fault Prefetch 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 64K 64K 64K 64K 64K 64K 64K 64K
Tree-based Neighborhood Prefetcher (TBNp) Invalid Page Access Far fault Prefetch 12.5% 25% 0% 50% 0% 0% 0% 0% 0% 100% 0% 0% 0% 0% 0% 0% 64K 4K 1 60K 64K 64K 64K 64K 64K 64K 1
Tree-based Neighborhood Prefetcher (TBNp) Invalid Page Access Far fault Prefetch 25% 50% 0% 50% 50% 0% 0% 0% 0% 100% 0% 0% 100% 0% 0% 0% 0% 64K 4K 1 60K 64K 4K 2 60K 64K 64K 64K 64K 1 2
Tree-based Neighborhood Prefetcher (TBNp) Invalid Page Access Far fault Prefetch 37.5% 75% 0% 100% 50% 0% 0% 0% 100% 100% 0% 0% 0% 100% 0% 0% 0% 0% 4K 3 60K 4K 1 60K 64K 4K 2 60K 64K 64K 64K 64K 3 1 2
Tree-based Neighborhood Prefetcher (TBNp) Invalid Page Access Far fault Prefetch 50% 100% 0% 100% 100% 0% 0% 0% 100% 0% 100% 100% 0% 0% 100% 0% 0% 0% 0% 4K 3 60K 4K 1 60K 4 64K 4K 2 60K 64K 64K 64K 64K 3 1 2
Tree-based Neighborhood Prefetcher (TBNp) Invalid Page Access Far fault Prefetch 62.5% 100% 25% 100% 100% 50% 0% 100% 0% 100% 0% 100% 0% 0% 100% 0% 100% 0% 0% 0% 4K 3 60K 4K 1 60K 4 64K 64K 4K 2 60K 4K 5 60K 64K 64K 64K 3 1 2 4
Tree-based Neighborhood Prefetcher (TBNp) Invalid Page Access Far fault Prefetch 100% 100% 100% 100% 100% 100% 100% 0% 100% 0% 100% 0% 100% 0% 100% 0% 100% 100% 0% 100% 0% 100% 0% 4K 3 60K 4K 1 60K 4 64K 64K 4K 2 60K 4K 5 60K 5 64K 5 64K 5 64K 3 1 2 4
When working set fits in device memory TBNp has 1-2 order of magnitude performance improvement over no prefetching Larger the transfer size, higher the bandwidth Reduced number of far-faults
What happens under device memory oversubscription? Disable hardware prefetchers To avoid displacement of heavily referenced pages Pre-eviction to maintain free-page buffer To avoid write-back latency Early disabling of prefetcher by pre-eviction ~100x performance degradation with just 110% oversubscription
Interplay between Prefetcher and Naïve Eviction Policies LRU 4KB 2MB 64KB
Interplay between Prefetcher and Naïve Eviction Policies LRU 4KB 2MB 64KB
Interplay between Prefetcher and Naïve Eviction Policies LRU 4KB 2MB 64KB No contiguous free space to prefetch Renders prefetcher ineffective
Interplay between Prefetcher and Naïve Eviction Policies LRU 4KB LRU 2MB 2MB 64KB 2MB 64KB No contiguous free space to prefetch Renders prefetcher ineffective
Interplay between Prefetcher and Naïve Eviction Policies LRU 4KB LRU 2MB 2MB 64KB 2MB 64KB No contiguous free space to prefetch Renders prefetcher ineffective
Interplay between Prefetcher and Naïve Eviction Policies LRU 4KB LRU 2MB 2MB 64KB 2MB 64KB 2MB 64KB No contiguous free space to prefetch Renders prefetcher ineffective
Interplay between Prefetcher and Naïve Eviction Policies LRU 4KB LRU 2MB 2MB 64KB 2MB 64KB 2MB 64KB No contiguous free space to prefetch Renders prefetcher ineffective Displace heavily referenced pages Causes large thrashing
Prefetcher Inspired Eviction Policies Random Eviction (Re) 2MB 2MB 2MB Randomly evict a 4KB page from the entire virtual address space
Prefetcher Inspired Eviction Policies Random Eviction (Re) 2MB 2MB 2MB Randomly evict a 4KB page from the entire virtual address space Sequential-local 64KB Pre-eviction (SLe) 2MB 2MB 2MB 64KB 64KB 64KB Pre-evict 64KB basic block corresponding to the 4KB LRU candidate
Tree-based Neighborhood Pre-eviction (TBNe) Valid LRU Candidate LRU Eviction Pre-eviction 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 64K 64K 64K 64K 64K 64K 64K 64K
Tree-based Neighborhood Pre-eviction (TBNe) Valid LRU Candidate LRU Eviction Pre-eviction 87.5% 75% 100% 50% 100% 100% 100% 100% 0% 100% 100% 100% 100% 100% 100% 100% 64K 4K 1 60K 64K 64K 64K 64K 64K 64K 1
Tree-based Neighborhood Pre-eviction (TBNe) Valid LRU Candidate LRU Eviction Pre-eviction 75% 50% 100% 50% 50% 100% 100% 100% 0% 100% 100% 100% 0% 100% 100% 100% 100% 64K 4K 1 60K 64K 4K 2 60K 64K 64K 64K 64K 1 2
Tree-based Neighborhood Pre-eviction (TBNe) Valid LRU Candidate LRU Eviction Pre-eviction 62.5% 50% 75% 50% 50% 50% 100% 100% 0% 100% 100% 0% 100% 0% 100% 100% 100% 100% 64K 4K 1 60K 64K 4K 2 60K 4K 3 60K 64K 64K 64K 1 2 3
Tree-based Neighborhood Pre-eviction (TBNe) Valid LRU Candidate LRU Eviction Pre-eviction 50% 25% 75% 0% 50% 50% 100% 0% 100% 0% 100% 100% 100% 0% 0% 100% 100% 100% 100% 4K 4 60K 4K 1 60K 64K 4K 2 60K 4K 3 60K 64K 64K 64K 4 1 2 3
Tree-based Neighborhood Pre-eviction (TBNe) Valid LRU Candidate LRU Eviction Pre-eviction 37.5% 0% 75% 0% 0% 50% 100% 0% 100% 0% 100% 100% 0% 0% 100% 100% 0% 100% 100% 100% 4K 4 60K 4K 1 60K 5 64K 4K 2 60K 4K 3 60K 64K 64K 64K 4 1 2 3
Tree-based Neighborhood Pre-eviction (TBNe) Valid LRU Candidate LRU Eviction Pre-eviction 0% 0% 0% 0% 0% 0% 0% 0% 100% 0% 100% 0% 100% 100% 0% 0% 100% 0% 100% 100% 0% 0% 100% 4K 4 60K 4K 1 60K 5 64K 4K 2 60K 4K 3 60K 6 64K 6 64K 6 64K 4 1 2 3
Combining Pre-evictions (4KB Granularity) and Prefetchers Order of magnitude performance improvement by TBNp and TBNe combo No additional co-ordination required Respecting each other pays off
Combining Pre-evictions (2MB Granularity) and Prefetchers Average 18.5% performance improvement by TBNe Dynamic eviction granularity Reduced number of thrashing
Conclusion Leverages the framework for hardware prefetcher No additional implementation and performance overhead Builds on generic concepts Vendor agnostic Opportunistically decide on dynamic eviction granularity Navigates between two extremes: 4KB and 2MB Overcomes limitations with static granularity Micro-benchmarks, UVM benchmarks, and simulator Public for future collaboration https://github.com/DebashisGanguly/gpgpu-sim_UVMSmart
Interplay between Hardware Prefetcher and Page Eviction Policy in CPU-GPU Unified Virtual Memory Debashis Ganguly Ph.D. Student debashis@cs.pitt.edu https://people.cs.pitt.edu/~debashis/