Download presentation
Presentation is loading. Please wait.
1
Debashis Ganguly Ziyu Zhang, Jun Yang, Rami Melhem
Interplay between Hardware Prefetcher and Page Eviction Policy in CPU-GPU Unified Virtual Memory ISCA 2019 Debashis Ganguly Ziyu Zhang, Jun Yang, Rami Melhem
2
Why do we need Hardware Prefetchers?
Kernel Execution Far fault Data Migration
3
Why do we need Hardware Prefetchers?
Kernel Execution Far fault Data Migration Stream 0 Stream 1 User Directed Prefetch
4
Why do we need Hardware Prefetchers?
Kernel Execution Far fault Data Migration Stream 0 Stream 1 User Directed Prefetch What and when to prefetch? How do I synchronize between streams?
5
Why do we need Hardware Prefetchers?
Kernel Execution Far fault Data Migration Stream 0 Stream 1 User Directed Prefetch What and when to prefetch? How do I synchronize between streams? Hardware Prefetch
6
Why do we need Hardware Prefetchers?
Kernel Execution Far fault Data Migration Stream 0 Stream 1 User Directed Prefetch What and when to prefetch? How do I synchronize between streams? Takes away the programming effort Follows spatio-temporal locality of past accesses Overlap kernel execution and data migration Hardware Prefetch
7
Different Hardware Prefetchers
Random Prefetcher (Rp) 2MB 2MB 2MB Randomly prefetch a 4KB page local to the 2MB large page to which the current faulty page belongs
8
Different Hardware Prefetchers
Random Prefetcher (Rp) 2MB 2MB 2MB Randomly prefetch a 4KB page local to the 2MB large page to which the current faulty page belongs Sequential-local 64KB Prefetcher (SLp) [Variation of Sequential and Locality-aware] 2MB 2MB 2MB 64KB 64KB 64KB Prefetch 64KB basic block corresponding to which the current faulty page belongs
9
Tree-based Neighborhood Prefetcher (TBNp)
Invalid Page Access Far fault Prefetch 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 64K 64K 64K 64K 64K 64K 64K 64K
10
Tree-based Neighborhood Prefetcher (TBNp)
Invalid Page Access Far fault Prefetch 12.5% 25% 0% 50% 0% 0% 0% 0% 0% 100% 0% 0% 0% 0% 0% 0% 64K 4K 1 60K 64K 64K 64K 64K 64K 64K 1
11
Tree-based Neighborhood Prefetcher (TBNp)
Invalid Page Access Far fault Prefetch 25% 50% 0% 50% 50% 0% 0% 0% 0% 100% 0% 0% 100% 0% 0% 0% 0% 64K 4K 1 60K 64K 4K 2 60K 64K 64K 64K 64K 1 2
12
Tree-based Neighborhood Prefetcher (TBNp)
Invalid Page Access Far fault Prefetch 37.5% 75% 0% 100% 50% 0% 0% 0% 100% 100% 0% 0% 0% 100% 0% 0% 0% 0% 4K 3 60K 4K 1 60K 64K 4K 2 60K 64K 64K 64K 64K 3 1 2
13
Tree-based Neighborhood Prefetcher (TBNp)
Invalid Page Access Far fault Prefetch 50% 100% 0% 100% 100% 0% 0% 0% 100% 0% 100% 100% 0% 0% 100% 0% 0% 0% 0% 4K 3 60K 4K 1 60K 4 64K 4K 2 60K 64K 64K 64K 64K 3 1 2
14
Tree-based Neighborhood Prefetcher (TBNp)
Invalid Page Access Far fault Prefetch 62.5% 100% 25% 100% 100% 50% 0% 100% 0% 100% 0% 100% 0% 0% 100% 0% 100% 0% 0% 0% 4K 3 60K 4K 1 60K 4 64K 64K 4K 2 60K 4K 5 60K 64K 64K 64K 3 1 2 4
15
Tree-based Neighborhood Prefetcher (TBNp)
Invalid Page Access Far fault Prefetch 100% 100% 100% 100% 100% 100% 100% 0% 100% 0% 100% 0% 100% 0% 100% 0% 100% 100% 0% 100% 0% 100% 0% 4K 3 60K 4K 1 60K 4 64K 64K 4K 2 60K 4K 5 60K 5 64K 5 64K 5 64K 3 1 2 4
16
When working set fits in device memory
TBNp has 1-2 order of magnitude performance improvement over no prefetching Larger the transfer size, higher the bandwidth Reduced number of far-faults
17
What happens under device memory oversubscription?
Disable hardware prefetchers To avoid displacement of heavily referenced pages Pre-eviction to maintain free-page buffer To avoid write-back latency Early disabling of prefetcher by pre-eviction ~100x performance degradation with just 110% oversubscription
18
Interplay between Prefetcher and Naïve Eviction Policies
LRU 4KB 2MB 64KB
19
Interplay between Prefetcher and Naïve Eviction Policies
LRU 4KB 2MB 64KB
20
Interplay between Prefetcher and Naïve Eviction Policies
LRU 4KB 2MB 64KB No contiguous free space to prefetch Renders prefetcher ineffective
21
Interplay between Prefetcher and Naïve Eviction Policies
LRU 4KB LRU 2MB 2MB 64KB 2MB 64KB No contiguous free space to prefetch Renders prefetcher ineffective
22
Interplay between Prefetcher and Naïve Eviction Policies
LRU 4KB LRU 2MB 2MB 64KB 2MB 64KB No contiguous free space to prefetch Renders prefetcher ineffective
23
Interplay between Prefetcher and Naïve Eviction Policies
LRU 4KB LRU 2MB 2MB 64KB 2MB 64KB 2MB 64KB No contiguous free space to prefetch Renders prefetcher ineffective
24
Interplay between Prefetcher and Naïve Eviction Policies
LRU 4KB LRU 2MB 2MB 64KB 2MB 64KB 2MB 64KB No contiguous free space to prefetch Renders prefetcher ineffective Displace heavily referenced pages Causes large thrashing
25
Prefetcher Inspired Eviction Policies
Random Eviction (Re) 2MB 2MB 2MB Randomly evict a 4KB page from the entire virtual address space
26
Prefetcher Inspired Eviction Policies
Random Eviction (Re) 2MB 2MB 2MB Randomly evict a 4KB page from the entire virtual address space Sequential-local 64KB Pre-eviction (SLe) 2MB 2MB 2MB 64KB 64KB 64KB Pre-evict 64KB basic block corresponding to the 4KB LRU candidate
27
Tree-based Neighborhood Pre-eviction (TBNe)
Valid LRU Candidate LRU Eviction Pre-eviction 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 64K 64K 64K 64K 64K 64K 64K 64K
28
Tree-based Neighborhood Pre-eviction (TBNe)
Valid LRU Candidate LRU Eviction Pre-eviction 87.5% 75% 100% 50% 100% 100% 100% 100% 0% 100% 100% 100% 100% 100% 100% 100% 64K 4K 1 60K 64K 64K 64K 64K 64K 64K 1
29
Tree-based Neighborhood Pre-eviction (TBNe)
Valid LRU Candidate LRU Eviction Pre-eviction 75% 50% 100% 50% 50% 100% 100% 100% 0% 100% 100% 100% 0% 100% 100% 100% 100% 64K 4K 1 60K 64K 4K 2 60K 64K 64K 64K 64K 1 2
30
Tree-based Neighborhood Pre-eviction (TBNe)
Valid LRU Candidate LRU Eviction Pre-eviction 62.5% 50% 75% 50% 50% 50% 100% 100% 0% 100% 100% 0% 100% 0% 100% 100% 100% 100% 64K 4K 1 60K 64K 4K 2 60K 4K 3 60K 64K 64K 64K 1 2 3
31
Tree-based Neighborhood Pre-eviction (TBNe)
Valid LRU Candidate LRU Eviction Pre-eviction 50% 25% 75% 0% 50% 50% 100% 0% 100% 0% 100% 100% 100% 0% 0% 100% 100% 100% 100% 4K 4 60K 4K 1 60K 64K 4K 2 60K 4K 3 60K 64K 64K 64K 4 1 2 3
32
Tree-based Neighborhood Pre-eviction (TBNe)
Valid LRU Candidate LRU Eviction Pre-eviction 37.5% 0% 75% 0% 0% 50% 100% 0% 100% 0% 100% 100% 0% 0% 100% 100% 0% 100% 100% 100% 4K 4 60K 4K 1 60K 5 64K 4K 2 60K 4K 3 60K 64K 64K 64K 4 1 2 3
33
Tree-based Neighborhood Pre-eviction (TBNe)
Valid LRU Candidate LRU Eviction Pre-eviction 0% 0% 0% 0% 0% 0% 0% 0% 100% 0% 100% 0% 100% 100% 0% 0% 100% 0% 100% 100% 0% 0% 100% 4K 4 60K 4K 1 60K 5 64K 4K 2 60K 4K 3 60K 6 64K 6 64K 6 64K 4 1 2 3
34
Combining Pre-evictions (4KB Granularity) and Prefetchers
Order of magnitude performance improvement by TBNp and TBNe combo No additional co-ordination required Respecting each other pays off
35
Combining Pre-evictions (2MB Granularity) and Prefetchers
Average 18.5% performance improvement by TBNe Dynamic eviction granularity Reduced number of thrashing
36
Conclusion Leverages the framework for hardware prefetcher
No additional implementation and performance overhead Builds on generic concepts Vendor agnostic Opportunistically decide on dynamic eviction granularity Navigates between two extremes: 4KB and 2MB Overcomes limitations with static granularity Micro-benchmarks, UVM benchmarks, and simulator Public for future collaboration
37
Interplay between Hardware Prefetcher and Page Eviction Policy in CPU-GPU Unified Virtual Memory
Debashis Ganguly Ph.D. Student
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.