Presentation is loading. Please wait.

Presentation is loading. Please wait.

Debashis Ganguly Ziyu Zhang, Jun Yang, Rami Melhem

Similar presentations


Presentation on theme: "Debashis Ganguly Ziyu Zhang, Jun Yang, Rami Melhem"— Presentation transcript:

1 Debashis Ganguly Ziyu Zhang, Jun Yang, Rami Melhem
Interplay between Hardware Prefetcher and Page Eviction Policy in CPU-GPU Unified Virtual Memory ISCA 2019 Debashis Ganguly Ziyu Zhang, Jun Yang, Rami Melhem

2 Why do we need Hardware Prefetchers?
Kernel Execution Far fault Data Migration

3 Why do we need Hardware Prefetchers?
Kernel Execution Far fault Data Migration Stream 0 Stream 1 User Directed Prefetch

4 Why do we need Hardware Prefetchers?
Kernel Execution Far fault Data Migration Stream 0 Stream 1 User Directed Prefetch What and when to prefetch? How do I synchronize between streams?

5 Why do we need Hardware Prefetchers?
Kernel Execution Far fault Data Migration Stream 0 Stream 1 User Directed Prefetch What and when to prefetch? How do I synchronize between streams? Hardware Prefetch

6 Why do we need Hardware Prefetchers?
Kernel Execution Far fault Data Migration Stream 0 Stream 1 User Directed Prefetch What and when to prefetch? How do I synchronize between streams? Takes away the programming effort Follows spatio-temporal locality of past accesses Overlap kernel execution and data migration Hardware Prefetch

7 Different Hardware Prefetchers
Random Prefetcher (Rp) 2MB 2MB 2MB Randomly prefetch a 4KB page local to the 2MB large page to which the current faulty page belongs

8 Different Hardware Prefetchers
Random Prefetcher (Rp) 2MB 2MB 2MB Randomly prefetch a 4KB page local to the 2MB large page to which the current faulty page belongs Sequential-local 64KB Prefetcher (SLp) [Variation of Sequential and Locality-aware] 2MB 2MB 2MB 64KB 64KB 64KB Prefetch 64KB basic block corresponding to which the current faulty page belongs

9 Tree-based Neighborhood Prefetcher (TBNp)
Invalid Page Access Far fault Prefetch 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 64K 64K 64K 64K 64K 64K 64K 64K

10 Tree-based Neighborhood Prefetcher (TBNp)
Invalid Page Access Far fault Prefetch 12.5% 25% 0% 50% 0% 0% 0% 0% 0% 100% 0% 0% 0% 0% 0% 0% 64K 4K 1 60K 64K 64K 64K 64K 64K 64K 1

11 Tree-based Neighborhood Prefetcher (TBNp)
Invalid Page Access Far fault Prefetch 25% 50% 0% 50% 50% 0% 0% 0% 0% 100% 0% 0% 100% 0% 0% 0% 0% 64K 4K 1 60K 64K 4K 2 60K 64K 64K 64K 64K 1 2

12 Tree-based Neighborhood Prefetcher (TBNp)
Invalid Page Access Far fault Prefetch 37.5% 75% 0% 100% 50% 0% 0% 0% 100% 100% 0% 0% 0% 100% 0% 0% 0% 0% 4K 3 60K 4K 1 60K 64K 4K 2 60K 64K 64K 64K 64K 3 1 2

13 Tree-based Neighborhood Prefetcher (TBNp)
Invalid Page Access Far fault Prefetch 50% 100% 0% 100% 100% 0% 0% 0% 100% 0% 100% 100% 0% 0% 100% 0% 0% 0% 0% 4K 3 60K 4K 1 60K 4 64K 4K 2 60K 64K 64K 64K 64K 3 1 2

14 Tree-based Neighborhood Prefetcher (TBNp)
Invalid Page Access Far fault Prefetch 62.5% 100% 25% 100% 100% 50% 0% 100% 0% 100% 0% 100% 0% 0% 100% 0% 100% 0% 0% 0% 4K 3 60K 4K 1 60K 4 64K 64K 4K 2 60K 4K 5 60K 64K 64K 64K 3 1 2 4

15 Tree-based Neighborhood Prefetcher (TBNp)
Invalid Page Access Far fault Prefetch 100% 100% 100% 100% 100% 100% 100% 0% 100% 0% 100% 0% 100% 0% 100% 0% 100% 100% 0% 100% 0% 100% 0% 4K 3 60K 4K 1 60K 4 64K 64K 4K 2 60K 4K 5 60K 5 64K 5 64K 5 64K 3 1 2 4

16 When working set fits in device memory
TBNp has 1-2 order of magnitude performance improvement over no prefetching Larger the transfer size, higher the bandwidth Reduced number of far-faults

17 What happens under device memory oversubscription?
Disable hardware prefetchers To avoid displacement of heavily referenced pages Pre-eviction to maintain free-page buffer To avoid write-back latency Early disabling of prefetcher by pre-eviction ~100x performance degradation with just 110% oversubscription

18 Interplay between Prefetcher and Naïve Eviction Policies
LRU 4KB 2MB 64KB

19 Interplay between Prefetcher and Naïve Eviction Policies
LRU 4KB 2MB 64KB

20 Interplay between Prefetcher and Naïve Eviction Policies
LRU 4KB 2MB 64KB No contiguous free space to prefetch Renders prefetcher ineffective

21 Interplay between Prefetcher and Naïve Eviction Policies
LRU 4KB LRU 2MB 2MB 64KB 2MB 64KB No contiguous free space to prefetch Renders prefetcher ineffective

22 Interplay between Prefetcher and Naïve Eviction Policies
LRU 4KB LRU 2MB 2MB 64KB 2MB 64KB No contiguous free space to prefetch Renders prefetcher ineffective

23 Interplay between Prefetcher and Naïve Eviction Policies
LRU 4KB LRU 2MB 2MB 64KB 2MB 64KB 2MB 64KB No contiguous free space to prefetch Renders prefetcher ineffective

24 Interplay between Prefetcher and Naïve Eviction Policies
LRU 4KB LRU 2MB 2MB 64KB 2MB 64KB 2MB 64KB No contiguous free space to prefetch Renders prefetcher ineffective Displace heavily referenced pages Causes large thrashing

25 Prefetcher Inspired Eviction Policies
Random Eviction (Re) 2MB 2MB 2MB Randomly evict a 4KB page from the entire virtual address space

26 Prefetcher Inspired Eviction Policies
Random Eviction (Re) 2MB 2MB 2MB Randomly evict a 4KB page from the entire virtual address space Sequential-local 64KB Pre-eviction (SLe) 2MB 2MB 2MB 64KB 64KB 64KB Pre-evict 64KB basic block corresponding to the 4KB LRU candidate

27 Tree-based Neighborhood Pre-eviction (TBNe)
Valid LRU Candidate LRU Eviction Pre-eviction 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 64K 64K 64K 64K 64K 64K 64K 64K

28 Tree-based Neighborhood Pre-eviction (TBNe)
Valid LRU Candidate LRU Eviction Pre-eviction 87.5% 75% 100% 50% 100% 100% 100% 100% 0% 100% 100% 100% 100% 100% 100% 100% 64K 4K 1 60K 64K 64K 64K 64K 64K 64K 1

29 Tree-based Neighborhood Pre-eviction (TBNe)
Valid LRU Candidate LRU Eviction Pre-eviction 75% 50% 100% 50% 50% 100% 100% 100% 0% 100% 100% 100% 0% 100% 100% 100% 100% 64K 4K 1 60K 64K 4K 2 60K 64K 64K 64K 64K 1 2

30 Tree-based Neighborhood Pre-eviction (TBNe)
Valid LRU Candidate LRU Eviction Pre-eviction 62.5% 50% 75% 50% 50% 50% 100% 100% 0% 100% 100% 0% 100% 0% 100% 100% 100% 100% 64K 4K 1 60K 64K 4K 2 60K 4K 3 60K 64K 64K 64K 1 2 3

31 Tree-based Neighborhood Pre-eviction (TBNe)
Valid LRU Candidate LRU Eviction Pre-eviction 50% 25% 75% 0% 50% 50% 100% 0% 100% 0% 100% 100% 100% 0% 0% 100% 100% 100% 100% 4K 4 60K 4K 1 60K 64K 4K 2 60K 4K 3 60K 64K 64K 64K 4 1 2 3

32 Tree-based Neighborhood Pre-eviction (TBNe)
Valid LRU Candidate LRU Eviction Pre-eviction 37.5% 0% 75% 0% 0% 50% 100% 0% 100% 0% 100% 100% 0% 0% 100% 100% 0% 100% 100% 100% 4K 4 60K 4K 1 60K 5 64K 4K 2 60K 4K 3 60K 64K 64K 64K 4 1 2 3

33 Tree-based Neighborhood Pre-eviction (TBNe)
Valid LRU Candidate LRU Eviction Pre-eviction 0% 0% 0% 0% 0% 0% 0% 0% 100% 0% 100% 0% 100% 100% 0% 0% 100% 0% 100% 100% 0% 0% 100% 4K 4 60K 4K 1 60K 5 64K 4K 2 60K 4K 3 60K 6 64K 6 64K 6 64K 4 1 2 3

34 Combining Pre-evictions (4KB Granularity) and Prefetchers
Order of magnitude performance improvement by TBNp and TBNe combo No additional co-ordination required Respecting each other pays off

35 Combining Pre-evictions (2MB Granularity) and Prefetchers
Average 18.5% performance improvement by TBNe Dynamic eviction granularity Reduced number of thrashing

36 Conclusion Leverages the framework for hardware prefetcher
No additional implementation and performance overhead Builds on generic concepts Vendor agnostic Opportunistically decide on dynamic eviction granularity Navigates between two extremes: 4KB and 2MB Overcomes limitations with static granularity Micro-benchmarks, UVM benchmarks, and simulator Public for future collaboration

37 Interplay between Hardware Prefetcher and Page Eviction Policy in CPU-GPU Unified Virtual Memory
Debashis Ganguly Ph.D. Student


Download ppt "Debashis Ganguly Ziyu Zhang, Jun Yang, Rami Melhem"

Similar presentations


Ads by Google