Download presentation
Presentation is loading. Please wait.
Published by紬韵 季 Modified over 6 years ago
1
Executive Summary Problem: Overheads of virtual memory can be high
1D (native): 42% 2D (virtualized): 97% Idea: Reduce dimensionality of page walks I: Virtualized Direct Segments -- 2D 0D (8 slides) Reduces overheads of virtual memory (virtualized) to less than 1% II: Agile Paging -- 2D 1D (12 slides) Reduces overheads of virtual memory (virtualized) less than 4% of native III: Redundant Memory Mappings -- 1D 0D (11 slides) Reduces overheads of virtual memory (native) to less than 1% PhD Defense
2
Virtual Memory Refresher
Virtual Address Space Page Table Physical Memory Process 1 Challenge: How to reduce costly page table walks? Process 2 TLB (Translation Lookaside Buffer) PhD Defense
3
Two Technology Trends TLB reach is limited Year Processor
L1 DTLB entries 1999 Pent. III 72 2001 Pent. 4 64 2008 Nehalem 96 2012 IvyBridge 100 2014 Haswell 2015 Skylake TLB reach is limited *Inflation-adjusted 2011 USD, from: jcmit.com PhD Defense
4
Use of Vast Memory Big-memory applications (ever increasing data sets)
PhD Defense
5
Native x86 Translation: 1D
VA Virtual Address Page Table CR3 PA Physical Address Up to mem accesses = 4 PhD Defense
6
Enabling many more startups
Virtual Machines Put logos instead Enabling many more startups PhD Defense
7
Why do we need more dimensions?
Guest Virtual Address Guest Physical Address Host Physical Address 1 2 gVA gPA hPA Guest Page Table Nested Page Table PhD Defense
8
Inception: Effects of Adding Dimensions
APP APP Jog 2D 70% OS VMM Run 1D 42% Hardware Sprint 0D 0% PhD Defense
9
Goal Can we have “zero” overheads of virtualizing memory at any level of virtualization? PhD Defense
10
ISCA’15 + MICRO TOP PICKS’16
Reaching the Goal 2D 0D I: Virtualized Direct Segments MICRO’14 2D 1D 1D 0D II: Agile Paging ISCA’16 III: Redundant Memory Mappings ISCA’15 + MICRO TOP PICKS’16 Virtual Machine Native Machine Direct Execution PhD Defense
11
ISCA’15 + MICRO TOP PICKS’16
Outline 2D 0D I: Virtualized Direct Segments MICRO’14 2D 1D 1D 0D II: Agile Paging ISCA’16 III: Redundant Memory Mappings ISCA’15 + MICRO TOP PICKS’16 Virtual Machine Native Machine Direct Execution PhD Defense
12
I: Virtualized Direct Segments – Goal
TLB reach is limited & Cost of a TLB miss with virtualization is very high Can we eliminate virtualized TLB misses totally? 2D 0D [Gandhi et al. – MICRO’14] PhD Defense
13
Virtualized Page Walk: 2D
hPA gVA gPA Guest Page Table Nested Page Table gVA Longer Page Walk gCR3 hPA At most Mem accesses 5 + 5 + 5 + 5 + 4 = 24 PhD Defense
14
Direct Segments Review
1 Conventional Paging 2 Direct Segment BASE LIMIT Virtual Address OFFSET Physical Address Why Direct Segment? Matches big memory workload needs NO TLB lookups => NO TLB Misses [Basu and Gandhi et al. – ISCA’13] PhD Defense
15
Direct Segments Review
VA VA PA PA Base Native 1D Direct Segments 1D 0D PhD Defense
16
Three Virtualized Modes
Details 1 2 3 VMM Direct 2D 1D Dual Direct 2D 0D Guest Direct 2D 1D PhD Defense
17
Methodology Measure cost on page walks on real hardware
Intel 12-core Sandy-bridge with 96GB memory Prototype VMM and OS and emulate hardware in Linux BadgerTrap for online analysis of TLB misses Released: Linear model to predict performance Workloads --- Big-memory workloads, SPEC 2006, BioBench, PARSEC [Gandhi et al. – CAN’14] PhD Defense
18
VMM Direct achieves near-native performance
Results VMM Direct achieves near-native performance PhD Defense
19
Results Dual Direct eliminates most of the TLB misses achieving better-than native performance PhD Defense
20
Results Guest Direct achieves near-native performance while providing flexibility at VMM PhD Defense
21
Results Same trend across all workloads (More workloads in the thesis)
PhD Defense
22
I: Summary – Virtualized Direct Segments
Problem: TLB misses in virtual machines Hardware-virtualized MMU has high overheads Solution: segmentation to bypass paging Extend Direct Segments for virtualization Three modes with different tradeoffs Results Near- or better-than-native performance PhD Defense
23
ISCA’15 + MICRO TOP PICKS’16
Outline 2D 0D I: Virtualized Direct Segments MICRO’14 2D 1D 1D 0D II: Agile Paging ISCA’16 III: Redundant Memory Mappings ISCA’15 + MICRO TOP PICKS’16 Virtual Machine Native Machine Direct Execution PhD Defense
24
II: Agile Paging – Goal Virtualized Direct Segments sacrified paging support & Required a lot of hardware and software support Can we make virtualized page walk faster while retaining paging? 2D 1D [Gandhi et al. -- ISCA’16] PhD Defense
25
Guest Physical Address
Virtualizing Memory APP APP gVA Guest Virtual Address Guest OS Guest Page Table gPA Guest Physical Address VMM Nested Page Table Hardware hPA Host Physical Address PhD Defense 28
26
Virtualizing Memory Two techniques to manage both page tables
gVA gPA Guest Page Table Nested Page Table Two techniques to manage both page tables Nested Paging -- Hardware Shadow Paging – Software Evaluated on two axis: Page Walk Latency & Page Table Updates PhD Defense 29
27
1. Nested Paging – Hardware
hPA gVA gPA Guest Page Table Nested Page Table gVA Longer Page Walk gCR3 hPA At most Mem accesses 5 + 5 + 5 + 5 + 4 = 24 PhD Defense 30
28
2. Shadow Paging – Software
APP APP gVA Guest OS Guest Page Table (Read Only) Guest Page Table RO RO gPA Shadow Page Table VMM Nested Page Table Hardware hPA PhD Defense 31
29
2. Shadow Paging – Software
hPA Guest Page Table (Read Only) Nested Page Table gVA Shadow Page Table Shorter Page Walk sCR3 At most mem accesses = 4 PhD Defense 32
30
Page Table Updates In-place fast update Slow meditated update
1. Nested Paging 2. Shadow Paging gVA gVA VMM Trap Updates: Copy-on-write Page migration Accessed bits Dirty bits Page sharing Working set sampling Many more… Guest Page Table Guest Page Table (Read Only) gPA Shadow Page Table Nested Page Table Nested Page Table hPA hPA In-place fast update Slow meditated update PhD Defense 33
31
Guest Virtual Address Space
Key Observation Fully static address space Reality !!! Guest Virtual Address Space Shadow Paging preferred Fully dynamic address space Small fraction of address space is dynamic Nested Paging preferred PhD Defense 34
32
Key Observation Guest Page Table gCR3 Nested Shadow PhD Defense 35
33
Agile Paging Start page walk in shadow mode
-- Achieving fast TLB misses Optionally switch to nested mode -- Allowing fast in-place updates Two parts of design: 1. Mechanism Policy PhD Defense 36
34
1. Mechanism gVA gPA hPA Guest Page Table Shadow Page Table
Nested Page Table gCR3 Shadow Page Table Guest Page Table sCR3 1 1 Read only Nested Page Table PhD Defense 37
35
1. Mechanism: Example Page Walk
gVA gVA sCR3 gCR3 hPA Switch level 4 of guest page table At most Mem accesses 1 + 1 + 1 + 5 = 8 PhD Defense 38
36
2. Policy: Shadow Nested
Start Shadow Write to page table (VMM Trap) Shadow (1 Write) Write to page table (VMM Trap) Nested Subsequent Writes (No VMM Traps) PhD Defense 39
37
2. Policy: Nested Shadow
Start Shadow Write to page table (VMM Trap) Shadow (1 Write) Write to page table (VMM Trap) Move non-dirty Timeout Use dirty bits to track writes to guest page table Nested Subsequent Writes (No VMM Traps) PhD Defense 40
38
Results Modeled based on emulator: BadgerTrap
B: Unvirtualized N: Nested Paging S: Shadow Paging A: Agile Paging Modeled based on emulator: BadgerTrap Measured using performance counters Solid bottom bar: Page walk overhead Hashed top bar: VMM overheads PhD Defense 41
39
Results Nested Paging has high overheads of TLB misses
B: Unvirtualized N: Nested Paging S: Shadow Paging A: Agile Paging Nested Paging has high overheads of TLB misses Effect of longer page walk 28% 19% 18% 6% Solid bottom bar: Page walk overhead Hashed top bar: VMM overheads PhD Defense 42
40
Shadow Paging has high overheads of VMM interventions
Results B: Unvirtualized N: Nested Paging S: Shadow Paging A: Agile Paging Shadow Paging has high overheads of VMM interventions 28% 70% 11% 19% 30% 18% 6% 6% Solid bottom bar: Page walk overhead Hashed top bar: VMM overheads PhD Defense 43
41
Agile paging consistently performs better than both techniques
Results B: Unvirtualized N: Nested Paging S: Shadow Paging A: Agile Paging Agile paging consistently performs better than both techniques 28% 70% 11% 2% 19% 30% 18% 6% 2% 4% 6% 3% Solid bottom bar: Page walk overhead Hashed top bar: VMM overheads THP PhD Defense 44
42
Can we get best of both for same address space (or same page walk)?
II: Summary – Agile Paging Problem: Virtualization valuable but have high overheads with larger workloads (At most 70% slower than native) Existing Choices: Nested Paging: slow page walk but fast page table updates Shadow Paging: fast page walk but slow page table updates Can we get best of both for same address space (or same page walk)? Yes, Agile Paging: use shadow paging and sometime switch to nested paging within the same page walk (At most 4% slower than native) PhD Defense 45
43
ISCA’15 + MICRO TOP PICKS’16
Outline 2D 0D I: Virtualized Direct Segments MICRO’14 2D 1D 1D 0D II: Agile Paging ISCA’16 III: Redundant Memory Mappings ISCA’15 + MICRO TOP PICKS’16 Virtual Machine Native Machine Direct Execution PhD Defense
44
How increase reach of each TLB entry?
III: Redundant Memory Mappings – Goal TLB reach is limited & In-memory workload size is increasing How increase reach of each TLB entry? 1D 0D [Gandhi et al. -- ISCA’15 and IEEE MICRO TOP PICKS’16] PhD Defense
45
Key Observation Virtual Memory Physical Memory PhD Defense
46
Key Observation Large contiguous regions of virtual memory
Code Heap Stack Shared Lib. Virtual Memory Large contiguous regions of virtual memory Limited in number: only a few handful Physical Memory PhD Defense
47
Compact Representation: Range Translation
BASE1 LIMIT1 Virtual Memory OFFSET1 Range Translation 1 Physical Memory Range Translation: is a mapping between contiguous virtual pages mapped to contiguous physical pages with uniform protection PhD Defense
48
Redundant Memory Mappings
Range Translation 3 Virtual Memory Range Translation 2 Range Translation 4 Range Translation 1 Range Translation 5 Physical Memory Map most of process’s virtual address space redundantly with modest number of range translations in addition to page mappings PhD Defense
49
Design: Redundant Memory Mappings
Three Components: A. Caching Range Translations B. Managing Range Translations C. Facilitating Range Translations PhD Defense
50
A. Caching Range Translations
V …………. V12 L1 DTLB L2 DTLB Range TLB Enhanced Page Table Walker Page Table Walker P …………. P12 PhD Defense
51
A. Caching Range Translations
V …………. V12 Hit L1 DTLB L2 DTLB Range TLB Enhanced Page Table Walker P …………. P12 PhD Defense
52
A. Caching Range Translations
V …………. V12 Miss L1 DTLB Refill L2 DTLB Range TLB Hit Enhanced Page Table Walker P …………. P12 PhD Defense
53
A. Caching Range Translations
V …………. V12 Refill Miss L1 DTLB L2 DTLB Range TLB Hit Enhanced Page Table Walker P …………. P12 PhD Defense
54
A. Caching Range Translations
Miss V …………. V12 P …………. P12 L1 DTLB Range TLB L2 DTLB Hit Refill Entry 1 BASE 1 ≤ > LIMIT 1 OFFSET Protection 1 Entry N BASE N ≤ > LIMIT N OFFSET N Protection N L1 TLB Entry Generator Logic: (Virtual Address + OFFSET) Protection PhD Defense
55
A. Caching Range Translations
V …………. V12 Miss L1 DTLB L2 DTLB Range TLB Miss Miss Enhanced Page Table Walker P …………. P12 PhD Defense
56
B. Managing Range Translations
Stores all the range translations in a OS managed structure Per-process like page-table Range Table CR-RT RTC RTD RTF RTG RTA RTB RTE PhD Defense
57
B. Managing Range Translations
On a L2+Range TLB miss, what structure to walk? A) Page Table B) Range Table C) Both A) and B) D) Either? Is a virtual page part of range? – Not known at a miss PhD Defense
58
B. Managing Range Translations
Redundancy to the rescue One bit in page table entry denotes that page is part of a range 2 1 3 Page Table Walk Application resumes memory access Range Table Walk (Background) CR-3 CR-RT RTC RTD RTF RTG RTA RTB RTE Part of a range Insert into L1 TLB Insert into Range TLB PhD Defense
59
C. Facilitating Range Translations
Demand Paging Virtual Memory Physical Memory Does not facilitate physical page contiguity for range creation PhD Defense
60
C. Facilitating Range Translations
Eager Paging Virtual Memory Physical Memory Allocate physical pages when virtual memory is allocated Increases range sizes Reduces number of ranges PhD Defense
61
Results 4KB: Baseline using 4KB paging
THP: Transparent Huge Pages using 2MB paging [Transparent Huge Pages] CTLB: Clustered TLB with cluster of 8 4KB entries [HPCA’14] DS: Direct Segments [ISCA’13 and MICRO’14] RMM: Our proposal: Redundant Memory Mappings [ISCA’15] PhD Defense
62
Performance Results Assumptions: CTLB: 512 entry fully-associative
RMM: 32 entry fully-associative Both in parallel with L2 Measured using performance counters Modeled based on emulator: BadgerTrap 5/14 workloads Rest in thesis Make this animated and add points to raise. BadgerTrap: [Gandhi & Basu et al. -- CAN’14] PhD Defense
63
Overheads of using 4KB pages are very high
Performance Results Overheads of using 4KB pages are very high Make this animated and add points to raise. PhD Defense
64
Clustered TLB works well, but limited by 8x reach
Performance Results Clustered TLB works well, but limited by 8x reach Make this animated and add points to raise. PhD Defense
65
2MB page helps with 512x reach: Overheads not very low
Performance Results 2MB page helps with 512x reach: Overheads not very low Make this animated and add points to raise. PhD Defense
66
Direct Segment perfect for some but not all workloads
Performance Results Direct Segment perfect for some but not all workloads Make this animated and add points to raise. PhD Defense
67
RMM achieves low overheads robustly across all workloads
Performance Results RMM achieves low overheads robustly across all workloads Make this animated and add points to raise. PhD Defense
68
III: Summary – RMM Proposal: Redundant Memory Mappings (1D0D) Result:
Problem: Virtual memory (native) overheads are high Proposal: Redundant Memory Mappings (1D0D) Proposed compact representation called range translation Range Translation – arbitrarily large contiguous mapping Effectively cached, managed and facilitated range translations Retain flexibility of 4KB paging Result: Reduced overheads of native virtual memory to less than 1% PhD Defense
69
ISCA’15 + MICRO TOP PICKS’16
Outline 2D 0D I: Virtualized Direct Segments MICRO’14 2D 1D 1D 0D II: Agile Paging ISCA’16 III: Redundant Memory Mappings ISCA’15 + MICRO TOP PICKS’16 Virtual Machine Native Machine Direct Execution PhD Defense
70
Conclusion Problem Paging is valuable but costly to use today Virtualization with paging has much higher cost Opportunity Reduce dimensionality of page walk Result Almost zero-overheads of virtual memory PhD Defense
71
Publications Jayneel Gandhi, Mark D. Hill, Michael M. Swift, Agile Paging for Efficient Virtualized Page Walks, ISCA 2016. Jayneel Gandhi, Vasileios Karakostas, Furkan Ayar, Adrián Cristal, Mark D. Hill, Kathryn S. McKinley, Mario Nemirovsky, Michael M. Swift, Osman Ünsal, Redundant Memory Mappings for Fast Access to Large Memories, IEEE MICRO TOP PICKS 2016. Vasileios Karakostas, Jayneel Gandhi, Adrián Cristal, Mark D. Hill, Kathryn S. McKinley, Mario Nemirovsky, Michael M. Swift, Osman Ünsal, Energy-Efficient Address Translation, HPCA 2016. Vasileios Karakostas, Jayneel Gandhi, Furkan Ayar, Adrián Cristal, Mark D. Hill, Kathryn S. McKinley, Mario Nemirovsky, Michael M. Swift, Osman Ünsal, Redundant Memory Mappings for Fast Access to Large Memories, ISCA Accepted for IEEE MICRO TOP PICKS 2016 Jayneel Gandhi, Arkaprava Basu, Mark D. Hill, Michael M. Swift, Efficient Memory Virtualization: Reducing Dimensionality of Nested Page Walks, MICRO Received honorable mention on IEEE MICRO TOP PICKS 2015. Jayneel Gandhi, Arkaprava Basu, Mark D. Hill, Michael M. Swift, BadgerTrap: A Tool to Instrument x86-64 TLB Misses, CAN 2014. Arkaprava Basu, Jayneel Gandhi, Jichuan Chang, Mark D. Hill, Michael M. Swift, Efficient Virtual Memory for Big Memory Servers, ISCA 2013. Niket Kumar Choudhary, Salil V. Wadhavkar, Tanmay A. Shah, Hiran Mayukh, Jayneel Gandhi, Brandon H. Dwiel, Sandeep Navada, Hashem Hashemi Najaf-abadi, Eric Rotenberg, FabScalar: Automating Superscalar Core Design, IEEE MICRO TOP PICKS 2012. Niket Kumar Choudhary, Salil V. Wadhavkar, Tanmay A. Shah, Hiran Mayukh, Jayneel Gandhi, Brandon H. Dwiel, Sandeep Navada, Hashem Hashemi Najaf-abadi, Eric Rotenberg, FabScalar: Composing Synthesizable RTL Designs of Arbitrary Cores within a Canonical Superscalar Template, ISCA Accepted for IEEE MICRO TOP PICKS 2012 PhD Defense
72
Collaborators and Contributors
Mark D. Hill, UW-Madison Michael M. Swift, UW-Madison Arkaprava Basu, AMD Research Vasilis Karakostas, BSC Barcelona Adrian Cristal, BSC Barcelona Mario Nemirovsky, BSC Barcelona Osman Unsal, BSC Barcelona Kathryn S. McKinley, Microsoft Research Benjamin Serebrin, Google And many more… PhD Defense
73
Questions ? PhD Defense
74
Backup Slides PhD Defense
75
Three Virtualized Modes
1 Features Maps almost whole gPA 4 memory accesses Near-native performance Helps any application VMM Direct 2D 1D PhD Defense
76
Three Virtualized Modes
1 2 Features 0 memory accesses Better-than native performance Suits big-memory applications VMM Direct 2D 1D Dual Direct 2D 0D PhD Defense
77
Three Virtualized Modes
1 2 3 Features 4 memory accesses Suits big-memory applications Flexible to provide VMM services VMM Direct 2D 1D Dual Direct 2D 0D Guest Direct 2D 1D PhD Defense
78
Compatibility VMM Hardware 1 OS Unmodified Modified (VMM Direct) APP
PhD Defense
79
Compatibility VMM VMM Hardware Hardware 1 2 OS OS Minimal Unmodified
APP APP Big-Memory OS OS Modified VMM VMM Modified Hardware (VMM Direct) Hardware (Dual Direct) PhD Defense
80
Compatibility VMM VMM VMM Hardware Hardware Hardware 1 2 3 OS OS OS
Minimal Minimal Unmodified APP APP Big-Memory Big-Memory Modified OS OS OS Modified VMM VMM Minimal VMM Modified Modified Hardware (VMM Direct) Hardware (Dual Direct) Hardware (Guest Direct) PhD Defense
81
Dimension/ Memory accesses Guest OS modifications
Tradeoffs: Summary Back Properties Base Virtualized VMM Direct Dual Direct Guest Direct Dimension/ Memory accesses 2D/24 1D/4 0D/0 Guest OS modifications none required VMM modifications minimal Applications Any Big-memory VMM services allowed Yes No
82
Results: 2MB and 1GB at VMM
Put animations for points to highlight in this slide
83
0. Page-based Translation
Virtual Memory TLB VPN0 PFN0 Benefits + Most flexible + Only 4KB alignment required Disadvantage ─ Requires more TLB entries ─ Limited TLB reach Physical Memory PhD Defense
84
1. Multipage Mapping Sub-blocked TLB/CoLT Clustered TLB Virtual Memory
VPN(0-3) PFN(0-3) Map Bitmap Benefits + Increased the TLB reach by 8X-32X + Clustering of pages allowed Disadvantages ─ Requires size alignment ─ Contiguity required Physical Memory [ASPLOS’94, MICRO’12 and HPCA’14] PhD Defense
85
2. Large Pages Large Page TLB Virtual Memory VPN0 PFN0
Benefits + Increases TLB reach + 2MB and 1GB page size in x86-64 Disadvantages ─ Size alignment restriction ─ Contiguity required Physical Memory [Transparent Huge Pages and libhugetlbfs] PhD Defense
86
3. Direct Segments Direct Segment BASE LIMIT Virtual Memory
(BASE,LIMIT) OFFSET OFFSET If BASE ≤ V < LIMIT P = V + OFFSET Benefits + Unlimited reach + No size alignment required Disadvantages ─ One direct segment per program ─ Not transparent to application ─ Applicable to big-memory workloads Physical Memory [Basu & Gandhi et al. -- ISCA’13; Gandhi & Basu et al. -- MICRO’14] PhD Defense
87
Can we get best of many worlds?
Multipage Mapping Large Pages Direct Segments Our Proposal Flexible alignment Arbitrary reach Multiple entries Transparent to applications Applicable to all workloads PhD Defense
88
Why low overheads? Virtual Contiguity
Benchmark Paging Ideal RMM ranges 4KB + 2MB THP # of ranges #of ranges to cover more than 99% of memory cactusADM 112 49 canneal 77 4 graph500 86 3 mcf 55 1 tigr 16 Only 10s-100s of ranges per application 1000s of TLB entries required Only few ranges for 99% coverage PhD Defense
89
Transparent Huge Page (2MB)
B: Unvirtualized N: Nested Paging S: Shadow Paging A: Agile Paging 68% 13% 14% 4% 2% 14% 5% 2% 10% 6% 3% 2% Solid bottom bar: Page walk overhead Hashed top bar: VMM overheads Back
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.