Executive Summary Problem: Overheads of virtual memory can be high 1D (native): 42% 2D (virtualized): 97% Idea: Reduce dimensionality of page walks I: Virtualized Direct Segments -- 2D 0D (8 slides) Reduces overheads of virtual memory (virtualized) to less than 1% II: Agile Paging -- 2D 1D (12 slides) Reduces overheads of virtual memory (virtualized) less than 4% of native III: Redundant Memory Mappings -- 1D 0D (11 slides) Reduces overheads of virtual memory (native) to less than 1% PhD Defense
Virtual Memory Refresher Virtual Address Space Page Table Physical Memory Process 1 Challenge: How to reduce costly page table walks? Process 2 TLB (Translation Lookaside Buffer) PhD Defense
Two Technology Trends TLB reach is limited Year Processor L1 DTLB entries 1999 Pent. III 72 2001 Pent. 4 64 2008 Nehalem 96 2012 IvyBridge 100 2014 Haswell 2015 Skylake TLB reach is limited *Inflation-adjusted 2011 USD, from: jcmit.com PhD Defense
Use of Vast Memory Big-memory applications (ever increasing data sets) PhD Defense
Native x86 Translation: 1D VA Virtual Address Page Table CR3 PA Physical Address Up to mem accesses = 4 PhD Defense
Enabling many more startups Virtual Machines Put logos instead Enabling many more startups PhD Defense
Why do we need more dimensions? Guest Virtual Address Guest Physical Address Host Physical Address 1 2 gVA gPA hPA Guest Page Table Nested Page Table PhD Defense
Inception: Effects of Adding Dimensions APP APP Jog 2D 70% OS VMM Run 1D 42% Hardware Sprint 0D 0% PhD Defense
Goal Can we have “zero” overheads of virtualizing memory at any level of virtualization? PhD Defense
ISCA’15 + MICRO TOP PICKS’16 Reaching the Goal 2D 0D I: Virtualized Direct Segments MICRO’14 2D 1D 1D 0D II: Agile Paging ISCA’16 III: Redundant Memory Mappings ISCA’15 + MICRO TOP PICKS’16 Virtual Machine Native Machine Direct Execution PhD Defense
ISCA’15 + MICRO TOP PICKS’16 Outline 2D 0D I: Virtualized Direct Segments MICRO’14 2D 1D 1D 0D II: Agile Paging ISCA’16 III: Redundant Memory Mappings ISCA’15 + MICRO TOP PICKS’16 Virtual Machine Native Machine Direct Execution PhD Defense
I: Virtualized Direct Segments – Goal TLB reach is limited & Cost of a TLB miss with virtualization is very high Can we eliminate virtualized TLB misses totally? 2D 0D [Gandhi et al. – MICRO’14] PhD Defense
Virtualized Page Walk: 2D hPA gVA gPA Guest Page Table Nested Page Table gVA Longer Page Walk gCR3 hPA At most Mem accesses 5 + 5 + 5 + 5 + 4 = 24 PhD Defense
Direct Segments Review 1 Conventional Paging 2 Direct Segment BASE LIMIT Virtual Address OFFSET Physical Address Why Direct Segment? Matches big memory workload needs NO TLB lookups => NO TLB Misses [Basu and Gandhi et al. – ISCA’13] PhD Defense
Direct Segments Review VA VA PA PA Base Native 1D Direct Segments 1D 0D PhD Defense
Three Virtualized Modes Details 1 2 3 VMM Direct 2D 1D Dual Direct 2D 0D Guest Direct 2D 1D PhD Defense
Methodology Measure cost on page walks on real hardware Intel 12-core Sandy-bridge with 96GB memory Prototype VMM and OS and emulate hardware in Linux BadgerTrap for online analysis of TLB misses Released: http://research.cs.wisc.edu/multifacet/BadgerTrap Linear model to predict performance Workloads --- Big-memory workloads, SPEC 2006, BioBench, PARSEC [Gandhi et al. – CAN’14] PhD Defense
VMM Direct achieves near-native performance Results VMM Direct achieves near-native performance PhD Defense
Results Dual Direct eliminates most of the TLB misses achieving better-than native performance PhD Defense
Results Guest Direct achieves near-native performance while providing flexibility at VMM PhD Defense
Results Same trend across all workloads (More workloads in the thesis) PhD Defense
I: Summary – Virtualized Direct Segments Problem: TLB misses in virtual machines Hardware-virtualized MMU has high overheads Solution: segmentation to bypass paging Extend Direct Segments for virtualization Three modes with different tradeoffs Results Near- or better-than-native performance PhD Defense
ISCA’15 + MICRO TOP PICKS’16 Outline 2D 0D I: Virtualized Direct Segments MICRO’14 2D 1D 1D 0D II: Agile Paging ISCA’16 III: Redundant Memory Mappings ISCA’15 + MICRO TOP PICKS’16 Virtual Machine Native Machine Direct Execution PhD Defense
II: Agile Paging – Goal Virtualized Direct Segments sacrified paging support & Required a lot of hardware and software support Can we make virtualized page walk faster while retaining paging? 2D 1D [Gandhi et al. -- ISCA’16] PhD Defense
Guest Physical Address Virtualizing Memory APP APP gVA Guest Virtual Address Guest OS Guest Page Table gPA Guest Physical Address VMM Nested Page Table Hardware hPA Host Physical Address PhD Defense 28
Virtualizing Memory Two techniques to manage both page tables gVA gPA Guest Page Table Nested Page Table Two techniques to manage both page tables Nested Paging -- Hardware Shadow Paging – Software Evaluated on two axis: Page Walk Latency & Page Table Updates PhD Defense 29
1. Nested Paging – Hardware hPA gVA gPA Guest Page Table Nested Page Table gVA Longer Page Walk gCR3 hPA At most Mem accesses 5 + 5 + 5 + 5 + 4 = 24 PhD Defense 30
2. Shadow Paging – Software APP APP gVA Guest OS Guest Page Table (Read Only) Guest Page Table RO RO gPA Shadow Page Table VMM Nested Page Table Hardware hPA PhD Defense 31
2. Shadow Paging – Software hPA Guest Page Table (Read Only) Nested Page Table gVA Shadow Page Table Shorter Page Walk sCR3 At most mem accesses = 4 PhD Defense 32
Page Table Updates In-place fast update Slow meditated update 1. Nested Paging 2. Shadow Paging gVA gVA VMM Trap Updates: Copy-on-write Page migration Accessed bits Dirty bits Page sharing Working set sampling Many more… Guest Page Table Guest Page Table (Read Only) gPA Shadow Page Table Nested Page Table Nested Page Table hPA hPA In-place fast update Slow meditated update PhD Defense 33
Guest Virtual Address Space Key Observation Fully static address space Reality !!! Guest Virtual Address Space Shadow Paging preferred Fully dynamic address space Small fraction of address space is dynamic Nested Paging preferred PhD Defense 34
Key Observation Guest Page Table gCR3 Nested Shadow PhD Defense 35
Agile Paging Start page walk in shadow mode -- Achieving fast TLB misses Optionally switch to nested mode -- Allowing fast in-place updates Two parts of design: 1. Mechanism 2. Policy PhD Defense 36
1. Mechanism gVA gPA hPA Guest Page Table Shadow Page Table Nested Page Table gCR3 Shadow Page Table Guest Page Table sCR3 1 1 Read only Nested Page Table PhD Defense 37
1. Mechanism: Example Page Walk gVA gVA sCR3 gCR3 hPA Switch modes @ level 4 of guest page table At most Mem accesses 1 + 1 + 1 + 5 = 8 PhD Defense 38
2. Policy: Shadow Nested Start Shadow Write to page table (VMM Trap) Shadow (1 Write) Write to page table (VMM Trap) Nested Subsequent Writes (No VMM Traps) PhD Defense 39
2. Policy: Nested Shadow Start Shadow Write to page table (VMM Trap) Shadow (1 Write) Write to page table (VMM Trap) Move non-dirty Timeout Use dirty bits to track writes to guest page table Nested Subsequent Writes (No VMM Traps) PhD Defense 40
Results Modeled based on emulator: BadgerTrap B: Unvirtualized N: Nested Paging S: Shadow Paging A: Agile Paging Modeled based on emulator: BadgerTrap Measured using performance counters Solid bottom bar: Page walk overhead Hashed top bar: VMM overheads PhD Defense 41
Results Nested Paging has high overheads of TLB misses B: Unvirtualized N: Nested Paging S: Shadow Paging A: Agile Paging Nested Paging has high overheads of TLB misses Effect of longer page walk 28% 19% 18% 6% Solid bottom bar: Page walk overhead Hashed top bar: VMM overheads PhD Defense 42
Shadow Paging has high overheads of VMM interventions Results B: Unvirtualized N: Nested Paging S: Shadow Paging A: Agile Paging Shadow Paging has high overheads of VMM interventions 28% 70% 11% 19% 30% 18% 6% 6% Solid bottom bar: Page walk overhead Hashed top bar: VMM overheads PhD Defense 43
Agile paging consistently performs better than both techniques Results B: Unvirtualized N: Nested Paging S: Shadow Paging A: Agile Paging Agile paging consistently performs better than both techniques 28% 70% 11% 2% 19% 30% 18% 6% 2% 4% 6% 3% Solid bottom bar: Page walk overhead Hashed top bar: VMM overheads THP PhD Defense 44
Can we get best of both for same address space (or same page walk)? II: Summary – Agile Paging Problem: Virtualization valuable but have high overheads with larger workloads (At most 70% slower than native) Existing Choices: Nested Paging: slow page walk but fast page table updates Shadow Paging: fast page walk but slow page table updates Can we get best of both for same address space (or same page walk)? Yes, Agile Paging: use shadow paging and sometime switch to nested paging within the same page walk (At most 4% slower than native) PhD Defense 45
ISCA’15 + MICRO TOP PICKS’16 Outline 2D 0D I: Virtualized Direct Segments MICRO’14 2D 1D 1D 0D II: Agile Paging ISCA’16 III: Redundant Memory Mappings ISCA’15 + MICRO TOP PICKS’16 Virtual Machine Native Machine Direct Execution PhD Defense
How increase reach of each TLB entry? III: Redundant Memory Mappings – Goal TLB reach is limited & In-memory workload size is increasing How increase reach of each TLB entry? 1D 0D [Gandhi et al. -- ISCA’15 and IEEE MICRO TOP PICKS’16] PhD Defense
Key Observation Virtual Memory Physical Memory PhD Defense
Key Observation Large contiguous regions of virtual memory Code Heap Stack Shared Lib. Virtual Memory Large contiguous regions of virtual memory Limited in number: only a few handful Physical Memory PhD Defense
Compact Representation: Range Translation BASE1 LIMIT1 Virtual Memory OFFSET1 Range Translation 1 Physical Memory Range Translation: is a mapping between contiguous virtual pages mapped to contiguous physical pages with uniform protection PhD Defense
Redundant Memory Mappings Range Translation 3 Virtual Memory Range Translation 2 Range Translation 4 Range Translation 1 Range Translation 5 Physical Memory Map most of process’s virtual address space redundantly with modest number of range translations in addition to page mappings PhD Defense
Design: Redundant Memory Mappings Three Components: A. Caching Range Translations B. Managing Range Translations C. Facilitating Range Translations PhD Defense
A. Caching Range Translations V47 …………. V12 L1 DTLB L2 DTLB Range TLB Enhanced Page Table Walker Page Table Walker P47 …………. P12 PhD Defense
A. Caching Range Translations V47 …………. V12 Hit L1 DTLB L2 DTLB Range TLB Enhanced Page Table Walker P47 …………. P12 PhD Defense
A. Caching Range Translations V47 …………. V12 Miss L1 DTLB Refill L2 DTLB Range TLB Hit Enhanced Page Table Walker P47 …………. P12 PhD Defense
A. Caching Range Translations V47 …………. V12 Refill Miss L1 DTLB L2 DTLB Range TLB Hit Enhanced Page Table Walker P47 …………. P12 PhD Defense
A. Caching Range Translations Miss V47 …………. V12 P47 …………. P12 L1 DTLB Range TLB L2 DTLB Hit Refill Entry 1 BASE 1 ≤ > LIMIT 1 OFFSET 1 Protection 1 Entry N BASE N ≤ > LIMIT N OFFSET N Protection N L1 TLB Entry Generator Logic: (Virtual Address + OFFSET) Protection PhD Defense
A. Caching Range Translations V47 …………. V12 Miss L1 DTLB L2 DTLB Range TLB Miss Miss Enhanced Page Table Walker P47 …………. P12 PhD Defense
B. Managing Range Translations Stores all the range translations in a OS managed structure Per-process like page-table Range Table CR-RT RTC RTD RTF RTG RTA RTB RTE PhD Defense
B. Managing Range Translations On a L2+Range TLB miss, what structure to walk? A) Page Table B) Range Table C) Both A) and B) D) Either? Is a virtual page part of range? – Not known at a miss PhD Defense
B. Managing Range Translations Redundancy to the rescue One bit in page table entry denotes that page is part of a range 2 1 3 Page Table Walk Application resumes memory access Range Table Walk (Background) CR-3 CR-RT RTC RTD RTF RTG RTA RTB RTE Part of a range Insert into L1 TLB Insert into Range TLB PhD Defense
C. Facilitating Range Translations Demand Paging Virtual Memory Physical Memory Does not facilitate physical page contiguity for range creation PhD Defense
C. Facilitating Range Translations Eager Paging Virtual Memory Physical Memory Allocate physical pages when virtual memory is allocated Increases range sizes Reduces number of ranges PhD Defense
Results 4KB: Baseline using 4KB paging THP: Transparent Huge Pages using 2MB paging [Transparent Huge Pages] CTLB: Clustered TLB with cluster of 8 4KB entries [HPCA’14] DS: Direct Segments [ISCA’13 and MICRO’14] RMM: Our proposal: Redundant Memory Mappings [ISCA’15] PhD Defense
Performance Results Assumptions: CTLB: 512 entry fully-associative RMM: 32 entry fully-associative Both in parallel with L2 Measured using performance counters Modeled based on emulator: BadgerTrap 5/14 workloads Rest in thesis Make this animated and add points to raise. BadgerTrap: [Gandhi & Basu et al. -- CAN’14] PhD Defense
Overheads of using 4KB pages are very high Performance Results Overheads of using 4KB pages are very high Make this animated and add points to raise. PhD Defense
Clustered TLB works well, but limited by 8x reach Performance Results Clustered TLB works well, but limited by 8x reach Make this animated and add points to raise. PhD Defense
2MB page helps with 512x reach: Overheads not very low Performance Results 2MB page helps with 512x reach: Overheads not very low Make this animated and add points to raise. PhD Defense
Direct Segment perfect for some but not all workloads Performance Results Direct Segment perfect for some but not all workloads Make this animated and add points to raise. PhD Defense
RMM achieves low overheads robustly across all workloads Performance Results RMM achieves low overheads robustly across all workloads Make this animated and add points to raise. PhD Defense
III: Summary – RMM Proposal: Redundant Memory Mappings (1D0D) Result: Problem: Virtual memory (native) overheads are high Proposal: Redundant Memory Mappings (1D0D) Proposed compact representation called range translation Range Translation – arbitrarily large contiguous mapping Effectively cached, managed and facilitated range translations Retain flexibility of 4KB paging Result: Reduced overheads of native virtual memory to less than 1% PhD Defense
ISCA’15 + MICRO TOP PICKS’16 Outline 2D 0D I: Virtualized Direct Segments MICRO’14 2D 1D 1D 0D II: Agile Paging ISCA’16 III: Redundant Memory Mappings ISCA’15 + MICRO TOP PICKS’16 Virtual Machine Native Machine Direct Execution PhD Defense
Conclusion Problem Paging is valuable but costly to use today Virtualization with paging has much higher cost Opportunity Reduce dimensionality of page walk Result Almost zero-overheads of virtual memory PhD Defense
Publications Jayneel Gandhi, Mark D. Hill, Michael M. Swift, Agile Paging for Efficient Virtualized Page Walks, ISCA 2016. Jayneel Gandhi, Vasileios Karakostas, Furkan Ayar, Adrián Cristal, Mark D. Hill, Kathryn S. McKinley, Mario Nemirovsky, Michael M. Swift, Osman Ünsal, Redundant Memory Mappings for Fast Access to Large Memories, IEEE MICRO TOP PICKS 2016. Vasileios Karakostas, Jayneel Gandhi, Adrián Cristal, Mark D. Hill, Kathryn S. McKinley, Mario Nemirovsky, Michael M. Swift, Osman Ünsal, Energy-Efficient Address Translation, HPCA 2016. Vasileios Karakostas, Jayneel Gandhi, Furkan Ayar, Adrián Cristal, Mark D. Hill, Kathryn S. McKinley, Mario Nemirovsky, Michael M. Swift, Osman Ünsal, Redundant Memory Mappings for Fast Access to Large Memories, ISCA 2015. Accepted for IEEE MICRO TOP PICKS 2016 Jayneel Gandhi, Arkaprava Basu, Mark D. Hill, Michael M. Swift, Efficient Memory Virtualization: Reducing Dimensionality of Nested Page Walks, MICRO 2014. Received honorable mention on IEEE MICRO TOP PICKS 2015. Jayneel Gandhi, Arkaprava Basu, Mark D. Hill, Michael M. Swift, BadgerTrap: A Tool to Instrument x86-64 TLB Misses, CAN 2014. Arkaprava Basu, Jayneel Gandhi, Jichuan Chang, Mark D. Hill, Michael M. Swift, Efficient Virtual Memory for Big Memory Servers, ISCA 2013. Niket Kumar Choudhary, Salil V. Wadhavkar, Tanmay A. Shah, Hiran Mayukh, Jayneel Gandhi, Brandon H. Dwiel, Sandeep Navada, Hashem Hashemi Najaf-abadi, Eric Rotenberg, FabScalar: Automating Superscalar Core Design, IEEE MICRO TOP PICKS 2012. Niket Kumar Choudhary, Salil V. Wadhavkar, Tanmay A. Shah, Hiran Mayukh, Jayneel Gandhi, Brandon H. Dwiel, Sandeep Navada, Hashem Hashemi Najaf-abadi, Eric Rotenberg, FabScalar: Composing Synthesizable RTL Designs of Arbitrary Cores within a Canonical Superscalar Template, ISCA 2011. Accepted for IEEE MICRO TOP PICKS 2012 PhD Defense
Collaborators and Contributors Mark D. Hill, UW-Madison Michael M. Swift, UW-Madison Arkaprava Basu, AMD Research Vasilis Karakostas, BSC Barcelona Adrian Cristal, BSC Barcelona Mario Nemirovsky, BSC Barcelona Osman Unsal, BSC Barcelona Kathryn S. McKinley, Microsoft Research Benjamin Serebrin, Google And many more… PhD Defense
Questions ? PhD Defense
Backup Slides PhD Defense
Three Virtualized Modes 1 Features Maps almost whole gPA 4 memory accesses Near-native performance Helps any application VMM Direct 2D 1D PhD Defense
Three Virtualized Modes 1 2 Features 0 memory accesses Better-than native performance Suits big-memory applications VMM Direct 2D 1D Dual Direct 2D 0D PhD Defense
Three Virtualized Modes 1 2 3 Features 4 memory accesses Suits big-memory applications Flexible to provide VMM services VMM Direct 2D 1D Dual Direct 2D 0D Guest Direct 2D 1D PhD Defense
Compatibility VMM Hardware 1 OS Unmodified Modified (VMM Direct) APP PhD Defense
Compatibility VMM VMM Hardware Hardware 1 2 OS OS Minimal Unmodified APP APP Big-Memory OS OS Modified VMM VMM Modified Hardware (VMM Direct) Hardware (Dual Direct) PhD Defense
Compatibility VMM VMM VMM Hardware Hardware Hardware 1 2 3 OS OS OS Minimal Minimal Unmodified APP APP Big-Memory Big-Memory Modified OS OS OS Modified VMM VMM Minimal VMM Modified Modified Hardware (VMM Direct) Hardware (Dual Direct) Hardware (Guest Direct) PhD Defense
Dimension/ Memory accesses Guest OS modifications Tradeoffs: Summary Back Properties Base Virtualized VMM Direct Dual Direct Guest Direct Dimension/ Memory accesses 2D/24 1D/4 0D/0 Guest OS modifications none required VMM modifications minimal Applications Any Big-memory VMM services allowed Yes No
Results: 2MB and 1GB at VMM Put animations for points to highlight in this slide
0. Page-based Translation Virtual Memory TLB VPN0 PFN0 Benefits + Most flexible + Only 4KB alignment required Disadvantage ─ Requires more TLB entries ─ Limited TLB reach Physical Memory PhD Defense
1. Multipage Mapping Sub-blocked TLB/CoLT Clustered TLB Virtual Memory VPN(0-3) PFN(0-3) Map Bitmap Benefits + Increased the TLB reach by 8X-32X + Clustering of pages allowed Disadvantages ─ Requires size alignment ─ Contiguity required Physical Memory [ASPLOS’94, MICRO’12 and HPCA’14] PhD Defense
2. Large Pages Large Page TLB Virtual Memory VPN0 PFN0 Benefits + Increases TLB reach + 2MB and 1GB page size in x86-64 Disadvantages ─ Size alignment restriction ─ Contiguity required Physical Memory [Transparent Huge Pages and libhugetlbfs] PhD Defense
3. Direct Segments Direct Segment BASE LIMIT Virtual Memory (BASE,LIMIT) OFFSET OFFSET If BASE ≤ V < LIMIT P = V + OFFSET Benefits + Unlimited reach + No size alignment required Disadvantages ─ One direct segment per program ─ Not transparent to application ─ Applicable to big-memory workloads Physical Memory [Basu & Gandhi et al. -- ISCA’13; Gandhi & Basu et al. -- MICRO’14] PhD Defense
Can we get best of many worlds? Multipage Mapping Large Pages Direct Segments Our Proposal Flexible alignment Arbitrary reach Multiple entries Transparent to applications Applicable to all workloads PhD Defense
Why low overheads? Virtual Contiguity Benchmark Paging Ideal RMM ranges 4KB + 2MB THP # of ranges #of ranges to cover more than 99% of memory cactusADM 1365 + 333 112 49 canneal 10016 + 359 77 4 graph500 8983 + 35725 86 3 mcf 1737 + 839 55 1 tigr 28299 + 235 16 Only 10s-100s of ranges per application 1000s of TLB entries required Only few ranges for 99% coverage PhD Defense
Transparent Huge Page (2MB) B: Unvirtualized N: Nested Paging S: Shadow Paging A: Agile Paging 68% 13% 14% 4% 2% 14% 5% 2% 10% 6% 3% 2% Solid bottom bar: Page walk overhead Hashed top bar: VMM overheads Back