Executive Summary Problem: Overheads of virtual memory can be high

Executive Summary Problem: Overheads of virtual memory can be high
1D (native): 42% 2D (virtualized): 97% Idea: Reduce dimensionality of page walks I: Virtualized Direct Segments -- 2D  0D (8 slides) Reduces overheads of virtual memory (virtualized) to less than 1% II: Agile Paging -- 2D  1D (12 slides) Reduces overheads of virtual memory (virtualized) less than 4% of native III: Redundant Memory Mappings -- 1D  0D (11 slides) Reduces overheads of virtual memory (native) to less than 1% PhD Defense

Virtual Memory Refresher
Virtual Address Space Page Table Physical Memory Process 1 Challenge: How to reduce costly page table walks? Process 2 TLB (Translation Lookaside Buffer) PhD Defense

Two Technology Trends TLB reach is limited Year Processor
L1 DTLB entries 1999 Pent. III 72 2001 Pent. 4 64 2008 Nehalem 96 2012 IvyBridge 100 2014 Haswell 2015 Skylake TLB reach is limited *Inflation-adjusted 2011 USD, from: jcmit.com PhD Defense

Use of Vast Memory Big-memory applications (ever increasing data sets)
PhD Defense

Native x86 Translation: 1D
VA Virtual Address Page Table CR3 PA Physical Address Up to mem accesses = 4 PhD Defense

Enabling many more startups
Virtual Machines Put logos instead Enabling many more startups PhD Defense

Why do we need more dimensions?
Guest Virtual Address Guest Physical Address Host Physical Address 1 2 gVA gPA hPA Guest Page Table Nested Page Table PhD Defense

Inception: Effects of Adding Dimensions
APP APP Jog 2D 70% OS VMM Run 1D 42% Hardware Sprint 0D 0% PhD Defense

Goal Can we have “zero” overheads of virtualizing memory at any level of virtualization? PhD Defense

ISCA’15 + MICRO TOP PICKS’16
Reaching the Goal 2D  0D I: Virtualized Direct Segments MICRO’14 2D  1D 1D  0D II: Agile Paging ISCA’16 III: Redundant Memory Mappings ISCA’15 + MICRO TOP PICKS’16 Virtual Machine Native Machine Direct Execution PhD Defense

Outline 2D  0D I: Virtualized Direct Segments MICRO’14 2D  1D 1D  0D II: Agile Paging ISCA’16 III: Redundant Memory Mappings ISCA’15 + MICRO TOP PICKS’16 Virtual Machine Native Machine Direct Execution PhD Defense

I: Virtualized Direct Segments – Goal
TLB reach is limited & Cost of a TLB miss with virtualization is very high Can we eliminate virtualized TLB misses totally? 2D  0D [Gandhi et al. – MICRO’14] PhD Defense

Virtualized Page Walk: 2D
hPA gVA gPA Guest Page Table Nested Page Table gVA Longer Page Walk gCR3 hPA At most Mem accesses 5 + 5 + 5 + 5 + 4 = 24 PhD Defense

Direct Segments Review
1 Conventional Paging 2 Direct Segment BASE LIMIT Virtual Address OFFSET Physical Address Why Direct Segment? Matches big memory workload needs NO TLB lookups => NO TLB Misses [Basu and Gandhi et al. – ISCA’13] PhD Defense

Direct Segments Review
VA VA PA PA Base Native 1D Direct Segments 1D  0D PhD Defense

Three Virtualized Modes
Details 1 2 3 VMM Direct 2D  1D Dual Direct 2D  0D Guest Direct 2D  1D PhD Defense

Methodology Measure cost on page walks on real hardware
Intel 12-core Sandy-bridge with 96GB memory Prototype VMM and OS and emulate hardware in Linux BadgerTrap for online analysis of TLB misses Released: Linear model to predict performance Workloads --- Big-memory workloads, SPEC 2006, BioBench, PARSEC [Gandhi et al. – CAN’14] PhD Defense

VMM Direct achieves near-native performance
Results VMM Direct achieves near-native performance PhD Defense

Results Dual Direct eliminates most of the TLB misses achieving better-than native performance PhD Defense

Results Guest Direct achieves near-native performance while providing flexibility at VMM PhD Defense

Results Same trend across all workloads (More workloads in the thesis)
PhD Defense

I: Summary – Virtualized Direct Segments
Problem: TLB misses in virtual machines Hardware-virtualized MMU has high overheads Solution: segmentation to bypass paging Extend Direct Segments for virtualization Three modes with different tradeoffs Results Near- or better-than-native performance PhD Defense

II: Agile Paging – Goal Virtualized Direct Segments sacrified paging support & Required a lot of hardware and software support Can we make virtualized page walk faster while retaining paging? 2D  1D [Gandhi et al. -- ISCA’16] PhD Defense

Guest Physical Address
Virtualizing Memory APP APP gVA Guest Virtual Address Guest OS Guest Page Table gPA Guest Physical Address VMM Nested Page Table Hardware hPA Host Physical Address PhD Defense 28

Virtualizing Memory Two techniques to manage both page tables
gVA gPA Guest Page Table Nested Page Table Two techniques to manage both page tables Nested Paging -- Hardware Shadow Paging – Software Evaluated on two axis: Page Walk Latency & Page Table Updates PhD Defense 29

1. Nested Paging – Hardware
hPA gVA gPA Guest Page Table Nested Page Table gVA Longer Page Walk gCR3 hPA At most Mem accesses 5 + 5 + 5 + 5 + 4 = 24 PhD Defense 30

2. Shadow Paging – Software
APP APP gVA Guest OS Guest Page Table (Read Only) Guest Page Table RO RO gPA Shadow Page Table VMM Nested Page Table Hardware hPA PhD Defense 31

2. Shadow Paging – Software
hPA Guest Page Table (Read Only) Nested Page Table gVA Shadow Page Table Shorter Page Walk sCR3 At most mem accesses = 4 PhD Defense 32

Page Table Updates In-place fast update Slow meditated update
1. Nested Paging 2. Shadow Paging gVA gVA VMM Trap Updates: Copy-on-write Page migration Accessed bits Dirty bits Page sharing Working set sampling Many more… Guest Page Table Guest Page Table (Read Only) gPA Shadow Page Table Nested Page Table Nested Page Table hPA hPA In-place fast update Slow meditated update PhD Defense 33

Guest Virtual Address Space
Key Observation Fully static address space Reality !!! Guest Virtual Address Space Shadow Paging preferred Fully dynamic address space Small fraction of address space is dynamic Nested Paging preferred PhD Defense 34

Key Observation Guest Page Table gCR3 Nested Shadow PhD Defense 35

Agile Paging Start page walk in shadow mode
-- Achieving fast TLB misses Optionally switch to nested mode -- Allowing fast in-place updates Two parts of design: 1. Mechanism Policy PhD Defense 36

1. Mechanism gVA gPA hPA Guest Page Table Shadow Page Table
Nested Page Table gCR3 Shadow Page Table Guest Page Table sCR3 1 1 Read only Nested Page Table PhD Defense 37

1. Mechanism: Example Page Walk
gVA gVA sCR3 gCR3 hPA Switch level 4 of guest page table At most Mem accesses 1 + 1 + 1 + 5 = 8 PhD Defense 38

2. Policy: Shadow  Nested
Start Shadow Write to page table (VMM Trap) Shadow (1 Write) Write to page table (VMM Trap) Nested Subsequent Writes (No VMM Traps) PhD Defense 39

2. Policy: Nested  Shadow
Start Shadow Write to page table (VMM Trap) Shadow (1 Write) Write to page table (VMM Trap) Move non-dirty Timeout Use dirty bits to track writes to guest page table Nested Subsequent Writes (No VMM Traps) PhD Defense 40

Results Modeled based on emulator: BadgerTrap
B: Unvirtualized N: Nested Paging S: Shadow Paging A: Agile Paging Modeled based on emulator: BadgerTrap Measured using performance counters Solid bottom bar: Page walk overhead Hashed top bar: VMM overheads PhD Defense 41

Results Nested Paging has high overheads of TLB misses
B: Unvirtualized N: Nested Paging S: Shadow Paging A: Agile Paging Nested Paging has high overheads of TLB misses Effect of longer page walk 28% 19% 18% 6% Solid bottom bar: Page walk overhead Hashed top bar: VMM overheads PhD Defense 42

Shadow Paging has high overheads of VMM interventions
Results B: Unvirtualized N: Nested Paging S: Shadow Paging A: Agile Paging Shadow Paging has high overheads of VMM interventions 28% 70% 11% 19% 30% 18% 6% 6% Solid bottom bar: Page walk overhead Hashed top bar: VMM overheads PhD Defense 43

Agile paging consistently performs better than both techniques
Results B: Unvirtualized N: Nested Paging S: Shadow Paging A: Agile Paging Agile paging consistently performs better than both techniques 28% 70% 11% 2% 19% 30% 18% 6% 2% 4% 6% 3% Solid bottom bar: Page walk overhead Hashed top bar: VMM overheads THP PhD Defense 44

Can we get best of both for same address space (or same page walk)?
II: Summary – Agile Paging Problem: Virtualization valuable but have high overheads with larger workloads (At most 70% slower than native) Existing Choices: Nested Paging: slow page walk but fast page table updates Shadow Paging: fast page walk but slow page table updates Can we get best of both for same address space (or same page walk)? Yes, Agile Paging: use shadow paging and sometime switch to nested paging within the same page walk (At most 4% slower than native) PhD Defense 45

How increase reach of each TLB entry?
III: Redundant Memory Mappings – Goal TLB reach is limited & In-memory workload size is increasing How increase reach of each TLB entry? 1D  0D [Gandhi et al. -- ISCA’15 and IEEE MICRO TOP PICKS’16] PhD Defense

Key Observation Virtual Memory Physical Memory PhD Defense

Key Observation Large contiguous regions of virtual memory
Code Heap Stack Shared Lib. Virtual Memory Large contiguous regions of virtual memory Limited in number: only a few handful Physical Memory PhD Defense

Compact Representation: Range Translation
BASE1 LIMIT1 Virtual Memory OFFSET1 Range Translation 1 Physical Memory Range Translation: is a mapping between contiguous virtual pages mapped to contiguous physical pages with uniform protection PhD Defense

Redundant Memory Mappings
Range Translation 3 Virtual Memory Range Translation 2 Range Translation 4 Range Translation 1 Range Translation 5 Physical Memory Map most of process’s virtual address space redundantly with modest number of range translations in addition to page mappings PhD Defense

Design: Redundant Memory Mappings
Three Components: A. Caching Range Translations B. Managing Range Translations C. Facilitating Range Translations PhD Defense

A. Caching Range Translations
V …………. V12 L1 DTLB L2 DTLB Range TLB Enhanced Page Table Walker Page Table Walker P …………. P12 PhD Defense

V …………. V12 Hit L1 DTLB L2 DTLB Range TLB Enhanced Page Table Walker P …………. P12 PhD Defense

V …………. V12 Miss L1 DTLB Refill L2 DTLB Range TLB Hit Enhanced Page Table Walker P …………. P12 PhD Defense

V …………. V12 Refill Miss L1 DTLB L2 DTLB Range TLB Hit Enhanced Page Table Walker P …………. P12 PhD Defense

Miss V …………. V12 P …………. P12 L1 DTLB Range TLB L2 DTLB Hit Refill Entry 1 BASE 1 ≤ > LIMIT 1 OFFSET Protection 1 Entry N BASE N ≤ > LIMIT N OFFSET N Protection N L1 TLB Entry Generator Logic: (Virtual Address + OFFSET) Protection PhD Defense

V …………. V12 Miss L1 DTLB L2 DTLB Range TLB Miss Miss Enhanced Page Table Walker P …………. P12 PhD Defense

B. Managing Range Translations
Stores all the range translations in a OS managed structure Per-process like page-table Range Table CR-RT RTC RTD RTF RTG RTA RTB RTE PhD Defense

On a L2+Range TLB miss, what structure to walk? A) Page Table B) Range Table C) Both A) and B) D) Either? Is a virtual page part of range? – Not known at a miss PhD Defense

Redundancy to the rescue One bit in page table entry denotes that page is part of a range 2 1 3 Page Table Walk Application resumes memory access Range Table Walk (Background) CR-3 CR-RT RTC RTD RTF RTG RTA RTB RTE Part of a range Insert into L1 TLB Insert into Range TLB PhD Defense

C. Facilitating Range Translations
Demand Paging Virtual Memory Physical Memory Does not facilitate physical page contiguity for range creation PhD Defense

C. Facilitating Range Translations
Eager Paging Virtual Memory Physical Memory Allocate physical pages when virtual memory is allocated Increases range sizes  Reduces number of ranges PhD Defense

Results 4KB: Baseline using 4KB paging
THP: Transparent Huge Pages using 2MB paging [Transparent Huge Pages] CTLB: Clustered TLB with cluster of 8 4KB entries [HPCA’14] DS: Direct Segments [ISCA’13 and MICRO’14] RMM: Our proposal: Redundant Memory Mappings [ISCA’15] PhD Defense

Performance Results Assumptions: CTLB: 512 entry fully-associative
RMM: 32 entry fully-associative Both in parallel with L2 Measured using performance counters Modeled based on emulator: BadgerTrap 5/14 workloads Rest in thesis Make this animated and add points to raise. BadgerTrap: [Gandhi & Basu et al. -- CAN’14] PhD Defense

Overheads of using 4KB pages are very high
Performance Results Overheads of using 4KB pages are very high Make this animated and add points to raise. PhD Defense

Clustered TLB works well, but limited by 8x reach
Performance Results Clustered TLB works well, but limited by 8x reach Make this animated and add points to raise. PhD Defense

2MB page helps with 512x reach: Overheads not very low
Performance Results 2MB page helps with 512x reach: Overheads not very low Make this animated and add points to raise. PhD Defense

Direct Segment perfect for some but not all workloads
Performance Results Direct Segment perfect for some but not all workloads Make this animated and add points to raise. PhD Defense

RMM achieves low overheads robustly across all workloads
Performance Results RMM achieves low overheads robustly across all workloads Make this animated and add points to raise. PhD Defense

III: Summary – RMM Proposal: Redundant Memory Mappings (1D0D) Result:
Problem: Virtual memory (native) overheads are high Proposal: Redundant Memory Mappings (1D0D) Proposed compact representation called range translation Range Translation – arbitrarily large contiguous mapping Effectively cached, managed and facilitated range translations Retain flexibility of 4KB paging Result: Reduced overheads of native virtual memory to less than 1% PhD Defense

Conclusion Problem Paging is valuable but costly to use today Virtualization with paging has much higher cost Opportunity Reduce dimensionality of page walk Result Almost zero-overheads of virtual memory PhD Defense

Publications Jayneel Gandhi, Mark D. Hill, Michael M. Swift, Agile Paging for Efficient Virtualized Page Walks, ISCA 2016. Jayneel Gandhi, Vasileios Karakostas, Furkan Ayar, Adrián Cristal, Mark D. Hill, Kathryn S. McKinley, Mario Nemirovsky, Michael M. Swift, Osman Ünsal, Redundant Memory Mappings for Fast Access to Large Memories, IEEE MICRO TOP PICKS 2016. Vasileios Karakostas, Jayneel Gandhi, Adrián Cristal, Mark D. Hill, Kathryn S. McKinley, Mario Nemirovsky, Michael M. Swift, Osman Ünsal, Energy-Efficient Address Translation, HPCA 2016. Vasileios Karakostas, Jayneel Gandhi, Furkan Ayar, Adrián Cristal, Mark D. Hill, Kathryn S. McKinley, Mario Nemirovsky, Michael M. Swift, Osman Ünsal, Redundant Memory Mappings for Fast Access to Large Memories, ISCA Accepted for IEEE MICRO TOP PICKS 2016 Jayneel Gandhi, Arkaprava Basu, Mark D. Hill, Michael M. Swift, Efficient Memory Virtualization: Reducing Dimensionality of Nested Page Walks, MICRO Received honorable mention on IEEE MICRO TOP PICKS 2015. Jayneel Gandhi, Arkaprava Basu, Mark D. Hill, Michael M. Swift, BadgerTrap: A Tool to Instrument x86-64 TLB Misses, CAN 2014. Arkaprava Basu, Jayneel Gandhi, Jichuan Chang, Mark D. Hill, Michael M. Swift, Efficient Virtual Memory for Big Memory Servers, ISCA 2013. Niket Kumar Choudhary, Salil V. Wadhavkar, Tanmay A. Shah, Hiran Mayukh, Jayneel Gandhi, Brandon H. Dwiel, Sandeep Navada, Hashem Hashemi Najaf-abadi, Eric Rotenberg, FabScalar: Automating Superscalar Core Design, IEEE MICRO TOP PICKS 2012. Niket Kumar Choudhary, Salil V. Wadhavkar, Tanmay A. Shah, Hiran Mayukh, Jayneel Gandhi, Brandon H. Dwiel, Sandeep Navada, Hashem Hashemi Najaf-abadi, Eric Rotenberg, FabScalar: Composing Synthesizable RTL Designs of Arbitrary Cores within a Canonical Superscalar Template, ISCA Accepted for IEEE MICRO TOP PICKS 2012 PhD Defense

Collaborators and Contributors
Mark D. Hill, UW-Madison Michael M. Swift, UW-Madison Arkaprava Basu, AMD Research Vasilis Karakostas, BSC Barcelona Adrian Cristal, BSC Barcelona Mario Nemirovsky, BSC Barcelona Osman Unsal, BSC Barcelona Kathryn S. McKinley, Microsoft Research Benjamin Serebrin, Google And many more… PhD Defense

Questions ? PhD Defense

Backup Slides PhD Defense

1 Features Maps almost whole gPA 4 memory accesses Near-native performance Helps any application VMM Direct 2D  1D PhD Defense

1 2 Features 0 memory accesses Better-than native performance Suits big-memory applications VMM Direct 2D  1D Dual Direct 2D  0D PhD Defense

1 2 3 Features 4 memory accesses Suits big-memory applications Flexible to provide VMM services VMM Direct 2D  1D Dual Direct 2D  0D Guest Direct 2D  1D PhD Defense

Compatibility VMM Hardware 1 OS Unmodified Modified (VMM Direct) APP
PhD Defense

Compatibility VMM VMM Hardware Hardware 1 2 OS OS Minimal Unmodified
APP APP Big-Memory OS OS Modified VMM VMM Modified Hardware (VMM Direct) Hardware (Dual Direct) PhD Defense

Compatibility VMM VMM VMM Hardware Hardware Hardware 1 2 3 OS OS OS
Minimal Minimal Unmodified APP APP Big-Memory Big-Memory Modified OS OS OS Modified VMM VMM Minimal VMM Modified Modified Hardware (VMM Direct) Hardware (Dual Direct) Hardware (Guest Direct) PhD Defense

Dimension/ Memory accesses Guest OS modifications
Tradeoffs: Summary Back Properties Base Virtualized VMM Direct Dual Direct Guest Direct Dimension/ Memory accesses 2D/24 1D/4 0D/0 Guest OS modifications none required VMM modifications minimal Applications Any Big-memory VMM services allowed Yes No

Results: 2MB and 1GB at VMM
Put animations for points to highlight in this slide

0. Page-based Translation
Virtual Memory TLB VPN0 PFN0 Benefits + Most flexible + Only 4KB alignment required Disadvantage ─ Requires more TLB entries ─ Limited TLB reach Physical Memory PhD Defense

1. Multipage Mapping Sub-blocked TLB/CoLT Clustered TLB Virtual Memory
VPN(0-3) PFN(0-3) Map Bitmap Benefits + Increased the TLB reach by 8X-32X + Clustering of pages allowed Disadvantages ─ Requires size alignment ─ Contiguity required Physical Memory [ASPLOS’94, MICRO’12 and HPCA’14] PhD Defense

2. Large Pages Large Page TLB Virtual Memory VPN0 PFN0
Benefits + Increases TLB reach + 2MB and 1GB page size in x86-64 Disadvantages ─ Size alignment restriction ─ Contiguity required Physical Memory [Transparent Huge Pages and libhugetlbfs] PhD Defense

3. Direct Segments Direct Segment BASE LIMIT Virtual Memory
(BASE,LIMIT)  OFFSET OFFSET If BASE ≤ V < LIMIT P = V + OFFSET Benefits + Unlimited reach + No size alignment required Disadvantages ─ One direct segment per program ─ Not transparent to application ─ Applicable to big-memory workloads Physical Memory [Basu & Gandhi et al. -- ISCA’13; Gandhi & Basu et al. -- MICRO’14] PhD Defense

Can we get best of many worlds?
Multipage Mapping Large Pages Direct Segments Our Proposal Flexible alignment Arbitrary reach Multiple entries Transparent to applications Applicable to all workloads PhD Defense

Why low overheads? Virtual Contiguity
Benchmark Paging Ideal RMM ranges 4KB + 2MB THP # of ranges #of ranges to cover more than 99% of memory cactusADM 112 49 canneal 77 4 graph500 86 3 mcf 55 1 tigr 16 Only 10s-100s of ranges per application 1000s of TLB entries required Only few ranges for 99% coverage PhD Defense

Transparent Huge Page (2MB)
B: Unvirtualized N: Nested Paging S: Shadow Paging A: Agile Paging 68% 13% 14% 4% 2% 14% 5% 2% 10% 6% 3% 2% Solid bottom bar: Page walk overhead Hashed top bar: VMM overheads Back

Executive Summary Problem: Overheads of virtual memory can be high

Similar presentations

Presentation on theme: "Executive Summary Problem: Overheads of virtual memory can be high"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Executive Summary Problem: Overheads of virtual memory can be high

Similar presentations

Presentation on theme: "Executive Summary Problem: Overheads of virtual memory can be high"— Presentation transcript:

Similar presentations

About project

Feedback