Agile Paging: Exceeding the Best of Nested and Shadow Paging Jayneel Gandhi, Mark D. Hill, Michael M. Swift
Can we get best of both for same address space (or same page walk)? Executive Summary Problem: Virtualization valuable but have high overheads with larger workloads (at most 70% slower than native) Existing Choices: Nested Paging: slow page walk but fast page table updates Shadow Paging: fast page walk but slow page table updates Can we get best of both for same address space (or same page walk)? Yes, Agile Paging: use shadow paging and sometime switch to nested paging within the same page walk (at most 4% slower than native)
Outline Motivation Agile Paging Results Summary
Virtualization Overview APP APP Benefits: Foundation of our cloud infrastructure Provides on-demand virtual instances Helps server consolidation Guest OS VMM Problem: Overheads of virtualizing memory is high At most 70% slower than unvirtualized Hardware
Guest Physical Address Virtualizing Memory APP APP gVA Guest Virtual Address Guest OS Guest Page Table gPA Guest Physical Address VMM Nested Page Table Hardware hPA Host Physical Address
Virtualizing Memory Two techniques to manage both page tables gVA gPA Guest Page Table Nested Page Table Two techniques to manage both page tables Nested Paging -- Hardware Shadow Paging – Software Evaluated on two axis: Page Walk Latency & Page Table Updates
Unvirtualized x86-64 Translation VA Virtual Address APP APP OS CR3 Hardware PA Physical Address At most mem accesses = 4
1. Nested Paging – Hardware hPA gVA gPA Guest Page Table Nested Page Table gVA Longer Page Walk gCR3 hPA At most Mem accesses 5 + 5 + 5 + 5 + 4 = 24
2. Shadow Paging – Software APP APP gVA Guest OS Guest Page Table (Read Only) Guest Page Table RO RO gPA Shadow Page Table VMM Nested Page Table Hardware hPA
2. Shadow Paging – Software hPA Guest Page Table (Read Only) Nested Page Table gVA Shadow Page Table Shorter Page Walk sCR3 At most mem accesses = 4
Page Table Updates In-place fast update Slow meditated update 1. Nested Paging 2. Shadow Paging gVA gVA VMM Trap Guest Page Table Guest Page Table (Read Only) gPA Shadow Page Table Nested Page Table Nested Page Table hPA hPA In-place fast update Slow meditated update
Guest Virtual Address Space Key Observation Fully static address space Reality !!! Guest Virtual Address Space Shadow Paging preferred Fully dynamic address space Small fraction of address space is dynamic Nested Paging preferred
Key Observation Guest Page Table gCR3 Nested Shadow
Outline Motivation Agile Paging Results Summary
Agile Paging Start page walk in shadow mode -- Achieving fast TLB misses Optionally switch to nested mode -- Allowing fast in-place updates Two parts of design: 1. Mechanism 2. Policy
1. Mechanism gVA gPA hPA Guest Page Table Shadow Page Table Nested Page Table gCR3 Shadow Page Table Guest Page Table sCR3 1 Read only Nested Page Table
1. Mechanism: Example Page Walk gVA gVA sCR3 gCR3 hPA Switch modes @ level 4 of guest page table At most Mem accesses 1 + 1 + 1 + 5 = 8
2. Policy: Shadow Nested Start Shadow Write to page table (VMM Trap) Shadow (1 Write) Write to page table (VMM Trap) Nested Subsequent Writes (No VMM Traps)
2. Policy: Nested Shadow Start Shadow Write to page table (VMM Trap) Shadow (1 Write) Write to page table (VMM Trap) Move non-dirty Timeout Use dirty bits to track writes to guest page table Nested Subsequent Writes (No VMM Traps)
Outline Motivation Agile Paging Results Summary
Methodology Measure cost on page walks on real hardware Intel 12-core Sandy-bridge with 96GB memory 64-entry L1 TLB + 512-entry L2 TLB 4-way associative for 4KB pages 32-entry L1 TLB 4-way associative for 2MB pages Prototype VMM and emulate hardware in Linux v3.12.13 BadgerTrap for online analysis of TLB misses and emulate agile paging Linear model to predict performance Workloads Big-memory workloads, SPEC 2006, BioBench, PARSEC
Performance Results Modeled based on emulator: BadgerTrap B: Unvirtualized N: Nested Paging S: Shadow Paging A: Agile Paging Modeled based on emulator: BadgerTrap Measured using performance counters Solid bottom bar: Page walk overhead Hashed top bar: VMM overheads
Performance Results Nested Paging has high overheads of TLB misses B: Unvirtualized N: Nested Paging S: Shadow Paging A: Agile Paging Nested Paging has high overheads of TLB misses Effect of longer page walk 28% 19% 18% 6% Solid bottom bar: Page walk overhead Hashed top bar: VMM overheads
Shadow Paging has high overheads of VMM interventions Performance Results B: Unvirtualized N: Nested Paging S: Shadow Paging A: Agile Paging Shadow Paging has high overheads of VMM interventions 28% 70% 11% 19% 30% 18% 6% 6% Solid bottom bar: Page walk overhead Hashed top bar: VMM overheads
Agile paging consistently performs better than both techniques Performance Results B: Unvirtualized N: Nested Paging S: Shadow Paging A: Agile Paging Agile paging consistently performs better than both techniques 28% 70% 11% 2% 19% 30% 18% 6% 2% 4% 6% 3% Solid bottom bar: Page walk overhead Hashed top bar: VMM overheads
Can we get best of both for same address space (or same page walk)? Summary Problem: Virtualization valuable but have high overheads with larger workloads (At most 70% slower than native) Existing Choices: Nested Paging: slow page walk but fast page table updates Shadow Paging: fast page walk but slow page table updates Can we get best of both for same address space (or same page walk)? Yes, Agile Paging: use shadow paging and sometime switch to nested paging within the same page walk (At most 4% slower than native)
Questions ?
Can we get best of both worlds? Nested Paging Shadow Paging Agile Paging Dimensions 2D 1D # of memory accesses 24 4 ~4-5 Page table updates Fast in-place Slow out of place
Short-Lived Processes Issue: The cost of creating shadow page table is high Solution: Start shadow mode after 1 sec for agile paging Give user mode access to run only in nested mode
Accessed/Dirty Bits Issue: Shadow mode is slow for setting A/D bits Coherence between shadow and guest page tables causes VMM traps. Solution: Hardware Optimization Intel sets accessed/dirty bits on both guest and nested page tables Broadwell supports multiple page table walkers per-core We propose to write A/D bits on all three page tables by hardware
Context-Switches Issue: Intra-guest context switches with shadow mode are slower Guest OS does not know existence of shadow page table --- VMM trap Solution: Hardware Optimization Add a small VMM managed cache of guest CR3 shadow CR3 Looked up by hardware for matching entry on context-switch If hits, does not require VMM trap
Why does agile paging work? Switch Level Shadow L4 L3 L2 L1 Nested Mem. Acc. 4 8 12 16 20 24 Avg. graph500 99.8% 0.2% - 4.01 memcached 88.2% 4.5% 7.3% 4.76 canneal 94.7% 4.6% 0.7% 4.24 dedup 91.4% 2.2% 6.4% 4.60 Brings average number of memory accesses down to ~(4-5) from 24
Transparent Huge Page (2MB) B: Unvirtualized N: Nested Paging S: Shadow Paging A: Agile Paging 68% 13% 14% 4% 2% 14% 5% 2% 10% 6% 3% 2% Solid bottom bar: Page walk overhead Hashed top bar: VMM overheads
Design Components Hardware VMM Three page table pointers Points to each of the page tables Enhanced page table walker Interprets switching bit Bridges the two state machines Manage three page tables Incremental from shadow paging Policies for changing modes Encapsulate policies in VMM