Download presentation
Presentation is loading. Please wait.
Published byGrant Wood Modified over 6 years ago
1
Performance Implications of Extended Page Tables on Virtualized x86 Processors
Tim Merrifield and Reza Taheri © 2014 VMware Inc. All rights reserved.
2
The Cost of “Virtualizing” Virtual Memory
Managing virtual memory and performing address translation can be an expensive operation. Even more so on virtualized servers. Address translation w/ hardware supported vMMUs requires a so-called 2-D page walk. In the worst case the 2-D walk requires a 6x increase in memory loads Several recent research proposals start from the premise that address translation in a VM is expensive. What is the cost of TLB miss processing
3
The Cost of “Virtualizing” Virtual Memory
Several recent research proposals start from the premise that address translation in a VM is expensive. Older hardware/software Using 4KB hypervisor mappings What is the cost of TLB miss processing in a virtual environment?
4
Executive Summary of Our findings
We evaluated TLB miss processing with EPT on: TPC-C w/ 475 GB of memory Multi-VM VMmark benchmark 49 programs from SPEC-CPU2006, Parsec 3 and SPLASH-2x Microbenchmark For TPC-C, % of total cycles spent in address translation is only 4.2% higher than bare metal For the VMmark benchmark only 4.3% of total cycles can be attributed to EPT Increase in address translation costs on virtual is not a severe as one might expect. Though, total address translation costs can still be large. Expanding TLB reach (both native and virtual) is still important.
5
The Rest of Our Talk... How address translation works in a virtual environment. Study EPT mechanism through a microbenchmark Parsec/SPLASH/SPEC results TPC-C and VMmark results Conclusion
6
Address Translation in a Virtual Environment
7
“Virtualizing” Virtual Memory
A Guest does not have direct access to physical memory. Hypervisor maintains guest isolation by maintaining the mappings to physical memory. Guests can only access physical memory allocated by the hypervisor Shadow page tables Guest page tables are unused Hypervisor maintains separate page tables used by the hardware in address translation Requires costly traps into hypervisor to maintain shadows
8
“Virtualizing” Virtual Memory
Alternatively, Intel and AMD provide hardware support through EPT and NPT (respectively). Both guest OS and hypervisor page tables are used during address translation. 2-D page walk No more traps to maintain shadows, but hardware page walks are more complex. How does all this work?
9
Native Address Translation
10
4-Level Hierarchical Page Table
Intel Name Entry Size L4 PML4 512GB L3 PDPT 1GB L2 PD 2MB L1 PT 4KB L4 L3 L2 L1
11
Basic Native Page Walk
12
Native Page Walk
13
Adding the Second Dimension (EPT)
14
Adding the Second Dimension (Simplified)
15
Putting it all together - EPT
16
Putting it all together - EPT
24 possible memory loads
17
TLB and Page Size TLB (and page structure caches) provides combined mappings: gVA->hPA To see TLB hit rate improvements from using large pages, both the guest OS and hypervisor must map the page as 2MB
18
EPT Evaluation and Analysis
19
Our Experimental Setup
2-socket Haswell-EP E GHz 36 cores/72 threads 512GB of memory VMware ESXi vSphere Release 6.0 1 VM (RHEL 7.1) with 64 vCPU’s and 475GB of memory Secondary machine (used for VMmark): 2-socket Haswell-EP E5-2687W v3
20
TLB and Cache Characteristics
Haswell SandyBridge Page Size Entries L1 ITLB 4KB 128 2MB 8/thread L1 DTLB 64 32 1GB 4 L2 Unified None 512 4KB or 2MB 1024 Haswell Size L1 Data 32KB L1 Instr. L2 256KB L3 45MB Cache Specifications TLB Specifications
21
(Partial) List of Relevant Performance Counters
Description DTLB_(LOAD|STORE)_MISSES.WALK_DURATION # of cycles spent on TLB miss processing (EPT inclusive) DTLB_(LOAD|STORE)_MISSES.WALK_COMPLETED # of native/guest-level walks EPT.WALK CYCLES # of cycles spent on EPT portion of walk PAGE_WALKER_LOADS.DTLB_(L1|L2|L3|MEMORY) # of memory loads performed during native/guest-level walks LONGEST_LAT_CACHE.REFERENCE # of LLC references (hits+misses) LONGEST_LAT_CACHE.MISS # of LLC misses DTLB_LOAD_MISSES.PDE_CACHE_MISS # of native/guest-level misses in the L2 page structure cache
22
Virtual MMU Simulation
We would like to better understand the EPT-level walk (not many counters for EPT) Simple simulator using Intel Pin Instruments program loads/stores. Models the TLB and page structure cache characteristics. Verified using performance counters.
23
Microbenchmark Evaluation
24
Microbenchmark Program
Performs random accesses to an mmap()’d array of integers. Mapped with: MAP_ANONYMOUS | MAP_POPULATE We experiment with: The size of the array The page size of the array (2MB or 4KB) The hypervisor page mapping size (2MB or 4KB) Results are shown using the notation: (execution environment)-(page size)-(hypervisor page size). Ex: Virtual-4KB-2MB
25
Address Translation Cycles - Array w/ Large Pages
26
Cost of Virtual Address Translation - EPT walks
# of EPT walks Determined by the length of the guest level walk gVA’s and the page structure caches - Where do we begin? Guest level page size - Where do we end?
27
Counting EPT Walks/Guest loads (from simulation)
28
Ex: 256GB Array - EPT Walks/Guest Loads
29
Cost of Virtual Address Translation - EPT loads
# of EPT loads gPA’s and the page structure caches - Where do we begin? Host level page size - Where do we end?
30
Counting EPT Loads (from simulation)
31
Ex: 256GB Array - EPT Loads
32
Cost of Virtual Address Translation - EPT walks
Memory load latency Determined by the size of the page table (guest and host)
33
Takeaways From Microbenchmark Experiments
Page size at the guest and hypervisor level changes the cost of address translation dramatically. Even when we forget the reduced TLB reach when hypervisor maps at 4KB With 2MB hypervisor mappings, the EPT cost is modest Because the entire page table can fit in the LLC.
34
Experiments with SPEC,Parsec 3 and SPLASH-2x
35
Address Translation Cycles
Only 7 of 49 programs spent more than 5% of their cycles on address translation on virtual Mean increase address translation cycles (% of total cycles): 3.7% Increase in address translation cycles: 80% 50% when you exclude dedup
36
System level benchmarks
37
TPC-C TPC-C is the industry standard OLTP benchmark 72-way Haswell-EP
Our biggest hammer Experimental results; not comparable to published results 72-way Haswell-EP ESX 6.0 GA, RHEL 7.1, Oracle 12c R1 64-way VM on 72-way host Throughput was 90% of native 4.8% in EPT
38
VMmark De facto industry standard consolidation benchmark
16 tiles, 128 VMs More than half are 2GB VMs So OK as far as TLB reach, but affected by working set sampling Haswell-EP 2687W v3 Top bin frequency but only 10 cores 97% CPU utilization 11.8% in TLB miss processing 3.9% in EPT Easier on the TLB than TPC-C
39
TLB miss processing costs for system-level benchmarks
40
Benefits of Hypervisor using prototype 1GB pages
41
TLB and LLC tats for 2MB and 1GB page backing
42
Page splintering The hypervisor may choose to splinter a host 2M page into its constituent 4K pages Estimate the size of a VM’s working set Upper bound of 400 large pages splintered on ESXi Preparing a VM for Live Migration, ballooning, etc. Transient Transparent Page Sharing only works on 4K pages User’s choice Could be a major performance problem 4K page mapping can drastically increase EPT costs
43
Page splintering TPC-C see no difference with and without splintering
512GB >> 800MB VMmark experiences an impact Over ½ the VMs are 2GB Disabling the sampling of working set gives VMmark a 1% drop in EPT cycles and 1% boost in throughput Hard to analyze since 2M DTLB misses are broken in virtual EPT in the default case with splintering is only 3.9%
44
Conclusions EPT still a substantial component of virtualization overhead But the absolute value is shrinking And the overall overhead is shrinking Time in TLB misses is still significant Could it get worse? Memory sizes growths follow Moore’s Law But TLB size grows linearly Page sizes stagnant Page splintering not a problem in the common case
45
Follow-up research topics
EPT-PDE cache goes unused for 2MB host mappings Perhaps could be reduced in favor of a larger EPT- PDPTE cache? EPT level TLB? Or, use EPT PD cache for 2MB pages Expand the investigation to AMD and other hypervisors
46
Acknowledgements Jim Mattson Michael Ho Yury Baskakov
Rajesh Venkatasubramanian James Zubb Fei Guo Seongbeom Kim Chris Rossbach Vish Viswanathan Sajjid Reza
47
The 24-step page walk gVA nCR3 gCR3 nL4 1 nL3 2 nL2 3 nL1 4 gL4 5 hPA
gPA gL4 nL4 6 nL3 7 nL2 8 nL1 9 gL4 10 gPA gL3 nL4 11 nL3 12 nL2 13 nL1 14 gL4 15 gPA gL2 nL4 16 nL3 17 nL2 18 nL1 19 gL4 20 gPA gL1 nL4 21 nL3 22 nL2 23 nL1 24 Final hPA gPA nCR3 nL4 nL3 nL2 nL1
48
In practice, 2 EPT walks with 2M guest + host pages
gVA hPA from PML4 cache nCR3 gCR3 gPA nL4 1 hPA nL3 2 hPA nL2 3 hPA nL1 4 nL1 4 hPA gL4 4 gL4 nL4 5 nL3 6 nL2 7 nL1 9 gL4 1 gPA gL3 nL4 2 nL3 3 nL2 4 nL1 14 nL1 14 gL4 5 gPA gL2 nL4 16 nL3 17 nL2 18 nL1 19 gL4 20 gPA gL1 nL4 6 nL3 7 nL2 8 nL1 24 Final hPA Final hPA gPA nCR3 nL4 nL3 nL2 nL1
49
2 IA steps + 3 EPT steps with 2M guest + host pages
gVA hPA from PML4 cache nCR3 gCR3 gPA nL4 1 hPA nL3 2 hPA nL2 3 hPA nL1 4 nL1 4 hPA gL4 4 gL4 nL4 5 nL3 6 nL2 7 nL1 9 gL4 1 gPA gL3 nL4 2 nL3 3 EPT-PDPE cache hit nL2 2 nL1 14 gL4 3 gPA nL1 14 gL2 nL4 16 nL3 17 nL2 18 nL1 19 gL4 20 gPA gL1 EPT-PML4E cache hit nL4 6 nL3 4 nL2 5 nL1 24 Final hPA Final hPA gPA nCR3 nL4 nL3 nL2 nL1
50
2 IA steps + 2 EPT steps with 1GB host pages
gVA hPA from PML4 cache nCR3 gCR3 gPA nL4 1 hPA nL3 2 hPA nL2 3 hPA nL1 4 nL1 4 hPA gL4 4 gL4 nL4 5 nL3 6 nL2 7 nL1 9 gL4 1 gPA gL3 EPT-PML4E cache hit nL4 2 nL3 3 nL2 2 nL1 14 gL4 3 gPA nL1 14 gL2 nL4 16 nL3 17 nL2 18 nL1 19 gL4 20 gPA gL1 EPT-PML4E cache hit nL4 6 nL3 4 nL2 5 nL1 24 Final hPA Final hPA gPA nCR3 nL4 nL3 nL2 nL1
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.