Accelerating Two-Dimensional Page Walks for Virtualized Systems Jun Ma
Introduction Native non-virtualized system We have a OS running on a physical system. OS communicates with physical system directly. Address Mapping: Virtual Address: The address used in OS application software. Physical Address: The address in physical machine. For native system: VA->PA.
Introduction Virtualization: Multiple OS can run simultaneously but separately on one physical system. hypervisor: underlying software used to insert abstractions into virtualized system and manipulate the communication between OS and physical system.
Introduction Virtualization: Address mapping for Virtual Machine. Guest OS: Guest Virtual Address (GVA), Guest Physical Address. (GPA) Physical system: System Physical Address(SPA). Address translation: GVA->GPA->SPA
Introduction Virtualization: Tradition idea for memory translation: manipulated by hypervisor. Drawbacks: hypervisor intercepts operation, exits guest, emulates the operation and does memory translation and then return back to guest. -> high overhead. Alternative idea: Using hardware to finish translation. Don’t need hypervisor, save overhead.
Background X86 Native Page Translation Page table: use hierarchical address-translation tables to map VA to PA. Page walk: an iterative process. In order to get the final PA from VA, we need a page walk and traverse all level page table hierarch.
Background X86 Native Page Translation From level 4 down to level 1. A physical address from above level is used as base address and 9-bit VA is used as offside. TLB(Translation look-aside buffers) caches the final physical address to reduce frequency of page walks.
Background Memory Management for Virtualization Without hardware support, we should use hypervisor to manipulate this translation. This is one important overhead for hypervisor. (Using shadow page table to map GVA to SPA) Hardware mechanism: Same idea as X86 page walking. (2D page walking) Nested paging: map GPA to SPA.
Background Memory Management for Virtualization Traverse guest page table to translate GVA to GPA. For each level, original GPA should be translated to SPA by walking nested page table for each gL (guest page table) to read. TLB caches the final SPA to reduce page walk overhead.
Background Large page size advantages: * Memory saving: With 4 KB pages, an OS should use entire L1 table which is 4 KB large. If we can make all KB into a 2 MB contiguous block, we can escape L1 so we save 4 KB space used by L1. * Reduction in TLB pressure: Each large page table entry can be stored in a single TLB entry while the corresponding regular page entries require KB TLB entries to map the same 2 MB range of virtual address. *Shorter page walk: Escape the entire L1, the page walking is shorter and therefore save some overhead.
Page walk characterization Page walk cost Perfect TLB Opportunity means the performance improvement that could be achieved with a perfect TLB which eliminates cold misses as well as conflict and capacity misses.
Page walk characterization Page entry reuses
Page walk characterization Page entry reuses
Page walk characterization Page entry reuses Nested page tables have much higher reuse than guest page tables, in part due to the inherent redundancy of the nested page walk. There are many more nested accesses than guest accesses in a 2D page walk. Each level of the nested page table hierarchy must be accessed for each guest level. In many cases the same nested page entries are accessed multiple times in a 2D page walk (high reuse rate).
Page walk characterization Page entry reuses and both have high unique page entries because both of them map guest data into their respective address space. maps GVA-> GPA. maps GPA -> SPA. So these two are most difficult to be cached.
Page Walk Acceleration AMD Opteron Translation Caching: Page walk cache(PWC): stores page entries from all page table levels except L1, which is stored in TLB. All page entries are initially brought into L2 cache. On a PWC miss, the page entry data may reside in the L2 cache, L3 cache(if present).
Page Walk Acceleration Translation caching for 2D page walks
Page Walk Acceleration Translation caching for 2D page walks One –Dimensional PWC(1D_PWC) : Only page entry data from the guest dimension are stored in the PWC and the entries are tagged based on the system physical address. The lowest level guest page table entry {G,gL1} is not cached in the PWC because of its low reuse rate. Two-Dimensional PWC (2D PWC): Extends 1D PWC into the nested dimension of the 2D page walk. Turning the 20 unconditional cache hierarchy accesses into 16 likely PWC hits (dark-filled references in Figure 5(b)) and four possible PWC hits (checkered references. Like 1D PWC, all page entries are tagged with their system physical address and {G,gL1} is not cached.
Page Walk Acceleration Translation caching for 2D page walks Two-Dimensional PWC with Nested Translations (2D PWC+NT): Augment 2D PWC with a dedicated GPA to SPA translation buffer, the Nested TLB (NTLB), which is used to reduce the average number of page entry references that take place during a 2D page walk. The NTLB uses the guest physical address of the guest page entry to cache the corresponding nL1 entry. The page walk begins by accessing the NTLB with the guest physical address of {G,gL4} and produce the data of {nL1,gL4}, allowing nested references 1-4 to be skipped. On an NTLB hit, the system physical address of {G,gL4} needed for the PWC access is calculated.
Result Benchmark we will use in the following slides:
Result The three hardware-only page walk caching schemes improve performance by turning page entry memory hierarchy references into lower latency PWC accesses and, in the case of 2D PWC+NT, skipping some page entry references entirely.
Result Left side: G column is not skipped, so it does not change. So does gPA row. gL1 in 2D_PWC+NT is skipped in 2D_PWC+NT though it has a low reuse rate. So it exhibits a shorter space in 2D_PWC_NT than in 2D_PWC. Right side: NTLB eliminates many of the PWC accesses, but it does not eliminate a significant portion of the accesses that have the highest penalty.
Result The first data column states that L2 accesses incurred during a 2D page walk using the 2D PWC+NT configuration generate times more L2 misses than the native page walk. This increase is primarily because the native page walk has fewer entries that are difficult to cache (L1 and sometimes L2) compared to the 2D page walk ({G,gL1}, {nL1,gPA} and sometimes {G,gL2}, {nL2,gPA}, {nL1,gL1}, and {nL2,gL1}). The second data column shows the L2 cache miss percentage due only to page entries from the 2D page walk. The miss percentages are relatively high because the PWC and NTLB have filtered the easy-to-cache accesses and the remaining accesses are difficult to cache.
Result The 8096 w/(G, gL1) configuration is unique in that it writes the gL1 guest page entry to the PWC.
Result Large pages allow the TLB to cover a larger data region with fewer translations, which will lead to less TLB missing. (the nL1 references for the gPA, gL1, gL2, gL3,and gL4 levels are all eliminated. ) The ability to eliminate poor-locality references, like {nL1,gL1} and {nL1,gPA}, reduces the number of L2 cache misses by 60%-64%.
Conclusion Nested paging is a hardware technique to reduce the complexity of software memory management during system virtualization. Nested page tables combine with the guest page tables to map GPA to SPA, resulting in a two-dimensional (2D) page walk(2D_PWC, 2D_PWC+NT). A hypervisor is no longer required to trap on all guest page table updates and significant virtualization overhead is eliminated. However, nested paging can introduce new overhead due to the increase in page entry references. Therefore, the overall performance of a virtualized system is improved by nested paging when the eliminated hypervisor memory management overhead is greater than the new 2D page walk overhead.