Devirtualizing Memory in Heterogeneous Systems

Devirtualizing Memory in Heterogeneous Systems
Swapnil Haria, Mark D. Hill, Michael M. Swift University of Wisconsin-Madison

Growing Diversity in Computer Systems
Modern systems are diversifying compute: General Purpose Cores (CPUs) GPU Specialized Accelerators Tensor Processing Unit (Google) Nervana (Intel) Database Accelerator (Oracle) ACC CPU GPU Interconnect System Memory We are increasingly seeing diverse compute resources alongside the conventional general purpose CPUs. The graphic processing unit has been here for some time now. Lately we have seen various specialized accelerators emerging from industry such as: These acceleators currently sit on the PCIE bus, but we firmly believe that moving these to the memory bus makes sense as it provides shared memory between the CPU and the accelerators which elimintates explicit data copying. Shared memory allows the CPU to initialize datastructures and then simply offload only the computation to the ACC without needing to explicitly copy any data.

3Ps for Memory Management
Physical Addressing Virtual Protection Performance Programmability Currently, there are only two major ways of managing the shared memory –using physical addresses which was common in some older accelerators or using virtual memory as is present in some modern GPUs. We will compare these of these on 3 metrics, Protection, Performance and Programmability.

Physical Addressing Access Private Data Direct Access Insecure! CPU
MMU Access Private Data TLB Caches Direct Access Insecure! System Memory On the left is a CPU using conventional VM with Trasnaction lookaside buffer or TLB, The accelerator .. The simplest approach is to simply have the ACC use PAs to directly. This avoids the need for any additional hardware. This approach is fast as the ACC directly accesses shared memory. However, direct access allows the ACC to even reference any private data. As a result, protection needs to be enforced either via device drivers or making sure that only trusted code is executed. Moreover, pointers are not useful to the ACC as they refer to VAs Kernel Data Shared Data

Virtual Addressing CPU ACC MMU TLB IOMMU Caches TLB System Memory
To allow the ACC to dereference pointers and in general use VAs, we need a IO memory management unit, and a TLB to cache such translations. This is easy to program as the ACC can directly dereference CPU-side pointers.

Virtual Addressing Access Shared Data Address Translation, Permissions
CPU ACC Access Shared Data Address Translation, Permissions TLB Miss Page Walk Return Translation MMU TLB Caches Data Fetch TLB IOMMU However, address translation can be rather slow as we see here. If the translation is not cached in the TLB, a page walk is started. After the long slow page walk, the translation is retuned, and the ACC can access the shared data. Shared Data Page Tables

3Ps for Memory Management
Physical Addressing Virtual Protection ×  Performance Programmability How do we get the best for both worlds? So to summarize, PA is good for performance but is hard to program for, and doesn’t offer protection. Conversely, VA supports protection and improves programmability but adds high overheads. So wouldn’t it be nice to get all three properties?

Devirtualizing Memory
Physical Virtual Our Proposal Addressing Addressing (DVM) Protection × ü ü Performance ü ü × Programmability × ü ü Use Virtual Addressing and (usually) set VA == PA We use virtual addresses so we get protection and programmability, and for performance, we allocate data with its VA and PA equal. This allows us to skip Address translation but simply perform a validation step for protection.

I don’t need no Translation
Translate VA Load

I don’t need no Translation
Validate Load On every memory access, we can have three cases, the good the bad and the ugly. When the address is identity mapped, we perform a quick permissions check and we add low overheads. When some private data is accessed, we quickly get a permissions violation, and an exception is raised. Finally, if an access is made to a page which is not identity mapped, DAV fails, we need to translate the page before we can fetch the data.

Outline Motivation Devirtualized Memory
a. Allocating VA == PA (Mostly Software) b. Exploiting PA == VA (Mostly Hardware) Evaluation DVM has two main parts: We need to allocate data such that VA == PA, which requires OS changes. Using this property of PA==VA to expedite memory accesses via hardware changes.

How is Memory Allocated Today?
APP: Application requests memory VA: Allocator selects suitable free VA range PA: On first access to a page, OS backs it up with available PA 0x1000:0x1FFF Virtual AS Physical AS Typically, only data on the heap is shared with the accelerator. Let’s first understand how heap memory is allocated. When the application requests memory, it doesn’t care about what VAs the allocated memory is located at. For small requests, the memory allocator tries to satisfy the request from its own pool of free memory. For larger requests, the memory allocator turns to the OS. The OS selects a suitable VA range which is so far unmapped and returns the base address to the application. PAs are allocated lazily at the time of first access to the page. 0x1800 0x2000

Allocating VA == PA (Identity Mapping)
APP: Application requests memory PA: Allocate PAs eagerly and contiguously (if possible) VA: Set VAs equal to allocated PAs 0x1000:0x1FFF 0x3000:0x4FFF Virtual AS Physical AS 0x1000:0x1FFF DVM allocates memory such that VA == PA, which we call identity mapping. When the allocator turns to the OS, the OS now flips the order of operations. First, it allocates PAs contiguously and during the initial allocation itself, and then sets VAs equal to the allocated PAs. For small allocations, we ensure that the allocators free pool is also identity mapped, and each large allocation is individually identity-mapped. This avoids creating a large contiguous heap. Of course, identity mapping is not always possible so DVM falls back to demand paging. 0x2400 0x2800

a. Allocating VA == PA (Mostly Software) b. Exploiting PA == VA (Mostly Hardware) Evaluation Let me describe how we exploit this property to validate accesses fast. Remember we have to support translation in some cases, so we have to support page tables and page walks. But

Memory Accesses with DVM
GOOD PA == VA BAD Permission Violation UGLY PA != VA PA == VA? Validate: Validate:× Validate:×? Load Load Translate VA Perms OK? Exception On every memory access, we can have three cases, the good the bad and the ugly. When the address is identity mapped, we perform a quick permissions check and we add low overheads. When some private data is accessed, we quickly get a permissions violation, and an exception is raised. Finally, if an access is made to a page which is not identity mapped, DAV fails, we need to translate the page before we can fetch the data.

Rethinking Page Tables
Page General Directory (L4) Page Directory Pointer (L3) Page Directory (L2) Page Table (L1) Virtual AS PA == VA Physical AS Here’s a page walk in a 4-level page table ending at a last level pte. This PTE maps the virtual page seen in the address space at the bottom of the page. If this page is identity mapped, we dont need to store the PA in the PTE. We are also only doing this in ACC-side page tables and for heap data, so we can ignore the metadata bits or store them at a higher granularity. We are left with only 2 permission bits in the entire 64 bit PTE. Now, if this page is part of a larger memory allocation, the PTEs for these adjacent pages will also be adjacent in the page tables. ALl of these pages will be identity mapped with the same permissions. So to conserve space, we can actually store these in the higher level of the page table itself.

Sixteen 2-bit Permissions
Permission Entry PA == VA Virtual AS Permission Entry (PE) PE=1 RW RW - RW 62 31:30 29:28 3:2 1:0 PE Identifier Sixteen 2-bit Permissions per aligned VA range PA == VA? Perms OK? It contains a bit to differentiate it from other PTEs, and sixteen 2-bit permissions. At level 2 of the page table, each entry points to a 2MB range, so each of the sixteen permissions points to a 128KB granularity. We can do this at higher levels of the page table as well. We can also handle holes in the page table. So validation succeeds if page walk ends in a PE with the right permissions.

Walk a little faster Improved caching of pagetable entries
– (64X) lesser information stored and retrieved from page tables Shorter page walks The permissions entries help us walk the page tables faster. For identity mapped pages, we are storing and retrieving much less information in the page tables, and this improves the efficacy of the PWC. Second, we are reducing the number of levels we have to walk.

a. Allocating VA == PA (Mostly Software) b. Exploiting PA == VA (Mostly Hardware) Evaluation

Methodology Identity Mapping prototyped in Linux v4.10
Simulated system with CPU & graph accelerator (Graphicionado) Performance evaluated using full-system mode

System Configuration DVM Conventional VM NO TLB!
128-entry, 4-way set associative cache (L1-L4 PTEs + PEs) Conventional VM 128-entry fully associative TLB 128-entry, 4-way set associative page walk cache (L2-L4 PTEs) ACC ACC MMU MMU TLB PWC AVC Shared Memory Shared Memory

Performance Evaluation
4% overheads Lower is Better Unsafe Physical Addresses

Conclusion We propose Devirtualized Memory
Often allocate memory with VA==PA (identity mapping) Replace translation with permission checks Modify page table structures to exploit PA==VA Within 4% of ideal (unsafe) direct access Let me start with an overview of our work. Accelerators such as the TPU are now becoming common. We believe that programming such accelerator grealty benefits from shared memory between the CPU and the accelerator. However, existing memory mgmnt techniques are unsuitable. For instance, accessing the shared memory using physical addresses offers fast, direct access to memory, but this can be abused to access even private or kernel memory. On the other hand, using virtual addresses allows memory protection but requires address translation which degrades performance. Our idea is that if we allocate data such that its VA and PA are equal, we don’t need to perform full address translation. In the common case, we can verify permissions fast and reduce the overheads of supporting VM on accelerator from about 140% to less than 2.1%.

BACKUP

Thanks! Physical Addressing Virtual Our Proposal (DVM) Protection × 
Performance Programmability

Thanks! Heterogeneous Systems are (finally) becoming mainstream
Benefit from efficient shared memory Plagued by unsuitable memory management techniques Direct access to PM is fast but unsafe VM offers protection, but is slow and expensive We propose DVM to provide best of both worlds Allocate and access memory with PA==VA Fetch data in parallel with enforcing protection Improves performance by 2.1X, within 2% of ideal

Dynamic Energy spent in VM mechanisms
4K,TLB+PWC Lower is Better

Comparision with Direct Segment, Ranges (RMM)
DVM Direct Segment RMM Heap Discontiguous Heap Single, Contiguous Heap Hardware No TLB. Only page table walkers, PWC TLB + Comparators TLB + Range TLB, Range table walkers + page table walkers

Comparison with Huge Pages
DVM breaks serialization of translation and data fetch DVM exploits finer granularities of contiguity rather than just 2MB, 1GB etc. DVM requires much lesser hardware Huge pages have to be mapped entirely, DVM allows holes

Performance

Energy

Fragmentation Aggravated but ..
Identity Map individual allocations separately not whole heap at once High-performance systems are configured to not swap anyways NVM’s higher capacity makes this scenario less likely in future

Code changes to Linux v4.10

ARM TrustZone Coarse-grained, either secure world or non-secure
Accelerator with direct access to PM can read/write data for other processes in the same ‘world’ High hardware overheads for supporting virtual processors/mmu, one in each world

Page Table Entry 0 – Present 1 – R/W (0 only R, 1 R+W)
2- User/Supervisor ( basically read bit as supervisor is not on ACC) 3- Page-level writethrough (memory type) 4- Page-level cache-disable 5,6 – Accessed, Dirty 7 – Rsvd 8- Global

Identity Mapping Allocate Memory such that PA==VA (almost always)
- ACCs: Heap - CPUs: Heap, Stack, Shared Objects, Code Leverage existing support in Linux - Address Space Layout Randomization (ASLR) - Position-Independent Executables (PIE) Less intrusive changes, and there are bound to be more conflicts on PA than VA.

Graphicionado High-Performance graph analytics accelerator (Micro ‘16)
Specialized-while-flexible HW pipeline Application-specific pipeline S1: Read Active SRC Property S2: Read Edge Pointer S3: Read Edges for given SRC S4: Process Edge S5: Control Atomic Update S6:Read Temp DST Property S8: Write Temp DST Property S7: Reduce

Access Validation Cache
Replaces Page Walk Cache (PWC) and TLB  128-entry, 4-way SA cache (1 KB size)  Physically indexed, Physically tagged Fewer leaf PTEs  Cache both intermediate and leaf PTEs  0 main memory access in best case Page Walk parallel to Data Fetch  Can tolerate 4-cycle in-cache page walk IOMMU Page Walker AVC Page Tables

Performance Evaluation
4% overheads 2.1% overheads Lower is Better Unsafe Physical Addresses

Exploiting PA==VA IF VA == PA, no need to store PA in PTE
For heap accesses from ACC, most metadata bits unimportant Record permission (RW) bits densely 63 62:52 51:32 NX AVAIL PHYS ADDR PHYS ADDR AVAIL G - D AC DC WT R W PS We can perform access validation using conventional multi-level paging, but we showed how that can be slow. Luckily, we can leverage identity-mapping to make this more efficient. The key idea is we don’t have to store page-level translations for IM pages. SO we can do 1 things. We can simply store permissions for all physical pages in a single bitmap, something like Border Control. Or, if we find it acceptable to make changes to the page tables, we can do something better. Let’s briefly look at each of these. 31:12 11:9 10:0 x86-64 Pagetable Entry

Physical Addressing Access Private Data Insecure! int main() {
int *a = contiguous_alloc(); ... init(a, b, c); add<<1, N>>(VAtoPA(a), VAtoPA(b), VAtoPA(c)); return 0; } CPU void add(int*a, int*b, int*c) { c[threadID.idx] = a[threadID.idx] + b[threadID.idx]; } ACC MMU Access Private Data TLB Caches Insecure! System Memory Private Data c a b

Virtual Addressing int main() { int a[N], b[N], c[N]; init(a, b, c);
add<<1, N>>(a, b, c); return 0; } CPU void add(int*a, int*b, int*c) { c[threadID.idx] = a[threadID.idx] + b[threadID.idx]; } ACC MMU TLB Caches TLB IOMMU System Memory Add graphs of TLB misses c a b

Virtual Addressing Access Shared Data Address Translation, Permissions
CPU ACC Access Shared Data Address Translation, Permissions TLB Miss Page Walk Return Translation MMU TLB Caches Data Fetch TLB IOMMU Add graphs of TLB missesTranslation c a b Page Tables

Virtual Addressing int main() { int a[N], b[N], c[N];
init(a, b, c); add<<1, N>>(a, b, c); return 0; } void add(int*a, int*b, int*c) { c[threadID.idx] = a[threadID.idx] + b[threadID.idx]; } CPU ACC MMU TLB Caches TLB IOMMU Animation 2/2 c a b

Insufficient Permissions!
Virtual Addressing CPU ACC Address Translation, Permissions Checked Access Private Data Insufficient Permissions! MMU TLB Caches TLB IOMMU All the permissions checking done here c a b Private Data

Optional Preloads on Reads
Key Idea: Access memory (Pre-load) assuming VA==PA in parallel with DAV Success: Use preload-ed value, DAV overheads hidden Failure: Discard preloaded value, redo access to translated PA

Compact Page Tables ( in KB)
Permission Entries Replaces regular page table entries (PTEs) at any level Access Validation ends on encountering PE Shorter Walks Page walk ends at L1 PTE if VA != PA Returns translated PA* Eliminates sub-tree below replaced PTE Smaller Page Tables Input Graph FR Wiki LJ S24 Page Tables (in KB) 616 2520 4280 13340 % occupied by L1PTEs 0.948 0.987 0.992 0.996 Compact Page Tables ( in KB) 48 60 Mention page ranks

Separate Address Spaces
CPU ACC MMU TLB Caches CPU Memory ACC Memory

Separate Address Spaces
int main() { int a[N], b[N], c[N]; int *d_a = cudaMalloc(...); ... init(a, b, c); cudaMemcpy(d_a, a, ...); add<<1, N>>(d_a, d_b, d_c); cudaMemcpy(c, d_c, ...); return 0; } CPU Special Allocation void add(int*a, int*b, int*c) { c[threadID.idx] = a[threadID.idx] + b[threadID.idx]; } ACC MMU TLB Special Copy Caches Private Data c a b d_a d_b d_c

I. Permission Bitmap 2 permission bits per physical page
Store permissions ONLY for identity-mapped pages If no permissions found, page walk required Virtual AS PA != VA PA == VA Physical AS Permission Bitmap This is simple, and does not require any changes to the existing page tables. However, the permissions bitmap wastes space especially if you have a large amount of sparsely used physical memory.

Methodology OS changes prototyped in Linux v4.10
Performance evaluated using gem5 full-system mode

Memory Mapping Segment
Address Space 0xFFF Random Offset Stack Stack Limit Random Offset Memory Mapping Segment Heap Random Offset BSS Data Text 0x 0x000

Programming accelerators
int main() { int a[N], b[N], c[N]; int *d_a = cudaMalloc(N*sizeof(int)); ... cudaMemcpy(d_a, a, N*sizeof(int), HtD); init(a, b, c); add<<1, N>>(d_a, d_b, d_c); cudaMemcpy(c, d_c, N*sizeof(int), DtH); return 0; } void add(int*a, int*b, int*c) { c[threadID.idx] = a[threadID.idx] + b[threadID.idx]; } Accelerator-Side CPU-Side

int main() { int a = contiguous_alloc(...); ... init(a, b, c); add<<1, N>>(VAtoPA(a), VAtoPA(b), VAtoPA(c)); return 0; } void add(int*a, int*b, int*c) { c[threadID.idx] = a[threadID.idx] + b[threadID.idx]; } Accelerator-Side CPU-Side

int main() { int a[N], b[N], c[N]; init(a, b, c); add<<1, N>>(a, b, c); return 0; } void add(int*a, int*b, int*c) { c[threadID.idx] = a[threadID.idx] + b[threadID.idx]; } Accelerator-Side CPU-Side

Shared Memory, Virtual Addressing
int main() { int a[N], b[N], c[N]; init(a, b, c); add<<1, N>>(a, b, c); return 0; } CPU void add(int*a, int*b, int*c) { c[threadID.idx] = a[threadID.idx] + b[threadID.idx]; } ACC MMU TLB Caches TLB IOMMU System Memory Add graphs of TLB misses c a b

Devirtualizing Memory in Heterogeneous Systems

Similar presentations

Presentation on theme: "Devirtualizing Memory in Heterogeneous Systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Devirtualizing Memory in Heterogeneous Systems

Similar presentations

Presentation on theme: "Devirtualizing Memory in Heterogeneous Systems"— Presentation transcript:

Similar presentations

About project

Feedback