Executive Summary Problem: Overheads of virtual memory can be high

Slides:

Advertisements

Similar presentations

Efficient Virtual Memory for Big Memory Servers U Wisc and HP Labs ISCA’13 Architecture Reading Club Summer'131.

Advertisements

May 7, A Real Problem  What if you wanted to run a program that needs more memory than you have?

1 A Real Problem  What if you wanted to run a program that needs more memory than you have?

COMP 3221: Microprocessors and Embedded Systems Lectures 27: Virtual Memory - III Lecturer: Hui Wu Session 2, 2005 Modified.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 17, 2003 Topic: Virtual Memory.

Virtual Memory Adapted from lecture notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley and Rabi Mahapatra & Hank Walker.

Virtual Memory and Paging J. Nelson Amaral. Large Data Sets Size of address space: – 32-bit machines: 2 32 = 4 GB – 64-bit machines: 2 64 = a huge number.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Operating Systems CMPSCI 377 Lecture.

CS 153 Design of Operating Systems Spring 2015 Lecture 17: Paging.

1 Chapter 3.2 : Virtual Memory What is virtual memory? What is virtual memory? Virtual memory management schemes Virtual memory management schemes Paging.

Revisiting Hardware-Assisted Page Walks for Virtualized Systems

Chapter 8 – Main Memory (Pgs ). Overview  Everything to do with memory is complicated by the fact that more than 1 program can be in memory.

Accelerating Two-Dimensional Page Walks for Virtualized Systems Jun Ma.

1 Some Real Problem  What if a program needs more memory than the machine has? —even if individual programs fit in memory, how can we run multiple programs?

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Virtual Memory Hardware.

Review °Apply Principle of Locality Recursively °Manage memory to disk? Treat as cache Included protection as bonus, now critical Use Page Table of mappings.

Paging (continued) & Caching CS-3013 A-term Paging (continued) & Caching CS-3013 Operating Systems A-term 2008 (Slides include materials from Modern.

Full and Para Virtualization

Redundant Memory Mappings for Fast Access to Large Memories

Protection of Processes Security and privacy of data is challenging currently. Protecting information – Not limited to hardware. – Depends on innovation.

CHAPTER 3-3: PAGE MAPPING MEMORY MANAGEMENT. VIRTUAL MEMORY Key Idea Disassociate addresses referenced in a running process from addresses available in.

Constructive Computer Architecture Virtual Memory: From Address Translation to Demand Paging Arvind Computer Science & Artificial Intelligence Lab. Massachusetts.

Memory Management Continued Questions answered in this lecture: What is paging? How can segmentation and paging be combined? How can one speed up address.

Memory Management. 2 How to create a process? On Unix systems, executable read by loader Compiler: generates one object file per source file Linker: combines.

CS203 – Advanced Computer Architecture Virtual Memory.

W4118 Operating Systems Instructor: Junfeng Yang.

Agile Paging: Exceeding the Best of Nested and Shadow Paging

Translation Lookaside Buffer

COSC6385 Advanced Computer Architecture Lecture 7. Virtual Memory

Memory Management Virtual Memory.

Virtual Memory Chapter 7.4.

Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh

ECE232: Hardware Organization and Design

CS161 – Design and Architecture of Computer

CSE 120 Principles of Operating

A Real Problem What if you wanted to run a program that needs more memory than you have? September 11, 2018.

From Address Translation to Demand Paging

CS703 - Advanced Operating Systems

Section 9: Virtual Memory (VM)

From Address Translation to Demand Paging

Today How was the midterm review? Lab4 due today.

143A: Principles of Operating Systems Lecture 6: Address translation (Paging) Anton Burtsev October, 2017.

Some Real Problem What if a program needs more memory than the machine has? even if individual programs fit in memory, how can we run multiple programs?

CS510 Operating System Foundations

CSE 153 Design of Operating Systems Winter 2018

Energy-Efficient Address Translation

CSE 153 Design of Operating Systems Winter 2018

Rachata Ausavarungnirun

Reducing Memory Reference Energy with Opportunistic Virtual Caching

Virtual Memory Hardware

Translation Lookaside Buffer

CSE 451: Operating Systems Autumn 2005 Memory Management

CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs

CSE451 Virtual Memory Paging Autumn 2002

CSE 451: Operating Systems Autumn 2003 Lecture 9 Memory Management

CSE 451: Operating Systems Autumn 2004 Page Tables, TLBs, and Other Pragmatics Hank Levy 1.

CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs

CSE 451: Operating Systems Autumn 2003 Lecture 9 Memory Management

Lecture 7: Flexible Address Translation

Memory Management CSE451 Andrew Whitaker.

Lecture 8: Efficient Address Translation

CSE 153 Design of Operating Systems Winter 2019

Paging and Segmentation

CS703 - Advanced Operating Systems

CSE 451: Operating Systems Winter 2005 Page Tables, TLBs, and Other Pragmatics Steve Gribble 1.

CSE 542: Operating Systems

CSE 153 Design of Operating Systems Winter 2019

CSE 542: Operating Systems

Cache writes and examples

CS 444/544 Operating Systems II Virtual Memory Translation

Presentation transcript:

Executive Summary Problem: Overheads of virtual memory can be high 1D (native): 42% 2D (virtualized): 97% Idea: Reduce dimensionality of page walks I: Virtualized Direct Segments -- 2D  0D (8 slides) Reduces overheads of virtual memory (virtualized) to less than 1% II: Agile Paging -- 2D  1D (12 slides) Reduces overheads of virtual memory (virtualized) less than 4% of native III: Redundant Memory Mappings -- 1D  0D (11 slides) Reduces overheads of virtual memory (native) to less than 1% PhD Defense

Virtual Memory Refresher Virtual Address Space Page Table Physical Memory Process 1 Challenge: How to reduce costly page table walks? Process 2 TLB (Translation Lookaside Buffer) PhD Defense

Two Technology Trends TLB reach is limited Year Processor L1 DTLB entries 1999 Pent. III 72 2001 Pent. 4 64 2008 Nehalem 96 2012 IvyBridge 100 2014 Haswell 2015 Skylake TLB reach is limited *Inflation-adjusted 2011 USD, from: jcmit.com PhD Defense

Use of Vast Memory Big-memory applications (ever increasing data sets) PhD Defense

Native x86 Translation: 1D VA Virtual Address Page Table CR3 PA Physical Address Up to mem accesses = 4 PhD Defense

Enabling many more startups Virtual Machines Put logos instead Enabling many more startups PhD Defense

Why do we need more dimensions? Guest Virtual Address Guest Physical Address Host Physical Address 1 2 gVA gPA hPA Guest Page Table Nested Page Table PhD Defense

Inception: Effects of Adding Dimensions APP APP Jog 2D 70% OS VMM Run 1D 42% Hardware Sprint 0D 0% PhD Defense

Goal Can we have “zero” overheads of virtualizing memory at any level of virtualization? PhD Defense

ISCA’15 + MICRO TOP PICKS’16 Reaching the Goal 2D  0D I: Virtualized Direct Segments MICRO’14 2D  1D 1D  0D II: Agile Paging ISCA’16 III: Redundant Memory Mappings ISCA’15 + MICRO TOP PICKS’16 Virtual Machine Native Machine Direct Execution PhD Defense

ISCA’15 + MICRO TOP PICKS’16 Outline 2D  0D I: Virtualized Direct Segments MICRO’14 2D  1D 1D  0D II: Agile Paging ISCA’16 III: Redundant Memory Mappings ISCA’15 + MICRO TOP PICKS’16 Virtual Machine Native Machine Direct Execution PhD Defense

I: Virtualized Direct Segments – Goal TLB reach is limited & Cost of a TLB miss with virtualization is very high Can we eliminate virtualized TLB misses totally? 2D  0D [Gandhi et al. – MICRO’14] PhD Defense

Virtualized Page Walk: 2D hPA gVA gPA Guest Page Table Nested Page Table gVA Longer Page Walk gCR3 hPA At most Mem accesses 5 + 5 + 5 + 5 + 4 = 24 PhD Defense

Direct Segments Review 1 Conventional Paging 2 Direct Segment BASE LIMIT Virtual Address OFFSET Physical Address Why Direct Segment? Matches big memory workload needs NO TLB lookups => NO TLB Misses [Basu and Gandhi et al. – ISCA’13] PhD Defense

Direct Segments Review VA VA PA PA Base Native 1D Direct Segments 1D  0D PhD Defense

Three Virtualized Modes Details 1 2 3 VMM Direct 2D  1D Dual Direct 2D  0D Guest Direct 2D  1D PhD Defense

Methodology Measure cost on page walks on real hardware Intel 12-core Sandy-bridge with 96GB memory Prototype VMM and OS and emulate hardware in Linux BadgerTrap for online analysis of TLB misses Released: http://research.cs.wisc.edu/multifacet/BadgerTrap Linear model to predict performance Workloads --- Big-memory workloads, SPEC 2006, BioBench, PARSEC [Gandhi et al. – CAN’14] PhD Defense

VMM Direct achieves near-native performance Results VMM Direct achieves near-native performance PhD Defense

Results Dual Direct eliminates most of the TLB misses achieving better-than native performance PhD Defense

Results Guest Direct achieves near-native performance while providing flexibility at VMM PhD Defense

Results Same trend across all workloads (More workloads in the thesis) PhD Defense

I: Summary – Virtualized Direct Segments Problem: TLB misses in virtual machines Hardware-virtualized MMU has high overheads Solution: segmentation to bypass paging Extend Direct Segments for virtualization Three modes with different tradeoffs Results Near- or better-than-native performance PhD Defense

ISCA’15 + MICRO TOP PICKS’16 Outline 2D  0D I: Virtualized Direct Segments MICRO’14 2D  1D 1D  0D II: Agile Paging ISCA’16 III: Redundant Memory Mappings ISCA’15 + MICRO TOP PICKS’16 Virtual Machine Native Machine Direct Execution PhD Defense

II: Agile Paging – Goal Virtualized Direct Segments sacrified paging support & Required a lot of hardware and software support Can we make virtualized page walk faster while retaining paging? 2D  1D [Gandhi et al. -- ISCA’16] PhD Defense

Guest Physical Address Virtualizing Memory APP APP gVA Guest Virtual Address Guest OS Guest Page Table gPA Guest Physical Address VMM Nested Page Table Hardware hPA Host Physical Address PhD Defense 28

Virtualizing Memory Two techniques to manage both page tables gVA gPA Guest Page Table Nested Page Table Two techniques to manage both page tables Nested Paging -- Hardware Shadow Paging – Software Evaluated on two axis: Page Walk Latency & Page Table Updates PhD Defense 29

1. Nested Paging – Hardware hPA gVA gPA Guest Page Table Nested Page Table gVA Longer Page Walk gCR3 hPA At most Mem accesses 5 + 5 + 5 + 5 + 4 = 24 PhD Defense 30

2. Shadow Paging – Software APP APP gVA Guest OS Guest Page Table (Read Only) Guest Page Table RO RO gPA Shadow Page Table VMM Nested Page Table Hardware hPA PhD Defense 31

2. Shadow Paging – Software hPA Guest Page Table (Read Only) Nested Page Table gVA Shadow Page Table Shorter Page Walk sCR3 At most mem accesses = 4 PhD Defense 32

Page Table Updates In-place fast update Slow meditated update 1. Nested Paging 2. Shadow Paging gVA gVA VMM Trap Updates: Copy-on-write Page migration Accessed bits Dirty bits Page sharing Working set sampling Many more… Guest Page Table Guest Page Table (Read Only) gPA Shadow Page Table Nested Page Table Nested Page Table hPA hPA In-place fast update Slow meditated update PhD Defense 33

Guest Virtual Address Space Key Observation Fully static address space Reality !!! Guest Virtual Address Space Shadow Paging preferred Fully dynamic address space Small fraction of address space is dynamic Nested Paging preferred PhD Defense 34

Key Observation Guest Page Table gCR3 Nested Shadow PhD Defense 35

Agile Paging Start page walk in shadow mode -- Achieving fast TLB misses Optionally switch to nested mode -- Allowing fast in-place updates Two parts of design: 1. Mechanism 2. Policy PhD Defense 36

1. Mechanism gVA gPA hPA Guest Page Table Shadow Page Table Nested Page Table gCR3 Shadow Page Table Guest Page Table sCR3 1 1 Read only Nested Page Table PhD Defense 37

1. Mechanism: Example Page Walk gVA gVA sCR3 gCR3 hPA Switch modes @ level 4 of guest page table At most Mem accesses 1 + 1 + 1 + 5 = 8 PhD Defense 38

2. Policy: Shadow  Nested Start Shadow Write to page table (VMM Trap) Shadow (1 Write) Write to page table (VMM Trap) Nested Subsequent Writes (No VMM Traps) PhD Defense 39

2. Policy: Nested  Shadow Start Shadow Write to page table (VMM Trap) Shadow (1 Write) Write to page table (VMM Trap) Move non-dirty Timeout Use dirty bits to track writes to guest page table Nested Subsequent Writes (No VMM Traps) PhD Defense 40

Results Modeled based on emulator: BadgerTrap B: Unvirtualized N: Nested Paging S: Shadow Paging A: Agile Paging Modeled based on emulator: BadgerTrap Measured using performance counters Solid bottom bar: Page walk overhead Hashed top bar: VMM overheads PhD Defense 41

Results Nested Paging has high overheads of TLB misses B: Unvirtualized N: Nested Paging S: Shadow Paging A: Agile Paging Nested Paging has high overheads of TLB misses Effect of longer page walk 28% 19% 18% 6% Solid bottom bar: Page walk overhead Hashed top bar: VMM overheads PhD Defense 42

Shadow Paging has high overheads of VMM interventions Results B: Unvirtualized N: Nested Paging S: Shadow Paging A: Agile Paging Shadow Paging has high overheads of VMM interventions 28% 70% 11% 19% 30% 18% 6% 6% Solid bottom bar: Page walk overhead Hashed top bar: VMM overheads PhD Defense 43

Agile paging consistently performs better than both techniques Results B: Unvirtualized N: Nested Paging S: Shadow Paging A: Agile Paging Agile paging consistently performs better than both techniques 28% 70% 11% 2% 19% 30% 18% 6% 2% 4% 6% 3% Solid bottom bar: Page walk overhead Hashed top bar: VMM overheads THP PhD Defense 44

Can we get best of both for same address space (or same page walk)? II: Summary – Agile Paging Problem: Virtualization valuable but have high overheads with larger workloads (At most 70% slower than native) Existing Choices: Nested Paging: slow page walk but fast page table updates Shadow Paging: fast page walk but slow page table updates Can we get best of both for same address space (or same page walk)? Yes, Agile Paging: use shadow paging and sometime switch to nested paging within the same page walk (At most 4% slower than native) PhD Defense 45

ISCA’15 + MICRO TOP PICKS’16 Outline 2D  0D I: Virtualized Direct Segments MICRO’14 2D  1D 1D  0D II: Agile Paging ISCA’16 III: Redundant Memory Mappings ISCA’15 + MICRO TOP PICKS’16 Virtual Machine Native Machine Direct Execution PhD Defense

How increase reach of each TLB entry? III: Redundant Memory Mappings – Goal TLB reach is limited & In-memory workload size is increasing How increase reach of each TLB entry? 1D  0D [Gandhi et al. -- ISCA’15 and IEEE MICRO TOP PICKS’16] PhD Defense

Key Observation Virtual Memory Physical Memory PhD Defense

Key Observation Large contiguous regions of virtual memory Code Heap Stack Shared Lib. Virtual Memory Large contiguous regions of virtual memory Limited in number: only a few handful Physical Memory PhD Defense

Compact Representation: Range Translation BASE1 LIMIT1 Virtual Memory OFFSET1 Range Translation 1 Physical Memory Range Translation: is a mapping between contiguous virtual pages mapped to contiguous physical pages with uniform protection PhD Defense

Redundant Memory Mappings Range Translation 3 Virtual Memory Range Translation 2 Range Translation 4 Range Translation 1 Range Translation 5 Physical Memory Map most of process’s virtual address space redundantly with modest number of range translations in addition to page mappings PhD Defense

Design: Redundant Memory Mappings Three Components: A. Caching Range Translations B. Managing Range Translations C. Facilitating Range Translations PhD Defense

A. Caching Range Translations V47 …………. V12 L1 DTLB L2 DTLB Range TLB Enhanced Page Table Walker Page Table Walker P47 …………. P12 PhD Defense

A. Caching Range Translations V47 …………. V12 Hit L1 DTLB L2 DTLB Range TLB Enhanced Page Table Walker P47 …………. P12 PhD Defense

A. Caching Range Translations V47 …………. V12 Miss L1 DTLB Refill L2 DTLB Range TLB Hit Enhanced Page Table Walker P47 …………. P12 PhD Defense

A. Caching Range Translations V47 …………. V12 Refill Miss L1 DTLB L2 DTLB Range TLB Hit Enhanced Page Table Walker P47 …………. P12 PhD Defense

A. Caching Range Translations Miss V47 …………. V12 P47 …………. P12 L1 DTLB Range TLB L2 DTLB Hit Refill Entry 1 BASE 1 ≤ > LIMIT 1 OFFSET 1 Protection 1 Entry N BASE N ≤ > LIMIT N OFFSET N Protection N L1 TLB Entry Generator Logic: (Virtual Address + OFFSET) Protection PhD Defense

A. Caching Range Translations V47 …………. V12 Miss L1 DTLB L2 DTLB Range TLB Miss Miss Enhanced Page Table Walker P47 …………. P12 PhD Defense

B. Managing Range Translations Stores all the range translations in a OS managed structure Per-process like page-table Range Table CR-RT RTC RTD RTF RTG RTA RTB RTE PhD Defense

B. Managing Range Translations On a L2+Range TLB miss, what structure to walk? A) Page Table B) Range Table C) Both A) and B) D) Either? Is a virtual page part of range? – Not known at a miss PhD Defense

B. Managing Range Translations Redundancy to the rescue One bit in page table entry denotes that page is part of a range 2 1 3 Page Table Walk Application resumes memory access Range Table Walk (Background) CR-3 CR-RT RTC RTD RTF RTG RTA RTB RTE Part of a range Insert into L1 TLB Insert into Range TLB PhD Defense

C. Facilitating Range Translations Demand Paging Virtual Memory Physical Memory Does not facilitate physical page contiguity for range creation PhD Defense

C. Facilitating Range Translations Eager Paging Virtual Memory Physical Memory Allocate physical pages when virtual memory is allocated Increases range sizes  Reduces number of ranges PhD Defense

Results 4KB: Baseline using 4KB paging THP: Transparent Huge Pages using 2MB paging [Transparent Huge Pages] CTLB: Clustered TLB with cluster of 8 4KB entries [HPCA’14] DS: Direct Segments [ISCA’13 and MICRO’14] RMM: Our proposal: Redundant Memory Mappings [ISCA’15] PhD Defense

Performance Results Assumptions: CTLB: 512 entry fully-associative RMM: 32 entry fully-associative Both in parallel with L2 Measured using performance counters Modeled based on emulator: BadgerTrap 5/14 workloads Rest in thesis Make this animated and add points to raise. BadgerTrap: [Gandhi & Basu et al. -- CAN’14] PhD Defense

Overheads of using 4KB pages are very high Performance Results Overheads of using 4KB pages are very high Make this animated and add points to raise. PhD Defense

Clustered TLB works well, but limited by 8x reach Performance Results Clustered TLB works well, but limited by 8x reach Make this animated and add points to raise. PhD Defense

2MB page helps with 512x reach: Overheads not very low Performance Results 2MB page helps with 512x reach: Overheads not very low Make this animated and add points to raise. PhD Defense

Direct Segment perfect for some but not all workloads Performance Results Direct Segment perfect for some but not all workloads Make this animated and add points to raise. PhD Defense

RMM achieves low overheads robustly across all workloads Performance Results RMM achieves low overheads robustly across all workloads Make this animated and add points to raise. PhD Defense

III: Summary – RMM Proposal: Redundant Memory Mappings (1D0D) Result: Problem: Virtual memory (native) overheads are high Proposal: Redundant Memory Mappings (1D0D) Proposed compact representation called range translation Range Translation – arbitrarily large contiguous mapping Effectively cached, managed and facilitated range translations Retain flexibility of 4KB paging Result: Reduced overheads of native virtual memory to less than 1% PhD Defense

ISCA’15 + MICRO TOP PICKS’16 Outline 2D  0D I: Virtualized Direct Segments MICRO’14 2D  1D 1D  0D II: Agile Paging ISCA’16 III: Redundant Memory Mappings ISCA’15 + MICRO TOP PICKS’16 Virtual Machine Native Machine Direct Execution PhD Defense

Conclusion Problem Paging is valuable but costly to use today Virtualization with paging has much higher cost Opportunity Reduce dimensionality of page walk Result Almost zero-overheads of virtual memory PhD Defense

Publications Jayneel Gandhi, Mark D. Hill, Michael M. Swift, Agile Paging for Efficient Virtualized Page Walks, ISCA 2016. Jayneel Gandhi, Vasileios Karakostas, Furkan Ayar, Adrián Cristal, Mark D. Hill, Kathryn S. McKinley, Mario Nemirovsky, Michael M. Swift, Osman Ünsal, Redundant Memory Mappings for Fast Access to Large Memories, IEEE MICRO TOP PICKS 2016. Vasileios Karakostas, Jayneel Gandhi, Adrián Cristal, Mark D. Hill, Kathryn S. McKinley, Mario Nemirovsky, Michael M. Swift, Osman Ünsal, Energy-Efficient Address Translation, HPCA 2016. Vasileios Karakostas, Jayneel Gandhi, Furkan Ayar, Adrián Cristal, Mark D. Hill, Kathryn S. McKinley, Mario Nemirovsky, Michael M. Swift, Osman Ünsal, Redundant Memory Mappings for Fast Access to Large Memories, ISCA 2015. Accepted for IEEE MICRO TOP PICKS 2016 Jayneel Gandhi, Arkaprava Basu, Mark D. Hill, Michael M. Swift, Efficient Memory Virtualization: Reducing Dimensionality of Nested Page Walks, MICRO 2014. Received honorable mention on IEEE MICRO TOP PICKS 2015. Jayneel Gandhi, Arkaprava Basu, Mark D. Hill, Michael M. Swift, BadgerTrap: A Tool to Instrument x86-64 TLB Misses, CAN 2014. Arkaprava Basu, Jayneel Gandhi, Jichuan Chang, Mark D. Hill, Michael M. Swift, Efficient Virtual Memory for Big Memory Servers, ISCA 2013. Niket Kumar Choudhary, Salil V. Wadhavkar, Tanmay A. Shah, Hiran Mayukh, Jayneel Gandhi, Brandon H. Dwiel, Sandeep Navada, Hashem Hashemi Najaf-abadi, Eric Rotenberg, FabScalar: Automating Superscalar Core Design, IEEE MICRO TOP PICKS 2012. Niket Kumar Choudhary, Salil V. Wadhavkar, Tanmay A. Shah, Hiran Mayukh, Jayneel Gandhi, Brandon H. Dwiel, Sandeep Navada, Hashem Hashemi Najaf-abadi, Eric Rotenberg, FabScalar: Composing Synthesizable RTL Designs of Arbitrary Cores within a Canonical Superscalar Template, ISCA 2011. Accepted for IEEE MICRO TOP PICKS 2012 PhD Defense

Collaborators and Contributors Mark D. Hill, UW-Madison Michael M. Swift, UW-Madison Arkaprava Basu, AMD Research Vasilis Karakostas, BSC Barcelona Adrian Cristal, BSC Barcelona Mario Nemirovsky, BSC Barcelona Osman Unsal, BSC Barcelona Kathryn S. McKinley, Microsoft Research Benjamin Serebrin, Google And many more… PhD Defense

Questions ? PhD Defense

Backup Slides PhD Defense

Three Virtualized Modes 1 Features Maps almost whole gPA 4 memory accesses Near-native performance Helps any application VMM Direct 2D  1D PhD Defense

Three Virtualized Modes 1 2 Features 0 memory accesses Better-than native performance Suits big-memory applications VMM Direct 2D  1D Dual Direct 2D  0D PhD Defense

Three Virtualized Modes 1 2 3 Features 4 memory accesses Suits big-memory applications Flexible to provide VMM services VMM Direct 2D  1D Dual Direct 2D  0D Guest Direct 2D  1D PhD Defense

Compatibility VMM Hardware 1 OS Unmodified Modified (VMM Direct) APP PhD Defense

Compatibility VMM VMM Hardware Hardware 1 2 OS OS Minimal Unmodified APP APP Big-Memory OS OS Modified VMM VMM Modified Hardware (VMM Direct) Hardware (Dual Direct) PhD Defense

Compatibility VMM VMM VMM Hardware Hardware Hardware 1 2 3 OS OS OS Minimal Minimal Unmodified APP APP Big-Memory Big-Memory Modified OS OS OS Modified VMM VMM Minimal VMM Modified Modified Hardware (VMM Direct) Hardware (Dual Direct) Hardware (Guest Direct) PhD Defense

Dimension/ Memory accesses Guest OS modifications Tradeoffs: Summary Back Properties Base Virtualized VMM Direct Dual Direct Guest Direct Dimension/ Memory accesses 2D/24 1D/4 0D/0 Guest OS modifications none required VMM modifications minimal Applications Any Big-memory VMM services allowed Yes No

Results: 2MB and 1GB at VMM Put animations for points to highlight in this slide

0. Page-based Translation Virtual Memory TLB VPN0 PFN0 Benefits + Most flexible + Only 4KB alignment required Disadvantage ─ Requires more TLB entries ─ Limited TLB reach Physical Memory PhD Defense

1. Multipage Mapping Sub-blocked TLB/CoLT Clustered TLB Virtual Memory VPN(0-3) PFN(0-3) Map Bitmap Benefits + Increased the TLB reach by 8X-32X + Clustering of pages allowed Disadvantages ─ Requires size alignment ─ Contiguity required Physical Memory [ASPLOS’94, MICRO’12 and HPCA’14] PhD Defense

2. Large Pages Large Page TLB Virtual Memory VPN0 PFN0 Benefits + Increases TLB reach + 2MB and 1GB page size in x86-64 Disadvantages ─ Size alignment restriction ─ Contiguity required Physical Memory [Transparent Huge Pages and libhugetlbfs] PhD Defense

3. Direct Segments Direct Segment BASE LIMIT Virtual Memory (BASE,LIMIT)  OFFSET OFFSET If BASE ≤ V < LIMIT P = V + OFFSET Benefits + Unlimited reach + No size alignment required Disadvantages ─ One direct segment per program ─ Not transparent to application ─ Applicable to big-memory workloads Physical Memory [Basu & Gandhi et al. -- ISCA’13; Gandhi & Basu et al. -- MICRO’14] PhD Defense

Can we get best of many worlds? Multipage Mapping Large Pages Direct Segments Our Proposal Flexible alignment Arbitrary reach Multiple entries Transparent to applications Applicable to all workloads PhD Defense

Why low overheads? Virtual Contiguity Benchmark Paging Ideal RMM ranges 4KB + 2MB THP # of ranges #of ranges to cover more than 99% of memory cactusADM 1365 + 333 112 49 canneal 10016 + 359 77 4 graph500 8983 + 35725 86 3 mcf 1737 + 839 55 1 tigr 28299 + 235 16 Only 10s-100s of ranges per application 1000s of TLB entries required Only few ranges for 99% coverage PhD Defense

Transparent Huge Page (2MB) B: Unvirtualized N: Nested Paging S: Shadow Paging A: Agile Paging 68% 13% 14% 4% 2% 14% 5% 2% 10% 6% 3% 2% Solid bottom bar: Page walk overhead Hashed top bar: VMM overheads Back