Memory Resource Management in Vmware ESX Server Author: Carl A. Waldspurger Vmware, Inc. Present: Jun Tao.

Slides:



Advertisements
Similar presentations
Remus: High Availability via Asynchronous Virtual Machine Replication
Advertisements

CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Memory Resource Management in VMware ESX Server Carl A. Waldspurger VMware, Inc. Appears in SOSDI.
VMWare ESX Memory Management Dr. Sanjay P. Ahuja, Ph.D FIS Distinguished Professor of Computer Science School of Computing, UNF.
Difference Engine: Harnessing Memory Redundancy in Virtual Machines by Diwaker Gupta et al. presented by Jonathan Berkhahn.
CSE 598B: Self-* Systems Memory Resource Management in VMware ESX Server by Carl A. Waldspurger Presented by: Arjun R. Nath (slide material adapted from.
Segmentation and Paging Considerations
XENMON: QOS MONITORING AND PERFORMANCE PROFILING TOOL Diwaker Gupta, Rob Gardner, Ludmila Cherkasova 1.
Introduction to Virtualization
CS-3013 & CS-502, Summer 2006 Virtual Machine Systems1 CS-502 Operating Systems Slides excerpted from Silbershatz, Ch. 2.
OS Fall ’ 02 Introduction Operating Systems Fall 2002.
G Robert Grimm New York University Disco.
Disco Running Commodity Operating Systems on Scalable Multiprocessors.
Informationsteknologi Friday, November 16, 2007Computer Architecture I - Class 121 Today’s class Operating System Machine Level.
OS Spring’03 Introduction Operating Systems Spring 2003.
1 Disco: Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine, and Mendel Rosenblum, Stanford University, 1997.
1 Last Class: Introduction Operating system = interface between user & architecture Importance of OS OS history: Change is only constant User-level Applications.
Computer Organization and Architecture
Virtualization and the Cloud
A. Frank - P. Weisberg Operating Systems Structure of Operating Systems.
November 1, 2004Introduction to Computer Security ©2004 Matt Bishop Slide #29-1 Chapter 33: Virtual Machines Virtual Machine Structure Virtual Machine.
Virtualization for Cloud Computing
Xen and the Art of Virtualization. Introduction  Challenges to build virtual machines Performance isolation  Scheduling priority  Memory demand  Network.
Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.
Tanenbaum 8.3 See references
Windows 2000 Memory Management Computing Department, Lancaster University, UK.
Computer System Architectures Computer System Software
Rensselaer Polytechnic Institute CSC 432 – Operating Systems David Goldschmidt, Ph.D.
Cellular Disco: resource management using virtual clusters on shared memory multiprocessors Published in ACM 1999 by K.Govil, D. Teodosiu,Y. Huang, M.
Microkernels, virtualization, exokernels Tutorial 1 – CSC469.
SAIGONTECH COPPERATIVE EDUCATION NETWORKING Spring 2009 Seminar #1 VIRTUALIZATION EVERYWHERE.
A Cloud is a type of parallel and distributed system consisting of a collection of inter- connected and virtualized computers that are dynamically provisioned.
Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Jonathan Walpole (based on a slide set from Vidhya Sivasankaran)
CS533 Concepts of Operating Systems Jonathan Walpole.
Chapter 1. Introduction What is an Operating System? Mainframe Systems
Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.
Virtual Machine Monitors: Technology and Trends Jonathan Kaldor CS614 / F07.
Embedded System Lab. 오명훈 Memory Resource Management in VMware ESX Server Carl A. Waldspurger VMware, Inc. Palo Alto, CA USA
Virtualization Part 2 – VMware. Virtualization 2 CS5204 – Operating Systems VMware: binary translation Hypervisor VMM Base Functionality (e.g. scheduling)
Our work on virtualization Chen Haogang, Wang Xiaolin {hchen, Institute of Network and Information Systems School of Electrical Engineering.
Virtualization Infrastructure Administration
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Edouard et al. Madhura S Rama.
Virtualization Part 2 – VMware Hardware Support. Virtualization 2 CS 5204 – Fall, 2008 VMware: binary translation Hypervisor VMM Base Functionality (e.g.
Supporting Multi-Processors Bernard Wong February 17, 2003.
 Virtual machine systems: simulators for multiple copies of a machine on itself.  Virtual machine (VM): the simulated machine.  Virtual machine monitor.
Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Vidhya Sivasankaran.
02/09/2010 Industrial Project Course (234313) Virtualization-aware database engine Final Presentation Industrial Project Course (234313) Virtualization-aware.
1 Virtual Machine Memory Access Tracing With Hypervisor Exclusive Cache USENIX ‘07 Pin Lu & Kai Shen Department of Computer Science University of Rochester.
By Teacher Asma Aleisa Year 1433 H.   Goals of memory management  To provide a convenient abstraction for programming.  To allocate scarce memory.
A. Frank - P. Weisberg Operating Systems Structure of Operating Systems.
VMWare MMU Ranjit Kolkar. Designed for efficient use of resources. ESX uses high-level resource management policies to compute a target memory allocation.
MEMORY RESOURCE MANAGEMENT IN VMWARE ESX SERVER 김정수
Processes and Virtual Memory
Full and Para Virtualization
Operating-System Structures
Protection of Processes Security and privacy of data is challenging currently. Protecting information – Not limited to hardware. – Depends on innovation.
Cloud Computing Lecture 5-6 Muhammad Ahmad Jan.
Page Buffering, I. Pages to be replaced are kept in main memory for a while to guard against poorly performing replacement algorithms such as FIFO Two.
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Presented by: Pierre LaBorde, Jordan Deveroux, Imran Ali, Yazen Ghannam, Tzu-Wei.
Memory Resource Management in VMware ESX Server By Carl A. Waldspurger Presented by Clyde Byrd III (some slides adapted from C. Waldspurger) EECS 582 –
Chapter 7 Memory Management Eighth Edition William Stallings Operating Systems: Internals and Design Principles.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
Virtualization.
Memory Resource Management in VMware ESX Server
Presented by Yoon-Soo Lee
Chapter 1: Introduction
Disco: Running Commodity Operating Systems on Scalable Multiprocessors
1. 2 VIRTUAL MACHINES By: Satya Prasanna Mallick Reg.No
Introduction to Operating Systems
Virtual Memory: Working Sets
System Virtualization
Presentation transcript:

Memory Resource Management in Vmware ESX Server Author: Carl A. Waldspurger Vmware, Inc. Present: Jun Tao

 Introduction  Memory Virtualization  Reclamation Mechanisms  Sharing Memory  Share vs. Working Sets  Allocation Policies  I/O Page Remapping  Related Work  Conclusions

Introduction  Vmware ESX Server: a thin software layer designed to multiplex hardware resources efficiently among virtual machines  Virtualizes the Intel IA-32 architecture  Runs existing operating systems without modification  IBM’s mainframe division & Disco prototypes  Vmware Workstation: uses a hosted virtual machine architecture that takes advantage of a pre-existing operating system for portable I/O device support

Memory Virtualization  Terminology  Machine address: actual hardware memory  Physical address: a software abstraction used to provide the illusion of hard ware memory to a virtual machine  Pmap: for each VM to translate “physical” page numbers (PPN) to machine page numbers (MPN)  Shadow page tables: contain virtual-to- machine page mappings

Reclamation Mechanisms  Memory allocation  Overcommitment of memory  The total size configured for all running virtual machines exceeds the total amount of actual machine memory  Max size  A configuration parameter that represents the maximum amount of machine memory it can be allocated.  Constant after booting a guest OS  A VM will be allocated its max size when memory is not overcommitted

Page Replacement Issues  When memory is overcommitted, ESX Server must employ some mechanism to reclaim space from one or more virtual machines.

 Standard approach  Introduce another level of paging, moving some VM “physical” pages to a swap area on disk  Disadvantages:  Requires a meta-level page replacement policy: VMM must make relatively uninformed resource management decisions and choose the least valuable pages.  Introduces performance anomalies due to unintended interactions with native memory management policies in guest operating systems.  Double paging problem: after the meta-level OS policy selecting a page to reclaim and paging it out, the guest OS may choose the very same page to write to its own virtual paging device.

 Ballooning  A technique used by ESX Server to coax the guest OS into reclaiming memory when possible by making it think it has been configured with less memory.  How it works  A small balloon module is loaded into the guest OS as a pseudo-device driver or kernel service.  Inflate: allocating pinned physical pages within the VM, using appropriate native interfaces.  Deflate: instructing it to deallocate previously- allocated pages.  Balloon driver communicates PPN to ESX Server, which may then reclaim the corresponding machine page. Deflating the balloon frees up memory for general use within the guest OS.

 Future guest OS support for hot-pluggable memory cards would enable an additional form of coarse grained ballooning. Virtual memory cards could be inserted into or remove from a VM in order to rapidly adjust its physical memory size.

 Effectiveness  Black bars: performance when the VM is configured with main memory sizes ranging from 128 MB to 256 MB  Grey bars: performance of the same VM configured with 256 MB, ballooned down to the specified size

 Disadvantages  The balloon driver may be uninstalled, disabled explicitly, unavailable while a guest OS is booting.  Temporarily unable to reclaim memory quickly enough to satisfy current system demands.  Upper bounds on reasonable balloon sizes may be imposed by various guest OS limitations.  Paging  A mechanism employed when ballooning is not possible or insufficient.  ESX Server swap daemon (Disk And Execution MONitor)  A randomized page replacement policy is used and more sophisticated algorithms are being investigated.

Sharing Memory  Server consolidation presents numerous opportunities for sharing memory between virtual machines.  Transparent Page Sharing  Introduced by Disco to eliminate redundant copies of pages, such as code or read-only data.  Disco required several guest OS modifications to identify redundant copies as they were created.

 Content-Based Page Sharing  Identify page copies by their contents. Pages with identical contents can be shared regardless of when, where or how those contents were generated.  Advantages:  Eliminates the need to modify, hook or even understanding guest OS code.  Able to identify more opportunities for sharing.  Cost of simple matching is very expensive  Comparing each page with every other page in the system would be prohibitively expensive  Naive matching would required O(n 2 ) page comparisons

 Instead, hashing is used to identify pages with potentially-identical contents.  How it works  A hash value that summarizes a page’s contents is used as a lookup key into a hash table containing entries for other pages that have already been marked copy-on-write (COW).  If hash value matches, a full comparison of the page contents will follow.  If the full comparison verifies the pages to be identical, a share frame in the hash table will be created or modified in response.  If no match is found, an unshared page will be tagged as a special hint entry.  Frames in the hash table are modified in response to new matching and hash changing.

 Page Sharing Performance  Sharing metrics for a series of experiments consisting of identical Linux VMs running SPEC95 benchmarks.  The left graph indicates the absolute amounts of memory shared and saved increase smoothly with the number of concurrent VMs.  The right graph plots these metrics as a percentage of aggregate VM memory.

 The CPU overhead due to page sharing was negligible. An identical set of experiments with page sharing disabled and enabled were run respectively. Over all runs, the aggregate throughput was actually 0.5% higher with page sharing enabled, and ranged from 1.6% lower to 1.8% higher.

 Real-World Page Sharing  Sharing metrics from production deployments of ESX Server.  Ten Windows NT VMs serving users at a Fortune 50 company, running a variety of database (Oracle, SQL Server), web (IIS, Websphere), development (Java, VB), and other applications.  Nine Linux VMs serving a large user community for a nonprofit organization, executing a mix of web (Apache), mail (Majordomo, Postfix, POP/IMAP, MailArmor), and other servers.

 Five Linux VMs providing web proxy (Squid), mail (Postfix, RAV), and remote access (ssh) services to VMware employees.

Shares vs. Working set  Due to the need to provide quality-of- service guarantees to clients of varying importance.  Share-Based Allocation  Resource rights are encapsulated by shares.  represent relative resource rights that depend on the total number of shares contending for a resource.  A client is entitled to consume resources proportional to its share allocation.  Both randomized and deterministic algorithms are proposed for proportional-share allocation.

 Dynamic min-funding revocation algorithm  When one client demands more space, a replacement algorithm selects a victim client that relinquishes some of its previously-allocated space.  Memory is revoked from the client that owns fewest share per allocated page.  Limitation  Pure proportional-share algorithms do not incorporate any information about active memory usage or working sets.

 Idle memory tax strategy  Charge a client more for an idle page than for one it is actively using. When memory is scarce, pages will be reclaimed preferentially from client that are not actively using their full allocations.  Min-funding revocation is extended to used an adjusted shares-per-page ratio: where S and P are number of shares and allocated pages owned by a client, respectively, f is the fraction that is active and k=1/(1-T) for a given tax rate 0<=T<1.

 Measuring Idle Memory  ESX Server uses a statistical sampling approach to obtain aggregate VM working set estimates directly, without any guest involvement. Each VM is sampled independently.  A small number n of the virtual machine’s “physical” pages are selected randomly using a uniform distribution.  For each time the guest access to a sampled page, a touched page count t is incremented.  A statistical estimate of the fraction f of memory actively accessed by the VM is f=t/n.  By default, ESX Server samples 100 pages for each 30 second period.

 Experiment  To balance stability and agility, separate exponentially weighted moving average with different gain parameters are maintained.  A slow moving average is used to produce a smooth, stable estimate (gray dotted line).  A fast moving average adapts quickly to working set changes (gray dashed line).

 The solid black line indicates the amount of memory repeatedly touched by a simple memory application named toucher.  Max is the maximum value of these three values to estimate the amount of memory being actively used by the guest.  Result  As expected, the statistical estimate of active memory usage responds quickly as more memory is touched, tracking the fast moving average, and more slowly as less memory is touched, tracking the slow moving average.  The spike is due to the Windows “zero page thread”.

 Performance of Idle Memory Tax  Two VMs with identical share allocations are each configured with 256 MB in an overcommitted system.  VM1 (gray) runs Windows, and remains idle after booting. VM2 (black) executes a memory-intensive Linux workload. For each VM, ESX Server allocations are plotted as solid lines, and estimated memory usage is indicated by dotted lines.

Allocation Policies  ESX Server computes a target memory allocation for each VM based on both its share- based entitlement and an estimate of its working set. This target is achieved via the ballooning and paging mechanisms. Page sharing runs as an additional background activity that reduce overall memory pressure on the system.  Parameters  Min size: a guaranteed lower bound on the amount of memory that will be allocated to the VM, even when memory is overcommitted.  Max size: the amount of “physical” memory configured for use by the guest OS running in the VM.

 Memory shares entitle a VM to a fraction of physical memory, based on a proportional-share allocation policy.  Admission Control  A policy that ensures that sufficient unreserved memory and server swap space is available before a VM is allowed to power on.  Machine memory must be reserved for the guaranteed min size, as well as additional overhead memory required for virtualization, for a total of min + overhead (typically to be 32 MB).  Disk swap space must be reserved for the remaining VM memory; i.e. max - min. This reservation ensures the system is able to preserve VM memory under any circumstances.

 Dynamic Reallocation  ESX Server recomputes memory allocations dynamically in response to:  Changes to system-wide or per-VM allocation parameters by a system administrator  Addition or removal of a VM from the system  Changes in the amount of free memory that cross predefined thresholds.  ESX Server uses 4 thresholds to reflect different reclamation states: high, soft, hard, and low, which default to 6%, 4%, 2% and 1% of system memory, respectively.  High – sufficient  Soft – Ballooning  Hard – Paging  Low – Paging and blocking some execution

 Memory allocation metrics over time for a consolidated workload consisting of five Windows VMs: Microsoft Exchange (separate server and client load generator VMs), Citrix MetaFrame (separate server and client load generator VMs), and Microsoft SQL Server.  (a) ESX Server allocation state transitions.  (b) Aggregate allocation metrics summed over all five VMs.  (c) Allocation metrics for MetaFrame Server VM.  (d) Allocation metrics for SQL Server VM.

I/O Page Remapping  IA-32 processors support a physical address extension (PAE) mode that allows the hardware to address up to 64 GB of memory. However many device support only 4 GB of memory.  Hardware solution: using a I/O MMU to copy data through a temporary bounce buffer from “high” memory to “low” memory.  Pose significant overhead  ESX Server maintains statistics to track “hot” pages in high memory that are involved in repeated I/O operation. And remap some hot pages, of which the count of accesses exceeds a specified threshold.  Make low memory a scarce resource.

Related Work  Disco and Cellular Disco  Vmware Workstation  Uses a hosted architecture  Self-paging of the Nemesis system  Similar to Ballooning  Requires applications to handle their own virtual memory operations  Transparent page sharing work in Disco  IBM’s MXT memory compression technology  Hardware approach

 Disco’s techniques for replication and migration to improve locality and fault containment in NUMA multi processors  Similar to the techniques of transparently remapping “physical” pages

Conclusion  Ballooning technique reclaims memory from a VM by implicitly causing the guest OS to invoke its own memory management routines  Idle memory tax solves an open problem in share-based management of space- shared resources enabling both performance isolation and efficient memory utilization.  Idleness is measured via a statistical working set estimator.

 Content-based transparent page sharing exploits sharing opportunities within and between VMs without any guest OS involvement.  Page remapping is also leveraged to reduce I/O copying overheads in large- memory systems.  A high-level dynamic reallocation policy coordinates these diverse techniques to efficiently support virtual machine workloads that overcommit memory