Xen and the Art of Virtualization Paul Barham*, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf Neugebauery, Ian Pratt, Andrew Wareld *Microsoft Research Cambridge, UK University of Cambridge Computer Laboratory 19th ACM Symposium on Operating System principles(SOSP’03 ) 1
Introduction Resurgence of interest in VM technology(2003) – Modern computers are sufficiently powerful to use virtualization. In this paper we present Xen: – a high performance resource-managed virtual machine monitor(VMM) 2
Problem need to solve VM isolation: – It is not acceptable for the execution of one to adversely affect the performance of another. Different operating systems enabled: – To accommodate the heterogeneity of popular applications. Performance overhead: – the performance overhead introduced by virtualization should be small. 3
XEN: APPROACH & OVERVIEW Traditional approach— full virtualization: – The virtual hardware exposed is functionally identical to the underlying machine. – Benefit: allowing unmodified operating systems to be hosted. – Drawback: Support for full virtualization was never part of the x86 architectural design. 4
XEN: APPROACH & OVERVIEW(cont.d) Improvement: – Paravirtualization: Presenting a virtual machine abstraction that is similar but not identical to the underlying hardware. Improved performance. Require modifications to the guest operating system. – But no modifications are required to guest applications. » ABI: an application binary interface (ABI) describes the low-level interface between an application (or any type of) program and the operating system or another application. 5
The Virtual Machine Interface Overview of the paravirtualized x86 interface: – Memory management – CPU – Device I/O Why x86? – x86 represents a worst case 6
Memory management Software-managed TLB V.S. Physical-managed TLB Software-managed TLB V.S. Physical-managed TLB Two decision: – To ensure safety and isolation: Guest OSes are responsible for allocating and managing the hardware page tables Minimal involvement from Xen; – Avoiding a TLB flush when entering and leaving the hypervisor. Xen exists in a 64MB section at the top of every address space. 7
Memory management(cont.d) Method: 1.A guest OS requires a new page table EX: a new process is being created. 2.It allocates and initializes a page from its own memory reservation and registers it with Xen. 3.Guest OS relinquish direct write privileges to the page-table memory. All subsequent updates must be validated by Xen – Note: Guest OSes may batch update requests to amortize the overhead of entering the hypervisor. 8
CPU In order to paravirtualize CPU, the hypervisor must have higher privilege level than guest OS. – Prevents the guest OS from directly executing privileged instructions Isolation. EX: memory management design we discuss before. – In x86, processor has 4 privilege levels in hardware. The x86 privilege levels are generally described as rings. – From ring 0 to ring 3(0 is the most privileged) Therefore, hypervisor is set to ring 0, guest OS is set to ring 1, ring 3 is set to applications. – Any OS which follows this common arrangement can be ported to Xen by modifying it to execute in ring 1. 9
CPU(cont.d) Exception handle: – EX: page faults and software exception. – A table describing the handler for each type of exception is registered with Xen for validation. Overhead. Safety is ensured by validating exception handlers. – Validate the handler's code segment does not specify execution in ring 0. 10
CPU(cont.d) Exception handle(cont.d): – To deal with overhead: only two types of exception occur frequently enough to affect system performance – system calls (usually implemented via a software exception) – page faults(no solution) System calls can be registered to a `fast' exception handler. – Accessed directly by the processor without indirecting via ring 0. 11
Device I/O Full-virtualized environments – Emulating existing hardware devices Paravirtualized: – Xen exposes a set of clean and simple device abstractions. Objective: Protection and isolation. – I/O data is transferred to and from each VM via Xen.(describe later) In order to perform validation checks – EX: checking that buffers are contained within a domain's memory reservation. 12
Detail Design Control Transfer: – Hypercalls and Events Data Transfer: – I/O Rings Subsystem Virtualization – CPU scheduling – Virtual address translation – Network – Disk 13
Control Transfer: Hypercalls and Events Hypercalls: – Synchronous calls from a VM to Xen. In order to perform a privilege operation EX: VM request a set of page table updates Events: – Notifications are delivered to VM from Xen using an asynchronous event mechanism. Replaces the usual delivery mechanisms for device interrupts. EX: Indicate that new data has been received over the network. Guest OS may specify an event-callback handler to respond to the notification. 14
Control Transfer: Hypercalls and Events Events(cont.d): – Pending events: Stored in a per-domain bitmask which is updated by Xen. – How to pend events? Set a Xen-readable software flag. – This is analogous to disabling interrupts on a real processor. 15
Data Transfer: I/O Rings Data transfer mechanism main idea – Allows data to move vertically through the system with as little overhead as possible. – Minimize the work required to demultiplex data to a specific VM when an interrupt is received from a device 16
Data Transfer: I/O Rings I/O data buffers are allocated out-of-band by the guest OS – Zero copy: by transfer the pointer and edit permission. 17
Data Transfer: I/O Rings Order: – There is no requirement that requests be processed by Xen. The guest OS associates a unique identifier with each request which is reproduced in the associated response. Reason: reorder I/O operations due to scheduling or priority considerations. 18
CPU scheduling Scheduling alg: – Borrowed Virtual Time (BVT) scheduling algorithm[11] work-conserving has a special mechanism for low-latency wake-up when VM receives an event [11]K. J. Duda and D. R. Cheriton. Borrowed-Virtual-Time (BVT) scheduling: supporting latency-sensitive threads in a general-purpose scheduler. In Proceedings of the 17th ACM SIGOPS Symposium on Operating Systems Principles, volume 33(5) of ACM Operating Systems Review, pages , Kiawah Island Resort, SC, USA, Dec
BVT scheduling 20
BVT scheduling 21
BVT scheduling 22 Virtual Time(E i ) Real Time w=2/3 w=1/3
BVT scheduling 23 Virtual Time(Ei) Real Time
Low latency dispatch 24
Low latency dispatch 25 Virtual Time(Ei) Real Time Mpeg wake on t=5 and 15, and execute for 2.5 time. mpeg run first because it is warped back 50 virtual units
Low latency dispatch 26 Virtual Time(Ei) Real Time L i exceeded
Virtual address translation Indeed, Xen need only be involved in page table updates. – Prevent guest OSes from making unacceptable changes. Approach: – Xen register guest OS page tables directly with the Memory Management Unit(MMU) and restrict guest OSes to read-only access. – Page table updates are passed to Xen via a hypercall 27
Network Each VM has one or more Virtual network interfaces (VIFs). – VIFs are attached to a virtual firewall-router(VFR) – Domain0 is responsible for inserting and removing rules on VFR. A VIF contains: – two I/O rings of buffer descriptors, one for transmit and one for receive. – Zero copy: The guest OS exchanges an unused page frame for each packet it receives. Fairness: – Xen implements a simple round-robin packet scheduler. 28
Disk Only Domain0 has direct unchecked access to physical (IDE and SCSI) disks. – VM access persistent storage through the abstraction of virtual block devices (VBDs). I/O ring mechanism. – A translation table is maintained within the hypervisor for each VBD. Mapping VBD identifier and offset to the corresponding sector address and physical device. – Xen services batches of requests from competing domains in a simple round-robin fashion 29
Evaluation Environment Hardware: – Dell 2650 dual processor2.4GHz Xeon server – 2GB RAM, – a Broadcom Tigon 3 Gigabit Ethernet NIC, – a single Hitachi DK32EJ 146GB 10k RPM SCSI disk OS: – Linux version
Evaluation Relative Performance 31 Compare a VM performance with “bare metal” Bare metal: a pure Linux OS directly install on physical machine.
Evaluation Concurrent Virtual machine 32
Conclusion This paper presents Xen, an x86 virtual machine monitor – allows multiple commodity operating systems to share conventional hardware. – without sacrificing either performance or functionality. As our experimental results shows. Ongoing work: – Porting BSD and Windows XP kernels to operate over Xen. 33
Comment Paravirtualization indeed has good performance. However, Domain-0 may be the bottleneck – a lot of work need domain-0 to validate or execute OS need modification in order to install on Xen’s VM. 34
TLB(1/3) First we take a look at how a application use memory: Every process have it own address space Memory management unit(MMU) translate it into 2 indices and 1 offset
TLB(2/3) A translation lookaside buffer (TLB): – a CPU cache that memory management hardware uses to improve virtual address translation speed. TLB cache this mapping 36
TLB(3/3) Physical-managed TLB(x86 architecture): – Need flush whole table whenever address space change. Software-managed TLB: – Tagged TLB: Associating an address-space identifier tag with each TLB entry. – Allows the hypervisor and each guest OS to efficiently coexist in separate address spaces. No need to flush TLB. 37
Introduction of Domain-0 Domain-0 – A special privileged domain(VM) – Serves as an administrative interface to Xen – The first domain launched when the system is booted Note: – Domain-0(Dom0) = Privileged domain – Domain-U(DomU) = Unprivileged domain 38
A simple Xen architecture 39 Direct physical access to all hardware 1 1 Dom0 exports the simplified generic class devices to each DomU 2 2 Configuration and monitoring Interface 3 3
TCP working flow Example 40 Zero Copy