Hardware Assisted Virtualization

Hardware Assisted Virtualization
Argentina Software Development Center Software and Solutions Group 21 July 2008

Agenda Challenges of running a VMM
SW Solution for IA-32 arch without Intel-VT Virtualization challenges Software workarounds to support Ring Deprivileging Intel® Virtualization Technology VT-x Modes VT-x transition mechanisms Virtual Machine Control Structure (VMCS) Solving Virtualization Challenges with VT-x VT-x New instructions VT-x Extensions VT-d Intel® Virtualization Technology for Directed I/O VT-c: Intel® Virtualization Technology for Connectivity Intel® VT vs AMD-V Conclusions

Challenges of running a Virtual Machine Monitor (VMM)
OS and Apps in a VM don’t know that the VMM exists or that they share CPU resources with other VMs. VMM should isolate Guest SW stacks from one another. VMM should run protected from all Guest software VMM should present a virtual platform interface to Guest SW. Los manejadores de maquinas virutales se enfrentan a varios desafios: Tienen que ser completamente transparentes, es decir ni las aplicaciones de usario ni los sistemas operativos que corren dentro de las maquinas virtules deben conocer la existencia del VMM, como tampoco deben conocer que estan compartiendo los recurosos con otras maquinas virtuales El VMM debe aislar por completo las stacks de las distintas maquinas virtuales El VMM debe correr protegido del todo el software que corre en las maquinas virtuales Y finalmente debe presentarle a las maquinas virtuales una interfaz para acceder a los recursos del hardware

SW Solution for IA-32 arch without Intel-VT
Ring Deprivileging A technique that runs all guest software at a privilege level greater than 0. Privileged instructions generate faults VMM runs in Ring-0 as a collection of fault handlers Ring 3 VM0 Guest OS0 VM1 Guest OS1 ... Platform Hardware VM Monitor App The VMM interprets in software privileged instructions that would be executed by an OS. Any non privileged instruction issued by an OS or Application Environment is executed directly by the machine. A guest OS could be deprivileged in two distinct ways: it could run either at privilege level 1 (the 0/1/3 model) or , It could run at privilege level 3 (the 0/3/3 model). Ring 1 Ring 0 Para lidiar con los problemas presentados anteriormente, en una arquitectura intel 32 sin soporte para virtualizacion, el manejador de maquinas virtuales debe correr en un modo con m’as provilegios que los modos en los que corren los sistemas operativos invitados. La solucion que se utiliza para resolver todos los problemas se llama ring deprivileging y es justamente hacer que los sistemas operativos invitados corran en un nivel inferior (ring 1) y las aplicaciones corran en nivel 3. De esta manera,cada vez que los sistemas operativos invitados ejecuten una operacion privilegiada, se va a generar una falla, porque no estan corriendo en el modo supervisor, y el VMM sera el encargado de colectar esas faltas, interprestarlas y hacer lo necesario de acuerdo a qu’e tipo de falta sea.

Virtualization challenges
Ring Aliasing Problems that arise when software is run at a privilege level other than the privilege level for which it was written. Example: the CS register which points to the code segment. If the PUSH instruction is executed with the CS register, the contents of that register (which include the current privilege level) is pushed on the stack. A guest OS could easily determine that it is not running at privilege level 0. Address-Space Compression OSs expect to have access to the processor’s full virtual address space (in IA-32. linear address space) The VMM could run entirely within the guest’s virtual-address space (but the VMM’s instructions and data structures would use a substantial amount of the guest’s virtual address space. The VMM could run in a separate address space, but it must use a minimal amount of the guest’s virtual address space for the control structures that manage transitions between guest software and the VMM (IDT and GDT for IA-32) The VMM must prevent guest access to those portions of the guest’s virtual address space that the VMM is using. Otherwise the VMM’s integrity could be compromised. The implementation of a virtual machine manager is very complex because of many problems that arise because of the ring deprivileging technique. Here some of the principal problems that a VMM implementation must deal with: First hole of the ring deprivileging is call RING ALIASING. For example, if the PUSH instruction is executed with the segment register CS, the segment address and the current privilege level are pushed on the stack. ADDRESS SPACE COMPRESSION: Adress Compression: refers to the challenges of protecting the portions of the virtual address space used by the VMM supporting guest accesses to them. Problems (in any of the cases) The integrity could be compromised if the guest can write those portions. Also, if the guest can read those portions it could detect that it is running in a VM Guest attempts to access these portions of the address space must generate transitions to the VMM, which can emulate or otherwise support them.

Excessive Faulting Ring deprivileging can interfere with the effectiveness of facilities in the IA-32 architecture that accelerate the delivery and handling of transitions to OS software. For example: The IA-32 SYSENTER and SYSEXIT instructions support low-latency system calls. SYSENTER always effects a transition to privilege level 0, and SYSEXIT faults if executed outside that ring. The VMM must emulate every execution of SYSENTER and SYSEXIT causing serious performance problems. Non-Trapping Instructions There are instructions that access privileged state and do not fault when executed with insufficient privilege. For example, the IA-32 registers GDTR, IDTR, LDTR, and TR contain pointers to data structures that control CPU operation. Software can execute the instructions that read, or store, from these registers at any privilege level. Excessive Faulting (Adverse impact on Guest System Calls) The IA-32 arch have 2 instructions that were created to make transitions from level 3 to 0 and vice versa. Those instructions, SYSENTER and SYSEXIT are used by the OS to execute all the system calls. SYSENTER changes from ring 3 to 0 and SYSEXIT changes from ring 0 to 3 and this fact is HARDCODED in the instruction. So, Executions of SYSENTER by a guest application cause transitions to the VMM and not to the guest OS. The VMM must emulate every guest execution of SYSENTER. Executions of SYSEXIT by a guest OS cause faults to the VMM. The VMM must emulate every guest execution of SYSEXIT. Non Trapping instructions (Non faulting Access to privileged State) The registers presented in the example could only be modified at ring 0, however they could be read at any privilege level. If the VM tries to read any of these registers it could be reading invalid information (for instance, information related to another virtual machine). Because the VM can do it at any privilege level, this operation will not failed and the VMM will not be aware of that.

Interrupt Virtualization The mechanisms of masking external interrupts for preventing their delivery when the OS is not ready for them is a big challenge for the VMM design. The VMM must manage the interrupt masking in order to prevent an OS of masking the external interrupts preventing any guest to receive interrupts. For example: IA-32 uses the interrupt flag (IF) in EFLAGS register to control interrupt masking. A value of 0 indicates that interrupts are masked. Access to Hidden State Some components of the processor state are not represented in any software- accessible register. For example: the IA-32 has the hidden descriptor caches for segment registers. A segment-register load copies of the GDT and LDT into this cache, which is not modified if software later writes to the descriptor tables. Interrupt Virtualization: Even If it were possible to prevent guest modifications of interrupt masking without intercepting each attempt, challenges would remain when a VMM has a “virtual interrupt” to deliver to a guest. A virtual interrupt should be delivered only when the guest has unmasked interrupts. To deliver virtual interrupts in a timely way, a VMM should intercept some but not all attempts by a guest to modify interrupt masking. Doing so could significantly complicate the design of a VMM. Access to hidden state: IA-32 does not provide a mechanism for saving and restoring hidden components of a guest context when changing VMs or for preserving them while the VMM is running.

Ring Compression Ring deprivileging uses privilege-based mechanisms to protect the VMM from guest software. IA-32 includes two mechanisms: segment limits and paging: Segment limits do not apply in 64-bit mode. Paging must be used. Problem: IA-32 paging does not distinguish privilege levels 0-2. The guest OS must run at privilege level 3 (the 0/3/3 model). The guest OS is not protected from the guest applications. Frequent Access to Privileged Resources The performance is compromised if the privileged resources are accessed many times generating many faults that must be intercepted by the VMM. For example: the task-priority register (TPR), in IA-32 located in the advanced programmable interrupt controller (APIC), is accessed with very high frequency by some OSs. Frequent Access to Privileged Resources: The TPR controls interrupt prioritization. The accesses to this register require VMM intervention only if they cause the TPR to drop below a value determined by the VMM.

Intel® Virtualization Technology
VT-x: Support for IA-32 processor virtualization VT-i: Support for Itanium processor virtualization Intel introdujo un cambio en su arquitectura IA-32 que llamo VT-x para resolver por medio del hardware muchos de los problemas que tenia que resolver por software el VMM. El objetivo era agregar un nuevo modo de operacion que permitiera que los sistemas operativos invitados pudieran correr en ring 0 con menos privilegios en los que corre el VMM. Lo mas importante fue entonces que se agregaron nuevos modos de operaciones y nuevos mecanismos para transicionar desde el software invitado hacia el VMM y viceversa.

VT-x Modes VMX root operation:
Full privileged, intended for Virtual Machine Monitor VMX non-root operation: Not fully privileged, intended for guest software Both forms of operation support all four privilege levels from 0 to 3 VT-x augments IA-32 with two new forms of CPU operation: VMX root operation and VMX not-root operation. VMX root operation is intended for use by a VMM and its behavior is very similar to that of IA-32 without VT-x. VMX non-root operation provides an alternative IA-32 environment controlled by a VMM and designed to support a VM.

VT-x transition mechanisms
VM exit From VMX non-root operation mode to VMX root operation mode VM entry from VMX root operation mode to VMX non-root operation mode VTX non root mode / Ring 3 VTX non root mode / Ring 0 VM0 VM1 ... ... App App App ... App App App VTX root mode / Ring 0 Guest OS0 Guest OS1 VMM VM Exit VM Entry Platform Hardware

Virtual Machine Control Structure (VMCS)
Data structure that manages VM entries and VM exits. VMCS is logically divided into: Guest-state area Host-state area. VM-execution control fields VM-exit control fields VM-entry control fields VM-exit information fields VM entries load processor state from the guest-state area. VM exits save processor state to the guest-state area and the exit reason, and then load processor state from the host-state area. VM entries and VM exits are managed by a new data structure called the virtual-machine control structure. Processor operation is changed substantially in VMX non-root operation. The most important change is that many instructions and events cause VM exits. Some instructions (e.g. INVD) cause VM exits unconditionally and thus can never be executed in VMX non-root operation. Other instructions (e.g. INVLPG) and all events can be configured to do so conditionally using VM-execution control fields in the VMCS. Guest-state area. Processor state is saved into the guest-state area on VM exits and loaded from there on VM entries. Host-state area. Processor state is loaded from the host-state area on VM exits. VM-execution control fields. These fields control processor behavior in VMX non-root operation. They determine in part the causes of VM exits. VM-exit control fields. These fields control VM exits. VM-entry control fields. These fields control VM entries. VM-exit information fields. These fields receive information on VM exits and describe the cause and the nature of VM exits. They are read-only.

VMX Non-root Operation . . . IA-32 Operation VMX Root Operation
VT-x Operations Ring 0 Ring 3 VM 1 Ring 0 Ring 3 VM 2 Ring 0 Ring 3 VM n VMX Non-root Operation . . . VMCS 1 VMCS 2 VMCS n VM Exit Ring 0 Ring 3 IA-32 Operation VMX Root Operation VMXON VMLAUNCH VMRESUME

Solving Virtualization Challenges with VT-x
Address-Space Compression With VT-x every transition between guest software and the VMM can change the linear-address space, allowing guest software full use of its own address space. The VMX transitions are managed by the VMCS, which resides in the physical-address space, not the linear address space. Ring Aliasing and Ring Compression VT-x allow VMM to run guest software at its intended privilege level, this fact: Eliminates ring aliasing problems: an instruction such as PUSH (of CS) cannot reveal that software is running in a VM. Eliminates ring compression problems that arise when a guest OS executes at the same privilege level as guest applications Address-Space Compression OSs expect to have access to the processor's full virtual-address space (known as the linear-address space in IA-32). A VMM must reserve for itself some portion of the guest's virtual-address space. It could run entirely within the guest's virtual-address space, which allows it easy access to guest data, but the VMM's instructions and data structures would use a substantial amount of the guest's virtual-address space. Alternatively, the VMM can run in a separate address space, but even in that case, the VMM must use a minimal amount of the guest's virtual-address space for the control structures that manage transitions between guest software and the VMM. For IA-32, these structures include the interrupt-descriptor table (IDT) and the global-descriptor table (GDT), which reside in the linear-address space. For the Itanium architecture, the structures include the interruption vector table (IVT), which resides in the virtual-address space. The VMM must prevent guest access to those portions of the guest's virtual-address space that the VMM is using. Otherwise, the VMM's integrity could be compromised (if the guest can write to those portions) or the guest could detect that it is running in a VM (if it can read those portions). Guest attempts to access these portions of the address space must generate transitions to the VMM, which can emulate or otherwise support them. The term address-space compression refers to the challenges of protecting these portions of the virtual-address space and supporting guest accesses to them. Hardware Virtualization VT-x and VT-i provide two different techniques for solving address-space compression problems. With VT-x, every transition between guest software and the VMM can change the linear-address space, allowing guest software full use of its own address space. The VMX transitions are managed by the VMCS, which resides in the physical-address space, not the linear-address space. With VT-i, the VMM has a virtual-address bit that guest software cannot use. A VMM can conceal hardware support for this bit by intercepting guest calls to the PAL procedure that reports the number of implemented virtual-address bits. As a result, the guest will not expect to use this uppermost bit, and hardware will not allow it to do so, thus providing the VMM exclusive use of half of the virtual-address space.

Nonfaulting Access to Privileged State VT-x avoid this problem in two ways: Generating VMExits on each sensitive execution Provides configuration of interrupts and exceptions disposition Guest System Calls Problems occur with the IA-32 instructions SYSENTER and SYSEXIT when guest OS run outside privilege level 0. This problem is solved because with VT-x, a guest OS can run at privilege level 0. Guest System Calls: The systenter and sysexit instructions are hardcoded to be successful only if they are called from ring0.

Interrupt Virtualization VT-x provide explicit support for interrupt virtualization It includes an external-interrupt exiting VM-execution control. When this control is set to 1, a VMM prevents guest control of interrupt masking without gaining control of every guest attempt to modify EFLAGS.IF. It includes an interrupt-window exiting VM-execution control. When this control is set to 1, a VM exit occurs whenever guest software is ready to receive interrupts. A VMM can set this control when it has a virtual interrupt to deliver to a guest. Access to Hidden State VT-x includes, in the guest-state area of the VMCS, fields corresponding to CPU state not represented in any software-accessible register. The processor loads values from these VMCS fields on every VM entry and saves into them on every VM exit.

Frequent Access to Privileged Resources VT-x allow a VMM to avoid the overhead of high-frequency guest access to the TPR register. A VMM can configure the VMCS so that the VMM is invoked only when required: when the value of the TPR shadow associated with the VMCS drops below that of a TPR threshold in the VMCS.

VT-x New instructions VMXON and VMXOFF
To enter and exit VMX-root mode. VMLAUNCH: Used on initial transition from VMM to Guest Enters VMX non-root operation mode VMRESUME: Used on subsequent entries Loads Guest state and Exit criteria from VMCS VMEXIT Used on transition from Guest to VMM Enters VMX root operation mode Saves Guest state in VMCS Loads VMM state from VMCS VMPTRST and VMPTRLD To Read and Write the VMCS pointer. VMREAD, VMWRITE, VMCLEAR Read from, Write to and clear a VMCS. The VT-x instructions are available on VMX root mode only.

VT-x Extensions CPUID spoofing (Flex Migration)
Extended Page Table (EPT) Virtual Processor IDs (VPID) Guest Preemption Timer The extensions are architectural enhancements geared to delivering more powerful virtualization solutions.

VT-x extension: CPUID spoofing (Flex Migration)
Allows software to “spoof” the CPUID feature bits (e.g. make the value of the CPUID feature bits appear different than they really are). This is the same than the CPUID spoofing feature that the current VT processors have. Live VM Migration Live VM Migration Older / Existing Servers Newer Servers 32 bit single core 64 bit single core Pre 2004 2004+ 2006+ (Intel® Core™) 64 bit dual, quad-core Flex Migration provides the ability to mask certain bits of the CPUID instruction which allows the live migration of virtual machines between different generation of Intel processors starting with the Intel Core architecture processors and follow on. This technology relies on “Spoofing” the CPUID instruction by masking certain bits of the CPUID which indicate that new instructions are present (e.g. SSE4). Which bits to mask are determined by the VMM, the VMM user can select which processor they want to define as the “lowest common denominator”. The VMM then masks the appropriate bits whenever a CPUID instruction is executed and provides to the application, the modified value, so that the application won’t use instructions that are not present on other machines, thus yielding the OS and application compatible for live migration between cross generations of Intel CPU. Note that the ability to mask CPUID bits has been available from Intel since Intel VT was launched on platforms that have Intel VT ENABLED. All VMM vendors that support and depend on Intel VT (e.g. Xen, Veridian, etc), should be able to implement CPUID spoofing today. With the Penryn based processors, this same capability is being added so that it can be used independent of Intel VT being ENABLED or DISABLED.

VT-x extension: Extended Page Table (EPT)
All guest-physical addresses go through extended page tables Includes address in CR3, address in PDE, address in PTE, etc. Reduces the frequency of VM exits to VMM. The net effect of both implementations (EPT or NPT) is to allow the guest OS to own and manage its own page table, and not force the host to get involved. Add one more indirection to memory addresses. In AMD this feature is called NPT the ordinary IA-32 page tables (referenced by control register CR3) translate from linear addresses to guest-physical addresses. A separate set of page tables (the EPT tables) translate form guest-physical addresses to the host-physical addresses that are used to access memory. As a result, guest software can be allowed to modify its own IA-32 page tables and directly handle page faults. This allows a VMM to avoid the VM exits associated with page-table virtualization, which are a major source of virtualization overhead without EPT.

VT-x extension: Virtual Processor IDs (VPID)
The idea of a tagged TLB is that each TLB entry is “tagged” with an identifier Having such a tag allows the TLB entries to not be “flushed” when switching between the host and a guest VPID is activated if the new “enable VPIP” control bit is set in VMCS The TLB cache contains a physical to logical mapping. Virtual-processor identifiers (VPIDs). This feature allows a VMM to assign a different non-zero VPID to each virtual processor (the zero VPID is reserved for the VMM). The CPU can use VPIDs to tag translations in the TLBs. This feature eliminates the need for TLB flushes on every VM entry and VM exit and eliminates the adverse impact of those flushes on performance.

VT-x extension: Guest Preemption Timer
Allows VMM to preempt guest execution. Can bound guest execution time. Programmable by VMM. Causes VM exit when timer expires. No impact on interrupt architecture. VMM-specific and platform-independent. No need to share with guest OS. It can help a lot when you need to switch tasks, or you must allocate a certain amount of CPU power to a task. For telecom and networking applications, it makes virtualization a useful tool and possibly a must have feature. On the other end of the spectrum, it can help for media applications like media PCs and Tivo-type devices. For the business world, it doesn't buy you all that much.

VT-d: Intel® Virtualization Technology for Directed I/O
Provides the capability to ensure improved isolation of I/O resources for greater reliability, security, and availability. Supports the remapping of I/O DMA transfers and device-generated interrupts. Provides flexibility to support multiple usage models that may run un-modified, special-purpose, or "virtualization aware" guest OSs.

VT-d Feature: DMA Remapping
DMA-remapping translates the address of the incoming DMA request to the correct physical memory address and perform checks for permissions to access that physical address DMA-remapping hardware logic in the chipset sits between the DMA capable peripheral I/O devices and the computer’s physical memory Intel VT-d enables protection by restricting direct memory access (DMA) of the devices to pre-assigned domains or physical memory regions. This is achieved by a hardware capability known as DMA-remapping. The VT-d DMA-remapping hardware logic in the chipset sits between the DMA capable peripheral I/O devices and the computer’s physical memory. It is programmed by the computer system software. In a virtualization environment the system software is the VMM. In a native environment where there is no virtualization software, the system software is the native OS. DMA-remapping translates the address of the incoming DMA request to the correct physical memory address and perform checks for permissions to access that physical address, based on the information provided by the system software. Intel VT-d enables system software to create multiple DMA protection domains. Each protection domain is an isolated environment containing a subset of the host physical memory. Depending on the software usage model, a DMA protection domain may represent memory allocated to a virtual machine (VM), or the DMA memory allocated by a guest-OS driver running in a VM or as part of the VMM itself. The VT-d architecture enables system software to assign one or more I/O devices to a protection domain. DMA isolation is achieved by restricting access to a protection domain's physical memory from I/O devices not assigned to it by using address-translation tables. This provides the necessary isolation to assure separation between each virtual machine’s computer resources. When any given I/O device tries to gain access to a certain memory location, DMA remapping hardware looks up the address-translation tables for access permission of that device to that specific protection domain. If the device tries to access outside of the range it is permitted to access, the DMA remapping hardware blocks the access and reports a fault to the system software.

VT-d Feature: Interrupt Remapping
The interrupt requests generated by I/O devices must be controlled by the VMM. When the interrupt occurs, the VMM must present the interrupt to the guest. This is not accomplished through hardware. The VT-d interrupt-remapping architecture addresses this problem by redefining the interrupt-message format. Interrupt requests specify a requester-ID and interrupt-ID, and remap hardware transforming these requests to a physical interrupt. For proper device isolation in a virtualized system, the interrupt requests generated by I/O devices must be controlled by the VMM. In the existing interrupt architecture for Intel platforms, a device may generate either a legacy interrupt (which is routed through I/O interrupt controllers) or may directly issue message signaled interrupts (MSIs) [20]. MSIs are issued as DMA write transactions to a pre-defined architectural address range, and the interrupt attributes (such as vector, destination processor, delivery mode, etc.) are encoded in the address and data of the write request. Since the interrupt attributes are encoded in the request issued by devices, the existing interrupt architecture does not offer interrupt isolation across protection domains. The VT-d interrupt-remapping architecture addresses this problem by redefining the interrupt-message format. The new interrupt message continues to be a DMA write request, but the write request itself contains only a "message identifier" and not the actual interrupt attributes. The write request, like any DMA request, specifies the requester-id of the hardware function generating the interrupt. DMA write requests identified as interrupt requests by the hardware are subject to interrupt remapping. The requestor-id of interrupt requests is remapped through the table structure. Each entry in the interrupt-remapping table corresponds to a unique interrupt message identifier from a device and includes all the necessary interrupt attributes (such as destination processor, vector, delivery mode, etc.). The architecture supports remapping interrupt messages from all sources including I/O interrupt controllers (IOAPICs), and all flavors of MSI and MSI-X interrupts defined in the PCI specifications. (there is no mention of legacy interrupts) For more info about MSI refer to:

VT-d Feature: Address Translation Services
VT-d architecture defines a multi-level page-table structure for DMA address translation. The multi-level page tables are similar to IA-32 processor page-tables, enabling software to manage memory at 4 KB or larger page granularity VT-d architecture defines a multi-level page-table structure for DMA address translation.The multi-level page tables are similar to IA-32 processor page-tables, enabling software to manage memory at 4 KB or larger page granularity. Hardware implements the page-walk logic and traverses these structures using the address from the DMA request. The number of page-table levels that must be traversed is specified through the context-entry referencing the root of the page table. The page directory and page-table entries specify independent read and write permissions, and hardware computes the cumulative read and write permissions encountered in a page walk as the effective permissions for a DMA request. The page-table and page-directory structures are always 4 KB in size, and larger page sizes (2 MB, 1 GB, etc.) are enabled through super-page support.

VT-c: Intel® Virtualization Technology for Connectivity
Improves overall system performance by improving communication between host CPU and I/O devices within the virtual server. This enables a lowering of CPU utilization, a reduction of system latency and improved networking and I/O throughput. VT-c includes: Intel® I/O Acceleration Technology. Virtual Machine Device Queues (VMDq). Single Root I/O Virtualization (SR-IOV) implementation in Intel® devices.

VT-c: Intel® I/O Acceleration Technology
Intel® I/O Acceleration Technology (Intel® I/OAT) is a suite of features which improves data acceleration across the platform, from I/O and networking devices to the memory and processors which help to improve system performance. Intel® QuickData Technology: designed to maximize the throughput of server data traffic across a broader range of configurations and server environments to achieve faster, scalable, and more reliable I/O. Direct Cache Access (DCA): Enables the CPU to pre-fetch data avoiding cache misses and improving application response times MSI-X: Helps in load-balancing I/O network interrupts Low latency interrupts: Automatically tune interrupt interval times depending on the latency sensitivity of the data Receive Side Coalescing (RSC): provides lightweight coalescing of receive packets, which increases the efficiency of the host network stack

VT-c: Virtual Machine Device Queues (VMDq)
In addition to consolidating CPU processes, you also effectively consolidate I/O bandwidth and switch processing capabilities onto the same platform The overhead of this switching limits your bandwidth, adds CPU overhead, and effectively reduces the benefits of server virtualization. In some cases you may have a new problem in having created an I/O bottleneck Here you tell the current problem, where the hypervisor must manage the traffic from the NIC to any of the VMs and vice versa.

VT-c: Virtual Machine Device Queues (VMDq)
On the receive path, VMDq provides a hardware ‘sorter' or classifier that essentially does the pre-work for the VMM of directing which end VM the packets should go to. The NIC or LAN silicon is performing a hardware assist for the VMM layer. The VMDq capability is located in the NIC silicon. This feature is supported in Intel® Gigabit Ethernet Controller and Intel® Gigabit Ethernet Controller, and requires virtualization software enabling.

VT-c: Single Root I/O Virtualization
VI switches and manages data streams between System Images (SI) and I/O devices VI has to: configure and setup I/O Devices copy data streams SI ↔ VI ↔ I/O devices switch I/O access from and to SI’s handle messages/interrupts I/O ↔ VI ↔ SI ensure secure data streams and messages between SI’s SW based virtualization of I/O is time consuming which limits performance

VT-c: Single Root I/O Virtualization
Single Root I/O Virtualization (SR-IOV) is a Peripheral Component Interconnect Special Interest Group (PCI-SIG) specification. SR-IOV provides a standard mechanism for devices to advertise their ability to be simultaneously shared among multiple virtual machines. SR-IOV allows for the partitioning of a PCI function into many virtual interfaces for the purpose of sharing the resources of a PCI Express* (PCIe) device in a virtual environment. With SR-IOV: SI’s will get direct access to PCIe device functions No more need for hypervisor (VI) to manage all system resources PCIe devices will have multiple virtual functions (VF’s) utilizable by multiple SI’s a single SI may also use multiple virtual functions Security of I/O Streams ensured by Independency of control structures between VF’s within one PCIe device I/O address translation services Interrupt remapping mechanisms SRIOV it is not a feature, is a specification.

Intel® VT vs AMD-V Although architectures are different, AMD’s Virtualization Technology have equivalent level of assistance to the VMMs as that of Intel® VT. Intel® and AMD’s virtualization technology roadmaps include equivalent extensions to accelerate and optimize virtualization software. AMD-V Rapid Virtualization Indexing provides performance improvement on virtualized environments and it is equivalent to Intel® VT Extended Page Tables. AMD-V Extended Migration is equivalent to VT Flex Migration.

Conclusions VT Reduces guest OS dependency
Eliminates need for binary patching / translation Facilitates support for Legacy OS VT improves robustness Eliminates need for complex SW techniques Simpler and smaller VMMs Smaller trusted-computing base VT improves performance Fewer unwanted Guest  VMM transitions

Backup

The VMCS guest area Is used to contain elements of the state of virtual CPU associated with that VMCS. The segment registers: to map from logical to linear addresses CR3: to map from linear to physical addresses IDTR: for event delivery It contains fields that are not held in any software-accessible register: The processor’s interruptibility state: indicates whether external interrupts are temporarily masked and whether non-maskeable interrupts are masked because software is handling an earlier NMI. It does not contain fields corresponding to registers that can be saved and loaded by the VMM itself. For proper VMM operation, certain registers must be loaded by every VM exit. These include those IA-32 registers that manage operation of the processor, such as the segment registers, CR3 and IDTR and many others. The guest-area contains fields for these registers so that their values can be saved as part of each VM exit. The guest state area does not contain fields corresponding to registers that can be saved and loaded by the VMM itself (e.g. general purpose registers like AX, BX, CX, DX, SP, BP, SI, and DI). Exclusion of such registers improves the performance of VM entries and VM exits. Software can manage these additional registers more efficiently as it knows better than the CPU when they need to be saved and loaded.

The VMCS control fields
The VMCS contains a number of fields that control VMX not-root operation by specifying the instructions and events that cause VM exits. The VMCS includes controls that support interrupt virtualization: External interrupt exiting: if it is set, all external interrupts cause VM exits. The guest is not able to mask these interrupts Interrupt window exiting: if it is set a VM exit occurs whenever guest software is ready to receive interrupts. Use TPR shadow: if is set, accesses to the APIC’s TPR through control register CR8 are handled in a special way: executions of MOV CR8 access a TPR shadow referenced by a pointer in the VMCS. The VMCS also includes a TPR threshold; a VM exit occurs after any instruction that reduces the TPR shadow below the TPR threshold. (Flex Priority) CR0 and CR4 virtualization There are VM-execution control fields that support efficient virtualization of the IA-32 control registers CR0 and CR4. These register each comprise a set of bits controlling processor operation. A VMM may wish to retain control of some of these bits (e.g. those that control floating-point instructions). The VMCS includes, for each of these registers, a guest/host mask that a VMM can use to indicate which bits it wants to protect. Guest writes can freely modify the unmasked bits, but an attempt to modify a masked bit causes a VM exit. The VMCS also includes, for each of these registers, a read shadow whose value is returned to guest reads of the register. Interrupt window exiting: Interrupt Windows Exiting: Guest OS may not be interruptible (e.g., critical section) Interrupt-window exiting allows guest OS to run until it has enabled interrupts (via EFLAGS.IF) In virt environment, if a guest OS is handling an NMI and another NMI comes in, the VMM must keep it pending and determine when the guest OS is ready. This was usually handled in a non-standard manner by different VMMs with workarounds. Now, the VMM is notified (via VMExit) that the guest is ready for the pending NMI Flex Priority: The TPR is the register that store the interrupts priority. TPR is part of the APIC. The TPR is accessed through the OS memory because it is mapped in its memory. For example: The OS makes an “in 0x3f” to read the register. But, the VMM must control the TPR accesses, and so the access to this memory portions should cause a VM exit. For every access to the TPR a VM exit must occurs. The use of a TPR shadow generates VM exits only when the TPR is written and the new value is below the TPR threshold (also included in the VMCS) but not when it is read, reducing a lot of overhead generated by the context switches of every VM exit. In addition to the controls mentioned, there are VM-execution controls that support flexible VM exiting for a number of privileged instructions.

The VMCS control fields
Exception bitmap: 32 entries for the IA-32 exceptions. To specify which exception should cause VM exits and which should not. I/O bitmaps: one entry for each port in the 16-bit I/O space. An I/O cause a VM exit if it attempts to access a port whose entry is set in the I/O bitmap. MSR bitmaps: two entries (read and write) for each model-specific register (MSR) currently in use. An execution of RDMSR (or WRMSR) causes a VM exit if attempts to read (or write) an MSR whose read bit (or write bit) is set in the MSR bitmaps.

VMCS location The VMCS is referenced with a physical address.
This eliminates the need to locate it in the guest’s linear-address (may be different from that of the VMM). The VMCS format and layout memory is not architecturally defined This allow implementation-specific optimization to improve performance in VMX non-root operation. This also reduce the latency of VM entries and VM exits. VT-x defines a set of new instructions that allows software to access the VMCS in an implementation-independent manner. (coming soon)

Hardware Assisted Virtualization

Similar presentations

Presentation on theme: "Hardware Assisted Virtualization"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hardware Assisted Virtualization

Similar presentations

Presentation on theme: "Hardware Assisted Virtualization"— Presentation transcript:

Similar presentations

About project

Feedback