Virtualisation Front Side Buses SMP systems COMP Jamie Curtis
Virtualisation Allows a “Host” OS to execute a “Guest” OS as a task No need for “Guest” OS’s cooperation Host is often called a Hypervisor or VMM VMM controls devices Often presents virtual devices to guest OS’s
Virtualisation cont. PC architecture makes this tricky Protected mode introduced privilege rings 4 Levels – 0,1,2,3 with decreasing privilege Operating systems assume full control (ring 0) and userspace at ring 3 There is no control on reading this state Some instructions silently fail if not run at the correct privilege level instead of causing a fault
Virtualisation cont. 4 approaches to fix this Emulation Paravirtualisation Binary Translation Hardware Assisted Emulation Fake the entire machine Very slow but doesn’t require the same host architecture as the guest
Paravirtualisation Alter the guest so that it doesn’t use the “bad” instructions. Guest instead calls out to the VMM Very fast with minimal overhead Requires support from all guest OS’s Effectively limited to open source OS’s Championed by Xen
Binary Translation The VMM now dynamically translates the byte stream before it’s executed, replacing “bad” instructions as it goes Lots of optimisations make this not nearly as bad as it sounds. Allows running un-modified OS’s Performance hit can be anything from 5 – 60% depending on workload. Championed by VMware
Hardware Virtualisation Adds another privilege level for the VMM Hardware maintains state for each guest VMM gets very flexible control of what causes faults Intel and AMD again have similar specs (but incompatible !) AMD – “Pacifica”, Intel – “Vanderpool”
Hardware Virtualisation cont. Initial solutions don’t deal with enough in hardware, causing them often to be slower than BT Transition into and out of guests is very slow VMware and Xen both have added support Hardware support is getting more complete in newer versions
Virtual I/O Devices Typically VMMs provide virtual hardware drivers to the guests These may then map onto real hardware inside the VMM Allowing secure and separated direct access to hardware is difficult
AMD 10h Family AMD’s latest architecture adds a number of important virtualisation enhancements Primarily focused around making fewer calls into the VMM Three main enhancements Nested Page Tables Tagged TLB Device Exclusion Vector (DEV)
Nested Page Tables Current designs use shadowed page tables The MMU is setup for the current guest / process Changes to the MMU are trapped by the VMM Nested page tables replicate the page table per VM. Looks similar to the virtual address space provided by the OS to processes, just one more level up Makes lookups through the page table slower
Tagged TLB TLB’s cache lookups through the page tables Typically they are flushed each time you switch process or VM Tagged TLB’s add a tag to reference which VM this tag relates to No longer have to flush TLB’s when switching VM Makes VM – VMM – VM transitions cheaper Reduces the performance hit of nested page tables
Device Exclusion Vector Contains a mapping of what pages a device can access Allows the VMM to give DMA access to specific pages Allows a device to do DMA directly into a specific VM’s memory space
Virtualisation Future Linux now has 4 full virtualisation packages VMWare, Xen, KVM, Lguest Intel and AMD working on next-gen hardware support Dense, multi-core solutions making virtualisation very attractive Power and space efficient Keeps getting more and more important
Traditional Intel Architecture Front Side Bus Northbridge Southbridge
Clock vs. Data rates Clock rates no longer equal data rates Watch specifications as both can look identical e.g. Intel's FSB is normally referred to as an 1333MHz bus It is actually a 333MHz clock with a quad-pumped data rate Should really be 1333MT/s
Frontside Bus & Dual Core 64 bit, quad-pumped 333MHz 8 * 4 * 333 = 10.6GB/s Half Duplex Traditional P4 FSB has been point to point Xeon’s SMP requires chipsets support a proper multipoint bus for the FSB. How did a dual core Pentium D work ?
Intel Dual Core Slap two cores on the same die ! How do they communicate ? Over the FSB like a Xeon ! Requires a new chipset to support it.
Caches There are two sets of L1 and L2 cache, one for each core. What happens if both cores are caching the same memory location ? Need a protocol to make sure this doesn’t cause problems Cache Coherency Protocol
Cache Coherency Intel use the MESI Protocol Modified 1 cache contains a modified copy of the location Exclusive 1 cache contains a un-modified copy of the location Shared 2 or more caches contain un-modified copies of the location Invalid Another cache contains a modified copy of the location
MESI What happens if a core has an Invalid entry but tries to access it ? The cache with the Modified entry needs to write the entry back to memory and become a Shared entry This means an Intel Pentium D has to involve the FSB in all of it’s cache coherency updates even though they are on the same die
Intel Core Core architecture is designed for dual-core Cache is now shared between cores Cache coherency is now between each L1 and L2 Bus Control L2 Cache L2 Control Core 1 Core 2 FSB
Intel Core 2 Quad Current quad core design adds two dual core designs together Cache coherency between dies again happens over the FSB
AMD K8 Architecture Integrated Memory Controller Very low latency CPU determines memory technology Requires both CPU and Motherboard to be changed for a new type of memory Athlon 64, Socket 754 Single channel, 64bit DDR at 200MHz 8 * 2 * 200 = 3.2GB/s
AMD K8 Memory Interfaces Athlon 64, Socket 939 Dual channel, 64bit DDR at 200MHz 2 * 8 * 2 * 200 = 6.4 GB/s Athlon 64, Socket AM2 Dual channel, 64bit DDR2 at 400MHz 2 * 8 * 2 * 400 = 12.8 GB/s Opteron, Socket 940 Dual channel, 64bit DDR at 200MHz 2 * 8 * 2 * 200 = 6.4 GB/s
AMD K8 FSB AMD use the HyperTransport technology Don’t confuse with the very different Intel Hyper Threading Technology Packet based, point to point link that provides the lowest possible latency Full duplex bi-directional link Available in 2, 4, 8, 16 or 32 bits wide 50MHz – 1.4GHz clock rates Clock rates and bit widths can be asymmetric Double-pumped data rate 1.4GHz * 2 * 4 = 11.2GB/s per direction
Athlon 64 FSB The Athlon 64 has a single HyperTransport link to connect to I/O subsystem 16bit, bi-directional, DDR at 800MHz (754) or 1GHz (939) 2 * 2 * 2 * 1000 = 8GB/s Can have either a single chip solution or stay with two chip, northbridge, southbridge combination
Opteron Uses HT for both the I/O interconnect and CPU interconnect CPU interconnect requires an additional Cache Coherency protocol addition to HT Come in three variants 1xx – Memory controller and 3 HT links 2xx – Memory controller and 3 HT links 8xx – Memory controller and 3 HT links
Opteron What makes them different ? Number of HT busses that support the CC protocol 1xx – Zero 2xx – One 8xx – Three This allows you to scale to different numbers of processors 1xx – 1 way, 2xx – 2way, 8xx – 4 or 8 way
AMD MOESI Cache Coherency slightly different to Intel’s Owner – This cache owns this memory location and it (not memory) services all requests for it from other caches The request goes across the high speed dedicated HT bus
AMD Dual Core In the dual core situation, it becomes even better:
AMD Dual Core Requests now go across the System Request Interface This runs at CPU core frequency System Request Interface also controls HT – Memory access as well as HT – HT communication in 2xx and 8xx Opterons
AMD Quad Core AMD again targeted native quad core instead of dual-die, dual core Introduced shared L3 cache