© 2010 IBM Corporation Plugging the Hypervisor Abstraction Leaks Caused by Virtual Networking Alex Landau, David Hadas, Muli Ben-Yehuda IBM Research – Haifa Alex Landau 25 May 2010
© 2010 IBM Corporation2 Hypervisor leaks Original goal of hypervisors – complete replica of physical hardware Application running on host should be able to run in guest Host details leaked to guest –Instruction set extensions –Bridged networking Leaked IP address, subnet mask, etc. –NAT Not suitable for many applications
© 2010 IBM Corporation3 Why leaks are bad? Why is that a problem? –Checkpoint / restart –Cloning –Live migration Example: –Guest acquires IP address from DHCP –Guest is live-migrated to different data center –Guest uses old IP address in new network Current solution: –Defer problem to guests and network equipment –E.g., VLANs
© 2010 IBM Corporation4 QEMU Guest Kernel VIRTIO Frontend VIRTIO Backend QEMU Guest application Guest Kernel Network Adapter Driver Emulated Network Adapter Guest Network Stack Guest Network Stack Socket Interface Guest application Socket Interface Host Kernel Virtual Network Interface TAP Host Network Services (E.g. Bridge or VAN central services) Virtual Network Interface TAP Packet flow today (in KVM)
© 2010 IBM Corporation5 How to avoid leaks? Hypervisor, not network, is responsible for avoiding leaks Guests should be: –Offered an isolated virtual environment –Independent of physical network characteristics (e.g., topology) –Independent of physical location (e.g., IP addresses) Example: –Guest should receive IP address independent of: Host running the guest Data center containing the host Network configuration of the host
© 2010 IBM Corporation6 Avoiding leaks – Encapsulation Guest produces Layer-2 frame Host encapsulates it in UDP packet Host finds destination host –By peeking at destination (guest) MAC address –And “somehow” finding destination host Host transmits UDP packet Receiver host receives UDP packet Receiver host decapsulates Layer-2 frame from UDP packet Receiver host passes Layer-2 frame to guest
© 2010 IBM Corporation7 Proposed packet flow – Dual Stack Host Kernel QEMU Guest Kernel VIRTIO Frontend Driver VIRTIO Backend QEMU Guest application Guest Kernel Network Adapter Driver Emulated Network Adapter Guest Network Stack Guest Network Stack Traffic Encapsulation Traffic Encapsulation Host Network Stack Socket Interface Guest application Socket Interface Guest Stack (Glue) Host Stack App. DriverNet Driver Isolation
© 2010 IBM Corporation8 Performance Path from guest to wire is long Latencies are manifested in the form of: –Packet copies –VM exits and entries –User/Kernel mode switches –Host QEMU process scheduling
© 2010 IBM Corporation9 Large packets Transport and Network layers capable of up to 64KB packets Ethernet limit is 1500 bytes –Ignoring jumbo frames But there is no Ethernet wire between guest and host! Set MTU to 64KB in guest 64KB packets are transferred from guest to host –Inhibit TCP/UDP checksum calculation and verification
© 2010 IBM Corporation10 Large packets – Flow Application writes 64KB to TCP socket TCP, IP check MTU (=64KB) and create 1 TCP segment, 1 IP packet Guest virtual NIC driver copies entire 64KB frame to host Host writes 64KB frame into UDP socket Host stack creates 1 64KB UDP packet If packet destination = VM on local host –Transfer 64KB packet directly on the loopback interface If packet destination = other host –Host NIC segments 64KB packet in hardware
© 2010 IBM Corporation11 CPU affinity and pinning QEMU process contains 2 threads –CPU thread (actually, one CPU thread per guest vCPU) –IO thread Linux process scheduler selects core(s) to run threads on Many times scheduler made wrong decisions –Schedule both on same core –Constantly reschedule (core 0 -> 1 -> 0 -> 1 -> …) Solution/workaround – pin CPU thread to core 0, IO thread to core 1
© 2010 IBM Corporation12 Flow control Guest does not anticipate flow control at Layer-2 Thus, host should not provide flow control –Otherwise, bad effects similar to TCP-in-TCP encapsulation will happen Lacking flow control, host should have large enough socket buffers Example: –Guest uses TCP –Host buffers should be at least guest TCP’s bandwidth x delay
© 2010 IBM Corporation13 Performance results ThroughputReceiver CPU Utilization
© 2010 IBM Corporation14 Thank you!