Ardalan Kangarlou, Sahan Gamage, Ramana Kompella, Dongyan Xu

vSnoop: Improving TCP Throughput in Virtualized Environments via Acknowledgement Offload
Ardalan Kangarlou, Sahan Gamage, Ramana Kompella, Dongyan Xu Department of Computer Science Purdue University Thank you Dr. Kielmann.

Cloud Computing and HPC
We have recently seen a growing adoption of cloud resources for high performance computing applications. We often see headlines such as the following. Cloud computing to HPC, exploring cloud resources for scientific research, cloud providers provisioning for HPC applications. Even a headline claiming Cloud computing and HPC are twins separated at birth. The truth is maybe cloud is not the optimal platform for all HPC applications for some types of some of the academic research, pure sciences, weather modeling, automobile design simulations, aerospace, energy, life sciences, and the newer class of business intelligence (BI) and analytics applications can be quite useful.

Background and Motivation
Virtualization: A key enabler of cloud computing Amazon EC2, Eucalyptus Increasingly adopted in other real systems: High performance computing NERSC’s Magellan system Grid/cyberinfrastructure computing In-VIGO, Nimbus, Virtuoso Emphasize virtualization Me je’ len

VM Consolidation: A Common Practice
Multiple VMs hosted by one physical host Multiple VMs sharing the same core Flexibility, scalability, and economy VM 1 VM 2 VM 3 VM 4 Key Observation: VM consolidation negatively impacts network performance! However!!! A common practice in virtualized systems is VM consolidation. We can run multiples VMs on the same host where each VM has been customized with different applications, libraries, and operating systems. Often these VMs share the same core or CPU to achieve greater economy of scale. This leads to more efficient use of hardware resources and energy saving. For instance, in Amazon EC2 medium instances 2 VMs are run per core, this number may increase in future. …. Unfortunately, we observe … In addition to some of the benefits described in the previous slide, VM migration and server consolidation are two attractive features of virtual clouds. VM migration enables moving a virtual machine from one physical server to another, for example in anticipation of maintenance on some servers or moving computation closer to data. Server consolidation refers to the practice of packing multiple VMs on the same server and is commonly used to free up physical resources and save energy (during light computation periods). Virtualization Layer Hardware

Investigating the Problem
Server Client VM 1 VM 2 VM 3 Sender Virtualization Layer To demonstrate and investigate why network performance is affected as a result of VM consolidation, we conduct a series of experiments. Here is the setup. Using this setup we are going to answer three questions. Hardware

Q1: How does CPU Sharing affect RTT ?
40 60 80 100 120 140 160 180 5 4 3 2 RTT (ms) Number of VMs US West – Australia US East – Europe RTT increases in proportion to VM scheduling slice (30ms) US East – West The first question we are going to answer is … . On the X-axis we have different number of VMs and on the Y-axis we have round trip time for the network packets. As you can see, RTT increases in proportion to the VM scheduling slice (that is 30 msec in Xen). To put things in perspective the RTT for the 2-VM scenario is about 60 msec, that is on the order of latency between US east coast and west coast. Please make note that due to VM consolidation we no longer experience the submillisecond latency expected in a LAN setup. The main motivation for our work vSnoop comes from the fact that we observed server consolidation and running multiple VMs on the same core/CPU adversely affects TCP throughput, particularly for small TCP flows that constitute the majority of flows in a data center/cloud environment. In the next few slides, we will go over a few experiments to understand the problem at hand in order to come up with an effective solution. In the first experiment, we try to understand how does CPU sharing by multiple VMs during server consolidation affects the RTT of network packets. In this experiment, a non-virtual host sends ping packets to a VM when 2, 3, 4 and 5 VMs are run on the same host. As you can see RTT grows proportional to the 30 msec VM scheduling slice in Xen. To put things in perspective … . Despite the fact that sender and receiver are just one hop away we experience such a high round-trip.

Q2: What is the Cause of RTT Increase ?
VM scheduling latency dominates virtualization overhead! VM 1 VM 2 VM 3 buf buf buf Sender CDF Driver Domain (dom0) 30ms Device Driver Hardware 30ms + dom0 processing x wait time in buffer The 2nd question we are going to answer is that what is the main reason behind the RTT increase. Is it network device virtualization or some other reason. Before answering this question, I’m going to give you a brief overview of network device virtualization in paravirtualized Xen. In Xen there is a special domain called driver domain which is where the actual hw device drivers, including the network driver resides. When a packet arrives at the NIC, the device driver in the driver domain processes that packet and places it in a shared buffer between the driver domain and the VM. In this experiment we are going to trace packets from the network card all the way to the VM and see where packets spend most of their time. In this figure the red diagram shows the CDF for the duration of processing by the driver domain, as you can see it is almost zero. However, the blue diagram shows packets spend most of their in the shared buffer waiting for the recipient VM to be scheduled. In this experiment we have 3 VMs running on the same core and the jumps at the 30 msec intervals, which is again the VM scheduling in Xen. This further confirms VM scheduling is the main reason behind increase in RTT. Here we can also see the driver domain gets scheduled more frequently than the VMs to process outgoing packets. This figure shows the way network device virtualization is done in paravirtual Xen. At a very a high-level, we have a special domain called domain-0 where the actual hardware device drivers reside. For example, the network device driverfetch packets from the physical NIC and place packets in a shared buffer between the driver domain and a guest domain (which is the VM). Since the driver domain incurs some additional processing compared to non-virtual hosts, we conduct an experiment to see where packets spend most of their time in order to find out the major cause of increase in RTT. We do experiments on both RX and TX path and as you see packets spend most of their time waiting in the buffer for the recipient VM to get scheduled (what we show as scheduling in the graph). RTT Increase

Q3: What is the Impact on TCP Throughput ?
+ dom0 x VM Connection to the VM is much slower than dom0! Legend. By now we have seen the problem affecting throughput. In light of this … ow that we identified RTT increases and VM scheduling is the main reason behind it, let’s examine how VM scheduling affects TCP throughput. Here we transfer a 1MB file to the driver domain (dom 0) and the VM (domU in Xen terminology). Here X-axis is time, Y-axis is the sequence number showing the progress of the TCP connection. And intuitively this makes sense as dom0 is scheduled more frequently than domU. Wipe/curve along packet trace. Now that we identified To answer this question we compare the tcpdump trace of a 1MB file transfer to a guest VM (domU) in Xen terminology to the driver domain (dom0) in Xen. As you can see TCP slow start advances a lot more quickly for the connection to dom0. This figure also illustrates dom0 gets scheduled more frequently than the guest VMs to process their I/O.

Our Solution: vSnoop Alleviates the negative effect of VM scheduling on TCP throughput Implemented within the driver domain to accelerate TCP connections Does not require any modifications to the VM Does not violate end-to-end TCP semantics Applicable across a wide range of VMMs Xen, VMware, KVM, etc. further

Sender establishes a TCP connection to VM1
TCP Connection to a VM Sender Driver Domain VM1 Buffer Scheduled VM SYN Sender establishes a TCP connection to VM1 SYN VM2 VM Scheduling Latency VM3 VM1 buffer RTT SYN VM1 SYN,ACK SYN,ACK SYN,ACK VM2 VM3 Before I give you the main idea behind vSnoop’s design, I’ll walk you through the various steps involved in the progress of aTCP connection to a VM. We have 3 VMs running running on the same core and sender wants to establish a connection to VM1. In this slide we show the main intution behind vSnoop’s design and implementation. vSnoop presence is completely transparent to both the sender and receiver VMs. VM Scheduling Latency RTT VM1 time

Key Idea: Acknowledgement Offload
Sender Driver Domain VM Shared Buffer Scheduled VM SYN w/ vSnoop SYN VM2 Faster progress during TCP slowstart VM3 VM1 buffer SYN,ACK VM1 SYN,ACK SYN,ACK VM2 Emphasize driver domain now acknolwdges. By the time the VM is scheduled it finds more packet in the buffer (dramatic) We observed that VM scheduling dominates RTT for a TCP packet. To address this problem, we propose vSnoop. The key idea behind vSnoop is to offload acknowledgement to the driver domain. Let’s walk to through the progress of a TCP connection when vSnoop is present. So the trace of blue packets is what we observed … now let’s Ack suppression In this slide we show the main intution behind vSnoop’s design and implementation. vSnoop presence is completely transparent to both the sender and receiver VMs. VM3 VM1 time

vSnoop’s Impact on TCP Flows
TCP Slow Start Early acknowledgements help progress connections faster Most significant benefit for short transfers that are more prevalent in data centers [Kandula IMC’09], [Benson WREN’09] TCP congestion avoidance and fast retransmit Large flows in the steady state can also benefit from vSnoop Benefit not as much as for Slow Start Emphase a TCP connection has a TCP slow start phase and a steady phase. Given the background in the previous slide, vSnoop can expedite TCP slow start. This is particularly important for small flows that spend their entire lifetime in TCP slow start. Many recent studies have show that this type of flows constitute the majority of flows in data centers so vSnoop. vSnoop does not hurt large flows that spend most of their in the steady state. In fact vSnoop can potentially benefit these flows for instance during

Challenges Challenge 1: Out-of-order/special packets (SYN, FIN packets) Solution: Let the VM handle these packets Challenge 2: Packet loss after vSnoop Solution: Let vSnoop acknowledge only if room in buffer Challenge 3: ACKs generated by the VM Solution: Suppress/rewrite ACKs already generated by vSnoop Challenge 4: Throttle Receive window to keep vSnoop online Solution: Adjusted according to the buffer size What I just described sounds very simple. However, there are many issues and details that need to be accounted for. For instance, when an out-of-order packets arrive, vSnoop cannot simply acknowledge those packets due to the cumulative nature of TCP acknowledgement. To keep vSnoop’s design simple and lightweight vSnoop does not buffer and order these packets, instead it lets the VM handle those packet as it normally would. This simple solution works well in practice because most of the packets are in in-order in the data center environments. There are also some special packets such as SYN and FIN packets that must be acknowledged by the VM, not vSnoop. The second challenge is that packets cannot be lost en route to VM after they get acknowledged by vSnoop. We explain in great detail in the paper that a few factors collectively prevent such loss. One of these factors is that vSnoop acknowledges packets only when there is room in the shared buffer between the VM and the driver domain. The 3rd issue is the issue of duplicate ACKs. Since VM is not aware of the presence of vSnoop it is still going to generate acks for packets already acknowledged by vSnoop. To prevent unnecessary duplicate acks from reaching the sender, vSnoop suppresses empty ACKs corresponding to packets that have already been acknowledged by vSnoop. Finally, to comply with TCP semantics and since shared buffer is a scarce resource vSnoop has to be very judicious in the value it advertises for receive window in its acknowledgements.

State Machine Maintained Per-Flow
Early acknowledgements for in-order packets Packet recv Start Active (online) In-order pkt Buffer space available In-order pkt Buffer space available Out-of-order packet No buffer In-order pkt No buffer vSnoop state machine takes care of special cases we just described (namely out-of-order packets and shared buffer being full) vSnoop has a very generic design that can be applied to virtualization platforms other than Xen, such Vmware, virtualbox, etc. vSnoop maintains a state-machine per flow or connection. The purpose of this state machine is to … What this state-machine achieves is that vSnoop acknowledges … If buffer is full or if packets arrive out-of-order, then snoop goes offline and packets get handled normally as if vSnoop was not present. Not Making the TCP implementation less reliable or more aggressive Unexpected Sequence No buffer (offline) Out-of-order packet Don’t acknowledge Pass out-of-order pkts to VM

vSnoop Implementation in Xen
Tuning Netfront VM1 VM2 VM3 Netfront Netfront Netfront buf buf buf Netback Netback Netback Add an animation for tuning Animation for vSnoop, our vSnoop implementation in PV Xen only entails changes to the Linux Bridge implementation. We did n’t make any changes to the Xen hypervisor. As you can see, this design is very general and can be applied to other VMMs, such as Vmware, KVM, etc. talk about tuning which enables placing more packets on the shared buffer between the VM and the driver domain. Bridge vSnoop Driver Domain (dom0)

Evaluation Overheads of vSnoop TCP throughput speedup
Application speedup Multi-tier web service (RUBiS) MPI benchmarks (Intel, High-Performance Linpack) We have performed an extensive set of evaluations to study vSnoop. Here is our evaluation setup. Xenoprof supports profiling CPU usage at the fine granularity of individual processes and routines executed in the Xen VMM, driver domain, and guest VMs. We use Xenoprof to measure the overhead associated with different vSnoop routines in terms of the CPU cycles/percentage they consume. We additionally instrument vSnoop routines to record the number of packets they process. This information helps us to obtain the per-packet cost or the cost incurred by vSnoop routines at a given point in time.

Evaluation – Setup VM hosts Client machine Gigabit Ethernet switch
3.06GHz Intel Xeon CPUs, 4GB RAM Only one core/CPU enabled Xen 3.3 with Linux for the driver domain (dom0) and the guest VMs Client machine 2.4GHz Intel Core 2 Quad CPU, 2GB RAM Linux Gigabit Ethernet switch Xen 3.3 with one excpetion only one core and cpu is enabled

vSnoop_lookup_hash()
vSnoop Overhead Profiling per-packet vSnoop overhead using Xenoprof [Menon VEE’05] vSnoop Routines Single Stream Multiple Streams Cycles CPU % vSnoop_ingress() 509 3.03 516 3.05 vSnoop_lookup_hash() 74 0.44 91 0.51 vSnoop_build_ack() 52 0.32 vSnoop_egress() 104 0.61 Add a ppt table. aggregate overhead is minimal. Xenoprof supports profiling CPU usage at the fine granularity of individual processes and routines executed in the Xen VMM, driver domain, and guest VMs. We use Xenoprof to measure the overhead associated with different vSnoop routines in terms of the CPU cycles/percentage they consume. We additionally instrument vSnoop routines to record the number of packets they process. This information helps us to obtain the per-packet cost or the cost incurred by vSnoop routines at a given point in time. The cost incurred by vSnoop is negligible as the sole purpose of the driver domain is to perform I/O on behalf of guest VMs and it does n’t run any applications. So 3% is quite minimal overhead. 100 concurrent connections Per-packet CPU overhead for vSnoop routines in dom0 Minimal aggregate CPU overhead

TCP Throughput Improvement
3 VMs consolidated, 1000 transfers of a 100KB file Vanilla Xen, Xen+tuning, Xen+tuning+vSnoop 30x Improvement 0.192MB/s 0.778MB/s Median 6.003MB/s I’m sure by now you can’t wait to see how much improvement does vSnoop or the netfront tuning modifications yield. Unfortunately, the answer is not very simple. Based on the timing of VM scheduling we can see a large variation in TCP throughput improvement. For example, here we have 3 VMs consolidated on the same core, each with 60% CPU load and we can see throughput values vary a lot for the following three scenarios: vanilla Xen, Xen+tuning, Xen+tuning and vSnoop. Notice that X-axis is the TCP throughput of the connection in log-scale and the y-axis is the cumulative density function (CDFs) for the throughput values. As you can see based on timing of packet transmission and the VM scheduling throughput can vary a lot. Due to large variations in throughput as a result comparing the average of measurements for the 3 scenarios does not make sense. Instead we are going to compare the 3 scenarios using their. For instance we have … a 30X improvement. What is particularly very interesting is that for about 35% of packet vSnoop achieves the optimal throughput that is vSnoop places all packets in the shared buffer before the VM gets scheduled. This results in experiencing throughput values that exceed the network link rate. from the client to the VM for vanilla Xen, Xen with netfront tuning, and Xen with netfront tuning and vSnoop. In this experiment, the server VM is co-located with two other nonidle guest VMs. This figure shows that vSnoop (with tuning) yields significant and in some cases orders of magnitude of improvement in TCP throughput. In particular, the median throughput values for ‘vanilla Xen’, ‘Xen+tuning’, and ‘Xen+tuning+vSnoop’ are MB/s, MB/s, and MB/s, respectively. + Vanilla Xen x Xen+tuning * Xen+tuning+vSnoop

TCP Throughput: 1 VM/Core
Xen Xen+tuning Xen+tuning+vSnoop 1.00 0.80 0.60 Normalized Throughput 0.40 0.20 Normalized median Normalize the median TCP throughput measurements based on the value for Xen with tuning and vSnoop configuration. We see vSnoop is particularly effective for smaller flows as they are more susceptible to VM scheduling delays. 0.00 50KB 1MB 100KB 250KB 500KB 10MB 100MB Transfer Size

TCP Throughput: 2 VMs/Core
0.00 0.20 0.40 0.60 0.80 1.00 100MB 10MB 1MB 500KB 250KB 100KB 50KB Normalized Throughput Transfer Size Xen+tuning+vSnoop Xen+tuning Xen

Xen Xen+tuning Xen+tuning+vSnoop 1.00 0.80 0.60 Normalized Throughput 0.40 0.20 0.00 50KB 1MB 100KB 250KB 500KB 10MB 100MB Transfer Size

Xen Xen+tuning Xen+tuning+vSnoop 1.00 0.80 vSnoop’s benefit rises with higher VM consolidation 0.60 Normalized Throughput 0.40 0.20 0.00 50KB 1MB 100KB 250KB 500KB 10MB 100MB Transfer Size

TCP Throughput: Other Setup Parameters
CPU load for VMs Number of TCP connections to VM Driver domain on separate core Sender being a VM vSnoop consistently achieves significant TCP throughput improvement

Application-Level Performance: RUBiS
RUBiS Clients Apache MySQL dom1 dom2 dom1 dom2 Client Threads vSnoop vSnoop Pause for emphasis Blue and purple similar dom0 dom0 Client Server1 Server2

SearchItemsInCategory
RUBiS Results RUBiS Operation Count w/o vSnoop w/ vSnoop % Gain Browse 421 505 19.9% BrowseCategories 288 357 23.9% SearchItemsInCategory 3498 4747 35.7% BrowseRegions 128 141 10.1% ViewItem 2892 3776 30.5% ViewUserInfo 732 846 15.6% ViewBidHistory 339 398 17.4% Others 3939 4815 22.2% Total 12237 15585 27.4% Average Throughput 29 req/s 37 req/s 27.5%

Application-level Performance – MPI Benchmarks
Intel MPI Benchmark: Network intensive High-performance Linpack: CPU intensive MPI nodes dom1 dom2 dom2 dom2 dom2 dom1 dom1 dom1 vSnoop emphasis Add dom2 to study the effects of VM consolidation. Add animation for vSnoop vSnoop vSnoop vSnoop dom0 dom0 dom0 dom0 Server1 Server2 Server3 Server4

Intel MPI Benchmark Results: Broadcast
Xen Xen+tuning Xen+tuning+vSnoop 1.00 0.80 0.60 Normalized Execution Time 0.40 40% Improvement 0.20 Lower the better Execution time of each MPI network operation normalized based on the execution or duration of that operation for the vanilla Xen configuration. 0.00 64KB 1MB 2MB 4MB 8MB 128KB 256KB 512KB Message Size

Intel MPI Benchmark Results: All-to-All
Xen Xen+tuning Xen+tuning+vSnoop 1.00 0.80 0.60 Normalized Execution Time 0.40 0.20 Execution time of each MPI network operation normalized based on the execution or duration of that operation for the vanilla Xen configuration. 0.00 64KB 1MB 2MB 4MB 8MB 128KB 256KB 512KB Message Size

HPL Benchmark Results 40% Gflops Problem Size and Block Size (N,NB)
Xen Xen+tuning+vSnoop 1.800 40% 1.600 1.400 1.200 1.000 Gflops 0.800 0.600 0.400 0.200 Unfavorable. Interleaving of compution and communication, communication dominating computation 40% improvement in the number of Gflops achieved in the best case 0.000 (4K,2) (4K,4) (4K,8) (4K,16) (6K,2) (6K,4) (6K,8) (6K,16) (8K,2) (8K,4) (8K,8) (8K,16) Problem Size and Block Size (N,NB)

Related Work Optimizing virtualized I/O path
Menon et al. [USENIX ATC’06,’08; ASPLOS’09] Improving intra-host VM communications XenSocket [Middleware’07], XenLoop [HPDC’08], Fido [USENIX ATC’09], XWAY [VEE’08], IVC [SC’07] I/O-aware VM scheduling Govindan et al. [VEE’07], DVT [SoCC’10] However, vSnoop is different from the existing solutions as we identify VM CPU sharing as a major source of degradation in TCP performance, perform a thorough analysis of the problem and propose a solution based on offloading TCP acknowledgement to the driver domain to address the issue.

Conclusions Problem: VM consolidation degrades TCP throughput
Solution: vSnoop Leverages acknowledgment offloading Does not violate end-to-end TCP semantics Is transparent to applications and OS in VMs Is generically applicable to many VMMs Results: 30x improvement in median TCP throughput About 30% improvement in RUBiS benchmark 40-50% reduction in execution time for Intel MPI benchmark Emphasize results, 30% vSnoop achieves even higher performance for tail 11:30-1, pstoedit, ghostview converge to vector format, save as wmf.

Or Google “vSnoop Purdue”
Thank you. For more information: Or Google “vSnoop Purdue”

TCP Benchmarks cont. Testing different scenarios:
a) 10 concurrent connections b) Sender also subject to VM scheduling c) Driver domain on a separate core a) b) c)

TCP Benchmarks cont. Varying CPU load for 3 consolidated VMs:

Ardalan Kangarlou, Sahan Gamage, Ramana Kompella, Dongyan Xu

Similar presentations

Presentation on theme: "Ardalan Kangarlou, Sahan Gamage, Ramana Kompella, Dongyan Xu"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Ardalan Kangarlou, Sahan Gamage, Ramana Kompella, Dongyan Xu

Similar presentations

Presentation on theme: "Ardalan Kangarlou, Sahan Gamage, Ramana Kompella, Dongyan Xu"— Presentation transcript:

Similar presentations

About project

Feedback