Ardalan Kangarlou, Sahan Gamage, Ramana Kompella, Dongyan Xu

Slides:



Advertisements
Similar presentations
Live migration of Virtual Machines Nour Stefan, SCPD.
Advertisements

Diagnosing Performance Overheads in the Xen Virtual Machine Environment Aravind Menon Willy Zwaenepoel EPFL, Lausanne Jose Renato Santos Yoshio Turner.
Netbus: A Transparent Mechanism for Remote Device Access in Virtualized Systems Sanjay Kumar PhD Student Advisor: Prof. Karsten Schwan.
Virtual Switching Without a Hypervisor for a More Secure Cloud Xin Jin Princeton University Joint work with Eric Keller(UPenn) and Jennifer Rexford(Princeton)
KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.
1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.
Virtualization and Cloud Computing. Definition Virtualization is the ability to run multiple operating systems on a single physical system and share the.
Live Migration of Virtual Machines Christopher Clark, Keir Fraser, Steven Hand, Jacob Gorm Hansen, Eric Jul, Christian Limpach, Ian Pratt, Andrew Warfield.
Performance Evaluation of RDMA over IP: A Case Study with the Ammasso Gigabit Ethernet NIC H.-W. Jin, S. Narravula, G. Brown, K. Vaidyanathan, P. Balaji,
1 Distributed Systems Meet Economics: Pricing in Cloud Computing Hadi Salimi Distributed Systems Lab, School of Computer Engineering, Iran University of.
KMemvisor: Flexible System Wide Memory Mirroring in Virtual Environments Bin Wang Zhengwei Qi Haibing Guan Haoliang Dong Wei Sun Shanghai Key Laboratory.
CS162 Section Lecture 9. KeyValue Server Project 3 KVClient (Library) Client Side Program KVClient (Library) Client Side Program KVClient (Library) Client.
Copyright © 2005 Department of Computer Science 1 Solving the TCP-incast Problem with Application-Level Scheduling Maxim Podlesny, University of Waterloo.
XENMON: QOS MONITORING AND PERFORMANCE PROFILING TOOL Diwaker Gupta, Rob Gardner, Ludmila Cherkasova 1.
CloudScale: Elastic Resource Scaling for Multi-Tenant Cloud Systems Zhiming Shen, Sethuraman Subbiah, Xiaohui Gu, John Wilkes.
Profiling Network Performance in Multi-tier Datacenter Applications
Towards High-Availability for IP Telephony using Virtual Machines Devdutt Patnaik, Ashish Bijlani and Vishal K Singh.
1 Virtual Machine Resource Monitoring and Networking of Virtual Machines Ananth I. Sundararaj Department of Computer Science Northwestern University July.
Modeling of Web/TCP Transfer Latency Yujian Peter Li January 22, 2004 M. Sc. Committee: Dr. Carey Williamson Dr. Wayne Eberly Dr. Elena Braverman Department.
Network Implementation for Xen and KVM Class project for E : Network System Design and Implantation 12 Apr 2010 Kangkook Jee (kj2181)
Copyright © 2005 Department of Computer Science CPSC 641 Winter Tutorial: TCP 101 The Transmission Control Protocol (TCP) is the protocol that sends.
Virtualization for Cloud Computing
Bandwidth Measurements for VMs in Cloud Amit Gupta and Rohit Ranchal Ref. Cloud Monitoring Framework by H. Khandelwal, R. Kompella and R. Ramasubramanian.
Container-based OS Virtualization A Scalable, High-performance Alternative to Hypervisors Stephen Soltesz, Herbert Pötzl, Marc Fiuczynski, Andy Bavier.
Xen and the Art of Virtualization. Introduction  Challenges to build virtual machines Performance isolation  Scheduling priority  Memory demand  Network.
MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.
CSE598C Virtual Machines and Their Applications Operating System Support for Virtual Machines Coauthored by Samuel T. King, George W. Dunlap and Peter.
SALSASALSASALSASALSA Performance Analysis of High Performance Parallel Applications on Virtualized Resources Jaliya Ekanayake and Geoffrey Fox Indiana.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
Supporting GPU Sharing in Cloud Environments with a Transparent
Hosting Virtual Networks on Commodity Hardware VINI Summer Camp.
Prioritizing Local Inter-Domain Communication in Xen Sisu Xi, Chong Li, Chenyang Lu, and Christopher Gill Cyber-Physical Systems Laboratory Washington.
SAIGONTECH COPPERATIVE EDUCATION NETWORKING Spring 2009 Seminar #1 VIRTUALIZATION EVERYWHERE.
Experiences in Design and Implementation of a High Performance Transport Protocol Yunhong Gu, Xinwei Hong, and Robert L. Grossman National Center for Data.
High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.
Profiling Grid Data Transfer Protocols and Servers George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA.
TCP Throughput Collapse in Cluster-based Storage Systems
1 XenSocket: VM-to-VM IPC John Linwood Griffin Jagged Technology virtual machine inter-process communication Suzanne McIntosh, Pankaj Rohatgi, Xiaolan.
Improving Network I/O Virtualization for Cloud Computing.
Zero-copy Migration for Lightweight Software Rejuvenation of Virtualized Systems Kenichi Kourai Hiroki Ooba Kyushu Institute of Technology.
Penn State CSE “Optimizing Network Virtualization in Xen” Aravind Menon, Alan L. Cox, Willy Zwaenepoel Presented by : Arjun R. Nath.
Politecnico di Torino Dipartimento di Automatica ed Informatica TORSEC Group Performance of Xen’s Secured Virtual Networks Emanuele Cesena Paolo Carlo.
Hadi Salimi Distributed Systems Lab, School of Computer Engineering, Iran University of Science and Technology, Fall 2010 Performance.
The Transmission Control Protocol (TCP) Application Services (Telnet, FTP, , WWW) Reliable Stream Transport (TCP) Connectionless Packet Delivery.
Our work on virtualization Chen Haogang, Wang Xiaolin {hchen, Institute of Network and Information Systems School of Electrical Engineering.
1 Xen and Co.: Communication-aware CPU Scheduling for Consolidated Xen-based Hosting Platforms Sriram Govindan, Arjun R Nath, Amitayu Das, Bhuvan Urgaonkar,
A Measurement Based Memory Performance Evaluation of High Throughput Servers Garba Isa Yau Department of Computer Engineering King Fahd University of Petroleum.
Srihari Makineni & Ravi Iyer Communications Technology Lab
HighSpeed TCP for High Bandwidth-Delay Product Networks Raj Kettimuthu.
Micro-sliced Virtual Processors to Hide the Effect of Discontinuous CPU Availability for Consolidated Systems Jeongseob Ahn, Chang Hyun Park, and Jaehyuk.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
VTurbo: Accelerating Virtual Machine I/O Processing Using Designated Turbo-Sliced Core Embedded Lab. Kim Sewoog Cong Xu, Sahan Gamage, Hui Lu, Ramana Kompella,
Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University.
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
Efficient Live Checkpointing Mechanisms for computation and memory-intensive VMs in a data center Kasidit Chanchio Vasabilab Dept of Computer Science,
Full and Para Virtualization
Enabling Technologies for Distributed Computing Dr. Sanjay P. Ahuja, Ph.D. Fidelity National Financial Distinguished Professor of CIS School of Computing,
Development of a QoE Model Himadeepa Karlapudi 03/07/03.
Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.
CSE598c - Virtual Machines - Spring Diagnosing Performance Overheads in the Xen Virtual Machine EnvironmentPage 1 CSE 598c Virtual Machines “Diagnosing.
CIS679: TCP and Multimedia r Review of last lecture r TCP and Multimedia.
© 2004 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Understanding Virtualization Overhead.
On-the-Fly TCP Acceleration with Miniproxy Giuseppe Siracusano 12, Roberto Bifulco 1, Simon Kuenzer 1, Stefano Salsano 2, Nicola Blefari Melazzi 2, Felipe.
Is Virtualization ready for End-to-End Application Performance?
Comparison of the Three CPU Schedulers in Xen
ECF: an MPTCP Scheduler to Manage Heterogeneous Paths
Monkey See, Monkey Do A Tool for TCP Tracing and Replaying
Xen Network I/O Performance Analysis and Opportunities for Improvement
Xing Pu21 Ling Liu1 Yiduo Mei31 Sankaran Sivathanu1 Younggyun Koh1
Efficient Migration of Large-memory VMs Using Private Virtual Memory
Presentation transcript:

vSnoop: Improving TCP Throughput in Virtualized Environments via Acknowledgement Offload Ardalan Kangarlou, Sahan Gamage, Ramana Kompella, Dongyan Xu Department of Computer Science Purdue University Thank you Dr. Kielmann.

Cloud Computing and HPC We have recently seen a growing adoption of cloud resources for high performance computing applications. We often see headlines such as the following. Cloud computing to HPC, exploring cloud resources for scientific research, cloud providers provisioning for HPC applications. Even a headline claiming Cloud computing and HPC are twins separated at birth. The truth is maybe cloud is not the optimal platform for all HPC applications for some types of some of the academic research, pure sciences, weather modeling, automobile design simulations, aerospace, energy, life sciences, and the newer class of business intelligence (BI) and analytics applications can be quite useful.

Background and Motivation Virtualization: A key enabler of cloud computing Amazon EC2, Eucalyptus Increasingly adopted in other real systems: High performance computing NERSC’s Magellan system Grid/cyberinfrastructure computing In-VIGO, Nimbus, Virtuoso Emphasize virtualization Me je’ len

VM Consolidation: A Common Practice Multiple VMs hosted by one physical host Multiple VMs sharing the same core Flexibility, scalability, and economy VM 1 VM 2 VM 3 VM 4 Key Observation: VM consolidation negatively impacts network performance! However!!! A common practice in virtualized systems is VM consolidation. We can run multiples VMs on the same host where each VM has been customized with different applications, libraries, and operating systems. Often these VMs share the same core or CPU to achieve greater economy of scale. This leads to more efficient use of hardware resources and energy saving. For instance, in Amazon EC2 medium instances 2 VMs are run per core, this number may increase in future. …. Unfortunately, we observe … In addition to some of the benefits described in the previous slide, VM migration and server consolidation are two attractive features of virtual clouds. VM migration enables moving a virtual machine from one physical server to another, for example in anticipation of maintenance on some servers or moving computation closer to data. Server consolidation refers to the practice of packing multiple VMs on the same server and is commonly used to free up physical resources and save energy (during light computation periods). Virtualization Layer Hardware

Investigating the Problem Server Client VM 1 VM 2 VM 3 Sender Virtualization Layer To demonstrate and investigate why network performance is affected as a result of VM consolidation, we conduct a series of experiments. Here is the setup. Using this setup we are going to answer three questions. Hardware

Q1: How does CPU Sharing affect RTT ? 40 60 80 100 120 140 160 180 5 4 3 2 RTT (ms) Number of VMs US West – Australia US East – Europe RTT increases in proportion to VM scheduling slice (30ms) US East – West The first question we are going to answer is … . On the X-axis we have different number of VMs and on the Y-axis we have round trip time for the network packets. As you can see, RTT increases in proportion to the VM scheduling slice (that is 30 msec in Xen). To put things in perspective the RTT for the 2-VM scenario is about 60 msec, that is on the order of latency between US east coast and west coast. Please make note that due to VM consolidation we no longer experience the submillisecond latency expected in a LAN setup. The main motivation for our work vSnoop comes from the fact that we observed server consolidation and running multiple VMs on the same core/CPU adversely affects TCP throughput, particularly for small TCP flows that constitute the majority of flows in a data center/cloud environment. In the next few slides, we will go over a few experiments to understand the problem at hand in order to come up with an effective solution. In the first experiment, we try to understand how does CPU sharing by multiple VMs during server consolidation affects the RTT of network packets. In this experiment, a non-virtual host sends ping packets to a VM when 2, 3, 4 and 5 VMs are run on the same host. As you can see RTT grows proportional to the 30 msec VM scheduling slice in Xen. To put things in perspective … . Despite the fact that sender and receiver are just one hop away we experience such a high round-trip.

Q2: What is the Cause of RTT Increase ? VM scheduling latency dominates virtualization overhead! VM 1 VM 2 VM 3 buf buf buf Sender CDF Driver Domain (dom0) 30ms Device Driver Hardware 30ms + dom0 processing x wait time in buffer The 2nd question we are going to answer is that what is the main reason behind the RTT increase. Is it network device virtualization or some other reason. Before answering this question, I’m going to give you a brief overview of network device virtualization in paravirtualized Xen. In Xen there is a special domain called driver domain which is where the actual hw device drivers, including the network driver resides. When a packet arrives at the NIC, the device driver in the driver domain processes that packet and places it in a shared buffer between the driver domain and the VM. In this experiment we are going to trace packets from the network card all the way to the VM and see where packets spend most of their time. In this figure the red diagram shows the CDF for the duration of processing by the driver domain, as you can see it is almost zero. However, the blue diagram shows packets spend most of their in the shared buffer waiting for the recipient VM to be scheduled. In this experiment we have 3 VMs running on the same core and the jumps at the 30 msec intervals, which is again the VM scheduling in Xen. This further confirms VM scheduling is the main reason behind increase in RTT. Here we can also see the driver domain gets scheduled more frequently than the VMs to process outgoing packets. This figure shows the way network device virtualization is done in paravirtual Xen. At a very a high-level, we have a special domain called domain-0 where the actual hardware device drivers reside. For example, the network device driverfetch packets from the physical NIC and place packets in a shared buffer between the driver domain and a guest domain (which is the VM). Since the driver domain incurs some additional processing compared to non-virtual hosts, we conduct an experiment to see where packets spend most of their time in order to find out the major cause of increase in RTT. We do experiments on both RX and TX path and as you see packets spend most of their time waiting in the buffer for the recipient VM to get scheduled (what we show as scheduling in the graph). RTT Increase

Q3: What is the Impact on TCP Throughput ? + dom0 x VM Connection to the VM is much slower than dom0! Legend. By now we have seen the problem affecting throughput. In light of this … ow that we identified RTT increases and VM scheduling is the main reason behind it, let’s examine how VM scheduling affects TCP throughput. Here we transfer a 1MB file to the driver domain (dom 0) and the VM (domU in Xen terminology). Here X-axis is time, Y-axis is the sequence number showing the progress of the TCP connection. And intuitively this makes sense as dom0 is scheduled more frequently than domU. Wipe/curve along packet trace. Now that we identified To answer this question we compare the tcpdump trace of a 1MB file transfer to a guest VM (domU) in Xen terminology to the driver domain (dom0) in Xen. As you can see TCP slow start advances a lot more quickly for the connection to dom0. This figure also illustrates dom0 gets scheduled more frequently than the guest VMs to process their I/O.

Our Solution: vSnoop Alleviates the negative effect of VM scheduling on TCP throughput Implemented within the driver domain to accelerate TCP connections Does not require any modifications to the VM Does not violate end-to-end TCP semantics Applicable across a wide range of VMMs Xen, VMware, KVM, etc. further

Sender establishes a TCP connection to VM1 TCP Connection to a VM Sender Driver Domain VM1 Buffer Scheduled VM SYN Sender establishes a TCP connection to VM1 SYN VM2 VM Scheduling Latency VM3 VM1 buffer RTT SYN VM1 SYN,ACK SYN,ACK SYN,ACK VM2 VM3 Before I give you the main idea behind vSnoop’s design, I’ll walk you through the various steps involved in the progress of aTCP connection to a VM. We have 3 VMs running running on the same core and sender wants to establish a connection to VM1. In this slide we show the main intution behind vSnoop’s design and implementation. vSnoop presence is completely transparent to both the sender and receiver VMs. VM Scheduling Latency RTT VM1 time

Key Idea: Acknowledgement Offload Sender Driver Domain VM Shared Buffer Scheduled VM SYN w/ vSnoop SYN VM2 Faster progress during TCP slowstart VM3 VM1 buffer SYN,ACK VM1 SYN,ACK SYN,ACK VM2 Emphasize driver domain now acknolwdges. By the time the VM is scheduled it finds more packet in the buffer (dramatic) We observed that VM scheduling dominates RTT for a TCP packet. To address this problem, we propose vSnoop. The key idea behind vSnoop is to offload acknowledgement to the driver domain. Let’s walk to through the progress of a TCP connection when vSnoop is present. So the trace of blue packets is what we observed … now let’s Ack suppression In this slide we show the main intution behind vSnoop’s design and implementation. vSnoop presence is completely transparent to both the sender and receiver VMs. VM3 VM1 time

vSnoop’s Impact on TCP Flows TCP Slow Start Early acknowledgements help progress connections faster Most significant benefit for short transfers that are more prevalent in data centers [Kandula IMC’09], [Benson WREN’09] TCP congestion avoidance and fast retransmit Large flows in the steady state can also benefit from vSnoop Benefit not as much as for Slow Start Emphase a TCP connection has a TCP slow start phase and a steady phase. Given the background in the previous slide, vSnoop can expedite TCP slow start. This is particularly important for small flows that spend their entire lifetime in TCP slow start. Many recent studies have show that this type of flows constitute the majority of flows in data centers so vSnoop. vSnoop does not hurt large flows that spend most of their in the steady state. In fact vSnoop can potentially benefit these flows for instance during

Challenges Challenge 1: Out-of-order/special packets (SYN, FIN packets) Solution: Let the VM handle these packets Challenge 2: Packet loss after vSnoop Solution: Let vSnoop acknowledge only if room in buffer Challenge 3: ACKs generated by the VM Solution: Suppress/rewrite ACKs already generated by vSnoop Challenge 4: Throttle Receive window to keep vSnoop online Solution: Adjusted according to the buffer size What I just described sounds very simple. However, there are many issues and details that need to be accounted for. For instance, when an out-of-order packets arrive, vSnoop cannot simply acknowledge those packets due to the cumulative nature of TCP acknowledgement. To keep vSnoop’s design simple and lightweight vSnoop does not buffer and order these packets, instead it lets the VM handle those packet as it normally would. This simple solution works well in practice because most of the packets are in in-order in the data center environments. There are also some special packets such as SYN and FIN packets that must be acknowledged by the VM, not vSnoop. The second challenge is that packets cannot be lost en route to VM after they get acknowledged by vSnoop. We explain in great detail in the paper that a few factors collectively prevent such loss. One of these factors is that vSnoop acknowledges packets only when there is room in the shared buffer between the VM and the driver domain. The 3rd issue is the issue of duplicate ACKs. Since VM is not aware of the presence of vSnoop it is still going to generate acks for packets already acknowledged by vSnoop. To prevent unnecessary duplicate acks from reaching the sender, vSnoop suppresses empty ACKs corresponding to packets that have already been acknowledged by vSnoop. Finally, to comply with TCP semantics and since shared buffer is a scarce resource vSnoop has to be very judicious in the value it advertises for receive window in its acknowledgements.

State Machine Maintained Per-Flow Early acknowledgements for in-order packets Packet recv Start Active (online) In-order pkt Buffer space available In-order pkt Buffer space available Out-of-order packet No buffer In-order pkt No buffer vSnoop state machine takes care of special cases we just described (namely out-of-order packets and shared buffer being full) vSnoop has a very generic design that can be applied to virtualization platforms other than Xen, such Vmware, virtualbox, etc. vSnoop maintains a state-machine per flow or connection. The purpose of this state machine is to … What this state-machine achieves is that vSnoop acknowledges … If buffer is full or if packets arrive out-of-order, then snoop goes offline and packets get handled normally as if vSnoop was not present. Not Making the TCP implementation less reliable or more aggressive Unexpected Sequence No buffer (offline) Out-of-order packet Don’t acknowledge Pass out-of-order pkts to VM

vSnoop Implementation in Xen Tuning Netfront VM1 VM2 VM3 Netfront Netfront Netfront buf buf buf Netback Netback Netback Add an animation for tuning Animation for vSnoop, our vSnoop implementation in PV Xen only entails changes to the Linux Bridge implementation. We did n’t make any changes to the Xen hypervisor. As you can see, this design is very general and can be applied to other VMMs, such as Vmware, KVM, etc. talk about tuning which enables placing more packets on the shared buffer between the VM and the driver domain. Bridge vSnoop Driver Domain (dom0)

Evaluation Overheads of vSnoop TCP throughput speedup Application speedup Multi-tier web service (RUBiS) MPI benchmarks (Intel, High-Performance Linpack) We have performed an extensive set of evaluations to study vSnoop. Here is our evaluation setup. Xenoprof supports profiling CPU usage at the fine granularity of individual processes and routines executed in the Xen VMM, driver domain, and guest VMs. We use Xenoprof to measure the overhead associated with different vSnoop routines in terms of the CPU cycles/percentage they consume. We additionally instrument vSnoop routines to record the number of packets they process. This information helps us to obtain the per-packet cost or the cost incurred by vSnoop routines at a given point in time.

Evaluation – Setup VM hosts Client machine Gigabit Ethernet switch 3.06GHz Intel Xeon CPUs, 4GB RAM Only one core/CPU enabled Xen 3.3 with Linux 2.6.18 for the driver domain (dom0) and the guest VMs Client machine 2.4GHz Intel Core 2 Quad CPU, 2GB RAM Linux 2.6.19 Gigabit Ethernet switch Xen 3.3 with one excpetion only one core and cpu is enabled

vSnoop_lookup_hash() vSnoop Overhead Profiling per-packet vSnoop overhead using Xenoprof [Menon VEE’05] vSnoop Routines Single Stream Multiple Streams Cycles CPU % vSnoop_ingress() 509 3.03 516 3.05 vSnoop_lookup_hash() 74 0.44 91 0.51 vSnoop_build_ack() 52 0.32 vSnoop_egress() 104 0.61 Add a ppt table. aggregate overhead is minimal. Xenoprof supports profiling CPU usage at the fine granularity of individual processes and routines executed in the Xen VMM, driver domain, and guest VMs. We use Xenoprof to measure the overhead associated with different vSnoop routines in terms of the CPU cycles/percentage they consume. We additionally instrument vSnoop routines to record the number of packets they process. This information helps us to obtain the per-packet cost or the cost incurred by vSnoop routines at a given point in time. The cost incurred by vSnoop is negligible as the sole purpose of the driver domain is to perform I/O on behalf of guest VMs and it does n’t run any applications. So 3% is quite minimal overhead. 100 concurrent connections Per-packet CPU overhead for vSnoop routines in dom0 Minimal aggregate CPU overhead

TCP Throughput Improvement 3 VMs consolidated, 1000 transfers of a 100KB file Vanilla Xen, Xen+tuning, Xen+tuning+vSnoop 30x Improvement 0.192MB/s 0.778MB/s Median 6.003MB/s I’m sure by now you can’t wait to see how much improvement does vSnoop or the netfront tuning modifications yield. Unfortunately, the answer is not very simple. Based on the timing of VM scheduling we can see a large variation in TCP throughput improvement. For example, here we have 3 VMs consolidated on the same core, each with 60% CPU load and we can see throughput values vary a lot for the following three scenarios: vanilla Xen, Xen+tuning, Xen+tuning and vSnoop. Notice that X-axis is the TCP throughput of the connection in log-scale and the y-axis is the cumulative density function (CDFs) for the throughput values. As you can see based on timing of packet transmission and the VM scheduling throughput can vary a lot. Due to large variations in throughput as a result comparing the average of measurements for the 3 scenarios does not make sense. Instead we are going to compare the 3 scenarios using their. For instance we have … a 30X improvement. What is particularly very interesting is that for about 35% of packet vSnoop achieves the optimal throughput that is vSnoop places all packets in the shared buffer before the VM gets scheduled. This results in experiencing throughput values that exceed the network link rate. from the client to the VM for vanilla Xen, Xen with netfront tuning, and Xen with netfront tuning and vSnoop. In this experiment, the server VM is co-located with two other nonidle guest VMs. This figure shows that vSnoop (with tuning) yields significant and in some cases orders of magnitude of improvement in TCP throughput. In particular, the median throughput values for ‘vanilla Xen’, ‘Xen+tuning’, and ‘Xen+tuning+vSnoop’ are 0.192 MB/s, 0.778 MB/s, and 6.003 MB/s, respectively. + Vanilla Xen x Xen+tuning * Xen+tuning+vSnoop

TCP Throughput: 1 VM/Core Xen Xen+tuning Xen+tuning+vSnoop 1.00 0.80 0.60 Normalized Throughput 0.40 0.20 Normalized median Normalize the median TCP throughput measurements based on the value for Xen with tuning and vSnoop configuration. We see vSnoop is particularly effective for smaller flows as they are more susceptible to VM scheduling delays. 0.00 50KB 1MB 100KB 250KB 500KB 10MB 100MB Transfer Size

TCP Throughput: 2 VMs/Core 0.00 0.20 0.40 0.60 0.80 1.00 100MB 10MB 1MB 500KB 250KB 100KB 50KB Normalized Throughput Transfer Size Xen+tuning+vSnoop Xen+tuning Xen

TCP Throughput: 3 VMs/Core Xen Xen+tuning Xen+tuning+vSnoop 1.00 0.80 0.60 Normalized Throughput 0.40 0.20 0.00 50KB 1MB 100KB 250KB 500KB 10MB 100MB Transfer Size

TCP Throughput: 5 VMs/Core Xen Xen+tuning Xen+tuning+vSnoop 1.00 0.80 vSnoop’s benefit rises with higher VM consolidation 0.60 Normalized Throughput 0.40 0.20 0.00 50KB 1MB 100KB 250KB 500KB 10MB 100MB Transfer Size

TCP Throughput: Other Setup Parameters CPU load for VMs Number of TCP connections to VM Driver domain on separate core Sender being a VM vSnoop consistently achieves significant TCP throughput improvement

Application-Level Performance: RUBiS RUBiS Clients Apache MySQL dom1 dom2 dom1 dom2 Client Threads vSnoop vSnoop Pause for emphasis Blue and purple similar dom0 dom0 Client Server1 Server2

SearchItemsInCategory RUBiS Results RUBiS Operation Count w/o vSnoop w/ vSnoop % Gain Browse 421 505 19.9% BrowseCategories 288 357 23.9% SearchItemsInCategory 3498 4747 35.7% BrowseRegions 128 141 10.1% ViewItem 2892 3776 30.5% ViewUserInfo 732 846 15.6% ViewBidHistory 339 398 17.4% Others 3939 4815 22.2% Total 12237 15585 27.4% Average Throughput 29 req/s 37 req/s 27.5%

Application-level Performance – MPI Benchmarks Intel MPI Benchmark: Network intensive High-performance Linpack: CPU intensive MPI nodes dom1 dom2 dom2 dom2 dom2 dom1 dom1 dom1 vSnoop emphasis Add dom2 to study the effects of VM consolidation. Add animation for vSnoop vSnoop vSnoop vSnoop dom0 dom0 dom0 dom0 Server1 Server2 Server3 Server4

Intel MPI Benchmark Results: Broadcast Xen Xen+tuning Xen+tuning+vSnoop 1.00 0.80 0.60 Normalized Execution Time 0.40 40% Improvement 0.20 Lower the better Execution time of each MPI network operation normalized based on the execution or duration of that operation for the vanilla Xen configuration. 0.00 64KB 1MB 2MB 4MB 8MB 128KB 256KB 512KB Message Size

Intel MPI Benchmark Results: All-to-All Xen Xen+tuning Xen+tuning+vSnoop 1.00 0.80 0.60 Normalized Execution Time 0.40 0.20 Execution time of each MPI network operation normalized based on the execution or duration of that operation for the vanilla Xen configuration. 0.00 64KB 1MB 2MB 4MB 8MB 128KB 256KB 512KB Message Size

HPL Benchmark Results 40% Gflops Problem Size and Block Size (N,NB) Xen Xen+tuning+vSnoop 1.800 40% 1.600 1.400 1.200 1.000 Gflops 0.800 0.600 0.400 0.200 Unfavorable. Interleaving of compution and communication, communication dominating computation 40% improvement in the number of Gflops achieved in the best case 0.000 (4K,2) (4K,4) (4K,8) (4K,16) (6K,2) (6K,4) (6K,8) (6K,16) (8K,2) (8K,4) (8K,8) (8K,16) Problem Size and Block Size (N,NB)

Related Work Optimizing virtualized I/O path Menon et al. [USENIX ATC’06,’08; ASPLOS’09] Improving intra-host VM communications XenSocket [Middleware’07], XenLoop [HPDC’08], Fido [USENIX ATC’09], XWAY [VEE’08], IVC [SC’07] I/O-aware VM scheduling Govindan et al. [VEE’07], DVT [SoCC’10] However, vSnoop is different from the existing solutions as we identify VM CPU sharing as a major source of degradation in TCP performance, perform a thorough analysis of the problem and propose a solution based on offloading TCP acknowledgement to the driver domain to address the issue.

Conclusions Problem: VM consolidation degrades TCP throughput Solution: vSnoop Leverages acknowledgment offloading Does not violate end-to-end TCP semantics Is transparent to applications and OS in VMs Is generically applicable to many VMMs Results: 30x improvement in median TCP throughput About 30% improvement in RUBiS benchmark 40-50% reduction in execution time for Intel MPI benchmark Emphasize results, 30% vSnoop achieves even higher performance for tail 11:30-1, pstoedit, ghostview converge to vector format, save as wmf.

Or Google “vSnoop Purdue” Thank you. For more information: http://friends.cs.purdue.edu/dokuwiki/doku.php?id=vsnoop Or Google “vSnoop Purdue”

TCP Benchmarks cont. Testing different scenarios: a) 10 concurrent connections b) Sender also subject to VM scheduling c) Driver domain on a separate core a) b) c)

TCP Benchmarks cont. Varying CPU load for 3 consolidated VMs: