CloudMirror: Application-Driven Bandwidth Guarantees in Datacenters

Slides:



Advertisements
Similar presentations
Remus: High Availability via Asynchronous Virtual Machine Replication
Advertisements

Capacity Planning in a Virtual Environment
Windows IT Pro magazine Datacenter solution with lower infrastructure costs and OPEX savings from increased operational efficiencies. Datacenter.
Towards Predictable Datacenter Networks
Sharing Cloud Networks Lucian Popa, Gautam Kumar, Mosharaf Chowdhury Arvind Krishnamurthy, Sylvia Ratnasamy, Ion Stoica UC Berkeley.
ElasticTree: Saving Energy in Data Center Networks Brandon Heller, Srini Seetharaman, Priya Mahadevan, Yiannis Yiakoumis, Puneed Sharma, Sujata Banerjee,
Efficient Autoscaling in the Cloud using Predictive Models for Workload Forecasting Roy, N., A. Dubey, and A. Gokhale 4th IEEE International Conference.
ElasticTree: Saving Energy in Data Center Networks Very offended by KALYAN MANDA LEI XIA.
Ashish Gupta Under Guidance of Prof. B.N. Jain Department of Computer Science and Engineering Advanced Networking Laboratory.
CS 268: Project Suggestions Ion Stoica February 6, 2003.
COMS E Cloud Computing and Data Center Networking Sambit Sahu
Building Edge-Failure Resilient Networks Chandra Chekuri Bell Labs Anupam Gupta Bell Labs ! CMU Amit Kumar Cornell ! Bell Labs Seffi Naor, Danny Raz Technion.
PortLand Presented by Muhammad Sadeeq and Ling Su.
Capacity planning for web sites. Promoting a web site Thoughts on increasing web site traffic but… Two possible scenarios…
Enable Multi Tenant Clouds Network Virtualization. Dynamic VM Placement. Secure Isolation. … High Scale & Low Cost Datacenters Leverage Hardware. High.
PETAL: DISTRIBUTED VIRTUAL DISKS E. K. Lee C. A. Thekkath DEC SRC.
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
ElasticTree: Saving Energy in Data Center Networks 許倫愷 2013/5/28.
Storage Allocation in Prefetching Techniques of Web Caches D. Zeng, F. Wang, S. Ram Appeared in proceedings of ACM conference in Electronic commerce (EC’03)
How to Resolve Bottlenecks and Optimize your Virtual Environment Chris Chesley, Sr. Systems Engineer
The DHCP Failover Protocol A Formal Perspective Rui FanMIT Ralph Droms Cisco Systems Nancy GriffethCUNY Nancy LynchMIT.
Network Aware Resource Allocation in Distributed Clouds.
The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers Di Xie, Ning Ding, Y. Charlie Hu, Ramana Kompella 1.
CloudNaaS: A Cloud Networking Platform for Enterprise Applications Theophilus Benson*, Aditya Akella*, Anees Shaikh +, Sambit Sahu + (*University of Wisconsin,
SAT #003 Atlanta GA October 27, Scripting Datacenter Orchestration Glenn Blogs.NetApp.com/MSEnviro Scripting Datacenter Orchestration.
1 ACTIVE FAULT TOLERANT SYSTEM for OPEN DISTRIBUTED COMPUTING (Autonomic and Trusted Computing 2006) Giray Kömürcü.
The Alternative Larry Moore. 5 Nodes and Variant Input File Sizes Hadoop Alternative.
The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers Di Xie, Ning Ding, Y. Charlie Hu, Ramana Kompella 1.
Dynamo: Amazon’s Highly Available Key-value Store DAAS – Database as a service.
6.894: Distributed Operating System Engineering Lecturers: Frans Kaashoek Robert Morris
Web Technologies Lecture 13 Introduction to cloud computing.
Implementing Remote Procedure Calls Andrew D. Birrell and Bruce Jay Nelson 1894 Xerox Palo Alto Research Center EECS 582 – W16.
Capacity Planning in a Virtual Environment Chris Chesley, Sr. Systems Engineer
Architecture for Resource Allocation Services Supporting Interactive Remote Desktop Sessions in Utility Grids Vanish Talwar, HP Labs Bikash Agarwalla,
6.888 Lecture 6: Network Performance Isolation Mohammad Alizadeh Spring
R2C2: A Network Stack for Rack-scale Computers Paolo Costa, Hitesh Ballani, Kaveh Razavi, Ian Kash Microsoft Research Cambridge EECS 582 – W161.
Performance Driven Database Design for Scalable Web Applications Jozsef Patvarczki, Murali Mani, and Neil Heffernan Scaling up web applications requires.
Data Center Architectures
Md Baitul Al Sadi, Isaac J. Cushman, Lei Chen, Rami J. Haddad
Chen Qian, Xin Li University of Kentucky
Energy Aware Network Operations
Xin Li, Chen Qian University of Kentucky
Optimizing Distributed Actor Systems for Dynamic Interactive Services
Marshfield Area Technical Council
Data Center Network Architectures
Data Centers: Network Architecture
Hydra: Leveraging Functional Slicing for Efficient Distributed SDN Controllers Yiyang Chang, Ashkan Rezaei, Balajee Vamanan, Jahangir Hasan, Sanjay Rao.
Networks Network:end-node and router C 2 B 1 3 D 5 A 4 6 E 7 Router F
Advanced Computer Networks
Building a Virtual Infrastructure
Presented by Kristen Carlson Accardi
Presented by Haoran Wang
Sebastian Solbach Consulting Member of Technical Staff
Microsoft SharePoint Server 2016
CS 425 / ECE 428 Distributed Systems Fall 2017 Nov 16, 2017
Towards Reliable Application Deployment in the Cloud
Server Allocation for Multiplayer Cloud Gaming
Software Engineering Introduction to Apache Hadoop Map Reduce
Dependability Evaluation and Benchmarking of
EECS 582 Final Review Mosharaf Chowdhury EECS 582 – F16.
ElasticTree: Saving Energy in Data Center Networks
INFO 344 Web Tools And Development
Microsoft Virtual Academy
Operating systems Process scheduling.
AWS-SysOps Dumps AWS Certified SysOps Administrator - Associate.
Jellyfish: Networking Data Centers Randomly
Specialized Cloud Architectures
Beyond FTP & hard drives: Accelerating LAN file transfers
Co-designed Virtual Machines for Reliable Computer Systems
Towards Predictable Datacenter Networks
Presentation transcript:

CloudMirror: Application-Driven Bandwidth Guarantees in Datacenters JK Lee, Yoshio Turner, Myungjin Lee1, Lucian Popa2, Sujata Banerjee, Joon-Myung Kang, Puneet Sharma HP Labs, 1University of Edinburgh, 2Databricks Presented by Jack Clark

Overview What Problem Does CloudMirror Try to Solve? How Do Other Solutions Solve the Problem? How Does CloudMirror Work? How Well Does CloudMirror Solve the Problem?

The Problem

Clients want predictable performance from their applications: Consistent throughput Bounded tail latencies Modern cloud application performance is highly dependent on network performance

Underpovisioning => Contention => Unpredictable Performance Cloud providers want to squeeze as many tenants onto their infrastructure as possible, which leads to under provisioning of network resources Underpovisioning => Contention => Unpredictable Performance

Why is Bandwidth So Important? If you have more data to send than can fit in the pipe, you are going to have to wait This intuition is backed up by Parley, which demonstrates that adequate bandwidth is critical for low tail latencies

Idea Provide bandwidth guarantees to clients: Clients are happy because they get more predictable performance Cloud providers are happy because clients are more confident about moving to the cloud (and they get a new dimension for billing)

Existing Solutions

Pipe Model Specify bandwidth requirement between every pair of VMs This makes efficient VM placement slow ~O(n3) Also makes it difficult for client to specify requirements Must provision peak VM bandwidth for each VM

Hose Model Problem #1 - Hose models fails to guarantee bandwidth in the case of congestion! TCP-like fair allocation would split the bandwidth 300:200 instead of 400:100

Problem #2: Hose model would provision 2x actual bandwidth required on L2

CloudMirror Tenant Application Graph (TAG) for tenants to specify the bandwidth guarantees they desire VM placement algorithm for the cloud provider to efficiently fill tenant requests

Tenant Application Graph (TAG) TAG Model allows users to specify their desired bandwidth guarantees in terms of the communication structure of their application

VM Placement Algorithm Goal: Deploy as many tenants as possible onto the topology as possible while maintaining bandwidth guarantees This is NP-hard Insights/heuristics: 1. Can save core bandwidth if we place more than ½ of VMs from communicating tiers together (collocate function) 2. If no bandwidth saving is possible, collocate high and low bandwidth tiers (balance function)

Surely This is Slow... Time complexity is O(T2): Scales with the number of tiers rather than the number of VMs Runs in < 1 sec on bing.com data

High Availability Worst Case Survivability (WCS) = fraction of VMs in a tier that remain alive during the failure of a subtree Huge assumption is made that this should be at the server level! They claim that because core switches are usually fault tolerant, it is ok to only ensure WCS amongst individual servers However, most outages in a modern data center are not because of switch failures, but rather due to operator error e.g. bad configuration

Evaluation Use bing.com workload - 3 level tree topology, 2048 hosts, 25 VM slots per host For a given number of requests, how much bandwidth does CloudMirror require compared to rival solutions?

Evaluation 2. For a given amount of bandwidth, how many tenant requests can CloudMirror accept compared to rival solutions?

Criticisms and Future Work How to integrate this with other resources such as CPU and Memory? Would this actually work for HA? WCS assumes that failures will be simple mechanical failures of switches. What if a bad network configuration causes a problem? It would have been nice to see them actually run applications on their system and verify that the bandwidth guarantees are honoured Pricing Model

Questions?