CloudMirror: Application-Driven Bandwidth Guarantees in Datacenters

Slides:

Advertisements

Similar presentations

Remus: High Availability via Asynchronous Virtual Machine Replication

Advertisements

Capacity Planning in a Virtual Environment

Windows IT Pro magazine Datacenter solution with lower infrastructure costs and OPEX savings from increased operational efficiencies. Datacenter.

Towards Predictable Datacenter Networks

Sharing Cloud Networks Lucian Popa, Gautam Kumar, Mosharaf Chowdhury Arvind Krishnamurthy, Sylvia Ratnasamy, Ion Stoica UC Berkeley.

ElasticTree: Saving Energy in Data Center Networks Brandon Heller, Srini Seetharaman, Priya Mahadevan, Yiannis Yiakoumis, Puneed Sharma, Sujata Banerjee,

Efficient Autoscaling in the Cloud using Predictive Models for Workload Forecasting Roy, N., A. Dubey, and A. Gokhale 4th IEEE International Conference.

ElasticTree: Saving Energy in Data Center Networks Very offended by KALYAN MANDA LEI XIA.

Ashish Gupta Under Guidance of Prof. B.N. Jain Department of Computer Science and Engineering Advanced Networking Laboratory.

CS 268: Project Suggestions Ion Stoica February 6, 2003.

COMS E Cloud Computing and Data Center Networking Sambit Sahu

Building Edge-Failure Resilient Networks Chandra Chekuri Bell Labs Anupam Gupta Bell Labs ! CMU Amit Kumar Cornell ! Bell Labs Seffi Naor, Danny Raz Technion.

PortLand Presented by Muhammad Sadeeq and Ling Su.

Capacity planning for web sites. Promoting a web site Thoughts on increasing web site traffic but… Two possible scenarios…

Enable Multi Tenant Clouds Network Virtualization. Dynamic VM Placement. Secure Isolation. … High Scale & Low Cost Datacenters Leverage Hardware. High.

PETAL: DISTRIBUTED VIRTUAL DISKS E. K. Lee C. A. Thekkath DEC SRC.

Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.

ElasticTree: Saving Energy in Data Center Networks 許倫愷 2013/5/28.

Storage Allocation in Prefetching Techniques of Web Caches D. Zeng, F. Wang, S. Ram Appeared in proceedings of ACM conference in Electronic commerce (EC’03)

How to Resolve Bottlenecks and Optimize your Virtual Environment Chris Chesley, Sr. Systems Engineer

The DHCP Failover Protocol A Formal Perspective Rui FanMIT Ralph Droms Cisco Systems Nancy GriffethCUNY Nancy LynchMIT.

Network Aware Resource Allocation in Distributed Clouds.

The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers Di Xie, Ning Ding, Y. Charlie Hu, Ramana Kompella 1.

CloudNaaS: A Cloud Networking Platform for Enterprise Applications Theophilus Benson*, Aditya Akella*, Anees Shaikh +, Sambit Sahu + (*University of Wisconsin,

SAT #003 Atlanta GA October 27, Scripting Datacenter Orchestration Glenn Blogs.NetApp.com/MSEnviro Scripting Datacenter Orchestration.

1 ACTIVE FAULT TOLERANT SYSTEM for OPEN DISTRIBUTED COMPUTING (Autonomic and Trusted Computing 2006) Giray Kömürcü.

The Alternative Larry Moore. 5 Nodes and Variant Input File Sizes Hadoop Alternative.

The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers Di Xie, Ning Ding, Y. Charlie Hu, Ramana Kompella 1.

Dynamo: Amazon’s Highly Available Key-value Store DAAS – Database as a service.

6.894: Distributed Operating System Engineering Lecturers: Frans Kaashoek Robert Morris

Web Technologies Lecture 13 Introduction to cloud computing.

Implementing Remote Procedure Calls Andrew D. Birrell and Bruce Jay Nelson 1894 Xerox Palo Alto Research Center EECS 582 – W16.

Capacity Planning in a Virtual Environment Chris Chesley, Sr. Systems Engineer

Architecture for Resource Allocation Services Supporting Interactive Remote Desktop Sessions in Utility Grids Vanish Talwar, HP Labs Bikash Agarwalla,

6.888 Lecture 6: Network Performance Isolation Mohammad Alizadeh Spring

R2C2: A Network Stack for Rack-scale Computers Paolo Costa, Hitesh Ballani, Kaveh Razavi, Ian Kash Microsoft Research Cambridge EECS 582 – W161.

Performance Driven Database Design for Scalable Web Applications Jozsef Patvarczki, Murali Mani, and Neil Heffernan Scaling up web applications requires.

Data Center Architectures

Md Baitul Al Sadi, Isaac J. Cushman, Lei Chen, Rami J. Haddad

Chen Qian, Xin Li University of Kentucky

Energy Aware Network Operations

Xin Li, Chen Qian University of Kentucky

Optimizing Distributed Actor Systems for Dynamic Interactive Services

Marshfield Area Technical Council

Data Center Network Architectures

Data Centers: Network Architecture

Hydra: Leveraging Functional Slicing for Efficient Distributed SDN Controllers Yiyang Chang, Ashkan Rezaei, Balajee Vamanan, Jahangir Hasan, Sanjay Rao.

Networks Network:end-node and router C 2 B 1 3 D 5 A 4 6 E 7 Router F

Advanced Computer Networks

Building a Virtual Infrastructure

Presented by Kristen Carlson Accardi

Presented by Haoran Wang

Sebastian Solbach Consulting Member of Technical Staff

Microsoft SharePoint Server 2016

CS 425 / ECE 428 Distributed Systems Fall 2017 Nov 16, 2017

Towards Reliable Application Deployment in the Cloud

Server Allocation for Multiplayer Cloud Gaming

Software Engineering Introduction to Apache Hadoop Map Reduce

Dependability Evaluation and Benchmarking of

EECS 582 Final Review Mosharaf Chowdhury EECS 582 – F16.

ElasticTree: Saving Energy in Data Center Networks

INFO 344 Web Tools And Development

Microsoft Virtual Academy

Operating systems Process scheduling.

AWS-SysOps Dumps AWS Certified SysOps Administrator - Associate.

Jellyfish: Networking Data Centers Randomly

Specialized Cloud Architectures

Beyond FTP & hard drives: Accelerating LAN file transfers

Co-designed Virtual Machines for Reliable Computer Systems

Towards Predictable Datacenter Networks

Presentation transcript:

CloudMirror: Application-Driven Bandwidth Guarantees in Datacenters JK Lee, Yoshio Turner, Myungjin Lee1, Lucian Popa2, Sujata Banerjee, Joon-Myung Kang, Puneet Sharma HP Labs, 1University of Edinburgh, 2Databricks Presented by Jack Clark

Overview What Problem Does CloudMirror Try to Solve? How Do Other Solutions Solve the Problem? How Does CloudMirror Work? How Well Does CloudMirror Solve the Problem?

The Problem

Clients want predictable performance from their applications: Consistent throughput Bounded tail latencies Modern cloud application performance is highly dependent on network performance

Underpovisioning => Contention => Unpredictable Performance Cloud providers want to squeeze as many tenants onto their infrastructure as possible, which leads to under provisioning of network resources Underpovisioning => Contention => Unpredictable Performance

Why is Bandwidth So Important? If you have more data to send than can fit in the pipe, you are going to have to wait This intuition is backed up by Parley, which demonstrates that adequate bandwidth is critical for low tail latencies

Idea Provide bandwidth guarantees to clients: Clients are happy because they get more predictable performance Cloud providers are happy because clients are more confident about moving to the cloud (and they get a new dimension for billing)

Existing Solutions

Pipe Model Specify bandwidth requirement between every pair of VMs This makes efficient VM placement slow ~O(n3) Also makes it difficult for client to specify requirements Must provision peak VM bandwidth for each VM

Hose Model Problem #1 - Hose models fails to guarantee bandwidth in the case of congestion! TCP-like fair allocation would split the bandwidth 300:200 instead of 400:100

Problem #2: Hose model would provision 2x actual bandwidth required on L2

CloudMirror Tenant Application Graph (TAG) for tenants to specify the bandwidth guarantees they desire VM placement algorithm for the cloud provider to efficiently fill tenant requests

Tenant Application Graph (TAG) TAG Model allows users to specify their desired bandwidth guarantees in terms of the communication structure of their application

VM Placement Algorithm Goal: Deploy as many tenants as possible onto the topology as possible while maintaining bandwidth guarantees This is NP-hard Insights/heuristics: 1. Can save core bandwidth if we place more than ½ of VMs from communicating tiers together (collocate function) 2. If no bandwidth saving is possible, collocate high and low bandwidth tiers (balance function)

Surely This is Slow... Time complexity is O(T2): Scales with the number of tiers rather than the number of VMs Runs in < 1 sec on bing.com data

High Availability Worst Case Survivability (WCS) = fraction of VMs in a tier that remain alive during the failure of a subtree Huge assumption is made that this should be at the server level! They claim that because core switches are usually fault tolerant, it is ok to only ensure WCS amongst individual servers However, most outages in a modern data center are not because of switch failures, but rather due to operator error e.g. bad configuration

Evaluation Use bing.com workload - 3 level tree topology, 2048 hosts, 25 VM slots per host For a given number of requests, how much bandwidth does CloudMirror require compared to rival solutions?

Evaluation 2. For a given amount of bandwidth, how many tenant requests can CloudMirror accept compared to rival solutions?

Criticisms and Future Work How to integrate this with other resources such as CPU and Memory? Would this actually work for HA? WCS assumes that failures will be simple mechanical failures of switches. What if a bad network configuration causes a problem? It would have been nice to see them actually run applications on their system and verify that the bandwidth guarantees are honoured Pricing Model

Questions?