Towards Reliable Application Deployment in the Cloud

Slides:



Advertisements
Similar presentations
Optimizing Buffer Management for Reliable Multicast Zhen Xiao AT&T Labs – Research Joint work with Ken Birman and Robbert van Renesse.
Advertisements

Software Quality Assurance (SQA). Recap SQA goal, attributes and metrics SQA plan Formal Technical Review (FTR) Statistical SQA – Six Sigma – Identifying.
Supply Chain Network Simulator Using Cloud Computing Project # CDP10-SCNS Project Team Principal InvestigatorManuel D. Rossetti, PhD, P.E. Graduate AssistantYaohua.
Architecture and Real Time Systems Lab University of Massachusetts, Amherst An Application Driven Reliability Measures and Evaluation Tool for Fault Tolerant.
Understanding Network Failures in Data Centers: Measurement, Analysis and Implications Phillipa Gill University of Toronto Navendu Jain & Nachiappan Nagappan.
Algorithms for Self-Organization and Adaptive Service Placement in Dynamic Distributed Systems Artur Andrzejak, Sven Graupner,Vadim Kotov, Holger Trinks.
Chapter 1: Hierarchical Network Design
ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance and High Availability.
A User Experience-based Cloud Service Redeployment Mechanism KANG Yu.
ON DESIGING END-USER MULTICAST FOR MULTIPLE VIDEO SOURCES Y.Nakamura, H.Yamaguchi, A.Hiromori, K.Yasumoto †, T.Higashino and K.Taniguchi Osaka University.
Extreme-scale computing systems – High performance computing systems Current No. 1 supercomputer Tianhe-2 at petaflops Pushing toward exa-scale computing.
Co-location Sites for Business Continuity and Disaster Recovery Peter Lesser (212) Peter Lesser (212) Kraft.
Hao Yang, Fan Ye, Yuan Yuan, Songwu Lu, William Arbaugh (UCLA, IBM, U. Maryland) MobiHoc 2005 Toward Resilient Security in Wireless Sensor Networks.
Tony McGregor RIPE NCC Visiting Researcher The University of Waikato DAR Active measurement in the large.
Load-Balancing Routing in Multichannel Hybrid Wireless Networks With Single Network Interface So, J.; Vaidya, N. H.; Vehicular Technology, IEEE Transactions.
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4b) Department of Electrical.
Clustering In A SAN For High Availability Steve Dalton, President and CEO Gadzoox Networks September 2002.
Zibin Zheng DR 2 : Dynamic Request Routing for Tolerating Latency Variability in Cloud Applications CLOUD 2013 Jieming Zhu, Zibin.
Visual Studio Windows Azure Portal Rest APIs / PS Cmdlets US-North Central Region FC TOR PDU Servers TOR PDU Servers TOR PDU Servers TOR PDU.
An Efficient Quorum-based Fault- Tolerant Approach for Mobility Agents in Wireless Mobile Networks Yeong-Sheng Chen Chien-Hsun Chen Hua-Yin Fang Department.
A Simulation-Based Study of Overlay Routing Performance CS 268 Course Project Andrey Ermolinskiy, Hovig Bayandorian, Daniel Chen.
CLOUD COMPUTING WHAT IS CLOUD COMPUTING?  Cloud Computing, also known as ‘on-demand computing’, is a kind of Internet-based computing,
© 2008 Cisco Systems, Inc. All rights reserved.Cisco ConfidentialPresentation_ID 1 Chapter 1: Hierarchical Network Design Connecting Networks.
Data Centers and Cloud Computing 1. 2 Data Centers 3.
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
Week#3 Software Quality Engineering.
Stamatia Bibi1, Dimitris Katsaros2, Panayiotis Bozanis2
Use Cloud Computing to Achieve Small Enterprise Savings
ITIL: Service Transition
Chapter 19: Network Management
Chapter 9 Optimizing Network Performance
OPERATING SYSTEMS CS 3502 Fall 2017
Rekayasa Perangkat Lunak Part-10
MOBILE NETWORKS DISASTER RECOVERY USING SDN-NFV
EGEE Middleware Activities Overview
Lecture 2: Leaf-Spine and PortLand Networks
Rekayasa Perangkat Lunak
High Availability Linux (HA Linux)
Embracing Failure: A Case for Recovery-Oriented Computing
IOT Critical Impact on DC Design
The Development Process of Web Applications
Operating System Protection Through Program Evolution
The ILC Control Work Packages
Network Load Balancing
Engineers Gate Founded in early 2014 by team led by Glenn Dubin
Cloud-Assisted VR.
Heading Off Correlated Failures through Independence-as-a-Service
Cloud Computing By P.Mahesh
Chapter 21: Cloud Computing and Related Security Issues
Cloud Computing.
Chapter 22: Cloud Computing Technology and Security
Supporting Fault-Tolerance in Streaming Grid Applications
Redundancy and diverse routing for data exchange
Rekayasa Perangkat Lunak
Managing Clouds with VMM
IS4680 Security Auditing for Compliance
SpiraTest/Plan/Team Deployment Considerations
INFO 344 Web Tools And Development
Fault Tolerance Distributed Web-based Systems
CloudMirror: Application-Driven Bandwidth Guarantees in Datacenters
Course Project Topics for CSE5469
Cloud computing mechanisms
UNIT IV RAID.
Internet and Web Simple client-server model
Data Center Architectures
2019/5/13 A Weighted ECMP Load Balancing Scheme for Data Centers Using P4 Switches Presenter:Hung-Yen Wang Authors:Peng Wang, George Trimponias, Hong Xu,
Capitalize on Your Business’s Technology
Harrison Howell CSCE 824 Dr. Farkas
In-network computation
Presentation transcript:

Towards Reliable Application Deployment in the Cloud Ruichuan Chen Joint work with Istemi Ekin Akkus, Bimal Viswanath, Ivica Rimac, Volker Hilt

Today, how to reliably deploy an application into cloud? Applications are moved from self-maintained infrastructure to the cloud. How to achieve high reliability in the cloud? Redundancy! Region 1 Region 2 Region N App App App Common Infrastructure (Storage, Power Supply, etc.)

Cloud service outages are still common

Existing efforts Cloud providers detect, localize or tolerate failures via: Diagnosis systems, e.g., NSDMiner, Orion, Sherlock, Sieve. Fault-tolerant systems, e.g., F10, NetPilot. Existing efforts address the problem after the outage occurs. Require human intervention. Prolonged failure recovery.

Our proposal -- reCloud reCloud takes proactive actions to prevent cloud service outages. Enable cloud provider to deploy applications with a user-specified reliability level. Work with complex applications such as micro-service applications. Balance between reliability, application performance, and resource utilization. Achieve all of the above with no changes to existing cloud infrastructure.

reCloud workflow reCloud System Cloud System Dependency DB Reliability requirements (e.g., three 9s, 2-of-3 redundancy, etc) User Yes No Topology, failure probabilities, etc. Dependency Acquisition Evolve New Deployment Plan Assess Reliability Check if User Requirements Met? Application Deployment Engine Cloud System Generate Initial Deployment Plan reCloud System

Step 0: specify reliability requirements User specifies reliability requirements: N : total number of application instances to be deployed. K : minimal number of deployed instances to be alive. Rdesired: desired reliability score, i.e., the probability that at least K out of N deployed instances are alive. Tmax: maximum time to search for a reliable deployment plan.

Step 0’: acquire dependency information Three types of infrastructure components: Hardware, software, and network components. Cloud providers normally use cloud management platforms to: Monitor the topology among various components. Measure the failure probability of various components. Example: cloud data center is organized as a fat-tree. Core Switch Agg Host Edge Switches and hosts may share additional common dependencies. Internet Hosts

Step 1: generate initial deployment plan reCloud generates an initial deployment plan by placing application instances onto random hosts. Deployment plan is a choice of hosts to deploy application instances. Example: User requires 1-of-2 redundancy. Agg Core Edge Hosts Switch Host Host for deployment

Step 2: assess reliability of a deployment plan Fix the application’s deployment plan. Generate failure states for all infrastructure components based on their failure probabilities. Core Switch Agg Host Host for deployment Edge / Failed switch / host Hosts

Step 2: assess reliability of a deployment plan Test reliability in the generated topology with failed components. Consider routing protocol, and user-specified K-of-N redundancy. This deployment plan is considered reliable because user requires 1-of-2 redundancy. Generate component failure states for X rounds, and test reliability in these rounds. If the deployment plan is considered reliable in Y rounds, then its reliability score is Y/X. Core Switch Agg Host Host for deployment Edge / Failed switch / host Hosts reachable unreachable

Step 2: assess reliability of a deployment plan Need to generate failure states for each component in each round. This is quite expensive. reCloud uses dagger sampling to generate failure states. Example: A component fails with probability of 0.2, meaning 1 failure every 5 rounds on average. Monte-Carlo sampling: generate 5 random numbers to produce failures for 5 rounds. Dagger sampling: generate only 1 random integer in [1,5] to decide in which round the component fails. failed/alive alive failed

Step 3: search for reliable deployment plan There are a huge number of potential deployment plans. reCloud uses simulated annealing to search for a reliable deployment plan. Evolve new deployment plans. Accept not only more reliable deployment plans, but also less reliable ones with some probability. Search ends until find a deployment plan which satisfies user-specified reliability, or time-out. ……

Step 3: search for reliable deployment plan Cloud data centers are normally designed to create symmetry. reCloud uses network transformations technique to check the equivalence of multiple deployment plans. No need to assess the equivalent deployment plan. equivalent

reCloud workflow (recap) Dependency DB Reliability requirements (e.g., three 9s, 2-of-3 redundancy, etc) User Yes No Topology, failure probabilities, etc. Dependency Acquisition Evolve New Deployment Plan Assess Reliability Check if User Requirements Met? Application Deployment Engine Cloud System Generate Initial Deployment Plan reCloud System

Evaluation We have implemented a functional prototype (~5.3K lines of Java code). We evaluate reCloud with 4 data center topologies, from tiny scale to large scale.

Evaluation How efficient is dagger sampling to generate failure states for components? 1 to 2 orders of magnitude faster than Monte-Carlo sampling.

Evaluation How efficient is reCloud to assess a given deployment plan? ~270ms even in a large-scale data center. Redundancy level does not affect performance significantly.

Evaluation How efficient is reCloud to search for a reliable deployment plan? Need only 30 seconds to find a deployment plan that is (at least) 10X more reliable than the current practice (CP) in a large-scale data center with 27K hosts. Example: To achieve a 4-of-5 redundancy, the current practice (CP) can find a deployment plan with 99.62% reliability (i.e., 33.3 hours downtime per year). reCloud can find a deployment plan with 99.97% reliability (i.e., 2.6 hours downtime per year), within 30 seconds.

Summary reCloud finds an application’s reliable deployment plan that fulfills user’s requirements, before the application gets deployed. Dagger sampling to generate failures when assessing reliability of a given deployment plan. Simulated annealing to explore the huge space of potential deployment plans. Network transformations to check the equivalence of different deployment plans. reCloud can also: Work with complex applications such as micro-service applications. Balance between reliability, application performance, and resource utilization. Achieve all of the above with no changes to existing cloud infrastructure. Please refer to the paper 