Download presentation
Presentation is loading. Please wait.
Published byJoleen Curtis Modified over 6 years ago
1
Towards Reliable Application Deployment in the Cloud
Ruichuan Chen Joint work with Istemi Ekin Akkus, Bimal Viswanath, Ivica Rimac, Volker Hilt
2
Today, how to reliably deploy an application into cloud?
Applications are moved from self-maintained infrastructure to the cloud. How to achieve high reliability in the cloud? Redundancy! Region 1 Region 2 Region N App App App Common Infrastructure (Storage, Power Supply, etc.)
3
Cloud service outages are still common
4
Existing efforts Cloud providers detect, localize or tolerate failures via: Diagnosis systems, e.g., NSDMiner, Orion, Sherlock, Sieve. Fault-tolerant systems, e.g., F10, NetPilot. Existing efforts address the problem after the outage occurs. Require human intervention. Prolonged failure recovery.
5
Our proposal -- reCloud
reCloud takes proactive actions to prevent cloud service outages. Enable cloud provider to deploy applications with a user-specified reliability level. Work with complex applications such as micro-service applications. Balance between reliability, application performance, and resource utilization. Achieve all of the above with no changes to existing cloud infrastructure.
6
reCloud workflow reCloud System Cloud System Dependency DB
Reliability requirements (e.g., three 9s, 2-of-3 redundancy, etc) User Yes No Topology, failure probabilities, etc. Dependency Acquisition Evolve New Deployment Plan Assess Reliability Check if User Requirements Met? Application Deployment Engine Cloud System Generate Initial Deployment Plan reCloud System
7
Step 0: specify reliability requirements
User specifies reliability requirements: N : total number of application instances to be deployed. K : minimal number of deployed instances to be alive. Rdesired: desired reliability score, i.e., the probability that at least K out of N deployed instances are alive. Tmax: maximum time to search for a reliable deployment plan.
8
Step 0’: acquire dependency information
Three types of infrastructure components: Hardware, software, and network components. Cloud providers normally use cloud management platforms to: Monitor the topology among various components. Measure the failure probability of various components. Example: cloud data center is organized as a fat-tree. Core Switch Agg Host Edge Switches and hosts may share additional common dependencies. Internet Hosts
9
Step 1: generate initial deployment plan
reCloud generates an initial deployment plan by placing application instances onto random hosts. Deployment plan is a choice of hosts to deploy application instances. Example: User requires 1-of-2 redundancy. Agg Core Edge Hosts Switch Host Host for deployment
10
Step 2: assess reliability of a deployment plan
Fix the application’s deployment plan. Generate failure states for all infrastructure components based on their failure probabilities. Core Switch Agg Host Host for deployment Edge / Failed switch / host Hosts
11
Step 2: assess reliability of a deployment plan
Test reliability in the generated topology with failed components. Consider routing protocol, and user-specified K-of-N redundancy. This deployment plan is considered reliable because user requires 1-of-2 redundancy. Generate component failure states for X rounds, and test reliability in these rounds. If the deployment plan is considered reliable in Y rounds, then its reliability score is Y/X. Core Switch Agg Host Host for deployment Edge / Failed switch / host Hosts reachable unreachable
12
Step 2: assess reliability of a deployment plan
Need to generate failure states for each component in each round. This is quite expensive. reCloud uses dagger sampling to generate failure states. Example: A component fails with probability of 0.2, meaning 1 failure every 5 rounds on average. Monte-Carlo sampling: generate 5 random numbers to produce failures for 5 rounds. Dagger sampling: generate only 1 random integer in [1,5] to decide in which round the component fails. failed/alive alive failed
13
Step 3: search for reliable deployment plan
There are a huge number of potential deployment plans. reCloud uses simulated annealing to search for a reliable deployment plan. Evolve new deployment plans. Accept not only more reliable deployment plans, but also less reliable ones with some probability. Search ends until find a deployment plan which satisfies user-specified reliability, or time-out. ……
14
Step 3: search for reliable deployment plan
Cloud data centers are normally designed to create symmetry. reCloud uses network transformations technique to check the equivalence of multiple deployment plans. No need to assess the equivalent deployment plan. equivalent
15
reCloud workflow (recap)
Dependency DB Reliability requirements (e.g., three 9s, 2-of-3 redundancy, etc) User Yes No Topology, failure probabilities, etc. Dependency Acquisition Evolve New Deployment Plan Assess Reliability Check if User Requirements Met? Application Deployment Engine Cloud System Generate Initial Deployment Plan reCloud System
16
Evaluation We have implemented a functional prototype (~5.3K lines of Java code). We evaluate reCloud with 4 data center topologies, from tiny scale to large scale.
17
Evaluation How efficient is dagger sampling to generate failure states for components? 1 to 2 orders of magnitude faster than Monte-Carlo sampling.
18
Evaluation How efficient is reCloud to assess a given deployment plan?
~270ms even in a large-scale data center. Redundancy level does not affect performance significantly.
19
Evaluation How efficient is reCloud to search for a reliable deployment plan? Need only 30 seconds to find a deployment plan that is (at least) 10X more reliable than the current practice (CP) in a large-scale data center with 27K hosts. Example: To achieve a 4-of-5 redundancy, the current practice (CP) can find a deployment plan with 99.62% reliability (i.e., 33.3 hours downtime per year). reCloud can find a deployment plan with 99.97% reliability (i.e., 2.6 hours downtime per year), within 30 seconds.
20
Summary reCloud finds an application’s reliable deployment plan that fulfills user’s requirements, before the application gets deployed. Dagger sampling to generate failures when assessing reliability of a given deployment plan. Simulated annealing to explore the huge space of potential deployment plans. Network transformations to check the equivalence of different deployment plans. reCloud can also: Work with complex applications such as micro-service applications. Balance between reliability, application performance, and resource utilization. Achieve all of the above with no changes to existing cloud infrastructure. Please refer to the paper
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.