WANTED: CLOUD SOLUTION

WANTED: CLOUD SOLUTION
Nam scelerisque iaculis tellus at dignissim. Aliquam rutrum fringilla libero, in mollis tellus fringilla adipiscing. Nunc iaculis dictum mi sit amet facilisis. Pellentesque dolor lacus, . Lorem ipsum dolor sit amet, consectetur adipiscing elit. Proin ornare tempor facilisis. Donec accumsan orci eu. GARAGE SALE Lorem ipsum dolor sit amet, consectetur adipiscing elit. Proin ornare tempor facilisis. Donec accumsan orci. MAN SEEKS TECH SAVY WOMAN WANTED: CLOUD SOLUTION Fun loving developer seeks cloud platform for long term relationship. Must be able to handle ups and downs and be able to commit to sometimes massive amounts of inquiries. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Proin ornare tempor facilisis. Donec accumsan orci. Sed lacus sem, pulvinar eu luctus. WANTED: NERD NO GEEKS PLEASE Lorem ipsum dolor sit amet, consectetur adipiscing elit. Proin ornare tempor facilisis. Donec accumsan orci. Sed lacus sem, pulvinar eu luctus ut, rhoncus volutpat purus. Curabitur sollicitudin dui a lectus posuere nec lacinia metus ullamcorper. Quisque ac aliquam neque WANTED: FREE, BETTER SLIDE IMAGES Lorem ipsum curabitur sollicitudin dui a lectus posuere nec lacinia metus ullamcorper. Quisque ac aliquam neque Availability As application and solution architects, there is something we’ve come to take for granted the last few years. Namely that our network infrastructures are where we need to focus our attention when we have concerns about the availability of services we are delivering. But in an increasingly distributed world, we are finding that we are often affected by external dependencies incidents. And nowhere is this more true then in the realm of cloud computing. Its about the uptime silly! SALE ON WEBBLE WOBBLES Poutine. Gravy, Vegetable, or Meat? Lorem ipsum dolor sit amet, consectetur adipiscing elit. Proin ornare tempor facilisis. Donec accumsan orci eu tortor hendrerit faucibus. Proin at odio velit. Etiam ac ipsum at magna placerat scelerisque eget eget nulla. In ut diam urna, malesuada tincidunt arcu. Fusce eleifend gravida You are not alone. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Proin ornare tempor facilisis. Donec accumsan orci. Sed lacus sem, pulvinar eu luctus ut, rhoncus volutpat purus. Curabitur sollicitudin dui a lectus posuere nec lacinia TECH ADDICTS MEETING Nam scelerisque iaculis tellus at dignissim. Aliquam rutrum fringilla libero, in mollis tellus fringilla adipiscing. Nunc iaculis dictum mi sit amet facilisis. Pellentesque dolor lacus, posuere vitae mollis quis, hendrerit sed libero. Suspendisse id ipsum et ipsum dapibus blandit

“Failure is always an option.”
Hardware Fails Software has bugs People make mistakes Adam Savage from Mythbusters ( has been quoted as saying, “Failure is ALWAYS an option”. While in his context he is referring to tests having non intended results, for the cloud developer we can take that to heart as well. There WILL be failure in your cloud solution, either through your own code, or in those services your solution relies on. The trick to dealing with failure is having a plan of action when they occur and learning from past issues. Image: Image used from Discovery.com, used here under Fair Use. “Failure is always an option.” Image: Discovery Channel, Fair Use

What are we looking for? Protection From: Accessible vs. Available
Hardware Failure Data Corruption Network Failure Loss of Facilities Ultimately though, what we’re after is protection from outages. These can be hardware failures, data corruption, network failures, or even a disaster that results in the loss of facilities. This also raises the question about making sure things are either accessible, or available. A service that is experiencing an outage that isn’t accessible is the worst possible scenario. There’s no way for it to communicate with its end users that there is an issue. But if we instead make sure its accessible, but perhaps not fully available, perhaps running in a degraded state is one that at least has option. So the next step is to meet our goal of…. (next slide) Images: Microsoft Office Clip Art. Godzilla is Copyright of Godzilla Releasing Corp (as far as I can tell) and is used here under the Fair Use clause. Accessible vs. Available Reachable by clients Degraded performance/function Check out: Images: Office ClipArt & Godzilla Releasing Corp (Fair Use)

What we’re trying to achieve
Courtesy of Bing Dictionary Architecting resilient solutions. Resilient solutions are capable of adapting to an outage situation and will not only recovery when normal operating conditions return, but are capable of changing their functionality, reducing what they do to a (at worst case) bare function mode. Image: Free to use from MorgueFile

What is an SLA? A negotiated agreement or contract.
Defines vendor commitment Penalties for violation Not a guarantee! What we really want is: Availability, not promises Protection from loss of revenue Number/percentage is an indicator, drive down to what it really means… An sla is a promise, and much life an insurance policy, this promise usually means financial compensation if the terms of that promise are met. But much like the insurance policy on our home, we really hope we never have to make a claim. We would rather take preventative measures like keeping a fire extinguisher in the kitchen. An investment in preventative measure can save you and you end users huge headaches in the long run. Image: Microsoft Office Clip Art Image: Office ClipArt

Animations: Monitoring Resilient solutions Revised sla’s These three items are key to ensuring availability, and they need to all be present. In this session we’ll discuss each.

Monitoring

Detection - Seek out Issues
If you do not monitor for issues, how can you react when they happen? Be an active participant. Multiple notification channels Leverage “runtime governance” Raise alarm before failures occur One of the biggest issues with dealing with failure is realizing what went wrong in the first place, or knowing that something is currently going wrong before the calls from your users start pouring in. You need to have health monitoring in place to detect when operations are not working as expected, queues are getting behind or when one of those identified points of failure has indeed rolled over and died. Image: Microsoft Office Clip Art Image: Office ClipArt

Functional Transparency
Properly instrument your applications These are remote machines Visibility into failures may be reduced Allow remote interaction, tweaks to instrumentation behavior

Runtime Governance There we have a couple screen grabs from a monitoring tool known as BlueStripe (not an endorsement). These tools provide a way to view everything that’s going on in a solution. This level of “functional transparency” into not just what’s happening but how can be critical into helping you diagnose what’s happening in your solution before an outage occurs. Its also helpful in identifying how things behave when conditions are optimal and thus letting you tune things to further improve the system.

Diagnostics Diagnostics transfers to Storage on demand or specific intervals. Because of hardware failures you will NOT get all of your log messages. Storage accounts do have performance specifications. Keep a diagnostics account separate from your other data. Good logging levels and configurable notifications are a must. Two Three options: Capture EVERYTHING Capture “Critical Errors” – adjustable logging levels Stick your head in the sand. Windows Azure Diagnostics is a topic that can take up an hour all by itself. Yes, Windows Azure does have a set of mechanisms in place to let you gather log messages locally in Cloud Services (Web and Worker roles) and then transfer them to a storage account; however, there are some thing to keep in mind here.

Resilience

This addresses the symptom, it does resolve the underlying problem
Try/catch != Resilient Now if you’ve slung code, you can likely guess what this code snippet is doing. And I know I’ve written this kind of exception handling block hundreds if not thousands of time. But this approach addresses the symptom, not the problem. The solution doesn’t react to the exception and take any action on it aside for perhaps logging the issue. This is where resiliency comes into play… This addresses the symptom, it does resolve the underlying problem

Remember: Failure is always an option.
Windows Azure 1/23/2018 Remember: Failure is always an option. Common Points of Failure Machine\application crashes Throttling (exceeding capacity) Connectivity\Network External service dependencies Now traditionally we’ve been able to depend on the uptime of hardware. But in the increasing complex world, where things are interconnected, we’re learning that this isn’t always enough. We need to understanding the different points of failure and how to address them. Machine/App crashes – have multiple copies and redirect traffic Throttling – know the limits of services/resources you are using and how to handle errors that occur Connectivity/Network – resource connectivity is much more fluid. So you need to know how to adjust as things move around External Service Dependencies – as you build dependencies on external services, what happens when they fail? Learn to adjust and move on. Highscalability.com -> every site will hit a point where it fails. Know what that point is and handle it gracefully Focus less on server uptime Focus less on the uptime of hardware and more about how the solution handles it WHEN something fails! © 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Classifications Components Failures Considerations Low State
High State Transient Long Lived Failures Considerations Failures Capacity Images: From Microsoft Clip Art & MC Hammer from Press photos. Images: Office ClipArt & press photo

Request buffering Retry Policies Wait and try again
Windows Azure 1/23/2018 Request buffering Retry Policies Wait and try again Queue until available Queuing Enables Asynchronous workloads Temporal Decoupling Load Levelling The first lesson we need to learn to address is how to handle transient, temporary issues. These can be caused by temporary exceeding capacity, or just momentary losses in connection. By implementing various approaches that allow for request buffering, we can How to throttle back, disconnected designs that allow work to be queued up so it can be processed as capacity is available. One approach is to implement retry policies using things like the Windows Azure Transient Fault Handling Application block. This allows for errors to be retried and even allow for backoff policies so we can slow down the frequency with which we can retry them. Another approach is to go entirely asynchronous and simply log the request so we can process it later. The techniques can even be combined. Note: If you aren’t investing in designing asynchronous workflows, especially for long running processes, start looking into it now. Check out: © 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Enables recovery during outages or spikes in load
Capacity Buffering Content Delivery Networks (CDN’s) Distributed Application Cache Local Content Cache Enables recovery during outages or spikes in load Another often overlooked method of helping avoid capacity based request throttling is looking at using various types of caching strategies. We can help overcome temporary capacity constraints by buffering content. Potentially by offload delivery work to items like content delivery networks, or using caches to help storage frequently accessed materials. Maybe even leverage local disk based caches for large files so we’re not constantly retrieving them from other storage systems. We’ll talk more about CDNs during our Scale and Reach discussion as well. Image: Free to use from MorgueFile

N+1 - Extra Capacity Carry extra capacity to help even out spikes
If you fail over, service degrades but doesn’t fail completely Buy time to react Speed recovery Leverage for rolling upgrade Extra capacity is another way to help buffer capacity issues. By carrying just a little bit of extra capacity, you can buy yourself time when sudden spikes occur. This speeds your time to recovery when spikes occur and offers a much better user experience but at the cost of additional cost. Question: what is a rolling upgrade? Answer: When you deploy changes, a group of your resources are taken offline to be upgraded while traffic is directed to the remaining resources. This allows updates to be deployed without taking the entire “cluster” offline. But this does comes at a cost because if you are deploying new version of services, or even database changes, the old versions need to be able to handle the new changes seamlessly. Check out:

Always carry a spare 50% more capacity then needed
75% Capacity, half of our load 0% Capacity, redirect all load 100% of load, 150% Capacity 75% Capacity, half of our load SYSTEM FAILURE!!! 50% more capacity then needed Can absorb of temporary spikes Time to react if need to add capacity Over allocated, but still functioning Degrade, but don’t fail Carrying extra capacity can also help in the case of an outage. If you have two clusters, each running some extra capacity, when one fails you can redirect the traffic to the remaining clusters. While they may be over utilized and running in a degraded state, you are at least able to keep running while you work to increase your capacity. Yet again providing capacity buffering.

Degrade, but don’t fail Due to higher than average volumes, processing of your request may be delayed. Image copyright of we SINGS This leads us to our next lesson, degrade but don’t fail. When issues happen, its not enough to trap the exception and log it. We need to bubble that failure back to our end users so that our end users at least get some indication of what’s going on. Some message is better than no message. A great example of this is Netflix’s recommendation engine. They have a comprehensive recommendation system but when it fails they have a backup. They just recommend items that are popular. This a less affective recommendation, or put another way a degraded experience. But its good enough that their end users rarely, if ever realize there was an outage. 404\503 error vs. placeholder content Try, try, and try yet again

Virtualization and Automation
Virtualization - Provides greater flexibility to move workloads Automation – reduces ‘mean time to recovery’ Don’t forget the human factor! Automation can be used to detect when a failure occurs and take corrective action. This not only reduces the impact on our end users but either reduces or removes entire the need for your to execute a manual process to recover. Thus reducing the mean time to recovery we talked about earlier. However, we want to be careful about having automation without checks and balances. After all, we’re not out to create the next SkyNet. Image from the film Terminator 2

The “HI” Point or “normal” hardware failure
On leap day in 2012, Windows Azure experienced is largest and most visible service disruption to date. Outages happen, but what’s important is that in this case the automated recovery systems had a “human intervention” or “hi” point. When the systems reached this point, they stopped taking action and alerted the support teams that there was something dramatically wrong. *speak to the animation and the cascade failure that the invalid certs were causing* Animation from TechEd NA Windows Azure Internals by Mark Russinovich or “normal” hardware failure System Reaches a point where it asks for human intervention/review Deploying an infra update or customer VM Normal “Service healing” migrates VMs Leap day starts… VMs cause nodes to fail The cascade is viral… Check out:

One startup offered an sla that gave back all monthly fees if they failed to meet their sla. They are no longer in business. Its important to set slas that are realistic. Service Levels

We need to be back up within 5 minutes!
Mean time to Recovery Don’t set an artificial limit… We need to be back up within 5 minutes! You can’t put a definite time limit on the unknown. Doing this just causes a LOT of stress on a situation that is already not good. Image: Free to use from MorgueFile

Total Outage duration = Time to Detect + Time to Diagnose
+ Time to Decide + Time to Act When calculating how fast you can be “back online”, you need to take all the activities into account. If you have to navigate a complex, manual process, this will slow things down. But if your solutions can react in an automated manner, you reduce outages and in some cases even hide some outages entirely By introducing the proper outage behaviors into your solutions and taking the proper steps to ensure that your organization’s process support and not hider reacting to issues, you can help minimize the downtime. This session focuses on what you as a solution architect can do. Image: Microsoft Office Clip Art Image: Office ClipArt

SLA’s – the sum of all parts
Let’s do some math. 4 services, each with a 99.95% uptime All critical to uptime of your solution Failure of any service will take down the entire solution. Solution has a mean availability of 99.75% So back to SLA, our scenario based sla has to take into account the sum off all our parts. So lets assume we have a solution that is composed of 4 services. And we’ve been guaranteed (yeah, that’s just another promise) that each of those services has a 99.95% uptime. The aggregate view is applicable when you have a solution that’s dependent on multiple services, each with their own SLA. If we have a solution that uses for example 5 services, each with a 99.95% uptime, our total uptime is NOT (as you’d commonly think) 99.95%. Since the uptime of the solution is dependent on the uptime of all the component services, we actually have to use an aggregate total amount of downtime. Each service could have up to 262 minutes (roughly) downtime each year. This means the solution could actually be impacted by up to 1,314 minutes per year or a 99.75% uptime. The key to this approach is that they have to have one or more shared/common points of potential failure. This can be the power supply, the internet connectivity, etc… Image: Office ClipArt

Dept. of Redundancy Dept.
Windows Azure 1/23/2018 Dept. of Redundancy Dept. Have a backup, somewhere else More than one? Cost to benefit Ratio? Ready State Hot = full capacity Warm = scaled down, but ready to grow Cold = mothballed, starts from zero So another approach to increasing availability is running additional copies of your solution, usually somewhere else. Or increased redundancy. These services are spun up completely isolated (in another datacenter even) with no common points of failure. They can exist in different ready states (depending on the financial investment you want to make). The hotter the ready state, the more costly the solution is likely to be. So you’ll want to pick a scenario that meets your requirements. Question: why wouldn’t we want to run a redundant copy in the same datacenter? Answer: this create an aggregate view since both solutions are now share common points of failure. Title: Firesign Theater reference Image: Office ClipArt © 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Redundancy - Its about probability
95% uptime 95% uptime 95% uptime 95% uptime 1 box : 5% downtime or 438hrs per year (that’s 18 ½ days!) Calculating a redundant SLA. Let’s say I have completely separate copies of my solution that are in no way dependent on each other. Perhaps in different datacenters. In this example, each solution would have 95% (using our previous figure). But because a failure in one doesn’t impact the other, an outage to our overall availability is going to be based on the probability of an intersection of outages occurring in both silos. This can be represented by a formula like 5/100 * 5/100. This gives us a probability of a complete outage to the solution of approximately 0.25% or 99.75% availability. Now there’s an important “but” to this figure. Since we’re arriving at this figure via a probability calculation, it means we’re actually gambling on the chance that we will have an outage that is impacting both solutions. 2 boxes : 5/100 * 5/100 = 25/10,000 = 0.25% downtime or 22hrs per year 4 boxes : 5/100 * 5/100 * 5/100 * 5/100 = 625/100,000,000 % downtime or MINUTES per year

Change the SLA Our email server must have 99% uptime.
99% of our s will be sent in 5 minutes or less Scenario based Directly relates to business value, provides flexibility in achieving objectives. Component based Little business context, hard to articulate the value. The first step is to work with your business stakeholders to change the way you think about SLA’s. Today, most SLAs are component based. They focus on a single component, like an server. This lacks a business context and therefore makes it difficult to quantify that the impact is to the organization (from a business standpoint). By moving to a more scenario based SLA, we change this focus to the business process. This allows us to measure exactly what the benefit to the business organization is as well as define a solution that will meet that need without being tied to a specific component. In our example, this means that instead of having to create a highly available server, we can create a cluster of servers with enough capacity to meet the business demands. Or better yet, perhaps we have a virtualized “backup” machine that we know we can deploy in < 5 minutes when our monitoring solution detects there is a problem with the primary server. Either solution will meet the needs of a scenario based SLA, but not of a component based. Image: Microsoft Office Clip Art Image: Office ClipArt

“Don't be too proud of this technological terror you've constructed…”
Root cause analysis Reed other root cause analysis ADMIT: Your Solution WILL fail at some point You can learn from others just as well as yourself DON’T: Get cocky Stick your head in the sand Try as you might you won’t be successful in rooting out all the potential failure points in your solution. There will always be some surprise that comes up. When this happens take the time to really dig deep into what went wrong and then determine a course of action to help mitigate the issue in the future. If you notice that after significant outages in both Windows Azure and Amazon EC2/S3 the vendors publish root cause analysis. Read these and become familiar with them. Ask yourself is this something that can happen to my code. Or, if your solution is based on some of those services (whether you were affected or not), what should we do if that happens to us? Put something in place to deal with this issue in the future. For example, don’t be the Empire who had a flaw only 2 meters wide in V1 of their product, then in the second version made spaces big enough to fly ships through. If you have users or customers, it’s probably best to be quick and very forthright in what happened. Share your root cause analysis with them. In February of 2012 Windows Azure suffered a severe outage in many of its services, including the management API. The issue boiled down to a simple code error around date calculation. Someone did something bad in code. It might be fun to laugh and say, “wow, can’t they get date math right?”, but then again, did you do a sweep of your own code looking for the same possible problem? I think you’d be really surprised what little gems haunt even the code of “senior” developers. When you see someone has made a mistake in code, in a design, etc., learn from it and make sure you won’t suffer the same fate.

Your entire organization must be committed.
Do, or do not! Your entire organization must be committed. This will take time. This will be expensive. You will still make mistakes, plan for and learn from them. So in closing we have some words of caution here. If you’re going to work on resilient solutions, its important to get your entire organization behind it. This will take more time, it can increase costs. And the hardest pill to swallow is that no matter what you do, there will likely be a time when something fails and you haven’t account for it. The point is celebrate the “silent failures”, make sure folks see value in the investments that have already been made. And when an outage does occur that the solution could not cope with, learn from it and share that learning!

As you scale your system you need to identify possible points of failure. For each point of failure access a risk level and determine if and how you will deal with it. Think about the recent landing of the Mars Curiosity rover by NASA. That landing process was extremely complex with any number of issues that could go wrong. Each one was reviewed, analyzed and a conscious decision was made to either deal with the problem, or accept the risk. In some cases the full cost of the missions ($2.5 Billion dollars) was risked because the cost to provide a backup was either deemed to high, or simply impossible to manage. The best advice I can give you right now for dealing with failure, is to actually deal with it. Have plans in place. Do failure assessments on your designs to find all the holes and possible points of failure. Then, just like NASA, assess each issue and make a decision on how you would fix it (if you even can) and how much effort will that be. Then, using the same risk vs cost discussion we’ve been bringing up, make a decision on if you plan on addressing it or not. Finally, document the recovery plans on how to deal with the failure. Image: NASA

Brent Stineman DX/TED, Azure Specialist, Cloud Evangelist Web:  Creating bugs since 1992!

Discussion Scenarios Upgrades/Updates Planned component outages
Unplanned outages Capacity spikes (Denial of Service)

WANTED: CLOUD SOLUTION

Similar presentations

Presentation on theme: "WANTED: CLOUD SOLUTION"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

WANTED: CLOUD SOLUTION

Similar presentations

Presentation on theme: "WANTED: CLOUD SOLUTION"— Presentation transcript:

Similar presentations

About project

Feedback