This presentation can be distributed under a Creative Commons License
Image: xkcd.com Dependable Cloud Mike Wood
Mike Wood Tack
“Failure is always an option.” Image: Discovery Channel, Fair Use
What are we looking for? Check out: Images: Office ClipArt & Godzilla Releasing Corp (Fair Use) Hardware FailureData Corruption Network Failure Loss of Facilities
Image: FOX, Fair Use Human Error
What we’re trying to achieve 1.Monitoring 2.Resilient Solutions Image: Cohdra
Image: Office ClipArt Cost vs Risk % $1, …, To get more 9’s here add more 0’s here.
Image: NASA
Functional Transparency Image: Office ClipArt Logging Messages Hardware Health Dependent Services Health
Telemetry
Image: NASA Analyze your Data
Image: Office ClipArt
Remember: Failure is always an option. Common Points of Failure Machine\application crashes Throttling (exceeding capacity) Connectivity\Network External service dependencies
Try/catch != Resilient private void createFile() { string fileName try { File.Create(fileName); } catch (DirectoryNotFoundException ex) { Trace.WriteLine( String.Format("Unable to create {0}. {1}", fileName, ex)); throw; }
Image: Michael Wood Decompose your system…
Capacity Buffering Content Delivery Networks (CDN’s) Distributed Application Cache Local Content Cache Enables recovery during outages or spikes in load Image: jepler
Always carry a spare 75% Capacity, half of our load 50% more capacity then needed Can absorb of temporary spikes Time to react if need to add capacity 100% of load, 150% Capacity 0% Capacity, redirect all load Over allocated, but still functioning Degrade, but don’t fail SYSTEM FAILURE!!! Image: Kevin Rosseel
Request Buffering Image: Joe Shlabotnik Queues Retry Policies Async Workloads
Dept. of Redundancy Dept. Have a backup, somewhere else More than one? Cost to benefit Ratio? Ready State Hot = full capacity Warm = scaled down, but ready to grow Cold = mothballed, starts from zero Image: Mr. White
Redundancy - Its about probability 95% uptime 1 box : 5% downtime or 438hrs per year 2 boxes : 5/100 * 5/100 = 25/10,000 = 0.25% downtime or 22hrs per year 4 boxes : 5/100 * 5/100 * 5/100 * 5/100 = 625/100,000, % downtime or MINUTES per year (that’s 18 ½ days!)
Total Outage duration = Time to Detect + Time to Diagnose + Time to Decide + Time to Act Image: Office ClipArt
Dynamic Addressing & Configuration
What about your data? Image: barrymieny
Availability via Degradation Image: Michael Wood
Images: Gizmodo Virtualization and Automation
Images: Orion Pictures owns Terminator Franchise
The “HI” Point Images: Office Clip Art
Image: NASA
“Don't be too proud of this technological terror you've constructed…” ADMIT: Your Solution WILL fail at some point You can learn from others just as well as yourself DO: Root cause analysis Read other root cause analysis Plan for failure DON’T: Get cocky Stick your head in the sand Images: LucasFilm, Fair Use
Mike Wood Tack