Download presentation
Presentation is loading. Please wait.
Published byNichole Willow Modified over 9 years ago
1
The Case for Drill-Ready Cloud Computing Vision Paper Tanakorn Leesatapornwongsa and Haryadi S. Gunawi 1
2
Cloud Services Cheap Convenient Reliable 2
3
Yahoo Mail Disruption Hardware failures Wrong failover Disruptions – Some users could not access – Some users saw wrong notifications – Several days to recover 3
4
Outlook Disruption Hardware failures – Caching server Failover to backend servers correctly Requests flooded the servers Service went down Microsoft needed to change its software infrastructure 4
5
Cloud Outages 5 Outage Amazon EBS Gmail App Engine Skype Google Drive Outlook Yahoo Mail Root Event Network misconfig Upgrade event Power failure Overload Network bug Caching failure Hardware failures Supposedly tolerable failure Network partition Servers offline 25 % machines offline 30 % nodes failed Network offline Failover to backend Servers offline Incorrect Recovery Re-mirroring storm Bad request routing Bad failover Positive feedback loop Timeout during failover Request flooding Buggy failover Major Outage Clusters collapsed All routing servers down All user app were degraded Almost all nodes failed 33 % requests affected 7-hour outage 1 % of users affected
6
Journey of Cloud Dependability Research 6
7
Fault-Tolerant Systems 7 Complex failures Hard to handle and implement correctly Recovery protocols are very complex Recovery code is one of the most buggy parts Complex failures Hard to handle and implement correctly Recovery protocols are very complex Recovery code is one of the most buggy parts
8
Offline Testing Thoroughly verify recovery mechanism 8
9
Offline Testing Thoroughly verify recovery mechanism Fault injection, model checking, stress testing, etc. “Mini cluster” that represents production runs Testing and production environment is different – Cluster, workload, failure 9 Mini cluster Production run Real workload Test workload
10
Offline Testing Thoroughly verify recovery mechanism Fault injection, model checking, stress testing, etc. “Mini cluster” that represents production runs Testing and production environment is different – Cluster, workload, failure Orders of magnitude different in scale – Facebook used 100 machines to mimic 3000-machine production run[2011] Small start-ups forego the luxury – Many tests are much smaller than this 10
11
Diagnosis Help administrators to point out and reproduce causes of outages BUT – Post-mortem, not prevent disruptions – Passive approach, wait outages happen before diagnosis 11
12
Online Testing and Failure Drills 12 Requests Customers Test Administrators “Inject failures online” Users outnumber testers Real deep scenarios
13
A Missing Piece 13 Boss, let do inject failures online using Chaos Monkey Hmm … EmployeeBoss Dear beloved customers, Thank you for trusting our services, but we accidentally lose your data because the failure drills that we run...
14
Future of Failure Drill 14 Drill-ready cloudsCurrent Drill A team of engineers standing by
15
Drill-Ready Cloud Computing Automatic failure drill and automatic cancellation Safe, efficient, easy manner Ideally, no engineering effort required 15
16
Drill-Ready Cloud Computing 16 Administrator Drill-Ready System Drill Mode Drill Spec Kill 25 % If it disrupts revert back Drill-ready cloud computing Systems take care failure injection and cancellation Drill-ready cloud computing Systems take care failure injection and cancellation
17
Outline Safety Efficiency Usability Generality 17
18
Safety Learn about failure implications without suffering through them Learn whether data can be lost – But not lose the data Learn whether SLA can be violated – But not violate it for long time 18
19
Safety Solutions Normal and drill states 19 Not drill aware
20
Safety Solutions Normal and drill states 20 Normal TopologyDrill Topology “Maintaining 2 states” Revert back to normal state easily Normal and drill states The first most important thing for drill-ready clouds Normal and drill states The first most important thing for drill-ready clouds
21
Safety Solutions Drill state isolation Self cancellation – Real failures during the drill – Drill master and drill agent – Drill master command agents – What if network partition? Agents are in limbo state – Self cancellation when agents cannot contact master 21
22
Safety Solutions Drill state isolation Self cancellation Safe drill specification – Drill specification 22 Drill Spec - What failures? - How long? - Cancellation conditions - Etc. Example Kill 25 % If SLA is violated revert back Safe drill specification Check whether the specification can run safely Safe drill specification Check whether the specification can run safely
23
Efficiency Failures trigger data migration Monetary cost – Bandwidth – Storage space System performance – Affect users 23
24
Efficiency Solutions Low-overhead drill setup and cleanup – Do we need to do real key re-balance? – Depends on the objective of the test 24 [11-20] [21-30] [1-10][31-40] [41-50][51-60] [41-45] [46-50] [11-15] [16-20] Yes, if we want to see background re-balance impact Read / Write data SLA okay?
25
Efficiency Solutions Low-overhead drill setup and cleanup – Do we need to do real key re-balance? – Depends on the objective of the test 25 [16-30] [31-45] [1-15][46-60] No, if we want to measure performance, when we lose 2 nodes Read / Write [46-60] SLA okay? No key [11] Low-overhead setup and cleanup The cost depends on the drill objectives and Drill objectives must be parts on drill specifications Low-overhead setup and cleanup The cost depends on the drill objectives and Drill objectives must be parts on drill specifications
26
Efficiency Solutions Low-overhead drill setup and cleanup Cheap drill specification – Smarter and cheaper drill specification 26 If replication is 50 % correct assume that the rest are correct Stop half way and report success Replicating progress status
27
Usability Solutions Declarative drill specification language 27 – Need declarative language Describe results Easy to read and write Drill Specification During peak load Kill 5% machines If SLA violated > 1 mins Cancel the drill If recovery is 50% good Stop the drill Report success
28
Generality Solutions Elasticity drill Configuration change drill Software upgrade drill Security attack drill 28
29
Conclusion Drill-ready cloud computing – New reliability paradigm Sketching a first draft We want your FEEDBACK 29
30
Thank You 30 http://ucare.cs.uchicago.edu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.