The Case for Drill-Ready Cloud Computing Vision Paper Tanakorn Leesatapornwongsa and Haryadi S. Gunawi 1
Cloud Services Cheap Convenient Reliable 2
Yahoo Mail Disruption Hardware failures Wrong failover Disruptions – Some users could not access – Some users saw wrong notifications – Several days to recover 3
Outlook Disruption Hardware failures – Caching server Failover to backend servers correctly Requests flooded the servers Service went down Microsoft needed to change its software infrastructure 4
Cloud Outages 5 Outage Amazon EBS Gmail App Engine Skype Google Drive Outlook Yahoo Mail Root Event Network misconfig Upgrade event Power failure Overload Network bug Caching failure Hardware failures Supposedly tolerable failure Network partition Servers offline 25 % machines offline 30 % nodes failed Network offline Failover to backend Servers offline Incorrect Recovery Re-mirroring storm Bad request routing Bad failover Positive feedback loop Timeout during failover Request flooding Buggy failover Major Outage Clusters collapsed All routing servers down All user app were degraded Almost all nodes failed 33 % requests affected 7-hour outage 1 % of users affected
Journey of Cloud Dependability Research 6
Fault-Tolerant Systems 7 Complex failures Hard to handle and implement correctly Recovery protocols are very complex Recovery code is one of the most buggy parts Complex failures Hard to handle and implement correctly Recovery protocols are very complex Recovery code is one of the most buggy parts
Offline Testing Thoroughly verify recovery mechanism 8
Offline Testing Thoroughly verify recovery mechanism Fault injection, model checking, stress testing, etc. “Mini cluster” that represents production runs Testing and production environment is different – Cluster, workload, failure 9 Mini cluster Production run Real workload Test workload
Offline Testing Thoroughly verify recovery mechanism Fault injection, model checking, stress testing, etc. “Mini cluster” that represents production runs Testing and production environment is different – Cluster, workload, failure Orders of magnitude different in scale – Facebook used 100 machines to mimic 3000-machine production run[2011] Small start-ups forego the luxury – Many tests are much smaller than this 10
Diagnosis Help administrators to point out and reproduce causes of outages BUT – Post-mortem, not prevent disruptions – Passive approach, wait outages happen before diagnosis 11
Online Testing and Failure Drills 12 Requests Customers Test Administrators “Inject failures online” Users outnumber testers Real deep scenarios
A Missing Piece 13 Boss, let do inject failures online using Chaos Monkey Hmm … EmployeeBoss Dear beloved customers, Thank you for trusting our services, but we accidentally lose your data because the failure drills that we run...
Future of Failure Drill 14 Drill-ready cloudsCurrent Drill A team of engineers standing by
Drill-Ready Cloud Computing Automatic failure drill and automatic cancellation Safe, efficient, easy manner Ideally, no engineering effort required 15
Drill-Ready Cloud Computing 16 Administrator Drill-Ready System Drill Mode Drill Spec Kill 25 % If it disrupts revert back Drill-ready cloud computing Systems take care failure injection and cancellation Drill-ready cloud computing Systems take care failure injection and cancellation
Outline Safety Efficiency Usability Generality 17
Safety Learn about failure implications without suffering through them Learn whether data can be lost – But not lose the data Learn whether SLA can be violated – But not violate it for long time 18
Safety Solutions Normal and drill states 19 Not drill aware
Safety Solutions Normal and drill states 20 Normal TopologyDrill Topology “Maintaining 2 states” Revert back to normal state easily Normal and drill states The first most important thing for drill-ready clouds Normal and drill states The first most important thing for drill-ready clouds
Safety Solutions Drill state isolation Self cancellation – Real failures during the drill – Drill master and drill agent – Drill master command agents – What if network partition? Agents are in limbo state – Self cancellation when agents cannot contact master 21
Safety Solutions Drill state isolation Self cancellation Safe drill specification – Drill specification 22 Drill Spec - What failures? - How long? - Cancellation conditions - Etc. Example Kill 25 % If SLA is violated revert back Safe drill specification Check whether the specification can run safely Safe drill specification Check whether the specification can run safely
Efficiency Failures trigger data migration Monetary cost – Bandwidth – Storage space System performance – Affect users 23
Efficiency Solutions Low-overhead drill setup and cleanup – Do we need to do real key re-balance? – Depends on the objective of the test 24 [11-20] [21-30] [1-10][31-40] [41-50][51-60] [41-45] [46-50] [11-15] [16-20] Yes, if we want to see background re-balance impact Read / Write data SLA okay?
Efficiency Solutions Low-overhead drill setup and cleanup – Do we need to do real key re-balance? – Depends on the objective of the test 25 [16-30] [31-45] [1-15][46-60] No, if we want to measure performance, when we lose 2 nodes Read / Write [46-60] SLA okay? No key [11] Low-overhead setup and cleanup The cost depends on the drill objectives and Drill objectives must be parts on drill specifications Low-overhead setup and cleanup The cost depends on the drill objectives and Drill objectives must be parts on drill specifications
Efficiency Solutions Low-overhead drill setup and cleanup Cheap drill specification – Smarter and cheaper drill specification 26 If replication is 50 % correct assume that the rest are correct Stop half way and report success Replicating progress status
Usability Solutions Declarative drill specification language 27 – Need declarative language Describe results Easy to read and write Drill Specification During peak load Kill 5% machines If SLA violated > 1 mins Cancel the drill If recovery is 50% good Stop the drill Report success
Generality Solutions Elasticity drill Configuration change drill Software upgrade drill Security attack drill 28
Conclusion Drill-ready cloud computing – New reliability paradigm Sketching a first draft We want your FEEDBACK 29
Thank You 30