Download presentation
Presentation is loading. Please wait.
Published byDylan Hensley Modified over 9 years ago
1
Operating 24x7 Amin Vahdat on behalf of John Jannotti, Jeff Mogul, Larry Peterson, Joe Touch, Paulo Verissimo, Werner Vogels, Bill Weihl
2
24x7 Availability: Goals Holistic approach Not just individual computers, but services Need to consider operators, etc. Sustainability (24x7 for how long) Need to handle a variety of failure model Understand what is and what is not correlated Real time, noisy, chaotic environment
3
24x7 Availability: Goals Self-configuration Evolvability Managing the availability/consistency tradeoff We live in a probabilistic world Monitoring needs built in from the ground up Predict and quantify cost of delivering certain levels of availability Including management, auditing, etc. With infinite cost, operating 24x7 is easy
4
New Models Fault Tolerant Software BFT is insufficient because of assumption of independence Multi-version programming is insufficient e.g., working from the same bad spec 100k nodes running more or less the same thing Extremely tolerant of hardware faults But if traffic causes software to fail Bohr bug No spare capacity in current power grid Interference is another problem in power grid
5
Dealing with Attacks Techniques to divert the traffic (/dev/null it) Isolate the attack traffic toward sacrificial machines Distinguish attack from non attack Legal and financial models primary technique for fighting attack Distinguishing humans versus bots Contracts distinguish between internal failures and acts of God/war
6
Living with Failure Services must behave within expectations even when individual components fail Graceful degradation Probabilistic reasoning, statistical models Statistical guarantees given failure models Must express assumptions about system behavior Expressing assumptions can be very difficult Mapping high-level system behavior to failure scenarios MTTR just as important as MTTF Tail (99.9%) of response curve must be within bounds
7
Evolvability Easier for centralized services, much more difficult in distributed environments Before deploying the new version, must have the old version available to deploy as the new version (quickly) What if a database scheme update was required Special case answers in some scenarios Tunneling in networks Huge amount of resources dedicated to test & development Regimented versus ad hoc environments Do you value reliability or innovation?
8
Sustainability Operating 24x7 for how many weeks sustainability Economic incentives Decentralized control can lead to longer term system reliability Internet partially succeeded because of decentralization Decentralization may help with evolvability though it can cut both ways
9
Infrastructure Support Virtualization Exporting appropriate failure models Fault injection Dependent/independent failures What is the minimal set of nodes required to predict behavior of much larger scale system? Evaluation techniques in general Simulated or emulated environments Including error models
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.