1 CSSE 377 – Intro to Availability & Reliability Part 2 Steve Chenoweth Tuesday, 9/13/11 Week 2, Day 2 Right – Pictorial view of how to achieve high availability through duplication of resources. Or is it instead a picture of how not to try using resources for some different activity? From
2 Today Tactics for software availability engineering… –Bass’s Ch 5 (pp ) Project 2, part 2 – tonite Biweekly quiz – second half hour Thursday HW 2 (individual)
3 Availability Tactics Try one of these 3 Strategies: –Fault detection –Fault recovery –Fault prevention See next slides for details on each
4 Fault Detection Strategy – Recognize when things are going sour: Ping/echo – Ok – A central monitor checks resource availability Heartbeat – Ok – The resources report this automatically Exceptions – Not ok – Someone gets negative reporting (often at low level, then “escalated” if serious) Right – Everyone likes early fault detection. In hardware systems, the use of multivariate analysis is used to isolate the source of deviations in system performance. From 4&ptid=0. 4&ptid=0
5 Fault Recovery - Preparation Strategy – Plan what to do when things go sour: Voting – Analyze which is faulty Active redundancy (hot backup) – Multiple resources with instant switchover Passive redundancy (warm backup) – Backup needs time to take over a role Spare – A very cool backup, but lets 1 box backup many different ones
6 Fault Recovery - Reintroduction Strategy – Do the recovery of a failed component - carefully: Shadow operation – Watch it closely as it comes back up, let it “pretend” to operate State resynchronization – Restore missing data – Often a big problem! –Special mode to resynch before it goes “live” –Problem of multiple machines with partial data Checkpoint/rollback – Verify it’s in a consistent state
7 Fault Prevention (in book) Runtime Strategy – Don’t even let it happen! Removal from service – Other components decide to take one out of service if it’s “close to failure” Transactions – Ensure consistency across servers. “ACID” model* is: –Atomicity –Consistency Process monitor – Make a new instance (like of a process) –Isolation –Durability *ACID Model - See for example
8 Fault Prevention (not in book) Construction Strategy – spend time on the software that’s most critical to availability. Let’s assume you have a fixed amount of time for developing the software. Divide the components into 3 classes: –Gold – The top feature, starting the system, backup & recovery, software needed for testing, … –Silver – Other key features –Bronze – Everything else Spend almost all your time achieving quality on the Gold!
9 Hardware basics Know your availability model! But which one do you really have? A = a 1 * a 2 a1a1 a2a2 A = 1 - ((1 - a 1 )*(1 - a 2 )) a1a1 a2a2 A = 1 - ((1 - a 1 )*(1 - a 2 )*(1 - a 3 )) a1a1 a2a2 a3a3
10 Interesting observations In duplicated systems, most crashes occur when one part already is down – why? Most software testing, for a release, is done until the system runs without severe errors for some designated period of time Time Number of failures Predicted time when target reached
11 What’s next on Project 2? Continuing this project, –Determine the availability of the current system, and –Implement a tactic to improve it by a designated amount! And a next step to take today: Decide on a strategy to test the current availability of the system (or features). Some “stimulator,” etc., that you can build over the weekend. Pick a tactic which you believe you can implement to improve on it. –Pick a method from Bass’s Ch5, specific in the categories of fault detection, recovery or prevention, and be particular about it. Turn in, in your team journal by 11:55 PM tonight.
12 Warning – you’re looking for problems speculatively Not every idea is a good one – just ask Zog from the Far Side…