Evaluating Undo: Human-Aware Recovery Benchmarks Aaron Brown with Leonard Chung, Calvin Ling, and William Kakes January 2004 ROC Retreat
Slide 2 Recap: ROC Undo We have developed & built a ROC Undo Tool – a recovery tool for human operators – lets operators take a system back in time to undo damage, while preserving end-user work We have evaluated its feasibility via performance and overhead benchmarks Now we must answer the key question: – does Undo-based recovery improve dependability?
Slide 3 Approach: Recovery Benchmarks Recovery benchmarks measure the dependability impact of recovery – behavior of system during recovery period – speed of recovery recovery time performability impact (performance, correctness) fault/error injection normal behavior performability recovery complete
Slide 4 What About the People? Existing recovery/dependability benchmarks ignore the human operator – inappropriate for undo, where human drives recovery To measure Undo, we need benchmarks that capture human-driven recovery – by including people in the benchmarking process
Slide 5 Outline Introduction Methodology – overview – faultload development – managing human subjects Evaluation of Undo Discussion and conclusions
Slide 6 Methodology Combine traditional recovery benchmarks with human user studies – apply workload and faultload – measure system behavior during recovery from faults – run multiple trials with a pool of human subjects acting as system operators Benchmark measures system, not humans – indirectly captures human aspects of recovery » quality of situational awareness, applicability of tools, usability & error-proneness of recovery procedures
Slide 7 Human-Aware Recovery Benchmarks Key components – workload: reuse performance benchmark – faultload: survey plus cognitive walkthrough – metrics: performance, correctness, and availability – human operators: handle non-self-healing recovery recovery time performability impact (performance, correctness) fault/error injection normal behavior performability recovery complete Key components – workload: reuse performance benchmark » faultload: survey plus cognitive walkthrough – metrics: performance, correctness, and availability » human operators: handle recovery tasks/tools
Slide 8 Developing the Faultload ROC approach combines surveys and cognitive walkthrough – surveys to establish common failure modes, symptoms, and error-prone administrative tasks » domain-specific, system-independent – cognitive walkthrough to translate to system-specific faultload Faultload specifies generic errors and events – provides system-independence, broader applicability – cognitive walkthrough maps to system-specific faults
Slide 9 Example: Service Faultload Web-based survey of admins – core questions: » “Describe any incidents in the past 3 months where data was lost or the service was unavailable.” » “Describe any administrative tasks you performed in the past 3 months that were particularly challenging.” – cost: 4 x $50 gift certificate to amazon.com » raffled off as incentive for participation – response: 68 respondents from SAGE mailing list
Slide 10 Survey Results Results configuration deployment/ upgrade other undoable non- undoable Common TasksChallenging TasksLost problems 50% 56% 25% 26% 17% 25% 18% 31% 33% 12% 1% 6% (151 total)(68 total)(12 total) – results dominated by » configuration errors (e.g., mail filters) » botched software/platform upgrades » hardware & environmental failures – Undo potentially useful for majority of problems
Slide 11 From Survey to Faultload Cognitive walkthrough example: SW upgrade – platform: sendmail on linux – task: upgrade from sendmail to sendmail – approach: 1. configure/locate existing sendmail-linux system 2. clone system to test machine (or use virtual machine) 3. attempt upgrade, identifying possible failure points » benchmarker must understand system to do this 4. simulate failures and select those that match symptom report from task survey – sample result: simulate failed upgrade that disables spam filtering by omitting -DMILTER compile-time flag
Slide 12 Human-Aware Recovery Benchmarks Key components – workload: reuse performance benchmark – faultload: survey plus cognitive walkthrough – metrics: performance, correctness, and availability – human operators: handle non-self-healing recovery recovery time performability impact (performance, correctness) fault/error injection normal behavior performability recovery complete Key components – workload: reuse performance benchmark » faultload: survey plus cognitive walkthrough – metrics: performance, correctness, and availability » human operators: handle recovery tasks/tools
Slide 13 Human Subject Protocol Benchmarks structured as human trials Protocol – human subject plays the role of system operator – subjects complete multiple sessions – in each session: » apply workload to test system » select random scenario and simulate problem » give human subject 30 minutes to complete recover Results reflect statistical average across subjects
Slide 14 The Variability Challenge Must control human variability to get reproducible, meaningful results Techniques – subject pool selection – screening – training – self-comparison » each subject faces same recovery scenario on all systems » system’s score determined by fraction of subjects with better recovery behavior » powerful, but only works for comparison benchmarks
Slide 15 Outline Introduction Methodology Evaluation of Undo – setup – per-subject results – aggregate results Discussion and conclusions
Slide 16 Evaluating Undo: Setup Faultload scenarios 1. SPAM filter configuration error 2. failed server upgrade 3. simple software crash (undo not useful here) Subject pool (after screening) – 12 UCB Computer Science graduate students Self-comparison protocol – each subject given same scenario in each of 2 sessions » undo available in first session only » imposes learning bias against undo, but lowers variability
Slide 17 Sample Single User Result Undo significantly improves correctness – with some (partially-avoidable) availability cost Without UndoWith Undo
Slide 18 Overall Evaluation Undo significantly improves correctness – and reduces variance across operators – statistically-justified, p-value Undo hurts IMAP availability – several possible workarounds exist Overall, Undo has a positive impact on dependability Sessions where Undo used
Slide 19 Outline Introduction Methodology Evaluation of Undo Discussion and conclusions
Slide 20 Discussion Undo-based recovery improves dependability – reduces incorrectly-handled mail in common failure cases More can still be done – tweaks to Undo implementation will reduce availability impact Benchmark methodology is effective at controlling human variability – self-comparison protocol gives statistically-justified results with 9 subjects (vs 15+ for random design)
Slide 21 Future Directions: Controlling Cost Human subject experiments are still costly – recruiting and compensating participants – extra time spent on training, multiple benchmark runs – extra demands on benchmark infrastructure – less than a user study, more than a perf. benchmark A necessary price to pay! Techniques for cost reduction – best-case results using best-of-breed operator – remote web-based participation – avoid human trials: extended cognitive walkthrough
Evaluating Undo: Human-Aware Recovery Benchmarks For more info: – – – paper: A. Brown, L. Chung et al. “Dependability Benchmarking of Human-Assisted Recovery Processes.” Submitted to DSN 2004, June 2004.
Backup Slides
Slide 24 Example: Service Faultload Results of task survey Lost Operator error (8%) User error (8%) External resource (8%) Software error (8%) Hardware/ Env’t (17%) Unknown (8%) (12 reports) Challenging Tasks Filter Installation (37%) Platform Change/ Upgrade (26%) Tool Dev. (6%) Config. (13%) Other (6%) User Ed. (4%) Architecture Changes (7%) (68 total) Configuration problems (25%) Upgrade- related (17%)
Slide 25 Full Summary Dataset