Evaluating Undo: Human-Aware Recovery Benchmarks Aaron Brown with Leonard Chung, Calvin Ling, and William Kakes January 2004 ROC Retreat.

Evaluating Undo: Human-Aware Recovery Benchmarks Aaron Brown with Leonard Chung, Calvin Ling, and William Kakes January 2004 ROC Retreat

Recap: ROC Undo We have developed & built a ROC Undo Tool – a recovery tool for human operators – lets operators take a system back in time to undo damage, while preserving end-user work We have evaluated its feasibility via performance and overhead benchmarks Now we must answer the key question: – does Undo-based recovery improve dependability?

Approach: Recovery Benchmarks Recovery benchmarks measure the dependability impact of recovery – behavior of system during recovery period – speed of recovery recovery time performability impact (performance, correctness) fault/error injection normal behavior performability recovery complete

What About the People? Existing recovery/dependability benchmarks ignore the human operator – inappropriate for undo, where human drives recovery To measure Undo, we need benchmarks that capture human-driven recovery – by including people in the benchmarking process

Outline Introduction Methodology – overview – faultload development – managing human subjects Evaluation of Undo Discussion and conclusions

Methodology Combine traditional recovery benchmarks with human user studies – apply workload and faultload – measure system behavior during recovery from faults – run multiple trials with a pool of human subjects acting as system operators Benchmark measures system, not humans – indirectly captures human aspects of recovery » quality of situational awareness, applicability of tools, usability & error-proneness of recovery procedures

Human-Aware Recovery Benchmarks Key components – workload: reuse performance benchmark – faultload: survey plus cognitive walkthrough – metrics: performance, correctness, and availability – human operators: handle non-self-healing recovery recovery time performability impact (performance, correctness) fault/error injection normal behavior performability recovery complete Key components – workload: reuse performance benchmark » faultload: survey plus cognitive walkthrough – metrics: performance, correctness, and availability » human operators: handle recovery tasks/tools

Developing the Faultload ROC approach combines surveys and cognitive walkthrough – surveys to establish common failure modes, symptoms, and error-prone administrative tasks » domain-specific, system-independent – cognitive walkthrough to translate to system-specific faultload Faultload specifies generic errors and events – provides system-independence, broader applicability – cognitive walkthrough maps to system-specific faults

Example: E-mail Service Faultload Web-based survey of e-mail admins – core questions: » “Describe any incidents in the past 3 months where data was lost or the service was unavailable.” » “Describe any administrative tasks you performed in the past 3 months that were particularly challenging.” – cost: 4 x $50 gift certificate to amazon.com » raffled off as incentive for participation – response: 68 respondents from SAGE mailing list

E-mail Survey Results Results configuration deployment/ upgrade other undoable non- undoable Common TasksChallenging TasksLost e-mail problems 50% 56% 25% 26% 17% 25% 18% 31% 33% 12% 1% 6% (151 total)(68 total)(12 total) – results dominated by » configuration errors (e.g., mail filters) » botched software/platform upgrades » hardware & environmental failures – Undo potentially useful for majority of problems

From Survey to Faultload Cognitive walkthrough example: SW upgrade – platform: sendmail on linux – task: upgrade from sendmail-8.2.9 to sendmail-8.2.10 – approach: 1. configure/locate existing sendmail-linux system 2. clone system to test machine (or use virtual machine) 3. attempt upgrade, identifying possible failure points » benchmarker must understand system to do this 4. simulate failures and select those that match symptom report from task survey – sample result: simulate failed upgrade that disables spam filtering by omitting -DMILTER compile-time flag

Human-Aware Recovery Benchmarks Key components – workload: reuse performance benchmark – faultload: survey plus cognitive walkthrough – metrics: performance, correctness, and availability – human operators: handle non-self-healing recovery recovery time performability impact (performance, correctness) fault/error injection normal behavior performability recovery complete Key components – workload: reuse performance benchmark » faultload: survey plus cognitive walkthrough – metrics: performance, correctness, and availability » human operators: handle recovery tasks/tools

Human Subject Protocol Benchmarks structured as human trials Protocol – human subject plays the role of system operator – subjects complete multiple sessions – in each session: » apply workload to test system » select random scenario and simulate problem » give human subject 30 minutes to complete recover Results reflect statistical average across subjects

The Variability Challenge Must control human variability to get reproducible, meaningful results Techniques – subject pool selection – screening – training – self-comparison » each subject faces same recovery scenario on all systems » system’s score determined by fraction of subjects with better recovery behavior » powerful, but only works for comparison benchmarks

Outline Introduction Methodology Evaluation of Undo – setup – per-subject results – aggregate results Discussion and conclusions

Evaluating Undo: Setup Faultload scenarios 1. SPAM filter configuration error 2. failed e-mail server upgrade 3. simple software crash (undo not useful here) Subject pool (after screening) – 12 UCB Computer Science graduate students Self-comparison protocol – each subject given same scenario in each of 2 sessions » undo available in first session only » imposes learning bias against undo, but lowers variability

Sample Single User Result Undo significantly improves correctness – with some (partially-avoidable) availability cost Without UndoWith Undo

Overall Evaluation Undo significantly improves correctness – and reduces variance across operators – statistically-justified, p-value 0.045 Undo hurts IMAP availability – several possible workarounds exist Overall, Undo has a positive impact on dependability Sessions where Undo used

Outline Introduction Methodology Evaluation of Undo Discussion and conclusions

Discussion Undo-based recovery improves dependability – reduces incorrectly-handled mail in common failure cases More can still be done – tweaks to Undo implementation will reduce availability impact Benchmark methodology is effective at controlling human variability – self-comparison protocol gives statistically-justified results with 9 subjects (vs 15+ for random design)

Future Directions: Controlling Cost Human subject experiments are still costly – recruiting and compensating participants – extra time spent on training, multiple benchmark runs – extra demands on benchmark infrastructure – less than a user study, more than a perf. benchmark A necessary price to pay! Techniques for cost reduction – best-case results using best-of-breed operator – remote web-based participation – avoid human trials: extended cognitive walkthrough

Evaluating Undo: Human-Aware Recovery Benchmarks For more info: – abrown@cs.berkeley.edu – http://roc.cs.berkeley.edu/ – paper: A. Brown, L. Chung et al. “Dependability Benchmarking of Human-Assisted Recovery Processes.” Submitted to DSN 2004, June 2004.

Backup Slides

Example: E-mail Service Faultload Results of e-mail task survey Lost E-mail Operator error (8%) User error (8%) External resource (8%) Software error (8%) Hardware/ Env’t (17%) Unknown (8%) (12 reports) Challenging Tasks Filter Installation (37%) Platform Change/ Upgrade (26%) Tool Dev. (6%) Config. (13%) Other (6%) User Ed. (4%) Architecture Changes (7%) (68 total) Configuration problems (25%) Upgrade- related (17%)

Full Summary Dataset

Evaluating Undo: Human-Aware Recovery Benchmarks Aaron Brown with Leonard Chung, Calvin Ling, and William Kakes January 2004 ROC Retreat.

Similar presentations

Presentation on theme: "Evaluating Undo: Human-Aware Recovery Benchmarks Aaron Brown with Leonard Chung, Calvin Ling, and William Kakes January 2004 ROC Retreat."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Evaluating Undo: Human-Aware Recovery Benchmarks Aaron Brown with Leonard Chung, Calvin Ling, and William Kakes January 2004 ROC Retreat.

Similar presentations

Presentation on theme: "Evaluating Undo: Human-Aware Recovery Benchmarks Aaron Brown with Leonard Chung, Calvin Ling, and William Kakes January 2004 ROC Retreat."— Presentation transcript:

Similar presentations

About project

Feedback