Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,

Similar presentations


Presentation on theme: "Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,"— Presentation transcript:

1 Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin, T. Nguyen, Rutgers University OSDI 2003 Vivo Project http://vivo.cs.rutgers.eduhttp://vivo.cs.rutgers.edu (based on slides from the authors’ OSDI presentation)

2 CS 395/495 Autonomic Computing Systems EECS, Northwestern University 2 Motivation Internet services are ubiquitous, e.g., Google, Yahoo!, Ebay, etc. –Expect 24 x 7 availability, but service outages still happen! A significant number of outages in Internet services are result of operator actions 1: Architecture is complex 2: Systems are constantly evolving 3: Lack of tools for operators to reason about the impact of their actions: Offline testing, emulation, simulation Very little detail on operator mistakes –Details strongly guarded by companies and administrators

3 CS 395/495 Autonomic Computing Systems EECS, Northwestern University 3 This work Understanding: Gather detailed data on operators’ mistakes –What categories of mistakes? –What’s the impact on the service? –How do mistakes correlate with experience, impact? –Caveat: this is not a complete study of operator behavior Approaches to deal with operator mistakes: prevention, recovery, automation Validation: Allow operators to evaluate the correctness of their actions prior to exposing them to the service –Like offline testing, but: Virtual environment (extension of online environment) Real workload Migration back and forth with minimal operator involvement

4 CS 395/495 Autonomic Computing Systems EECS, Northwestern University 4 Contributions Detailed information on operator tasks and mistakes –43 exp. - detailed data on operator behavior inc. 42 mistakes –64% immediately degraded throughput –57% were software configuration mistakes –Human experiments are possible and valuable! Designed and prototyped a validation infrastructure –Implemented on 2 cluster-based services: cooperative Web server (PRESS) and a multi-tier auction service –2 techniques to allow operators to validate their actions Demonstrated validation is a promising technique for reducing impact of operator mistakes –66% of all mistakes observed in operator study caught –6/9 mistakes caught in live operator exp. w/ validation –Successfully tested with synthetically injected mistakes

5 CS 395/495 Autonomic Computing Systems EECS, Northwestern University 5 Talk outline Approach and contributions Operator study: Understanding the mistakes –Representative environment –Choice of human subjects and experiments –Results Validation: Preventing exposure of mistakes Conclusion and future work

6 CS 395/495 Autonomic Computing Systems EECS, Northwestern University 6 Multi-tiered Internet services Web Server Application Server Application Server Application Server Application Server Application Server Application Server Database Client emulator exercises the service Tier 1 Tier 2 Tier 3 Code from the DynaServer project! On-line auction service ~ EBay

7 CS 395/495 Autonomic Computing Systems EECS, Northwestern University 7 Tasks, operators & training Tasks – two categories –Scheduled maintenance tasks (proactive), e.g. upgrade sw –Diagnose-and-repair tasks (reactive), e.g. disk failure Operator composition –14 computer science graduate students –5 professional programmers (Ask Jeeves) –2 sysadmins from our department Categorization of operators – w/ filled in questionnaire –11 novices – some familiarity with set up –5 intermediates – experience with a similar service –5 experts - in-charge of a service requiring high uptime Operator training –Novice operators given warm-up tasks –Material describing service, and detailed steps for tasks

8 CS 395/495 Autonomic Computing Systems EECS, Northwestern University 8 Experimental setup Service –3-tier auction service, and client emulator from Rice University’s DynaServer Project –Loaded at 35% of capacity Machines –2 Web servers (Apache), –5 application servers (Tomcat), –1 database machine (MYSQL) Operator assistance & data capture –Monitor service throughput –Modified bash shell for command and result trace Manual observation –Noting anomalies in operator behavior –Bailing out ‘lost’ operators

9 CS 395/495 Autonomic Computing Systems EECS, Northwestern University 9 Example trace Task: Add an application server –Mistake: Apache misconfiguration –Impact: Degraded throughput Application server added First Apache misconfigured and restarted Second Apache misconfigured and restarted

10 CS 395/495 Autonomic Computing Systems EECS, Northwestern University 10 Sampling of other mistakes Adding a new application server –Omission of new application server from backend member list –Syntax errors, duplicate entries, wrong hostnames –Launching the wrong version of software Migrating the database for performance upgrade –Incorrect privileges for accessing the database Security vulnerability –Database installed on wrong disk

11 CS 395/495 Autonomic Computing Systems EECS, Northwestern University 11 Operator mistakes: Category vs. impact 64% of all mistakes had immediate impact on service performance –36% resulted in latent faults Obs. #1: Significant no. of mistakes can be checked by testing with a realistic environment Obs. #2: Undetectable latent errors will still require online- recovery techniques

12 CS 395/495 Autonomic Computing Systems EECS, Northwestern University 12 Operator mistakes Misconfigurations account for 57% of all errors –Config. mistakes spanning multiple components are more likely (global misconfigurations) Obs. #1: Tools to manipulate & check configs are crucial Obs. #2: Careful maintaining multiple versions of s/w

13 CS 395/495 Autonomic Computing Systems EECS, Northwestern University 13 Operator categories Experts also made mistakes! –Complexity of tasks executed by experts were higher

14 CS 395/495 Autonomic Computing Systems EECS, Northwestern University 14 Summary of operator study 43 experiments  42 mistakes 27 (64%) mistakes caused immediate impact on service performance 24 (57%) were software configuration mistakes Mistakes were made across all operator categories Trace of operator commands & service performance for all experiments –Available at http://vivo.cs.rutgers.edu

15 CS 395/495 Autonomic Computing Systems EECS, Northwestern University 15 Talk outline Approach and contributions Operator study: Understanding the mistakes Validation: Preventing exposure of mistakes –Technique –Experimental evaluation Conclusion and future work

16 CS 395/495 Autonomic Computing Systems EECS, Northwestern University 16 Validation of operator’s actions Validation –Allow operator to check correctness of his/her actions prior to exposing their impact to the service interface (clients) –Correctness is tested by: Migrate the component(s) to virtual sand-box environment, Subject to a real load, Compare behavior to a known correct one, and –Migrate back to online environment Types of validation: –Replica-based: Compare with online replica (real time) –Trace-based: Compare with logged behavior

17 CS 395/495 Autonomic Computing Systems EECS, Northwestern University 17 Validating a component: Replica-based Web Server Database Tier 1 Tier 3 Tier 2 Validation sliceOnline slice Application Server Application Server Database Proxy Web Server Proxy Application Server Application Server Application Server Application Server Client Requests Compare Application State Shunt Compare

18 CS 395/495 Autonomic Computing Systems EECS, Northwestern University 18 Validating a component: Trace-based Validation sliceOnline slice Application Server Application Server Database Proxy Web Server Proxy State Compare Web Server Database Tier 1 Tier 3 Tier 2 Application Server Application Server Application Server Application Server Client Requests Shunt State

19 CS 395/495 Autonomic Computing Systems EECS, Northwestern University 19 Implementation details Shunting performed in middleware layer –Each request tagged with a unique ID all along the request path Component proxies can be constructed with little effort (mySQL proxy is ~ 384NCSL (402kNCSL) –Reuse discovery and communication interfaces, common messaging core State management requires well-defined export and import API –Stateful servers often support such API Comparator functions to detect errors –Simple throughput, flow, and content comparators

20 CS 395/495 Autonomic Computing Systems EECS, Northwestern University 20 Validating our prototype: results Live operator experiments –Operator given option of type of validation, duration, and to skip validation –Validation caught 6 out of 9 mistakes from 8 experiments with validation Mistake-injection experiments –Validation caught errors in data content (inaccessible files, corrupted files) and configuration mistakes (incorrect # of workers in Web Server degraded throughput) Operator-emulation experiments –Operator command scripts derived from the 42 operator mistakes –Both trace-based and replica validation caught 22 mistakes Multi-component validation caught 4 latent (component interaction) mistakes

21 CS 395/495 Autonomic Computing Systems EECS, Northwestern University 21 Reduction in impact with validation

22 CS 395/495 Autonomic Computing Systems EECS, Northwestern University 22 Fewer mistakes with validation

23 CS 395/495 Autonomic Computing Systems EECS, Northwestern University 23 Shunting & buffering overheads Shunting overhead for replica-based validation  39% additional CPU –All requests and responses are captured and forwarded to validation slice –Trace-based validation is slightly better  32 % additional CPU –Overhead is incurred on single component, and only during validation Various optimizations can reduce overhead to 13-22% –Examples: response summary (64byte), sampling (session boundaries) Buffering capacity during state check pointing and duplication –Required to buffer only about 150 requests for small state sizes

24 CS 395/495 Autonomic Computing Systems EECS, Northwestern University 24 Caveats, limitations & open Issues Non-determinism increases complexity of comparators and proxies –E.g., choice of back-end server, remote cache vs. local disk, pseudo-random session-id, time stamps Hard state management may require operator intervention –Component requires initialization prior to online migration Bootstrapping the validation –Validating an intended modification of service behavior – nothing to compare with! How long to validate? What types of validation? –Duration spent in validation implies reduced online capacity Future work: Taking validation further… –Validate operator actions on databases, network components –Combine validation with diagnosis for assisting operators –Other validation techniques: Model-based validation


Download ppt "Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,"

Similar presentations


Ads by Google