Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,

Slides:

Advertisements

Similar presentations

Automatic Configuration of Internet Services Wei Zheng, Ricardo Bianchini, and Thu Nguyen Department of Computer Science Rutgers University.

Advertisements

McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 20 Systems Operations and Support.

Testing Web Applications. Applications Architecture Client Server Architecture.

Database Architectures and the Web

Automating Bespoke Attack Ruei-Jiun Chapter 13. Outline Uses of bespoke automation ◦ Enumerating identifiers ◦ Harvesting data ◦ Web application fuzzing.

IBM Software Group ® Recommending Materialized Views and Indexes with the IBM DB2 Design Advisor (Automating Physical Database Design) Jarek Gryz.

Shadow Configurations: A Network Management Primitive Richard Alimi, Ye Wang, Y. Richard Yang Laboratory of Networked Systems Yale University.

14.1 © 2004 Pearson Education, Inc. Exam Planning, Implementing, and Maintaining a Microsoft Windows Server 2003 Active Directory Infrastructure.

70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 11: Monitoring Server Performance.

Web Servers How do our requests for resources on the Internet get handled? Can they be located anywhere? Global?

Understanding and Dealing with Operator Mistakes in Internet Services Kiran Nagaraja, Fábio Oliveira Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen.

Figure 1.1 Interaction between applications and the operating system.

2001 ©R.P.Martin Using Distributed Data Structures for Constructing Cluster-Based Servers Richard Martin, Kiran Nagaraja and Thu Nguyen Rutgers University.

Maintaining and Updating Windows Server 2008

70-291: MCSE Guide to Managing a Microsoft Windows Server 2003 Network Chapter 14: Troubleshooting Windows Server 2003 Networks.

Installing software on personal computer

 Network Management  Network Administrators Jobs  Reasons for using Network Management Systems  Analysing Network Data  Points that must be taken.

MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 7 Configuring File Services in Windows Server 2008.

New Challenges in Cloud Datacenter Monitoring and Management

Overview SAP Basis Functions. SAP Technical Overview Learning Objectives What the Basis system is How does SAP handle a transaction request Differentiating.

Client/Server Software Architectures Yonglei Tao.

Chapter 9 Database Planning, Design, and Administration Sungchul Hong.

1 Autonomic Computing An Introduction Guenter Kickinger.

©Ian Sommerville 2006Software Engineering, 8th edition. Chapter 12 Slide 1 Distributed Systems Architectures.

 Prototype for Course on Web Security ETEC 550.  Huge topic covering both system/network architecture and programming techniques.  Identified lack.

Guide to Linux Installation and Administration, 2e 1 Chapter 9 Preparing for Emergencies.

Copyright ®xSpring Pte Ltd, All rights reserved Versions DateVersionDescriptionAuthor May First version. Modified from Enterprise edition.NBL.

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

User Manager Pro Suite Taking Control of Your Systems Joe Vachon Sales Engineer November 8, 2007.

70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 11: Monitoring Server Performance.

1 Apache. 2 Module - Apache ♦ Overview This module focuses on configuring and customizing Apache web server. Apache is a commonly used Hypertext Transfer.

Module 10: Monitoring ISA Server Overview Monitoring Overview Configuring Alerts Configuring Session Monitoring Configuring Logging Configuring.

Auditing Information Systems (AIS)

16 1 Installation  After development and testing, system must be put into operation  Important planning considerations Costs of operating both systems.

Testing Workflow In the Unified Process and Agile/Scrum processes.

Contents 1.Introduction, architecture 2.Live demonstration 3.Extensibility.

Advanced Computer Networks Topic 2: Characterization of Distributed Systems.

Architecture for Caching Responses with Multiple Dynamic Dependencies in Multi-Tier Data- Centers over InfiniBand S. Narravula, P. Balaji, K. Vaidyanathan,

70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 11: Monitoring Server Performance.

Business Data Communications, Fourth Edition Chapter 11: Network Management.

“Trusted Passages”: Meeting Trust Needs of Distributed Applications Mustaque Ahamad, Greg Eisenhauer, Jiantao Kong, Wenke Lee, Bryan Payne and Karsten.

1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.

Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.

 Load balancing is the process of distributing a workload evenly throughout a group or cluster of computers to maximize throughput.  This means that.

Introduction: Information security services. We adhere to the strictest and most respected standards in the industry, including: -The National Institute.

11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.

Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 Based upon slides from Jay Lepreau, Utah Emulab Introduction Shiv Kalyanaraman

CASE (Computer-Aided Software Engineering) Tools Software that is used to support software process activities. Provides software process support by:- –

Module 9 Planning and Implementing Monitoring and Maintenance.

EPICS Release 3.15 Bob Dalesio May 19, Features for 3.15 Support for large arrays Channel access priorities Portable server replacement of rsrv.

A Binary Agent Technology for COTS Software Integrity Anant Agarwal Richard Schooler InCert Software.

CHAPTER Windows Server Management. Chapter Objectives Give an overview of the Server Manager Provide details of accessing the Server Manager Explain the.

Improving the Reliability of Commodity Operating Systems Michael M. Swift, Brian N. Bershad, Henry M. Levy Presented by Ya-Yun Lo EECS 582 – W161.

Mark Shtern.  Our life depends on computer systems  Traffic control  Banking  Medical equipment  Internet  Social networks  Growing number of.

1 Evaluation of Cooperative Web Caching with Web Polygraph Ping Du and Jaspal Subhlok Department of Computer Science University of Houston presented at.

 Project Team: Suzana Vaserman David Fleish Moran Zafir Tzvika Stein  Academic adviser: Dr. Mayer Goldberg  Technical adviser: Mr. Guy Wiener.

Maintaining and Updating Windows Server 2008 Lesson 8.

Test Loads Andy Wang CIS Computer Systems Performance Analysis.

1 Presented by: Val Pennell, Test Tool Manager Date: March 9, 2004 Software Testing Tools – Load Testing.

Architecture Review 10/11/2004

Fail-stutter Behavior Characterization of NFS

Monitoring Windows Server 2012

Software Architecture in Practice

Maximum Availability Architecture Enterprise Technology Centre.

Enterprise Computing Collaboration System Example

CSC 480 Software Engineering

Some Simple Definitions for Testing

Evaluating Transaction System Performance

Presentation transcript:

Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin, T. Nguyen, Rutgers University OSDI 2003 Vivo Project (based on slides from the authors’ OSDI presentation)

CS 395/495 Autonomic Computing Systems EECS, Northwestern University 2 Motivation Internet services are ubiquitous, e.g., Google, Yahoo!, Ebay, etc. –Expect 24 x 7 availability, but service outages still happen! A significant number of outages in Internet services are result of operator actions 1: Architecture is complex 2: Systems are constantly evolving 3: Lack of tools for operators to reason about the impact of their actions: Offline testing, emulation, simulation Very little detail on operator mistakes –Details strongly guarded by companies and administrators

CS 395/495 Autonomic Computing Systems EECS, Northwestern University 3 This work Understanding: Gather detailed data on operators’ mistakes –What categories of mistakes? –What’s the impact on the service? –How do mistakes correlate with experience, impact? –Caveat: this is not a complete study of operator behavior Approaches to deal with operator mistakes: prevention, recovery, automation Validation: Allow operators to evaluate the correctness of their actions prior to exposing them to the service –Like offline testing, but: Virtual environment (extension of online environment) Real workload Migration back and forth with minimal operator involvement

CS 395/495 Autonomic Computing Systems EECS, Northwestern University 4 Contributions Detailed information on operator tasks and mistakes –43 exp. - detailed data on operator behavior inc. 42 mistakes –64% immediately degraded throughput –57% were software configuration mistakes –Human experiments are possible and valuable! Designed and prototyped a validation infrastructure –Implemented on 2 cluster-based services: cooperative Web server (PRESS) and a multi-tier auction service –2 techniques to allow operators to validate their actions Demonstrated validation is a promising technique for reducing impact of operator mistakes –66% of all mistakes observed in operator study caught –6/9 mistakes caught in live operator exp. w/ validation –Successfully tested with synthetically injected mistakes

CS 395/495 Autonomic Computing Systems EECS, Northwestern University 5 Talk outline Approach and contributions Operator study: Understanding the mistakes –Representative environment –Choice of human subjects and experiments –Results Validation: Preventing exposure of mistakes Conclusion and future work

CS 395/495 Autonomic Computing Systems EECS, Northwestern University 6 Multi-tiered Internet services Web Server Application Server Application Server Application Server Application Server Application Server Application Server Database Client emulator exercises the service Tier 1 Tier 2 Tier 3 Code from the DynaServer project! On-line auction service ~ EBay

CS 395/495 Autonomic Computing Systems EECS, Northwestern University 7 Tasks, operators & training Tasks – two categories –Scheduled maintenance tasks (proactive), e.g. upgrade sw –Diagnose-and-repair tasks (reactive), e.g. disk failure Operator composition –14 computer science graduate students –5 professional programmers (Ask Jeeves) –2 sysadmins from our department Categorization of operators – w/ filled in questionnaire –11 novices – some familiarity with set up –5 intermediates – experience with a similar service –5 experts - in-charge of a service requiring high uptime Operator training –Novice operators given warm-up tasks –Material describing service, and detailed steps for tasks

CS 395/495 Autonomic Computing Systems EECS, Northwestern University 8 Experimental setup Service –3-tier auction service, and client emulator from Rice University’s DynaServer Project –Loaded at 35% of capacity Machines –2 Web servers (Apache), –5 application servers (Tomcat), –1 database machine (MYSQL) Operator assistance & data capture –Monitor service throughput –Modified bash shell for command and result trace Manual observation –Noting anomalies in operator behavior –Bailing out ‘lost’ operators

CS 395/495 Autonomic Computing Systems EECS, Northwestern University 9 Example trace Task: Add an application server –Mistake: Apache misconfiguration –Impact: Degraded throughput Application server added First Apache misconfigured and restarted Second Apache misconfigured and restarted

CS 395/495 Autonomic Computing Systems EECS, Northwestern University 10 Sampling of other mistakes Adding a new application server –Omission of new application server from backend member list –Syntax errors, duplicate entries, wrong hostnames –Launching the wrong version of software Migrating the database for performance upgrade –Incorrect privileges for accessing the database Security vulnerability –Database installed on wrong disk

CS 395/495 Autonomic Computing Systems EECS, Northwestern University 11 Operator mistakes: Category vs. impact 64% of all mistakes had immediate impact on service performance –36% resulted in latent faults Obs. #1: Significant no. of mistakes can be checked by testing with a realistic environment Obs. #2: Undetectable latent errors will still require online- recovery techniques

CS 395/495 Autonomic Computing Systems EECS, Northwestern University 12 Operator mistakes Misconfigurations account for 57% of all errors –Config. mistakes spanning multiple components are more likely (global misconfigurations) Obs. #1: Tools to manipulate & check configs are crucial Obs. #2: Careful maintaining multiple versions of s/w

CS 395/495 Autonomic Computing Systems EECS, Northwestern University 13 Operator categories Experts also made mistakes! –Complexity of tasks executed by experts were higher

CS 395/495 Autonomic Computing Systems EECS, Northwestern University 14 Summary of operator study 43 experiments  42 mistakes 27 (64%) mistakes caused immediate impact on service performance 24 (57%) were software configuration mistakes Mistakes were made across all operator categories Trace of operator commands & service performance for all experiments –Available at

CS 395/495 Autonomic Computing Systems EECS, Northwestern University 15 Talk outline Approach and contributions Operator study: Understanding the mistakes Validation: Preventing exposure of mistakes –Technique –Experimental evaluation Conclusion and future work

CS 395/495 Autonomic Computing Systems EECS, Northwestern University 16 Validation of operator’s actions Validation –Allow operator to check correctness of his/her actions prior to exposing their impact to the service interface (clients) –Correctness is tested by: Migrate the component(s) to virtual sand-box environment, Subject to a real load, Compare behavior to a known correct one, and –Migrate back to online environment Types of validation: –Replica-based: Compare with online replica (real time) –Trace-based: Compare with logged behavior

CS 395/495 Autonomic Computing Systems EECS, Northwestern University 17 Validating a component: Replica-based Web Server Database Tier 1 Tier 3 Tier 2 Validation sliceOnline slice Application Server Application Server Database Proxy Web Server Proxy Application Server Application Server Application Server Application Server Client Requests Compare Application State Shunt Compare

CS 395/495 Autonomic Computing Systems EECS, Northwestern University 18 Validating a component: Trace-based Validation sliceOnline slice Application Server Application Server Database Proxy Web Server Proxy State Compare Web Server Database Tier 1 Tier 3 Tier 2 Application Server Application Server Application Server Application Server Client Requests Shunt State

CS 395/495 Autonomic Computing Systems EECS, Northwestern University 19 Implementation details Shunting performed in middleware layer –Each request tagged with a unique ID all along the request path Component proxies can be constructed with little effort (mySQL proxy is ~ 384NCSL (402kNCSL) –Reuse discovery and communication interfaces, common messaging core State management requires well-defined export and import API –Stateful servers often support such API Comparator functions to detect errors –Simple throughput, flow, and content comparators

CS 395/495 Autonomic Computing Systems EECS, Northwestern University 20 Validating our prototype: results Live operator experiments –Operator given option of type of validation, duration, and to skip validation –Validation caught 6 out of 9 mistakes from 8 experiments with validation Mistake-injection experiments –Validation caught errors in data content (inaccessible files, corrupted files) and configuration mistakes (incorrect # of workers in Web Server degraded throughput) Operator-emulation experiments –Operator command scripts derived from the 42 operator mistakes –Both trace-based and replica validation caught 22 mistakes Multi-component validation caught 4 latent (component interaction) mistakes

CS 395/495 Autonomic Computing Systems EECS, Northwestern University 21 Reduction in impact with validation

CS 395/495 Autonomic Computing Systems EECS, Northwestern University 22 Fewer mistakes with validation

CS 395/495 Autonomic Computing Systems EECS, Northwestern University 23 Shunting & buffering overheads Shunting overhead for replica-based validation  39% additional CPU –All requests and responses are captured and forwarded to validation slice –Trace-based validation is slightly better  32 % additional CPU –Overhead is incurred on single component, and only during validation Various optimizations can reduce overhead to 13-22% –Examples: response summary (64byte), sampling (session boundaries) Buffering capacity during state check pointing and duplication –Required to buffer only about 150 requests for small state sizes

CS 395/495 Autonomic Computing Systems EECS, Northwestern University 24 Caveats, limitations & open Issues Non-determinism increases complexity of comparators and proxies –E.g., choice of back-end server, remote cache vs. local disk, pseudo-random session-id, time stamps Hard state management may require operator intervention –Component requires initialization prior to online migration Bootstrapping the validation –Validating an intended modification of service behavior – nothing to compare with! How long to validate? What types of validation? –Duration spent in validation implies reduced online capacity Future work: Taking validation further… –Validate operator actions on databases, network components –Combine validation with diagnosis for assisting operators –Other validation techniques: Model-based validation