A Simple Way to Estimate the Cost of Downtime Dave Patterson EECS Department University of California, Berkeley

Slides:



Advertisements
Similar presentations
Communication Transferring information from one person to another. Communication is used to instruct, clarify interpret, notify, warn, receive feedback,
Advertisements

Large-Scale Distributed Systems Andrew Whitaker CSE451.
A new standard in Enterprise File Backup. Contents 1.Comparison with current backup methods 2.Introducing Snapshot EFB 3.Snapshot EFB features 4.Organization.
Chapter 3 Program Management and Project Evaluation Professor Hossein Saiedian McGraw-Hill Education ISBN
Fabián E. Bustamante, Winter 2006 Recovery Oriented Computing Embracing Failure A. B. Brown and D. A. Patterson, Embracing failure: a case for recovery-
Key Challenges in Information Processing James Hamilton Microsoft SQL Server
Copyright ©2003 Digitask Consultants Inc., All rights reserved Storage Area Networks Digitask Seminar April 2000 Digitask Consultants, Inc.
Princeton PC Users Group Hard Drive Disaster! By Paul Kurivchack March 14, 2005.
Challenges in Large Enterprise Data Management James Hamilton Microsoft SQL Server
E-Commerce: The Second Wave Fifth Annual Edition Chapter 12: Planning for Electronic Commerce.
Failure Analysis of Two Internet Services Archana Ganapathi
Recovery Oriented Computing (ROC) Dave Patterson and a cast of 1000s: Aaron Brown, Pete Broadwell, George Candea †, Mike Chen, James Cutler †, Prof. Armando.
CalStan 3/2011 VIRAM-1 Floorplan – Tapeout June 01 Microprocessor –256-bit media processor –12-14 MBytes DRAM – Gops –2W at MHz –Industrial.
Recovery Oriented Computing (ROC) Dave Patterson and a cast of 1000s: Aaron Brown, Pete Broadwell, George Candea †, Mike Chen, James Cutler †, Prof. Armando.
Introduction to Systems Analysis and Design
VIRTUALIZATION AND YOUR BUSINESS November 18, 2010 | Worksighted.
1 CSE 403 Reliability Testing These lecture slides are copyright (C) Marty Stepp, They may not be rehosted, sold, or modified without expressed permission.
Introduction to Computer Administration System Administration
Latency as a Performability Metric for Internet Services Pete Broadwell
CompSci Self-Managing Systems Shivnath Babu.
What is E-Commerce? Section 8.1. What is E-commerce? E-commerce is the exchange of goods, services, information, or other businesses through electronic.
CNJohnson & Associates, Inc An Overview of Chargeback Best Practices.
Elite Networking & Consulting Presents: Everything You Wanted To Know About Data Insurance* * But Were Afraid To Ask Elite Networking & Consulting, LLC,
System Testing There are several steps in testing the system: –Function testing –Performance testing –Acceptance testing –Installation testing.
Adam Leidigh Brandon Pyle Bernardo Ruiz Daniel Nakamura Arianna Campos.
1 Faculty Council IT Committee C-13 February 4, /4/2010.
1 SYS366 Week 3 Lecture 1 Introduction to Requirements Gathering: Part 1.
© 2014 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual.
DotHill Systems Data Management Services. Page 2 Agenda Why protect your data?  Causes of data loss  Hardware data protection  DMS data protection.
Eng. Mohammed Timraz Electronics & Communication Engineer University of Palestine Faculty of Engineering and Urban planning Software Engineering Department.
1 Web Commerce Definition Benefits Impacts Other Types of Electronic Commerce.
August 01, 2008 Performance Modeling John Meisenbacher, MasterCard Worldwide.
Recovery Oriented Computing (ROC) Aaron Brown*, Pete Broadwell, George Candea †, Mike Chen, Leonard Chung*, James Cutler †, Armando Fox †, Archana Ganapathi*,
1 Performance Evaluation of Computer Systems and Networks Introduction, Outlines, Class Policy Instructor: A. Ghasemi Many thanks to Dr. Behzad Akbari.
GRAPHITE PAYMENTS ATM PROGRAM 1. ATM STATISTICS 2013 Federal Reserve Payments Study  5.8 billion ATM withdrawals  1.9 billion from non-FI ATMs  $
1 BTEC HNC Systems Support Castle College 2007/8 Systems Analysis Lecture 13 Post-Implementation Training.
Failure Analysis of the PSTN: 2000 Patricia Enriquez Mills College Oakland, California Mentors: Aaron Brown David Patterson.
1. Tracking and oversight 2. Costing and Charging Assessment & Improvement of IT Services / IT Service Management Nynke de Vries.
Chapter 11 Maintaining the System System evolution Legacy systems Software rejuvenation.
Your business runs even when your server doesn’t DR Recommendation November 2011.
Metrics and Techniques for Evaluating the Performability of Internet Services Pete Broadwell
CompSci Self-Managing Systems Shivnath Babu.
Disaster Recovery and Business Continuity Planning.
Microsoft Reseach, CambridgeBrendan Murphy. Measuring System Behaviour in the field Brendan Murphy Microsoft Research Cambridge.
Evaluating Undo: Human-Aware Recovery Benchmarks Aaron Brown with Leonard Chung, Calvin Ling, and William Kakes January 2004 ROC Retreat.
Lecture 4. IS Planning & Acquisition To be covered: To be covered: – IS planning and its importance Cost-benefit analysis Cost-benefit analysis Funding.
Computing A2 Project Analysis. Company Background Set the scene – Company history – Staff – Turnover – Location(s) – Brief overview of product/services.
Hosted Voice & Hosted Contact Center
Fault Tolerance Benchmarking. 2 Owerview What is Benchmarking? What is Dependability? What is Dependability Benchmarking? What is the relation between.
Software Maintenance Speaker: Jerry Gao Ph.D. San Jose State University URL: Sept., 2001.
Mindcraft is a registered trademark of Mindcraft, Inc. October 26, 1998Copyright 1998 Mindcraft, Inc. A Strategy for Buying Directory Servers Bruce Weiner.
Ellis Paul Technical Solution Specialist – System Center Microsoft UK Operations Manager Overview.
Slide 1 Recovery-Oriented Computing Aaron Brown, Dan Hettenna, David Oppenheimer, Noah Treuhaft, Leonard Chung, Patty Enriquez, Susan Housand, Archana.
“The Role of Experience in Software Testing Practice” A Review of the Article by Armin Beer and Rudolf Ramler By Jason Gero COMP 587 Prof. Lingard Spring.
©2011 Quest Software, Inc. All rights reserved. Quick, Scalable Restore of Granular Objects Recovery Manager for Active Directory.
If you have a transaction processing system, John Meisenbacher
Online Shopping. Learning Objectives To learn how society has been affected by online shopping (e-Commerce)
Introduction to System Administration. System Administration  System Administration  Duties of System Administrator  Types of Administrators/Users.
INTRODUCTION TO DESKTOP SUPPORT
Managed IT Solutions More Reliable Networks Are Our Business
Azure Site Recovery For Hyper-V, VMware, and Physical Environments
Introduction to High Availability
Embracing Failure: A Case for Recovery-Oriented Computing
DISASTER RECOVERY INSTITUTE INTERNATIONAL
Leveraging the Cloud: A New Way of Managing Testing
Large Distributed Systems
Addressing Human Error with Undo
Recovery-Oriented Computing
HR Metrics 2: Staffing Metrics
Agenda The current Windows XP and Windows XP Desktop situation
Presentation transcript:

A Simple Way to Estimate the Cost of Downtime Dave Patterson EECS Department University of California, Berkeley November 2002

Slide 2 Motivation Our perspective: Dependability and Cost of Ownership are the upcoming challenges – Past challenges: Performance and Cost of Purchase Ideal: compare (purchase cost + outage cost) But companies claim most customers won’t pay much more for more dependable products 1. How do you measure product availability? 2. How much money would greater availability save? Researchers are starting to benchmark dependability: commonplace in 2 to 4 years? –Predict hours of downtime per year per product If customers cannot easily estimate downtime costs, who will pay more for dependability? –outage cost = downtime hours X cost/hour of downtime

Slide “Downtime Costs” (per Hour) Brokerage operations$6,450,000 Credit card authorization$2,600,000 Ebay (1 outage 22 hours)$225,000 Amazon.com$180,000 Package shipping services$150,000 Home shopping channel$113,000 Catalog sales center$90,000 Airline reservation center$89,000 Cellular service activation$41,000 On-line network fees$25,000 ATM service fees$14,000 Sources: InternetWeek 4/3/ Fibre Channel: A Comprehensive Introduction, R. Kembel 2000, p.8. ”...based on a survey done by Contingency Planning Research."

Slide 4 One Approach 1. Estimate on-line income lost during outages –(Online Income / quarter divided by hours / quarter) X hours of downtime Lost off-line sales? Employee productivity? 2. Interview employees after an outage, ask how many were idled by the outage, and calculate their salaries and benefits –How many employees would answer (honestly)? (Big Brother data collection?) –How many companies would spend the money to collect this information? But want CIO, system administrator to easily estimate costs to evaluate future purchases

Slide 5 A Simple Estimate Estimated Average Cost of hour of downtime = Employee Costs per Hour * Fraction Employees Affected by Outage + Average Revenue per Hour * Fraction Revenue Affected by Outage Employee Costs per Hour: total salaries and benefits of employees per week divided by the average number of working hours Average Revenue per Hour: total revenue per week divided by average number of open hours “Fraction Employees Affected by Outage” and “Fraction Revenue Affected by Outage” are just educated guesses or plausible ranges – Since evaluating purchases, estimates OK

Slide 6 Caveats Ignores cost of repair, such as cost of operator overtime or bringing in consultants Ignores daily, seasonal variations in revenue Indirect costs of outages can be as important as these more immediate costs –Company morale can suffer, reducing productivity for periods that far exceed the outage –Frequent outages can lead to a loss of confidence in the IT team and its skills (IT blamed for everything) –Can eventually lead to individual departments hiring their own IT people, which lead to higher direct costs Hence estimate tends to be conservative

Slide 7 3 Examples (2001) InstitutionRevenue Hours Employee Hours per Week per Week EECS Dept.--10 hrs x 5 days U.C. Berkeley Amazon24 hrs x 7 days10 x 5 (Online) Sun Microsystems24 x 510 x 5 (Offline, but Global)

Slide 8 Example 1: EECS Dept. at U.C.B. State funds: 68 $100k / week + 90 $200k/week (including benefits) External (federal) funds per week: –School year: 670 (students, $450k –Summer: 635 (students, faculty, $675k ~ $44,000,000 / year in Salaries & Benefits or ~ $850,000 / 50 hours / week => ~ $17,000 per hour If outage affects 80% employees, ~ $14,000 / hour Guess at 2002 outages: ~ 50 hours => ~ $680,000 per year

Slide 9 Example 2: Amazon 2001 Revenue $3.1B/year, with 7744 employees Revenue per hour (24x7): ~ $350,000 If outage affects 90% revenue: ~ $320,000 Public quarterly reports do not include salaries and benefits directly Assume avg. annual salary is $85,000 –UC staff ~$70,000; Amazon 20% higher? $656M / year, or $12.5M / week for all 50 hours / week => ~ $250,000 per hour If outage affects 80% employees: $200,000 Total: ~ $520,000 / hour Note: Employee cost/hour ~ revenue

Slide 10 Example 3: Sun Microsystems 2001 Revenue $12.5B/year, with 43,314 employees Revenue per hour (24x5): ~ $2,000k If outage affected 10% revenue: ~ $200,000 Assume avg. annual salary is $100,000 –More engineers than Amazon, so 20% higher $4331M/year, or $83M / week for all 50 hours / week => ~ $1,660,000 per hour If outage affects 50% employees: ~$830,000 Total: ~ $1,030,000 / hour Note: Employee costs 80% of outage cost

Slide 11 Purchase Example Comparing 2 RAID disk arrays, same capacity; Brand X Purchase Price: $200,000 Brand Y Purchase Price: $250,000 Dependability benchmarks suggest over 3-year product lifetime 10 hours of downtime for Brand X vs. 1 hour of downtime for Brand Y Using EECS downtime costs of $14k per hour: Brand X: $200,000+10x$14,000 = $340,000 Brand Y: $250,000+ 1x$14,000 = $264,000 Helps organization justify selecting more dependable but more expensive product

Slide 12 Observations Data was easy to collect –3 s inside EECS –Quarterly Reports for Amazon and Sun Quantifies cost difference of planned vs. unplanned downtime, which is not captured by availability ( %) Employee productivity costs, traditionally ignored in such estimates, are significant –Even for Internet companies like Amazon –Dominate traditional organizations like Sun Outages at universities and government organizations can be expensive, even without a computer-related revenue stream Include employee productivity in costs!

Slide 13 Is greater accuracy needed? Downtime can have subtle, difficult to measure effects on sales and productivity –Are sales simply re-ordered after downtime, Or do customers switch to more dependable company? –Do employees just do other work during downtime, Or does downtime result in lost work, psychological impact, so that it takes longer than the downtime to recover? Hence spending much more for accuracy may not be worthwhile –Also, metric is less likely to see widespread use if it’s much more difficult to generate

Slide 14 Conclusion Employee productivity costs are significant Goal: easy-to-calculate estimate to lays groundwork for buying dependable products Unclear what precision is needed or possible, so make easy for CIOs, SysAdmins to calculate Estimated Average Cost of 1 hour of downtime = Employee Costs per Hour * Fraction Employees Affected + Average Revenue per Hour * Fraction Revenue Affected Web page for downtime calculation for you:

Slide 15 FYI: ROC Project Recovery Oriented Computing (ROC) Project collecting real failure for Recovery Benchmarks –If interested in helping, let us know, and see

Slide 16 BACKUP SLIDES

Slide 17 Human error Human operator error is the leading cause of dependability problems in many domains Operator error cannot be eliminated –humans inevitably make mistakes: “to err is human” –automation irony tells us we can’t eliminate the human Source: D. Patterson et al. Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies, UC Berkeley Technical Report UCB//CSD , March Public Switched Telephone Network Average of 3 Internet Sites Sources of Failure

Slide 18 ROC Part I: Failure Data Lessons about human operators Human error is largest single failure source –HP HA labs: human error is #1 cause of failures (2001) –Oracle: half of DB failures due to human error (1999) –Gray/Tandem: 42% of failures from human administrator errors (1986) –Murphy/Gent study of VAX systems (1993): Causes of system crashes Time ( ) % of System Crashes System management Software failure Hardware failure Other 53% 18% 10%

Slide 19 Total Cost of Ownership: Ownership vs. Purchase Source: "The Role of Linux in Reducing the Cost of Enterprise Computing“, IDC white paper, sponsored by Red Hat, by Al Gillen, Dan Kusnetzky, and Scott McLaron, Jan. 2002, available at HW/SW decrease vs. Salary Increase –142 sites, users/site, $2B/yr sales A B C D

Slide 20 Recovery benchmarks quantify system behavior under failures, maintenance, recovery They require –A realistic workload for the system –Quality of service metrics and tools to measure them –Fault-injection to simulate failures –Human operators to perform repairs Repair Time QoS degradation failure normal behavior (99% conf.) Recovery benchmarking 101 Source: A. Brown, and D. Patterson, “Towards availability benchmarks: a case study of software RAID systems,” Proc. USENIX, June 2000

Slide 21 Example: 1 fault in SW RAID Compares Linux and Solaris reconstruction –Linux: Small impact but longer vulnerability to 2nd fault –Solaris: large perf. impact but restores redundancy fast –Windows: did not auto-reconstruct! Linux Solaris

Slide 22 Dependability: Claims of 5 9s? % availability from telephone company? –AT&T switches < 2 hours of failure in 40 years Cisco, HP, Microsoft, Sun … claim % availability claims (5 minutes down / year) in marketing/advertising –HP-9000 server HW and HP-UX OS can deliver % availability guarantee “in certain pre- defined, pre-tested customer environments” –Environmental? Application? Operator? s from Jim Gray’s talk: “Dependability in the Internet Era”

Slide 23 “Microsoft fingers technicians for crippling site outages” By Robert Lemos and Melanie Austria Farmer, ZDNet News, January 25, 2001 Microsoft blamed its own technicians for a crucial error that crippled the software giant's connection to the Internet, almost completely blocking access to its major Web sites for nearly 24 hours… a "router configuration error" had caused requests for access to the company’s Web sites to go unanswered… "This was an operational error and not the result of any issue with Microsoft or third-party products, nor with the security of our networks," a Microsoft spokesman said. (5 9s possible if site stays up 250 years!) 99

Slide 24 The ironies of automation Automation doesn’t remove human influence –shifts the burden from operator to designer »designers are human too, and make mistakes »unless designer is perfect, human operator still needed Automation can make operator’s job harder –reduces operator’s understanding of the system »automation increases complexity, decreases visibility »no opportunity to learn without day-to-day interaction –uninformed operator still has to solve exceptional scenarios missed by (imperfect) designers »exceptional situations are already the most error-prone Need tools to help, not replace, operator Source: J. Reason, Human Error, Cambridge University Press, mention human-aware automation

Slide 25 Recovery-Oriented Computing Philosophy “If a problem has no solution, it may not be a problem, but a fact, not to be solved, but to be coped with over time” — Shimon Peres (“Peres’s Law”) People/HW/SW failures are facts, not problems Recovery/repair is how we cope with them Improving recovery/repair improves availability –UnAvailability = MTTR MTTF –1/10th MTTR just as valuable as 10X MTBF (assuming MTTR much less than MTTF) ROC also helps with maintenance/TCO –since major Sys Admin job is recovery after failure Since TCO is 5-10X HW/SW $, if necessary spend disk/DRAM/CPU resources for recovery

Slide 26 Five “ROC Solid” Principles 1.Given errors occur, design to recover rapidly 2.Extensive sanity checks during operation –To discover failures quickly (and to help debug) –Report to operator (and remotely to developers) 3.Tools to help operator find, fix problems –Since operator part of recovery; e.g., hot swap; undo; graceful, gradual SW upgrade/degrade 4.Any error message in HW or SW can be routinely invoked, scripted for regression test –To test emergency routines during development –To validate emergency routines in field –To train operators in field 5.Recovery benchmarks to measure progress –Recreate performance benchmark competition

Slide 27 Recovery Benchmarks (so far) Race to recover vs. race to finish line Recovery benchmarks involve people, but so do most research by social scientists –“Macro” benchmarks for competition, must be fair, hard to game, representative; use ~ 10 operators in routine maintenance and observe errors; insert realistic HW, SW errors stochastically –“Micro” benchmarks for development, must be cheap; inject typical human, HW, SW errors Many opportunities to compare commercial products and claims, measure value of research ideas, … with recovery benchmarks –initial recovery benchmarks find peculiarities –Lots of low hanging fruit (~ early RAID days)