1 Dependability in the Internet Era Jim Gray Microsoft Research High Dependability Computing Consortium Conference Santa Cruz, CA 7 May 2001 REVISED: 13.

Slides:

Advertisements

Similar presentations

How We Manage SaaS Infrastructure Knowledge Track

Advertisements

Express5800/ft series servers Product Information Fault-Tolerant General Purpose Servers.

Terminology and empirical measures General methods to mask faults.

Clustering Technology For Scaleability Jim Gray Microsoft Research

1 Dependability in the Internet Era. 2 Outline The glorious past (Availability Progress) The dark ages (current scene) Some recommendations.

IBM SMB Software Group ® ibm.com/software/smb Maintain Hardware Platform Health An IT Services Management Infrastructure Solution.

The Lucernex Cloud: A software-as-a-service solution delivered via the Cloud What is the Cloud? Cloud Computing is the future of all software applications,

Large-Scale Distributed Systems Andrew Whitaker CSE451.

Which server is right for you? Get in Contact with us

SQL Server Disaster Recovery Chris Shaw Sr. SQL Server DBA, Xtivia Inc.

REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.

Copyright ©2003 Digitask Consultants Inc., All rights reserved Storage Area Networks Digitask Seminar April 2000 Digitask Consultants, Inc.

Business Continuity and DR, A Practical Implementation Mich Talebzadeh, Consultant, Deutsche Bank

“Turn you Smart phone into Business phone “

Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.

1 ITC242 – Introduction to Data Communications Week 12 Topic 18 Chapter 19 Network Management.

A. Frank 1 Internet Resources Discovery (IRD) Peer-to-Peer (P2P) Technology (1) Thanks to Carmit Valit and Olga Gamayunov.

J. Gray, Dependability in the Internet Era (acknowledgement: slides from J.Gray, E.Brewer)

Distributed Information Systems - The Client server model

Lesson 1: Configuring Network Load Balancing

1 Disaster Recovery Planning & Cross-Border Backup of Data among AMEDA Members Vipin Mahabirsingh Managing Director, CDS Mauritius For Workgroup on Cross-Border.

OCR Computing for GCSE © Hodder Education 2011

SMS Gateway OZEKI NG Document version: v Adding SMS functionality to SysAid.

Microsoft Load Balancing and Clustering. Outline Introduction Load balancing Clustering.

Appliance Firewalls A Technology Review By: Brent Huston T h e B l a c k H a t B r i e f i n g s July 7-8, 1999 Las Vegas.

CS162 Section Lecture 11. Project 4 Implement a distributed key-value store that uses – Two-Phase Commit for atomic operations, – Replication for performance.

Windows Server MIS 424 Professor Sandvig. Overview Role of servers Performance Requirements Server Hardware Software Windows Server IIS.

Freenet. Anonymity  Napster, Gnutella, Kazaa do not provide anonymity  Users know who they are downloading from  Others know who sent a query  Freenet.

Maintaining a Microsoft SQL Server 2008 Database SQLServer-Training.com.

by Marc Comeau. About A Webmaster Developing a website goes far beyond understanding underlying technologies Determine your requirements.

Guide to Linux Installation and Administration, 2e 1 Chapter 9 Preparing for Emergencies.

Unit 1 – Improving Productivity Tyler Dunn Instructions ~ 100 words per box.

Scalability Terminology: Farms, Clones, Partitions, and Packs: RACS and RAPS Bill Devlin, Jim Cray, Bill Laing, George Spix Microsoft Research Dec

Sofia, Bulgaria | 9-10 October SQL Server 2005 High Availability for developers Vladimir Tchalkov Crossroad Ltd. Vladimir Tchalkov Crossroad Ltd.

Eng. Mohammed Timraz Electronics & Communication Engineer University of Palestine Faculty of Engineering and Urban planning Software Engineering Department.

© , OrangeScape Technologies Limited. Confidential 1 Write Once. Cloud Anywhere. Building Highly Scalable Web applications BASE gives way to ACID.

1 Availability Policy (slides from Clement Chen and Craig Lewis)

Definitions What is a network? A series of interconnected computers, linked together either via cabling or wirelessly. Often linked via a central server.

Hosted by Why You Need a Storage Management Organization Ray Paquet Vice President & Research Director Gartner.

RON: Resilient Overlay Networks David Andersen, Hari Balakrishnan, Frans Kaashoek, Robert Morris MIT Laboratory for Computer Science

Introduction. Readings r Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 m Note: All figures from this book.

Usenix Annual Conference, Freenix track – June 2004 – 1 : Flexible Database Clustering Middleware Emmanuel Cecchet – INRIA Julie Marguerite.

Microsoft ® Windows ® Small Business Server 2003 R2 Sales Cycle.

Business Data Communications, Fourth Edition Chapter 11: Network Management.

Oracle's Distributed Database Bora Yasa. Definition A Distributed Database is a set of databases stored on multiple computers at different locations and.

CS 505: Thu D. Nguyen Rutgers University, Spring CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers.

HalFILE 2.1 Network Protection & Disaster Recovery.

Unit 9: Distributing Computing & Networking Kaplan University 1.

CHAPTER 7 CLUSTERING SERVERS. CLUSTERING TYPES There are 2 types of clustering ; Server clusters Network Load Balancing (NLB) The difference between the.

TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY.

CS 162 Section 10 Two-phase commit Fault-tolerant computing.

Component 8/Unit 9aHealth IT Workforce Curriculum Version 1.0 Fall Installation and Maintenance of Health IT Systems Unit 9a Creating Fault Tolerant.

Highly available, Fault tolerant Co-scheduling System With working implementation.

1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.

Matt Jennings.  What is DDoS?  Recent DDoS attacks  History of DDoS  Prevention Techniques.

1  2004 Morgan Kaufmann Publishers Fallacies and Pitfalls Fallacy: the rated mean time to failure of disks is 1,200,000 hours, so disks practically never.

SemiCorp Inc. Presented by Danu Hunskunatai GGU ID #

1 High-availability and disaster recovery  Dependability concepts:  fault-tolerance, high-availability  High-availability classification  Types of.

IBM System x Systems Management Made Easy ibm

Integrating Disk into Backup for Faster Restores

Server Upgrade HA/DR Integration

Common Methods Used to Commit Computer Crimes

Large Distributed Systems

Maximum Availability Architecture Enterprise Technology Centre.

A Technical Overview of Microsoft® SQL Server™ 2005 High Availability Beta 2 Matthew Stephen IT Pro Evangelist (SQL Server)

IBM System x Systems Management Made Easy ibm

Terminology and empirical measures General methods to mask faults.

Administering Your Network

Unit 11- Computer Networks

Seminar on Enterprise Software

Presentation transcript:

1 Dependability in the Internet Era Jim Gray Microsoft Research High Dependability Computing Consortium Conference Santa Cruz, CA 7 May 2001 REVISED: 13 Feb 2005 Stanford, CA

2 Outline The glorious past (Availability Progress) The dark ages (current scene) Some recommendations

3 Preview The Last 10 Years: Availability Dark Ages Ready for a Renaissance? Things got better, then things got a lot worse! % 99% 99.9% 99.99% % Computer Systems Telephone Systems Cell phones Interne t Availability 2010

4 DEPENDABILITY: The 3 ITIES RELIABILITY / INTEGRITY: Does the right thing. (also MTTF>>1) AVAILABILITY: Does it now. (also 1 >> MTTR ) MTTF+MTTR System Availability: If 90% of terminals up & 99% of DB up? (=>89% of transactions are serviced on time ). Holistic vs. Reductionist view Security Integrity Reliability Availability

5 Fail-Fast is Good, Repair is Needed Improving either MTTR or MTTF gives benefit Simple redundancy does not help much. Lifecycle of a module fail-fast gives short fault latency High Availability is low UN-Availability is low UN-Availability Unavailability ~ MTTR MTTF MTTF

6 Fault Model Failures are independent So, single fault tolerance is a big win Hardware fails fast (dead disk, blue-screen) Software fails-fast (or goes to sleep) Software often repaired by reboot: –Heisenbugs Operations tasks: major source of outage –Utility operations –Software upgrades

7 Disks (raid) the BIG Success Story Duplex or Parity: masks faults 1M hours (~100 years) But –controllers fail and –have 1,000s of disks. Duplexing or parity, and dual path gives “perfect disks” Wal-Mart never lost a byte (thousands of disks, hundreds of failures). Only software/operations mistakes are left.

8 Fault Tolerance vs Disaster Tolerance Fault-Tolerance: mask local faults –RAID disks –Uninterruptible Power Supplies –Cluster Failover Disaster Tolerance: masks site failures –Protects against fire, flood, sabotage,.. –Also, software changes, site moves,… –Redundant system and service at remote site.

9 Availability well-managed nodes well-managed packs & clones well-managed GeoPlex Masks some hardware failures Masks hardware failures, Operations tasks (e.g. software upgrades) Masks some software failures Masks site failures (power, network, fire, move,…) Masks some operations failures Availability Un-managed

10 Case Study - Japan "Survey on Computer Security", Japan Info Dev Corp., March (trans: Eiichi Watanabe). Vendor (hardware and software) 5 Months Application software 9 Months Communications lines1.5 Years Operations 2 Years Environment 2 Years 10 Weeks 1,383 institutions reported (6/84 - 7/85) 7,517 outages, MTTF ~ 10 weeks, avg duration ~ 90 MINUTES To Get 10 Year MTTF, Must Attack All These Areas 42% 12% 25% 9.3% 11.2 % Vendor Environment Operations Application Software Tele Comm lines

11 Case Studies - Tandem Trends MTTF improved Shiftfrom Hardware & Maintenance to from 50% to 10% toSoftware (62%) & Operations (15%) NOTE: Systematic under-reporting ofEnvironment Operations errors Application Software

12 Dependability Status circa 1995 ~4-year MTTF 5 9s for well-managed sys. Fault Tolerance Works. Hardware is GREAT (maintenance and MTTF). Software masks most hardware faults. Many hidden software outages in operations: New Software. Utilities. Need to make all hardware/software changes ONLINE. Software seems to define a 30-year MTTF ceiling. Reasonable Goal: 100-year MTTF. class 4 today => class 6 tomorrow.

13 Honorable Mention The nice folks at Tandem (now HP ) ) –Made failover fast (30 seconds or less). –Made change online Add hardware/software Reorganize database. Rolling upgrades. –Added at least one 9 to their story.

14 And Then? Hardware got better (& more complex) Software got better (& more complex) Raid is standard, Snapshots becoming standard Cluster in a box: commodity failover Remote replication is standard.

15 Outline The glorious past (Availability Progress) The dark ages (current scene) Some recommendations

16 Progress? MTTF improved from MTTR incremental improvements failover Hardware and Software online change (pNp) is now standard Then the Internet arrived: –No project can take more than 3 months. –Time to market is everything –Change is good. Computer Systems Telephone Systems Cell phones Internet

17 The Internet Changed Expectations 1990 Phones delivered % ATMs delivered 99.99% Failures were front-page news. Few hackers Outages last an “hour” 2005 Cell phones deliver 90% Web sites deliver 99% Failures are business-page news Many hackers. Outages last a “day” This is progress?

18 Eric Brewer said it best: ACID vs BASE the internet litmus test “copy” of slide 8 of A tomicity C onsistency I solation D urabilty Availability? Strong consistency Isolation Focus on commit Conservative (Pessimistic) Difficult evolution (e.g. schema) Nested transactions B asic A vailability S oft State E ventual Consistency Availability FIRST Weak consistency stale data is OK Approximate answers OK Best effort Aggressive (optimistic) Easier Evolution. Simpler! Faster I think it is a spectrum

19 Why (1) Complexity Internet sites are MUCH more complex. –NAP –Firewall/proxy/IPsprayer –Web –DMZ –App server –DB server –Links to other sites –tcp/http/html/dhtml/dom/xml/ com/corba/cgi/sql/fs/os… Skill level is much reduced

20 A Data Center (500 servers)

21 A Schematic of HotMail ~7,000 servers 100 backend stores with 300TB (cooked) many data centers Links to –Internet Mail gateways –Ad-rotator –Passport –… ~ 5 B messages per day 350M mailboxes, 250M active ~1M new per day. New software every 3 months (small changes weekly). Swittched Ethernet Internet Telnet Management Local Director MSERVS Front Doors MSERVS Incoming MailServer s MSERVS AD Servers Local Director MSERVS Graphics Servers Data USTORES Member Directory Local Director MSERVS Login Servers gateway

22 Why (2) Velocity No project can take more than 13 weeks. Time to market is everything Functionality is everything Faster, cheaper, … Schedule Quality Functionality trend

23 Why (3) Hackers Hacker’s are a new increased threat Any site can be attacked from anywhere Motives include ego, malice, and greed. Complexity makes it hard to protect sites. Whole internet attacks: Slammer Concentration of wealth makes attractive target: Reporter: “Why did you rob banks?” Willie Sutton: “Cause that’s where the money is!” Note: Eric Raymond’s How to Become a Hacker is the positive use of “Hacker”, here I mean malicious and anti-social hackers. Black-hats, not white-hats.

24 How Bad Is It? Connectivity is poor.

25 How Bad Is It? Median monthly % ping packet loss for 2/ 99

26 And in 2006, about the same

27 Or In the US

28 Keynote measures Response Time and Up Time Measures response time around the world Business service is better than popular service Has many proprietary services for SLAs. Week of April 22 - April 28, 2001Previous Week Index Web Sites with Best Performance Averages Ameritrade (65) Lycos (81) Yahoo! (81) Altavista (19) Go.com Ameritrade (64) Lycos (80) Yahoo! (80) Ask Jeeves (7) Altavista (18) Worst Average(anonymous)38.04(anonymous)37.44

: typical 97.48% Availability 97.48%

30 Netcraft’s Crisis-of-the-Day

31

32 Service Level Measurements Many organizations are measured on SLAs Example: 1 sec response 99% of prime time Keynote, Netcraft, … –offer to monitor you site (probe every few min) This probing can go deep into the tree to detect services. –Send alerts via –Give monthly reports.

33 In addition Most large sites build their own instrumentation (several times ) This instrumentation is elaborate and essential for the Network Operations Center (NOC). There are attempts now to systematize it Tivoli, OpenView, NetIQ, WhatsUP, Mom,..

34 Microsoft.Com Operations mis-configured a router Took a day to diagnose and repair. DOS attacks cost a fraction of a day. Regular security patches.

35 Back-End Servers are More Stable Generally deliver 99.99% TerraServer for example single back-end failed after 2.5 y. Went to 4-node cluster Fails every 2 mo. Transparent failover in 30 sec. Online software upgrades So… % in backend… Year 1 Through 18 Months Down 30 hours in July (hardware stop, auto restart failed, operations failure) Down 26 hours in September (Backplane failure, I/O Bus failure)

36 eBay: A very honest site Publishes operations log.Publishes operations log. Has 99% of scheduled uptimeHas 99% of scheduled uptime Schedules about 2 hours/week down.Schedules about 2 hours/week down. Has had some operations outagesHas had some operations outages Has had some DOS problems.Has had some DOS problems.

37 And 2006…. Welcome to eBay's System Board. Visit this board for information on scheduled site maintenance or system issues that are affecting Marketplace trading. For general eBay news, please see our General Announcements Board.General Announcements Board ***Resolved - PayPal site slowness*** February 08, 2006 | 05:20PM PST/PT For several hours today, members may have experienced slowness while trying to access the PayPal website. This issue has now been resolved. A Thank you for your patience. Link to this announcementLink to this announcement | Back to topBack to top ***PayPal site slowness*** February 08, 2006 | 02:38PM PST/PT Members may be experiencing intermittent slowness while trying to access the PayPal website. We're aware of this issue and are working to fix it as quickly as possible. Thank you for your patience. Link to this announcementLink to this announcement | Back to topBack to top ***Scheduled Maintenance For This Week*** February 08, 2006 | 02:03PM PST/PT The eBay system will be undergoing general maintenance from approximately 23:00 PT on Thursday, February 9th to 01:00 PT on Friday, February 10th. During this maintenance period, certain eBay site features may be intermittently unavailable or slow.

38 Some Cool New Things There are 100,000 node services. Google File System shows importance & benefit of Triplex DB replication & mirroring works (is easy) little things I have done –With Leslie Lamport: unified Paxos & 2PC –Measured mean-time-to-data-loss (and continue to measure things).

39 Outline The glorious past (Availability Progress) The dark ages (current scene) Some recommendations

40 Not to throw stones but… Everyone has a serious problem. The BEST people publish their stats. The others HIDE their stats (check Netcraft to see who I mean). We have good NODE-level availability 5-9s is reasonable. We have TERRIBLE system-level availability 2-9s “scheduled” is the goal (!).

41 Gresham’s Law: “bad money drives out good” People WANT features! People WANT convenience! People WANT cheap! In exchange, they seem to be willing to tolerate some –Un-availability (= inconvenience) –“Dirty data” that needs reconciliation –Insecurity I see it as our task to make it easier & cheaper to get high availability and Security. Schedule Quality Functionality trend

42 Recommendation #1 Continue progress on back-ends. –Make management easier (AUTOMATE IT!!!) –Measure –Compare best practices –Continue to look for better algoritims. Live in fear –We are at 10,000 node servers –We are headed for 1,000,000 node servers

43 Recommendation #2 Current security approach is unworkable: –Anonymous clients –Firewall is clueless –Incredible complexity We cant win this game! So change the rules (redefine the problem): –No anonymity –Unified authentication/authorization model –Single-function devices (with simple interfaces) –Only one-kind of interface (uddi/wsdl/soap/…).

44 Recommendation #3 Dependability requires holistic not reductionist approach. It’s the WHOLE system (end-to-end, top-to-bottom) Hard to publish in this area, hard to get tenure. –Journals want theorem+proof and crisp statements. Companies want to make money, so do not share their knowledge. Dependability is an important social good, So, it Dependability Research needs government or philanthropic sponsorship

45 References Adams, E. (1984). “Optimizing Preventative Service of Software Products.” IBM Journal of Research and Development. 28(1): Anderson, T. and B. Randell. (1979). Computing Systems Reliability. Garcia-Molina, H. and C. A. Polyzois. (1990). Issues in Disaster Recovery. 35th IEEE Compcon Gray, J. (1986). Why Do Computers Stop and What Can We Do About It. 5th Symposium on Reliability in Distributed Software and Database Systems Gray, J. (1990). “A Census of Tandem System Availability between 1985 and 1990.” IEEE Transactions on Reliability. 39(4): Gray, J. N., Reuter, A. (1993). Transaction Processing Concepts and Techniques. San Mateo, Morgan Kaufmann. Lampson, B. W. (1981). Atomic Transactions. Distributed Systems -- Architecture and Implementation: An Advanced Course. ACM, Springer-Verlag. Laprie, J. C. (1985). Dependable Computing and Fault Tolerance: Concepts and Terminology. 15’th FTCS Long, D.D., J. L. Carroll, and C.J. Park (1991). A study of the reliability of Internet sites. Proc 10’th Symposium on Reliable Distributed Systems, pp , Pisa, September Theory and Practice of Reliable System Design, Dan Siewiorek, Robert Swarz Building Secure and Reliable Network Applications, Ken P. Birman Darrell LongDarrell Long, Andrew Muir and Richard Golding, ``A Longitudinal Study of Internet Host Reliability,'' Proc of the Symposium on Reliable Distributed Systems, Bad Neuenahr, Germany: IEEE, 1995, p. 2-9Richard Golding They have even better for-fee data as well, but for-free is really excellent. eBay is an Excellent benchmark of best Internet practices Empirical Measurements of Disk Failure Rates and Error RatesEmpirical Measurements of Disk Failure Rates and Error Rates + C.van Ingen moving 2P with cheap iron “Consensus on Transaction Commit”, “Consensus on Transaction Commit”, +, L. Lamport, unifies 2PC and Byzantie-Paxos