Terminology and empirical measures General methods to mask faults.

Slides:



Advertisements
Similar presentations
Express5800/ft series servers Product Information Fault-Tolerant General Purpose Servers.
Advertisements

Terminology and empirical measures General methods to mask faults.
Clustering Technology For Scaleability Jim Gray Microsoft Research
Gray & Reuter FT 2: 1 Dependable Computing Systems Jim Gray Microsoft, Microsoft.com Andreas Reuter International University,
Gray FT 4/24/95 1 Dependable Computing Systems Jim Gray UC Berkeley McKay Lecture 25 April 1995 Microsoft.com Talk 1: Many little will win over.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
Business Continuity Section 3(chapter 8) BC:ISMDR:BEIT:VIII:chap8:Madhu N PIIT1.
REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.
© 2009 EMC Corporation. All rights reserved. Introduction to Business Continuity Module 3.1.
High Availability 24 hours a day, 7 days a week, 365 days a year… Vik Nagjee Product Manager, Core Technologies InterSystems Corporation.
Oracle Data Guard Ensuring Disaster Recovery for Enterprise Data
June 23rd, 2009Inflectra Proprietary InformationPage: 1 SpiraTest/Plan/Team Deployment Considerations How to deploy for high-availability and strategies.
Making Services Fault Tolerant
Business Continuity and DR, A Practical Implementation Mich Talebzadeh, Consultant, Deutsche Bank
1 FT 101 FT 101 Jim Gray Microsoft Research 80% of slides are not shown (are hidden) so view with PPT to see.
J. Gray, Dependability in the Internet Era (acknowledgement: slides from J.Gray, E.Brewer)
1 Making Services Fault Tolerant Pat Chan, Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong Miroslaw Malek.
Session 3 Windows Platform Dina Alkhoudari. Learning Objectives Understanding Server Storage Technologies Direct Attached Storage DAS Network-Attached.
National Manager Database Services
11 SERVER CLUSTERING Chapter 6. Chapter 6: SERVER CLUSTERING2 OVERVIEW  List the types of server clusters.  Determine which type of cluster to use for.
CS162 Section Lecture 11. Project 4 Implement a distributed key-value store that uses – Two-Phase Commit for atomic operations, – Replication for performance.
Building Highly Available Systems with SQL Server™ 2005 Vineet Gupta Evangelist – Data and Integration Microsoft Corp.
ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance and High Availability.
LAN / WAN Business Proposal. What is a LAN or WAN? A LAN is a Local Area Network it usually connects all computers in one building or several building.
Chapter 10 : Designing a SQL Server 2005 Solution for High Availability MCITP Administrator: Microsoft SQL Server 2005 Database Server Infrastructure Design.
2. Fault Tolerance. 2 Fault - Error - Failure Fault = physical defect or flow occurring in some component (hardware or software) Error = incorrect behavior.
High-Availability Methods Lesson 25. Skills Matrix.
Maintaining a Microsoft SQL Server 2008 Database SQLServer-Training.com.
Business Continuity and Disaster Recovery Chapter 8 Part 2 Pages 914 to 945.
1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.
Eng. Mohammed Timraz Electronics & Communication Engineer University of Palestine Faculty of Engineering and Urban planning Software Engineering Department.
DATABASE MIRRORING  Mirroring is mainly implemented for increasing the database availability.  Is configured on a Database level.  Mainly involves two.
1 Web Server Administration Chapter 2 Preparing For Server Installation.
Building Dependable Distributed Systems Chapter 1 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
"1"1 Introduction to Managing Data " Describe problems associated with managing large numbers of disks " List requirements for easily managing large amounts.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
McLean HIGHER COMPUTER NETWORKING Lesson 15 (a) Disaster Avoidance Description of disaster avoidance: use of anti-virus software use of fault tolerance.
1 Reliable Web Services by Fault Tolerant Techniques: Methodology, Experiment, Modeling and Evaluation Term Presentation Presented by Pat Chan 3 May 2006.
CS 505: Thu D. Nguyen Rutgers University, Spring CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers.
Continuous Availability
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
1 Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012.
High Availability in DB2 Nishant Sinha
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
Presentation-2 Group-A1 Professor:Mohamed Khalil Anita Kanuganti Hemanth Rao.
CS 162 Section 10 Two-phase commit Fault-tolerant computing.
Component 8/Unit 9aHealth IT Workforce Curriculum Version 1.0 Fall Installation and Maintenance of Health IT Systems Unit 9a Creating Fault Tolerant.
Fault Tolerance
1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.
Hands-On Microsoft Windows Server 2008 Chapter 7 Configuring and Managing Data Storage.
Oracle Database High Availability
Adam Backman Chief Cat Wrangler – White Star Software
Fault Tolerance Comparison
High Availability 24 hours a day, 7 days a week, 365 days a year…
Managing Multi-User Databases
Outline Introduction Background Distributed DBMS Architecture
Large Distributed Systems
Maximum Availability Architecture Enterprise Technology Centre.
A Technical Overview of Microsoft® SQL Server™ 2005 High Availability Beta 2 Matthew Stephen IT Pro Evangelist (SQL Server)
Enterprise Architecture
Oracle Database High Availability
SQL Server High Availability Amit Vaid.
COP 5611 Operating Systems Fall 2011
Business Continuity Technology
Web Server Administration
SpiraTest/Plan/Team Deployment Considerations
Fault Tolerance Distributed Web-based Systems
Co-designed Virtual Machines for Reliable Computer Systems
Seminar on Enterprise Software
Presentation transcript:

Terminology and empirical measures General methods to mask faults. Heisenbugs: A Probabilistic Approach to Availability Jim Gray Microsoft Research http://research.microsoft.com/~gray/Talks/ ½ the slides are not shown (are hidden, so view with PPT to see them all Outline Terminology and empirical measures General methods to mask faults. Software-fault tolerance Summary

Heisenbugs: A Probabilistic Approach to Availability There is considerable evidence that (1) production systems have about one bug per thousand lines of code (2) these bugs manifest themselves in stochastically: failures are due to confluence of rare events, (3) system mean-time-to-failure has a lower bound of a decade or so. To make highly available systems, architects must tolerate these failures by providing instant repair (un-availability is approximated by repair_time/time_to_fail so cutting the repair time in half makes things twice as good. Ultimately, one builds a set of standby servers which have both design diversity and geographic diversity. This minimizes common-mode failures.

Dependability: The 3 ITIES Reliability / Integrity: does the right thing. (Also large MTTF) Availability: does it now. (Also small MTTR MTTF+MTTR System Availability: if 90% of terminals up & 99% of DB up? (=>89% of transactions are serviced on time). Holistic vs. Reductionist view Security Integrity Reliability Availability

High Availability System Classes Goal: Build Class 6 Systems System Type Unmanaged Managed Well Managed Fault Tolerant High-Availability Very-High-Availability Ultra-Availability Unavailable (min/year) 50,000 5,000 500 50 5 .5 .05 Availability 90.% 99.% 99.9% 99.99% 99.999% 99.9999% 99.99999% Class 1 2 3 4 6 7 UnAvailability = MTTR/MTBF can cut it in ½ by cutting MTTR or MTBF

Demo: looking at some nodes Look at http://httpmonitor/ Internet Node availability: 92% mean, 97% median Darrell Long (UCSC) ftp://ftp.cse.ucsc.edu/pub/tr/ ucsc-crl-90-46.ps.Z "A Study of the Reliability of Internet Sites" ucsc-crl-91-06.ps.Z "Estimating the Reliability of Hosts Using the Internet" ucsc-crl-93-40.ps.Z "A Study of the Reliability of Hosts on the Internet" ucsc-crl-95-16.ps.Z "A Longitudinal Survey of Internet Host Reliability"

Sources of Failures Power Failure: 2000 hr 1 hr Phone Lines MTTF MTTR Power Failure: 2000 hr 1 hr Phone Lines Soft >.1 hr .1 hr Hard 4000 hr 10 hr Hardware Modules: 100,000hr 10hr (many are transient) Software: 1 Bug/1000 Lines Of Code (after vendor-user testing) => Thousands of bugs in System! Most software failures are transient: dump & restart system. Useful fact: 8,760 hrs/year ~ 10k hr/year

Case Studies - Tandem Trends Reported MTTF by Component 1985 1987 1990 SOFTWARE 2 53 33 Years HARDWARE 29 91 310 Years MAINTENANCE 45 162 409 Years OPERATIONS 99 171 136 Years ENVIRONMENT 142 214 346 Years SYSTEM 8 20 21 Years Problem: Systematic Under-reporting

Many Software Faults are Soft After Design Review Code Inspection Alpha Test Beta Test 10k Hrs Of Gamma Test (Production) Most Software Faults Are Transient MVS Functional Recovery Routines 5:1 Tandem Spooler 100:1 Adams >100:1 Terminology: Heisenbug: Works On Retry Bohrbug: Faults Again On Retry Adams: "Optimizing Preventative Service of Software Products", IBM J R&D,28.1,1984 Gray: "Why Do Computers Stop", Tandem TR85.7, 1985 Mourad: "The Reliability of the IBM/XA Operating System", 15 ISFTCS, 1985.

Summary of FT Studies Current Situation: ~4-year MTTF => Fault Tolerance Works. Hardware is GREAT (maintenance and MTTF). Software masks most hardware faults. Many hidden software outages in operations: New Software. Utilities. Must make all software ONLINE. Software seems to define a 30-year MTTF ceiling. Reasonable Goal: 100-year MTTF. class 4 today => class 6 tomorrow.

Fault Tolerance vs Disaster Tolerance Fault-Tolerance: mask local faults RAID disks Uninterruptible Power Supplies Cluster Failover Disaster Tolerance: masks site failures Protects against fire, flood, sabotage,.. Redundant system and service at remote site. Use design diversity There have been a variety of technologies introduced to address your growing need for high-availability servers. The simplest of these is Data Mirroring, which continuously duplicates all disk writes onto a mirrored set of disks, possibly at a remote disaster recovery site. Today you can get Data Mirroring products for Windows NT Server from a few vendors, including Octopus (http://www.octopus.com) and Vinca (http://www.vinca.com). These solutions provide excellent protection for your data, even in the event of a metropolis-wide disaster. However, they’re not high-availability solutions that have the ability to detect all types of hardware or software failure, and they have at best limited abilities to automatically restart applications. (For example, users must manually reconnect to the new server, plus any applications running on the recovery server are canceled as if it had been the server that failed.) Server Mirroring like Novell SFT III (Server Fault Tolerance) is a high-availability capability that both protects your data and provides for automatic detection of failures plus restart of selected applications. Server Mirroring provides excellent reliability, but at a very high cost since it requires an idle “standby” server that does no productive work except when the primary server fails. There are also very few applications which can take advantage of proprietary server mirroring solutions like Novell SFT III. At the high end are true “fault tolerant” systems like the excellent “NonStop” systems from Tandem. These systems are able to detect and almost instantly recover from virtually any single hardware or software failure. Most bank transactions, for example, run on this type of system. This level of reliability comes with a very high price tag, however, and each solution is based on a proprietary, single-vendor set of hardware. And, finally, there’s another high-availability technology which seems to offer the best of all these capabilities: clustering...

Outline General methods to mask faults. Terminology and empirical measures General methods to mask faults. Software-fault tolerance Summary

Fault Tolerance Techniques FAIL FAST MODULES: work or stop SPARE MODULES : instant repair time. INDEPENDENT MODULE FAILS by design MTTFPair ~ MTTF2/ MTTR (so want tiny MTTR) MESSAGE BASED OS: Fault Isolation software has no shared memory. SESSION-ORIENTED COMM: Reliable messages detect lost/duplicate messages coordinate messages with commit PROCESS PAIRS :Mask Hardware & Software Faults TRANSACTIONS: give A.C.I.D. (simple fault model)

Example: the FT Bank Modularity & Repair are KEY: vonNeumann needed 20,000x redundancy in wires and switches We use 2x redundancy. Redundant hardware can support peak loads (so not redundant)

Fail-Fast is Good, Repair is Needed Lifecycle of a module fail-fast gives short fault latency High Availability is low UN-Availability Unavailability ­ MTTR MTTF Improving either MTTR or MTTF gives benefit Simple redundancy does not help much.

Software-fault tolerance Outline Terminology and empirical measures General methods to mask faults. Software-fault tolerance Summary

} { } { Key Idea Architecture Hardware Faults Software Masks Environmental Faults Distribution Maintenance Software automates / eliminates operators So, In the limit there are only software & design faults. Software-fault tolerance is the key to dependability. INVENT IT!

Software Techniques: Learning from Hardware Recall that most outages are not hardware. Most outages in Fault Tolerant Systems are SOFTWARE Fault Avoidance Techniques: Good & Correct design. After that: Software Fault Tolerance Techniques: Modularity (isolation, fault containment) Design diversity N-Version Programming: N-different implementations Defensive Programming: Check parameters and data Auditors: Check data structures in background Transactions: to clean up state after a failure Paradox: Need Fail-Fast Software

Fail-Fast and High-Availability Execution Software N-Plexing: Design Diversity N-Version Programming Write the same program N-Times (N > 3) Compare outputs of all programs and take majority vote Process Pairs: Instant restart (repair) Use Defensive programming to make a process fail-fast Have restarted process ready in separate environment Second process “takes over” if primary faults Transaction mechanism can clean up distributed state if takeover in middle of computation.

What Is MTTF of N-Version Program? First fails after MTTF/N Second fails after MTTF/(N-1),... so MTTF(1/N + 1/(N-1) + ... + 1/2) harmonic series goes to infinity, but VERY slowly for example 100-version programming gives ~4 MTTF of 1-version programming Reduces variance N-Version Programming Needs REPAIR If a program fails, must reset its state from other programs. => programs have common data/state representation. How does this work for Database Systems? Operating Systems? Network Systems? Answer: I don’t know.

Why Process Pairs Mask Faults: Many Software Faults are Soft After Design Review Code Inspection Alpha Test Beta Test 10k Hrs Of Gamma Test (Production) Most Software Faults Are Transient MVS Functional Recovery Routines 5:1 Tandem Spooler 100:1 Adams >100:1 Terminology: Heisenbug: Works On Retry Bohrbug: Faults Again On Retry Adams: "Optimizing Preventative Service of Software Products", IBM J R&D,28.1,1984 Gray: "Why Do Computers Stop", Tandem TR85.7, 1985 Mourad: "The Reliability of the IBM/XA Operating System", 15 ISFTCS, 1985.

Process Pair Repair Strategy If software fault (bug) is a Bohrbug, then there is no repair “wait for the next release” or “get an emergency bug fix” or “get a new vendor” If software fault is a Heisenbug, then repair is reboot and retry or switch to backup process (instant restart) PROCESS PAIRS Tolerate Hardware Faults Heisenbugs Repair time is seconds, could be mili-seconds if time is critical Flavors Of Process Pair: Lockstep Automatic State Checkpointing Delta Checkpointing Persistent

How Takeover Masks Failures Server Resets At Takeover But What About Application State? Database State? Network State? Answer: Use Transactions To Reset State! Abort Transaction If Process Fails. Keeps Network "Up" Keeps System "Up" Reprocesses Some Transactions On Failure

PROCESS PAIRS - SUMMARY Transactions Give Reliability Process Pairs Give Availability Process Pairs Are Expensive & Hard To Program Transactions + Persistent Process Pairs => Fault Tolerant Sessions & Execution When Tandem Converted To This Style Saved 3x Messages Saved 5x Message Bytes Made Programming Easier

SYSTEM PAIRS FOR HIGH AVAILABILITY Primary Backup Programs, Data, Processes Replicated at two sites. Pair looks like a single system. System becomes logical concept Like Process Pairs: System Pairs. Backup receives transaction log (spooled if backup down). If primary fails or operator Switches, backup offers service.

SYSTEM PAIR BENEFITS Protects against ENVIRONMENT: weather utilities sabotage Protects against OPERATOR FAILURE: two sites, two sets of operators Protects against MAINTENANCE OUTAGES work on backup software/hardware install/upgrade/move... Protects against HARDWARE FAILURES backup takes over Protects against TRANSIENT SOFTWARE ERRORR Allows design diversity different sites have different software/hardware)

} { } { Key Idea Architecture Hardware Faults Software Masks Environmental Faults Distribution Maintenance Software automates / eliminates operators So, In the limit there are only software & design faults. Many are Heisenbugs Software-fault tolerance is the key to dependability. INVENT IT!

References Adams, E. (1984). “Optimizing Preventative Service of Software Products.” IBM Journal of Research and Development. 28(1): 2-14.0 Anderson, T. and B. Randell. (1979). Computing Systems Reliability. Garcia-Molina, H. and C. A. Polyzois. (1990). Issues in Disaster Recovery. 35th IEEE Compcon 90. 573-577. Gray, J. (1986). Why Do Computers Stop and What Can We Do About It. 5th Symposium on Reliability in Distributed Software and Database Systems. 3-12. Gray, J. (1990). “A Census of Tandem System Availability between 1985 and 1990.” IEEE Transactions on Reliability. 39(4): 409-418. Gray, J. N., Reuter, A. (1993). Transaction Processing Concepts and Techniques. San Mateo, Morgan Kaufmann. Lampson, B. W. (1981). Atomic Transactions. Distributed Systems -- Architecture and Implementation: An Advanced Course. ACM, Springer-Verlag. Laprie, J. C. (1985). Dependable Computing and Fault Tolerance: Concepts and Terminology. 15’th FTCS. 2-11. Long, D.D., J. L. Carroll, and C.J. Park (1991). A study of the reliability of Internet sites. Proc 10’th Symposium on Reliable Distributed Systems, pp. 177-186, Pisa, September 1991. Darrell Long, Andrew Muir and Richard Golding, ``A Longitudinal Study of Internet Host Reliability,'' Proceedings of the Symposium on Reliable Distributed Systems, Bad Neuenahr, Germany: IEEE, September 1995, pp. 2-9