Gray FT 4/24/95 1 Dependable Computing Systems Jim Gray UC Berkeley McKay Lecture 25 April 1995 Microsoft.com Talk 1: Many little will win over.

Slides:



Advertisements
Similar presentations
Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.
Advertisements

Symantec 2010 Windows 7 Migration EMEA Results. Methodology Applied Research performed survey 1,360 enterprises worldwide SMBs and enterprises Cross-industry.
Symantec 2010 Windows 7 Migration Global Results.
Express5800/ft series servers Product Information Fault-Tolerant General Purpose Servers.
1 Stratus Technologies Andy Bailey Underlying technologies and reducing the risk Stratus Technologies Andy Bailey.
1 Senn, Information Technology, 3 rd Edition © 2004 Pearson Prentice Hall James A. Senns Information Technology, 3 rd Edition Chapter 7 Enterprise Databases.
1 Concurrency: Deadlock and Starvation Chapter 6.
Zhongxing Telecom Pakistan (Pvt.) Ltd
1
Operations Management Maintenance and Reliability Chapter 17
Distributed Systems Architectures
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Terminology and empirical measures General methods to mask faults.
Clustering Technology For Scaleability Jim Gray Microsoft Research
Gray & Reuter FT 2: 1 Dependable Computing Systems Jim Gray Microsoft, Microsoft.com Andreas Reuter International University,
Past High Availability Standards Efforts Jim Gray Microsoft
1 Dependability in the Internet Era. 2 Outline The glorious past (Availability Progress) The dark ages (current scene) Some recommendations.
Processes and Operating Systems
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
By Rick Clements Software Testing 101 By Rick Clements
Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination. Introduction to the Business.
18 Copyright © 2005, Oracle. All rights reserved. Distributing Modular Applications: Introduction to Web Services.
Data recovery 1. 2 Recovery - introduction recovery restoring a system, after an error or failure, to a state that was previously known as correct have.
Chapter 5 Input/Output 5.1 Principles of I/O hardware
Making the System Operational
1 Click here to End Presentation Software: Installation and Updates Internet Download CD release NACIS Updates.
Database Systems: Design, Implementation, and Management
Operations Management
Configuration management
Copyright © 2009 EMC Corporation. Do not Copy - All Rights Reserved.
Mehdi Naghavi Spring 1386 Operating Systems Mehdi Naghavi Spring 1386.
Troubleshooting Startup Problems
Fact-finding Techniques Transparencies
Chapter 1 Introduction to the Programmable Logic Controllers.
Virtualization & Disaster Recovery
Database Performance Tuning and Query Optimization
Software testing.
PP Test Review Sections 6-1 to 6-6
Chapter 10: Virtual Memory
The Modular Structure of Complex Systems Team 3 Nupur Choudhary Aparna Nanjappa Mark Zeits.
CS 6143 COMPUTER ARCHITECTURE II SPRING 2014 ACM Principles and Practice of Parallel Programming, PPoPP, 2006 Panel Presentations Parallel Processing is.
25 July, 2014 Hailiang Mei, TU/e Computer Science, System Architecture and Networking 1 Hailiang Mei Remote Terminal Management.
Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.
Large-Scale Distributed Systems Andrew Whitaker CSE451.
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
Lecture plan Transaction processing Concurrency control
Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy Anthony J. YuGuy G.F. Lemieux September 15, 2005.
CONTROL VISION Set-up. Step 1 Step 2 Step 3 Step 5 Step 4.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Introduction to Computer Administration Introduction.
LO: Count up to 100 objects by grouping them and counting in 5s 10s and 2s. Mrs Criddle: Westfield Middle School.
Note to the teacher: Was 28. A. to B. you C. said D. on Note to the teacher: Make this slide correct answer be C and sound to be “said”. to said you on.
Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M
Subtraction: Adding UP
Chapter 10: The Traditional Approach to Design
Systems Analysis and Design in a Changing World, Fifth Edition
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Converting a Fraction to %
PSSA Preparation.
 2003 Prentice Hall, Inc. All rights reserved. 1 Chapter 13 - Exception Handling Outline 13.1 Introduction 13.2 Exception-Handling Overview 13.3 Other.
Introduction to ikhlas ikhlas is an affordable and effective Online Accounting Solution that is currently available in Brunei.
Aviation Management System 1 2  Silver Wings Aircraft Aviation Management System represents a functional “high – end” suite of integrated applications.
1 FT 101 FT 101 Jim Gray Microsoft Research 80% of slides are not shown (are hidden) so view with PPT to see.
J. Gray, Dependability in the Internet Era (acknowledgement: slides from J.Gray, E.Brewer)
CS 162 Section 10 Two-phase commit Fault-tolerant computing.
Fault Tolerance Distributed Web-based Systems
Terminology and empirical measures General methods to mask faults.
Seminar on Enterprise Software
Presentation transcript:

Gray FT 4/24/95 1 Dependable Computing Systems Jim Gray UC Berkeley McKay Lecture 25 April 1995 Microsoft.com Talk 1: Many little will win over few big. So Parallel Computers are are in your future. Talk 2: Database folks do parallelism with dataflow. They get near-linear scaleup, automatic parallelism. Talk 3: Fault tolerance is important if you have thousands of parts (many little machines have many little failures)

Gray FT 4/24/95 2 High Speed Network ( 10 Gb/s) The Airplane Rule A two engine airplane has twice as many engine problems. A thousand-engine airplane has thousands of engine problems. Fault Tolerance is KEY! Mask and repair faults Internet: Node fails every 2 weeks Vendors: Disk fails every 40 years Here: node fails every 20 minutes disk fails every 2 weeks.

Gray FT 4/24/95 3 Outline Does fault tolerance work?Does fault tolerance work? General methods to mask faults.General methods to mask faults. Software-fault toleranceSoftware-fault tolerance SummarySummary

Gray FT 4/24/95 4 DEPENDABILITY: The 3 ITIES RELIABILITY / INTEGRITY: Does the right thing (also large MTTF)RELIABILITY / INTEGRITY: Does the right thing (also large MTTF) AVAILABILITY: Does it now. (also large MTTF MTTF+MTTR System Availability: If 90% of terminals up & 99% of DB up? (=>89% of transactions are serviced on time).AVAILABILITY: Does it now. (also large MTTF MTTF+MTTR System Availability: If 90% of terminals up & 99% of DB up? (=>89% of transactions are serviced on time). Holistic vs Reductionist viewHolistic vs Reductionist view Security Integrity / Reliability Availability Security Integrity / Reliability Availability

Gray FT 4/24/95 5 High Availability System Classes Goal: Build Class 6 Systems System Type Unmanaged Managed Well Managed Fault Tolerant High-Availability Very-High-Availability Ultra-Availability Unavailable (min/year) 50,000 5, Availability 90.% 99.% 99.9% 99.99% % % % Availability Class

Gray FT 4/24/95 6 Sources of Failures MTTFMTTR Power Failure :2000 hr 1 hr Phone Lines Soft >.1 hr.1 hr Hard4000 hr10 hr Hardware Modules :100,000hr10hr Hardware Modules :100,000hr10hr (many are transient) Software : 1 Bug/1000 Lines Of Code (after vendor-user testing) => Thousands of bugs in System! Most software failures are transient: dump & restart system. Most software failures are transient: dump & restart system. Useful fact: 8,760 hrs/year ~ 10k hr/year

Gray FT 4/24/95 7 Case Studies - Japan "Survey on Computer Security", Japan Info Dev Corp., March (trans: Eiichi Watanabe). Vendor (hardware and software) 5 Months Application software 9 Months Communications lines1.5 Years Operations 2 Years Environment 2 Years 10 Weeks 10 Weeks 1,383 institutions reported (6/84 - 7/85) 7,517 outages, MTTF ~ 10 weeks, avg duration ~ 90 MINUTES TO GET 10 YEAR MTTF MUST ATTACK ALL THESE AREAS

Gray FT 4/24/95 8 Case Studies -Tandem Case Studies -Tandem Outage Reports to Vendor Totals: More than 7,000 Customer years More than 30,000 System years More than 80,000 Processor years More than 200,000 Disc Years Systematic Under-reporting But ratios & trends interesting

Gray FT 4/24/95 9 Case Studies - Tandem Trends MTTF improved: WOW! Outages per millennium. Shiftfrom Hardware & Maintenance to from 50% to 10% toSoftware (62%) & Operations (15%) NOTE: Systematic under-reporting ofEnvironment Operations errors Application Software

Gray FT 4/24/95 10 Case Studies - Tandem Trends Reported MTTF by Component SOFTWARE Years HARDWARE Years MAINTENANCE Years OPERATIONS Years ENVIRONMENT Years SYSTEM82021Years Remember Systematic Under-reporting

Gray FT 4/24/95 11 Summary Current Situation:~4-year MTTF => Fault Tolerance Works.Current Situation:~4-year MTTF => Fault Tolerance Works. Hardware is GREAT (maintenance and MTTF).Hardware is GREAT (maintenance and MTTF). Software masks most hardware faults.Software masks most hardware faults. Many software outages in operations:Many hidden software outages in operations: – New System Software. – New Application Software. – Utilities. Must make all software ONLINE.Must make all software ONLINE. Software seems to define a 30-year MTTF ceiling.Software seems to define a 30-year MTTF ceiling. Reasonable Goal: 100-year MTTF. class 4 today => class 6 tomorrow.Reasonable Goal: 100-year MTTF. class 4 today => class 6 tomorrow.

Gray FT 4/24/95 12 Outline Does fault tolerance work?Does fault tolerance work? General methods to mask faults.General methods to mask faults. Software-fault toleranceSoftware-fault tolerance SummarySummary

Gray FT 4/24/95 13 Key Idea ArchitectureHardware Faults ArchitectureHardware Faults Software MasksEnvironmental Faults Software MasksEnvironmental Faults DistributionMaintenance DistributionMaintenance Software automates / eliminates operatorsSoftware automates / eliminates operatorsSo, In the limit there are only software & design faults. Software-fault tolerance is the key to dependability. INVENT IT!In the limit there are only software & design faults. Software-fault tolerance is the key to dependability. INVENT IT! }{ } {

Gray FT 4/24/95 14 Fault Tolerance Techniques FAIL FAST MODULES: work or stopFAIL FAST MODULES: work or stop SPARE MODULES : repair time.SPARE MODULES : instant repair time. INDEPENDENT MODULE FAILS by design MTTF Pair ~ MTTF 2 / MTTR ( so want tiny MTTR )INDEPENDENT MODULE FAILS by design MTTF Pair ~ MTTF 2 / MTTR ( so want tiny MTTR ) MESSAGE BASED OS: Fault IsolationMESSAGE BASED OS: Fault Isolation software has no shared memory. SESSION-ORIENTED COMM: Reliable messagesSESSION-ORIENTED COMM: Reliable messages detect lost/duplicate messages coordinate messages with commit PROCESS PAIRS :PROCESS PAIRS :Mask Hardware & Software Faults TRANSACTIONS: give A.C.I.D. (simple fault model)TRANSACTIONS: give A.C.I.D. (simple fault model)

Gray FT 4/24/95 15 Example: the FT Bank Modularity & Repair are KEY: vonNeumann needed 20,000x redundancy in wires and switches vonNeumann needed 20,000x redundancy in wires and switches We use 2x redundancy. Redundant hardware can support peak loads (so not redundant)

Gray FT 4/24/95 16 Fail-Fast is Good, Repair is Needed Improving either MTTR or MTTF gives benefit Simple redundancy does not help much. Lifecycle of a module fail-fast gives short fault latency High Availability is low UN-Availability is low UN-Availability Unavailability ­ MTTR MTTF MTTF

Gray FT 4/24/95 17 Hardware Reliability/Availability (how to make it fail fast) Comparitor Strategies: Duplex: Fail-Fast: fail if either fails (e.g. duplexed cpus) vs Fail-Soft: fail if both fail (e.g. disc, atm,...) Note: in recursive pairs, parent knows which is bad. Triplex:Fail-Fast: fail if 2 fail (triplexed cpus) Fail-Soft: fail if 3 fail (triplexed FailFast cpus) Fail-Soft: fail if 3 fail (triplexed FailFast cpus)

Gray FT 4/24/95 18 Redundant Designs have Worse MTTF! THIS IS NOT GOOD: Variance is lower but MTTF is worse Simple redundancy does not improve MTTF (sometimes hurts). This is just an example of the airplane rule. This is just an example of the airplane rule.

Gray FT 4/24/95 19 Add Repair: Get 10 4 Improvement

Gray FT 4/24/95 20 When To Repair? Chances Of Tolerating A Fault are 1000:1 (class 3) A 1995 study: Processor & Disc Rated At ~ 10khr MTTF Computed Single Observed FailuresDouble Fails Ratio 10k Processor Fails14 Double ~ 1000 : 1 10k Processor Fails14 Double ~ 1000 : 1 40k Disc Fails,26 Double ~ 1000 : 1 40k Disc Fails,26 Double ~ 1000 : 1 Hardware Maintenance: On-Line Maintenance "Works" 999 Times Out Of The chance a duplexed disc will fail during maintenance?1:1000 Risk Is 30x Higher During Maintenance => Do It Off Peak Hour Software Maintenance: Repair Only Virulent Bugs Wait For Next Release To Fix Benign Bugs

Gray FT 4/24/95 21 OK: So Far Hardware fail-fast is easy Redundancy plus Repair is great (Class 7 availability) Hardware redundancy & repair is via modules. How can we get instant software repair? We Know How To Get Reliable Storage RAID Or Dumps And Transaction Logs. We Know How To Get Available Storage Fail Soft Duplexed Discs (RAID 1...N). ? HOW DO WE GET RELIABLE EXECUTION? ? HOW DO WE GET AVAILABLE EXECUTION?

Gray FT 4/24/95 22 Outline Does fault tolerance work?Does fault tolerance work? General methods to mask faults.General methods to mask faults. Software-fault toleranceSoftware-fault tolerance SummarySummary

Gray FT 4/24/95 23 Software Techniques: Learning from Hardware Recall that most outages are not hardware. Most outages in Fault Tolerant Systems are SOFTWARE Fault Avoidance Techniques: Good & Correct design. After that: Software Fault Tolerance Techniques: Modularity (isolation, fault containment) Design diversity N-Version Programming: N-different implementations Defensive Programming: Check parameters and data Auditors: Check data structures in background Transactions: to clean up state after a failure Paradox: Need Fail-Fast Software

Gray FT 4/24/95 24 Fail-Fast and High-Availability Execution Software N-Plexing: Design Diversity N-Version Programming Write the same program N-Times (N > 3) Compare outputs of all programs and take majority vote Process Pairs: Instant restart (repair) Use Defensive programming to make a process fail-fast Have restarted process ready in separate environment Second process takes over if primary faults Transaction mechanism can clean up distributed state if takeover in middle of computation.

Gray FT 4/24/95 25 What Is MTTF of N-Version Program? First fails after MTTF/N Second fails after MTTF/(N-1),... so MTTF(1/N + 1/(N-1) /2) harmonic series goes to infinity, but VERY slowly for example 100-version programming gives ~4 MTTF of 1-version programming Reduces variance N-Version Programming Needs REPAIR If a program fails, must reset its state from other programs. => programs have common data/state representation. How does this work for Database Systems? Operating Systems? Network Systems? Answer: I dont know.

Gray FT 4/24/95 26 Why Process Pairs Mask Faults Many Software Faults are Soft After Design Review Code Inspection Alpha Test Beta Test 10k Hrs Of Gamma Test (Production) Most Software Faults Are Transient MVS Functional Recovery Routines 5:1 Tandem Spooler 100:1 Adams >100:1 Terminology: Heisenbug: Works On Retry Bohrbug: Faults Again On Retry Adams: "Optimizing Preventative Service of Software Products", IBM J R&D,28.1,1984 Gray: "Why Do Computers Stop", Tandem TR85.7, 1985 Mourad: "The Reliability of the IBM/XA Operating System", 15 ISFTCS, 1985.

Gray FT 4/24/95 27 Process Pair Repair Strategy If software fault (bug) is a Bohrbug, then there is no repair wait for the next release or get an emergency bug fix or get a new vendor If software fault is a Heisenbug, then repair is reboot and retry or switch to backup process (instant restart) PROCESS PAIRS Tolerate Hardware Faults PROCESS PAIRS Tolerate Hardware FaultsHeisenbugs Repair time is seconds, could be mili-seconds if time is critical Flavors Of Process Pair: Lockstep Automatic State Checkpointing Delta Checkpointing Persistent

Gray FT 4/24/95 28 How Takeover Masks Failures Server Resets At Takeover But What About Application State? Database State? Network State? Answer: Use Transactions To Reset State! Abort Transaction If Process Fails. Keeps Network "Up" Keeps System "Up" Reprocesses Some Transactions On Failure

Gray FT 4/24/95 29 PROCESS PAIRS - SUMMARY Transactions Give Reliability Process Pairs Give Availability Process Pairs Are Expensive & Hard To Program Transactions + Persistent Process Pairs => Fault TolerantSessions Execution When Tandem Converted To This Style Saved 3x Messages Saved 5x Message Bytes Made Programming Easier

Gray FT 4/24/95 30 SYSTEM PAIRS FOR HIGH AVAILABILITY Programs, Data, Processes Replicated at two sites. Pair looks like a single system. System becomes logical concept Like Process Pairs: System Pairs. Backup receives transaction log (spooled if backup down). If primary fails or operator Switches, backup offers service.

Gray FT 4/24/95 31 SYSTEM PAIR CONFIGURATION OPTIONS Mutual Backup: each has 1/2 of Database & Application each has 1/2 of Database & ApplicationHub: One site acts as backup for many others One site acts as backup for many others In General can be any directed graph Stale replicas: Lazy replication

Gray FT 4/24/95 32 SYSTEM PAIRS FOR: SOFTWARE MAINTENANCE Similar ideas apply to: Database Reorganization Hardware modification (e.g. add discs, processors,...) Hardware maintenance Environmental changes (rewire, new air conditioning) Move primary or backup to new location.

Gray FT 4/24/95 33 SYSTEM PAIR BENEFITS Protects against ENVIRONMENT: different sites weatherutilitiessabotage Protects against OPERATOR FAILURE: two sites, two sets of operators Protects against MAINTENANCE OUTAGES work on backup software/hardware install/upgrade/move... Protects against HARDWARE FAILURES backup takes over Protects against TRANSIENT SOFTWARE ERRORS Commercial systems: Digital's Remote Transaction Router (RTR) Tandem's Remote Database Facility (RDF) IBM's Cross Recovery XRF( both in same campus) Oracle, Sybase, Informix, Microsoft... replication

Gray FT 4/24/95 34 SUMMARY FT systems fail for the conventional reasons Environmentmostly Peoplesometimes Softwaremostly HardwareRarely MTTF of FT SYSTEMS ~ 50X conventional ~ years vs weeks ~ years vs weeks Fail-Fast Modules + Reconfiguration + Repair => Good Hardware Fault Tolerance Transactions + Process Pairs => Good Software Fault Tolerance (Repair) System Pairs Hide Many Faults Challenge: Tolerate Human Errors (make system simpler to manage, operate, and maintain) (make system simpler to manage, operate, and maintain)

Gray FT 4/24/95 35 Key Idea ArchitectureHardware Faults ArchitectureHardware Faults Software MasksEnvironmental Faults Software MasksEnvironmental Faults DistributionMaintenance DistributionMaintenance Software automates / eliminates operatorsSoftware automates / eliminates operatorsSo, In the limit there are only software & design faults. Software-fault tolerance is the key to dependability. INVENT IT!In the limit there are only software & design faults. Software-fault tolerance is the key to dependability. INVENT IT! }{ } {

Gray FT 4/24/95 36 References Adams, E. (1984). Optimizing Preventative Service of Software Products. IBM Journal of Research and Development. 28(1): Anderson, T. and B. Randell. (1979). Computing Systems Reliability. Garcia-Molina, H. and C. A. Polyzois. (1990). Issues in Disaster Recovery. 35th IEEE Compcon Gray, J. (1986). Why Do Computers Stop and What Can We Do About It. 5th Symposium on Reliability in Distributed Software and Database Systems Gray, J. (1990). A Census of Tandem System Availability between 1985 and IEEE Transactions on Reliability. 39(4): Gray, J. N., Reuter, A. (1993). Transaction Processing Concepts and Techniques. San Mateo, Morgan Kaufmann. Lampson, B. W. (1981). Atomic Transactions. Distributed Systems -- Architecture and Implementation: An Advanced Course. ACM, Springer-Verlag. Laprie, J. C. (1985). Dependable Computing and Fault Tolerance: Concepts and Terminology. 15th FTCS Long, D.D., J. L. Carroll, and C.J. Park (1991). A study of the reliability of Internet sites. Proc 10th Symposium on Reliable Distributed Systems, pp , Pisa, September 1991.