J. Gray, Dependability in the Internet Era (acknowledgement: slides from J.Gray, E.Brewer)

Slides:

Advertisements

Similar presentations

Express5800/ft series servers Product Information Fault-Tolerant General Purpose Servers.

Advertisements

Clustering Technology For Scaleability Jim Gray Microsoft Research

Gray FT 4/24/95 1 Dependable Computing Systems Jim Gray UC Berkeley McKay Lecture 25 April 1995 Microsoft.com Talk 1: Many little will win over.

Past High Availability Standards Efforts Jim Gray Microsoft

1 Dependability in the Internet Era. 2 Outline The glorious past (Availability Progress) The dark ages (current scene) Some recommendations.

The Lucernex Cloud: A software-as-a-service solution delivered via the Cloud What is the Cloud? Cloud Computing is the future of all software applications,

Large-Scale Distributed Systems Andrew Whitaker CSE451.

RAID Redundant Arrays of Inexpensive Disks –Using lots of disk drives improves: Performance Reliability –Alternative: Specialized, high-performance hardware.

WHAT IS RAID? Christopher J Dutra Seton Hall University.

REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.

Managing Information Systems Information Systems Security and Control Part 1 Dr. Stephania Loizidou Himona ACSC 345.

Copyright ©2003 Digitask Consultants Inc., All rights reserved Storage Area Networks Digitask Seminar April 2000 Digitask Consultants, Inc.

5/18/2015CPE 731, 4-Principles 1 Define and quantify dependability (1/3) How decide when a system is operating properly? Infrastructure providers now offer.

CS252/Patterson Lec 6.1 2/2/01 CS252 Graduate Computer Architecture Lecture 6: I/O 2: Failure Terminology, Examples, Gray Paper and a little Queueing Theory.

University of WashingtonComputing & Communications Ten Minutes on Five Nines Terry Gray Associate VP, IT Infrastructure University of Washington Common.

Modern Distributed Systems Design – Security and High Availability 1.Measuring Availability 2.Highly Available Data Management 3.Redundant System Design.

Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.

Presented By: Vinay Kumar.  At the time of invention, Internet was just accessible to a small group of pioneers who wanted to make the network work.

Presentation on Clustering Paper: Cluster-based Scalable Network Services; Fox, Gribble et. al Internet Services Suman K. Grandhi Pratish Halady.

Implementing Disaster Protection

Copyright © 2015 Pearson Education, Inc. Processing Integrity and Availability Controls Chapter

CS162 Operating Systems and Systems Programming Lecture 20 Why Systems Fail and What We Can Do About It April 15, 2013 Anthony D. Joseph

CS162 Section Lecture 11. Project 4 Implement a distributed key-value store that uses – Two-Phase Commit for atomic operations, – Replication for performance.

TECHNOLOGY GUIDE 3: Emerging Types of Enterprise Computing

Security Equipment Equipment for preventing unauthorised access to data & information.

1 Dependability in the Internet Era Jim Gray Microsoft Research High Dependability Computing Consortium Conference Santa Cruz, CA 7 May 2001 REVISED: 13.

Lecture 11: Storage Systems Disk, RAID, Dependability Kai Bu

DISKS IS421. DISK  A disk consists of Read/write head, and arm  A platter is divided into Tracks and sector  The R/W heads can R/W at the same time.

Lesson 20. Fault Tolerance and Disaster Recovery.

Business Continuity and Disaster Recovery Chapter 8 Part 2 Pages 914 to 945.

1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.

High-Availability Linux.  Reliability  Availability  Serviceability.

DotHill Systems Data Management Services. Page 2 Agenda Why protect your data?  Causes of data loss  Hardware data protection  DMS data protection.

Module 9: Configuring Storage

Eng. Mohammed Timraz Electronics & Communication Engineer University of Palestine Faculty of Engineering and Urban planning Software Engineering Department.

CREATE THE DIFFERENCE Back ups and Recovery Janet Francis/Geoff Leese January 2010.

© , OrangeScape Technologies Limited. Confidential 1 Write Once. Cloud Anywhere. Building Highly Scalable Web applications BASE gives way to ACID.

1 Availability Policy (slides from Clement Chen and Craig Lewis)

Co-location Sites for Business Continuity and Disaster Recovery Peter Lesser (212) Peter Lesser (212) Kraft.

High Availability for Information Security Managing The Seven R’s Rich Schiesser Sr. Technical Planner.

Welcome To Business Summary DiveIn Incorporated is a small company that specializes in the sales of swimming pools supplies to homeowners by mail order.

VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Trade-offs in Cloud.

CS211 - Fernandez - 1 CS211 Graduate Computer Architecture Network 3: Clusters, Examples.

CS 505: Thu D. Nguyen Rutgers University, Spring CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers.

IT 606 Computer Networks (CN). 1.Evolution of Computer Networks & Application Layer. 2.Transport Layer & Network Layer. 3.Routing & Data link Layer. 4.Physical.

1 Fault-Tolerant Computing Systems #1 Introduction Pattara Leelaprute Computer Engineering Department Kasetsart University

CHAPTER 7 CLUSTERING SERVERS. CLUSTERING TYPES There are 2 types of clustering ; Server clusters Network Load Balancing (NLB) The difference between the.

WINDOWS SERVER 2003 Genetic Computer School Lesson 12 Fault Tolerance.

CS 162 Section 10 Two-phase commit Fault-tolerant computing.

CREATE THE DIFFERENCE Back ups and Recovery. CREATE THE DIFFERENCE Aims This lecture aims to cover –Back ups –Transaction logging –Security threats.

1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.

Fault-tolerant Computing Spring 2007 April 4, 2007.

Security Operations Chapter 11 Part 2 Pages 1262 to 1279.

CS203 – Advanced Computer Architecture Dependability & Reliability.

1 High-availability and disaster recovery  Dependability concepts:  fault-tolerance, high-availability  High-availability classification  Types of.

Lecture 11: Storage Systems Disk, RAID, Dependability Kai Bu

Sources of Failure in the Public Switched Telephone Network

IC3 GS3 Standard COMPUTING FUNDAMENTALS Module

Server Upgrade HA/DR Integration

TECHNOLOGY GUIDE THREE

Embracing Failure: A Case for Recovery-Oriented Computing

Large Distributed Systems

Fault Tolerance & Reliability CDA 5140 Spring 2006

Maximum Availability Architecture Enterprise Technology Centre.

A Technical Overview of Microsoft® SQL Server™ 2005 High Availability Beta 2 Matthew Stephen IT Pro Evangelist (SQL Server)

TECHNOLOGY GUIDE THREE

Terminology and empirical measures General methods to mask faults.

COP 5611 Operating Systems Spring 2010

Transaction Properties: ACID vs. BASE

TECHNOLOGY GUIDE THREE

Presentation transcript:

J. Gray, Dependability in the Internet Era (acknowledgement: slides from J.Gray, E.Brewer)

The Last 10 Years: Availability Dark Ages Ready for a Renaissance? Things got better, then things got a lot worse! % 99% 99.9% 99.99% % Computer Systems Telephone Systems Cell phones Interne t Availability 2010

DEPENDABILITY: The 3 ITIES RELIABILITY / INTEGRITY: Does the right thing. (also MTTF>>1) AVAILABILITY: Does it now. (also 1 >> MTTR ) MTTF+MTTR System Availability: If 90% of terminals up & 99% of DB up? (=>89% of transactions are serviced on time ). Holistic vs. Reductionist view Security Integrity Reliability Availability

Fail-Fast is Good, Repair is Needed Improving either MTTR or MTTF gives benefit Lifecycle of a module fail-fast gives short fault latency High Availability is low UN-Availability is low UN-Availability Unavailability ~ MTTR MTTF MTTF

Disks (raid) the BIG Success Story Duplex or Parity: masks faults 1M hours (~100 years) But –controllers fail and –have 1,000s of disks. Duplexing or parity, and dual path gives “perfect disks” Wal-Mart never lost a byte (thousands of disks, hundreds of failures). Only software/operations mistakes are left.

Fault Tolerance vs Disaster Tolerance Fault-Tolerance: mask local faults –RAID disks –Uninterruptible Power Supplies –Cluster Failover Disaster Tolerance: masks site failures –Protects against fire, flood, sabotage,.. –Also, software changes, site moves,… –Redundant system and service at remote site.

Availability well-managed nodes well-managed packs & clones well-managed GeoPlex Masks some hardware failures Masks hardware failures, Operations tasks (e.g. software upgrades) Masks some software failures Masks site failures (power, network, fire, move,…) Masks some operations failures Availability Un-managed

Case Studies - Tandem Trends MTTF improved Shiftfrom Hardware & Maintenance to from 50% to 10% toSoftware (62%) & Operations (15%) NOTE: Systematic under-reporting ofEnvironment Operations errors Application Software

Dependability Status circa 1995 ~4-year MTTF 5 9s for well-managed sys. Fault Tolerance Works. Hardware is GREAT (maintenance and MTTF). Software masks most hardware faults. Many hidden software outages in operations: New Software. Utilities. Need to make all hardware/software changes ONLINE.

Progress? MTTF improved from MTTR incremental improvements failover Hardware and Software online change (pNp) is now standard Then the Internet arrived: –No project can take more than 3 months. –Time to market is everything –Change is good. Computer Systems Telephone Systems Cell phones Internet

The Internet Changed Expectations 1990 Phones delivered % ATMs delivered 99.99% Failures were front-page news. Few hackers Outages last an “hour” 2005 Cell phones deliver 90% Web sites deliver 99% Failures are business-page news Many hackers. Outages last a “day” This is progress?

2006

Eric Brewer said it best: ACID vs BASE the internet litmus test A tomicity C onsistency I solation D urabilty Availability? Strong consistency Isolation Focus on commit Conservative (Pessimistic) Difficult evolution (e.g. schema) Nested transactions B asic A vailability S oft State E ventual Consistency Availability FIRST Weak consistency stale data is OK Approximate answers OK Best effort Aggressive (optimistic) Easier Evolution. Simpler! Faster I think it is a spectrum