J. Gray, Dependability in the Internet Era (acknowledgement: slides from J.Gray, E.Brewer)
The Last 10 Years: Availability Dark Ages Ready for a Renaissance? Things got better, then things got a lot worse! % 99% 99.9% 99.99% % Computer Systems Telephone Systems Cell phones Interne t Availability 2010
DEPENDABILITY: The 3 ITIES RELIABILITY / INTEGRITY: Does the right thing. (also MTTF>>1) AVAILABILITY: Does it now. (also 1 >> MTTR ) MTTF+MTTR System Availability: If 90% of terminals up & 99% of DB up? (=>89% of transactions are serviced on time ). Holistic vs. Reductionist view Security Integrity Reliability Availability
Fail-Fast is Good, Repair is Needed Improving either MTTR or MTTF gives benefit Lifecycle of a module fail-fast gives short fault latency High Availability is low UN-Availability is low UN-Availability Unavailability ~ MTTR MTTF MTTF
Disks (raid) the BIG Success Story Duplex or Parity: masks faults 1M hours (~100 years) But –controllers fail and –have 1,000s of disks. Duplexing or parity, and dual path gives “perfect disks” Wal-Mart never lost a byte (thousands of disks, hundreds of failures). Only software/operations mistakes are left.
Fault Tolerance vs Disaster Tolerance Fault-Tolerance: mask local faults –RAID disks –Uninterruptible Power Supplies –Cluster Failover Disaster Tolerance: masks site failures –Protects against fire, flood, sabotage,.. –Also, software changes, site moves,… –Redundant system and service at remote site.
Availability well-managed nodes well-managed packs & clones well-managed GeoPlex Masks some hardware failures Masks hardware failures, Operations tasks (e.g. software upgrades) Masks some software failures Masks site failures (power, network, fire, move,…) Masks some operations failures Availability Un-managed
Case Studies - Tandem Trends MTTF improved Shiftfrom Hardware & Maintenance to from 50% to 10% toSoftware (62%) & Operations (15%) NOTE: Systematic under-reporting ofEnvironment Operations errors Application Software
Dependability Status circa 1995 ~4-year MTTF 5 9s for well-managed sys. Fault Tolerance Works. Hardware is GREAT (maintenance and MTTF). Software masks most hardware faults. Many hidden software outages in operations: New Software. Utilities. Need to make all hardware/software changes ONLINE.
Progress? MTTF improved from MTTR incremental improvements failover Hardware and Software online change (pNp) is now standard Then the Internet arrived: –No project can take more than 3 months. –Time to market is everything –Change is good. Computer Systems Telephone Systems Cell phones Internet
The Internet Changed Expectations 1990 Phones delivered % ATMs delivered 99.99% Failures were front-page news. Few hackers Outages last an “hour” 2005 Cell phones deliver 90% Web sites deliver 99% Failures are business-page news Many hackers. Outages last a “day” This is progress?
2006
Eric Brewer said it best: ACID vs BASE the internet litmus test A tomicity C onsistency I solation D urabilty Availability? Strong consistency Isolation Focus on commit Conservative (Pessimistic) Difficult evolution (e.g. schema) Nested transactions B asic A vailability S oft State E ventual Consistency Availability FIRST Weak consistency stale data is OK Approximate answers OK Best effort Aggressive (optimistic) Easier Evolution. Simpler! Faster I think it is a spectrum