Large Distributed Systems

Large Distributed Systems
Andrew Whitaker CSE451

Textbook Definition “A distributed system is a collection of loosely coupled processors interconnected by a communication network” Typically, the nodes run software to create an application/service e.g., 1000s of Google nodes work together to build a search engine

Challenge #1 Must handle partial failures
System must stay up, even when individual components fail Amazon.com Imagine giving a 142 assignment. Here’s a linked-list implementation that you’re free to use. But, the list will fail 1% of the time.

Challenge #2 No global state
Machines can only communicate with messages This makes it difficult to agree on anything “What time is it?” “Which happened first, A or B?” Theory: consensus is slow and doesn’t work in the presence of failure So, we try to avoid needing to agree in the first place A B

Internet Service Requirement: Availability
Basic goal: build a site that satisfies every user requests Detailed requirements: Handle billions of transactions per day Be available 24/7 Handle load spikes that are 10x normal capacity Do it with a random selection of mismatched hardware

An Overview of HotMail (Jim Gray)
~7,000 servers 100 backend stores with 300TB (cooked) Many data centers Links to Internet Mail gateways Ad-rotator Passport ~ 5 B messages per day 350M mailboxes, 250M active ~1M new per day. New software every 3 months (small changes weekly). 57,000 req/sec

Availability Strategy #1: Perfect Hardware
Pay extra $ for components that do not fail People have tried this “fault tolerant computing” This isn’t practical for Amazon / Google: It’s impossible to get rid of all faults Software and administrative errors still exist

Availability Strategy #2: Over-provision
Step 1: buy enough hardware to handle your workload Step 2: buy more hardware Replicate Replicate Replicate Replicate

Benefits of Replication
Scalability Guards against hardware failures Guards against software failures (bugs)

Replication Meets Probability
p is probability that a single machine fails Probability of n failures is: 1-p^n Site unavailability

Availability in the Real World
Phone network: 5 9’s 99.999% available ATMs: 4 9’s 99.99% available What about Internet services? Not very good…

2006: typical 97.48% Availability
Source: Jim Gray 97.48%

What Gives? Why isn’t simple redundancy enough to give very high availability?

Failure in the Real World
Server Server Amazon.com Internet Server Load balancer Server Server Load Balancer uses a “Least Connections” policy Server fails by returning an HTTP error 400 Net result: “failed” server becomes a black hole

Correlated Failures In practice, components often fail at the same time Natural disasters Security vulnerabilities Correlated manufacturing defects Human error…

Human error Human operator error is the leading cause of dependability problems in many domains Public Switched Telephone Network Average of 3 Internet Sites Sources of Failure Source: D. Patterson et al. Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies, UC Berkeley Technical Report UCB//CSD , March 2002.

Understanding Human Error
Administrator actions tend to involve many nodes at once: Upgrade from Apache 1.3 to Apache 2.0 Change the root DNS server Network / router configuration This can lead to (highly) correlated failures

Learning to Live with Failures
If we can’t prevent failures outright, how can we make their impact less severe? Understanding availability: MTTF: Mean-time-to-failure MTTR: Mean-time-to-repair Availability = MTTR / (MTTR + MTTF) Approximately MTTR / MTTF Note: recovery time is just as important as failure time!

Large Distributed Systems

Similar presentations

Presentation on theme: "Large Distributed Systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Large Distributed Systems

Similar presentations

Presentation on theme: "Large Distributed Systems"— Presentation transcript:

Similar presentations

About project

Feedback