Large-Scale Distributed Systems Andrew Whitaker CSE451.

Large-Scale Distributed Systems Andrew Whitaker CSE451

Textbook Definition “A distributed system is a collection of loosely coupled processors interconnected by a communication network” Typically, the nodes run software to create an application/service  e.g., 1000s of Google nodes work together to build a search engine

Why Not to Build a Distributed System (1) Must handle partial failures  System must stay up, even when individual components fail Amazon.com

Why Not to Build a Distributed System (2) No global state  Machines can only communicate with messages This makes it difficult to agree on anything  “What time is it?”  “Which happened first, A or B?” Theory: consensus is slow and doesn’t work in the presence of failure  So, we try to avoid needing to agree in the first place A B

Reasons to Build a Distributed System (1) The application or service is inherently distributed Andrew Whitaker Joan Whitaker

Reason to Build a Distributed System (2) Application requirements  Must scale to millions of requests / sec  Must be available despite component failures This is why Amazon, Google, Ebay, etc. are all large distributed systems

Internet Service Requirements Basic goal: build a site that satisfies every user requests Detailed requirements:  Handle billions of transactions per day  Be available 24/7  Handle load spikes that are 10x normal capacity  Do it with a random selection of mismatched hardware

An Overview of HotMail (Jim Gray) ~7,000 servers 100 backend stores with 300TB (cooked) Many data centers Links to  Internet Mail gateways  Ad-rotator  Passport ~ 5 B messages per day 350M mailboxes, 250M active ~1M new per day. New software every 3 months (small changes weekly).

Availability Strategy #1: Perfect Hardware Pay extra $$$ for components that do not fail People have tried this  “fault tolerant computing” This isn’t practical for Amazon / Google:  It’s impossible to get rid of all faults  Software and administrative errors still exist

Availability Strategy #2: Over- provision Step 1: buy enough hardware to handle your workload Step 2: buy more hardware Replicate

Benefits of Replication Scalability Guards against hardware failures Guards against software failures (bugs)

Replication Meets Probability p is probability that a single machine fails Probability of N failures is: 1-p^n Site unavailability

Availability in the Real World Phone network: 5 9’s  99.999% available ATMs: 4 9’s  99.99% available What about Internet services?  Not very good…

2006: typical 97.48% Availability 97.48% Source: Jim Gray

Netcraft’s Crisis-of-the-Day

What Gives? Why isn’t simple redundancy enough to give very high availability?

Failure Modes Fail-stop failure: A component fails by stopping  It’s totally dead: doesn’t respond to input or output  Ideally, this happens fast Like a light-bulb Byzantine failure: Component fails in an arbitrary way  Produces unpredictable output

Byzantine Generals Basic goal: reach consensus in the presence of arbitrary failures Results:  More than 2/3 of the nodes must be “loyal” 3t + 1 nodes with t traitors  Consensus is possible, but expensive Lot’s of messages Many rounds of communication In practice, people assume that failures are fail- stop, and hope for the best…

Example of a non Fail-Stop Failure Server Load balancer Internet Load Balancer uses a “Least Connections” policy Server fails by returning an HTTP error 400 Net result: “failed” server becomes a black hole Amazon.com

Correlated Failures In practice, components often fail at the same time  Natural disasters  Security vulnerabilities  Correlated manufacturing defects  Human error…

Human error Human operator error is the leading cause of dependability problems in many domains Source: D. Patterson et al. Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies, UC Berkeley Technical Report UCB//CSD-02-1175, March 2002. Public Switched Telephone Network Average of 3 Internet Sites Sources of Failure

Understanding Human Error Administrator actions tend to involve many nodes at once:  Upgrade from Apache 1.3 to Apache 2.0  Change the root DNS server  Network / router misconfiguration This can lead to (highly) correlated failures

Learning to Live with Failures If we can’t prevent failures outright, how can we make their impact less severe? Understanding availability:  MTTF: Mean-time-to-failure  MTTR: Mean-time-to-repair  Availability = MTTR / (MTTR + MTTF) Approximately MTTR / MTTF Note: recovery time is just as important as failure time!

Summary Large distributed systems are built from many flaky components  Key challenge: don’t let component failures become system failures Basic approach: throw lots of hardware at the problem; hope everything doesn’t fail at once  Try to decouple failures  Try to avoid single points-of-failure  Try to fail fast Availability is affected as much by recovery time as by error frequency

Large-Scale Distributed Systems Andrew Whitaker CSE451.

Similar presentations

Presentation on theme: "Large-Scale Distributed Systems Andrew Whitaker CSE451."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Large-Scale Distributed Systems Andrew Whitaker CSE451.

Similar presentations

Presentation on theme: "Large-Scale Distributed Systems Andrew Whitaker CSE451."— Presentation transcript:

Similar presentations

About project

Feedback