Large-Scale Distributed Systems Andrew Whitaker CSE451.

Slides:



Advertisements
Similar presentations
Express5800/ft series servers Product Information Fault-Tolerant General Purpose Servers.
Advertisements

1 Perspectives from Operating a Large Scale Website Dennis Lee VP Technical Operations, Marchex.
Hiltronics Computers Getting Your Website Up! Frank Hill President/CEO Hiltronics Computers.
Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.
Networks By the end of this session, you will:
Chapter 8 Fault Tolerance
Byzantine Generals. Outline r Byzantine generals problem.
Cs/ee 143 Communication Networks Chapter 6 Internetworking Text: Walrand & Parekh, 2010 Steven Low CMS, EE, Caltech.
Chapter 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
Availability in Globally Distributed Storage Systems
VoipNow Core Solution capabilities and business value.
Business Continuity and DR, A Practical Implementation Mich Talebzadeh, Consultant, Deutsche Bank
Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
2/11/2004 Internet Services Overview February 11, 2004.
OCT1 Principles From Chapter One of “Distributed Systems Concepts and Design”
Overview Distributed vs. decentralized Why distributed databases
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 1: Introduction to Windows Server 2003.
EEC-681/781 Distributed Computing Systems Lecture 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Last Class: Weak Consistency
1 Internet Management and Security We will look at management and security of networks and systems. Systems: The end nodes of the Internet Network: The.
Lesson 1: Configuring Network Load Balancing
DISTRIBUTED COMPUTING
11 SERVER CLUSTERING Chapter 6. Chapter 6: SERVER CLUSTERING2 OVERVIEW  List the types of server clusters.  Determine which type of cluster to use for.
07/14/08. 2 Points Introduction. Cluster and Supercomputers. Cluster Types and Advantages. Our Cluster. Cluster Performance. Cluster Computer for Basic.
Chapter 2 Computer Clusters Lecture 2.1 Overview.
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Server Load Balancing. Introduction Why is load balancing of servers needed? If there is only one web server responding to all the incoming HTTP requests.
CS 221 – May 13 Review chapter 1 Lab – Show me your C programs – Black spaghetti – connect remaining machines – Be able to ping, ssh, and transfer files.
Chapter 16 – DNS. DNS Domain Name Service This service allows client machines to resolve computer names (domain names) to IP addresses DNS works at the.
Windows 2000 Advanced Server and Clustering Prepared by: Tetsu Nagayama Russ Smith Dale Pena.
Local Area Networks (LAN) are small networks, with a short distance for the cables to run, typically a room, a floor, or a building. - LANs are limited.
INSTALLING MICROSOFT EXCHANGE SERVER 2003 CLUSTERS AND FRONT-END AND BACK ‑ END SERVERS Chapter 4.
1 MSCS 237 Communication issues. 2 Colouris et al. (2001): Is a system in which hardware or software components located at networked computers communicate.
Distributed Systems: Concepts and Design Chapter 1 Pages
Distributed systems A collection of autonomous computers linked by a network, with software designed to produce an integrated computing facility –A well.
Failure Analysis of the PSTN: 2000 Patricia Enriquez Mills College Oakland, California Mentors: Aaron Brown David Patterson.
Metrics and Techniques for Evaluating the Performability of Internet Services Pete Broadwell
Introduction. Readings r Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 m Note: All figures from this book.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
Multiprossesors Systems.. What are Distributed Databases ? “ A Logically interrelated collection of shared data ( and a description of this data) physically.
CLUSTER COMPUTING TECHNOLOGY BY-1.SACHIN YADAV 2.MADHAV SHINDE SECTION-3.
 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.
Oracle's Distributed Database Bora Yasa. Definition A Distributed Database is a set of databases stored on multiple computers at different locations and.
Random Graph Generator University of CS 8910 – Final Research Project Presentation Professor: Dr. Zhu Presented: December 8, 2010 By: Hanh Tran.
CS 505: Thu D. Nguyen Rutgers University, Spring CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers.
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
Lesson 19-E-Commerce Security Needs. Overview Understand e-commerce services. Understand the importance of availability. Implement client-side security.
CCNA4 v3 Module 6 v3 CCNA 4 Module 6 JEOPARDY K. Martin.
Chapter 20 Parallel Sysplex
Distributed Computing Systems CSCI 4780/6780. Scalability ConceptExample Centralized servicesA single server for all users Centralized dataA single on-line.
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
Introduction to Fault Tolerance By Sahithi Podila.
Fault Tolerance
CLOUD COMPUTING WHAT IS CLOUD COMPUTING?  Cloud Computing, also known as ‘on-demand computing’, is a kind of Internet-based computing,
The Internet What is the Internet? The Internet is a lot of computers over the whole world connected together so that they can share information. It.
Fail-Stop Processors UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau One paper: Byzantine.
1 High-availability and disaster recovery  Dependability concepts:  fault-tolerance, high-availability  High-availability classification  Types of.
PERFORMANCE MANAGEMENT IMPROVING PERFORMANCE TECHNIQUES Network management system 1.
Embracing Failure: A Case for Recovery-Oriented Computing
N-Tier Architecture.
Large Distributed Systems
Principles of Network Applications
Fault Tolerance In Operating System
Chapter 16: Distributed System Structures
Introduction to Fault Tolerance
Introduction To Distributed Systems
Distributed Systems and Concurrency: Distributed Systems
Presentation transcript:

Large-Scale Distributed Systems Andrew Whitaker CSE451

Textbook Definition “A distributed system is a collection of loosely coupled processors interconnected by a communication network” Typically, the nodes run software to create an application/service  e.g., 1000s of Google nodes work together to build a search engine

Why Not to Build a Distributed System (1) Must handle partial failures  System must stay up, even when individual components fail Amazon.com

Why Not to Build a Distributed System (2) No global state  Machines can only communicate with messages This makes it difficult to agree on anything  “What time is it?”  “Which happened first, A or B?” Theory: consensus is slow and doesn’t work in the presence of failure  So, we try to avoid needing to agree in the first place A B

Reasons to Build a Distributed System (1) The application or service is inherently distributed Andrew Whitaker Joan Whitaker

Reason to Build a Distributed System (2) Application requirements  Must scale to millions of requests / sec  Must be available despite component failures This is why Amazon, Google, Ebay, etc. are all large distributed systems

Internet Service Requirements Basic goal: build a site that satisfies every user requests Detailed requirements:  Handle billions of transactions per day  Be available 24/7  Handle load spikes that are 10x normal capacity  Do it with a random selection of mismatched hardware

An Overview of HotMail (Jim Gray) ~7,000 servers 100 backend stores with 300TB (cooked) Many data centers Links to  Internet Mail gateways  Ad-rotator  Passport ~ 5 B messages per day 350M mailboxes, 250M active ~1M new per day. New software every 3 months (small changes weekly).

Availability Strategy #1: Perfect Hardware Pay extra $$$ for components that do not fail People have tried this  “fault tolerant computing” This isn’t practical for Amazon / Google:  It’s impossible to get rid of all faults  Software and administrative errors still exist

Availability Strategy #2: Over- provision Step 1: buy enough hardware to handle your workload Step 2: buy more hardware Replicate

Benefits of Replication Scalability Guards against hardware failures Guards against software failures (bugs)

Replication Meets Probability p is probability that a single machine fails Probability of N failures is: 1-p^n Site unavailability

Availability in the Real World Phone network: 5 9’s  % available ATMs: 4 9’s  99.99% available What about Internet services?  Not very good…

2006: typical 97.48% Availability 97.48% Source: Jim Gray

Netcraft’s Crisis-of-the-Day

What Gives? Why isn’t simple redundancy enough to give very high availability?

Failure Modes Fail-stop failure: A component fails by stopping  It’s totally dead: doesn’t respond to input or output  Ideally, this happens fast Like a light-bulb Byzantine failure: Component fails in an arbitrary way  Produces unpredictable output

Byzantine Generals Basic goal: reach consensus in the presence of arbitrary failures Results:  More than 2/3 of the nodes must be “loyal” 3t + 1 nodes with t traitors  Consensus is possible, but expensive Lot’s of messages Many rounds of communication In practice, people assume that failures are fail- stop, and hope for the best…

Example of a non Fail-Stop Failure Server Load balancer Internet Load Balancer uses a “Least Connections” policy Server fails by returning an HTTP error 400 Net result: “failed” server becomes a black hole Amazon.com

Correlated Failures In practice, components often fail at the same time  Natural disasters  Security vulnerabilities  Correlated manufacturing defects  Human error…

Human error Human operator error is the leading cause of dependability problems in many domains Source: D. Patterson et al. Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies, UC Berkeley Technical Report UCB//CSD , March Public Switched Telephone Network Average of 3 Internet Sites Sources of Failure

Understanding Human Error Administrator actions tend to involve many nodes at once:  Upgrade from Apache 1.3 to Apache 2.0  Change the root DNS server  Network / router misconfiguration This can lead to (highly) correlated failures

Learning to Live with Failures If we can’t prevent failures outright, how can we make their impact less severe? Understanding availability:  MTTF: Mean-time-to-failure  MTTR: Mean-time-to-repair  Availability = MTTR / (MTTR + MTTF) Approximately MTTR / MTTF Note: recovery time is just as important as failure time!

Summary Large distributed systems are built from many flaky components  Key challenge: don’t let component failures become system failures Basic approach: throw lots of hardware at the problem; hope everything doesn’t fail at once  Try to decouple failures  Try to avoid single points-of-failure  Try to fail fast Availability is affected as much by recovery time as by error frequency