Large Distributed Systems

Slides:



Advertisements
Similar presentations
Large-Scale Distributed Systems Andrew Whitaker CSE451.
Advertisements

Availability in Globally Distributed Storage Systems
Business Continuity and DR, A Practical Implementation Mich Talebzadeh, Consultant, Deutsche Bank
Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.
The Phoenix Recovery System: Rebuilding from the ashes of an Internet catastrophe Flavio Junqueira, Ranjita Bhagwan, Keith Marzullo, Stefan Savage, and.
2/11/2004 Internet Services Overview February 11, 2004.
OCT1 Principles From Chapter One of “Distributed Systems Concepts and Design”
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 1: Introduction to Windows Server 2003.
EEC-681/781 Distributed Computing Systems Lecture 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Undergraduate Poster Presentation Match 31, 2015 Department of CSE, BUET, Dhaka, Bangladesh Wireless Sensor Network Integretion With Cloud Computing H.M.A.
DISTRIBUTED COMPUTING
Understanding Network Failures in Data Centers: Measurement, Analysis and Implications Phillipa Gill University of Toronto Navendu Jain & Nachiappan Nagappan.
11 SERVER CLUSTERING Chapter 6. Chapter 6: SERVER CLUSTERING2 OVERVIEW  List the types of server clusters.  Determine which type of cluster to use for.
07/14/08. 2 Points Introduction. Cluster and Supercomputers. Cluster Types and Advantages. Our Cluster. Cluster Performance. Cluster Computer for Basic.
Presented by: Alvaro Llanos E.  Motivation and Overview  Frangipani Architecture overview  Similar DFS  PETAL: Distributed virtual disks ◦ Overview.
Server Load Balancing. Introduction Why is load balancing of servers needed? If there is only one web server responding to all the incoming HTTP requests.
Current Job Components Information Technology Department Network Systems Administration Telecommunications Database Design and Administration.
Windows 2000 Advanced Server and Clustering Prepared by: Tetsu Nagayama Russ Smith Dale Pena.
Local Area Networks (LAN) are small networks, with a short distance for the cables to run, typically a room, a floor, or a building. - LANs are limited.
INSTALLING MICROSOFT EXCHANGE SERVER 2003 CLUSTERS AND FRONT-END AND BACK ‑ END SERVERS Chapter 4.
Distributed Systems: Concepts and Design Chapter 1 Pages
Introduction. Readings r Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 m Note: All figures from this book.
Multiprossesors Systems.. What are Distributed Databases ? “ A Logically interrelated collection of shared data ( and a description of this data) physically.
Chapter 5 McGraw-Hill/Irwin Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. Enterprise Architectures.
CLUSTER COMPUTING TECHNOLOGY BY-1.SACHIN YADAV 2.MADHAV SHINDE SECTION-3.
 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.
CS 505: Thu D. Nguyen Rutgers University, Spring CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers.
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
Lesson 19-E-Commerce Security Needs. Overview Understand e-commerce services. Understand the importance of availability. Implement client-side security.
Highly available, Fault tolerant Co-scheduling System With working implementation.
1 High-availability and disaster recovery  Dependability concepts:  fault-tolerance, high-availability  High-availability classification  Types of.
Distributed Systems Architectures. Topics covered l Client-server architectures l Distributed object architectures l Inter-organisational computing.
INTRODUCTION TO DESKTOP SUPPORT
Lecture 6: Cloud Computing
Enterprise Architectures
Exploring the Functions of Networking
Hardware & Software Reliability
Embracing Failure: A Case for Recovery-Oriented Computing
N-Tier Architecture.
CSE 486/586 Distributed Systems Mid-Semester Overview
Principles of Network Applications
© 2013 Jones and Bartlett Learning, LLC, an Ascend Learning Company All rights reserved. Page 1 Fundamentals of Information Systems.
Fault Tolerance & Reliability CDA 5140 Spring 2006
Maximum Availability Architecture Enterprise Technology Centre.
CSC 480 Software Engineering
CHAPTER 3 Architectures for Distributed Systems
Cloud Computing Ed Lazowska August 2011 Bill & Melinda Gates Chair in
Fault Tolerance In Operating System
Welcome To : Group 1 VC Presentation
Chapter 16: Distributed System Structures
Distributed System Structures 16: Distributed Structures
Advanced Operating Systems
Lecture-5 Implementation of Information System Part - I Thepul Ginige
Outline Virtualization Cloud Computing Microsoft Azure Platform
Fault Tolerance Distributed Web-based Systems
Introduction to Fault Tolerance
CSE 451: Operating Systems Spring 2005 Module 20 Distributed Systems
Distributed computing deals with hardware
Computer communications
Introduction to Cyberspace
CSE 451: Operating Systems Winter 2004 Module 19 Distributed Systems
Introduction To Distributed Systems
Database System Architectures
Overview of Networking
CSE 451: Operating Systems Winter 2007 Module 21 Distributed Systems
Network management system
Design.
Distributed Systems and Concurrency: Distributed Systems
In-network computation
Presentation transcript:

Large Distributed Systems Andrew Whitaker CSE451

Textbook Definition “A distributed system is a collection of loosely coupled processors interconnected by a communication network” Typically, the nodes run software to create an application/service e.g., 1000s of Google nodes work together to build a search engine

Challenge #1 Must handle partial failures System must stay up, even when individual components fail Amazon.com Imagine giving a 142 assignment. Here’s a linked-list implementation that you’re free to use. But, the list will fail 1% of the time.

Challenge #2 No global state Machines can only communicate with messages This makes it difficult to agree on anything “What time is it?” “Which happened first, A or B?” Theory: consensus is slow and doesn’t work in the presence of failure So, we try to avoid needing to agree in the first place A B

Internet Service Requirement: Availability Basic goal: build a site that satisfies every user requests Detailed requirements: Handle billions of transactions per day Be available 24/7 Handle load spikes that are 10x normal capacity Do it with a random selection of mismatched hardware

An Overview of HotMail (Jim Gray) ~7,000 servers 100 backend stores with 300TB (cooked) Many data centers Links to Internet Mail gateways Ad-rotator Passport ~ 5 B messages per day 350M mailboxes, 250M active ~1M new per day. New software every 3 months (small changes weekly). 57,000 req/sec

Availability Strategy #1: Perfect Hardware Pay extra $ for components that do not fail People have tried this “fault tolerant computing” This isn’t practical for Amazon / Google: It’s impossible to get rid of all faults Software and administrative errors still exist

Availability Strategy #2: Over-provision Step 1: buy enough hardware to handle your workload Step 2: buy more hardware Replicate Replicate Replicate Replicate

Benefits of Replication Scalability Guards against hardware failures Guards against software failures (bugs)

Replication Meets Probability p is probability that a single machine fails Probability of n failures is: 1-p^n Site unavailability

Availability in the Real World Phone network: 5 9’s 99.999% available ATMs: 4 9’s 99.99% available What about Internet services? Not very good…

2006: typical 97.48% Availability Source: Jim Gray 97.48%

What Gives? Why isn’t simple redundancy enough to give very high availability?

Failure in the Real World Server Server Amazon.com Internet Server Load balancer Server Server Load Balancer uses a “Least Connections” policy Server fails by returning an HTTP error 400 Net result: “failed” server becomes a black hole

Correlated Failures In practice, components often fail at the same time Natural disasters Security vulnerabilities Correlated manufacturing defects Human error…

Human error Human operator error is the leading cause of dependability problems in many domains Public Switched Telephone Network Average of 3 Internet Sites Sources of Failure Source: D. Patterson et al. Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies, UC Berkeley Technical Report UCB//CSD-02-1175, March 2002.

Understanding Human Error Administrator actions tend to involve many nodes at once: Upgrade from Apache 1.3 to Apache 2.0 Change the root DNS server Network / router configuration This can lead to (highly) correlated failures

Learning to Live with Failures If we can’t prevent failures outright, how can we make their impact less severe? Understanding availability: MTTF: Mean-time-to-failure MTTR: Mean-time-to-repair Availability = MTTR / (MTTR + MTTF) Approximately MTTR / MTTF Note: recovery time is just as important as failure time!