Distributed Computing

Slides:



Advertisements
Similar presentations
Chapter 8 Fault Tolerance
Advertisements

Teaser - Introduction to Distributed Computing
Definition of a Distributed System (1) A distributed system is: A collection of independent computers that appears to its users as a single coherent system.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Introducing … Distributed Systems.
Last Class: Weak Consistency
Tutorials 1 1.What is the definition of a distributed system? 1.A distributed system is a collection of independent computers that appears to its users.
Computer Science Lecture 16, page 1 CS677: Distributed OS Last Class:Consistency Semantics Consistency models –Data-centric consistency models –Client-centric.
A brief overview about Distributed Systems Group A4 Chris Sun Bryan Maden Min Fang.
Advanced Operating Systems Welcome to this course, in Fall Semester Main TextBooks 1- Tanenbaum’s book 2- Chow’s Book 3- Singhal’s Book Other extra.
1 MSCS 237 Communication issues. 2 Colouris et al. (2001): Is a system in which hardware or software components located at networked computers communicate.
Exercises for Chapter 2: System models
Distributed Software Engineering Lecture 1 Introduction Sam Malek SWE 622, Fall 2012 George Mason University.
Distributed Systems COEN 317 Introduction Chapter 1,2,3.
OS2- Sem ; R. Jalili Introduction Chapter 1.
Kyung Hee University 1/41 Introduction Chapter 1.
DISTRIBUTED COMPUTING Introduction Dr. Yingwu Zhu.
Distributed Computing Systems CSCI 4780/6780. Distributed System A distributed system is: A collection of independent computers that appears to its users.
1 MSCS 237 Communication issues. 2 Colouris et al. (2001): Is a system in which hardware or software components located at networked computers communicate.
OS2- Sem1-83; R. Jalili Introduction Chapter 1. OS2- Sem1-83; R. Jalili Definition of a Distributed System (1) A distributed system is: A collection of.
Distributed Systems: Principles and Paradigms By Andrew S. Tanenbaum and Maarten van Steen.
Definition of a Distributed System (1) A distributed system is: A collection of independent computers that appears to its users as a single coherent system.
V1.7Fault Tolerance1. V1.7Fault Tolerance2 A characteristic of Distributed Systems is that they are tolerant of partial failures within the distributed.
Distributed Computing Systems CSCI 6900/4900. Review Distributed system –A collection of independent computers that appears to its users as a single coherent.
Introduction Chapter 1. Definition of a Distributed System (1) A distributed system is: A collection of independent computers that appears to its users.
1 Distributed Processing Chapter 1 : Introduction.
Fault Tolerance Chapter 7. Topics Basic Concepts Failure Models Redundancy Agreement and Consensus Client Server Communication Group Communication and.
Introduction to Fault Tolerance By Sahithi Podila.
Fault Tolerance Chapter 7. Goal An important goal in distributed systems design is to construct the system in such a way that it can automatically recover.
Distributed Computing COEN 317 DC1: Introduction.
TEXT: Distributed Operating systems A. S. Tanenbaum Papers oriented on: 1.OS Structures 2.Shared Memory Systems 3.Advanced Topics in Communications 4.Distributed.
Primitive Concepts of Distributed Systems Chapter 1.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
The Consensus Problem in Fault Tolerant Computing
Distributed Operating Systems Spring 2004
Last Class: Introduction
When Is Agreement Possible
Distributed Systems – Paxos
Distributed Operating Systems
Introduction to Distributed Systems
Definition of Distributed System
CHAPTER 3 Architectures for Distributed Systems
8.2. Process resilience Shreyas Karandikar.
Fault Tolerance In Operating System
Slides for Chapter 2: Architectural Models
Dependability Dependability is the ability to avoid service failures that are more frequent or severe than desired. It is an important goal of distributed.
Fault Tolerance - Transactions
Advanced Operating Systems
Fault Tolerance - Transactions
Chapter 17: Database System Architectures
Distributed Systems, Consensus and Replicated State Machines
Fault Tolerance Distributed Web-based Systems
Fault Tolerance - Transactions
Slides for Chapter 2: Architectural Models
Fault Tolerance CSC 8320 : AOS Class Presentation Shiraj Pokharel
Introduction to Fault Tolerance
Architectures of distributed systems Fundamental Models
PERSPECTIVES ON THE CAP THEOREM
Architectures of distributed systems Fundamental Models
Distributed Systems CS
Fault Tolerance - Transactions
Outline Review of Classical Operating Systems - continued
Architectures of distributed systems
Introduction To Distributed Systems
Architectures of distributed systems Fundamental Models
Database System Architectures
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Fault Tolerance - Transactions
Presentation transcript:

Distributed Computing COEN 317 DC1: Introduction

COEN 317 JoAnne Holliday Email: jholliday@scu.edu (best way to reach me) Office: Engineering 247, (408) 551-1941 Office Hours: MW 3:30-4:45 and by appointment Class web page: http://www.cse.scu.edu/~jholliday/

Textbook: Distributed Systems, Concepts and Design By Coulouris, Dollimore, Kindberg (CDK) We follow the book only roughly. The chapters with material that corresponds to the lectures are chapters 5 and 9 - 14. Read chapter 1 and 2. Review chapters 3 and 4 if needed for networks, threads and processes.

Topics What is a Distributed System? What are the goals of a Distributed System? Scalability Fault Tolerance Transparency

Definition of a Distributed System (1) A distributed system is: A collection of independent computers that appears to its users as a single coherent system.

Definition of a Distributed System (2) 1.1 A distributed system organized as middleware. Note that the middleware layer extends over multiple machines.

Distributed systems “Distributed System” covers a wide range of architectures from slightly more distributed than a centralized system to a truly distributed network of peers.

One Extreme: Centralized Centralized: mainframe and dumb terminals All of the computation is done on the mainframe. Each line or keystroke is sent from the terminal to the mainframe.

Moving Towards Distribution In a client-server system, the clients are workstations or computers in their own right and perform computations and formatting of the data. However, the data and the application which manipulates it ultimately resides on the server.

More Decentralization In Distributed-with-Coordinator, the nodes or sites depend on a coordinator node with extra knowledge or processing abilities Coordinator might be used only in case of failures or other problems

True Decentralization A true Distributed system has no distinguished node which acts as a coordinator and all nodes or sites are equals. The nodes may choose to elect one of their own to act as a temporary coordinator or leader

Distributed Systems: Pro and Con Some things that were difficult in a centralized system become easier Doing tasks faster by doing them in parallel Avoiding a single point of failure (all eggs in one basket) Geographical distribution Some things become more difficult Transaction commit Snapshots, time and causality Agreement (consensus)

Advantages of the True Distributed System No central server or coordinator means it is scalable SDDS, Scalable Distributed Data Structures, attempt to move distributed systems from a small number of nodes to thousands of nodes We need scalable algorithms to operate on these networks/structures For example peer-to-peer networks

Homogeneous and tightly coupled vs heterogeneous and loosely coupled We will study heterogeneous and loosely coupled systems.

Software Concepts DOS (Distributed Operating Systems) Description Main Goal DOS Tightly-coupled operating system for multi-processors and homogeneous multicomputers Hide and manage hardware resources NOS Loosely-coupled operating system for heterogeneous multicomputers (LAN and WAN) Offer local services to remote clients Middleware Additional layer atop of NOS implementing general-purpose services Provide distribution transparency DOS (Distributed Operating Systems) NOS (Network Operating Systems) Middleware

Distributed Operating Systems May share memory or other resources. 1.14

Network Operating System General structure of a network operating system. 1-19

Two meanings of synchronous and asynchronous communications Synchronous communications is where a process blocks after sending a message to wait for the answer or before receiving. Sync and async have come to describe the communications channels with which they are used. Synchronous: message transit time is short and bounded. If site does not respond in x sec, site can be declared dead. Simplifies algorithms! Asynchronous: message transit time is unbounded. If a message is not received in a given time interval, it could just be slow.

What’s the Point of Unrealistic Assumptions? So why should we study synchronous systems (which never occur in real life) or systems without failures? If something is difficult (intractable) in a synchronous system it will be worse in an asynchronous system. If an algorithm works well in an asynchronous system, it probably will do even better in a real system.

Scalability Something is scalable if it “increases linearly with size” where size is usually number of nodes or distance. “X is scalable with the number of nodes” Every site (node) is directly connected to every other site through a communication channel. Number of channels is NOT scalable. For N sites there are N! channels. Sites connected in a ring. # of channels IS scalable. (N channels for N sites)

Examples of scalability limitations. Scalability Problems Concept Example Centralized services A single server for all users Centralized data A single on-line telephone book Centralized algorithms Doing routing based on complete information Examples of scalability limitations.

Scaling Techniques (1) 1.4 The difference between letting: a server or a client check forms as they are being filled

An example of dividing the DNS name space into zones. Scaling Techniques (2) 1.5 An example of dividing the DNS name space into zones.

Characteristics of Scalable Distributed Algorithms No machine (node, site) has complete information about the system state. Sites make decisions based only on local information. Failure of one site does not ruin the algorithm. There is no implicit assumption that a global clock exists.

Transparency in a D.S. Transparency Description Access Hide differences in data representation and how a resource is accessed Location Hide where a resource is located Migration Hide that a resource may move to another location Relocation Hide that a resource may be moved to another location while in use Replication Hide that copies of a resource exist and a user might use different ones at different times Concurrency Hide that a resource may be shared by several competitive users Failure Hide the failure and recovery of a resource Persistence Hide whether a (software) resource is in memory or on disk Important: location, migration (relocation), replication, concurrency, failure.

Fault Tolerance An important goal in DS is to make the system resilient to failures of some of the components. Fault tolerance (FT) is frequently one of the reasons for making it distributed in the first place. Dependability Includes: Availability Reliability Safety Maintainability

Fault Tolerance Goals Availability: Can I use it now? Probability of being up at any given time. Reliability: Will it be up as long as I need it? Ability to run continuously without failure. If system crashes briefly every hour, it may still have good availability (it is up most of the time) but has poor reliability because it cannot run for very long before crashing. Safety: If it fails, ensure nothing bad happens? Maintainability: How easy is it to fix if it breaks?

Faults FAULT A fault is the cause of an error FAULT TOLERANCE - A system can continue to function even in the presence of faults. Classification of faults: Transient faults - occur once then disappear. Intermittent faults - occurs, goes away, then comes back, goes away … Permanent faults - doesn't go away by itself, like disk failures.

Failure Models Type of failure Description Crash failure or fail-stop A server halts, but is working correctly until it halts Omission failure Receive omission Send omission A server fails to respond to incoming requests A server fails to receive incoming messages A server fails to send messages Timing failure A server's response lies outside the specified time interval Response failure Value failure State transition failure The server's response is incorrect The value of the response is wrong The server deviates from the correct flow of control Arbitrary or Byzantine A server may produce arbitrary responses at arbitrary times

Failure Models Fail-stop: Process works fine until it stops, then remains dead. Other processes can detect this state. Crash: Works fine until it stops. Other processes may not be able to detect this. Send or Receive Omission: Process fails to do one thing, but works fine otherwise. Arbitrary (Byzantine): Process behavior is unpredictable, maybe malicious.

Network Failures Link failure (one way or 2 way): 5 can talk to 6, but 6 can not talk to 5 Network partitions: the network 1,2,3,4,5,6 is partitioned into 1,2,3,4 and 5,6. 1 2 5 4 6 3

Are The Models Realistic? No, of course not! Synch vs Asynch Asynchronous model is too weak (real systems have clocks, “most” timing meets expectations… but occasionally…) Synchronous model is too strong (real systems lack a way to implement synchronize rounds) Failure Types Crash fail (fail-stop) model is too weak (systems usually display some odd behavior before dying) Byzantine model is too strong (assumes an adversary of arbitrary speed who designs the “ultimate attack”)

Models: Justification If we can do something in the asynchronous model, we can probably do it even better in a real network Clocks, a-priori knowledge can only help… If we can’t do something in the synchronous model, we can’t do it in a real network After all, synchronized rounds are a powerful, if unrealistic, capability to introduce If we can survive Byzantine failures, we can probably survive a real distributed system.

Fault Tolerance Strategies Redundancy Hardware, software, informational, temporal Hierarchy Confinement of errors

Failure Masking by Redundancy Triple modular redundancy. Voter circuits choose majority of inputs to determine correct output

Flat Groups versus Hierarchical Groups Communication in a flat group. Communication in a simple hierarchical group

Identical Processes, Fail-stop A system is K fault tolerant if it can withstand faults in K components and still produce correct results. Example: FT through replication - each replica reports a result. If the nodes in a DS are fail-stop and there are K+1 identical processes, then the system can tolerate K failures: the result comes from the remaining one. 1 2 5 3 4

Identical Processes, Byzantine Failures If K failures are Byzantine (with K-collusion) then 2K+1 processes are needed for K FT. Example: K processes can be faulty and "lie" about their result. (If they simply fail to report a result, that is not a problem). If there are 2K+1 processes, at least K+1 will be correct and report the same correct answer. So by taking the result reported by at least K+1 (which is a majority), we get the correct answer.

What makes Distributed Systems Difficult? Asynchrony – even “synchronous” systems have time lag. Limited local knowledge – algorithms can consider only information acquired locally. Failures – parts of the distributed system can fail independently leaving some nodes operational and some not.

Example: Agreement with Unreliable Communication Introduced as voting problem (Lamport, Shostak, Pease ’82) A and B can defeat enemy iff both attack A sends message to B: Attack at Noon! General A General B The Enemy

Agreement with Unreliable Comm Impossible with unreliable networks Possible if some guarantees of reliability Guaranteed delivery within bounded time Limitations on corruption of messages Probabilistic guarantees (send multiple messages)