Distributed Computing COEN 317 DC1: Introduction
COEN 317 JoAnne Holliday Email: jholliday@scu.edu (best way to reach me) Office: Engineering 247, (408) 551-1941 Office Hours: MW 3:30-4:45 and by appointment Class web page: http://www.cse.scu.edu/~jholliday/
Textbook: Distributed Systems, Concepts and Design By Coulouris, Dollimore, Kindberg (CDK) We follow the book only roughly. The chapters with material that corresponds to the lectures are chapters 5 and 9 - 14. Read chapter 1 and 2. Review chapters 3 and 4 if needed for networks, threads and processes.
Topics What is a Distributed System? What are the goals of a Distributed System? Scalability Fault Tolerance Transparency
Definition of a Distributed System (1) A distributed system is: A collection of independent computers that appears to its users as a single coherent system.
Definition of a Distributed System (2) 1.1 A distributed system organized as middleware. Note that the middleware layer extends over multiple machines.
Distributed systems “Distributed System” covers a wide range of architectures from slightly more distributed than a centralized system to a truly distributed network of peers.
One Extreme: Centralized Centralized: mainframe and dumb terminals All of the computation is done on the mainframe. Each line or keystroke is sent from the terminal to the mainframe.
Moving Towards Distribution In a client-server system, the clients are workstations or computers in their own right and perform computations and formatting of the data. However, the data and the application which manipulates it ultimately resides on the server.
More Decentralization In Distributed-with-Coordinator, the nodes or sites depend on a coordinator node with extra knowledge or processing abilities Coordinator might be used only in case of failures or other problems
True Decentralization A true Distributed system has no distinguished node which acts as a coordinator and all nodes or sites are equals. The nodes may choose to elect one of their own to act as a temporary coordinator or leader
Distributed Systems: Pro and Con Some things that were difficult in a centralized system become easier Doing tasks faster by doing them in parallel Avoiding a single point of failure (all eggs in one basket) Geographical distribution Some things become more difficult Transaction commit Snapshots, time and causality Agreement (consensus)
Advantages of the True Distributed System No central server or coordinator means it is scalable SDDS, Scalable Distributed Data Structures, attempt to move distributed systems from a small number of nodes to thousands of nodes We need scalable algorithms to operate on these networks/structures For example peer-to-peer networks
Homogeneous and tightly coupled vs heterogeneous and loosely coupled We will study heterogeneous and loosely coupled systems.
Software Concepts DOS (Distributed Operating Systems) Description Main Goal DOS Tightly-coupled operating system for multi-processors and homogeneous multicomputers Hide and manage hardware resources NOS Loosely-coupled operating system for heterogeneous multicomputers (LAN and WAN) Offer local services to remote clients Middleware Additional layer atop of NOS implementing general-purpose services Provide distribution transparency DOS (Distributed Operating Systems) NOS (Network Operating Systems) Middleware
Distributed Operating Systems May share memory or other resources. 1.14
Network Operating System General structure of a network operating system. 1-19
Two meanings of synchronous and asynchronous communications Synchronous communications is where a process blocks after sending a message to wait for the answer or before receiving. Sync and async have come to describe the communications channels with which they are used. Synchronous: message transit time is short and bounded. If site does not respond in x sec, site can be declared dead. Simplifies algorithms! Asynchronous: message transit time is unbounded. If a message is not received in a given time interval, it could just be slow.
What’s the Point of Unrealistic Assumptions? So why should we study synchronous systems (which never occur in real life) or systems without failures? If something is difficult (intractable) in a synchronous system it will be worse in an asynchronous system. If an algorithm works well in an asynchronous system, it probably will do even better in a real system.
Scalability Something is scalable if it “increases linearly with size” where size is usually number of nodes or distance. “X is scalable with the number of nodes” Every site (node) is directly connected to every other site through a communication channel. Number of channels is NOT scalable. For N sites there are N! channels. Sites connected in a ring. # of channels IS scalable. (N channels for N sites)
Examples of scalability limitations. Scalability Problems Concept Example Centralized services A single server for all users Centralized data A single on-line telephone book Centralized algorithms Doing routing based on complete information Examples of scalability limitations.
Scaling Techniques (1) 1.4 The difference between letting: a server or a client check forms as they are being filled
An example of dividing the DNS name space into zones. Scaling Techniques (2) 1.5 An example of dividing the DNS name space into zones.
Characteristics of Scalable Distributed Algorithms No machine (node, site) has complete information about the system state. Sites make decisions based only on local information. Failure of one site does not ruin the algorithm. There is no implicit assumption that a global clock exists.
Transparency in a D.S. Transparency Description Access Hide differences in data representation and how a resource is accessed Location Hide where a resource is located Migration Hide that a resource may move to another location Relocation Hide that a resource may be moved to another location while in use Replication Hide that copies of a resource exist and a user might use different ones at different times Concurrency Hide that a resource may be shared by several competitive users Failure Hide the failure and recovery of a resource Persistence Hide whether a (software) resource is in memory or on disk Important: location, migration (relocation), replication, concurrency, failure.
Fault Tolerance An important goal in DS is to make the system resilient to failures of some of the components. Fault tolerance (FT) is frequently one of the reasons for making it distributed in the first place. Dependability Includes: Availability Reliability Safety Maintainability
Fault Tolerance Goals Availability: Can I use it now? Probability of being up at any given time. Reliability: Will it be up as long as I need it? Ability to run continuously without failure. If system crashes briefly every hour, it may still have good availability (it is up most of the time) but has poor reliability because it cannot run for very long before crashing. Safety: If it fails, ensure nothing bad happens? Maintainability: How easy is it to fix if it breaks?
Faults FAULT A fault is the cause of an error FAULT TOLERANCE - A system can continue to function even in the presence of faults. Classification of faults: Transient faults - occur once then disappear. Intermittent faults - occurs, goes away, then comes back, goes away … Permanent faults - doesn't go away by itself, like disk failures.
Failure Models Type of failure Description Crash failure or fail-stop A server halts, but is working correctly until it halts Omission failure Receive omission Send omission A server fails to respond to incoming requests A server fails to receive incoming messages A server fails to send messages Timing failure A server's response lies outside the specified time interval Response failure Value failure State transition failure The server's response is incorrect The value of the response is wrong The server deviates from the correct flow of control Arbitrary or Byzantine A server may produce arbitrary responses at arbitrary times
Failure Models Fail-stop: Process works fine until it stops, then remains dead. Other processes can detect this state. Crash: Works fine until it stops. Other processes may not be able to detect this. Send or Receive Omission: Process fails to do one thing, but works fine otherwise. Arbitrary (Byzantine): Process behavior is unpredictable, maybe malicious.
Network Failures Link failure (one way or 2 way): 5 can talk to 6, but 6 can not talk to 5 Network partitions: the network 1,2,3,4,5,6 is partitioned into 1,2,3,4 and 5,6. 1 2 5 4 6 3
Are The Models Realistic? No, of course not! Synch vs Asynch Asynchronous model is too weak (real systems have clocks, “most” timing meets expectations… but occasionally…) Synchronous model is too strong (real systems lack a way to implement synchronize rounds) Failure Types Crash fail (fail-stop) model is too weak (systems usually display some odd behavior before dying) Byzantine model is too strong (assumes an adversary of arbitrary speed who designs the “ultimate attack”)
Models: Justification If we can do something in the asynchronous model, we can probably do it even better in a real network Clocks, a-priori knowledge can only help… If we can’t do something in the synchronous model, we can’t do it in a real network After all, synchronized rounds are a powerful, if unrealistic, capability to introduce If we can survive Byzantine failures, we can probably survive a real distributed system.
Fault Tolerance Strategies Redundancy Hardware, software, informational, temporal Hierarchy Confinement of errors
Failure Masking by Redundancy Triple modular redundancy. Voter circuits choose majority of inputs to determine correct output
Flat Groups versus Hierarchical Groups Communication in a flat group. Communication in a simple hierarchical group
Identical Processes, Fail-stop A system is K fault tolerant if it can withstand faults in K components and still produce correct results. Example: FT through replication - each replica reports a result. If the nodes in a DS are fail-stop and there are K+1 identical processes, then the system can tolerate K failures: the result comes from the remaining one. 1 2 5 3 4
Identical Processes, Byzantine Failures If K failures are Byzantine (with K-collusion) then 2K+1 processes are needed for K FT. Example: K processes can be faulty and "lie" about their result. (If they simply fail to report a result, that is not a problem). If there are 2K+1 processes, at least K+1 will be correct and report the same correct answer. So by taking the result reported by at least K+1 (which is a majority), we get the correct answer.
What makes Distributed Systems Difficult? Asynchrony – even “synchronous” systems have time lag. Limited local knowledge – algorithms can consider only information acquired locally. Failures – parts of the distributed system can fail independently leaving some nodes operational and some not.
Example: Agreement with Unreliable Communication Introduced as voting problem (Lamport, Shostak, Pease ’82) A and B can defeat enemy iff both attack A sends message to B: Attack at Noon! General A General B The Enemy
Agreement with Unreliable Comm Impossible with unreliable networks Possible if some guarantees of reliability Guaranteed delivery within bounded time Limitations on corruption of messages Probabilistic guarantees (send multiple messages)