Distributed Systems Major Design Issues Debraj De Presentation 09/14/2011 CS8320 – Advanced Operating Systems Fall 2011 – Section 2.6 Presentation
Distributed System Design Issues Presentation Outline Introduction Distributed System Design Issues Object Models and Naming Schemes Distributed Coordination Interprocess Communication Distributed Resources Fault Tolerance and Security References
Management of Distributed System mainly consists of: Introduction Management of Distributed System mainly consists of: Coordination of concurrent distributed processes Management and networking of distributed resources Functioning of distributed algorithms But…. network may be unreliable Components may be untrusted These raise the design and implementation issues, in particular how to support transparency.
Introduction Following need to be considered for resolving design and implementation issues: How objects in the system are modeled and identified How to co-ordinate the interaction among objects how they communicate with each other How can shared/replicated objects be managed in controlled fashion Protection of objects and system security
Design & Implementation Issues Object Models and Naming Schemes Distributed Coordination Interprocess Communication Distributed Resources Fault Tolerance and Security
[1] Object Models and Naming Schemes Objects in a computer system: processes, data files, memory, devices, processors, and networks. Objects are encapsulated in servers process servers, file servers, memory servers etc. A client is a null server that accesses object servers.
[1] Object Models and Naming Schemes Cont’d Three possible ways to identify a server Identification by name (name server) Identification by either physical or logical address (network server) Identification by service that the servers provide Following all depend on the naming scheme for system objects: Structure of the system, management of name space, name resolution, access methods
[2] Distributed Coordination Processes require coordination to achieve synchronization Types of synchronization requirement Barrier synchronization Condition coordination Mutual exclusion
Types of Synchronization [2] Distributed Coordination Types of Synchronization Barrier synchronization Process must reach a common synchronization point before they can continue Condition coordination A process must wait for a condition that will be set asynchronously by other interacting processes to maintain some ordering of execution Mutual exclusion Concurrent processes must have mutual exclusion when accessing a critical shared resource
[2] Distributed Coordination State information synchronization No shared memory Time messages are inaccurate or incomplete Centralized coordination, shift in coordinator
[2] Distributed Coordination Process deadlock problem related to synchronization Deadlock detection and recovery tool needed Four conditions must hold for deadlock to occur Exclusive use Hold and wait No preemption Cyclical wait
[2] Distributed Coordination The problem of deadlocks can be handled in following ways Prevention Ensure that deadlock is not possible Avoidance require decisions by the system while it is running in order to insure that deadlocks will not occur Detection When detected, decide which process to rollback or abnormally terminate
[2] Distributed Coordination If one of the four conditions is prevented, it will prevent deadlocks For example, to impose an order on the resources and require processes to request resources in increasing order. This prevents cyclical wait and thus makes deadlocks impossible
[2] Distributed Coordination Real-world example with deadlock: Mars Rover problem M. Jones. What really happened on mars rover pathfinder. The Risks Digest, 19(49), December 1997
[2] Distributed Coordination Mars Rover Frequent Reset issue: Data-gathering thread (low priority) lock.acquire(); write data; lock.release(); Info-bus thread (high priority) retrieve data; Communication thread (medium priority,long) Information bus (shared memory) Info-bus thread waits for data-gathering thread (to acquire lock); Communication thread preempts data-gathering thread
[3] Interprocess Communication Interprocess communication can be accomplished by using simple message passing primitives Higher level logical communication methods provides the transparency Hide the physical details of message passing Two important concepts The client/server model Remote Procedure Call (RPC)
[3] Interprocess Communication The client/ server model is a programming example for structuring processes in distributed systems logical communication request reply actual communication network client server kernel kernel
[3] Interprocess Communication The Remote Procedure Call (RPC) model is similar to that of the local model The caller places arguments to a procedure in a specific location (such as a result register) The caller temporarily transfers control to the procedure When the caller gains control again, it obtains the results of the procedure from the specified location. The caller then continues program execution.
[3] Interprocess Communication On the server side, a process is dormant (inactive, sleeping), awaiting the arrival of a call message. When one arrives, the server process computes a reply that it then sends back to the requesting client. After this, the server process becomes dormant again.
[3] Interprocess Communication
[4] Distributed Resources Data and Processing Capacity Load Distribution multiprocessor scheduling (Static) load distribution/sharing (Dynamic)
[4] Distributed Resources Distributed shared memory Distributed file systems Issues: Sharing and Replication of data Requires to maintain: data consistency and coherency Difference in implementation: distributed file systems and distributed shared memory
[5] Fault Tolerance and Security Distributed Systems have openness in operating environment So vulnerable to failures and security threats Faults: Failure and Security Violation
[5] Fault Tolerance and Security The problem of failures can be alleviated through: redundancy Transparent handling of failures (like removal of machines, network links, and other resources) without loss of data or functionality Roll-back recovery for execution states
[5] Fault Tolerance and Security OS view: trustworthy communication process, confidentiality and integrity of messages and data Security Authentication: clients and also servers and messages must be authenticated. Authorization: access control has to be performed across a physical network with heterogeneous components under different administrative units using different security models.
[5] Fault Tolerance and Security Ariane 5 failure A software bug caused European Space Agency’s Ariane 5 rocket to crash 40 seconds into its first flight in 1996 (cost: half billion dollars) The bug was caused because of a software component that was being reused from Ariane 4 A software exception occurred during execution of a data conversion from 64-bit floating point to 16-bit signed integer value The value was larger than 32,767, the largest integer storable in a 16 bit signed integer, and thus the conversion failed and an exception was raised by the program Engineers chose in earlier version of the Ariane rocket, to leave this function running for the first 40 seconds of flight to make it easy to restart the system in the event of a brief hold in the countdown * [Source: http://www.ima.umn.edu/~arnold/disasters/ariane5rep.html]
Summary Unique design and implementation issues Include: object models and naming schemes distributed resources interprocess communication Fault tolerance and security
References [1] Randy Chow & Theodore Johnson, 1997,“Distributed Operating Systems & Algorithms”, (Addison-Wesley), p. 45 to 50, 61 to 63. [2] Suresh Sridharan, 2006, “Distributed Operating Systems “, (University of Wisconsin, Madison). http://pages.cs.wisc.edu/~dusseau/Classes/CS739/Writeups/Survey.pdf [3] JoAnne L. Holliday and Amr El Abbadi, ”Distributed Deadlock Detection”, http://www.cse.scu.edu/~jholliday/dd_9_16.htm
References [4] List of distributed computing projects: http://en.wikipedia.org/wiki/List_of_distributed_computing_projects
Questions Thank You Email: dde1@student.gsu.edu