Download presentation
1
CS542: Topics in Distributed Systems
Diganta Goswami
2
Distributed System A collection of independent computers that appears to its users as a single coherent system. Autonomous computers Many components – connected by a network – sharing resources.
3
Distributed System A System of networked components that communicate and coordinate their actions only by passing messages concurrent execution of programs no global clock components fail independently of one another
4
Another definition You know you have a distributed system when the crash of a computer you’ve never heard of stops you from getting any work done. inter-dependencies shared state independent failure of components
5
A working definition for us
A distributed system is a collection of entities, each of which is autonomous, programmable, asynchronous and failure-prone, and which communicate through an unreliable communication medium using message passing. Entity=a process on a device (PC, PDA) Communication Medium=Wired or wireless network Our interest in distributed systems involves design and implementation, maintenance, algorithmics Designers and progammers interested in : algorithmics, maintenance. (informal definition, for us programmers of distributed systems) Ok: peer to peer systems Contradiction: a computer without ROM or disk drives that needs to boot over the network. Is a collection of these computers a distributed system?
6
“Important” Distributed Systems Issues
No global clock: no single global notion of the correct time (asynchrony) Unpredictable failures of components: lack of response may be due to either failure of a network component, network path being down, or a computer crash (failure-prone, unreliable) Highly variable bandwidth: from 16Kbps (slow modems or Google Balloon) to Gbps (Internet2) to Tbps (in between DCs of same big company) Possibly large and variable latency: few ms to several seconds Large numbers of hosts: 2 to several million
7
There are a range of interesting problems for Distributed System designers
Real distributed systems Cloud Computing, Peer to peer systems, Hadoop, distributed file systems, sensor networks, graph processing, … Classical Problems Failure detection, Asynchrony, Snapshots, Multicast, Consensus, Mutual Exclusion, Election, … Concurrency RPCs, Concurrency Control, Replication Control, … Security Byzantine Faults, … Others…
8
Typical Distributed Systems Design Goals
Common Goals: Heterogeneity – can the system handle a large variety of types of PCs and devices? Robustness – is the system resilient to host crashes and failures, and to the network dropping messages? Availability – are data+services always there for clients? Transparency – can the system hide its internal workings from the users? Concurrency – can the server handle multiple clients simultaneously? Efficiency – is the service fast enough? Does it utilize 100% of all resources? Scalability – can it handle 100 million nodes without degrading service? (nodes=clients and/or servers) Security – can the system withstand hacker attacks? Openness – is the system extensible?
9
Challenges and Goals of Distributed Systems
Heterogeneity Openness Security Scalability Failure handling Concurrency Transparency
10
Challenges Heterogeneity (variety and difference) of underlying network infrastructure, Internet consists of many different sorts of network – their differences are masked by the fact that all of the computers attached to them use the Internet Protocols for communication. e.g. a computer attached to an Ethernet has an implementation of the Internet Protocols over the Ethernet, whereas a computer on a different sort of network will need an implementation of the Internet Protocols for that network.
11
Heterogeneity Computer hardware and software
e.g., operating systems, compare UNIX socket and Winsock calls Programming languages : in particular, data representations
12
Some approaches: Middleware
A S/W layer that provides a programming abstraction as well as masking the heterogeneity of the underlying networks, H/W, O/S and programming languages. Middleware (e.g., CORBA): transparency of network, hard- and software and programming language heterogeneity. JAVA RMI In addition to solving the problems of heterogeneity, middleware provides a uniform computational model for use by the programmers of servers and distributed applications.
13
Positioning Middleware
General structure of a distributed system as middleware. 1-22
14
Openness Characteristic that determine whether the system can be extended and re-implemented in various ways. Determined primarily by the degree to which new resource sharing services can be added and be made available for use by a variety of client programs. Cannot be achieved unless the specification and documentation of the key s/w interfaces are made available to s/w developers (i.e. key interfaces are published)
15
Openness Designers of the Internet protocols introduced a series of documents called RFCs Specifications of the Internet communication protocols Specifications for applications run over them e.g., , telnet, file transfer, etc. (by the mid 80’s) RFCs are not the only means --- e.g. CORBA is published through a series of documents, including a complete specification of the interfaces of its services (
16
Openness Offering services according to standard rules that describe the syntax and semantics of those services e.g., Network protocol rules (RFCs) Services specified through interfaces Interface definition languages (IDLs) specifies names and available functions as well as parameters, return values, exceptions etc.
17
Security Distributed systems must protect the shared information and resources The openness of DS makes them vulnerable to security threats Providing security is a significant challenge for DS
18
Security. Integrity: protection against alteration or corruption
Privacy / Confidentiality: protection against disclosure to unauthorized individuals Integrity: protection against alteration or corruption Availability: protection against interference with the means to access the resources
19
Scalability Scalable system—system that can handle additional number of users/resources without suffering noticeable loss of performance Three metrics of a scalable system No of user/resources Distance between the farthest nodes in the system (network radius) Number of organizations exerting control over the pieces of the system
20
Challenges in designing scalable DS
Controlling the cost of physical resources: As the demand for a resource grows, it should be possible to extend the system, at reasonable cost, to meet it. e.g. it must be possible to add server computers to avoid the performance bottleneck that would arise if a single file server had to handle all file access request when the freq. of file access request grows in an intranet with the increase in users and computers. is more than one computer
21
Challenges in designing scalable DS
Controlling the performance loss: Management of a set of data whose size is proportional to the number of users or resources in the system e.g. the Domain Name System holds the table with the correspondence between domain names of computers and their Internet address Hierarchic structures scale better than linar structures.
22
An example of dividing the DNS name space into zones.
Scaling Techniques 1.5 An example of dividing the DNS name space into zones.
23
Challenges in designing scalable DS
Preventing s/w resources running out: Numbers used as Internet address bits was used in the late 70’s but may run out soon. Change from 32 bits to 128 bits? Difficult to predict the demand. Over-compensating for future growth may be worse than adapting to a change when we are forced to - large Internet address occupy extra space in messages and in computer storage.
24
Failure Handling Failure in a DS is partial
Some components fail while others continue to function This makes handling of failures difficult.
25
Techniques for dealing with failures
Detecting failures may be impossible – remote site crash or delay in message transmission? Some can be. Ex. - Checksums can be used to detect corrupted data
26
Techniques for dealing with failures
Masking failure Some can be hidden or made less severe Retransmission – when messages fail to arrive
27
Techniques for dealing with failures
Tolerating failures Would not be practical to detect and hide all of the failures. Can be designed to tolerate some of those e.g. timeouts when waiting for a web resource – clients give up after a predetermined number of attempts and take other actions & inform the user.
28
Failure Handling Recovery from failures Redundancy Rollback
Undo/Redo in transactions Redundancy Makes the system more available through replication of resources/data Redundant routes in the network Replication of name tables in multiple domain name servers
29
Concurrency In a distributed system it is possible that multiple machines/processes/users may try to access shared data/resource concurrently Can potentially lead to incorrect results and/or Deadlocks The operations must be synchronized/serialized so that the end result is correct
30
Transparency Concealing the heterogeneous and distributed nature of the system so that it appears to the user like one system Making the user believe that there is only a single, undivided system i.e., to hide the notion of distribution completely What are the challenges of transparency?
31
Transparency Categories
Access transparency - access local and remote resources using identical operations e.g., users of UNIX NFS can use the same commands and parameters for file system operations regardless of whether the accessed files are on a local or remote disk.
32
Transparency categories
Location Transparency: Access without knowledge of location of a resource e.g., URLs, addresses (hostname, IP addresses, etc. not required --- the part of the URL that identifies a web server domain name refers to a computer name in a domain, rather than to an Internet address)
33
Transparency Categories
Concurrency transparency: Allow several processes to operate concurrently using shared resources in a consistent fashion w/o interference between them. That is, users and programmers are unaware that components request services concurrently. Replication transparency Use replicated resource as if there was just one instance. Increase reliability and performance w/o knowledge of the replicas by users or application programmers.
34
Failure transparency Enables the concealment of faults, allowing users and application programs to complete their task despite failures of h/w or s/w components. Retransmit of messages – eventually delivered even when servers or communication links fail – it may even take several days.
35
Failure transparency Failure transparency depends on concurrency and replication transparency. Replication can be employed to achieve failure transparency Message transmission governed by TCP is a mechanism for providing failure transparency
36
Mobility Transparency
Mobility transparency: allow resources to move around w/o affecting the operation of users or programs e.g., 700 phone number – but URLs are not, because someone’s personal web page cannot move to their new place of work in a different domain – all of the links in other pages will still point to the original page!
37
Transparency Categories
Performance transparency: adaptation of the system to varying load situations without the user noticing it. Scaling transparency: allow system and applications to expand without need to change structure or application algorithms
38
Degree of transparency
There are systems in which attempting to blindly hide all distribution aspects from users is not always a good idea Requesting your electronic newspaper in your mailbox before 7 am local time – while you are at the other end of the world living in a different time zone (Your morning paper will not be the morning paper you are used to)
39
Degree of transparency
There is trade-off between a high degree of transparency and the performance of a system Masking transient server failure by retransmitting the request may slow down the system If it is necessary to guarantee that several replicas need to be consistent all the time, a single update may take a long time – something that cannot be hidden from the user.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.