Coda Descendant of AFS Developed by Mahadev Satyanarayanan and coworkers at Carnegie-Mellon University since 1987 Open Source advanced caching schemes.

Slides:



Advertisements
Similar presentations
Consistency and Replication Chapter 7 Part II Replica Management & Consistency Protocols.
Advertisements

Silberschatz and Galvin  Operating System Concepts Module 16: Distributed-System Structures Network-Operating Systems Distributed-Operating.
The Zebra Striped Network Filesystem. Approach Increase throughput, reliability by striping file data across multiple servers Data from each client is.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
CS-550: Distributed File Systems [SiS]1 Resource Management in Distributed Systems: Distributed File Systems.
28.2 Functionality Application Software Provides Applications supply the high-level services that user access, and determine how users perceive the capabilities.
Distributed Processing, Client/Server, and Clusters
Network Operating Systems Users are aware of multiplicity of machines. Access to resources of various machines is done explicitly by: –Logging into the.
Tutorials 2 A programmer can use two approaches when designing a distributed application. Describe what are they? Communication-Oriented Design Begin with.
Group Communications Group communication: one source process sending a message to a group of processes: Destination is a group rather than a single process.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Other File Systems: AFS, Napster. 2 Recap NFS: –Server exposes one or more directories Client accesses them by mounting the directories –Stateless server.
Computer Science Lecture 21, page 1 CS677: Distributed OS Today: Coda, xFS Case Study: Coda File System Brief overview of other recent file systems –xFS.
Communication in Distributed Systems –Part 2
Concurrency Control & Caching Consistency Issues and Survey Dingshan He November 18, 2002.
Jeff Chheng Jun Du.  Distributed file system  Designed for scalability, security, and high availability  Descendant of version 2 of Andrew File System.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
NFS. The Sun Network File System (NFS) An implementation and a specification of a software system for accessing remote files across LANs. The implementation.
Distributed File Systems Sarah Diesburg Operating Systems CS 3430.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Distributed Systems Principles and Paradigms Chapter 10 Distributed File Systems 01 Introduction 02 Communication 03 Processes 04 Naming 05 Synchronization.
Distributed Systems. Interprocess Communication (IPC) Processes are either independent or cooperating – Threads provide a gray area – Cooperating processes.
Distributed File Systems
Latest Relevant Techniques and Applications for Distributed File Systems Ela Sharda
CS 390- Unix Programming Environment CS 390 Unix Programming Environment Topics to be covered: Distributed Computing Fundamentals.
Distributed File Systems Case Studies: Sprite Coda.
CS 5204 (FALL 2005)1 Leases: An Efficient Fault Tolerant Mechanism for Distributed File Cache Consistency Gray and Cheriton By Farid Merchant Date: 9/21/05.
Implementing Remote Procedure Calls Authored by Andrew D. Birrell and Bruce Jay Nelson Xerox Palo Alto Research Center Presented by Lars Larsson.
DISTRIBUTED COMPUTING PARADIGMS. Paradigm? A MODEL 2for notes
Introduction to DFS. Distributed File Systems A file system whose clients, servers and storage devices are dispersed among the machines of a distributed.
Distributed Systems Principles and Paradigms Chapter 12 Distributed Coordination-Based Systems 01 Introduction 02 Communication 03 Processes 04 Naming.
 Distributed file systems having transaction facility need to support distributed transaction service.  A distributed transaction service is an extension.
Distribution and components. 2 What is the problem? Enterprise computing is Large scale & complex: It supports large scale and complex organisations Spanning.
IS473 Distributed Systems CHAPTER 5 Distributed Objects & Remote Invocation.
Presented By: Samreen Tahir Coda is a network file system and a descendent of the Andrew File System 2. It was designed to be: Highly Highly secure Available.
Chapter 5: Distributed objects and remote invocation Introduction Remote procedure call Events and notifications.
JINI Coordination-Based System By Anthony Friel * David Kiernan * Jasper Wood.
Jini Architecture Introduction System Overview An Example.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
GLOBAL EDGE SOFTWERE LTD1 R EMOTE F ILE S HARING - Ardhanareesh Aradhyamath.
V1-5Coordination Based Systems1 Distributed Coordination Based Systems.
Lecture 4 Mechanisms & Kernel for NOSs. Mechanisms for Network Operating Systems  Network operating systems provide three basic mechanisms that support.
Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT By Jyothsna Natarajan Instructor: Prof. Yanqing Zhang Course: Advanced Operating Systems.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED.
Enterprise Computing with Jini Technology Mark Stang and Stephen Whinston Jan / Feb 2001, IT Pro presented by Alex Kotchnev.
11.6 Distributed File Systems Consistency and Replication Xiaolong Wu Instructor: Dr Yanqing Zhang Advanced Operating System.
Distributed File Systems Questions answered in this lecture: Why are distributed file systems useful? What is difficult about distributed file systems?
Fault Tolerance (2). Topics r Reliable Group Communication.
Computer Science Lecture 19, page 1 CS677: Distributed OS Last class: Distributed File Systems Issues in distributed file systems Sun’s Network File System.
THE EVOLUTION OF CODA M. Satyanarayanan Carnegie-Mellon University.
Distributed Systems: Distributed File Systems Ghada Ahmed, PhD. Assistant Prof., Computer Science Dept. Web:
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Operating Systems Distributed-System Structures. Topics –Network-Operating Systems –Distributed-Operating Systems –Remote Services –Robustness –Design.
DISTRIBUTED FILE SYSTEM- ENHANCEMENT AND FURTHER DEVELOPMENT BY:- PALLAWI(10BIT0033)
Topic 4: Distributed Objects Dr. Ayman Srour Faculty of Applied Engineering and Urban Planning University of Palestine.
File System Implementation
Distribution and components
Chapter 16: Distributed System Structures
Synchronization in Distributed File System
Advanced Operating Systems Chapter 11 Distributed File systems 11
Today: Coda, xFS Case Study: Coda File System
CSE 451: Operating Systems Winter Module 22 Distributed File Systems
Distributed File Systems
Distributed File Systems
Distributed File Systems
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Outline Review of Quiz #1 Distributed File Systems 4/20/2019 COP5611.
Distributed File Systems
Distributed File Systems
Presentation transcript:

Coda Descendant of AFS Developed by Mahadev Satyanarayanan and coworkers at Carnegie-Mellon University since 1987 Open Source advanced caching schemes that allow a client to continue operation despite being disconnected from a server. An important goal was to achieve a high degree of naming and location transparency so that the system would appear to its users very similar to a pure local file system.

Coda is a descendant of version 2 of the Andrew File System (AFS) AFS nodes are partitioned into two groups. One group consists of a relatively small number of dedicated Vice file servers, which are centrally administered. The other group consists of a very much larger collection of Virtue workstations that give users and processes access to the file system, as shown in Fig

Communication RPC2 offers reliable RPCs on top of the (unreliable) UDP protocol. Each time a remote procedure is called, the RPC2 client code starts a new thread that sends an invocation request to the server and subsequently blocks until it receives an answer. As request processing may take an arbitrary time to complete, the server regularly sends back messages to the client to let it know it is still working on the request. If the server dies, sooner or later this thread will notice that the messages have ceased and report back failure to the calling application.

An interesting aspect of RPC2 is its support for side effects. A side effect is a mechanism by which the client and server can communicate using an application-specific protocol. RPC2 allows the client and the server to set up a separate connection for transferring the video data to the client on time. Connection setup is done as a side effect of an RPC call to the server.

Another feature of RPC2 that makes it different from other RPC systems is its support for multicasting. An important design issue in Coda is that servers keep track of which clients have a local copy of a file. When a file is modified, a server invalidates local copies by notifying the appropriate clients through an RPC.

Parallel RPCs are implemented by means of the MultiRPC system, which is part of the RPC2 package. An important aspect of MultiRPC is that the parallel invocation of RPCs is fully transparent to the callee. MultiRPC is implemented by essentially executing multiple RPCs in parallel.

Synchronization Sharing Files in Coda To accommodate file sharing, Coda uses a special allocation scheme that bears some similarities to share reservations in NFS. To understand how the scheme works, the following is important. When a client successfully opens a file f, an entire copy of f is transferred to the client’s machine. The server records that the client has a copy of f. This approach is similar to open delegation in NFS.

Now suppose client A has opened file f for writing. When another client B wants to open f as well, it will fail. This failure is caused by the fact that the server has recorded that client A might have already modified f. On the other hand, had client A opened f for reading, an attempt by client B to get a copy from the server for reading would succeed. An attempt by B to open for writing would succeed as well

Caching and Replication Client Caching Client-side caching is crucial to the operation of Coda for two reasons. First, caching is done to achieve scalability. Second, caching provides a higher degree of fault tolerance as the client becomes less dependent on the availability of the server. For these two reasons, clients in Coda always cache entire files. when a file is opened for either reading or writing, an entire copy of the file is transferred to the client, where it is subsequently cached.

A server is said to record a callback promise for a client. When a client updates its local copy of the file for the first time, it notifies the server, which, in turn, sends an invalidation message to the other clients. Such an invalidation message is called a callback break, because the server will then discard the callback promise it held for the client it just sent an invalidation.

The interesting aspect of this scheme is that as long as a client knows it has an outstanding callback promise at the server, it can safely access the file locally. In particular, suppose a client opens a file and finds it is still in its cache. It can then use that file provided the server still has a callback promise on the file for that client. The client will have to check with the server if that promise still holds. If so, there is no need to transfer the file from the server to the client again.

Server-Side Replication Coda allows file servers to be replicated. As we mentioned, the unit of replication is a collection of files called a volume. The collection of Coda servers that have a copy of a volume, are known as that volume's Volume Storage Group, or simply VSG. In the presence of failures, a client may not have access to all servers in a volume's VSG. A client's Accessible Volume Storage Group (AVSG) for a volume consists of those servers in that volume's VSG that the client can contact at the moment. If the AVSG is empty, the client is said to be disconnected.

Coda uses a replicated-write protocol to maintain consistency of a replicated volume. In particular, it uses a variant of Read-One, Write-All (ROWA). When a client needs to read a file, it contacts one of the members in its AVSG of the volume to which that file belongs. However, when closing a session on an updated file, the client transfers it in parallel to each member in the AVSG.

This scheme works fine as long as there are no failures, that is, for each client, that client's AVSG of a volume is the same as its VSG. However, in the presence of failures, things may go wrong. Consider a volume that is replicated across three servers S1, S2, and S3, For client A, assume its AVSG covers servers S1 and S2 whereas client B has access only to server S3, as shown in Fig

Coda uses an optimistic strategy for file replication. In particular, both A and B will be allowed to open a file, f, for writing, update their respective copies, and transfer their copy back to the members in their AVSG. Obviously, there will be different versions of t stored in the VSG. The question is how this inconsistency can be detected and resolved.

The solution adopted by Coda is deploying a versioning scheme. In particular, a server Sj in a VSG maintains a Coda version vector CVVi(f) for each file f contained in that VSG. If CVVi(f)[j] = k, thenserver Si knows that server Sj has seen at least version k of file f CVVi(f)[i] is the number of the current version of f stored at server Si. An update of f at server Si will lead to an increment of CVVi(f)[i].

Returning to our three-server example, CVVi(f) is initially equal to [1,1,1] for each server Si. When client A reads f from one of the servers in its AVSG, say S1, it also receives CVV1 (f). After updating f, client A multicasts f to each server in its AVSG, that is, S1 and S2, Both servers will then record that their respective copy has been updated, but not that of S3. In other words, CVV1(f) = CVV2(f) = [2,2,1]

Meanwhile, client B will be allowed to open a session in which it receives a copy of f from server S3, and subsequently update f as well. When closing its session and transferring the update to S3, server S3 will update its version vector to CVV3(f)=[1,1,2] When the partition is healed, the three servers will need to reintegrate their copies of f. By comparing their version vectors, they will notice that a conflict has occurred that needs to be repaired.

XFS Design goals for this file system (from Silicon Graphics) centered on supporting intense I/O performance demands, large (media) files, and file systems with many files and many large files. terabytes of disk space (so many files and directories) huge files hundreds of MB/s of I/O bandwidth

Every machine involved in XFS can become server to some les and also client to some other les. XFS uses the storage technology called RAID (redundant array of independent disks) to spread data on multiple disks.

RAID The basic idea of RAID is le striping: each file is partitioned into multiple pieces and each of them is stored on different disk. The advantage of file striping includes the gain of parallelism. When accessing a file, all the pieces of that single file on different disks can be accessed in parallel, which could result in linear speedup. In addition, automatic load balance comes for free because popular files are distributed across multiple disks and therefore no single disk will be overloaded. However, file striping comes with disadvantages: when a disk fails, all the files with a part on the disk will be gone. Assuming independent fails, the mean time to failure drops signicantly as the number of disks increases. Therefore, redundancy must be added to make the system usable in practice.

Overview of xFS. The xFS file system is based on a serverless model. The entire file system is distributed across machines including clients. Each machine can run a storage server, a metadata server and a client process.

A typical distribution of xFS processes across multiple machines.

Communication in xFS RPC was substituted with active messages in XFS. RPC performance was not the best and fully decentralization is hard to manage with RPC. In an active message, when a message arrives, an handler is automatically invoked for execution.

DISTRIBUTED COORDINATION-BASED SYSTEMS COORDINATION MODELS the coordination part of a distributed system handles the communication and cooperation between processes. It forms the glue that binds the activities performed by processes into a whole. we make a distinction between models along two different dimensions, temporal and referential, as shown in Fig

When processes are temporally and referentially coupled, coordination takes place in a direct way, referred to as direct coordination. The referential coupling generally appears in the form of explicit referencing in communication. For example, a process can communicate only-if it knows the name or identifier of the other processes it wants to exchange information with. Temporal coupling means that processes that are communicating will both have to be up and running. This coupling is analogous to the transient message- oriented communication

A different type of coordination occurs when processes are temporally decoupled, but referentially coupled, which we refer to as mailbox coordination. In this case, there is no need for two communicating processes to execute at the same time in order to let communication take place. Instead, communication takes place by putting messages in a (possibly shared) mailbox

The combination of referentially decoupled and temporally coupled systems form the group of models for meeting-oriented coordination. In referentially decoupled systems, processes do not know each other explicitly. In other words, when a process wants to coordinate its activities with other processes, it cannot directly refer to another process. Instead, there is a concept of a meeting in which processes temporarily group together to coordinate their activities. The model prescribes that the meeting processes are executing at the same time.

Meeting-based systems are often implemented by means of events, like the ones supported by object- based distributed systems. mechanism for implementing meetings --- publish/subscribe systems. In these systems, processes can subscribe to messages containing information on specific subjects, while other processes produce (i.e., publish) such messages. Most publish/subscribe systems require that communicating processes are active at the same time; hence there is a temporal coupling

The most widely-known coordination model is the combination of referentially and temporally decoupled processes, exemplified by generative communication as introduced in the Linda programming system. The key idea in generative communication is that a collection of independent processes make use of a shared persistent data space of tuples. Tuples are tagged data records consisting of a number (but possibly zero) typed fields. Processes can put any type of record into the shared dataspace (i.e., they generat communication records).

An interesting feature of these shared dataspaces is that they implement an associative search mechanism for tuples. In other words, when a process wants to extract a tuple from the dataspace, it essentially specifies (some of) the values of the fields it is interested in. Any tuple that matches that specification is then removed from the dataspace and passed to the process. If no match could be found, the process can choose to block until there is a matching tuple.

ARCHITECTURES Overall Approach Let us first assume that data items are described by a series of attributes. A data item is said to be published when it is made available for other processes to read. To that end a subscription needs to be passed to the middleware, containing a description of the data items that the subscriber is interested in. Such a description typically consists of some (attribute, value) pairs, possibly combined with (attribute, range) pairs. In the latter case, the specified attribute is expected to take on values within a specified range.

We are now confronted with a situation in which subscriptions need to be matched against data items, as shown in Fig When matching succeeds, there are two possible scenarios. In the first case, the middleware may decide to forward the published data to its current set of subscribers, that is, processes with a matching subscription. As an alternative, the middleware can also forward a notification at which point subscribers can execute a read operation to retrieve the published data item.

In those cases in which data items are immediately forwarded to subscribers, the middleware will generally not offer storage of data. Storage is either explicitly handled by a separate service, or is the responsibility of subscribers. In other words, we have a referentially decoupled, but temporally coupled system. This situation is different when notifications are sent so that subscribers need to explicitly read the published data. Necessarily, the middleware will have to store data items. In these situations there are additional operations for data management. It is also possible to attach a lease to a data item such that when the lease expires that the data item is automatically deleted.

Traditional Architectures The simplest solution for matching data items against subscriptions is to have a centralized client-server architecture. This is a typical solution currently adopted by many publish/subscribe systems, including IBM's WebSphere.

JINI Jini is a distributed system architecture developed by Sun Microsystems, Inc. Its main goal is “network plug and play”. A Jini system is a distributed system based on the idea of join together groups of users and the resources required by those users. The overall goal is to turn the network into a flexible, easily administered tool with which resources can be found by human and computational clients.

JINI Goals – Enabling users to share services and resources over a network – Providing users easy access to resources anywhere on the network while allowing the network location of the user to change – Simplifying the task of building, maintaining, and altering a network of devices, software, and users

Jini and JavaSpaces Jini is a distributed system that consists of a mixture of different but related elements. It is strongly related to the Java programming language, although many of its principles can be implemented equally well in other languages. An important part of the system is formed by a coordination model for generative communication. Jini provides temporal and referential decoupling of processes through a coordination system called JavaSpaces. A JavaSpace is a shared dataspace that stores tuples representing a typed set of references to Java objects. Multiple JavaSpaces may coexist in a single Jini system.

when a tuple contains two different fields that refer to the same object, the tuple as stored in a JavaSpace implementation will hold two marshaled (represent an object in to a data format that is suitable for storing or retransmission)copies of that object. A tuple is put into a JavaSpace by means of a write operation, which first marshals the tuple before storing it. Each time the write operation is called on a tuple, another marshaled copy of that tuple is stored in the JavaSpace, as shown in Fig We will refer to each marshaled copy as a tuple instance.

To read a tuple instance, a process provides another tuple that it uses as a template for matching tuple instances as stored in a JavaSpace. Like any other tuple, a template tuple is a typed set of object references. Only tuple instances of the same type as the template can be read from a JavaSpace. A field in the template tuple either contains a reference to an actual object or contains the value NULL.

When a tuple instance is found that matches the template tuple provided as part of a read operation, that tuple instance is unmarshaled and returned to the reading process. There is also a take operation that additionally removes the tuple instance from the JavaSpace. Both operations block the caller until a matching tuple is found. It is possible to specify a maximum blocking time. In addition, there are variants that simply return immediately if no matching tuple existed.

TIB/Rendezvous using central servers is to immediately disseminate published data items to the appropriate subscribers using multicasting. This principle is used in TIBlRendezvous, of which the basic architecture is shown in Fig In this approach, a data item is a message tagged with a compound keyword describing its content, such as news. comp.os. books. A subscriber provides (parts of) a keyword, or indicating the messages it wants to receive, such as news.comp. *.books. These keywords are said to indicate the subject of a message.

if it is known exactly where a subscriber resides, point-to-point messages will generally be used. Each host on such a network will run a rendezvous daemon, which takes care that messages are sent and delivered according to their subject. Whenever a message is published, it is multicast to each host on the network running a rendezvous daemon. Typically, multicasting is implemented using the facilities offered by the underlying network, such as Ip multicasting or hardware broadcasting.

Processes that subscribe to a subject pass their subscription to their local daemon. The daemon constructs a table of (process, subject), entries and whenever a message on subject S arrives, the daemon simply checks in its table for local subscribers, and forwards the message to each one. If there are no subscribers for S, the message is discarded immediately.