Presentation is loading. Please wait.

Presentation is loading. Please wait.

Distributed File Systems Andy Wang COP 5611 Advanced Operating Systems.

Similar presentations


Presentation on theme: "Distributed File Systems Andy Wang COP 5611 Advanced Operating Systems."— Presentation transcript:

1 Distributed File Systems Andy Wang COP 5611 Advanced Operating Systems

2 Outline Basic concepts NFS Andrew File System Replicated file systems Ficus Coda Serverless file systems

3 Basic Distributed FS Concepts You are here, the file’s there, what do you do about it? Important questions What files can I access? How do I name them? How do I get the data? How do I synchronize with others?

4 What files can be accessed? Several possible choices Every file in the world Every file stored in this kind of system Every file in my local installation Selected volumes Selected individual files

5 What dictates the choice? Why not make every file available? Naming issues Scaling issues Local autonomy Security Network traffic

6 Naming Files in a Distributed System How much transparency? Does every user/machine/sub-network need its own namespace? How do I find a site that stores the file that I name? Is it implicit in the name? Can my naming scheme scale? Must everyone agree on my scheme?

7 How do I get remote files? Fetch it over the network? How much caching? Replication? What security is required for data transport?

8 Synchronization and Consistency Will there be trouble if multiple sites want to update a file? Can I get any guarantee that I always see consistent versions of data? i.e., will I ever see old data after new? How soon do I see new data?

9 NFS Networked file system Provide distributed filing by remote access With a high degree of transparency Method of providing highly transparent access to remote files Developed by Sun

10 NFS Characteristics Volume-level access RPC-based (uses XDR) Stateless remote file access Location (not name) transparent Implementation for many systems All interoperate, even non-Unix ones Currently based on VFS

11 VFS/Vnode Review VFS—Virtual File System Common interface allowing multiple file system implementations on one system Plugged in below user level Files represented by vnodes

12 NFS Diagram NFS Client NFS Server /tmp / /mnt xy /home / /bin foobar

13 NFS File Handles On clients, files are represented by vnodes The client internally represents remote files as handles Opaque to client But meaningful to server To name remote file, provide handle to server

14 NFS Handle Diagram file descriptor vnode handleinode vnode handle User process VFS level NFS level Client sideServer side NFS server VFS level UFS

15 How to make this work? Could integrate it into the kernel Non-portable, non-distributable Instead, use existing features to do the work VFS for common interface RPC for data transport

16 Using RPC for NFS Must have some process at server that answers the RPC requests Continuously running daemon process Somehow, must perform mounts over machine boundaries A second daemon process for this

17 NFS Processes nfsd daemons—server daemons that accept RPC calls for NFS rpc.mountd daemons—server daemons that handle mount requests biod daemons—optional client daemons that can improve performance

18 NFS from the Client’s Side User issues a normal file operation Like read() Passes through vnode interface to client-side NFS implementation Client-side NFS implementation formats and sends an RPC packet to perform operation Single client blocks until RPC returns

19 NFS RPC Procedures 16 RPC procedures to implement NFS Some for files, some for file systems Including directory ops, link ops, read, write, etc. Lookup() is the key operation Because it fetches handles Other NFS file operations use the handle

20 Mount Operations Must mount an NFS file system on the client before you can use it Requires local and remote operations Local ops indicate mount point has an NFS- type VFS at that point in hierarchy Remote operations go to remote rpc.mountd Mount provides “primal” file handle

21 NFS on the Server Side The server side is represented by the local VFS actually storing the data Plus rpc.mountd and nfsd daemons NFS is stateless—servers do not keep track of clients Each NFS operation must be self- contained (from server’s point of view)

22 Implications of Statelessness Self-contained NFS RPC requests NFS operations should be idempotent NFS should use a stateless transport protocol (e.g., UDP) Servers don’t worry about client crashes Server crashes won’t leave junk

23 More Implications of Statelessness Servers don’t know what files clients think are open Unlike in UFS, LFS, most local VFS file systems Makes it much harder to provide certain semantics Scales nicely, though

24 Preserving UNIX File Operation Semantics NFS works hard to provide identical semantics to local UFS operations Some of this is tricky Especially given statelessness of server E.g., how do you avoid discarding pages of unlinked file a client has open?

25 Sleazy NFS Tricks Used to provide desired semantics despite statelessness of the server E.g., if client unlinks open file, send rename to server rather than remove Perform actual remove when file is closed Won’t work if file removed on server Won’t work with cooperating clients

26 File Handles Method clients use to identify files Created by the server on file lookup Must be unique mappings of server file identifier to universal identifier File handles become invalid when server frees or reuses inode Inode generation number in handle shows when stale

27 NFS Daemon Processes nfsd daemon biod daemon rpc.mount daemon rpc.lockd daemon rpc.statd daemon

28 nfsd Daemon Handle incoming RPC requests Often multiple nfsd daemons per site A nfsd daemon makes kernel calls to do the real work Allows multiple threads

29 biod Daemon Does readahead for clients To make use of kernel file buffer cache Only improves performance—NFS works correctly without biod daemon Also flushes buffered writes for clients

30 rpc.mount Daemon Runs on server to handle VFS-level operations for NFS Particularly remote mount requests Provides initial file handle for a remote volume Also checks that incoming requests are from privileged ports (in UDP/IP packet source address)

31 rpc.lockd Daemon NFS server is stateless, so it does not handle file locking rpc.lockd provides locking Runs on both client and server Client side catches request, forwards to server daemon rpc.lockd handles lock recovery when server crashes

32 rpc.statd Daemon Also runs on both client and server Used to check status of a machine Server’s rpc.lockd asks rpc.statd to store permanent lock information (in file system) And to monitor status of locking machine If client crashes, clear its locks from server

33 Recovering Locks After a Crash If server crashes and recovers, its rpc.lockd contacts clients to reestablish locks If client crashes, rpc.statd contacts client when it becomes available again Client has short grace period to revalidate locks Then they’re cleared

34 Caching in NFS What can you cache at NFS clients? How do you handle invalid client caches?

35 What can you cache? Data blocks read ahead by biod daemon Cached in normal file system cache area

36 What can you cache, con’t? File attributes Specially cached by NFS Directory attributes handled a little differently than file attributes Especially important because many programs get and set attributes frequently

37 Security in NFS NFS inherits RPC mechanism security Some RPC mechanisms provide decent security Some don’t Mount security provided via knowing which ports are permitted to mount what

38 The Andrew File System A different approach to remote file access Meant to service a large organization Such as a university campus Scaling is a major goal

39 Basic Andrew Model Files are stored permanently at file server machines Users work from workstation machines With their own private namespace Andrew provides mechanisms to cache user’s files from shared namespace

40 User Model of AFS Use Sit down at any AFS workstation anywhere Log in and authenticate who I am Access all files without regard to which workstation I’m using

41 The Local Namespace Each workstation stores a few files Mostly system programs and configuration files Workstations are treated as generic, interchangeable entities

42 Virtue and Vice Vice is the system run by the file servers Distributed system Virtue is the protocol client workstations use to communicate to Vice

43 Overall Architecture System is viewed as a WAN composed of LANs Each LAN has a Vice cluster server Which stores local files But Vice makes all files available to all clients

44 Andrew Architecture Diagram LAN WAN LAN

45 Caching the User Files Goal is to offload work from servers to clients When must servers do work? To answer requests To move data Whole files cached at clients

46 Why Whole-file Caching? Minimizes communications with server Most files used in entirety, anyway Easier cache management problem Requires substantial free disk space on workstations - Doesn’t address huge file problems

47 The Shared Namespace An Andrew installation has globally shared namespace All client’s files in the namespace with the same names High degree of name and location transparency

48 How do servers provide the namespace? Files are organized into volumes Volumes are grafted together into overall namespace Each file has globally unique ID Volumes are stored at individual servers But a volume can be moved from server to server

49 Finding a File At high level, files have names Directory translates name to unique ID If client knows where the volume is, it simply sends unique ID to appropriate server

50 Finding a Volume What if you enter a new volume? How do you find which server stores the volume? Volume-location database stored on each server Once information on volume is known, client caches it

51 Making a Volume When a volume moves from server to server, update database Heavyweight distributed operation What about clients with cached information? Old server maintains forwarding info Also eases server update

52 Handling Cached Files Files fetched transparently when needed File system traps opens Sends them to local Venus process

53 The Venus Daemon Responsible for handling single client cache Caches files on open Writes modified versions back on close Cached files saved locally after close Cache directory entry translations, too

54 Consistency for AFS If my workstation has a locally cached copy of a file, what if someone else changes it? Callbacks used to invalidate my copy Requires servers to keep info on who caches files

55 Write Consistency in AFS What if I write to my cached copy of a file? Need to get write permission from server Which invalidates other copies Permission obtained on open for write Need to obtain new data at this point

56 Write Consistency in AFS, Con’t Initially, written only to local copy On close, Venus sends update to server Extra mechanism to handle failures

57 Storage of Andrew Files Stored in UNIX file systems Client cache is a directory on local machine Low-level names do not match Andrew names

58 Venus Cache Management Venus keeps two caches Status Data Status cache kept in virtual memory For fast attribute lookup Data cache kept on disk

59 Venus Process Architecture Venus is single user process But multithreaded Uses RPC to talk to server RPC is built on low level datagram service

60 AFS Security Only server/Vice are trusted here Client machines might be corrupted No client programs run on Vice machines Clients must authenticate themselves to servers Encrypted transmissions

61 AFS File Protection AFS supports access control lists Each file has list of users who can access it And permitted modes of access Maintained by Vice Used to mimic UNIX access control

62 AFS Read-only Replication For volumes containing files that are used frequently, but not changed often E.g., executables AFS allows multiple servers to store read-only copies

63 Distributed FS, Continued Andy Wang COP 5611 Advanced Operating Systems

64 Outline Replicated file systems Ficus Coda Serverless file systems

65 Replicated File Systems NFS provides remote access AFS provides high quality caching Why isn’t this enough? More precisely, when isn’t this enough?

66 When Do You Need Replication? For write performance For reliability For availability For mobile computing For load sharing Optimistic replication increases these advantages

67 Some Replicated File Systems Locus Ficus Coda Rumor All optimistic: few conservative file replication systems have been built

68 Ficus Optimistic file replication based on peer-to-peer model Built in Unix context Meant to service large network of workstations Built using stackable layers

69 Peer-to-peer Replication All replicas are equal No replicas are masters, or servers All replicas can provide any service All replicas can propagate updates to all other replicas Client/server is the other popular model

70 Basic Ficus Architecture Ficus replicates at volume granularity Given volume can be replicated many times Performance limitations on scale Updates propagated as they occur On single best-efforts basis Consistency achieved by periodic reconciliation

71 Stackable Layers in Ficus Ficus is built out of stackable layers Exact composition depends on what generation of system you look at

72 Ficus Stackable Layers Diagram Select FLFS Storage FPFS Transport Storage FPFS

73 Ficus Diagram Site A Site B Site C 1 2 3

74 An Update Occurs Site A Site B Site C 1 2 3

75 Reconciliation in Ficus Reconciliation process runs periodically on each Ficus site For each local volume replica Reconciliation strategy implies eventual consistency guarantee Frequency of reconciliation affects how long “eventually” takes

76 Steps in Reconciliation 1. Get information about the state of a remote replica 2.Get information about the state of the local replica 3.Compare the two sets of information 4.Change local replica to reflect remote changes

77 Ficus Reconciliation Diagram C Reconciles With A Site A Site B Site C 1 2 3

78 Ficus Reconciliation Diagram Con’t B Reconciles With C Site A Site B Site C 1 2 3

79 Gossiping and Reconciliation Reconciliation benefits from the use of gossip In example just shown, an update originating at A got to B through communications between B and C So B can get the update without talking to A directly

80 Benefits of Gossiping Potentially less communications Shares load of sending updates Easier recovery behavior Handles disconnections nicely Handles mobile computing nicely Peer model systems get more benefit than client/server model systems

81 Reconciliation Topology Reconciliation in Ficus is pair-wise In the general case, which pairs of replicas should reconcile? Reconciling all pairs is unnecessary Due to gossip Want to minimize number of recons But propagate data quickly

82 Ring Reconciliation Topology

83 Adaptive Ring Topology

84 Problems in File Reconciliation Recognizing updates Recognizing update conflicts Handling conflicts Recognizing name conflicts Update/remove conflicts Garbage collection Ficus has solutions for all these problems

85 Recognizing Updates in Ficus Ficus keeps per-file version vectors Updates detected by version vector comparisons The data for the later version can then be propagated Ficus propagates full files

86 Recognizing Update Conflicts Concurrent updates can lead to update conflicts Version vectors permit detection of update conflicts Works for n-way conflicts, too

87 Handling Update Conflicts Ficus uses resolver programs to handle conflicts Resolvers work on one pair of replicas of one file System attempts to deduce file type and call proper resolver If all resolvers fail, notify user Ficus also blocks access to file

88 Handling Directory Conflicts Directory updates have very limited semantics So directory conflicts are easier to deal with Ficus uses in-kernel mechanisms to automatically fix most directory conflicts

89 Directory Conflict Diagram Earth Mars Saturn Earth Mars Sedna Replica 2 Replica 1

90 How Did This Directory Get Into This State? If we could figure out what operations were performed on each side that cased each replica to enter this state, We could produce a merged version But there are several possibilities

91 Possibility 1 1. Earth and Mars exist 2. Create Saturn at replica 1 3. Create Sedna at replica 2 Correct result is directory containing Earth, Mars, Saturn, and Sedna

92 The Create/delete Ambiguity This is an example of a general problem with replicated data Cannot be solved with per-file version vectors Requires per-entry information Ficus keeps such information Must save removed files’ entries for a while

93 Possibility 2 1. Earth, Mars, and Saturn exist 2. Delete Saturn at replica 2 3. Create Sedna at replica 2 Correct result is directory containing Earth, Mars, and Sedna And there are other possibilities

94 Recognizing Name Conflicts Name conflicts occur when two different files are concurrently given same name Ficus recognizes them with its per-entry directory info Then what? Handle similarly to update conflicts Add disambiguating suffixes to names

95 Internal Representation of Problem Directory Earth Mars Saturn Earth Mars Saturn Sedna Replica 1Replica 2

96 Update/remove Conflicts Consider case where file “Saturn” has two replicas 1. Replica 1 receives an update 2. Replica 2 is removed What should happen? A matter of systems semantics, basically

97 Ficus’ No-lost-updates Semantics Ficus handles this problem by defining its semantics to be no-lost-updates In other words, the update must not disappear But the remove must happen Put “Saturn” in the orphanage Requires temporarily saving removed files

98 Removals and Hard Links Unix and Ficus support hard links Effectively, multiple names for a file Cannot remove a file’s bits until the last hard link to the file is removed Tricky in a distributed system

99 Link Example Replica 1 foodir redblue Replica 2 foodir redblue

100 Link Example, Part II Replica 1 foodir redblue Replica 2 foodir redblue update blue

101 Link Example, Part III Replica 1 foodir redblue Replica 2 foodir redblue delete blue bardir create hard link in bardir to blue

102 What Should Happen Here? Clearly, the link named foodir/blue should disappear And the link in bardir link point to? But what version of the data should the bardir link point to? No-lost-update semantics say it must be the update at replica 1

103 Garbage Collection in Ficus Ficus cannot throw away removed things at once Directory entries Updated files for no-lost-updates Non-updated files due to hard links When can Ficus reclaim the space these use?

104 When Can I Throw Away My Data Not until all links to the file disappear Global information, not local Moreover, just because I know all links have disappeared doesn’t mean I can throw everything away Must wait till everyone knows Requires two trips around the ring

105 Why Can’t I Forget When I Know There Are No Links I can throw the data away I don’t need it, nobody else does either But I can’t forget that I knew this Because not everyone knows it For them to throw their data away, they must learn So I must remember for their benefit

106 Coda A different approach to optimistic replication Inherits a lot form Andrew Basically, a client/server solution Developed at CMU

107 Coda Replication Model Files stored permanently at server machines Client workstations download temporary replicas, not cached copies Can perform updates without getting token from the server So concurrent updates possible

108 Detecting Concurrent Updates Workstation replicas only reconcile with their server At recon time, they compare their state of files with server’s state Detecting any problems Since workstations don’t gossip, detection is easier than in Ficus

109 Handling Concurrent Updates Basic strategy is similar to Ficus’ Resolver programs are called to deal with conflicts Coda allows resolvers to deal with multiple related conflicts at once Also has some other refinements to conflict resolution

110 Server Replication in Coda Unlike Andrew, writable copies of a file can be stored at multiple servers Servers have peer-to-peer replication Servers have strong connectivity, crash infrequently Thus, Coda uses simpler peer-to-peer algorithms than Ficus must

111 Why Is Coda Better Than AFS? Writes don’t lock the file Writes happen quicker More local autonomy Less write traffic on the network Workstations can be disconnected Better load sharing among servers

112 Comparing Coda to Ficus Coda uses simpler algorithms Less likely to be bugs Less likely to be performance problems Coda doesn’t allow client gossiping Coda has built-in security Coda garbage collection simpler

113 Serverless Network File Systems New network technologies are much faster, with much higher bandwidth In some cases, going over the net is quicker than going to local disk How can we improve file systems by taking advantage of this change?

114 Fundamental Ideas of xFS Peer workstations providing file service for each other High degree of location independence Make use of all machine’s caches Provide reliability in case of failures

115 xFS Developed at Berkeley Inherits ideas from several sources LFS Zebra (RAID-like ideas) Multiprocessor cache consistency Built for Network of Workstations (NOW) environment

116 What Does a File Server Do? Stores file data blocks on its disks Maintains file location information Maintains cache of data blocks Manages cache consistency for its clients

117 xFS Must Provide These Services In essence, every machine takes on some of the server’s responsibilities Any data or metadata might be located at any machine Key challenge is providing same services centralized server provided in a distributed system

118 Key xFS Concepts Metadata manager Stripe groups for data storage Cooperative caching Distributed cleaning processes

119 How Do I Locate a File in xFS? I’ve got a file name, but where is it? Assuming it’s not locally cached File’s director converts name to a unique index number Consult the metadata manager to find out where file with that index number is stored-the manager map

120 The Manager Map Data structure that allows translation of index numbers to file managers Not necessarily file locations Kept by each metadata manager Globally replicated data structure Simply says what machine manages the file

121 Using the Manager Map Look up index number in local map Index numbers are clustered, so many fewer entries than files Send request to responsible manager

122 What Does the Manager Do? Manager keeps two types of information 1. imap information 2. caching information If some other sites has the file in its cache, tell requester to go to that site Always use cache before disk Even if cache is remote

123 What if No One Caches the Block? Metadata manager for this file then must consult its imap Imap tells which disks store the data block Files are striped across disks stored on multiple machines Typically single block is on one disk

124 Writing Data xFS uses RAID-like methods to store data RAID sucks for small writes So xFS avoids small writes By using LFS-style operations Batch writes until you have a full stripe’s worth

125 Stripe Groups Set of disks that cooperatively store data in RAID fashion xFS uses single parity disk Alternative to striping all data across all disks

126 Cooperative Caching Each site’s cache can service requests from all other sites Working from assumption that network access is quicker than disk access Metadata managers used to keep track of where data is cached So remote cache access takes 3 network hops

127 Getting a Block from a Remote Cache Manager Map Client Cache Consistency State MetaData Server Unix Cache Caching Site Request Block 1 2 3

128 Providing Cache Consistency Per-block token consistency To write a block, client requests token from metadata server Metadata server retrievers token from whoever has it And invalidates other caches Writing site keeps token

129 Which Sites Should Manage Which Files? Could randomly assign equal number of file index groups to each site Better if the site using a file also manages it In particular, if most frequent writer manages it Can reduce network traffic by ~50%

130 Cleaning Up File data (and metadata) is stored in log structures spread across machines A distributed cleaning method is required Each machine stores info on its usage of stripe groups Each cleans up its own mess

131 Basic Performance Results Early results from incomplete system Can provide up to 10 times the bandwidth of file data as single NFS server Even better on creating small files Doesn’t compare xFS to multimachine servers


Download ppt "Distributed File Systems Andy Wang COP 5611 Advanced Operating Systems."

Similar presentations


Ads by Google