Download presentation
Presentation is loading. Please wait.
Published byJasmine Osborne Modified over 9 years ago
1
EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS
2
Andrew File System Let’s start with a familiar example: andrew 10,000s of machines 10,000s of people Goal: Have a consistent namespace for files across computers Allow any authorized user to access their files from any computer DiskDiskDisk Terabytes of disk
3
AFS summary Client-side caching is a fundamental technique to improve scalability and performance But raises important questions of cache consistency Timeouts and callbacks are common methods for providing (some forms of) consistency. AFS picked close-to-open consistency as a good balance of usability (the model seems intuitive to users), performance, etc. AFS authors argued that apps with highly concurrent, shared access, like databases, needed a different model
4
Today's Lecture 4 Other types of DFS Coda – disconnected operation Programming assignment 4
5
Background 5 We are back to 1990s. Network is slow and not stable Terminal “powerful” client 33MHz CPU, 16MB RAM, 100MB hard drive Mobile Users appeared 1st IBM Thinkpad in 1992 We can do work at client without network
6
CODA 6 Successor of the very successful Andrew File System (AFS) AFS First DFS aimed at a campus-sized user community Key ideas include open-to-close consistency callbacks
7
Hardware Model 7 CODA and AFS assume that client workstations are personal computers controlled by their user/owner Fully autonomous Cannot be trusted CODA allows owners of laptops to operate them in disconnected mode Opposite of ubiquitous connectivity
8
Accessibility 8 Must handle two types of failures Server failures: Data servers are replicated Communication failures and voluntary disconnections Coda uses optimistic replication and file hoarding
9
Design Rationale 9 Scalability Callback cache coherence (inherit from AFS) Whole file caching Portable workstations User’s assistance in cache management
10
Design Rationale –Replica Control 10 Pessimistic Disable all partitioned writes - Require a client to acquire control of a cached object pri or to disconnection Optimistic Assuming no others touching the file - sophisticated: conflict detection + fact: low write-sharing in Unix + high availability: access anything in range without lock
11
Pessimistic Replica Control 11 Would require client to acquire exclusive (RW) or s hared (R) control of cached objects before accessing them in disconnected mode: Acceptable solution for voluntary disconnections Does not work for involuntary disconnections What if the laptop remains disconnected for a long time?
12
Leases 12 We could grant exclusive/shared control of the cach ed objects for a limited amount of time Works very well in connected mode Reduces server workload Server can keep leases in volatile storage as long as th eir duration is shorter than boot time Would only work for very short disconnection perio ds
13
Optimistic Replica Control (I) 13 Optimistic replica control allows access in every disco nnected mode Tolerates temporary inconsistencies Promises to detect them later Provides much higher data availability
14
Optimistic Replica Control (II) 14 Defines an accessible universe: set of replicas that t he user can access Accessible universe varies over time At any time, user Will read from the latest replica(s) in his accessible univ erse Will update all replicas in his accessible universe
15
Coda (Venus) States 15 1. Hoarding: Normal operation mode 2. Emulating: Disconnected operation mode 3. Reintegrating: Propagates changes and detects inconsistencies Hoarding Emulating Recovering
16
Hoarding 16 Hoard useful data for disconnection Balance the needs of connected and disconnected o peration. Cache size is restricted Unpredictable disconnections Prioritized algorithm – cache manage hoard walking – reevaluate objects
17
Prioritized algorithm 17 User defined hoard priority p: how interest it is? Recent Usage q Object priority = f(p,q) Kick out the one with lowest priority + Fully tunable Everything can be customized - Not tunable (?) - No idea how to customize
18
Emulation 18 In emulation mode: Attempts to access files that are not in the client caches appear as failures to application All changes are written in a persistent log, the client modification log (CML)
19
Persistence 19 Venus keeps its cache and related data structures in non-volatile storage
20
Reintegration 20 When workstation gets reconnected, Coda initiates a reintegration process Performed one volume at a time Venus ships replay log to all volumes Each volume performs a log replay algorithm Only care write/write confliction Succeed? Yes. Free logs, reset priority No. Save logs to a tar. Ask for help
21
Performance 21 Duration of Reintegration A few hours disconnection 1 min But sometimes much longer Cache size 100MB at client is enough for a “typical” workday Conflicts No Conflict at all! Why? Over 99% modification by the same person Two users modify the same obj within a day: <0.75%
22
Coda Summary 22 Puts scalability and availability before data consistency Unlike NFS Assumes that inconsistent updates are very infreque nt Introduced disconnected operation mode and file h oarding
23
Today's Lecture 23 Other types of DFS Coda – disconnected operation Programming assignment 4
24
Filesystems Last time: Looked at how we could use RPC to split filesystem functionality between client and server But pretty much, we didn’t change the design We just moved the entire filesystem to the server and then added some caching on the client in various ways
25
You can go farther... But it requires ripping apart the filesystem functionality into modules and placing those modules at different computers on the network So now we need to ask... what does a filesystem do, anyway?
26
Well, there’s a disk... disks store bits. in fixed-length pieces called sectors or blocks but a filesystem has... files. and often directories. and maybe permissions. creation and modification time. and other stuff about the files. (“metadata”)
27
Filesystem functionality Directory management (maps entries in a hierarchy of names to files-on-disk) File management (manages adding, reading, changing, appending, deleting) individual files Space management: where on disk to store these things? Metadata management
28
Conventional filesystem Wraps all of these up together Useful concepts: [pictures] “Superblock” -- well-known location on disk where top-level filesystem info is stored (pointers to more structures, etc.) “Free list” or “Free space bitmap” -- data structures to remember what’s used on disk and what’s not. Why? Fast allocation of space for new files. “inode” - short for index node - stores all metadata about a file, plus information pointing to where the file is stored on disk Small files may be referenced entirely from the inode; larger files may have some indirection to blocks that list locations on disk Directory entries point to inodes “extent” - a way of remembering where on disk a file is stored. Instead of listing all blocks, list a starting block and a range. More compact representation, but requires large contiguous block allocation.
29
Filesystem “VFS” ops VFS: (‘virtual filesystem‘): common abstraction layer inside kernels for building filesystems -- interface is common across FS implementations Think of this as an abstract data type for filesystems has both syntax (function names, return values, etc) and semantics (“don’t block on this call”, etc.) One key thing to note: The VFS itself may do some caching and other management... in particular: often maintains an inode cache
30
FUSE The lab will use FUSE FUSE is a way to implement filesystems in user space (as normal programs), but have them available through the kernel -- like normal files It has a kinda VFS-like interface
31
Figure from FUSE documentation
32
Directory operations readdir(path) - return directory entries for each file in the directory mkdir(path) -- create a new directory rmdir(path) -- remove the named directory
33
File operations mknod(path, mode, dev) -- create a new “node” (generic: a file is one type of node; a device node is another) unlink(path) -- remove link to inode, decrementing inode’s reference count many filesystems permit “hard links” -- multiple directory entries pointing to the same file rename(path, newpath) open -- open a file, returning a file handle , read, write truncate -- cut off at particular length flush -- close one handle to an open file release -- completely close file handle
34
Metadata ops getattr(path) -- return metadata struct chmod / chown (ownership & perms)
35
Back to goals of DFS Users should have same view of system, be able to share files Last time: Central fileserver handles all filesystem operations -- consistency was easy, but overhead high, scalability poor Moved to NFS and then AFS: Added more and more caching at client; added cache consistency problems Solved using timeouts or callbacks to expire cached contents
36
Protocol & consistency Remember last time: NFS defined operations to occur on unique inode #s instead of names... why? idempotency. Wanted operations to be unique. Related example for today when we’re considering splitting up components: moving a file from one directory to another What if this is a complex operation (“remove from one”, “add to another”), etc. Can another user see intermediate state?? (e.g., file in both directories or file in neither?) Last time: Saw issue of when things become consistent Presented idea of close-to-open consistency as a compromise
37
Scaling beyond... What happens if you want to build AFS for all of CMU? More disks than one machine can handle; more users than one machine can handle Simplest idea: Partition users onto different servers How do we handle a move across servers? How to divide the users? Statically? What about load balancing for operations & for space? Some files become drastically more popular?
38
“Cluster” filesystems Lab inspired by Frangipani, a scalable distributed filesystem. Think back to our list of things that filesystems have to do Concurrency management Space allocation and data storage Directory management and naming
39
Frangipani design Program Frangipani file server Distributed lock service Petal distributed virtual disk Physical disks Petal aggregates many disks (across many machines_ into one big “virtual disk”. Simplifying abstraction for both design &implementation. exports extents - provides allocation, deallocation, etc. Internally: maps (virtual disk, offset) to (server, physical disk, offset) Frangipani stores all data (inodes, directories, data) in petal; uses lock server for consistency (eg, creating file)
40
Consequential design
41
Compare with NFS/AFS In NFS/AFS, clients just relay all FS calls to the server; central server. Here, clients run enough code to know which server to direct things to; are active participants in filesystem. (n.b. -- you could, of course, use the Frangipani/Petal design to build a scalable NFS server -- and, in fact, similar techniques are how a lot of them actually are built. See upcoming lecture on RAID, though: replication and redundancy management become key)
42
Lab 2: YFS Yet-another File System. :) Simpler version of what we just talked about: only one extent server (you don’t have to implement Petal; single lock server)
43
Each server written in C++ yfs_client interfaces with OS through fuse Following labs will build YFS incrementally, starting with the lock server and building up through supporting file & directory ops distributed around the network
44
Warning This lab is difficult. Assumes a bit more C++ than lab 1 did. Please please please get started early; ask course staff for help. It will not destroy you; it will make you stronger. But it may well take a lot of work and be pretty intensive.
45
45
46
Remember this slide? 46 We are back to 1990s. Network is slow and not stable Terminal “powerful” client 33MHz CPU, 16MB RAM, 100MB hard drive Mobile Users appear 1st IBM Thinkpad in 1992
47
What’s now? 47 We are in 2000s now. Network is fast and reliable in LAN “powerful” client very powerful client 2.4GHz CPU, 1GB RAM, 120GB hard drive Mobile users everywhere Do we still need disconnection? How many people are using coda?
48
Do we still need disconnection? 48 WAN and wireless is not very reliable, and is slow PDA is not very powerful 200MHz strongARM, 128M CF Card Electric power constrained LBFS (MIT) on WAN, Coda and Odyssey (CMU) for mobile users Adaptation is also important
49
What is the future? 49 High bandwidth, reliable wireless everywhere Even PDA is powerful 2GHz, 1G RAM/Flash What will be the research topic in FS? P2P?
50
Today's Lecture 50 DFS design comparisons continued Topic 4: file access consistency NFS, AFS, Sprite, and DCE DFS Topic 5: Locking Other types of DFS Coda – disconnected operation LBFS – weakly connected operation
51
Low Bandwidth File System Key Ideas 51 A network file systems for slow or wide-area networ ks Exploits similarities between files or versions of the s ame file Avoids sending data that can be found in the server’s fi le system or the client’s cache Also uses conventional compression and caching Requires 90% less bandwidth than traditional netwo rk file systems
52
Working on slow networks 52 Make local copies Must worry about update conflicts Use remote login Only for text-based applications Use instead a LBFS Better than remote login Must deal with issues like auto-saves blocking the editor for the duration of transfer
53
LBFS design 53 Provides close-to-open consistency Uses a large, persistent file cache at client Stores clients working set of files LBFS server divides file it stores into chunks and ind exes the chunks by hash value Client similarly indexes its file cache Exploits similarities between files LBFS never transfers chunks that the recipient already h as
54
Indexing 54 Uses the SHA-1 algorithm for hashing It is collision resistant Central challenge in indexing file chunks is keeping the index at a reasonable size while dealing with sh ifting offsets Indexing the hashes of fixed size data blocks Indexing the hashes of all overlapping blocks at all offs ets
55
LBFS indexing solution 55 Considers only non-overlapping chunks Sets chunk boundaries based on file contents rather than on position within a file Examines every overlapping 48-byte region of file to select the boundary regions called breakpoints usi ng Rabin fingerprints When low-order 13 bits of region’s fingerprint equals a chosen value, the region constitutes a breakpoint
56
Effects of edits on file chunks 56 Chunks of file before/after edits Grey shading show edits Stripes show 48byte regions with magic hash values creating chunk boun daries
57
More Indexing Issues 57 Pathological cases Very small chunks Sending hashes of chunks would consume as much bandwidth as just sending the file Very large chunks Cannot be sent in a single RPC LBFS imposes minimum and maximum chuck sizes
58
The Chunk Database 58 Indexes each chunk by the first 64 bits of its SHA-1 hash To avoid synchronization problems, LBFS always rec omputes the SHA-1 hash of any data chunk before using it Simplifies crash recovery Recomputed SHA-1 values are also used to detect hash collisions in the database
59
Conclusion 59 Under normal circumstances, LBFS consumes 90% les s bandwidth than traditional file systems. Makes transparent remote file access a viable and l ess frustrating alternative to running interactive pro grams on remote machines.
60
60
61
“File System Interfaces” vs. “Block Level Interfaces” 61 Data are organized in files, which in turn are organi zed in directories Compare these with disk-level access or “block” acc ess interfaces: [Read/Write, LUN, block#] Key differences: Implementation of the directory/file structure and sema ntics Synchronization (locking)
62
Digression: “Network Attached Storage” vs. “Stora ge Area Networks” 62 NASSAN Access MethodsFile accessDisk block access Access MediumEthernetFiber Channel and Ethernet Transport ProtocolLayer over TCP/IPSCSI/FC and SCSI/IP EfficiencyLessMore Sharing and Access Control GoodPoor Integrity demandsStrongVery strong ClientsWorkstationsDatabase servers
63
Decentralized Authentication (1) 63 Figure 11-31. The organization of SFS.
64
Decentralized Authentication (2) 64 Figure 11-32. A self-certifying pathname in SFS.
65
General Organization (II) 65 Clients view Coda as a single location-transparent s hared Unix file system Complements local file system Coda namespace is mapped to individual file serve rs at the granularity of subtrees called volumes Each client has a cache manager (VICE)
66
General Organization (III) 66 High availability is achieved through Server replication: Set of replicas of a volume is VSG (Volume Storage Group) At any time, client can access AVSG (Available Volume Stora ge Group) Disconnected Operation: When AVSG is empty
67
67
68
Protocol 68 Based of NFS version 3 LBFS adds extensions Leases Compresses all RPC traffic using conventional gzip compression New NFS procedures GETHASH(fh, offset, count) – retrieve hashes of data chunks in a file MKTMPFILE(fd, fhandle) – create a temporary file for later use in atomic upd ate TMPWRITE(fd,offset,count,data) – write to created temporary file CONDWRITE(fd,offset,count,sha_hash) – similar to TMPWRITE except SHA-1 hash of data COMMITTMP(fd,target_fhandle) – commits the contents of temporary file
69
File Consistency 69 LBFS client performs whole file caching Uses a 3-tiered scheme to check file status Whenever a client makes any RPC on a file it gets back read lea se on file When user opens file, if lease is valid and file version up to date then open succeeds with no message exchange with server When user opens file, if lease has expired, clients gets new lease & attributes from server. If modification time has not changed client uses version from cach e else it gets new contents from the server No need for write leases due to close-to-open consistency
70
File read 70
71
File write 71
72
Security Considerations 72 It uses the security infrastructure from SFS Every server has public key and client specifies it on mounting the server. Entire LBFS protocol is gzip compressed, tagged wit h a MAC and then encrypted User can check whether the file system contains a p articular hash of the chunk by observing server’s ans wer to CONDWRITE request at subtle timing differe nces
73
Implementation Client and server run at use r level Client implements FS using xfs device driver Server uses NFS to access fi les Client-server communication done using RPC over TCP 73
74
Evaluation: repeated data in files 74
75
Evaluation (2) : bandwidth utilization 75
76
Evaluation (3) : application performance 76
77
Evaluation (4) 77
78
Conclusion 78 LBFS is a network file system for low-bandwidth networks Saves bandwidth by exploiting commonality between files Breaks files into variable sized chunks based on contents Indexes file chunks by their hash value Looks up chunks to reconstruct files that contain same data without sendin g that data over network It consumes over an order of magnitude less bandwidth than traditional file systems Can be used where other file systems cannot be used makes remote file access a viable alternative to run interactive applications on remote machines
79
UNIX sharing semantics 79 Centralized UNIX file systems provide one-copy sem antics Every modification to every byte of a file is immediately and permanently visible to all processes accessing the file AFS uses a open-to-close semantics Coda uses an even laxer model
80
Open-to-Close Semantics (I) 80 First version of AFS Revalidated cached file on each open Propagated modified files when they were closed If two users on two different workstations modify th e same file at the same time, the users closing the fil e last will overwrite the changes made by the other user
81
Open-to-Close Semantics (II) 81 Example: Time FF’ F” First client Second Client F” overwrites F’
82
Open to Close Semantics (III) 82 Whenever a client opens a file, it always gets the la test version of the file known to the server Clients do not send any updates to the server until t hey close the file As a result Server is not updated until file is closed Client is not updated until it reopens the file
83
Callbacks (I) 83 AFS-1 required each client to call the server every t ime it was opening an AFS file Most of these calls were unnecessary as user files are r arely shared AFS-2 introduces the callback mechanism Do not call the server, it will call you!
84
Callbacks (II) 84 When a client opens an AFS file for the first time, se rver promises to notify it whenever it receives a new version of the file from any other client Promise is called a callback Relieves the server from having to answer a call fro m the client every time the file is opened Significant reduction of server workload
85
Callbacks (III) 85 Callbacks can be lost! Client will call the server every tau minutes to check wh ether it received all callbacks it should have received Cached copy is only guaranteed to reflect the state of t he server copy up to tau minutes before the time the cli ent opened the file for the last time
86
Coda semantics 86 Client keeps track of subset s of servers it was able to connect the last time it tried Updates s at least every tau seconds At open time, client checks it has the most recent cop y of file among all servers in s Guarantee weakened by use of callbacks Cached copy can be up to tau minutes behind the serve r copy
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.