EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS

Andrew File System  Let’s start with a familiar example: andrew 10,000s of machines 10,000s of people Goal: Have a consistent namespace for files across computers Allow any authorized user to access their files from any computer DiskDiskDisk Terabytes of disk

AFS summary  Client-side caching is a fundamental technique to improve scalability and performance  But raises important questions of cache consistency  Timeouts and callbacks are common methods for providing (some forms of) consistency.  AFS picked close-to-open consistency as a good balance of usability (the model seems intuitive to users), performance, etc.  AFS authors argued that apps with highly concurrent, shared access, like databases, needed a different model

Today's Lecture 4  Other types of DFS  Coda – disconnected operation  Programming assignment 4

Background 5  We are back to 1990s.  Network is slow and not stable  Terminal  “powerful” client  33MHz CPU, 16MB RAM, 100MB hard drive  Mobile Users appeared  1st IBM Thinkpad in 1992  We can do work at client without network

CODA 6  Successor of the very successful Andrew File System (AFS)  AFS  First DFS aimed at a campus-sized user community  Key ideas include open-to-close consistency callbacks

Hardware Model 7  CODA and AFS assume that client workstations are personal computers controlled by their user/owner  Fully autonomous  Cannot be trusted  CODA allows owners of laptops to operate them in disconnected mode  Opposite of ubiquitous connectivity

Accessibility 8  Must handle two types of failures  Server failures: Data servers are replicated  Communication failures and voluntary disconnections Coda uses optimistic replication and file hoarding

Design Rationale 9  Scalability  Callback cache coherence (inherit from AFS)  Whole file caching  Portable workstations  User’s assistance in cache management

Design Rationale –Replica Control 10  Pessimistic  Disable all partitioned writes - Require a client to acquire control of a cached object pri or to disconnection  Optimistic  Assuming no others touching the file - sophisticated: conflict detection + fact: low write-sharing in Unix + high availability: access anything in range without lock

Pessimistic Replica Control 11  Would require client to acquire exclusive (RW) or s hared (R) control of cached objects before accessing them in disconnected mode:  Acceptable solution for voluntary disconnections  Does not work for involuntary disconnections  What if the laptop remains disconnected for a long time?

Leases 12  We could grant exclusive/shared control of the cach ed objects for a limited amount of time  Works very well in connected mode  Reduces server workload  Server can keep leases in volatile storage as long as th eir duration is shorter than boot time  Would only work for very short disconnection perio ds

Optimistic Replica Control (I) 13  Optimistic replica control allows access in every disco nnected mode  Tolerates temporary inconsistencies  Promises to detect them later  Provides much higher data availability

Optimistic Replica Control (II) 14  Defines an accessible universe: set of replicas that t he user can access  Accessible universe varies over time  At any time, user  Will read from the latest replica(s) in his accessible univ erse  Will update all replicas in his accessible universe

Coda (Venus) States 15 1. Hoarding: Normal operation mode 2. Emulating: Disconnected operation mode 3. Reintegrating: Propagates changes and detects inconsistencies Hoarding Emulating Recovering

Hoarding 16  Hoard useful data for disconnection  Balance the needs of connected and disconnected o peration.  Cache size is restricted  Unpredictable disconnections  Prioritized algorithm – cache manage  hoard walking – reevaluate objects

Prioritized algorithm 17  User defined hoard priority p: how interest it is?  Recent Usage q  Object priority = f(p,q)  Kick out the one with lowest priority + Fully tunable Everything can be customized - Not tunable (?) - No idea how to customize

Emulation 18  In emulation mode:  Attempts to access files that are not in the client caches appear as failures to application  All changes are written in a persistent log, the client modification log (CML)

Persistence 19  Venus keeps its cache and related data structures in non-volatile storage

Reintegration 20  When workstation gets reconnected, Coda initiates a reintegration process  Performed one volume at a time  Venus ships replay log to all volumes  Each volume performs a log replay algorithm  Only care write/write confliction  Succeed? Yes. Free logs, reset priority No. Save logs to a tar. Ask for help

Performance 21  Duration of Reintegration  A few hours disconnection  1 min  But sometimes much longer  Cache size  100MB at client is enough for a “typical” workday  Conflicts  No Conflict at all! Why?  Over 99% modification by the same person  Two users modify the same obj within a day: <0.75%

Coda Summary 22  Puts scalability and availability before data consistency  Unlike NFS  Assumes that inconsistent updates are very infreque nt  Introduced disconnected operation mode and file h oarding

Today's Lecture 23  Other types of DFS  Coda – disconnected operation  Programming assignment 4

Filesystems  Last time: Looked at how we could use RPC to split filesystem functionality between client and server  But pretty much, we didn’t change the design  We just moved the entire filesystem to the server  and then added some caching on the client in various ways

You can go farther...  But it requires ripping apart the filesystem functionality into modules  and placing those modules at different computers on the network  So now we need to ask... what does a filesystem do, anyway?

 Well, there’s a disk...  disks store bits. in fixed-length pieces called sectors or blocks  but a filesystem has... files. and often directories. and maybe permissions. creation and modification time. and other stuff about the files. (“metadata”)

Filesystem functionality  Directory management (maps entries in a hierarchy of names to files-on-disk)  File management (manages adding, reading, changing, appending, deleting) individual files  Space management: where on disk to store these things?  Metadata management

Conventional filesystem  Wraps all of these up together  Useful concepts: [pictures]  “Superblock” -- well-known location on disk where top-level filesystem info is stored (pointers to more structures, etc.)  “Free list” or “Free space bitmap” -- data structures to remember what’s used on disk and what’s not. Why? Fast allocation of space for new files.  “inode” - short for index node - stores all metadata about a file, plus information pointing to where the file is stored on disk Small files may be referenced entirely from the inode; larger files may have some indirection to blocks that list locations on disk  Directory entries point to inodes  “extent” - a way of remembering where on disk a file is stored. Instead of listing all blocks, list a starting block and a range. More compact representation, but requires large contiguous block allocation.

Filesystem “VFS” ops  VFS: (‘virtual filesystem‘): common abstraction layer inside kernels for building filesystems -- interface is common across FS implementations  Think of this as an abstract data type for filesystems  has both syntax (function names, return values, etc) and semantics (“don’t block on this call”, etc.)  One key thing to note: The VFS itself may do some caching and other management...  in particular: often maintains an inode cache

FUSE  The lab will use FUSE  FUSE is a way to implement filesystems in user space (as normal programs), but have them available through the kernel -- like normal files  It has a kinda VFS-like interface

Figure from FUSE documentation

Directory operations  readdir(path) - return directory entries for each file in the directory  mkdir(path) -- create a new directory  rmdir(path) -- remove the named directory

File operations  mknod(path, mode, dev) -- create a new “node” (generic: a file is one type of node; a device node is another)  unlink(path) -- remove link to inode, decrementing inode’s reference count  many filesystems permit “hard links” -- multiple directory entries pointing to the same file  rename(path, newpath)  open -- open a file, returning a file handle , read, write  truncate -- cut off at particular length  flush -- close one handle to an open file  release -- completely close file handle

Metadata ops  getattr(path) -- return metadata struct  chmod / chown (ownership & perms)

Back to goals of DFS  Users should have same view of system, be able to share files  Last time:  Central fileserver handles all filesystem operations -- consistency was easy, but overhead high, scalability poor  Moved to NFS and then AFS: Added more and more caching at client; added cache consistency problems Solved using timeouts or callbacks to expire cached contents

Protocol & consistency  Remember last time: NFS defined operations to occur on unique inode #s instead of names... why? idempotency. Wanted operations to be unique.  Related example for today when we’re considering splitting up components: moving a file from one directory to another  What if this is a complex operation (“remove from one”, “add to another”), etc. Can another user see intermediate state?? (e.g., file in both directories or file in neither?)  Last time: Saw issue of when things become consistent  Presented idea of close-to-open consistency as a compromise

Scaling beyond...  What happens if you want to build AFS for all of CMU? More disks than one machine can handle; more users than one machine can handle  Simplest idea: Partition users onto different servers  How do we handle a move across servers?  How to divide the users? Statically? What about load balancing for operations & for space? Some files become drastically more popular?

“Cluster” filesystems  Lab inspired by Frangipani, a scalable distributed filesystem.  Think back to our list of things that filesystems have to do  Concurrency management  Space allocation and data storage  Directory management and naming

Frangipani design Program Frangipani file server Distributed lock service Petal distributed virtual disk Physical disks Petal aggregates many disks (across many machines_ into one big “virtual disk”. Simplifying abstraction for both design &implementation. exports extents - provides allocation, deallocation, etc. Internally: maps (virtual disk, offset) to (server, physical disk, offset) Frangipani stores all data (inodes, directories, data) in petal; uses lock server for consistency (eg, creating file)

Consequential design

Compare with NFS/AFS  In NFS/AFS, clients just relay all FS calls to the server; central server.  Here, clients run enough code to know which server to direct things to; are active participants in filesystem.  (n.b. -- you could, of course, use the Frangipani/Petal design to build a scalable NFS server -- and, in fact, similar techniques are how a lot of them actually are built. See upcoming lecture on RAID, though: replication and redundancy management become key)

Lab 2: YFS  Yet-another File System. :)  Simpler version of what we just talked about: only one extent server (you don’t have to implement Petal; single lock server)

 Each server written in C++  yfs_client interfaces with OS through fuse  Following labs will build YFS incrementally, starting with the lock server and building up through supporting file & directory ops distributed around the network

Warning  This lab is difficult.  Assumes a bit more C++ than lab 1 did.  Please please please get started early; ask course staff for help.  It will not destroy you; it will make you stronger. But it may well take a lot of work and be pretty intensive.

Remember this slide? 46  We are back to 1990s.  Network is slow and not stable  Terminal  “powerful” client  33MHz CPU, 16MB RAM, 100MB hard drive  Mobile Users appear  1st IBM Thinkpad in 1992

What’s now? 47  We are in 2000s now.  Network is fast and reliable in LAN  “powerful” client  very powerful client  2.4GHz CPU, 1GB RAM, 120GB hard drive  Mobile users everywhere  Do we still need disconnection?  How many people are using coda?

Do we still need disconnection? 48  WAN and wireless is not very reliable, and is slow  PDA is not very powerful  200MHz strongARM, 128M CF Card  Electric power constrained  LBFS (MIT) on WAN, Coda and Odyssey (CMU) for mobile users  Adaptation is also important

What is the future? 49  High bandwidth, reliable wireless everywhere  Even PDA is powerful  2GHz, 1G RAM/Flash  What will be the research topic in FS?  P2P?

Today's Lecture 50  DFS design comparisons continued  Topic 4: file access consistency NFS, AFS, Sprite, and DCE DFS  Topic 5: Locking  Other types of DFS  Coda – disconnected operation  LBFS – weakly connected operation

Low Bandwidth File System Key Ideas 51  A network file systems for slow or wide-area networ ks  Exploits similarities between files or versions of the s ame file  Avoids sending data that can be found in the server’s fi le system or the client’s cache  Also uses conventional compression and caching  Requires 90% less bandwidth than traditional netwo rk file systems

Working on slow networks 52  Make local copies  Must worry about update conflicts  Use remote login  Only for text-based applications  Use instead a LBFS  Better than remote login  Must deal with issues like auto-saves blocking the editor for the duration of transfer

LBFS design 53  Provides close-to-open consistency  Uses a large, persistent file cache at client  Stores clients working set of files  LBFS server divides file it stores into chunks and ind exes the chunks by hash value  Client similarly indexes its file cache  Exploits similarities between files  LBFS never transfers chunks that the recipient already h as

Indexing 54  Uses the SHA-1 algorithm for hashing  It is collision resistant  Central challenge in indexing file chunks is keeping the index at a reasonable size while dealing with sh ifting offsets  Indexing the hashes of fixed size data blocks  Indexing the hashes of all overlapping blocks at all offs ets

LBFS indexing solution 55  Considers only non-overlapping chunks  Sets chunk boundaries based on file contents rather than on position within a file  Examines every overlapping 48-byte region of file to select the boundary regions called breakpoints usi ng Rabin fingerprints  When low-order 13 bits of region’s fingerprint equals a chosen value, the region constitutes a breakpoint

Effects of edits on file chunks 56  Chunks of file before/after edits  Grey shading show edits  Stripes show 48byte regions with magic hash values creating chunk boun daries

More Indexing Issues 57  Pathological cases  Very small chunks Sending hashes of chunks would consume as much bandwidth as just sending the file  Very large chunks Cannot be sent in a single RPC  LBFS imposes minimum and maximum chuck sizes

The Chunk Database 58  Indexes each chunk by the first 64 bits of its SHA-1 hash  To avoid synchronization problems, LBFS always rec omputes the SHA-1 hash of any data chunk before using it  Simplifies crash recovery  Recomputed SHA-1 values are also used to detect hash collisions in the database

Conclusion 59  Under normal circumstances, LBFS consumes 90% les s bandwidth than traditional file systems.  Makes transparent remote file access a viable and l ess frustrating alternative to running interactive pro grams on remote machines.

“File System Interfaces” vs. “Block Level Interfaces” 61  Data are organized in files, which in turn are organi zed in directories  Compare these with disk-level access or “block” acc ess interfaces: [Read/Write, LUN, block#]  Key differences:  Implementation of the directory/file structure and sema ntics  Synchronization (locking)

Digression: “Network Attached Storage” vs. “Stora ge Area Networks” 62 NASSAN Access MethodsFile accessDisk block access Access MediumEthernetFiber Channel and Ethernet Transport ProtocolLayer over TCP/IPSCSI/FC and SCSI/IP EfficiencyLessMore Sharing and Access Control GoodPoor Integrity demandsStrongVery strong ClientsWorkstationsDatabase servers

Decentralized Authentication (1) 63  Figure 11-31. The organization of SFS.

Decentralized Authentication (2) 64  Figure 11-32. A self-certifying pathname in SFS.

General Organization (II) 65  Clients view Coda as a single location-transparent s hared Unix file system  Complements local file system  Coda namespace is mapped to individual file serve rs at the granularity of subtrees called volumes  Each client has a cache manager (VICE)

General Organization (III) 66  High availability is achieved through  Server replication: Set of replicas of a volume is VSG (Volume Storage Group) At any time, client can access AVSG (Available Volume Stora ge Group)  Disconnected Operation: When AVSG is empty

Protocol 68  Based of NFS version 3  LBFS adds extensions  Leases  Compresses all RPC traffic using conventional gzip compression  New NFS procedures GETHASH(fh, offset, count) – retrieve hashes of data chunks in a file MKTMPFILE(fd, fhandle) – create a temporary file for later use in atomic upd ate TMPWRITE(fd,offset,count,data) – write to created temporary file CONDWRITE(fd,offset,count,sha_hash) – similar to TMPWRITE except SHA-1 hash of data COMMITTMP(fd,target_fhandle) – commits the contents of temporary file

File Consistency 69  LBFS client performs whole file caching  Uses a 3-tiered scheme to check file status  Whenever a client makes any RPC on a file it gets back read lea se on file  When user opens file, if lease is valid and file version up to date then open succeeds with no message exchange with server  When user opens file, if lease has expired, clients gets new lease & attributes from server.  If modification time has not changed client uses version from cach e else it gets new contents from the server  No need for write leases due to close-to-open consistency

File read 70

File write 71

Security Considerations 72  It uses the security infrastructure from SFS  Every server has public key and client specifies it on mounting the server.  Entire LBFS protocol is gzip compressed, tagged wit h a MAC and then encrypted  User can check whether the file system contains a p articular hash of the chunk by observing server’s ans wer to CONDWRITE request at subtle timing differe nces

Implementation  Client and server run at use r level  Client implements FS using xfs device driver  Server uses NFS to access fi les  Client-server communication done using RPC over TCP 73

Evaluation: repeated data in files 74

Evaluation (2) : bandwidth utilization 75

Evaluation (3) : application performance 76

Evaluation (4) 77

Conclusion 78  LBFS is a network file system for low-bandwidth networks  Saves bandwidth by exploiting commonality between files  Breaks files into variable sized chunks based on contents  Indexes file chunks by their hash value  Looks up chunks to reconstruct files that contain same data without sendin g that data over network  It consumes over an order of magnitude less bandwidth than traditional file systems  Can be used where other file systems cannot be used  makes remote file access a viable alternative to run interactive applications on remote machines

UNIX sharing semantics 79  Centralized UNIX file systems provide one-copy sem antics  Every modification to every byte of a file is immediately and permanently visible to all processes accessing the file  AFS uses a open-to-close semantics  Coda uses an even laxer model

Open-to-Close Semantics (I) 80  First version of AFS  Revalidated cached file on each open  Propagated modified files when they were closed  If two users on two different workstations modify th e same file at the same time, the users closing the fil e last will overwrite the changes made by the other user

Open-to-Close Semantics (II) 81  Example: Time FF’ F” First client Second Client F” overwrites F’

Open to Close Semantics (III) 82  Whenever a client opens a file, it always gets the la test version of the file known to the server  Clients do not send any updates to the server until t hey close the file  As a result  Server is not updated until file is closed  Client is not updated until it reopens the file

Callbacks (I) 83  AFS-1 required each client to call the server every t ime it was opening an AFS file  Most of these calls were unnecessary as user files are r arely shared  AFS-2 introduces the callback mechanism  Do not call the server, it will call you!

Callbacks (II) 84  When a client opens an AFS file for the first time, se rver promises to notify it whenever it receives a new version of the file from any other client  Promise is called a callback  Relieves the server from having to answer a call fro m the client every time the file is opened  Significant reduction of server workload

Callbacks (III) 85  Callbacks can be lost!  Client will call the server every tau  minutes to check wh ether it received all callbacks it should have received  Cached copy is only guaranteed to reflect the state of t he server copy up to tau minutes before the time the cli ent opened the file for the last time

Coda semantics 86  Client keeps track of subset s of servers it was able to connect the last time it tried  Updates s at least every tau seconds  At open time, client checks it has the most recent cop y of file among all servers in s  Guarantee weakened by use of callbacks  Cached copy can be up to tau minutes behind the serve r copy

EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

Similar presentations

Presentation on theme: "EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

Similar presentations

Presentation on theme: "EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS."— Presentation transcript:

Similar presentations

About project

Feedback