EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS

Andrew File System  Let’s start with a familiar example: andrew 10,000s of machines 10,000s of people Goal: Have a consistent namespace for files across computers Allow any authorized user to access their files from any computer DiskDiskDisk Terabytes of disk

Callbacks 3  When a client opens an AFS file for the first time, se rver promises to notify it whenever it receives a new version of the file from any other client  Promise is called a callback  Relieves the server from having to answer a call fro m the client every time the file is opened  Significant reduction of server workload (remember NFS asks server every 60 secs.)

AFS summary  Client-side caching is a fundamental technique to improve scalability and performance  But raises important questions of cache consistency  Timeouts and callbacks are common methods for providing (some forms of) consistency.  AFS picked session semantics (close-to-open consistency) as a good balance of usability (the model seems intuitive to users), performance, etc.  AFS authors argued that apps with highly concurrent, shared access, like databases, needed a different model

Today's Lecture 5  Other types of DFS  Coda – disconnected operation  Programming assignment 4

Background 6  We are back to 1990s.  Network is slow and not stable  Terminal  “powerful” client  33MHz CPU, 16MB RAM, 100MB hard drive  Mobile Users appeared  1st IBM Thinkpad in 1992  We can do work at client without network

CODA 7  Successor of the very successful Andrew File System (AFS)  AFS  First DFS aimed at a campus-sized user community  Key ideas include Session semantics: open-to-close consistency callbacks

Hardware Model 8  Similarity  CODA and AFS assume that client workstations are pers onal computers controlled by their user/owner Fully autonomous Cannot be trusted  Difference  CODA allows owners of laptops to operate them in dis connected mode Opposite of ubiquitous connectivity

Coda 9  Must handle two types of failures  Server failures: Data servers are replicated  Communication failures and voluntary disconnections Coda uses optimistic replication and file hoarding

Design Rationale 10  Scalability  Callback cache coherence (inherit from AFS)  Whole file caching  Portable workstations  User’s assistance in cache management

Design Rationale –Replica Control 11  Pessimistic  Disable all partitioned writes - Require a client to acquire control of a cached object pri or to disconnection  Optimistic  Assuming no others touching the file - sophisticated: conflict detection + fact: low write-sharing in Unix + high availability: access anything in range without lock

Pessimistic Replica Control 12  Would require client to acquire exclusive (RW) or s hared (R) control of cached objects before accessing them in disconnected mode:  Acceptable solution for voluntary disconnections  Does not work for involuntary disconnections  What if the laptop remains disconnected for a long time?

Leases 13  We could grant exclusive/shared control of the cach ed objects for a limited amount of time  Works very well in connected mode  Reduces server workload  Server can keep leases in volatile storage as long as th eir duration is shorter than boot time  Would only work for very short disconnection perio ds

Optimistic Replica Control (I) 14  Optimistic replica control allows access in every disco nnected mode  Tolerates temporary inconsistencies  Promises to detect them later  Provides much higher data availability

Optimistic Replica Control (II) 15  Defines an accessible universe: set of replicas that t he user can access  Accessible universe varies over time  At any time, user  Will read from the latest replica(s) in his accessible univ erse  Will update all replicas in his accessible universe

Coda (Venus) States 16 1. Hoarding: Normal operation mode 2. Emulating: Disconnected operation mode 3. Reintegrating: Propagates changes and detects inconsistencies Hoarding Emulating Recovering

Hoarding 17  Hoard useful data for disconnection  Balance the needs of connected and disconnected o peration.  Cache size is restricted  Unpredictable disconnections  Prioritized algorithm – cache manage  hoard walking – reevaluate objects

Prioritized algorithm 18  User defined hoard priority p: how interest it is?  Recent Usage q  Object priority = f(p,q)  Kick out the one with lowest priority + Fully tunable Everything can be customized - Not tunable (?) - No idea how to customize

Emulation 19  In emulation mode:  Attempts to access files that are not in the client caches appear as failures to application  All changes are written in a persistent log, the client modification log (CML)

Persistence 20  Venus keeps its cache and related data structures in non-volatile storage

Reintegration 21  When workstation gets reconnected, Coda initiates a reintegration process  Performed one volume at a time  Venus ships replay log to all volumes  Each volume performs a log replay algorithm  Only care write/write confliction  Succeed? Yes. Free logs, reset priority No. Save logs to a tar. Ask for help

Performance 22  Duration of Reintegration  A few hours disconnection  1 min  But sometimes much longer  Cache size  100MB at client is enough for a “typical” workday  Conflicts  No Conflict at all! Why?  Over 99% modification by the same person  Two users modify the same obj within a day: <0.75%

Coda Summary 23  Puts scalability and availability before data consistency  Unlike NFS  Assumes that inconsistent updates are very infreque nt  Introduced disconnected operation mode and file h oarding

Today's Lecture 24  Other types of DFS  Coda – disconnected operation  Programming assignment 4  Note: Slides and project borrowed from David Andersen (CMU)

Filesystems  Last time: Looked at how we could use RPC to split filesystem functionality between client and server  But pretty much, we didn’t change the design  We just moved the entire filesystem to the server  and then added some caching on the client in various ways

You can go farther...  But it requires ripping apart the filesystem functionality into modules  and placing those modules at different computers on the network  So now we need to ask... what does a filesystem do, anyway?

 Well, there’s a disk...  disks store bits. in fixed-length pieces called sectors or blocks  but a filesystem has... files. and often directories. and maybe permissions. creation and modification time. and other stuff about the files. (“metadata”)

Filesystem functionality  Directory management (maps entries in a hierarchy of names to files-on-disk)  File management (manages adding, reading, changing, appending, deleting) individual files  Space management: where on disk to store these things?  Metadata management

Conventional filesystem  Useful concepts: [pictures]  “Superblock” -- well-known location on disk where top-level filesystem info is stored (pointers to more structures, etc.)  “Free list” or “Free space bitmap” -- data structures to remember what’s used on disk and what’s not. Why? Fast allocation of space for new files.  “inode” - short for index node - stores all metadata about a file, plus information pointing to where the file is stored on disk  Directory entries point to inodes  “extent” - a way of remembering where on disk a file is stored. Instead of listing all blocks, list a starting block and a range. More compact representation, but requires large contiguous block allocation.

Filesystem “VFS” ops  VFS: (‘virtual filesystem‘): common abstraction layer inside kernels for building filesystems -- interface is common across FS implementations  Think of this as an abstract data type for filesystems  has both syntax (function names, return values, etc) and semantics (“don’t block on this call”, etc.)  One key thing to note: The VFS itself may do some caching and other management...  in particular: often maintains an inode cache

FUSE  The lab will use FUSE  FUSE is a way to implement filesystems in user space (as normal programs), but have them available through the kernel -- like normal files  It has a kinda VFS-like interface

Figure from FUSE documentation

Directory operations  readdir(path) - return directory entries for each file in the directory  mkdir(path) -- create a new directory  rmdir(path) -- remove the named directory

File operations  mknod(path, mode, dev) -- create a new “node” (generic: a file is one type of node; a device node is another)  unlink(path) -- remove link to inode, decrementing inode’s reference count  many filesystems permit “hard links” -- multiple directory entries pointing to the same file  rename(path, newpath)  open -- open a file, returning a file handle , read, write  truncate -- cut off at particular length  flush -- close one handle to an open file  release -- completely close file handle

Metadata ops  getattr(path) -- return metadata struct  chmod / chown (ownership & perms)

Back to goals of DFS  Users should have same view of system, be able to share files  Last time:  Central fileserver handles all filesystem operations -- consistency was easy, but overhead high, scalability poor  Moved to NFS and then AFS: Added more and more caching at client; added cache consistency problems Solved using timeouts or callbacks to expire cached contents

Scaling beyond...  What happens if you want to build AFS for all of KAIST? More disks than one machine can handle; more users than one machine can handle  Simplest idea: Partition users onto different servers  How do we handle a move across servers?  How to divide the users? Statically? What about load balancing for operations & for space? Some files become drastically more popular?

“Cluster” filesystems  Lab inspired by Frangipani, a scalable distributed filesystem.  Think back to our list of things that filesystems have to do  Concurrency management  Space allocation and data storage  Directory management and naming

Frangipani design Program Frangipani file server Distributed lock service Petal distributed virtual disk Physical disks Petal aggregates many disks (across many machines_ into one big “virtual disk”. Simplifying abstraction for both design &implementation. exports extents - provides allocation, deallocation, etc. Internally: maps (virtual disk, offset) to (server, physical disk, offset) Frangipani stores all data (inodes, directories, data) in petal; uses lock server for consistency (eg, creating file)

Consequential design

Compare with NFS/AFS  In NFS/AFS, clients just relay all FS calls to the server; central server.  Here, clients run enough code to know which server to direct things to; are active participants in filesystem.

Programming Assignment: YFS  Yet-another File System. :)  Simpler version of what we just talked about: only one extent server (you don’t have to implement Petal; single lock server)

 Each server written in C++  yfs_client interfaces with OS through fuse  Following labs will build YFS incrementally, starting with the lock server and building up through supporting file & directory ops distributed around the network

Warning  This lab is difficult.  Assumes a bit more C++  Please please please get started early; ask course staff for help.  It will not destroy you; it will make you stronger. But it may well take a lot of work and be pretty intensive.

EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

Similar presentations

Presentation on theme: "EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

Similar presentations

Presentation on theme: "EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS."— Presentation transcript:

Similar presentations

About project

Feedback