Example File Systems Using Replication CS 188 Distributed Systems February 10, 2015
Example Replicated File Systems NFS Coda Ficus
NFS Originally NFS did not have any replication capability Replication of read-only file systems added later Primary copy read/write replication added later
NFS Read-Only Replication Almost by hand Sysadmin ensures multiple copies of file systems are identical Typically on different machines Avoid writing to any replica E.g., mount them read-only Use automounting facilities to handle failover and load balancing
Primary Copy NFS Replication Commonly referred to as DRDB Typically two replicas Primarily for reliability One replica is the primary It can be written Other replica mirrors the primary Provides service if primary unavailable
Some Primary Copy Issues Handling updates How and when do they propagate? Determining failure Of the secondary copy Of the primary copy Handling recovery
Update Issues In DRDB Two choices: Synchronous Writes don’t return until both copies are updated Asynchronous Writes return once primary updated Secondary updated later
Implications of Synchronous Writes Slower, since can’t indicate success till both copies are written One is written across the network, ensuring slowness Fewer consistency issues If write returned, both copies have it If not, neither does Real bad timing requires some cleanup
Implications of Asynchronous Writes Faster, since you only wait for primary copy Almost always works just fine Almost always Problems when it doesn’t though Different values of same data at different copies May not clear how it happened Perhaps even worse
Detecting Failures DRDB usually uses a heartbeat process Primary and secondary expect to communicate every few seconds E.g., every two seconds If too many heartbeats in a row missed, declare the partner “dead” Might just be unreachable, though
Responding To Failures Switch service from the primary to the secondary Which becomes the primary Including write service Ensures continued operation after failure Update logging ensure new primary is up to date
Recovery From Failures Recovered node becomes the secondary Receives missed updates from primary Complications if network failure caused the failure The split brain problem
The Split Brain Problem Primary Secondary Primary NETWORK PARTITION! Update 1 Update 2 Update 3 Now what?
The “Simple” Solution Prevent access to both Until sysadmin designates one of them as the new primary Throw away the other and reset to the designated primary Simple for the sysadmin, maybe not for the users
What Other Solution Is Possible? Try to figure out what the correct version of the data is In NFS case, chances are good writes are to different files In which case, you probably just need the most recent copy of each file But there are complex cases NFS replication doesn’t try to do this
Coda A follow-on to the Andrew File System (AFS) Using the basic philosophy of AFS But specifically to handle mobile computers
Clients request files from the servers The AFS System A server pool Clients request files from the servers Client workstations
AFS Characteristics Files permanently stored at exactly one server Clients keep cached copies Writes cached until file close Asynchronous writes Other copies then invalidated Stateful servers Unless write conflicts
Adding Mobile Computers A server pool Just like AFS, except . . . Client workstations Some of the clients are mobile
Why Does That Make a Difference? Mobile computers come and go Well, so do users at workstations But mobile computers take their files with them And expect to access them while they are gone What happens when they do?
The Mobile Problem for AFS Now it reconnects The laptop downloads some files to its disk Then it disconnects from the network Then it uses the files And maybe writes them
Why Is This Different Than Normal AFS? We might get write conflicts here Normal AFS might, too But normal AFS conflicts have a small window Truly concurrent writes only Cache invalidation when someone closes For laptop, close could occur weeks before reconnect
Handling Disconnected Operations Could use a solution like NFS Server has primary copy Client has secondary copy If client can’t access server, can’t write Or could use an optimistic solution Assume no one else is going to write your file, so go ahead yourself Detect problems and fix as needed
The Coda Approach Essentially optimistic When connected, operates much like AFS When disconnected, client is allowed to update cached files Access control permitting But unlike AFS, can’t propagate updates on file close After all, it’s disconnected Instead, remember this failure until later
Ficus A more peer-oriented replicated file system A descendant of the Locus operating system Specifically designed for mobile computers
AFS, Coda, and Caching Like AFS, client machines only cache files An AFS cache miss is just a performance penalty Get it from the server A Coda cache miss when disconnected is a disaster User can’t access his file
Avoiding Disconnected Cache Misses Really requires thinking ahead Initially Coda required users to do it Maintain a list of files they wanted to be sure to always cache In case of disconnected operations Eventually went to a hoarding solution We’ll discuss hoarding later
Coda Reintegration When a disconnected Coda client reconnects Tries to propagate updates occurring during disconnection to a server If no one else updated that file, just like a normal AFS update If someone else updated the file during disconnection, what then?
Coda and Conflicts Such update problems on Coda reintegration are conflicts Two (or more) users made concurrent writes to a file Original solution was that later update (mostly) lost Update on server wins Other update put in special conflict directory Owning user or sysadmin notified to take action Or not take action . . .
Later Coda Conflict Solutions Automated reconciliation of conflicts When possible User tools to help handle them when automation doesn’t work Can you think of particularly problematic issues here?
The Locus Computing Model System composed of many personal workstations Connected by a local area network Shared by all! But provide the illusion of . . . And perhaps a few shared server machines All machines have dedicated storage
The Ficus Computing Model Just like the Locus model, except . . . Some of the workstations are portable computers Which might disconnect from the network Taking their storage with them
Ficus Shares Some Problems With Coda Portable computers can only access local disks while disconnected Updates involving disconnected computers are complicated And can even cause conflicts
Ficus Has Some Unique Problems, Too It’s really . . . What happens to this when the portables’ storage goes away? And, unfortunately. . .
Handling the Problems Rely on replication Replicate the files that the portable needs while disconnected Replicate the files it’s taking away when it departs So everyone else can still see them
Updates in Ficus Ficus uses peer replication No primary copy All replicas are equally good So if access permissions allow update And you can get to a replica You can update it How does Ficus handle that?
The Easy Case All replicas are present and available Allow update to one of the replicas Make a reasonable effort to propagate the update to all others But not synchronously On a good day, this works and everything is identical
The Hard Case The best effort to propagate an update from the original replica fails Perhaps because you can’t reach one or more other replicas Perhaps because the portable computers holding them are elsewhere
Handling Updates With primary copies With peer copies Secondary If they’re the same, no problem If they’re the same, still no problem If they’re the different, the primary always wins But what if they’re different? Only possible reason is that the secondary is old
What Are the Possibilities? 1. One is old and the other is updated Or . . . How do we tell which is the new one? 2. Both have been updated Now what?
More Complicated If >2 Replicas What’s the right thing to do? Here’s just one example And how do you figure that out? Replica 1 Replica 2 Replica 3 Somehow you figure out replica 2 is newer than replica 3 Update replica 1 Propagate to replica 2 Propagate to replica 3 Update replica 1
Reconciliation Always an option in Locus and Ficus Much more important with disconnected operation When a replica notices a previously unavailable replica, Check for missing updates and trade information about them The async operation that ensures eventual update propagation
Gossiping in Ficus Primary copy replication and systems like Coda always propagate updates the same way Other replicas give their updates to a single site And get new updates from that site Peer systems like Ficus have another option Any peer with later updates can pass them to you Even if they aren’t the primary and didn’t create the updates In file systems, this is called gossiping
How Does Ficus Track Updates? Ficus uses version vectors An applied type of vector clock These clocks keep one vector element per replica With a vector clock stored at each replica Clocks “tick” only on updates
Version Vector Use Example Replica 1 Replica 2 Replica 3 1 1 1 1 1 When replica 2 comes back, its version will be recognized as old Compared to either replica 1 or replica 3
Version Vectors and Conflicts Ficus recognizes concurrent (and thus conflicting) writes Using version vectors If neither of two version vectors dominates the other, there’s a conflict Implying concurrent write Typically detected during reconciliation
For Example CONFLICT! Replica 1 Replica 2 1 1 1 1
Now What? Conflicting files represent concurrent writes There is no “correct” order to apply them Use other techniques to resolve the conflicts Creating a semantically correct and/or acceptable version
Example Conflict Resolution Identical conflicts Same update made in two different places Easy to resolve Assuming updates in question are idempotent Conflicts involving append-only files Merge the appends Most Unix directory conflicts are automatically resolvable
Ficus Replication Granularity NFS replicates volumes Coda replicates individual files Ficus replicates volumes Later, selective replication of files within volumes added
Hoarding A portable machine off the network must operate off its own disk Only! So it better replicate the files it needs If you know/predict portable disconnection, pre-replicate those files That’s called hoarding
Mechanics of Hoarding Mechanically easy if you replicate at file granularity E.g., Coda or Ficus with selective replication Simply replicate what you need Inefficient if you replicate at volume granularity
What Do You Hoard? Could be done manually Doesn’t work out well Could replicate every file the portable ever touches Might overfill its disk Could use LRU Experience shows that fails oddly
What Does Work Well? You might think clustering Identify files that are used together If one of them recently used, hoard them all Basic approach in Seer Actually, LRU plus some sleazy tricks works equally well And is much cheaper