Replication and Consistency (2)
Reference r Replication in the Harp File System, Barbara Liskov, Sanjay Ghemawat, Robert Gruber, Paul Johnson, Liuba Shrira, Michael Williams, MIT
Highly Available, Reliable, Persistent (HARP) fs r Replicated Unix file system accessible via the Virtual File System (VFS) interface r VFS is a software abstraction in UNIX like OS. It provides a unified approach a number of different file systems VFS NTFSUFSEXT2FSHARP Low level file system
HARP r Provides highly available, reliable storage for files r Guarantees atomic file operations in spite of concurrency and failure r Primary copy replication m Master server authoritative m Replicas – backup servers m Updates are sent to “enough” replicas to guarantee fail- safe behavior r Log structured updates
Using logs for throughput r Disks are partitioned into tracks, sectors and cylinders r Writing a file might involve writing blocks in different tracks (slow because of seeks) r Log structure file systems allow user to write sequentially onto disk. Logs contain the transformations. Lazy update.
Outline r Replication in the Harp File System, Barbara Liskov, Sanjay Ghemawat, Robert Gruber, Paul Johnson, Liuba Shrira, Michael Williams, MIT m Barbara Liskov – first woman to receive Ph.D. in Computer Science in the US (1968, Stanford). Faculty at MIT since 1972
HARP – Highly Available Reliable Persistent r Provides highly available, reliable storage for files r Guarantees atomic file operations in spite of concurrency and failure r Primary copy replication (Eager master) m Master server authoritative m Replicas – backup servers m Updates are sent to “enough” replicas to guarantee fail- safe behavior r Log structured updates
HARP - Operating Environment r Bunch of nodes connected by a network r Fail-stop - Nodes fail by crashing (they don’t limp along) r Network may lose, duplicate messages, deliver messages out of order (but not corrupt messages) r Loosely synchronized clocks r Each server is equipped with a small UPS
Replication Method r Primary-copy method r One node acts as the primary – clients request from this primary r Primary contacts backups r Two-phase modifications m Phase 1: Primary informs the backups about modification Backups acknowledge receipt of modification Primary commits and returns m Phase 2: Primary informs backups of commit in background
Failover protocol r Failover protocol – view change r View change m failed node removed from service m Recovered node is put back into service r Primary of new view maybe different from old view r Clients requests sent to primary of new view r 2n+1 servers for tolerating n failures
Failover – Contd. r Store n+1 copies r N nodes witness view change to make sure only one view is selected r Each replicate group manages one or more file systems m 1 designated primary m N designated backups m N designated witnesses
Normal case processing r Primary maintains a log in volatile memory in which it records modification operations r Some records are phase 1 while others are for operations that have committed r Primary maintains a commit point (CP); index of the latest committed operation r An apply process applies committed operations to the file system (AP) asynchronously r Lower Bound pointer records operations that were written to disk r Global LB is used to cleanup the logs
Volatile state of the modification logs Written to disk Apply to disk Commit Phase 1 CP AP LB Primary Time Written to disk Apply to disk Commit Phase 1 CP AP LB Backup GLB
Normal case operations r Operations that do not modify file system state (get file attributes – e.g. ls) entirely in primary. Performed on committed writes. r Non-modifications can lead to problems in network partitions. Hence they use loosely synchronized clocks and restrict view change after a interval since the last message
View change r Requirements m Correctness: Operations must appear to be executed in commit order E.g. create file, list file, delete file expects a certain ordering m Reliability: Effects of committed operations must survive single failures and simultaneous failures m Availability: m Timelines: Fail-over must be done in a timely manner r Use event records to avoid problems with multiple applications of event records
Discussion