1 Principles of Reliable Distributed Systems Tutorial 12: Frangipani Spring 2009 Alex Shraer.

2 2 Frangipani File System Thekkath, Mann, and Lee, SOSP 1997

3 3 Frangipani Scalable file system built at SRC-DEC Published in SOSP’97 Uses failure detection, Paxos, leases,… Two layers: –Petal: virtual disk from many “storage bricks” –Frangipani file system and lock service

4 4 Motivation Large-scale distributed file systems are hard to administer Hard to add/remove machines (servers) Hard to add/remove disks (storage space) Hard to manage set of current components Hard to manage locks

5 5 Petal: Distributed Virtual Disks C. A. Thekkath and E. K. Lee Systems Research Center Digital Equipment Corporation ASPLOS’96

6 6 Client’s View

7 7 Petal Overview Petal provides virtual disks –Large (2 64 bytes), sparse virtual space –Disk storage allocated on demand –Accessible to all file servers over a network Virtual disks implemented by –Cooperating CPUs executing Petal software –Ordinary disks attached to the CPUs –A scalable interconnection network

8 8 Petal Prototype

9 9 Global State Management Uses Paxos –Global state is replicated across all servers Metadata (disk allocation) only! –Consistent in the face of server and network failures –A majority is needed to update the global state –Any server can be added/removed in the presence of failed servers

10 10 Key Petal Features Storage is incrementally expandable Data is optionally mirrored over multiple servers Metadata is replicated on all servers Transparent addition and deletion of servers Supports read-only snapshots of virtual disks Client API looks like block-level disk device Throughput –Scales linearly with additional servers –Degrades gracefully with failures

11 11 Frangipani: A Scalable Distributed File System C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation SOSP’97

12 12 Frangipani Features Behaves like a local file system –Multiple machines cooperatively manage a Petal disk –Users on any machine see a consistent view of data Exhibits good performance, scaling, and load balancing Easy to administer

13 13 Ease of Administration Frangipani machines are modular –Can be added and deleted transparently Common free space pool –Users don’t have to be moved Automatically recovers from crashes Consistent backup without halting the system

14 14 Frangipani Structure Distributed file system built atop a shared virtual disk (Petal) Frangipani servers do not communicate with each other directly –Only through Petal Simplifies managemant –Addition/removal of servers

15 15 Frangipani Layering

16 16 Standard Organization

17 17 Components of Frangipani File system core –Implements the file system (FS) interface –Uses FS mechanisms (buffer cache etc.) –Exploits Petal’s large virtual space Locks with leases –Granted for finite time, must be refreshed Write-ahead redo log –Performance optimization + failure recovery

18 18 Locks Multiple reader/single writer Granularity: lock per entire file or directory A lock is really a lease – it expires –After 30 seconds in their implementation Assumption?

19 19 Using Locks Frangipani servers are clients of lock service Dirty data is written to disk (Petal) before the lock is given to another machine Locks are cached by servers that acquire them –Soft state: no need to explicitly release locks –Uses lease timeouts for lock recovery

20 20 Distributed Lock Management A set of lock servers collaboratively manage locks –Run Paxos among them –Consensus on global state: set of locks each server is responsible for, list of current lock servers, lock allocation to clients –Need majority to make progress Using leases requires assuming loosely synchronized clocks –Expired leases should not be accepted Why Paxos then? –To overcome network partitions

21 21 Logging Frangipani uses a write ahead redo log for metadata –Log records are kept on Petal (why?) Data is written to Petal –On sync, fsync, or every 30 seconds –On lock revocation or when the log wraps Each server has a separate log –Reduces contention –Independent recovery

22 22 Recovery Recovery initiated due to failure detection –By the lock service –Failure detection implemented using heartbeats Any server can recover operations for a failed server –Log is available via Petal

23 23 Conclusions Fault-tolerance in the real world Overcome crashes and network partitions using consensus-based replication –Paxos Un-contended good performance –Using locks Implement locks as leases for robustness Logging for recovery

