Download presentation
Presentation is loading. Please wait.
1
Taming Aggressive Replication in the Pangaea Wide-area File System Y. Saito, C. Kaamanolis, M. Karlsson, M. Mahalingam Presented by Jason Waddle
2
Pangaea: Wide-area File System o Support the daily storage needs of distributed users. o Enable ad-hoc data sharing.
3
Pangaea Design Goals I. Speed Hide wide-area latency, file access time ~ local file system II. Availability & autonomy Avoid single point-of-failure Adapt to churn III. Network economy Minimize use of wide-area network Exploit physical locality
4
Pangaea Assumptions (Non-goals) Servers are trusted Weak data consistency is sufficient (consistency in seconds)
5
Symbiotic Design
6
Autonomous Each server operates when disconnected from network.
7
Symbiotic Design Autonomous Each server operates when disconnected from network. Cooperative When connected, servers cooperate to enhance overall performance and availability.
8
Pervasive Replication Replicate at file/directory level Aggressively create replicas: whenever a file or directory is accessed No single “master” replica A replica may be read / written at any time Replicas exchange updates in a peer-to- peer fashion
9
Graph-based Replica Management Replicas connected in a sparse, strongly- connected, random graph Updates propagate along edges Edges used for discovery and removal
10
Benefits of Graph-based Approach Inexpensive Graph is sparse, adding/removing replicas O(1) Available update distribution As long as graph is connected, updates reach every replica Network economy High connectivity for close replicas, build spanning tree along fast edges
11
Optimistic Replica Coordination Aim for maximum availability over strong data-consistency Any node issues updates at any time Update transmission and and conflict resolution in background
12
Optimistic Replica Coordination “Eventual consistency” (~ 5s in tests) No strong consistency guarantees: no support for locks, lock-files, etc.
13
Pangaea Structure Region (<5ms RTT) Server or Node
14
Server Structure NFS client User space Kernel space NFS protocol handler membership Replication engine log Pangaea server I/O request (application) Inter-node communication
15
Server Modules NFS protocol handler Receives requests from apps, updates local replicas, generates requests to
16
Server Modules NFS protocol handler Receives requests from apps, updates local replicas, generates requests to Replication engine Accepts local and remote requests Modifies replicas Forwards requests to other nodes
17
Server Modules NFS protocol handler Receives requests from apps, updates local replicas, generates requests to Replication engine Accepts local and remote requests Modifies replicas Forwards requests to other nodes Log module Transaction-like semantics for local updates
18
Server Modules Membership module maintains: List of regions, their members, estimated RTT between regions Location of root directory replicas Information coordinated by gossiping “Landmark” nodes bootstrap newly joining nodes Maintaining RTT information: main scalability bottleneck
19
File System Structure Gold replicas Listed in directory entries Form clique in replica graph Fixed number (e.g., 3) All replicas (gold and bronze) Unidirectional edges to all gold replicas Bidirectional peer-edges Backpointer to parent directory
20
File System Structure /joe/foo /joe
21
File System Structure struct Replica fid: FileID ts: TimeStamp vv: VersionVector goldPeers: Set(NodeID) peers: Set(NodeID) backptrs: Set(FileID, String) struct DirEntry fname: String fid: FileID downlinks: Set(NodeID) ts: TimeStamp
22
File Creation Select locations for g gold replicas (e.g., g=3) One on current server Others on random servers from different regions Create entry in parent directory Flood updates Update to parent directory File contents (empty) to gold replicas
23
Replica Creation Recursively get replicas for ancestor directories Find a close replica (shortcutting) Send request to the closest gold replica Gold replica forwards request to its neighbor closest to requester, who then sends
24
Replica Creation Select m peer-edges (e.g., m=4) Include a gold replica (for future shortcutting) Include closest neighbor from a random gold replica Get remaining nodes from random walks starting at a random gold replica Create m bidirectional peer-edges
25
Bronze Replica Removal To recover disk space Using GD-Size algorithm, throw out largest, least-accessed replica Drop useless replicas Too many updates before an access (e.g., 4) Must notify peer-edges of removal; peers use random walk to choose new edge
26
Replica Updates Flood entire file to replica graph neighbors Updates reach all replicas as long as the graph is strongly connected Optional: user can block on update until all neighbors reply (red-button mode) Network economy???
27
Optimized Replica Updates Send only differences (deltas) Include old timestamp, new timestamp Only apply delta to replica if old timestamp matches Revert to full-content transfer if necessary Merge deltas when possible
28
Optimized Replica Updates Don’t send large (e.g., > 1KB) updates to each of m neighbors Instead, use harbingers to dynamically build a spanning-tree update graph Harbinger: small message with update’s timestamps Send updates along spanning-tree edges Happens in two phases
29
Optimized Replica Updates Exploit Physical Topology Before pushing a harbinger to a neighbor, add a random delay ~ RTT (e.g., 10*RTT) Harbingers propagate down fastest links first Dynamically builds an update spanning-tree with fast edges
30
Update Example (Phase 1) A B C DE F
31
A B C DE F
32
A B C DE F
33
A B C DE F
34
A B C DE F
35
A B C DE F
36
Update Example (Phase 2) A B C DE F
37
A B C DE F
38
A B C DE F
39
Conflict Resolution Use a combination of version vectors and last-writer wins to resolve If timestamps mismatch, full-content is transferred Missing update: just overwrite replica
40
Regular File Conflict (Three Solutions) 1)Last-writer-wins, using update timestamps Requires server clock synchronization 2)Concatenate both updates Make the user fix it 3)Possibly application-specific resolution
41
Directory Conflict alice$ mv /foo /alice/foobob$ mv /foo /bob/foo
42
Directory Conflict alice$ mv /foo /alice/foobob$ mv /foo /bob/foo /alice replica set /bob replica set
43
Directory Conflict alice$ mv /foo /alice/foobob$ mv /foo /bob/foo Let the child (foo) decide! Implement mv as a change to the file’s backpointer Single file resolves conflicting updates File then updates affected directories
44
Temporary Failure Recovery Log outstanding remote operations Update, random walk, edge addition, etc. Retry logged updates On reboot On recovery of another node Can create superfluous edges Retains m-connectedness
45
Permanent Failures A garbage collector (GC) scans for failed nodes Bronze replica on failed node GC causes replica’s neighbors to replace link with a new peer using random walk
46
Permanent Failure Gold replica on failed node Discovered by another gold (clique) Chooses new gold by random walk Flood choice to all replicas Update parent directory to contain new gold replica nodes Resolve conflicts with last-writer-wins Expensive!
47
Performance – LAN Andrew-Tcl benchmarks, time in seconds
48
Performance – Slow Link The importance of local replicas
49
Performance – Roaming Compile on C1 then time compile on C2. Pangaea utilizes fast links to a peer’s replicas.
50
Performance: Non-uniform Net A model of HP’s corporate network.
51
Performance: Non-uniform Net
52
Performance: Update Propagation Harbinger time is the window of inconsistency.
53
Performance: Large Scale HP: 3000 Node 7-region HP Network U: 500 regions, 6 Nodes per region, 200ms RTT 5Mb/s Latency improves with more replicas.
54
Performance: Large Scale HP: 3000 Node 7-region HP Network U: 500 regions, 6 Nodes per region, 200ms RTT 5Mb/s Network economy improves with more replicas.
55
Performance: Availability Numbers in parenthesis are relative storage overhead.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.