Large Scale Sharing GFS and PAST Mahesh Balakrishnan.

Large Scale Sharing GFS and PAST Mahesh Balakrishnan

Distributed File Systems Traditional Definition: Data and/or metadata stored at remote locations, accessed by client over the network. Data and/or metadata stored at remote locations, accessed by client over the network. Various degrees of centralization: from NFS to xFS. Various degrees of centralization: from NFS to xFS. GFS and PAST Unconventional, specialized functionality Unconventional, specialized functionality Large-scale in data and nodes Large-scale in data and nodes

The Google File System Specifically designed for Google’s backend needs Web Spiders append to huge files Application data patterns: Multiple producer – multiple consumer Multiple producer – multiple consumer Many-way merging Many-way merging GFS  Traditional File Systems

Design Space Coordinates Commodity Components Very large files – Multi GB Large sequential accesses Co-design of Applications and File System Supports small files, random access writes and reads, but not efficiently

GFS Architecture Interface: Usual: create, delete, open, close, etc Usual: create, delete, open, close, etc Special: snapshot, record append Special: snapshot, record append Files divided into fixed size chunks Each chunk replicated at chunkservers Single master maintains metadata Master, Chunkservers, Clients: Linux workstations, user-level process

Client File Request Client finds chunkid for offset within file Client sends to Master Master returns chunk handle and chunkserver locations

Design Choices: Master Single master maintains all metadata … Simple Design Simple Design Global decision making for chunk replication Global decision making for chunk replication and placement and placement  Bottleneck?  Single Point of Failure?

Design Choices: Master Single master maintains all metadata … in memory! Fast master operations Fast master operations Allows background scans of entire data Allows background scans of entire data  Memory Limit?  Fault Tolerance?

Relaxed Consistency Model File Regions are - Consistent: All clients see the same thing Consistent: All clients see the same thing Defined: After mutation, all clients see exactly what the mutation wrote Defined: After mutation, all clients see exactly what the mutation wrote Ordering of Concurrent Mutations – For each chunk’s replica set, Master gives one replica primary lease For each chunk’s replica set, Master gives one replica primary lease Primary replica decides ordering of mutations and sends to other replicas Primary replica decides ordering of mutations and sends to other replicas

Anatomy of a Mutation 1 2 Client gets chunkserver locations from master 3 Client pushes data to replicas, in a chain 4 Client sends write request to primary; primary assigns sequence number to write and applies it 5 6 Primary tells other replicas to apply write 7 Primary replies to client

Connection with Consistency Model Secondary replica encounters error while applying write (step 5): region Inconsistent. Client code breaks up single large write into multiple small writes: region Consistent, but Undefined.

Special Functionality Atomic Record Append Primary appends to itself, then tells other replicas to write at that offset Primary appends to itself, then tells other replicas to write at that offset If secondary replica fails to write data (step 5), If secondary replica fails to write data (step 5), duplicates in successful replicas, padding in failed ones region defined where append successful, inconsistent where failed Snapshot Copy-on-write: chunks copied lazily to same replica Copy-on-write: chunks copied lazily to same replica

Master Internals Namespace management Replica Placement Chunk Creation, Re-replication, Rebalancing Garbage Collection Stale Replica Detection

Dealing with Faults High availability Fast master and chunkserver recovery Fast master and chunkserver recovery Chunk replication Chunk replication Master state replication: read-only shadow replicas Master state replication: read-only shadow replicas Data Integrity Chunk broken into 64KB blocks, with 32 bit checksum Chunk broken into 64KB blocks, with 32 bit checksum Checksums stored in memory, logged to disk Checksums stored in memory, logged to disk Optimized for appends, since no verifying required Optimized for appends, since no verifying required

Micro-benchmarks

Storage Data for ‘real’ clusters

Performance

Workload Breakdown % of operations for given size % of bytes transferred for given operation size

GFS: Conclusion Very application-specific: more engineering than research

PAST Internet-based P2P global storage utility Strong persistence Strong persistence High availability High availability Scalability Scalability Security Security Not a conventional FS Files have unique id Files have unique id Clients can insert and retrieve files Clients can insert and retrieve files Files are immutable Files are immutable

PAST Operations Nodes have random unique nodeIds No searching, directory lookup, key distribution Supported Operations: Insert: (name, key, k, file)  fileId Stores on k nodes closest in id space Stores on k nodes closest in id space Lookup: (fileId)  file Reclaim: (fileId, key)

Pastry P2P routing substrate route (key, msg) : routes to numerically closest node in less than log 2 b N steps Routing Table Size: (2 b - 1) * log 2 b N + 2l b : determines tradeoff between per node state and lookup order l : failure tolerance: delivery guaranteed unless l/2 adjacent nodeIds fail

10233102: Routing Table |L|/2 larger and |L|/2 smaller nodeIds Routing Entries |M| closest nodes

PAST operations/security Insert: Certificate created with fileId, file content hash, replication factor and signed with private key Certificate created with fileId, file content hash, replication factor and signed with private key File and certificate routed through Pastry File and certificate routed through Pastry First node in k closest accepts file and forwards to other k-1 First node in k closest accepts file and forwards to other k-1 Security: Smartcards Public/Private key Public/Private key Generate and verify certificates Generate and verify certificates Ensure integrity of nodeId and fileId assignments Ensure integrity of nodeId and fileId assignments

Storage Management Design Goals High global storage utilization High global storage utilization Graceful degradation near max utilization Graceful degradation near max utilization PAST tries to: Balance free storage space amongst nodes Balance free storage space amongst nodes Maintain k closest nodes replication invariant Maintain k closest nodes replication invariant Storage Load Imbalance Variance in number of files assigned to node Variance in number of files assigned to node Variance in size distribution of inserted files Variance in size distribution of inserted files Variance in storage capacity of PAST nodes Variance in storage capacity of PAST nodes

Storage Management Large capacity storage nodes have multiple nodeIds Replica Diversion If node A cannot store file, it stores pointer to file at leaf set node B which is not in k closest If node A cannot store file, it stores pointer to file at leaf set node B which is not in k closest What if A or B fail? Duplicate pointer in k+1 closest node What if A or B fail? Duplicate pointer in k+1 closest node Policies for directing and accepting replicas: t pri and t div thresholds for file size / free space. Policies for directing and accepting replicas: t pri and t div thresholds for file size / free space. File Diversion If insert fails, client retries with different fileId If insert fails, client retries with different fileId

Storage Management Maintaining replication invariant Failures and joins Failures and joinsCaching k-replication in PAST for availability k-replication in PAST for availability Extra copies stored to reduce client latency, network traffic Extra copies stored to reduce client latency, network traffic Unused disk space utilized Unused disk space utilized Greedy Dual-Size replacement policy Greedy Dual-Size replacement policy

Performance Workloads: 8 Web Proxy Logs 8 Web Proxy Logs Combined file systems Combined file systems k=5, b=4 # of nodes = 2250 Without replica and file diversion: 51.1% insertions failed 51.1% insertions failed 60.8% global utilization 60.8% global utilization 4 normal distributions of node storage sizes

Effect of Storage Management

Effect of t pri t div = 0.05 t pri varied Lower t pri : Better utilization, More failures

Effect of t div t pri = 0.1 t div varied Trend similar to t pri

File and Replica Diversions Ratio of file diversions vs utilization Ratio of replica diversions vs utilization

Distribution of Insertion Failures Web logs trace File system trace

Caching

Conclusion

Large Scale Sharing GFS and PAST Mahesh Balakrishnan.

Similar presentations

Presentation on theme: "Large Scale Sharing GFS and PAST Mahesh Balakrishnan."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Large Scale Sharing GFS and PAST Mahesh Balakrishnan.

Similar presentations

Presentation on theme: "Large Scale Sharing GFS and PAST Mahesh Balakrishnan."— Presentation transcript:

Similar presentations

About project

Feedback