Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google Jaehyun Han 1
OUTLINE Introduction Design Overview System Interactions Master Operation Fault Tolerance and Diagnosis Measurements Conclusions 2
INTRODUCTION Motivation ◦ Rapidly growing demands of Google’s data processing needs Design choices ◦ Component failures are normal ◦ Files are huge ◦ Most files are mutated by appending 3
4
Composed of inexpensive components often fail Stores 100 MB or larger size file Large streaming reads & small random reads Many large, sequential writes that append data to files Atomicity with minimal synchronization overhead is essential High sustained bandwidth is more important than low latency 5 ASSUMPTIONS
INTERFACE Files are organized hierarchically in directories and identified by pathnames Does not implement a standard API such as POSIX Usual operations create delete open close read write GFS specific snapshot – copy of a file or a directory tree at low cost record append – append data to the file 6
7 ARCHITECTURE
8 SINGLE MASTER
CHUNK SIZE 64MB for chunk size – much larger than typical file system Advantages Reduce client-master interaction Reduce network overhead Reduce the size of metadata – less than 64byte Disadvantages Hot spot of a small file - Many clients accessing the same file Not a major issue in practice 9
METADATA Three major types of metadata File and chunk namespace File to chunk mapping Location of each chunk’s replicas All metadata is kept in master’s memory The master does not store chunk location Operations of the master are fast and efficient Less than 64bytes metadata each chunk Operation log ◦ Contain historical record of critical metadata changes ◦ Recovery by replaying the operation log 10
CONSISTENCY MODEL Guarantees by GFS File namespace mutations are atomic all clients will always see the same data regardless of which replicas they read from Defined consistent and clients will see what mutation writes in its entirety Inconsistent different clients may see different data at different times 11
12
LEASES AND MUTATION ORDER Leases Ensure data consistent & defined Minimize load on master Primary chunkserver Master grants lease to one replica Primary serializes all mutation requests All replicas follow the order 13
LEASES AND MUTATION ORDER 14
DATA FLOW Decouple control flow and data flow Fully utilize network bandwidth Forwards the data to the closest machine Avoid network bottlenecks and high-latency Pipelining the data transfer Minimize latency 15
ATOMIC RECORD APPENDS Record append : atomic append operation Client specifies only the data Allows for multiple writers Append data on every replicas May cause inconsistent states between successful appends Checksums 16
17 SNAPSHOT Make a copy of a file or a directory tree Standard copy-on-write SNAPSHOT
18
NAMESPACE MANAGEMENT AND LOCKING Namespace Lookup table mapping full pathname to metadata Locking To ensure proper serialization multiple operations active and use locks over regions of the namespace Allow concurrent mutations in the same directory Prevent deadlock consistent total order 19
REPLICA PLACEMENT Maximize data reliability and availability Maximize network bandwidth utilization Spread replicas across machines Spread chunk replicas across the racks 20
CREATION, RE-REPLICATION, REBALANCING Creation Demanded by writers Re-replication Number of available replicas fall down below a user-specifying goal Rebalancing For better disk space and load balancing 21
GARBAGE COLLECTION Lazy reclaim Log deletion immediately Rename to a hidden name with deletion timestamp Remove 3 days later Undelete by renaming back to normal Regular scan Heartbeat message exchange with each chunkserver Identify orphaned chunks and erase the metadata 22
STALE REPLICA DETECTION Maintain a chunk version number Detect stale replicas Remove stale replicas in regular garbage collection 23
24
HIGH AVAILABILITY Fast recovery Restore state and start in seconds Chunk replication Different replication levels for different parts of the file namespace Master clones existing replicas as chunkservers go offline or detect corrupted replicas through checksum verification 25
HIGH AVAILABILITY Master replication Operation log and checkpoints are replicated on multiple machines Master machine or disk fail Monitoring infrastructure outside GFS starts new master process Shadow master Read-only access when primary master is down 26
DATA INTEGRITY Checksum To detect corruption Every 64KB block in each chunk In memory and stored persistently with logging Read Chunkserver verifies checksum before returning Write Append Incrementally update the checksum for the last block Compute new checksum 27
DATA INTEGRITY Write Overwrite Read and verify the first and last block then write Compute and record new checksums During idle periods Chunkservers scan and verify inactive chunks 28
29
MICRO-BENCHMARKS GFS cluster 1 master & 2 master replicas 16 chunkservers 16 clients Machine Specification 1.4GHz Pentium III 2GB RAM 100Mbps full-duplex Ethernet Server machines connected to one switch Client machines connected to the other switch Two switches are connected with 1 Gbps link. 30
31 MICRO-BENCHMARKS Figure 3: Aggregate Throughputs. Top curves show theoretical limits imposed by our network topology. Bottom curves show measured throughputs. They have error bars that show 95% confidence intervals, which are illegible in some cases because of low variance in measurements.
32 REAL WORLD CLUSTERS Table2: characteristic Of two GFS clusters
33 REAL WORLD CLUSTERS Table 3: Performance Metrics for Two GFS Clusters
REAL WORLD CLUSTERS In cluster B Killed a single chunk server containing 15,000 chunks (600GB of data) ◦ All chunks restored in 23.2minutes ◦ Effective replication rate of 440MB/s Killed two chunk servers each 16,000 chunks (660GB of data) ◦ 266 chunks only have a single replica ◦ Higher priority ◦ Restored with in 2 minutes 34
CONCLUSIONS Demonstrates qualities essential to support large- scale processing workloads Treat component failure as the norm Optimize for huge files Fault tolerance provide Consistent monitoring Replicating crucial data Fast and automatic recovery Use checksum to detect data corruption High aggregate throughput to a variety of tasks 35