Google File System Robert Nishihara
What is GFS? Distributed filesystem for large-scale distributed applications
Setting Frequent hardware failures Large files Most writes are appends Most reads are sequential Throughput > latency
Architecture Files divided into 64MB “chunks” Chunkservers store/write/serve chunks Master maps files -> chunk
Design Decisions Large chunks (64MB) – Pro: fewer client/master interactions – Pro: less metadata No caching Writes optimized for “appends” Single master => optimizations Master metadata stored in memory – Pro: master operations are fast – Con: limits number of files
Fault Tolerance Chunks replicated (3x by default) Master state replicated (both logs and checkpoints)
Consistency Namespace mutation (e.g., file creation) is atomic Relaxed guarantees (“inconsistent” regions may be interspersed between “consistent” ones) Clients can handle de-duplication
Conclusion GFS is a filesystem designed for large scale distributed applications Optimized for appends and sequential reads Fault tolerance via replication, monitoring, fast recovery, checksumming