The Google File System Presenter: Gladon Almeida Authors: Sanjay Ghemawat Howard Gobioff Shun-Tak Leung Year: OCT’2003 Google File System14/9/2013.

The Google File System Presenter: Gladon Almeida Authors: Sanjay Ghemawat Howard Gobioff Shun-Tak Leung Year: OCT’2003 Google File System14/9/2013

Key Design Considerations  Component failures are the norm rather than the exception.  Files are huge by traditional standards.  Most ﬁles are mutated by appending new data rather than overwriting existing data.  Co-designing the applications and the ﬁle system API. Google File System2 4/9/2013

Assumptions  Inexpensive commodity hardware that often fail.  Modest number of large files(typically ~100MB) will be stored. Small files supported but not optimized for.  Large streaming read / small random reads  Large sequential write that append data to files  System must efficiently implement well-defined- semantics for multiple clients that concurrently append to the same file  High sustained bandwidth is more important than low latency Google File System3 4/9/2013

Architecture  Single master and multiple chunk servers  Each is a typical commodity Linux machine running a user level server process  Files are divided into fixed sized chunks each of which is identified by a globally unique 64-bit chunk handle  Typically 3 replica’s of each chuck spread over different chunk servers  Master maintains all file system metadata  Namespace  Access control information  Mapping from files to chunks  Current locations of chunk  Master also control system-wide activities like:  Lease management  Garbage collection of orphaned chunks  Chunk migration between chunk servers  Periodic handshake messages with chunk servers  GFS client code uses this file system API to communicate with Master and chunk servers Google File System4 4/9/2013

Architecture Figure: GFS Architecture Google File System54/9/2013

Design Overview  Chuck Size (64MB - Large)  Reduces client-master interactions  Reduces network overhead  Reduces size of metadata on the master.  Metadata:  Namespace and file-to-chunk mappings persistent storage as mutation logs on master (as well as remote locations)  Operation Logs:  Historical record of critical metadata changes  Deﬁnes the order of concurrent operations  Critical:  Replication on multiple remote machines  Changes are 1 st made persistent on local as well as remote location and then made visible to client.  Fast recovery: (1 minute for few million files)  Replay operation logs  Checkpoints (B-tree like form) Google File System64/9/2013

Consistency Model  Consistent region: If all the clients will always see the same data, regardless of which replicas they read from.  Defined region: After a ﬁle data mutation if it is consistent and clients will see what the mutation writes in its entirety. Google File System7 4/9/2013

Consistency Model – contd.  After a sequence of successful mutations, the mutated is guaranteed to be defined by: 1. Applying mutations to all replicas in the same order 2. Using chunk version number to detect any replica that becomes stale  What if client caches stale chunk location?  Such window limited by the cache entry’s timeout  Most files are append-only – Stale replica returns a premature end of chunk rather than outdated data Google File System 8 4/9/2013

System Interactions: Leases and Mutation order  Leases: used to maintain a consistent mutation order across replicas.  Lease is granted by the master to one of the replicas called the primary  The primary picks a serial order for all mutations to the chunk.  Initial timeout of lease of 60 sec which can be extended  Extension requests piggybacked on heartbeat messages between master and chunk server Google File System9 4/9/2013

Control and Data flow for a write Google File System 10 4/9/2013

Decoupling of Data and Control Flow  Control flow:  Master -> Client -> Primary Chunk -> Secondary Chunks  Data flow:  Decoupled from control flow to use the network efficiently  Data pushed linearly in pipeline fashion  Each machine forwards data to the closest machine (Determined by the IP address)  Outbound b/w is fully utilized (No tree structure)  Switched networks and full-duplex links Google File System11 4/9/2013

Atomic Record Appends (record append)  Concurrent serializable appends  Client specifies only data (No offset)  GFS uses append-at-least-once-atomically policy  GFS appends the data and returns the offset to the client  Heavily used:  Multiple-producer / Single-Consumer queues  Concurrent merged results from many different clients  If failure, the client retries the operation  GFS doesn’t guarantee that each chunk is byte-wise identical, it only guarantees that data is written at least once as an atomic unit.  Successful operations: Defined regions  Intervening regions: Undefined regions Google File System12 4/9/2013

Snapshots  Why ?  Makes copy of a file or directory almost instantaneously  Copy-on-write technique  Steps when a snapshot request is received: 1. Revoke nay outstanding leases 2. Log the operation to the disk 3. Duplicate the metadata for source file / directory tree  When write request to these chucks is received  Notices that reference count is greater than 1  Creates copy of chunk locally  Informs other replicas to do the same  Returns new chunk handle Google File System13 4/9/2013

Master Operations - Namespace Management and Locking  No per-directory data structure  No support for aliases  lookup table mapping full path names to metadata  Each node in namespace tree (file/directory) has associated read/write lock  Why locks ?  Example: to lock /d1/d2/d3/leaf for write  Example: How mechanism prevents a ﬁle /home/user/foo from being created while /home/user is being snapshotted to /save/user  Snapshot:  Read locks on /home, /save  Write locks on /home/user, /save/user  Create:  Read lock on /home, /home/user  Write lock on /home/user/foo Google File System14 4/9/2013

Policies:  Chunk replica replacement: 1. Maximize data reliability and availability 2. Maximize network bandwidth utilization 3. Spread replicas across machine as well as racks  New chunk creation: 1. New replicas created on below-average disk utilization 2. Limit the number of “recent” creations on each chunk server 3. spread replicas of a chunk across racks  Re-replication:  soon as the number of available replicas falls below a user-speciﬁed goal. Priority considering: 1. How far from replication goal? 2. Is chunk blocking client? 3. Is file live?  Occasionally rebalance replicas Google File System15 4/9/2013

Garbage Collection  Lazy garbage collection  Steps: 1. Log the deletion immediately 2. Rename file to hidden name (Deleted in 3 days) 3. During regular scans remove the such hidden files  Master’s regular scans of chunk namespace  Identify orphaned chunks and erase metadata  During heartbeat message exchange this info with the chunk servers  Chunk servers delete these chunks  Stale replica detection:  Using version numbers Google File System16 4/9/2013

Fault Tolerance:  High Availability:  Fast recovery: Servers designed to restart fast.  Chunk Replication:  Different replication levels for different parts on namespace  Default level = 3  Master Replication:  Operation log and checkpoint replicated remotely  Master process externally monitored.  On failure new process started using remotely saved checkpoint and logs  Use of canonical names  Shadow Masters  Data Integrity:  chunk broken into 64 KB blocks. Each has 32 bit checksum.  Chunk server verifies before returning – Hence, no error propagation Google File System17 4/9/2013

Relating it to CSCI-572 !  GFS led to development of Hadoop Distributed File System  HDFS ideal for large workloads which can use the Map- Reduce framework for high degree of parallelism using commodity hardware.  Ideal for search engine workloads like:  Crawling,  Generating inverted index,  PageRank calculation, etc.  Other:  Large-scale machine learning problems  Clustering problems  Large scale graph computation 4/9/2013 Google File System18

Pros and Cons  Pros:  Assumptions at the beginning of the paper are later backed up by experiment results.  Very important paper: Led to development of Hadoop Distributed File System (HDFS).  Cons:  Only talks about workloads which are sequential reads and file appends. GFS is not suitable for random read/writes.  The authors don’t provide any performance results for random read/writes 4/9/2013 Google File System19

Conclusion:  GFS supports large-scale data processing workloads in commodity hardware  Design decisions specific to Google's needs but many may apply to data processing tasks of a similar magnitude and cost consciousness.  Component failures are the norm rather than exception  Optimization priority: 1. Concurrent appends 2. Read  Fault tolerance by monitoring, replication and fast recovery  High aggregate throughput to many concurrent readers Google File System20 4/9/2013

The Google File System Presenter: Gladon Almeida Authors: Sanjay Ghemawat Howard Gobioff Shun-Tak Leung Year: OCT’2003 Google File System14/9/2013.

Similar presentations

Presentation on theme: "The Google File System Presenter: Gladon Almeida Authors: Sanjay Ghemawat Howard Gobioff Shun-Tak Leung Year: OCT’2003 Google File System14/9/2013."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Google File System Presenter: Gladon Almeida Authors: Sanjay Ghemawat Howard Gobioff Shun-Tak Leung Year: OCT’2003 Google File System14/9/2013.

Similar presentations

Presentation on theme: "The Google File System Presenter: Gladon Almeida Authors: Sanjay Ghemawat Howard Gobioff Shun-Tak Leung Year: OCT’2003 Google File System14/9/2013."— Presentation transcript:

Similar presentations

About project

Feedback