Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Google Vijay Kumar

Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Google Vijay Kumar
The File System Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Google Vijay Kumar

Outline Introduction Goals Architecture Operations Supported
Master Operations Performance Conclusion 12/3/2018

File System for Google A new distributed file system developed to meet the demands of Google’s application workloads and technological environment Shares many of the same goals as the other distributed file systems The Google file system (GFS) is a file system developed to serve the needs of applications at Google and work efficiently in its technological environment. It is distributed in nature and is highly scalable. Even though the design of the file system is different from the existing file systems, it shares many of the same goals as the other distributed file systems. Lets us look at the nature of application workloads and technological environment at Google that necessitated the design of a new file system. 12/3/2018

Technological environment
Clusters of more than 15,000 commodity class PCs Hardware prone to failure Fault tolerant software Source: "Web Search for a Planet: The Google Cluster Architecture" Google’s technological environment consists of large clusters. These clusters are made up of cheap commodity class PCs rather than highly expensive, powerful servers. The PCs are prone to failure and hence it is with the help of software that fault tolerance has to be achieved. The query issued to Google is sent to the ‘Google Web Server’ shown in the diagram, which is a set of servers, one amongst which is chosen to process the query. Each of the ‘Index Servers’ shown in the figure is a cluster of machines and is responsible for storing a part of the inverted index built by the Google Crawler. They compare the terms in the query with the inverted index entries and returns the docids of the matching documents. The results from the various ‘Index Servers’ are merged to produce the final ranking of the docids. These are then sent to the ‘Document Servers’ which is a set of clusters containing the documents. These return the URL and the contents of the matching documents. The snippet matching the query is then formed. In the meantime that adserver and the spell checker do their job. So the new file system has to work in an environment where failures occur often and where large amounts of data had to be stored and accessed. 12/3/2018

Application Workloads
Sequential reads Indexer reading the contents of web pages Frequent Appends Crawler appending new pages Files used as Producer – Consumer queues Indexer waits for the crawler to retrieve contents Files used for multi-way merging 12/3/2018

GFS – Motivation Component failures – norm rather than exception
Efficient management of large files Optimization of frequently performed operations Flexibility of co-designing application and file system Having looked at the technological environment and the nature of the application workloads, let us look at what motivated Google to develop a new file system, rather than using an existing file system. The clusters used at Google are made out of commodity PCs and failures are expected. So the file system shouldn’t treat machine failures as an exceptional case and should be quick at recovering from machine failures without any performance degradation. The sizes of the files handled by applications at Google are much larger then the usual files. Also the number of files is numerous. So the file system should be capable of efficient file management. The file system should provide optimal support for commonly performed operations at Google. The most commonly performed operation on the files is the append operation. The data that Google usually works with, namely contents of web pages, changes frequently. Hence Google found it wise to store the difference (diff) between the current version of the file and the new version at the end of the file, rather than writing the changes into the file. So support for the operation was need with least overhead. Also the co-designing of the application and the file system API offers flexibility. Google wanted to capitalize on this fact by designing a new file system rather than using an existing one. 12/3/2018

GFS - Goals Reliability, availability, scalability…
Tolerance to hardware failures Managing numerous files of large size Optimizing commonly performed operations The motivations that we just discussed about, led to these goals for the new file system. It had the goals of achieving reliability, availability and scalability like the other DFS. It had to be tolerant to hardware failures which were very common. It had to detect the failures and recover from them quickly. Also it had to manage many large sized files. The sizes of the files were much larger than those supported by traditional DFSs. Also the file system had to be optimized for data appends, because this operation was much more frequent than random writes. 12/3/2018

GFS Architecture Single master Multiple chunkservers Multiple clients
GFS consists of a single master and multiple chunkservers connected to it. The chunkservers are commodity Linux machines running a user level server process. GFS is accessed by multiple clients. Master stores just the metadata and not the data. It is the chunkserver that stores the data. Communication between master and the chunkservers takes place via periodic heartbeat messages. Single master Multiple chunkservers Multiple clients Master stores metadata Chunkservers store data Communication via heartbeat messages 12/3/2018

GFS Architecture (contd)
Files are broken into fixed sized chunks Chunks are identified by unique chunk handles Chunks are stored in chunkservers Fixed chunk size – 64 MB The files stored in the GFS are broken down into chunks of a specific size. The size is 64 MB. These chunks have a 64 bit unique identifier. These chunks are stored in the chunkservers as Linux files. Thus the contents of the files are distributed across the chunkservers and the master just stores data about where data is stored (Metadata). Master accessed by the client only to find out which chunkserver contains the data it is interested in. It caches the information returned by the master and for subsequent accesses it doesn’t contact the master. The clients only cache the metadata and not the file data that is read from the chunkservers. This is because their working sets are large and it is mostly streaming data that they work with. The master communicates with all the chunkservers by the exchange of heartbeat messages, which serves several purposes as we shall see. The chunk size is 64 MB. The advantages of a large size are i) Reduced client master interaction ii) Persistent TCP connection iii) Lesser metadata to be stored in the master 12/3/2018

GFS Operations File open, close … Data reads Data Mutations Snapshot
Random writes Record appends Snapshot Having looked at the architecture of GFS, let us look at the operations supported by GFS and how GFS actually implements a few of these operations. The file system API provided by GFS is not POSIX compliant. It supports the usual operations like file create, delete, open, close, read and writ. Apart from the usual operations, it supports the record append operation and the snapshot operation. Let us see what these operations are and how they work after looking at how some of the usual operations like read and write are implemented. 12/3/2018

Data Read To perform a read, the client converts the byte offset into the chunk index within the file, by using the fixed chunk size. It passes the file name and the chunk index to the master and requests the chunkservers that contain the required chunks. The master replies with the chunk handle and names of the chunkservers containing the required chunk. The client then contacts one of the chunkservers, mostly the one closest to it, and specifies the chunk handle and the byte range that it is interested in reading. The chunk servers accesses its local file system, retrieves the data and sends it to the requesting client. As we can see, the master is not involved in the actual data transfer. Such a design is the key to scalability. Source: Uma Murthy’s presentation on GFS, Fall ’04 12/3/2018

Data Mutations Two kinds of data mutations are supported
Random writes Record appends Record appends much more frequent than random writes Concurrent appends are common Leases used to maintain consistent mutation order Two types of data mutations are supported. The conventional write specifies the offset and the data to be written and the record append just specifies the data to be written and the GFS chooses the offset at which the data is appended. Leases are used to maintain consistent mutation order among the replicas. Master grants the lease to one of the chunkservers – primary chunkserver. Picks a serial order of mutations to the chunk and the other replicas follow the order. Lease usually granted out for 60 sec. But can be extended. After the completion of a data mutation, the state of the region that was written to depends on many factors such as the type of mutation – write or append, did the mutation succeed or fail and were there other concurrent mutations. 12/3/2018

Data Mutations (contd)
Relaxed consistency model to support applications Relatively simple and efficient to implement GFS guarantees Atomic file namespace mutations State of a file region after a data mutation depends on type of mutation, success or failure, presence or absence of concurrent mutations In order to support the nature of the operations performed by the applications (especially record appends), GFS has a relaxed consistency model, which is also simple and efficient to implement. GFS guarantees only the atomic file namespace mutations and atomic append if it succeeds. It doesn’t guarantee that the chunk replicas are byte by byte similar. Planning to talk more about it after I explain record append. Talk about how this model makes concurrent record appends easier with no synchronization needed among the different clients and how it deals with the record append failures. 12/3/2018

Data Mutations (contd)
Consistent file region – All client will see the same data, regardless of the replicas they read from Defined file region – It is consistent and clients see what mutations writes in its entirety Applications have to deal with the relaxed consistency A file region can have different states. A file region is said to be in a consistent state, if all the clients see the same data irrespective of the chunkserver they read the data from. A file region is said to be in the defined state after a mutation, if all the clients see the same data, irrespective of the chunkserver they read the data from, and if the clients can see that the mutation has written in its entirety. The table shows the state of the region for the different cases. For a data write, if it is serial (i.e. only one client is writing at any time to the region) and if the write succeeds, then the region is in a defined state. In the case of concurrent writes to the region, if the write succeeds, it is in a consistent state but it is undefined, because the region would contain fragments of different writes. In the case of record appends, the state is defined in both the serial and the concurrent cases. But the defined regions are interspersed with inconsistent regions. If the mutation operation fails, the region is inconsistent. The application is supposed to take care of the relaxed consistency. The applications do this by writing self identifying records. 12/3/2018

Data Writes Client requests for lease holder and secondary replicas
Master responds Client pushes out data to the replicas Client issues write instruction to primary replica Explain the various steps 12/3/2018

Data Writes (contd) Primary forwards write request to other replicas
Secondaries respond to primary reporting the status of the write Primary reports the status to the client Explain the various steps + error cases 12/3/2018

Record Appends Appends record to the file atomically (i.e., as a continuous sequence of bytes) Appended at an offset chosen by primary Pads chunks if append is expected to cross the boundary. Secondaries also do so. New chunk has to be allocated in this case Offset returned to the client on success Client retries on failure Record append is the second type of mutation that GFS supports. It appends a record to the file atomically. Atomically here means, as a continuous sequence of bytes. It is similar to the write operation. But as the offset is not specified, the primary has to choose the offset at which the record has to be appended. The primary checks to see if appending the record would cause the chunk to exceed it maximum size. If so the primary pads the chunk to maximum size and it instructs the secondaries to do so. It replies to client saying operation has to be retried on a different chunk. To prevent wastage of space due to padding the record sizes are restricted to be at most one fourth of the chunk size. If the record fits in the chunk, the primary appends the record and tells the secondaries the exact offset at which it appended and they append the record at the same location as the primary. If the append succeeds at all the replicas, the primary returns the offset to the client. Where as if the append fails, the client retries again. Thus the contents of the chunk at the replica which failed would not be exactly same as the contents of the other replicas. This is why, failure leads to inconsistent regions. When the client retries, the record append may succeed. Once succeeded, the region that was written to, is defined and the regions that were written to in the append operations that failed are inconsistent. 12/3/2018

Snapshot Makes a copy of a file or directory tree
Based on copy-on-write technique Minimal overhead involved This operation makes a copy of a file or a directory tree. It takes less time to do so and it minimizes the interruptions to the ongoing mutations. Copy on write technique is used to implement snapshot. The master revokes the leases on the files involved in the snapshot, so that the master has to be contacted for future writes. The master duplicates the metadata for the file or the directory tree whose snapshot is needed. The newly created metadata points to the same chunks as the original. When a client wants to write to a chunk that was involved in the snapshot, the master instructs the chunkservers holding that chunk to create a local copy of it and assigns a new chunk handle to the newly created copy. Future operations are performed on the new chunk. Since the chunks are being copied locally, it is efficient. 12/3/2018

Master Operation Master is responsible for Metadata management
Namespace management Replica management Garbage Collection The duties of the master can be broadly classified as Metadata management Namespace management Replica Management Garbage Collection We have already looked at metadata management – storing and updating the metadata in the masters memory and updating the operation log. Let us talk about the other 3. 12/3/2018

Metadata Management Master stores the following metadata
File and chunk namespaces – stored persistently File to chunk mapping – stored persistently Chunk replica locations – queries chunkserver Operation Log – historic record of metadata changes Replicated on multiple remote machines Size kept small by checkpointing Master stores the Metadata needed for the operation of GFS. The master stores the file and chunk namespaces. The name space is stored very efficiently and hence doesn’t occupy much space. The mapping between the files and the chunks are also stored in the master. The namespace and the mapping have to be stored persistently, it is only the master that has these information and we shouldn’t lose the info if the master crashes. The third kind of metadata stored is the locations of various chunks. This information is not stored persistently, as the chunkservers can be queried for this information at startup and it reduces the overhead too. All the metadata is stored in the masters main memory. The metadata for the file namespace takes less than 64 bytes per file. Also only 64 bytes of information is maintained about each of the chunks. Storing the metadata in memory has many advantages. The master operations are very fast and the master can easily scan the metadata in the background, which is needed for many operations as we see later. The operation log is a historic record of metadata changes. It identifies the file and chunk versions from the logical times at which they were created. The master responds to the client request only after flushing the log record to disk both locally and remotely. Since it is critical to the operation of GFS, it is replicated on multiple machines. In order to keep the size small and minimize startup times, periodic checkpointing is done. The master replays the events in the log at startup after loading the latest checkpoint and hence smaller the size of the log lower would be the startup overhead. So master checkpoints its state periodically and the checkpoint is also stored locally and remotely. It creates a new log to record the events that occur after the checkpoint. Thus checkpointing reduces the size of the active operation log. We just need the latest checkpoint and the log created to log events occurring after the checkpoint. 12/3/2018

Name Space Management Does not have a per directory data structure
No support for aliases Namespace represented as lookup Files and directories have associated read and write locks GFS does not maintain any per directory data structure. And it doesn't support aliases either. It represents its namespace as a lookup table and with prefix compression techniques this lookup table can be efficiently represented in the memory. As some of the master operation takes a long time, we don’t want to delay operations that work on data not being used by other operations. So we use locks over the name space. Each file and directory has a read and a write lock associated with it and each master operation acquires the appropriate lock before the operation. Operations on the file /d1/d2/ … /dn/leaf, requires read locks on /d1, /d1/d2, /d1/d2/…/dn and wither a read or a write lock on /d1/d2/ … /dn/leaf depending on the operation. Talk about the example in paper. These locks are acquired in a particular order so as to prevent deadlocks. They are ordered by the level in the namespace tree and lexicographically within the level. 12/3/2018

Replica Management Chunk replicas created for
New chunk creation Chunk re-replication Chunk rebalancing Chunk replicas placement based on Maximizing data reliability and availability Maximizing network utilization Chunkserver’s disk utilization, count of recent chunks, rack position affect desicions Chunk replicas are created for 3 reasons When creating new chunks for the files, chunk replicas are created. When the number of available replicas falls below a limit, new replicas are created. When rebalancing replicas periodically based on the disk space and load balancing. In all the cases, the new replicas are created on chunkservers so as to maximize the data reliability and availability. To maximize these it is not sufficient to spread the chunks across different machines. The machines are spread across racks. Doing so increases the reliabilty and availability. When creating new replicas the disk utilization of the chunkservers are considered and new replicas are placed on those that has a low disk utilization. Also the chunkservers with least recent replicas on it are chosen as, a heavy load is expected soon after placing a new replica on a chunksever. As just mentioned, the rack positions also play a part in the decision. 12/3/2018

Garbage Collection Reclamation of physical storage
Files renamed to hidden names upon deletion File metadata deleted during namespace scan Physical storage freed during exchange of Heart Beat messages Stale Replica Detection Stale chunks detected using chunk version number The files that are deleted are renamed to some hidden name and the storage for it is not reclaimed immediately. The master logs the deletion. The deletion time is also recorded. During the regular scan of the file system namespace, the master looks for files that remain hidden for a specified time and removes it from the namespace. Until the file scan, the deleted file can be read under the new name and it can be undeleted. The file, once it is removed from the namespace, loses it metadata. During the regular scan of the chunk namespace, the master identifies the orphaned chunks and instructs the chunkservers to delete them. Such lazy deletion has many advantages: Simple and reliable in a large scale distributed system where failures are common Storage reclamation merged with regular background activity Safety against accidental file deletions Chunk replicas may become stale if a chunkserver fails during a mutation. Such stale replicas are identified using the chunk version number. When master grants out a new lease it increases the chunk version number and informs the new number to the up-to-date replicas. The versions numbers are stored persistently in master and chunkservers. The master detects stale replicas when the chunkserver that was down, reports the set of chunks and its version numbers to the master. Such stale chunks are removed during the regular garbage collection. The master also includes the version no. in the reply it sends to the client requesting a write operation, so the chunkservers can check their version numbers when the write request is sent to them by the client. 12/3/2018

GFS Goals revisited Availability Scalability Fast recovery
Chunk replication Master replication Scalability Keeping master’s involvement limited in data transactions Availability is achieved by fast recovery of the master and the chunkservers. They store their states and load the states upon restarting. Chunk replication , that we talked about, increases the availability. To support the file system during the failures of the primary master, we have the Shadow master, which provides read only access to the metadata. It reads the operation log and applies the changes. It also polls the chunkservers and exchanges messages. Depends on primary master only for replica location updates. 12/3/2018

GFS Goals revisited (contd)
Fault Tolerance Replication, constant monitoring, fast recovery Data Integrity Checksumming used to detect corruption Optimization for frequent operations Relaxed consistency model Talk about how each of these goals are achieved. 12/3/2018

Performance Micro benchmarks
Aggregate read rate – 75% of theoretical limit Aggregate write rate – 50% of theoretical limit The performance was measured in the following setup A cluster of 1 master 2 master replicas 16 chunkservers 16 clients All machines were configured with Dual 1.4 GHz PIII processors 2 GB of memory Two 80 GB 5400 rpm disks 100 Mbps full-duplex Ethernet connection to a HP 2524 switch All 19 GFS server machines were connected by one switch and the clients were connected by another switch. The two switches were connected with a 1 Gbps link. Reads N clients simultaneously read from the file system. Each clients read a randomly selected 4 MB region from a 320 GB file set. This was repeated 256 times so that each client read 1 GB of data. The graph shows that the aggregate read rate reached around 94 MB/s which is around 75% of the theoretical limit for 16 clients. The efficiency drops because many clients are trying to read simultaneously from the same chunkserver. Write N clients write simultaneously to N files. Each client writes 1 GB of data in a series of 1 MB writes The aggregate write rate reaches 35 MB/s which is about 50% of the theoretical limit. The reason the authors give for this is the same as read – multiple clients try to write to the same chunkserver. Another reason they give is that their network stack does not talk well to their pipelining scheme of sending data. But it does not significantly affect aggregate write rates in practice so it does not bother Google. Record Appends N clients simultaneously append to a single file. Since it is a record append, performance is limited by network bandwidth of the chunkservers that store the last chunk of the file. But it does not affect them in practice since usually many clients write to many files and clients can progress writing on one file while chunkservers of other files are busy. 12/3/2018

Performance (contd) Real world clusters Talk about the tables
12/3/2018

Performance (contd) Workload breakdown Talk about the tables 12/3/2018

Conclusion Demonstrates qualities needed to support large scale data processing workloads on commodity hardware Delivers high throughput Successfully meets Google’s storage needs 12/3/2018

12/3/2018

Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Google Vijay Kumar

Similar presentations

Presentation on theme: "Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Google Vijay Kumar"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Google Vijay Kumar

Similar presentations

Presentation on theme: "Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Google Vijay Kumar"— Presentation transcript:

Similar presentations

About project

Feedback