The Google File System by S. Ghemawat, H. Gobioff, and S-T. Leung CSCI 485 lecture by Shahram Ghandeharizadeh Computer Science Department University of.

Slides:



Advertisements
Similar presentations
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung SOSP 2003 Presented by Wenhao Xu University of British Columbia.
Advertisements

Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung
The google file system Cs 595 Lecture 9.
THE GOOGLE FILE SYSTEM CS 595 LECTURE 8 3/2/2015.
G O O G L E F I L E S Y S T E M 陳 仕融 黃 振凱 林 佑恩 Z 1.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google Jaehyun Han 1.
The Google File System Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani 1CS5204 – Operating Systems.
GFS: The Google File System Brad Karp UCL Computer Science CS Z03 / th October, 2006.
Homework 1: Common Mistakes Memory Leak Storing of memory pointers instead of data.
Google File System 1Arun Sundaram – Operating Systems.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
The Google File System and Map Reduce. The Team Pat Crane Tyler Flaherty Paul Gibler Aaron Holroyd Katy Levinson Rob Martin Pat McAnneny Konstantin Naryshkin.
1 The File System Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung (Google)
GFS: The Google File System Michael Siegenthaler Cornell Computer Science CS th March 2009.
Large Scale Sharing GFS and PAST Mahesh Balakrishnan.
The Google File System.
Google File System.
Northwestern University 2007 Winter – EECS 443 Advanced Operating Systems The Google File System S. Ghemawat, H. Gobioff and S-T. Leung, The Google File.
Case Study - GFS.
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
1 The Google File System Reporter: You-Wei Zhang.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
The Google File System Presenter: Gladon Almeida Authors: Sanjay Ghemawat Howard Gobioff Shun-Tak Leung Year: OCT’2003 Google File System14/9/2013.
Data in the Cloud – I Parallel Databases The Google File System Parallel File Systems.
Outline for today  Administrative  Next week: Monday lecture, Friday discussion  Objective  Google File System  Paper: Award paper at SOSP in 2003.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Presenters: Rezan Amiri Sahar Delroshan
GFS : Google File System Ömer Faruk İnce Fatih University - Computer Engineering Cloud Computing
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Google Chubby Lock Service Steve Ko Computer Sciences and Engineering University at Buffalo.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Presenter: Seikwon KAIST The Google File System 【 Ghemawat, Gobioff, Leung 】
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture Chunkservers Master Consistency Model File Mutation Garbage.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Lecture 24: GFS.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Google File System Sanjay Ghemwat, Howard Gobioff, Shun-Tak Leung Vijay Reddy Mara Radhika Malladi.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
GFS: The Google File System Brad Karp UCL Computer Science CS GZ03 / M th October, 2008.
Dr. Zahoor Tanoli COMSATS Attock 1.  Motivation  Assumptions  Architecture  Implementation  Current Status  Measurements  Benefits/Limitations.
1 CMPT 431© A. Fedorova Google File System A real massive distributed file system Hundreds of servers and clients –The largest cluster has >1000 storage.
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
CREATED BY: JEAN LOIZIN CLASS: CS 345 DATE: 12/05/2016
Data Management with Google File System Pramod Bhatotia wp. mpi-sws
Google File System.
GFS.
The Google File System (GFS)
Google Filesystem Some slides taken from Alan Sussman.
Google File System CSE 454 From paper by Ghemawat, Gobioff & Leung.
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
آزمايشگاه سيستمهای هوشمند علی کمالی زمستان 95
The Google File System (GFS)
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Google Vijay Kumar
The Google File System (GFS)
The Google File System (GFS)
The Google File System (GFS)
CSE 451: Operating Systems Distributed File Systems
The Google File System (GFS)
THE GOOGLE FILE SYSTEM.
by Mikael Bjerga & Arne Lange
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google SOSP’03, October 19–22, 2003, New York, USA Hyeon-Gyu Lee, and Yeong-Jae.
The Google File System (GFS)
Presentation transcript:

The Google File System by S. Ghemawat, H. Gobioff, and S-T. Leung CSCI 485 lecture by Shahram Ghandeharizadeh Computer Science Department University of Southern California

Primary Functionality of Google

Search content on the web in browsing mode. Search content on the web in browsing mode. Open world assumption: If your search with Google does not return results, it does not mean that the referenced content is non- exit. It only means that Google does not know about it when the search was issued. Open world assumption: If your search with Google does not return results, it does not mean that the referenced content is non- exit. It only means that Google does not know about it when the search was issued.  Google may retrieve results if the search is issued again.  Do not index/find me: Google provides tags to enable an information provider to prevent Google from indexing its pages.  No one gets angry if Google does not retrieve information known to exist on the Internet. How is this different than financial applications?

Functionality Search content on the web in browsing mode. Search content on the web in browsing mode. Open world assumption: If your search with Google does not return results, it does not mean that the referenced content is non- existent. It only means that Google does not know about it when the search was issued. Open world assumption: If your search with Google does not return results, it does not mean that the referenced content is non- existent. It only means that Google does not know about it when the search was issued.  Google may retrieve results if the search is issued again.  Do not index/find me: Google provides tags to enable an information provider to prevent Google from indexing its pages.  No one gets angry if Google does not retrieve information known to exist on the Internet. Query based: Looking for a needle in a hay stack. Query based: Looking for a needle in a hay stack. Closed world assumption: A data item that is not known does not exist. Closed world assumption: A data item that is not known does not exist.  A query must retrieve correct results 100% of the time!  If a customer insists the bank cannot find his or her account because the customer has used Google’s “do not find me” tags, the customer is kicked out!  Customers become angry if the system retrieves incorrect data. IR: DB:

Key Observation Okay to return either no or incorrect results. Okay to return either no or incorrect results. Acceptable for a user search to observe stale data. Acceptable for a user search to observe stale data. Not okay to return incorrect results. Not okay to return incorrect results. A transaction must observe consistent data. A transaction must observe consistent data. SQL front end. SQL front end. IR: DB:

Big Picture A shared-nothing architecture consisting of thousands of nodes! A shared-nothing architecture consisting of thousands of nodes!  A node is an off-the-shelf, commodity PC. Google File System Google’s Bigtable Data Model Google’s Map/Reduce Framework Yahoo’s Pig Latin …….

Big Picture Shared-nothing architecture consisting of thousands of nodes! Shared-nothing architecture consisting of thousands of nodes!  A node is an off-the-shelf, commodity PC. Google File System Google’s Bigtable Data Model Google’s Map/Reduce Framework Yahoo’s Pig Latin ……. Divide & Conquer

Big Picture Source code for Pig and hadoop are available for free download. Source code for Pig and hadoop are available for free download. Google File System Google’s Bigtable Data Model Google’s Map/Reduce Framework Yahoo’s Pig Latin ……. Hadoop Pig

Data Shipping Client retrieves data from the node. Client retrieves data from the node. Client performs computation locally. Client performs computation locally. Limitation: Dumb servers, utilizes the limited network bandwidth. Limitation: Dumb servers, utilizes the limited network bandwidth. A Node Data Process f(x) XmitData

Function Shipping Client ships the function to the node for processing. Client ships the function to the node for processing. Relevant data is sent to client. Relevant data is sent to client. Function f(x) should produce less data than the original data stored in the database. Function f(x) should produce less data than the original data stored in the database. Minimizes demand for the network bandwidth. Minimizes demand for the network bandwidth. A Node Output of f(x) Process function f(x)

Google Application (configured with GFS client) may run on the same PC as the one hosting a chunkserver. Requirements: Application (configured with GFS client) may run on the same PC as the one hosting a chunkserver. Requirements:  Machine resources are not overwhelmed.  The lower reliability is acceptable.

References Pig Latin Pig Latin  Olston et. al. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD Map Reduce Map Reduce  Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January Bigtable Bigtable  Chang et. al. Bigtable: A Distributed Storage System for Structured Data. In OSDI GFS GFS  Ghemawat et. al. The Google File System. In SOSP 2003.

Overview: GFS A highly available, distributed file system for inexpensive commodity PCs. A highly available, distributed file system for inexpensive commodity PCs.  Supports node failures as the norm rather than the exception.  Stores and retrieves multi-GB files.  Assumes files are append only (instead of updates that modify a certain piece of existing data).  Atomic append operation to enable multiple clients to append to a file with minimal synchronization.  Relaxed consistency model to simplify the file system and enhance performance.

Google File System: Assumptions

Google File System: Assumptions (Cont…)

GFS: Interfaces Create, delete, open, close, read, and write files. Create, delete, open, close, read, and write files. Snapshot a file: Snapshot a file:  Create a copy of the file. Record append operation: Record append operation:  Allows multiple clients to append data to the same file concurrently, while guaranteeing the atomicity of each individual client’s append.

GFS: Architecture 1 Master 1 Master Multiple chunkservers Multiple chunkservers File is partitioned into fixed- size chunks. File is partitioned into fixed- size chunks. Each chunk has a 64 bit chunk handle that is unique globally. Each chunk has a 64 bit chunk handle that is unique globally. Each chunk is replicated on several chunkservers. Each chunk is replicated on several chunkservers.  Degree of replication is application specific; default is 3. Software Software  Master maintains all file system meta-data: namespace, access control info, mapping from files to chunks, current location of chunks.  GFS client caches meta-data about file system.  Client receives data from chunkserver directly.  Client and chunkserver do not cache file data.

GFS: Architecture 1 Master 1 Master Multiple chunkservers Multiple chunkservers File is partitioned into fixed- size (64 MB) chunks. File is partitioned into fixed- size (64 MB) chunks. Each chunk has a 64 bit chunk handle that is unique globally. Each chunk has a 64 bit chunk handle that is unique globally. Each chunk is replicated on several chunkservers. Each chunk is replicated on several chunkservers.  Degree of replication is application specific; default is 3. Software Software  Master maintains all file system meta-data: namespace, access control info, mapping from files to chunks, current location of chunks.  GFS client caches meta-data about file system.  Client receives data from chunkserver directly.  Client and chunkserver do not cache file data. Clientchooses one of the replicas.

GFS: Architecture 1 Master 1 Master Multiple chunkservers Multiple chunkservers File is partitioned into fixed- size (64 MB) chunks. File is partitioned into fixed- size (64 MB) chunks. Each chunk has a 64 bit chunk handle that is unique globally. Each chunk has a 64 bit chunk handle that is unique globally. Each chunk is replicated on several chunkservers. Each chunk is replicated on several chunkservers.  Degree of replication is application specific; default is 3. Software Software  Master maintains all file system meta-data: namespace, access control info, mapping from files to chunks, current location of chunks.  GFS client caches meta-data about file system.  Client receives data from chunkserver directly.  Client and chunkserver do not cache file data. Clientchooses one of the replicas. Unix allocates space lazily! Unix allocates space lazily! Many small logical files are stored in one file. Many small logical files are stored in one file.

GFS Master 1 master simplifies software design. 1 master simplifies software design. Master monitors availability of chunkservers using heart-beat messages. Master monitors availability of chunkservers using heart-beat messages. 1 master is a single point of failure: 1 master is a single point of failure:  Master does not store chunk location information persistently: When the master is started, it asks each chunkserver about its chunks (and whenever a chunkserver joins).  File and chunk namespaces,  Mapping from files to chunks,  Location of each chunk’s replica.

Mutation = Update Mutation is an operation that changes the contents of either metadata (delete or create a file) or a chunk (append a record). Mutation is an operation that changes the contents of either metadata (delete or create a file) or a chunk (append a record). Content mutation: Content mutation:  Performed on all chunk’s replicas.  Master grants a chunk lease to one of the replicas, primary.  Primary picks a serial order for all mutations to the chunk. Lease: Lease:  Granted by master, typically 60 seconds.  Primary may request extensions.  If master loses communication with a primary, it can safely grant a new lease to another replica after the current lease expires.

Master & Logging Master stores 3 types of metadata: Master stores 3 types of metadata: 1. The file and chunk namespaces, 2. Mapping from files to chunks, 3. Locations of each chunk’s replicas. First two types are kept persistent by: First two types are kept persistent by:  Logging mutations (updates) to an operation log stored on the master’s local disk,  Replicating the operation log on multiple machines.  What is required to support logging?

Master & Logging Master stores 3 types of metadata: Master stores 3 types of metadata: 1. The file and chunk namespaces, 2. Mapping from files to chunks, 3. Locations of each chunk’s replicas. First two types are kept persistent by: First two types are kept persistent by:  Logging mutations (updates) to an operation log stored on the master’s local disk,  Replicating the operation log on multiple machines.  What is required to support logging?  Uniquely identify transactions and data items.  Checkpointing.

Master & Logging Master stores 3 types of metadata: Master stores 3 types of metadata: 1. The file and chunk namespaces, 2. Mapping from files to chunks, 3. Locations of each chunk’s replicas. First two types are kept persistent by: First two types are kept persistent by:  Logging mutations (updates) to an operation log stored on the master’s local disk,  Replicating the operation log on multiple machines.  Files and chunks, as well as their versions, are uniquely identified by the logical times at which they were created.  GFS responds to a cleint operation only after flushing the log record to disk both locally and remotely.  With failures, during recovery phase, master recovers its file system by replaying the operation log.  Checkpoints are fuzzy.  Maintains a few older checkpoints and log files, deleting the prior ones.

Master & Locking Namespace management: Namespace management:  GFS represents its namespace as a lookup table mapping full pathnames to metadata.  /d1/d2/…/dn/fileA consists of the following pathnames:  /d1  /d1/d2  …  /d1/d2/…/dn  /d1/d2/…/dn/fileA

Master & Locking Namespace management: Namespace management:  GFS represents its namespace as a lookup table mapping full pathnames to metadata.  Each node in the namespace tree has an associated read-write lock.  Each master operation requires a set of locks before it can perform its read/mutation operation:  Typically, an operation involving /d1/d2/…/dn/fileA will acquire read locks on /d1, /d1/d2, /d1/d2/…/dn and either a read or write lock on /d1/d2/…/dn/fileA.  A read lock is the same as a Shared lock.  A write lock is the same as an eXclusive lock.

Example Operation 1: Operation 1:  Copy directory /home/user to /save/user Operation 2: Operation 2:  Create /home/user/foo

Example Operation 1: Operation 1:  Copy directory /home/user to /save/user Operation 2: Operation 2:  Create /home/user/foo Could they have used IS and IX locks?

Atomic Record Appends Background: Background:  With traditional writes, a client specifies the offset at which data is to be written.  GFS cannot serialize concurrent writes to the same region. With record append, the client specifies only the data. GFS appends the record to the file at least once atomically at an offset of GFS’s choosing and returns that offset to the client. With record append, the client specifies only the data. GFS appends the record to the file at least once atomically at an offset of GFS’s choosing and returns that offset to the client. What does “atomically” mean? What does “atomically” mean?

Atomic Record Appends Background: Background:  With traditional writes, a client specifies the offset at which data is to be written.  GFS cannot serialize concurrent writes to the same region. With record append, the client specifies only the data. GFS appends the record to the file at least once atomically at an offset of GFS’s choosing and returns that offset to the client. With record append, the client specifies only the data. GFS appends the record to the file at least once atomically at an offset of GFS’s choosing and returns that offset to the client. What does “atomically” mean? What does “atomically” mean?  The record is written as one sequence of bytes. Does GFS write the record partially? Does GFS write the record partially?

Atomic Record Appends Background: Background:  With traditional writes, a client specifies the offset at which data is to be written.  GFS cannot serialize concurrent writes to the same region. With record append, the client specifies only the data. GFS appends the record to the file at least once atomically at an offset of GFS’s choosing and returns that offset to the client. With record append, the client specifies only the data. GFS appends the record to the file at least once atomically at an offset of GFS’s choosing and returns that offset to the client. What does “atomically” mean? What does “atomically” mean?  The record is written as one sequence of bytes. Does GFS write the record partially? Does GFS write the record partially?  Yes, a record might be written partially to a file replica.

Atomic Record Appends How? How?  Discuss how regular chunk mutations are supported.

Updates

Atomic Record Appends: How? Client: Client:  Pushes data to all replicas of the last chunk of the file.  Sends its write request to the primary. Primary appends data to its replica and tells the secondaries to write data at the exact offset where it has written. If all secondaries succeed, primary replies success to the client. Primary appends data to its replica and tells the secondaries to write data at the exact offset where it has written. If all secondaries succeed, primary replies success to the client. If a record append fails at any replica, primary reports error and client retries the operation. If a record append fails at any replica, primary reports error and client retries the operation.  One or more of the replicas may have succeeded fully (or written partially) → replicas of the same chunk may contain different data including duplicates of the same record.  GFS does not guarantee that all replicas are bytewise identical. GFS guarantees that the record is written at the same offset at least once in its entirety (atmoic unit).

Summary File namespace mutations are managed by requiring Master to implement ACID properties: locking guarantees atomicity, consistency, and isolation. Operation log provides durability. File namespace mutations are managed by requiring Master to implement ACID properties: locking guarantees atomicity, consistency, and isolation. Operation log provides durability. State of a file region after a data mutation depends on the type of mutation, whether it succeeds or fails, and whether there are concurrent mutations: State of a file region after a data mutation depends on the type of mutation, whether it succeeds or fails, and whether there are concurrent mutations:  A file region is consistent if all clients will always see the same data, regardless of which replicas they read from.  A file region is defined after a mutation if it is consistent and clients will see what the mutation writes in its entirety.  When a mutation succeeds without interferance from concurrent writers, the affted region is defined and by implication consistent.  Concurrent successful mutations leave the region undefined and consistent: all clients see the same data that consists of mingled fragments from multiple mutations.