Cloud scale storage: The Google File system

Slides:



Advertisements
Similar presentations
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung SOSP 2003 Presented by Wenhao Xu University of British Columbia.
Advertisements

Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung
The google file system Cs 595 Lecture 9.
THE GOOGLE FILE SYSTEM CS 595 LECTURE 8 3/2/2015.
G O O G L E F I L E S Y S T E M 陳 仕融 黃 振凱 林 佑恩 Z 1.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google Jaehyun Han 1.
The Google File System Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani 1CS5204 – Operating Systems.
GFS: The Google File System Brad Karp UCL Computer Science CS Z03 / th October, 2006.
The Google File System (GFS). Introduction Special Assumptions Consistency Model System Design System Interactions Fault Tolerance (Results)
Google File System 1Arun Sundaram – Operating Systems.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
The Google File System and Map Reduce. The Team Pat Crane Tyler Flaherty Paul Gibler Aaron Holroyd Katy Levinson Rob Martin Pat McAnneny Konstantin Naryshkin.
GFS: The Google File System Michael Siegenthaler Cornell Computer Science CS th March 2009.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Large Scale Sharing GFS and PAST Mahesh Balakrishnan.
The Google File System.
Google File System.
Northwestern University 2007 Winter – EECS 443 Advanced Operating Systems The Google File System S. Ghemawat, H. Gobioff and S-T. Leung, The Google File.
Case Study - GFS.
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Presented by: Alvaro Llanos E.  Motivation and Overview  Frangipani Architecture overview  Similar DFS  PETAL: Distributed virtual disks ◦ Overview.
1 The Google File System Reporter: You-Wei Zhang.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
The Google File System Presenter: Gladon Almeida Authors: Sanjay Ghemawat Howard Gobioff Shun-Tak Leung Year: OCT’2003 Google File System14/9/2013.
Data in the Cloud – I Parallel Databases The Google File System Parallel File Systems.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
CENG334 Introduction to Operating Systems Erol Sahin Dept of Computer Eng. Middle East Technical University Ankara, TURKEY Network File System Except as.
Presenters: Rezan Amiri Sahar Delroshan
Serverless Network File Systems Overview by Joseph Thompson.
The Google File System by S. Ghemawat, H. Gobioff, and S-T. Leung CSCI 485 lecture by Shahram Ghandeharizadeh Computer Science Department University of.
GFS : Google File System Ömer Faruk İnce Fatih University - Computer Engineering Cloud Computing
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
Presenter: Seikwon KAIST The Google File System 【 Ghemawat, Gobioff, Leung 】
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture Chunkservers Master Consistency Model File Mutation Garbage.
Google File System Robert Nishihara. What is GFS? Distributed filesystem for large-scale distributed applications.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Lecture 24: GFS.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
GFS: The Google File System Brad Karp UCL Computer Science CS GZ03 / M th October, 2008.
1 CMPT 431© A. Fedorova Google File System A real massive distributed file system Hundreds of servers and clients –The largest cluster has >1000 storage.
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
Cloud Computing Platform as a Service The Google Filesystem
CREATED BY: JEAN LOIZIN CLASS: CS 345 DATE: 12/05/2016
Data Management with Google File System Pramod Bhatotia wp. mpi-sws
Google File System.
GFS.
The Google File System (GFS)
Google Filesystem Some slides taken from Alan Sussman.
Google File System CSE 454 From paper by Ghemawat, Gobioff & Leung.
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
آزمايشگاه سيستمهای هوشمند علی کمالی زمستان 95
The Google File System (GFS)
The Google File System (GFS)
CSE 451: Operating Systems Autumn Module 22 Distributed File Systems
The Google File System (GFS)
The Google File System (GFS)
Chapter 2: Operating-System Structures
CSE 451: Operating Systems Distributed File Systems
The Google File System (GFS)
Cloud Computing Storage Systems
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
THE GOOGLE FILE SYSTEM.
by Mikael Bjerga & Arne Lange
Chapter 2: Operating-System Structures
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google SOSP’03, October 19–22, 2003, New York, USA Hyeon-Gyu Lee, and Yeong-Jae.
The Google File System (GFS)
Presentation transcript:

Cloud scale storage: The Google File system CS6410 Harjasleen Malvai

Where do the files go? Machines placed in a network need to share and use data. Introduces a few problems: Plain old access Consistency/Reliability Availability Source: Brown Daily Herald

Version 1.0: Network File System Introduced by Sun in 1985 (Sandberg et al. at USENIX). Interface looks like Unix File System: machine actually holding the file becomes “server”, machine requesting becomes “client”. Single copies stored. No locks, which might cause problems with concurrent modifications. There is a cache. Unreliable due to the fact that the strategy for getting files from server is based on: Source: Sandberg, Russel, et al. "Design and implementation of the Sun network filesystem." Proceedings of the Summer USENIX conference. 1985.

Version 2.0: Sharing is Caring (p2p) Many untrusted nodes which can come and go store files. E.g. Napster, Limewire for p2p filesharing. Napster (1999) and its contemporaries had to maintain some centralized store of where files were or search all nodes for them, limiting scalability. Concurrent proposals (~2001) of various distributed hash tables: hash “keys” (e.g. file IDs) and/or node names, use some structure to speed up search for key locations (Chord, CAN, Tapestry, Pastry). Applications could include any distributed system with nodes leaving such as distributing nonce ranges to nodes in a mining pool! Using the distributed hash tables (among other new tools), the issues from Napster could be overcome: Systems such as Pond (2003) implemented scalable p2p data storage. Did not trust the hosts! Source: Website

Why Google File System? Datacenter! Cheap commodity machines to run Google’s operations with high bandwidth. Machines owned by Google, within data center, hence trusted! Need to design file system which accounted for: Large scale distributed storage Reliability Availability

ASSUMPTIONS Hardware: Files: Writes: Reads: Efficiency: Using commodity hardware. Component failures are common and need to be accounted for. Files: Huge files are common so design needs to accommodate. Writes: Most mutations are appends and not overwrites. Concurrent modifications are to be accommodated. Reads: Primarily large streaming reads and small random reads. Efficiency: High bandwidth > low latency: Most applications process data at a high rate but do not have fast response requirements.

Data Under The Hood File Fixed Sized Chunks Chunk Handle Chunk Handle … Salient features: Chunk is treated as a Linux file on the hardware, Linux caching is implicitly used. Data is written at an offset within a chunk. Size is a parameter. They chose 64 MB. Many replicas (more on this later).

Primary replica of chunk Architecture C1 C2 Ci Cj Cn Clients … … Data/Operation On Chunk Master Primary replica of chunk Chunk Servers

Client Interaction Client wants to mutate a chunk (write or append). Master grants an arbitrarily extendible 60s lease for the chunk to a random primary with an up to date version (version checked with master metadata). Replies to client with primary and replicas. Client caches the primary and other chunk servers with replicas (secondaries). All edits are pushed to all replicas and write request is sent to the primary by the client. Primary mutates and also makes an ordered list of write requests, accounting for multiple users sending write requests to the chunk. Primary forwards list of writes, hence ensuring consistency. Any errors from secondary writes are sent to client which handles re-tries. Source: The Google File System

Problems Posed By GFS

Synchronization I Filesystem itself (namespace): File/directory names saved as full pathnames in a lookup table, each with read/write locks. File manipulation requires no locks from directory! Why? “Because the old directory is dead!” This implies: Ability to snapshot while still writing to “directory”. Ability to write concurrently to “directory”.

Synchronization II Multiple users editing a chunk Atomic record appends: Since primary is the authority on write operations, if multiple users send write requests, it is just treated as a multi-user write queue. Details about chunk size being exceeded/needing new chunk. Checksums contained in records to deal with resulting inconsistencies. Snapshots for versioning: If snapshot requested, leases revoked, new copies created. Copies created on the same machines to reduce network cost. Revoked lease prevents new writes without master in the mean time. Heartbeat messages to keep master knowledge about chunks/servers current. Operation Log of mutations stored to replicated persistent memory for the master.

Availability Chunk replications via chunk-servers Multi-level distribution Multiple copies per rack. Aim to keep copies on multiple racks in case specific routers fail. Master replication and logging Re-replication in case of failure: Priority depending on degree of failure. Trying to reduce bottlenecks by distributing new replicas.

Recovery Primary down! Master recovery Reconnect or new lease Heartbeat messages keep track Master recovery All mutations are saved to disk and not considered complete till replicated to all the backup masters. Only background operations running in memory most of the time. This means re-start or start of new master is seamless.*

Integrity Correctness of chunk mutations from mutation order. Checksums on chunk servers and checksum version numbers stored on master. Corroboration with client to ensure integrity.

Server Efficiency Memory efficiency: Garbage collection Load balancing Data flow efficiency (utilizing bandwidth) Diagnostics Atomic record appends for fast concurrent mutation. Avoiding bottlenecks by reducing role of master: Once primary assigned, client only interacts with primary and secondaries. Memory used only for “maintenance” operations such as garbage collection and load balancing.

Measurements Included measurements from real use cases! Low memory overhead for filesystem (see fig). It would appear memory bounds master but experiments show not an issue in practice. Some experiments with recovery: Killed a single chunkserver (new replicas made in ~23 min). Killed 16,000 chunkservers, leaving some chunks with single replica, hence high copy priority (all new replicas in ~2mins).

Comments/Questions Application design specific to assumptions! How does this extend? What assumptions can we drop/need to drop? Chunk server recovery is analyzed but master recovery is not. Since the centralized controller in itself seems like a dangerous idea from an availability perspective, to what extent is this worrisome? Seems like the trust model is that the clients are somehow internal and will not try to launch a DoS on the master. Is this a good assumption? Provided, they do have the caveat of not trying to generalize.

Cloud scale storage: Spanner: Google’s globally distributed database CS6410 Harjasleen Malvai

Why Spanner? Based on Colossus (successor to GFS)! Predecessors: BigTable: Low functionality (no transactions), not strongly consistent. [Also uses GFS] Megastore: Strong consistency but low write throughput. Google needed a (third!) tool which addressed these drawbacks. In addition on a global scale: Client proximity matters for read latency. Replica proximity matters for write latency. Number of replicas matters for availability.

Spanner Solution Spanner solves this problem by implementing a derivative of BigTable with Paxos commits to support transactions. Spanner is “chunked” by rows having same or similar keys which they call “tablets”. Spanner deployments termed “universe” with physically isolated units known as “zones”. Zones have zonemasters and placemasters which serve values and move data around respectively. Since no longer in one physical location with single master, time synchronization poses a problem. They address this using their new API TrueTime.

TrueTime Each datacenter has various servers which provide time using GPS and atomic clocks. Time is no longer returned as an absolute but rather as an interval with real time guaranteed to be within the interval. Spanner holds off on certain serialized transactions if it is required with certainty that it is after a given time. Allows externally consistent snapshots. Now Paxos leaders can be selected disjointly.

Comments/Questions Fast distributed file systems and databases are possible but may need to limit assumptions! To what extent are corporate scale assumptions widely useful?