Data Management with Google File System Pramod Bhatotia wp. mpi-sws

Slides:



Advertisements
Similar presentations
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung SOSP 2003 Presented by Wenhao Xu University of British Columbia.
Advertisements

Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung
The google file system Cs 595 Lecture 9.
G O O G L E F I L E S Y S T E M 陳 仕融 黃 振凱 林 佑恩 Z 1.
The Google File System Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani 1CS5204 – Operating Systems.
GFS: The Google File System Brad Karp UCL Computer Science CS Z03 / th October, 2006.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
The Google File System (GFS). Introduction Special Assumptions Consistency Model System Design System Interactions Fault Tolerance (Results)
MapReduce. 2 (2012) Average Searches Per Day: 5,134,000,000 (2012) Average Searches Per Day: 5,134,000,000.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
GFS: The Google File System Michael Siegenthaler Cornell Computer Science CS th March 2009.
The Google File System.
Google File System.
Northwestern University 2007 Winter – EECS 443 Advanced Operating Systems The Google File System S. Ghemawat, H. Gobioff and S-T. Leung, The Google File.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
1 The Google File System Reporter: You-Wei Zhang.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Data in the Cloud – I Parallel Databases The Google File System Parallel File Systems.
Introduction to HDFS Prasanth Kothuri, CERN 2 What’s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Presenters: Rezan Amiri Sahar Delroshan
The Google File System by S. Ghemawat, H. Gobioff, and S-T. Leung CSCI 485 lecture by Shahram Ghandeharizadeh Computer Science Department University of.
GFS : Google File System Ömer Faruk İnce Fatih University - Computer Engineering Cloud Computing
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
Introduction to HDFS Prasanth Kothuri, CERN 2 What’s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand.
Presenter: Seikwon KAIST The Google File System 【 Ghemawat, Gobioff, Leung 】
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture Chunkservers Master Consistency Model File Mutation Garbage.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Lecture 24: GFS.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
1 CMPT 431© A. Fedorova Google File System A real massive distributed file system Hundreds of servers and clients –The largest cluster has >1000 storage.
Advanced Operating Systems Chapter 6.1 – Characteristics of a DFS Jongchan Shin.
Parallel Virtual File System (PVFS) a.k.a. OrangeFS
Map reduce Cs 595 Lecture 11.
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
CREATED BY: JEAN LOIZIN CLASS: CS 345 DATE: 12/05/2016
Apache hadoop & Mapreduce
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
CSS534: Parallel Programming in Grid and Cloud
Large-scale file systems and Map-Reduce
Google File System.
Hands-On Hadoop Tutorial
Gowtham Rajappan.
Pyspark 최 현 영 컴퓨터학부.
Google Filesystem Some slides taken from Alan Sussman.
Google File System CSE 454 From paper by Ghemawat, Gobioff & Leung.
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
Distributed File Systems
آزمايشگاه سيستمهای هوشمند علی کمالی زمستان 95
Hands-On Hadoop Tutorial
The Google File System (GFS)
Introduction to Apache
IS 651: Distributed Systems Distributed File Systems
The Google File System (GFS)
CSE 451: Operating Systems Autumn Module 22 Distributed File Systems
The Google File System (GFS)
The Google File System (GFS)
CS 345A Data Mining MapReduce This presentation has been altered.
The Google File System (GFS)
Cloud Computing Storage Systems
THE GOOGLE FILE SYSTEM.
by Mikael Bjerga & Arne Lange
CSE 451: Operating Systems Autumn Module 22 Distributed File Systems
Acknowledgement: slides include content from Ennan Zhai
The Google File System (GFS)
Presentation transcript:

Data Management with Google File System Pramod Bhatotia wp. mpi-sws Data Management with Google File System Pramod Bhatotia wp.mpi-sws.org/~bhatotia Distributed Systems, 2016

From the last class: How big is “Big Data”? processes 20 PB a day (2008) crawls 20B web pages a day (2012) >10 PB data, 75B DB calls per day (6/2012) >100 PB of user data + 500 TB/day (8/2012) S3: 449B objects, peak 290k request/second (7/2011) 1T objects (6/2012)

Distributed systems for “Big Data” management Bandwidth: ~180 MB/sec, Read time 1PB: ~65 days! Single disk Multiple disks Aggregate bandwidth: ~ No. of disks*180 MB/sec

In today’s class How to manage “Big data” in distributed setting? Google File System (GFS) Open source: Hadoop Distributed File System (HDFS)

Distributed file system From Google search: “A distributed file system is a client/server-based application that allows clients to access and process data stored on the server as if it were on their own computer.” E.g.: NFS, AFS Network Read File Data Client Server

Key design requirements Large files Scalable Performance Write once (or append only) Reliable Available Namespace Concurrency

Beauty of GFS/HDFS MapReduce Applications GFS/HDFS

Interface/API Simple shell interface $ hadoop dfs -<fs commands> <arguments> <fs commands>: mv, cp, chmod, copyFromLocal, copyToLocal, rm, du, etc. Additional commands for snapshots, appends Similar programming APIs to access HDFS Not compatible w/ POSIX API; HDFS only allows appends

Distributed file system: GFS/HDFS Architecture Master node (NameNode) Namespace ACL Data management Lease management Garbage collection Reliability of chunks Distributed file system: GFS/HDFS

Architecture Master node (NameNode) Data Node Data Node Data Node Data

Basic functioning Metadata Namenode (metadata) Client (caches metadata) Input file Data (read/write) Data Node Data Node

Read operation Metadata (red chunk) Master node (NameNode) Client Read red chunk Data Node Data Node

Write operation Metadata (red chunk) Master node (NameNode) Client B C Send data to buffer A B Data Node Data Node Data Node

Using lease management by the master Commit order Using lease management by the master Primary (Replica A) Secondary (Replicas B & C) C A B Data Node Data Node Data Node Update buffer at A U1, U2, U3 Update buffer at B U3, U2, U1 Update buffer at C U1, U3, U2

Primary commit order Client Primary (Replica A) Secondary (Replicas B & C) Write complete C A B Data Node Data Node Data Node U1, U2, U3 U1, U2, U3 U1, U2, U3 Commit order Commit order Commit order

Consistency semantics Updates (only for GFS) File region may end up containing mingled fragments from different clients E.g., writes to different chunks may be ordered differently by their different primary replica Thus, writes are consistent but undefined in GFS Appends Append causes data to be appended atomically at least once Offset chosen by HDFS, not by the client

Discussion: Other design details Chunk replication: placement and re-balancing Namespace management & locking Garbage collection (lazy) Data integrity Snapshots (using copy-on-write) Master replication

References GFS [SOSP’03] Tachyon [SoCC’13] TidyFS [USENIX ATC’11] Original Google File System paper Tachyon [SoCC’13] Distributed in-memory file system for Spark TidyFS [USENIX ATC’11] A distributed file system by Microsoft Resources: www.hadoop.apache.org

Pramod Bhatotia wp.mpi-sws.org/~bhatotia Thanks! Pramod Bhatotia wp.mpi-sws.org/~bhatotia