Data in the Cloud – I Parallel Databases The Google File System Parallel File Systems.

Slides:



Advertisements
Similar presentations
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung SOSP 2003 Presented by Wenhao Xu University of British Columbia.
Advertisements

Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung
The google file system Cs 595 Lecture 9.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google Jaehyun Han 1.
The Google File System Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani 1CS5204 – Operating Systems.
GFS: The Google File System Brad Karp UCL Computer Science CS Z03 / th October, 2006.
The Google File System (GFS). Introduction Special Assumptions Consistency Model System Design System Interactions Fault Tolerance (Results)
MapReduce. 2 (2012) Average Searches Per Day: 5,134,000,000 (2012) Average Searches Per Day: 5,134,000,000.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
GFS: The Google File System Michael Siegenthaler Cornell Computer Science CS th March 2009.
Large Scale Sharing GFS and PAST Mahesh Balakrishnan.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
The Google File System.
Google File System.
Northwestern University 2007 Winter – EECS 443 Advanced Operating Systems The Google File System S. Ghemawat, H. Gobioff and S-T. Leung, The Google File.
Distributed Databases
Case Study - GFS.
David Gibbs and Govardhan Tanniru Georgia State University Department of Computer Science P.O. Box 3965 Atlanta, GA
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
1 The Google File System Reporter: You-Wei Zhang.
Cloud Computing Lecture Column Store – alternative organization for big relational data.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
Chapter Oracle Server An Oracle Server consists of an Oracle database (stored data, control and log files.) The Server will support SQL to define.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Latest Relevant Techniques and Applications for Distributed File Systems Ela Sharda
Outline for today  Administrative  Next week: Monday lecture, Friday discussion  Objective  Google File System  Paper: Award paper at SOSP in 2003.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
CENG334 Introduction to Operating Systems Erol Sahin Dept of Computer Eng. Middle East Technical University Ankara, TURKEY Network File System Except as.
Presenters: Rezan Amiri Sahar Delroshan
The Google File System by S. Ghemawat, H. Gobioff, and S-T. Leung CSCI 485 lecture by Shahram Ghandeharizadeh Computer Science Department University of.
GFS : Google File System Ömer Faruk İnce Fatih University - Computer Engineering Cloud Computing
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Problem-solving on large-scale clusters: theory and applications Lecture 4: GFS & Course Wrap-up.
Caching Consistency and Concurrency Control Contact: Dingshan He
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
Presenter: Seikwon KAIST The Google File System 【 Ghemawat, Gobioff, Leung 】
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture Chunkservers Master Consistency Model File Mutation Garbage.
Google File System Robert Nishihara. What is GFS? Distributed filesystem for large-scale distributed applications.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Lecture 24: GFS.
Google File System Sanjay Ghemwat, Howard Gobioff, Shun-Tak Leung Vijay Reddy Mara Radhika Malladi.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
GFS: The Google File System Brad Karp UCL Computer Science CS GZ03 / M th October, 2008.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
Google Cloud computing techniques (Lecture 03) 18th Jan 20161Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore
Dr. Zahoor Tanoli COMSATS Attock 1.  Motivation  Assumptions  Architecture  Implementation  Current Status  Measurements  Benefits/Limitations.
1 CMPT 431© A. Fedorova Google File System A real massive distributed file system Hundreds of servers and clients –The largest cluster has >1000 storage.
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
Cloud Computing and Architecuture
CREATED BY: JEAN LOIZIN CLASS: CS 345 DATE: 12/05/2016
Data Management with Google File System Pramod Bhatotia wp. mpi-sws
Google File System.
GFS.
Introduction to NewSQL
Google Filesystem Some slides taken from Alan Sussman.
Google File System CSE 454 From paper by Ghemawat, Gobioff & Leung.
Gregory Kesden, CSE-291 (Storage Systems) Fall 2017
Gregory Kesden, CSE-291 (Cloud Computing) Fall 2016
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
آزمايشگاه سيستمهای هوشمند علی کمالی زمستان 95
IS 651: Distributed Systems Distributed File Systems
CSE 451: Operating Systems Autumn Module 22 Distributed File Systems
CSE 451: Operating Systems Distributed File Systems
Cloud Computing Storage Systems
THE GOOGLE FILE SYSTEM.
Presentation transcript:

Data in the Cloud – I Parallel Databases The Google File System Parallel File Systems

Relational Database Technology Row-stores are good for transactional workloads with many random reads and updates, and strong concurrency control requirements Column-stores are good for data-warehouse workloads, with batch data insert/append followed by many aggregation queries: SELECT …. SUM() … GROUP BY …

Parallel Databases Automatic parallelization of SQL queries Distributed `two-phase commit’ locking for transaction isolation Partitioning tables across disks (by rows or columns as appropriate) Typical parallel databases run with at most few dozens of servers

File Systems File systems traditionally designed for random access by a single process Parallelism managed by client application, e.g., parallel databases Fault-tolerance limited to replication of block descriptions, not data itself

Google File System Motivation: Large-scale distributed storage of massive files Fault-tolerance a necessity, since p(system failure) = 1-(1-p(component failure)) N 1 for large N Ensuring parallel block reads as well as parallel bulk appends: – such as being able to read indexes that are continuously being updated High-volume reads and bulk appends Acceptable level of consistency Acceptable level of latency for random writes Not included: Parallel I/O (any parallelism is handled by the application) High-volume random writes

GFS Architecture Read: 1.Send full path and offset to Master; receive meta-data for chunk-server of one replica 2.Client directly connects to chunk-server to read data 3.GFS does not cache data – any caching is as provided by local client / chunk-server OS Single master controls file namespace Data broken in 64Meg “Chunks” Chunks replicated over many “Chunk Servers” Client talks directly to chunk servers for data

GFS Architecture (cont.) Record-append (no offset specified):(e.g. record = document, index-block, etc.) 1.Send full path to Master; receive meta-data for chunk-server of all replicas 2.Client sends data to be appended to chunk-server of each replica; waits for acknowledgement 3.Once all replicas have replied, client informs chunk-server of primary replica 4.Primary chunk-server then chooses the offset where to append the data (may be beyond EOF in case many clients are appending to the same file); sends offset to each replica 5.If all replicas do not succeed in writing at the designated offset, the primary retries (some replicas can have duplicate/junk data interleaved with ‘defined’ data)

Consistency in GFS Concurrent writers on a standard file system will yield consistent but undefined data GFS data consists of defined segments interleaved with inconsistent ones Consistent: All chunk servers have the same data Defined: The result of the “last write operation” (such as a record-append) is available in its entirety A defined chunk is also a consistent chunk Questions: – Is data corrupt if it is inconsistent? – Is data corrupt if it is undefined? – Can applications use data in either state? – How? n n+1 n+2 n+3 n n+1 n+2 n+3 m m+1 m Consistent n n+1 n+2 n+3 n n+1 n+2 n+3 n n+1 m m+1 m padding Defined inconsistent

Parallel File Systems In GFS, the I/O bandwidth from one client to a particular file remains the same as for a normal networked file system Parallel file systems distribute (stripe) blocks of a file across many storage servers similar to RAID disks  High-bandwidth parallel I/O Lustre: – Blocks of file are striped across objects – Objects are distributed across many parallel object stores Applications: – Scientific computing – Video recording and analysis