Google Distributed System and Hadoop Lakshmi Thyagarajan.

Slides:



Advertisements
Similar presentations
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Advertisements

Overview of MapReduce and Hadoop
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung
The google file system Cs 595 Lecture 9.
G O O G L E F I L E S Y S T E M 陳 仕融 黃 振凱 林 佑恩 Z 1.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google Jaehyun Han 1.
The Google File System Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani 1CS5204 – Operating Systems.
MapReduce. 2 (2012) Average Searches Per Day: 5,134,000,000 (2012) Average Searches Per Day: 5,134,000,000.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
The Google File System.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore INTRODUCTION TO HADOOP.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Author: Murray Stokely Presenter: Pim van Pelt Distributed Computing at Google.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur Presented by Bryant Yao.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
1 The Google File System Reporter: You-Wei Zhang.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
GFS : Google File System Ömer Faruk İnce Fatih University - Computer Engineering Cloud Computing
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Problem-solving on large-scale clusters: theory and applications Lecture 4: GFS & Course Wrap-up.
MapReduce and the New Software Stack CHAPTER 2 1.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture V: 2014/04/07.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Presenter: Seikwon KAIST The Google File System 【 Ghemawat, Gobioff, Leung 】
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Google File System Robert Nishihara. What is GFS? Distributed filesystem for large-scale distributed applications.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Lecture 4. MapReduce Instructor: Weidong Shi (Larry), PhD
Large-scale file systems and Map-Reduce
Google File System CSE 454 From paper by Ghemawat, Gobioff & Leung.
Gregory Kesden, CSE-291 (Storage Systems) Fall 2017
Gregory Kesden, CSE-291 (Cloud Computing) Fall 2016
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
The Basics of Apache Hadoop
CS 345A Data Mining MapReduce This presentation has been altered.
THE GOOGLE FILE SYSTEM.
by Mikael Bjerga & Arne Lange
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google SOSP’03, October 19–22, 2003, New York, USA Hyeon-Gyu Lee, and Yeong-Jae.
Presentation transcript:

Google Distributed System and Hadoop Lakshmi Thyagarajan

What is Hadoop? A software platform that lets one write and run applications that process vast amount of data Cornerstones of Hadoop: Hadoop Distributed File System (HDFS) – For distributed storage Hadoop Map/Reduce – For parallel computing Open source implementation of Google File System (GFS) and Map/Reduce Written in Java

What a Distributed File System Does? 1) Usual file system stuff: create, read, move & find files. 2) Allow distributed access to files. 3) Files are stored distributedly. If you just do #1 and #2, you are a network file system. To do #3, it's a good idea to also provide fault tolerance

How GFS is Different From Other Distributed File Systems? GFS assumes: ● High failure rate ● Huge files (optimized for GB+ files) ● Almost all writes are appends ● Reads are either small and random OR big and streaming

GFS : Architecture Client Misc. servers Client Replicas Masters GFS Master C0C0 C1C1 C2C2 5 Chunkserver 1 C0C0 C5C5 Chunkserver N C1C1 C3C3 C5C5 Chunkserver 2 … C2C2

● Files are divided into 64 MB chunks (last chunk of a file may be smaller). ● Each chunk is identified by an unique 64-bit id. ● Chunks are stored as regular files on local disks. ● By default, each chunk is stored thrice, preferably on more than one rack. ● To protect data integrity, each 64 KB block gets a 32 bit checksum that is checked on all reads. ● When idle, a chunkserver scans inactive chunks for corruption. GFS Architecture: Chunks

● Stores all metadata (namespace, access control). ● Stores (file -> chunks) and (chunk -> location) mappings. ● All of the above is stored in memory for speed. ● Clients get chunk locations for a file from the master, and then talk directly to the chunkservers for the data. ● Master is also in charge of chunk management: migrating, replicating, garbage collecting and leasing chunks. GFS Architecture: Master

1) Client program asks for 1 Gb of file “A” starting at the 200 millionth byte. 2) Client GFS library asks master for chunks 3, of file “A”. 3) Master responds with all of the locations of chunks 2, of file “A”. 4) Client caches all of these locations (with their cache time-outs). 5) Client reads chunk 2 from the closest location. 6) Client reads chunk 3 from the closest location. 7)... GFS: Life of a Read

GFS: Life of a Write 1) Client gets locations of chunk replicas as before. 2) For each chunk, client sends the write data to nearest replica. 3) This replica sends the data to the nearest replica to it that has not yet received the data. 4) When all of the replicas have received the data, then it is safe for them to actually write it. 5) Master hands out a short term (~1 minute) lease for a particular replica to be the primary one. 6) This primary replica assigns a serial number to each mutation so that every replica performs the mutations in the same order. GFS: Life of a Write

Initial chunk placement balances several factors: Want to use under-utilized chunkservers. Don't want to overload a chunkserver with lots of new chunks. Want to spread a chunk's replicas over more than one rack. A chunk is re-replicated if it has too few replicas (because a chunkserver fails, or a checksum detects corruption,...). Master also periodically rebalances chunk replicas. GFS: Life of a chunk

● The Master stores its state via periodic checkpoints and a mutation log. ● Both are replicated. ● Master election and notification is implemented ● New master restores state from checkpoint and log (exception: chunk locations are determined by asking the chunkservers). GFS: Master Failure

HDFS does not yet implement: ● more than one simultaneous writer ● writes other than appends ● automatic master failover ● snapshots ● optimized chunk replica placement ● data rebalancing GFS/HDFS terminologies: Master (GFS) – Namenode (HDFS) Slave (GFS) – Datanode (HDFS) GFS: Differences between GFS and HDFS

Map/Reduce A programming model and an associated implementation for processing and generating large data sets. A user specified map function processes a key/value pair to generate a set of intermediate key/value pairs. A user specified reduce function merges all intermediate values associated with the same intermediate key. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The MapReduce runtime environment takes care of: Partitioning the input data. Scheduling the program’s execution across a set of machines. Handling machine failures. Managing required inter-machine communication.

Map/Reduce..contd Suppose we wanted a distributed grep over a large data set. cat file | grep “ip address” > matches.txt Our input is really large, so we’ll need to read it from a distributed file system anyway. A distributed grep with MapReduce would : Split up the file into individual chunks. Send each small chunk to a different machine. Have each of the machines run the same mapper on their data chunk (grep “ip address” in this case)‏ Collate the matches from the individual worker machines into an output file (the reducer, in this example the Identity function). System Optimizations possible: Data locality Network topology optimizations (rack diversity, etc..)‏

Map Reduce..contd

More from the Hadoop stables: Hbase – Open source Distributed storage system for structured data build on top of HDFS. Google's equivalent: BigTable Zookeeper – Distributed, open-source coordination service for distributed applications Mahout – For building scalable machine learning libraries References: