The Hadoop Distributed File System, by Dhyuba Borthakur and Related Work Presented by Mohit Goenka.

Slides:



Advertisements
Similar presentations
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
Advertisements

Apache Hadoop and Hive.
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
Hadoop File System B. Ramamurthy 4/19/2017.
Workshop on Basics & Hands on Kapil Bhosale M.Tech (CSE) Walchand College of Engineering, Sangli. (Worked on Hadoop in Tibco) 1.
Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation.
Apache Hadoop and Hive Dhruba Borthakur Apache Hadoop Developer
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Introduction to Apache Hadoop CSCI 572: Information Retrieval and Search Engines Summer 2010.
Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore INTRODUCTION TO HADOOP.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop Distributed File System by Swathi Vangala.
HADOOP ADMIN: Session -2
The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur Presented by Bryant Yao.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Introduction to Hadoop 趨勢科技研發實驗室. Copyright Trend Micro Inc. Outline Introduction to Hadoop project HDFS (Hadoop Distributed File System) overview.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Owen O’Malley Yahoo! Grid Team
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Hadoop & Condor Dhruba Borthakur Project Lead, Hadoop Distributed File System Presented at the The Israeli Association of Grid Technologies.
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team
The exponential growth of data –Challenges for Google,Yahoo,Amazon & Microsoft in web search and indexing The volume of data being made publicly available.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
HDFS (Hadoop Distributed File System) Taejoong Chung, MMLAB.
Introduction to HDFS Prasanth Kothuri, CERN 2 What’s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand.
Mapreduce 与 Hadoop 陆嘉恒 中国人民大学. 主要内容 分布式计算软件构架 MapReduce 介绍 分布式计算开源框架 Hadoop 介绍 小结.
Hadoop implementation of MapReduce computational model Ján Vaňo.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team Modified by R. Cook.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
School of Computer Science and Mathematics BIG DATA WITH HADOOP EXPERIMENTATION APARICIO CARRANZA NYC College of Technology – CUNY ECC Conference 2016.
Understanding the File system  Block placement Current Strategy  One replica on local node  Second replica on a remote rack  Third replica on same.
Hadoop Aakash Kag What Why How 1.
Hadoop.
Dhruba Borthakur Apache Hadoop Developer Facebook Data Infrastructure
Case study: Hadoop.
Introduction to MapReduce and Hadoop
Introduction to HDFS: Hadoop Distributed File System
Central Florida Business Intelligence User Group
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
CS6604 Digital Libraries IDEAL Webpages Presented by
Hadoop Distributed Filesystem
Hadoop Basics.
Hadoop Technopoints.
CS 345A Data Mining MapReduce This presentation has been altered.
Charles Tappert Seidenberg School of CSIS, Pace University
Presentation transcript:

The Hadoop Distributed File System, by Dhyuba Borthakur and Related Work Presented by Mohit Goenka

The Hadoop Distributed File System: Architecture and Design

Requirement SECTION TITLE Need to process Multi Petabyte Datasets Expensive to build reliability in each application. Nodes fail every day Need common infrastructure

Introduction SECTION TITLE HDFS, Hadoop Distributed File System is designed to run on commodity hardware Built out by brilliant engineers and contributors from Yahoo, and Facebook and Cloudera and other companies Has grown into really large project at Apache with significant ecosystem

Typically in 2 level architecture – Nodes are commodity PCs – nodes/rack – Uplink from rack is 3-4 gigabit – Rack-internal is 1 gigabit Commodity Hardware SECTION TITLE

Goals SECTION TITLE Very Large Distributed File System – 10K nodes, 100 million files, 10 PB Assumes Commodity Hardware – Files are replicated to handle hardware failure – Detect failures and recovers from them Optimized for Batch Processing – Data locations exposed so that computations can move to where data resides – Provides very high aggregate bandwidth User Space, runs on heterogeneous OS

HDFS Basic Architecture SECTION TITLE Secondary NameNode Client NameNode DataNodes 1. filename 2. BlckId, DataNodes o 3.Read data Cluster Membership NameNode : Maps a file to a file-id and list of MapNodes DataNode : Maps a block-id to a physical location on disk SecondaryNameNode: Periodic merge of Transaction log

Distributed File System SECTION TITLE Single Namespace for entire cluster Data Coherency – Write-once-read-many access model – Client can only append to existing files Files are broken up into blocks – Typically 128 MB block size – Each block replicated on multiple DataNodes Intelligent Client – Client can find location of blocks – Client accesses data directly from DataNode

HDFS Core Architecture SECTION TITLE

NameNode Metadata SECTION TITLE Meta-data in Memory – The entire metadata is in main memory – No demand paging of meta-data Types of Metadata – List of files – List of Blocks for each file – List of DataNodes for each block – File attributes, e.g creation time, replication factor A Transaction Log – Records file creations, file deletions. etc

Data Node SECTION TITLE A Block Server – Stores data in the local file system (e.g. ext3) – Stores meta-data of a block (e.g. CRC) – Serves data and meta-data to Clients Block Report – Periodically sends a report of all existing blocks to the NameNode Facilitates Pipelining of Data – Forwards data to other specified DataNodes

Block Placement SECTION TITLE Current Strategy - One replica on local node - Second replica on a remote rack - Third replica on same remote rack - Additional replicas are randomly placed Clients read from nearest replica Would like to make this policy pluggable

Data Correctness SECTION TITLE Use Checksums to validate data – Use CRC32 File Creation – Client computes checksum per 512 byte – DataNode stores the checksum File access – Client retrieves the data and checksum from DataNode – If Validation fails, Client tries other replicas

NameNode Failure SECTION TITLE A single point of failure Transaction Log stored in multiple directories - A directory on the local file system - A directory on a remote file system (NFS/CIFS) Need to develop a real HA solution

Data Pipelining SECTION TITLE Client retrieves a list of DataNodes on which to place replicas of a block Client writes block to the first DataNode The first DataNode forwards the data to the next DataNode in the Pipeline When all replicas are written, the Client moves on to write the next block in file

Rebalancer SECTION TITLE Goal: % disk full on DataNodes should be similar –Usually run when new DataNodes are added –Cluster is online when Rebalancer is active –Rebalancer is throttled to avoid network congestion –Command line tool

Hadoop Map / Reduce SECTION TITLE The Map-Reduce programming model – Framework for distributed processing of large data sets – Pluggable user code runs in generic framework Common design pattern in data processing cat * | grep | sort | unique -c | cat > file input | map | shuffle | reduce | output Natural for: – Log processing – Web search indexing – Ad-hoc queries

Data Flow SECTION TITLE Web ServersScribe Servers Network Storage Hadoop Cluster Oracle RAC MySQL

Basic Operations SECTION TITLE Listing files -./bin/hadoop fs –ls Writing files -./bin/hadoop fs –put Running Map Reduce Jobs - mkdir input - cp conf/*.xml input - cat output/*

Hadoop Ecosystem Projects SECTION TITLE HBase - Big Table HIVE - Built on Facebook, provides SQL interface Chukwa - Log Processing Pig - Scientific data analysis language Zookeeper - Distributed Systems management

Limitatons SECTION TITLE The gigabytes to terabytes of data this system handles can only be scaled down to limited threshold Due to this threshold being very high, the system is limited in a lot of ways It hampers the efficiency of the system during large computations or parallel data exchange

JSON Interface to Control HDFS An Open Source Project by Mohit Goenka

JSON Interface to Control HDFS An Open Source Project by Mohit Goenka

JSON SECTION TITLE JSON (JavaScript Object Notation) is a lightweight data-interchange format Can be easily read and written by humans Can be easily parsed by machines Written in text format Similar conventions as existing programming languages

JSON Data SECTION TITLE It is based on two structures: - A collection of name/value pairs - An ordered list of values Concept: Use the light-weighted nature of JSON data to automate command execution on HDFS interface

Goal SECTION TITLE Designing a JSON interface to control HDFS Development of two modules: - For writing into the system - For reading from the system

Outcome SECTION TITLE User can specify execution commands directly in the JSON file along with data Only data gets stored into the system Commands are deleted from the file after execution

Sources and Acknowledgements

Sources SECTION TITLE Dhurba Borthakur, Apache Hadoop Developer, Facebook Data Infrastructure Matei Zaharia, Cloudera / Facebook / UC Berkeley RAD Lab Devaraj Das, Yahoo! Inc. Bangalore and Apache Software Foundation HDFS Java API: - HDFS source code: -

Acknowledgements SECTION TITLE Professor Chris Mattmann for guidance as and when reqired Hossein (Farshad) Tajalli for his continued support and help throughout the project All my classmates for providing valuable inputs throughout the work, especially through their presentations

That’s All Folks! SECTION TITLE