Distributed and Parallel Processing Technology Chapter3. The Hadoop Distributed filesystem Kuldeep Gurjar 19 th March 2013 1.

Slides:



Advertisements
Similar presentations
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Advertisements

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Spark: Cluster Computing with Working Sets
O’Reilly – Hadoop: The Definitive Guide Ch.3 The Hadoop Distributed Filesystem June 4 th, 2010 Taewhi Lee.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
MapReduce. 2 (2012) Average Searches Per Day: 5,134,000,000 (2012) Average Searches Per Day: 5,134,000,000.
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
Hadoop Setup. Prerequisite: System: Mac OS / Linux / Cygwin on Windows Notice: 1. only works in Ubuntu will be supported by TA. You may try other environments.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore INTRODUCTION TO HADOOP.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop Distributed File System by Swathi Vangala.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur Presented by Bryant Yao.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
1 The Google File System Reporter: You-Wei Zhang.
The Hadoop Distributed File System
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Overview Hadoop is a framework for running applications on large clusters built of commodity hardware. The Hadoop framework transparently provides applications.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.
MARISSA: MApReduce Implementation for Streaming Science Applications 作者 : Fadika, Z. ; Hartog, J. ; Govindaraju, M. ; Ramakrishnan, L. ; Gunter, D. ; Canon,
The exponential growth of data –Challenges for Google,Yahoo,Amazon & Microsoft in web search and indexing The volume of data being made publicly available.
Introduction to HDFS Prasanth Kothuri, CERN 2 What’s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
HDFS (Hadoop Distributed File System) Taejoong Chung, MMLAB.
Introduction to HDFS Prasanth Kothuri, CERN 2 What’s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
Hadoop: what is it?. Hadoop manages: – processor time – memory – disk space – network bandwidth Does not have a security model Can handle HW failure.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
BIG DATA/ Hadoop Interview Questions.
Presenter: Yue Zhu, Linghan Zhang A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: a Case Study by PowerPoint.
Hadoop. Introduction Distributed programming framework. Hadoop is an open source framework for writing and running distributed applications that.
Hadoop Aakash Kag What Why How 1.
Introduction to Distributed Platforms
CSS534: Parallel Programming in Grid and Cloud
Large-scale file systems and Map-Reduce
Gregory Kesden, CSE-291 (Storage Systems) Fall 2017
Gregory Kesden, CSE-291 (Cloud Computing) Fall 2016
Software Engineering Introduction to Apache Hadoop Map Reduce
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
The Basics of Apache Hadoop
GARRETT SINGLETARY.
Hadoop Distributed Filesystem
Hadoop Basics.
Hadoop Technopoints.
THE GOOGLE FILE SYSTEM.
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Distributed and Parallel Processing Technology Chapter3. The Hadoop Distributed filesystem Kuldeep Gurjar 19 th March

Recap:  Why we needed Hadoop  Comparison between Rdbms and Mapreduce  Introduction to Hadoop (brief History)  Introduction to Mapreduce  The process of analyzing data with Hadoop (Map & Reduce)  Scaling out (data flow) 2

The Hadoop Distributed file system  Why HDFS is required ?  How we define Big Data ? “when a dataset outgrows the storage capacity of a single physical machine, it becomes necessary to partition it across a number of separate machines.”  Since it is a network-based, the challenge is to make the file system adapted to bear the Node failure without any data loss.  So HDFS is the distributed file system for Hadoop.  And the origin comes from Google: (95% of architecture is implemented) Google file system become HDFS Google Map reduce become apache Map reduce Google big table become apache H Base 3

An image explaining HDFS requirement:  A theoretical calculation : 45 mins 4.5 mins 4

The design of HDFS  “ HDFS is filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware.”  Very large files (GBs,TBs and PBs) Hadoop clusters stores petabytes of data today.  Streaming data access (write once read many times) A dataset is generated or copied from a source. And various analysis are performed over time. So the time to read the whole dataset is imp. Than the latency in reading first record.  Commodity hardware (clusters of commodity hardware) doesn’t require expensive hardware without noticeable interruption. 5

Applications for which HDFS doesn’t work:  Low-latency data access: applications that requires low-latency access will not work with HDFS, as it is optimized to deliver the a high throughput.  Lots of small files: The limit to the number of files in a file system is governed by the amount of memory in the namenode. “150 bytes for each file, example”  Multiple writers, arbitrary file modifications : No support for the multiple writers. 6

HDFS concept :  Blocks: “ A disk has block size, which is the minimum amount of data that it can read or write.”  Filysytem blocks are few kilobytes in size, while disk blocks are normally 512 bytes.  Here HDFS, blocks are much larger (64 MB by default). Unlike a file system files in HDFS are broken into block-sized chunks, which are stored as independent units. 7

Blocks:  Why blocks in HDFS so large?  To minimize the cost of seeks. By making a block large enough, the time to transfer the data from the disk can be made to be significantly larger than the time to seek to the start of the block. Thus the time to transfer a large file made of multiple blocks operates at the disk transfer rate. “A quick calculation shows that if the seek time is around 10 ms, and the transfer rate is 100 MB/s, then to make the seek time 1% of the transfer time, we need to make the block size around 100 MB. The default is actually 64 MB, although many HDFS installations use 128 MB blocks. This figure will continue to be revised upward as transfer speeds grow with new generations of disk drives.” This argument shouldn’t be taken too far, however. Map tasks in MapReduce normally operate on one block at a time, so if you have too few tasks (fewer than nodes in the cluster), your jobs will run slower than they could otherwise. 8

Benefits of blocks :  First; a file can be larger than any single disk in the network. There's nothing that requires the blocks from a file to be stored on the same disk, so they can take advantage of any of the disks in the cluster.  Second; making the unit of abstraction a block rather than a file simplifies the storage subsystem. Simplicity is something to strive for all in all systems, but is especially important for a distributed system in which the failure modes are so varied. The storage subsystem deals with blocks, simplifying storage management (since blocks are a fixed size, it is easy to calculate how many can be stored on a given disk) and eliminating metadata concerns (blocks are just a chunk of data to be stored—file metadata such as permissions information does not need to be stored with the blocks, so another system can handle metadata separately).  Third; Furthermore, blocks fit well with replication for providing fault tolerance and availability. To insure against corrupted blocks and disk and machine failure, each block is replicated to a small number of physically separate machines (typically three). If a block becomes unavailable, a copy can be read from another location in a way that is transparent to the client. A block that is no longer available due to corruption or machine failure can be replicated from its alternative locations to other live machines to bring the replication factor back to the normal level. Similarly, some applications may choose to set a high replication factor for the blocks in a popular file to spread the read load on the cluster. Like its disk filesystem cousin, HDFS’s fsck command understands blocks. For example, running: % hadoop fsck / -files –blocks will list the blocks that make up each file in the filesystem. 9

HDFS nodes:  An HDFS cluster has two types of node operating in a master-worker pattern: a namenode(the master) and a number of datanodes (workers).  Namenode: The namenode manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree. This information is stored persistently on the local disk in the form of two files: the namespace image and the edit log. The namenode also knows the datanodes on which all the blocks for a given file are located, however, it does not store block locations persistently, since this information is reconstructed from datanodes when the system starts. A client accesses the filesystem on behalf of the user by communicating with the namenode and datanodes,so the user code does not need to know about the namenode and datanode to function.  Datanode: Datanodes are the workhorses of the filesystem. They store and retrieve blocks when they are told to (by clients or the namenode), and they report back to the namenode periodically with lists of blocks that they are storing. 10

Mater-slave architecture: 11

The command line interface:  There are many other interfaces to HDFS, but the command line is one of the simplest and, to many developers, the most familiar.  In order to run HDFS on machine we Need to follow some instructions for setting up Hadoop, showed in Appendix A of this book.  There are two properties that we set in the pseudo-distributed configuration that deserve further explanation. The first is fs.default.name, set to hdfs://localhost/, which is used to set a default filesystem for Hadoop. Filesystems are specified by a URI, and here we have used an hdfs URI to configure Hadoop to use HDFS by default. The HDFS daemons will use this property to determine the host and port for the HDFS namenode. We’ll be running it on localhost, on the default HDFS port, And HDFS clients will use this property to work out where the namenode is running so they can connect to it.  We set the second property, dfs.replication, to 1 so that HDFS doesn’t replicate filesystem blocks by the default factor of three. When running with a single datanode, HDFS can’t replicate blocks to three datanodes, so it would perpetually warn about blocks being under-replicated. This setting solves that problem. 12

Basic filesystem Operations: :We can do all of the usual filesystem operations such as reading files, creating directories, moving files, deleting data, and listing directories.  For example we will Start by copying a file from the local filesystem to HDFS: % hadoop fs -copyFromLocal input/docs/quangle.txt hdfs://localhost/user/tom/quangle.txt  This command invokes Hadoop’s filesystem shell command fs, which supports a number of subcommands—in this case, we are running -copyFromLocal. The local file quangle.txt is copied to the file /user/tom/quangle.txt on the HDFS instance running on localhost. 13

File permissions in HDFS:  HDFS has a permissions model for files and directories that is much like POSIX.  There are three types of permission: the read permission (r), the write permission (w),and the execute permission (x). The read permission is required to read files or list the contents of a directory. The write permission is required to write a file, or for a directory, to create or delete files or directories in it. The execute permission is ignored for a file since you can’t execute a file on HDFS (unlike POSIX), and for a directory it is required to access its children.  Each file and directory has an owner, a group, and a mode. The mode is made up of the permissions for the user who is the owner, the permissions for the users who are members of the group, and the permissions for users who are neither the owners nor members of the group. Note: Some technical parts are skipped as the main goal is to explain the concepts. 14

The Java Interface  Hadoop is written in Java, and all Hadoop filesystem interactions are mediated through the Java API.‖ The filesystem shell, for example, is a Java application that uses the Java FileSystem class to provide filesystem operations.  There are few APIs available for non-java systems in order to help them interact with the Hadoop filesystem. Thrift C Fuse  Reading Data from a Hadoop URL : One of the simplest ways to read a file from a Hadoop filesystem is by using a java.net.URL object to open a stream to read the data from. The general idiom is: InputStream in = null; try { in = new URL("hdfs://host/path").openStream(); // process in } finally { IOUtils.closeStream(in); } 15

The java interface:  Writing Data  The FileSystem class has a number of methods for creating a file. The simplest is the method that takes a Path object for the file to be created and returns an output stream to write to: public FSDataOutputStream create(Path f) throws IOException  There are overloaded versions of this method that allow you to specify whether to forcibly overwrite existing files, the replication factor of the file, the buffer size to use when writing the file, the block size for the file, and file permissions.  Directories  FileSystem provides a method to create a directory: public boolean mkdirs(Path f) throws IOException  This method creates all of the necessary parent directories if they don’t already exist, just like the java.io.File ’s mkdirs() method. It returns true if the directory (and all parent directories) was (were) successfully created.  Often, you don’t need to explicitly create a directory, since writing a file, by calling create(), will automatically create any parent directories. 16

The java interface:  Querying the Filesystem  File metadata: FileStatus  An important feature of any filesystem is the ability to navigate its directory structure and retrieve information about the files and directories that it stores. The FileStatus class encapsulates filesystem metadata for files and directories, including file length,block size, replication, modification time, ownership, and permission information. The method getFileStatus() on FileSystem provides a way of getting a FileStatus object for a single file or directory.  Deleting Data  Use the delete() method on FileSystem to permanently remove files or directories: public boolean delete(Path f, boolean recursive) throws IOException  If f is a file or an empty directory, then the value of recursive is ignored. A nonempty directory is only deleted, along with its contents, if recursive is true (otherwise an IOException is thrown). Note: Network topology, Replica placement, etc. are too be discussed in next class. 17

Thank you !! 18