INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.

Slides:



Advertisements
Similar presentations
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Advertisements

Mapreduce and Hadoop Introduce Mapreduce and Hadoop
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
GROUP 7 TOOLS FOR BIG DATA Sandeep Prasad Dipojjwal Ray.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
THE HOG LANGUAGE A scripting MapReduce language. Jason Halpern Testing/Validation Samuel Messing Project Manager Benjamin Rapaport System Architect Kurry.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Overview Hadoop is a framework for running applications on large clusters built of commodity hardware. The Hadoop framework transparently provides applications.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Apache Hadoop MapReduce What is it ? Why use it ? How does it work Some examples Big users.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
MARISSA: MApReduce Implementation for Streaming Science Applications 作者 : Fadika, Z. ; Hartog, J. ; Govindaraju, M. ; Ramakrishnan, L. ; Gunter, D. ; Canon,
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Apache Hadoop Daniel Lust, Anthony Taliercio. What is Apache Hadoop? Allows applications to utilize thousands of nodes while exchanging thousands of terabytes.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Copyright © 2015, SAS Institute Inc. All rights reserved. THE ELEPHANT IN THE ROOM SAS & HADOOP.
MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Hadoop & Neptune Feb 김형준.
Cloud Computing project NSYSU Sec. 1 Demo. NSYSU EE IT_LAB2 Outline  Our system’s architecture  Flow chart of the hadoop’s job(web crawler) working.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Next Generation of Apache Hadoop MapReduce Owen
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
BIG DATA/ Hadoop Interview Questions.
Image taken from: slideshare
Big Data is a Big Deal!.
Hadoop Aakash Kag What Why How 1.
Hadoop.
Introduction to Distributed Platforms
Apache hadoop & Mapreduce
Hadoop MapReduce Framework
Introduction to MapReduce and Hadoop
Hadoop Clusters Tess Fulkerson.
Software Engineering Introduction to Apache Hadoop Map Reduce
Central Florida Business Intelligence User Group
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
The Basics of Apache Hadoop
CS6604 Digital Libraries IDEAL Webpages Presented by
Hadoop Basics.
CS110: Discussion about Spark
Hadoop Technopoints.
Introduction to Apache
Lecture 16 (Intro to MapReduce and Hadoop)
Apache Hadoop and Spark
Presentation transcript:

INTRODUCTION TO HADOOP

OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework  The characteristics of Hadoop  The Distribution of a Hadoop Cluster  The Structure of a Small Hadoop Cluster  The Structure of Single Node  Case Study

WHAT IS HADOOP  A framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.  Versions  Apache  Cloudera  Yahoo

WHAT IS HADOOP  Apache  Official Version  Cloudera  Very popular.  Relatively reliable with tech support.  Several useful patches based on Apache.  Yahoo  Interior version for Yahoo

THE CORE OF HADOOP  HDFS: Hadoop Distributed File System  MapReduce: Parallel Computation Framework Image Source:

MAPREDUCE  Application is divided into many small fragments of work.  Each fragment of work may be executed or re-executed on any node in the cluster.  This approach takes advantage of data locality to allow the data to be processed faster and more efficiently via distributed processing.

STRUCTURE OF HDFS  Only one name node with many data nodes  Name node is in charge of :  Receiving requests from user  Maintaining the directory structure of file system  Managing the relationship between block and file, block and data node

 Data node is in charge of :  Saving files  Splitting files into blocks to store them on disk  Making backups

STRUCTURE OF MAPREDUCE  One JobTracker with many TaskTrackers  JobTracker is in charge of:  Receiving the computation job from user  Assigning the job to TaskTrackers to implement  Monitoring the status/conditions of TaskTrackers  TaskTracker is in charge of  Executing the computation job assigned by JobTracker

CHARACTERISTICS OF A HADOOP CLUSTER  Scalable  Economical  It can be built based on normal computers and can handle several thousand of nodes on one cluster.  Efficient  By assigning data to different nodes, it can process the data parallelly.  Reliable  It keeps several data copies and redeploys computation task automatically.

 A small Hadoop cluster includes a single master and multiple worker nodes.  The master node consists of a JobTracker, TaskTracker, NameNode and DataNode.  A worker node acts as both a DataNode and TaskTracker, though it is possible to have data-only worker nodes and compute-only worker nodes. STRUCTURE OF A SMALL HADOOP CLUSTER

Image Source:

Image source:

CASE STUDY  Writing an Hadoop MapReduce Program in Python  The objective is to develop a program that reads text files and counts how often words occur.  Original tutorial developed by Michael G. Noll.  hadoop-mapreduce-program-in-python/ hadoop-mapreduce-program-in-python/

INTRODUCTION  Hadoop framework is written in Java.  Programs for Hadoop can be developed in Python, but normally need to be translated into Java jar file.  Write Hadoop MapReduce program in a more Pythonic way using Hadoop Streaming API.  Requirement: A running Hadoop (Multi-Node Cluster) on Linux System

PYTHON MAPREDUCE CODE: MAPPER  Use Hadoop Streaming API to pass data between Map and Reduce code via standard input and standard output  To assign execution permission to the mapper Python file chmod +x /home/hduster/mapper.py  To assign execution permission to the reducer Python file chmod +x /home/hduster/reducer.py

RUN THE MAPREDUCE JOB  Copy local data (e.g. eBook) to HDFS  Run the MapReduce job  bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar  -mapper /home/hduser/mapper.py  -reducer /home/hduser/reducer.py  -input /user/hduser/gutenberg/*  -output /user/hduser/gutenberg-output  More details at this post.this post

SUMMARY  Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity hardware.  Essentially, it accomplishes two tasks: massive data storage and faster processing.  MapReduce, as a programming model, makes it possible to process large sets of data in parallel.  Hadoop Streaming API supports Python writing Credit:

REFERENCES   hadoop-mapreduce-program-in-python/ hadoop-mapreduce-program-in-python/  data/hadoop.html data/hadoop.html