INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.

INTRODUCTION TO HADOOP

OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework  The characteristics of Hadoop  The Distribution of a Hadoop Cluster  The Structure of a Small Hadoop Cluster  The Structure of Single Node  Case Study

WHAT IS HADOOP  A framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.  Versions  Apache  Cloudera  Yahoo

WHAT IS HADOOP  Apache  Official Version  Cloudera  Very popular.  Relatively reliable with tech support.  Several useful patches based on Apache.  Yahoo  Interior version for Yahoo

THE CORE OF HADOOP  HDFS: Hadoop Distributed File System  MapReduce: Parallel Computation Framework Image Source: http://www.glennklockwood.com/di/hadoop-overview.php

MAPREDUCE  Application is divided into many small fragments of work.  Each fragment of work may be executed or re-executed on any node in the cluster.  This approach takes advantage of data locality to allow the data to be processed faster and more efficiently via distributed processing.

STRUCTURE OF HDFS  Only one name node with many data nodes  Name node is in charge of :  Receiving requests from user  Maintaining the directory structure of file system  Managing the relationship between block and file, block and data node

 Data node is in charge of :  Saving files  Splitting files into blocks to store them on disk  Making backups

STRUCTURE OF MAPREDUCE  One JobTracker with many TaskTrackers  JobTracker is in charge of:  Receiving the computation job from user  Assigning the job to TaskTrackers to implement  Monitoring the status/conditions of TaskTrackers  TaskTracker is in charge of  Executing the computation job assigned by JobTracker

CHARACTERISTICS OF A HADOOP CLUSTER  Scalable  Economical  It can be built based on normal computers and can handle several thousand of nodes on one cluster.  Efficient  By assigning data to different nodes, it can process the data parallelly.  Reliable  It keeps several data copies and redeploys computation task automatically.

 A small Hadoop cluster includes a single master and multiple worker nodes.  The master node consists of a JobTracker, TaskTracker, NameNode and DataNode.  A worker node acts as both a DataNode and TaskTracker, though it is possible to have data-only worker nodes and compute-only worker nodes. STRUCTURE OF A SMALL HADOOP CLUSTER

Image Source: http://en.wikipedia.org/wiki/Apache_Hadoop http://sens.tistory.com/256

Image source: http://sens.tistory.com/256

CASE STUDY  Writing an Hadoop MapReduce Program in Python  The objective is to develop a program that reads text files and counts how often words occur.  Original tutorial developed by Michael G. Noll.  http://www.michael-noll.com/tutorials/writing-an- hadoop-mapreduce-program-in-python/ http://www.michael-noll.com/tutorials/writing-an- hadoop-mapreduce-program-in-python/

INTRODUCTION  Hadoop framework is written in Java.  Programs for Hadoop can be developed in Python, but normally need to be translated into Java jar file.  Write Hadoop MapReduce program in a more Pythonic way using Hadoop Streaming API.  Requirement: A running Hadoop (Multi-Node Cluster) on Linux System

PYTHON MAPREDUCE CODE: MAPPER  Use Hadoop Streaming API to pass data between Map and Reduce code via standard input and standard output  To assign execution permission to the mapper Python file chmod +x /home/hduster/mapper.py  To assign execution permission to the reducer Python file chmod +x /home/hduster/reducer.py

RUN THE MAPREDUCE JOB  Copy local data (e.g. eBook) to HDFS  Run the MapReduce job  bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar  -mapper /home/hduser/mapper.py  -reducer /home/hduser/reducer.py  -input /user/hduser/gutenberg/*  -output /user/hduser/gutenberg-output  More details at this post.this post

SUMMARY  Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity hardware.  Essentially, it accomplishes two tasks: massive data storage and faster processing.  MapReduce, as a programming model, makes it possible to process large sets of data in parallel.  Hadoop Streaming API supports Python writing Credit: http://www.sas.com/en_us/insights/big-data/hadoop.html

REFERENCES  https://hadoop.apache.org/ https://hadoop.apache.org/  http://www.michael-noll.com/tutorials/writing-an- hadoop-mapreduce-program-in-python/ http://www.michael-noll.com/tutorials/writing-an- hadoop-mapreduce-program-in-python/  http://www.sas.com/en_us/insights/big- data/hadoop.html http://www.sas.com/en_us/insights/big- data/hadoop.html

INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.

Similar presentations

Presentation on theme: "INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.

Similar presentations

Presentation on theme: "INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework."— Presentation transcript:

Similar presentations

About project

Feedback