Download presentation
Presentation is loading. Please wait.
Published byTobias Hubbard Modified over 8 years ago
1
INTRODUCTION TO HADOOP
2
OUTLINE What is Hadoop The core of Hadoop Structure of Hadoop Distributed File System Structure of MapReduce Framework The characteristics of Hadoop The Distribution of a Hadoop Cluster The Structure of a Small Hadoop Cluster The Structure of Single Node Case Study
3
WHAT IS HADOOP A framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. Versions Apache Cloudera Yahoo
4
WHAT IS HADOOP Apache Official Version Cloudera Very popular. Relatively reliable with tech support. Several useful patches based on Apache. Yahoo Interior version for Yahoo
5
THE CORE OF HADOOP HDFS: Hadoop Distributed File System MapReduce: Parallel Computation Framework Image Source: http://www.glennklockwood.com/di/hadoop-overview.php
6
MAPREDUCE Application is divided into many small fragments of work. Each fragment of work may be executed or re-executed on any node in the cluster. This approach takes advantage of data locality to allow the data to be processed faster and more efficiently via distributed processing.
7
STRUCTURE OF HDFS Only one name node with many data nodes Name node is in charge of : Receiving requests from user Maintaining the directory structure of file system Managing the relationship between block and file, block and data node
8
Data node is in charge of : Saving files Splitting files into blocks to store them on disk Making backups
9
STRUCTURE OF MAPREDUCE One JobTracker with many TaskTrackers JobTracker is in charge of: Receiving the computation job from user Assigning the job to TaskTrackers to implement Monitoring the status/conditions of TaskTrackers TaskTracker is in charge of Executing the computation job assigned by JobTracker
10
CHARACTERISTICS OF A HADOOP CLUSTER Scalable Economical It can be built based on normal computers and can handle several thousand of nodes on one cluster. Efficient By assigning data to different nodes, it can process the data parallelly. Reliable It keeps several data copies and redeploys computation task automatically.
11
A small Hadoop cluster includes a single master and multiple worker nodes. The master node consists of a JobTracker, TaskTracker, NameNode and DataNode. A worker node acts as both a DataNode and TaskTracker, though it is possible to have data-only worker nodes and compute-only worker nodes. STRUCTURE OF A SMALL HADOOP CLUSTER
12
Image Source: http://en.wikipedia.org/wiki/Apache_Hadoop http://sens.tistory.com/256
13
Image source: http://sens.tistory.com/256
14
CASE STUDY Writing an Hadoop MapReduce Program in Python The objective is to develop a program that reads text files and counts how often words occur. Original tutorial developed by Michael G. Noll. http://www.michael-noll.com/tutorials/writing-an- hadoop-mapreduce-program-in-python/ http://www.michael-noll.com/tutorials/writing-an- hadoop-mapreduce-program-in-python/
15
INTRODUCTION Hadoop framework is written in Java. Programs for Hadoop can be developed in Python, but normally need to be translated into Java jar file. Write Hadoop MapReduce program in a more Pythonic way using Hadoop Streaming API. Requirement: A running Hadoop (Multi-Node Cluster) on Linux System
16
PYTHON MAPREDUCE CODE: MAPPER Use Hadoop Streaming API to pass data between Map and Reduce code via standard input and standard output To assign execution permission to the mapper Python file chmod +x /home/hduster/mapper.py To assign execution permission to the reducer Python file chmod +x /home/hduster/reducer.py
18
RUN THE MAPREDUCE JOB Copy local data (e.g. eBook) to HDFS Run the MapReduce job bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar -mapper /home/hduser/mapper.py -reducer /home/hduser/reducer.py -input /user/hduser/gutenberg/* -output /user/hduser/gutenberg-output More details at this post.this post
19
SUMMARY Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity hardware. Essentially, it accomplishes two tasks: massive data storage and faster processing. MapReduce, as a programming model, makes it possible to process large sets of data in parallel. Hadoop Streaming API supports Python writing Credit: http://www.sas.com/en_us/insights/big-data/hadoop.html
20
REFERENCES https://hadoop.apache.org/ https://hadoop.apache.org/ http://www.michael-noll.com/tutorials/writing-an- hadoop-mapreduce-program-in-python/ http://www.michael-noll.com/tutorials/writing-an- hadoop-mapreduce-program-in-python/ http://www.sas.com/en_us/insights/big- data/hadoop.html http://www.sas.com/en_us/insights/big- data/hadoop.html
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.