Lecture 16 (Intro to MapReduce and Hadoop)

Lecture 16 (Intro to MapReduce and Hadoop)
CSE 482: Big Data Analysis Lecture 16 (Intro to MapReduce and Hadoop)

Outline of Today’s Lecture
What is Hadoop and MapReduce? Why use Hadoop/MapReduce? What is Hadoop distributed file system (HDFS)? How to use HDFS? How to run a MapReduce job on Hadoop cluster?

MapReduce versus Hadoop
MapReduce is a distributed programming model for writing and executing applications that require processing massive amounts of data Originally developed at Google Hadoop is its open-source implementation, whose development was led by Yahoo Research (and is now part of the Apache project) This class will focus on the Hadoop implementation

Why use Hadoop/MapReduce?
Big data applications require huge computational resources beyond those provided by a single machine However, writing programs that run on multiple machines can be daunting Hadoop/MapReduce provides a computing environment that simplifies the process of writing distributed programs

Advantages of Hadoop/MapReduce
Accessible: runs on large clusters of commodity (PC) machines (instead of expensive parallel machines) Robust: can gracefully handle hardware failures Scalable: scales linearly to handle larger data by adding more nodes to the cluster Simple: allows users to quickly write efficient parallel code (that can be executed on a single machine or a cluster of machines)

Key Concepts in MapReduce/Hadoop
Hadoop architecture includes both distributed storage and distributed computation Distributed storage is managed by the Hadoop Distributed File System (HDFS) Data is split into smaller blocks that are distributed and replicated across multiple machines in the cluster Data are stored as “key-value” pairs instead of relational tables

Key Concepts in MapReduce/Hadoop
“Scale-out” rather than “Scale-up” architecture Instead of using expensive high-end servers to process the massive data, Hadoop is designed to run on a cluster of commodity PC machines Move code to data (instead of data to code) Offline batch processing instead of online transaction processing (OLTP) although, there has been some recent development for stream processing – Apache Storm

Hadoop Architecture Hadoop employs a master/slave architecture for its distributed storage and computation When you launch Hadoop, the following “daemons” (resident programs) will run in the background For distributed storage Master node has a NameNode daemon Each slave node has a DataNode daemon For distributed computation Master node launches a Jobtracker daemon Each slave node launches a Tasktracker daemon

NameNode and DataNodes
The distributed storage system is called Hadoop File System (HDFS) Each file is divided into blocks (the default size of each block is 64MB), stored and replicated on multiple machines NameNode is the master of HDFS that directs the slave DataNode to perform low-level I/O tasks When your client program needs to read/write a HDFS file, the NameNode will tell your client which DataNode each block resides; your client then communicates directly with the DataNode (this is done automatically without the knowledge of the programmer)

NameNode and DataNodes

JobTracker and TaskTracker
JobTracker serves as a liason between your client program and Hadoop When you submit a job, the JobTracker determines the execution plan (which files to process, how to assign nodes to different tasks, and monitors for completion of the tasks) If a task fail, JobTracker will automatically relaunch the task, possibly on a different node, up to a predefined limit of retries TaskTracker manages the execution of individual tasks on each slave node

JobTracker and TaskTracker

Hadoop Distributed File System (HDFS)
Important: Before you run any Hadoop jobs, you must first upload your data set to HDFS After the job has completed, you can copy the results from HDFS to your local filesystem Thus, when executing a Hadoop program, you must be able to distinguish between the local filesystem and the Hadoop filesystem (HDFS)

Basic HDFS commands Syntax: To list files in current directory:
To view the subdirectories: To make a directory: To copy a file from HDFS to local filesystem To copy a file from local filesystem to HDFS

Basic HDFS Commands To view a file
If file is too large, can pipe it to a Linux command To display the last kilobyte of the file To delete a file To remove a directory

Hadoop Programs Next, we will describe the basics of Hadoop / MapReduce programs How do the Hadoop/MapReduce programs differ from standard programs? An example of running a hadoop job will also be illustrated

Motivating Example Consider the problem of finding “hot items” in a data stream of tokens Example: find the most frequent words that appear in twitter messages Imagine you write a program to do this on a single machine Keep a counter for each unique term Increment the counter associated with each term that appears in the data stream

WordCount Example Pseudocode (serial algorithm) NOT SCALABLE

WordCount Example If you have millions of documents, one possibility is to distribute the work over several machines Pseudocode for a distributed approach Phase 1: Phase 2:

Distributed WordCount Example
Phase 1 Phase 2

Challenges for Distributed WordCount
Bottleneck if all the data files are stored on a single server Solution is to store the files over many machines Phase 2 is performed by only one machine Can we do phase 2 in a distributed fashion? E.g., Machine 1 processes all words starting from a – i, machine 2 processes words starting from j-r, machine 3 processes words starting from s-z Need to partition the wordCount and shuffle each partition to different machines so that each machine handles only one partition in phase two

Hadoop Provides an abstraction that hides many of the system-level implementation details so that programmers do not have to worry about How to partition a large job into smaller tasks that can be assigned to different machines to be executed in parallel during phase 1 and phase 2? How to divide the data among the machines? How to coordinate synchronization among the different machines that are working in parallel? How to collect and combine the partial results from the parallel machines? How to accomplish all of the above in face of software errors and hardware failures?

Hadoop Hadoop provides built-in Java libraries to support implementation of the distributed computing tasks Libraries include Java classes for mappers (phase 1) and reducers (phase 2) Other classes include partitioners and combiners Programming in Hadoop requires you to decompose a computational problem into mapper and reducer tasks Once it has been decomposed, scaling the application to use hundreds or thousands of machines is simple (can be done without requiring you to change the code) 23

Mappers and Reducers Mappers Reducers
Performs local computation on each machine Reducers “Combines” the local results obtained from different mappers to obtain the “global” solution Input data to mappers and reducers are in the form of <key, value> pairs

Hadoop Example

Running a Hadoop Job Read the lecture16b supplementary material on how to create an account on Amazon Web Service and setup your private-public keys You should apply for AWS Education credit ($35) to run your hadoop code on AWS After signing in and successfully setting up your account and private-public keys Launch an AWS EMR cluster EMR (Elastic MapReduce) cluster already has software needed for this class (hadoop, hive, spark, pig) Connect to the cluster using SSH

Running a Hadoop Job

Running a Hadoop Job Directory that contains hadoop installation on AWS

WordCount Example Part of hadoop-examples.jar code
Example: count frequency of terms that appear in tweets from CDCgov

WordCount Example Every line in the tweet.txt file corresponds to a separate tweet message

WordCount Example First, you need to locate hadoop-examples.jar file.
Next, you will run the hadoop wordcount program on the tweet.txt file

WordCount Example After the program has been successfully completed, you can visualize the solution in output/part-r-00000

Summary We have introduced the Hadoop/MapReduce distributed computed framework We have introduced the Hadoop distributed file server (HDFS) We have illustrated a simple word count example and showed how to run the code on Hadoop cluster on Amazon Web Service (AWS)

Lecture 16 (Intro to MapReduce and Hadoop)

Similar presentations

Presentation on theme: "Lecture 16 (Intro to MapReduce and Hadoop)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 16 (Intro to MapReduce and Hadoop)

Similar presentations

Presentation on theme: "Lecture 16 (Intro to MapReduce and Hadoop)"— Presentation transcript:

Similar presentations

About project

Feedback