Lecture 16 (Intro to MapReduce and Hadoop)

Slides:



Advertisements
Similar presentations
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Advertisements

Mapreduce and Hadoop Introduce Mapreduce and Hadoop
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Apache Hadoop Daniel Lust, Anthony Taliercio. What is Apache Hadoop? Allows applications to utilize thousands of nodes while exchanging thousands of terabytes.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
BIG DATA/ Hadoop Interview Questions.
Hadoop. Introduction Distributed programming framework. Hadoop is an open source framework for writing and running distributed applications that.
Image taken from: slideshare
Map reduce Cs 595 Lecture 11.
Big Data is a Big Deal!.
Hadoop Aakash Kag What Why How 1.
Introduction to Distributed Platforms
Unit 2 Hadoop and big data
Software Systems Development
INTRODUCTION TO BIGDATA & HADOOP
What is Apache Hadoop? Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created.
Chapter 10 Data Analytics for IoT
15-826: Multimedia Databases and Data Mining
Introduction to MapReduce and Hadoop
Introduction to HDFS: Hadoop Distributed File System
Hadoop Clusters Tess Fulkerson.
Software Engineering Introduction to Apache Hadoop Map Reduce
Central Florida Business Intelligence User Group
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
Distributed Systems CS
The Basics of Apache Hadoop
CS6604 Digital Libraries IDEAL Webpages Presented by
Hadoop Basics.
Hadoop Technopoints.
Introduction to Apache
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
Distributed Systems CS
CSE 491/891 Lecture 24 (Hive).
Charles Tappert Seidenberg School of CSIS, Pace University
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Lecture 16 (Intro to MapReduce and Hadoop) CSE 482: Big Data Analysis Lecture 16 (Intro to MapReduce and Hadoop)

Outline of Today’s Lecture What is Hadoop and MapReduce? Why use Hadoop/MapReduce? What is Hadoop distributed file system (HDFS)? How to use HDFS? How to run a MapReduce job on Hadoop cluster?

MapReduce versus Hadoop MapReduce is a distributed programming model for writing and executing applications that require processing massive amounts of data Originally developed at Google Hadoop is its open-source implementation, whose development was led by Yahoo Research (and is now part of the Apache project) This class will focus on the Hadoop implementation

Why use Hadoop/MapReduce? Big data applications require huge computational resources beyond those provided by a single machine However, writing programs that run on multiple machines can be daunting Hadoop/MapReduce provides a computing environment that simplifies the process of writing distributed programs

Advantages of Hadoop/MapReduce Accessible: runs on large clusters of commodity (PC) machines (instead of expensive parallel machines) Robust: can gracefully handle hardware failures Scalable: scales linearly to handle larger data by adding more nodes to the cluster Simple: allows users to quickly write efficient parallel code (that can be executed on a single machine or a cluster of machines)

Key Concepts in MapReduce/Hadoop Hadoop architecture includes both distributed storage and distributed computation Distributed storage is managed by the Hadoop Distributed File System (HDFS) Data is split into smaller blocks that are distributed and replicated across multiple machines in the cluster Data are stored as “key-value” pairs instead of relational tables

Key Concepts in MapReduce/Hadoop “Scale-out” rather than “Scale-up” architecture Instead of using expensive high-end servers to process the massive data, Hadoop is designed to run on a cluster of commodity PC machines Move code to data (instead of data to code) Offline batch processing instead of online transaction processing (OLTP) although, there has been some recent development for stream processing – Apache Storm

Hadoop Architecture Hadoop employs a master/slave architecture for its distributed storage and computation When you launch Hadoop, the following “daemons” (resident programs) will run in the background For distributed storage Master node has a NameNode daemon Each slave node has a DataNode daemon For distributed computation Master node launches a Jobtracker daemon Each slave node launches a Tasktracker daemon

NameNode and DataNodes The distributed storage system is called Hadoop File System (HDFS) Each file is divided into blocks (the default size of each block is 64MB), stored and replicated on multiple machines NameNode is the master of HDFS that directs the slave DataNode to perform low-level I/O tasks When your client program needs to read/write a HDFS file, the NameNode will tell your client which DataNode each block resides; your client then communicates directly with the DataNode (this is done automatically without the knowledge of the programmer)

NameNode and DataNodes

JobTracker and TaskTracker JobTracker serves as a liason between your client program and Hadoop When you submit a job, the JobTracker determines the execution plan (which files to process, how to assign nodes to different tasks, and monitors for completion of the tasks) If a task fail, JobTracker will automatically relaunch the task, possibly on a different node, up to a predefined limit of retries TaskTracker manages the execution of individual tasks on each slave node

JobTracker and TaskTracker

Hadoop Distributed File System (HDFS) Important: Before you run any Hadoop jobs, you must first upload your data set to HDFS After the job has completed, you can copy the results from HDFS to your local filesystem Thus, when executing a Hadoop program, you must be able to distinguish between the local filesystem and the Hadoop filesystem (HDFS)

Basic HDFS commands Syntax: To list files in current directory: To view the subdirectories: To make a directory: To copy a file from HDFS to local filesystem To copy a file from local filesystem to HDFS

Basic HDFS Commands To view a file If file is too large, can pipe it to a Linux command To display the last kilobyte of the file To delete a file To remove a directory

Hadoop Programs Next, we will describe the basics of Hadoop / MapReduce programs How do the Hadoop/MapReduce programs differ from standard programs? An example of running a hadoop job will also be illustrated

Motivating Example Consider the problem of finding “hot items” in a data stream of tokens Example: find the most frequent words that appear in twitter messages Imagine you write a program to do this on a single machine Keep a counter for each unique term Increment the counter associated with each term that appears in the data stream

WordCount Example Pseudocode (serial algorithm) NOT SCALABLE

WordCount Example If you have millions of documents, one possibility is to distribute the work over several machines Pseudocode for a distributed approach Phase 1: Phase 2:

Distributed WordCount Example Phase 1 Phase 2

Challenges for Distributed WordCount Bottleneck if all the data files are stored on a single server Solution is to store the files over many machines Phase 2 is performed by only one machine Can we do phase 2 in a distributed fashion? E.g., Machine 1 processes all words starting from a – i, machine 2 processes words starting from j-r, machine 3 processes words starting from s-z Need to partition the wordCount and shuffle each partition to different machines so that each machine handles only one partition in phase two

Hadoop Provides an abstraction that hides many of the system-level implementation details so that programmers do not have to worry about How to partition a large job into smaller tasks that can be assigned to different machines to be executed in parallel during phase 1 and phase 2? How to divide the data among the machines? How to coordinate synchronization among the different machines that are working in parallel? How to collect and combine the partial results from the parallel machines? How to accomplish all of the above in face of software errors and hardware failures?

Hadoop Hadoop provides built-in Java libraries to support implementation of the distributed computing tasks Libraries include Java classes for mappers (phase 1) and reducers (phase 2) Other classes include partitioners and combiners Programming in Hadoop requires you to decompose a computational problem into mapper and reducer tasks Once it has been decomposed, scaling the application to use hundreds or thousands of machines is simple (can be done without requiring you to change the code) 23

Mappers and Reducers Mappers Reducers Performs local computation on each machine Reducers “Combines” the local results obtained from different mappers to obtain the “global” solution Input data to mappers and reducers are in the form of <key, value> pairs

Hadoop Example

Running a Hadoop Job Read the lecture16b supplementary material on how to create an account on Amazon Web Service and setup your private-public keys You should apply for AWS Education credit ($35) to run your hadoop code on AWS After signing in and successfully setting up your account and private-public keys Launch an AWS EMR cluster EMR (Elastic MapReduce) cluster already has software needed for this class (hadoop, hive, spark, pig) Connect to the cluster using SSH

Running a Hadoop Job

Running a Hadoop Job Directory that contains hadoop installation on AWS

WordCount Example Part of hadoop-examples.jar code Example: count frequency of terms that appear in tweets from CDCgov

WordCount Example Every line in the tweet.txt file corresponds to a separate tweet message

WordCount Example First, you need to locate hadoop-examples.jar file. Next, you will run the hadoop wordcount program on the tweet.txt file

WordCount Example After the program has been successfully completed, you can visualize the solution in output/part-r-00000

Summary We have introduced the Hadoop/MapReduce distributed computed framework We have introduced the Hadoop distributed file server (HDFS) We have illustrated a simple word count example and showed how to run the code on Hadoop cluster on Amazon Web Service (AWS)