Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
MapReduce Overview MapReduce is a distributed program model for processing data with large volume A program following the model is inherently and distributed and parallel We will illustrate how to write a program by using MapReduce framework to analyze a weather data set
A Sample Weather Data Set The weather data set is generated by many weather sensors that collect data every hour at many locations acorss the globe The dataset can be downloaded from National Climate Data Center (NCDC) at
The data is stored using a line-oriended ASCII format Below is a sample line of the data
Data files are organized by date and weather station. - There is a directory for each year from 1901 to Each directory contains a gzipped file for each weather station
What we would like to find out is “what is the highest global temparature for each year”
Analyze the Data with Unix Tools It takes 42 minutes in one run on a single EC2 High-CPU Extra Large Instance
Analyzing the Data with Hadoop To take advantage of the distributed processing capability the hadoop provides, we need to write our program by using MapReduce framework. MapReduce works by breaking the processing into two phases - Map phase - Reduce phase
Correspondingly, the program using MapReduce framework will specify two functions: - Map function (of a Mapper class) - Reduce function (of a Reducer class) The inputs and outpus for both functions will be (key, value) pairs.
Map function - input: - output: MapReduce Framework, - sort all the output pairs and combine them into the following (key, value) pairs Reduce - input: - output:
There are two types of nodes that control the job execution process: - one jobtracker - a number of tasktrackers The jobtracker coordinates all the jobs run on the system by scheduling tasks to run on tasktrackers Tasktrackers run tasks and send progress report to the jobtracker.
Single reducer
Multiple reducers
no reducer