Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cloud Distributed Computing Environment Hadoop

Similar presentations


Presentation on theme: "Cloud Distributed Computing Environment Hadoop"— Presentation transcript:

1 Cloud Distributed Computing Environment Hadoop

2 MapReduce Overview MapReduce is a distributed program model for processing data with large volume A program following the model is inherently distributed and parallel We will illustrate how to write a program by using MapReduce framework to analyze a weather data set

3 A Sample Weather Data Set
The weather data set is generated by many weather sensors that collect data every hour at many locations acorss the globe The dataset can be downloaded from National Climate Data Center (NCDC) at

4 The data is stored using a line-oriended ASCII format
Below is a sample record of the data

5 Data files are organized by date and weather station.
- There is a directory for each year from 1901 to 2001 - Each directory contains a gzipped file for each weather station

6 What we would like to find out is “what is the highest global temparature for each year”

7 Analyze the Data with Unix Tools
It takes 42 minutes in one run on a single EC2 High-CPU Extra Large Instance

8 Analyzing the Data with Hadoop
To take advantage of the distributed processing capability the hadoop provides, we need to write our program by using MapReduce framework. MapReduce works by breaking the processing into two phases - Map phase - Reduce phase

9 Correspondingly, the program using MapReduce framework will specify two functions:
- Map function (of a Mapper class) - Reduce function (of a Reducer class) The inputs and outputs for both functions will be (key, value) pairs.

10 - output: Map function - input: - output: Reduce - input:
MapReduce Framework, - sort all the output pairs and combine them into the following (key, value) pairs Reduce - input: - output:

11

12

13

14 There are two types of nodes that control the job execution process:
- one jobtracker - a number of tasktrackers The jobtracker coordinates all the jobs run on the system by scheduling tasks to run on tasktrackers Tasktrackers run tasks and send progress report to the jobtracker.

15 Single reducer

16 Multiple reducers

17 no reducer


Download ppt "Cloud Distributed Computing Environment Hadoop"

Similar presentations


Ads by Google