Download presentation
Presentation is loading. Please wait.
Published byAlison McDaniel Modified over 6 years ago
1
Cloud Distributed Computing Environment Hadoop
2
MapReduce Overview MapReduce is a distributed program model for processing data with large volume A program following the model is inherently distributed and parallel We will illustrate how to write a program by using MapReduce framework to analyze a weather data set
3
A Sample Weather Data Set
The weather data set is generated by many weather sensors that collect data every hour at many locations acorss the globe The dataset can be downloaded from National Climate Data Center (NCDC) at
4
The data is stored using a line-oriended ASCII format
Below is a sample record of the data
5
Data files are organized by date and weather station.
- There is a directory for each year from 1901 to 2001 - Each directory contains a gzipped file for each weather station
6
What we would like to find out is “what is the highest global temparature for each year”
7
Analyze the Data with Unix Tools
It takes 42 minutes in one run on a single EC2 High-CPU Extra Large Instance
8
Analyzing the Data with Hadoop
To take advantage of the distributed processing capability the hadoop provides, we need to write our program by using MapReduce framework. MapReduce works by breaking the processing into two phases - Map phase - Reduce phase
9
Correspondingly, the program using MapReduce framework will specify two functions:
- Map function (of a Mapper class) - Reduce function (of a Reducer class) The inputs and outputs for both functions will be (key, value) pairs.
10
- output: Map function - input: - output: Reduce - input:
MapReduce Framework, - sort all the output pairs and combine them into the following (key, value) pairs Reduce - input: - output:
14
There are two types of nodes that control the job execution process:
- one jobtracker - a number of tasktrackers The jobtracker coordinates all the jobs run on the system by scheduling tasks to run on tasktrackers Tasktrackers run tasks and send progress report to the jobtracker.
15
Single reducer
16
Multiple reducers
17
no reducer
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.