Cloud Distributed Computing Environment Hadoop

Slides:



Advertisements
Similar presentations
Distributed and Parallel Processing Technology Chapter2. MapReduce
Advertisements

The map and reduce functions in MapReduce are easy to test in isolation, which is a consequence of their functional style. For known inputs, they produce.
MapReduce.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MapReduce Online Veli Hasanov Fatih University.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Developing a MapReduce Application – packet dissection.
A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.
Hadoop: The Definitive Guide Chap. 2 MapReduce
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
THE HOG LANGUAGE A scripting MapReduce language. Jason Halpern Testing/Validation Samuel Messing Project Manager Benjamin Rapaport System Architect Kurry.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Apache Hadoop MapReduce What is it ? Why use it ? How does it work Some examples Big users.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Hadoop Ali Sharza Khan High Performance Computing 1.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.
MapReduce 資工碩一 黃威凱. Outline Purpose Example Method Advanced 資工碩一 黃威凱.
A Hierarchical MapReduce Framework Yuan Luo and Beth Plale School of Informatics and Computing, Indiana University Data To Insight Center, Indiana University.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Youngil Kim Awalin Sopan Sonia Ng Zeng.  Introduction  Concept of the Project  System architecture  Implementation – HDFS  Implementation – System.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/
Youngil Kim Awalin Sopan Sonia Ng Zeng.  Introduction  System architecture  Implementation – HDFS  Implementation – System Analysis ◦ System Information.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
GSICS Annual Meeting Laurie Rokke, DPhil NOAA/NESDIS/STAR March 26, 201.
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Hadoop Aakash Kag What Why How 1.
By Chris immanuel, Heym Kumar, Sai janani, Susmitha
How to download, configure and run a mapReduce program In a cloudera VM Presented By: Mehakdeep Singh Amrit Singh Chaggar Ranjodh Singh.
Chapter 10 Data Analytics for IoT
Hadoop MapReduce Framework
MapReduce Types, Formats and Features
Spark Presentation.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
TABLE OF CONTENTS. TABLE OF CONTENTS Not Possible in single computer and DB Serialised solution not possible Large data backup difficult so data.
Lecture 17 (Hadoop: Getting Started)
Introduction to MapReduce and Hadoop
Calculation of stock volatility using Hadoop and map-reduce
Overview of Hadoop MapReduce MapReduce is a soft work framework for easily writing applications which process vast amounts of.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
Big Data Programming: an Introduction
The Basics of Apache Hadoop
湖南大学-信息科学与工程学院-计算机与科学系
MapReduce: Data Distribution for Reduce
MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.
CS110: Discussion about Spark
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
VI-SEEM data analysis service
Cloud Computing: Project Tutorial Hadoop Map-Reduce Programming
Lecture 16 (Intro to MapReduce and Hadoop)
CS 345A Data Mining MapReduce This presentation has been altered.
Charles Tappert Seidenberg School of CSIS, Pace University
MAPREDUCE TYPES, FORMATS AND FEATURES
5/7/2019 Map Reduce Map reduce.
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
Map Reduce, Types, Formats and Features
Presentation transcript:

Cloud Distributed Computing Environment Hadoop

MapReduce Overview MapReduce is a distributed program model for processing data with large volume A program following the model is inherently distributed and parallel We will illustrate how to write a program by using MapReduce framework to analyze a weather data set

A Sample Weather Data Set The weather data set is generated by many weather sensors that collect data every hour at many locations acorss the globe The dataset can be downloaded from National Climate Data Center (NCDC) at http://www.ncdc.noaa.gov

The data is stored using a line-oriended ASCII format Below is a sample record of the data

Data files are organized by date and weather station. - There is a directory for each year from 1901 to 2001 - Each directory contains a gzipped file for each weather station

What we would like to find out is “what is the highest global temparature for each year”

Analyze the Data with Unix Tools It takes 42 minutes in one run on a single EC2 High-CPU Extra Large Instance

Analyzing the Data with Hadoop To take advantage of the distributed processing capability the hadoop provides, we need to write our program by using MapReduce framework. MapReduce works by breaking the processing into two phases - Map phase - Reduce phase

Correspondingly, the program using MapReduce framework will specify two functions: - Map function (of a Mapper class) - Reduce function (of a Reducer class) The inputs and outputs for both functions will be (key, value) pairs.

- output: Map function - input: - output: Reduce - input: MapReduce Framework, - sort all the output pairs and combine them into the following (key, value) pairs Reduce - input: - output:

There are two types of nodes that control the job execution process: - one jobtracker - a number of tasktrackers The jobtracker coordinates all the jobs run on the system by scheduling tasks to run on tasktrackers Tasktrackers run tasks and send progress report to the jobtracker.

Single reducer

Multiple reducers

no reducer