Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)

Slides:



Advertisements
Similar presentations
Distributed and Parallel Processing Technology Chapter2. MapReduce
Advertisements

Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
The map and reduce functions in MapReduce are easy to test in isolation, which is a consequence of their functional style. For known inputs, they produce.
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
MapReduce.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MapReduce Online Veli Hasanov Fatih University.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Developing a MapReduce Application – packet dissection.
Map reduce with Hadoop streaming and/or Hadoop. Hadoop Job Hadoop Mapper Hadoop Reducer Partitioner Hadoop FileSystem Combiner Shuffle Sort Shuffle Sort.
Hadoop: The Definitive Guide Chap. 2 MapReduce
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
MapReduce in the Clouds for Science CloudCom 2010 Nov 30 – Dec 3, 2010 Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox {tgunarat, taklwu,
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
Big Data Analytics with R and Hadoop
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
MapReduce.
THE HOG LANGUAGE A scripting MapReduce language. Jason Halpern Testing/Validation Samuel Messing Project Manager Benjamin Rapaport System Architect Kurry.
H ADOOP DB: A N A RCHITECTURAL H YBRID OF M AP R EDUCE AND DBMS T ECHNOLOGIES FOR A NALYTICAL W ORKLOADS By: Muhammad Mudassar MS-IT-8 1.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Apache Hadoop MapReduce What is it ? Why use it ? How does it work Some examples Big users.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 作者 :Rutvik Karve , Devendra Dahiphale , Amit Chhajer 報告 : 饒展榕.
Hadoop Ali Sharza Khan High Performance Computing 1.
Distributed Computing with Turing Machine. Turing machine  Turing machines are an abstract model of computation. They provide a precise, formal definition.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.
An Introduction to HDInsight June 27 th,
MapReduce 資工碩一 黃威凱. Outline Purpose Example Method Advanced 資工碩一 黃威凱.
A Hierarchical MapReduce Framework Yuan Luo and Beth Plale School of Informatics and Computing, Indiana University Data To Insight Center, Indiana University.
P ARALLEL A NALYSIS OF E GG D ATA WITH HADOOP ON FUTUREGRID Project Member: Rewati Ovalekar Project Guide : Gregor von Laszweski, Lizhe Wang.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Youngil Kim Awalin Sopan Sonia Ng Zeng.  Introduction  Concept of the Project  System architecture  Implementation – HDFS  Implementation – System.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
GSICS Annual Meeting Laurie Rokke, DPhil NOAA/NESDIS/STAR March 26, 201.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
How to download, configure and run a mapReduce program In a cloudera VM Presented By: Mehakdeep Singh Amrit Singh Chaggar Ranjodh Singh.
Chapter 10 Data Analytics for IoT
Hadoop MapReduce Framework
MapReduce Types, Formats and Features
Spark Presentation.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
TABLE OF CONTENTS. TABLE OF CONTENTS Not Possible in single computer and DB Serialised solution not possible Large data backup difficult so data.
Lecture 17 (Hadoop: Getting Started)
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
The Basics of Apache Hadoop
Cloud Distributed Computing Environment Hadoop
CS110: Discussion about Spark
Cloud Computing: Project Tutorial Hadoop Map-Reduce Programming
Lecture 16 (Intro to MapReduce and Hadoop)
Charles Tappert Seidenberg School of CSIS, Pace University
MAPREDUCE TYPES, FORMATS AND FEATURES
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
Map Reduce, Types, Formats and Features
Presentation transcript:

Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)

MapReduce Overview MapReduce is a distributed program model for processing data with large volume A program following the model is inherently and distributed and parallel We will illustrate how to write a program by using MapReduce framework to analyze a weather data set

A Sample Weather Data Set The weather data set is generated by many weather sensors that collect data every hour at many locations acorss the globe The dataset can be downloaded from National Climate Data Center (NCDC) at

The data is stored using a line-oriended ASCII format Below is a sample line of the data

Data files are organized by date and weather station. - There is a directory for each year from 1901 to Each directory contains a gzipped file for each weather station

What we would like to find out is “what is the highest global temparature for each year”

Analyze the Data with Unix Tools It takes 42 minutes in one run on a single EC2 High-CPU Extra Large Instance

Analyzing the Data with Hadoop To take advantage of the distributed processing capability the hadoop provides, we need to write our program by using MapReduce framework. MapReduce works by breaking the processing into two phases - Map phase - Reduce phase

Correspondingly, the program using MapReduce framework will specify two functions: - Map function (of a Mapper class) - Reduce function (of a Reducer class) The inputs and outpus for both functions will be (key, value) pairs.

Map function - input: - output: MapReduce Framework, - sort all the output pairs and combine them into the following (key, value) pairs Reduce - input: - output:

There are two types of nodes that control the job execution process: - one jobtracker - a number of tasktrackers The jobtracker coordinates all the jobs run on the system by scheduling tasks to run on tasktrackers Tasktrackers run tasks and send progress report to the jobtracker.

Single reducer

Multiple reducers

no reducer