MapReduce. What is MapReduce? (1) A programing model for parallel processing of a distributed data on a cluster It is an ideal solution for processing.

Slides:



Advertisements
Similar presentations
Beyond Mapper and Reducer
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
MapReduce Simplified Data Processing on Large Clusters
MapReduce.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MapReduce in Action Team 306 Led by Chen Lin College of Information Science and Technology.
Developing a MapReduce Application – packet dissection.
Map reduce with Hadoop streaming and/or Hadoop. Hadoop Job Hadoop Mapper Hadoop Reducer Partitioner Hadoop FileSystem Combiner Shuffle Sort Shuffle Sort.
Hadoop: The Definitive Guide Chap. 2 MapReduce
Distributed Computations
Hadoop: Nuts and Bolts Data-Intensive Information Processing Applications ― Session #2 Jimmy Lin University of Maryland Tuesday, February 2, 2010 This.
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
Big Data Analytics with R and Hadoop
HADOOP ADMIN: Session -2
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
Lecture 3-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2010 Indranil Gupta (Indy) August 31, 2010 Lecture 3  2010, I. Gupta.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High Throughput Partition-able problems Fault Tolerance.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture VI: 2014/04/14.
HAMS Technologies 1
Introduction to Hadoop and HDFS
Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
HAMS Technologies 1
Hive Facebook 2009.
Whirlwind Tour of Hadoop Edward Capriolo Rev 2. Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High.
Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 作者 :Rutvik Karve , Devendra Dahiphale , Amit Chhajer 報告 : 饒展榕.
MapReduce.
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
Distributed Computing with Turing Machine. Turing machine  Turing machines are an abstract model of computation. They provide a precise, formal definition.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.
An Introduction to HDInsight June 27 th,
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
MAP REDUCE BASICS CHAPTER 2. Basics Divide and conquer – Partition large problem into smaller subproblems – Worker work on subproblems in parallel Threads.
Hive. What is Hive? Data warehousing layer on top of Hadoop – table abstractions SQL-like language (HiveQL) for “batch” data processing SQL is translated.
Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?
Data Engineering How MapReduce Works
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010.
MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
1 Cloud paradigm, standards and middleware for PGS * ESRIN *
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
MapReduce Compiler RHadoop
Hadoop MapReduce Framework
Calculation of stock volatility using Hadoop and map-reduce
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
MapReduce Simplied Data Processing on Large Clusters
Cloud Distributed Computing Environment Hadoop
湖南大学-信息科学与工程学院-计算机与科学系
MapReduce Algorithm Design
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
Data processing with Hadoop
Cloud Computing: Project Tutorial Hadoop Map-Reduce Programming
Charles Tappert Seidenberg School of CSIS, Pace University
MAPREDUCE TYPES, FORMATS AND FEATURES
Bryon Gill Pittsburgh Supercomputing Center
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
COS 518: Distributed Systems Lecture 11 Mike Freedman
Map Reduce, Types, Formats and Features
Presentation transcript:

MapReduce

What is MapReduce? (1) A programing model for parallel processing of a distributed data on a cluster It is an ideal solution for processing data on HDFS Data Slice 1Data Slice 2 Data Slice 3 Data Slice 4 Data Slice 5Data Slice X Data processor - Extraction - Filtering - Transformation Data collector Result Data shuffling - Grouping - Aggregating - Dissmising Mapping Reducing Node 1Node 2Node 3Node 4 Node 5 Node X

What is MapReduce? (2) 2 staged data processing Map and Reduce Each stage emits key-value pairs as a result of its work Programing MapReduce In Java 3 classes Map Reduce (optional) Job configuration (with a ‚main’ function)

Example: The famous „world counting”

MapReduce on Hadoop In V2 controlled by YARN demons ResourceManager, NodeManager Client Node JVM MR code Job object Run 1 Management Node JVM Resource Manager Cluster Node HDFS Get new application/job id 2 Copy job resources 3 JVM Node Manager JVM MR application master Application job/submission 4 Start MR App Master 5 Get Input Splits 6 Resource request 7 Cluster Node JVM Node Manager JVM Map or Reduse task Cluster Node JVM Node Manager JVM Map or Reduse task Cluster Node JVM Node Manager JVM Map or Reduce task Get input data 9 Start containers 8

MR hands on (1) The problem Q: „What follows two rainy days in the Geneva region?” A: „Monday” The goal Proof if the theory is true or false Solution Lets take meteo data from GVA and build a histogram of bad weather days of a week Mon | Tue |Wed |Thu | Fr | Sat | Sun Bad weather days count ?

MR hands on (2) The source data ( Source: Last 3 years of weather data taken at GVA airport CSV format What is a bad weather day?: Weather anomalies between 8am and 10pm "Local time in Geneva (airport)";"T";"P0";"P";"U";"DD";"Ff";"ff10";"WW";"W'W'";"c";"VV";"Td"; " :50";"18.0";"730.4";"767.3";"100";"variable wind direction";"2";"";"";"";"No Significant Clouds";"10.0 and more";"18.0"; " :20";"18.0";"730.4";"767.3";"94";"variable wind direction";"1";"";"";"";"Few clouds (10-30%) 300 m, scattered clouds (40-50%) 3300 m";"10.0 and more";"17.0"; " :50";"19.0";"730.5";"767.3";"88";"Wind blowing from the west";"2";"";"";"";"Few clouds (10-30%) 300 m, broken clouds (60-90%) 5400 m";"10.0 and more";"17.0"; " :20";"19.0";"729.9";"766.6";"83";"Wind blowing from the south-east";"4";"";"";"";"Few clouds (10-30%) 300 m, scattered clouds (40-50%) 2400 m, overcast (100%) 4500 m";"10.0 and more";"16.0"; " :50";"19.0";"729.9";"766.6";"94";"Wind blowing from the east-northeast";"5";"";"Light shower(s), rain";"";"Few clouds (10-30%) 1800 m, scattered clouds (40-50%) 2400 m, broken clouds (60-90%) 3000 m";"10.0 and more";"18.0"; " :20";"20.0";"730.7";"767.3";"88";"Wind blowing from the north-west";"2";"";"Light shower(s), rain, in the vicinity thunderstorm";"";"Few clouds (10-30%) 1800 m, cumulonimbus clouds, broken clouds (60-90%) 2400 m";"10.0 and more";"18.0"; " :50";"22.0";"730.2";"766.6";"73";"Wind blowing from the south";"7";"";"Thunderstorm";"";"Few clouds (10-30%) 1800 m, cumulonimbus clouds, scattered clouds (40-50%) 2100 m, broken clouds (60-90%) 3000 m";"10.0 and more";"17.0"; " :20";"23.0";"729.6";"765.8";"78";"Wind blowing from the west-southwest";"4";"";"Light shower(s), rain, in the vicinity thunderstorm";"";"Few clouds (10-30%) 1740 m, cumulonimbus clouds, scattered clouds (40-50%) 2100 m, broken clouds (60-90%) 3000 m";"10.0 and more";"19.0"; " :50";"23.0";"728.8";"765.0";"65";"variable wind direction";"2";"";"In the vicinity thunderstorm";"";"Scattered clouds (40-50%) 1950 m, cumulonimbus clouds, scattered clouds (40-50%) 2100 m, broken clouds (60-90%) 3300 m";"10.0 and more";"16.0"; " :20";"23.0";"728.2";"764.3";"74";"Wind blowing from the west-northwest";"4";"";"Light thunderstorm, rain";"";"Scattered clouds (40-50%) 1950 m, cumulonimbus clouds, scattered clouds (40-50%) 2100 m, broken clouds (60-90%) 3300 m";"10.0 and more";"18.0"; " :50";"28.0";"728.0";"763.5";"45";"Wind blowing from the south-west";"5";"11";"Thunderstorm";"";"Scattered clouds (40-50%) 1950 m, cumulonimbus clouds, scattered clouds (40-50%) 2100 m, broken clouds (60-90%) 6300 m";"10.0 and more";"15.0"; " :20";"28.0";"728.0";"763.5";"42";"Wind blowing from the north-northeast";"2";"";"In the vicinity thunderstorm";"";"Few clouds (10-30%) 1950 m, cumulonimbus clouds, broken clouds (60-90%) 6300 m";"10.0 and more";"14.0";

MR hands on (3) Designing MapReduce flow " :50";"18.0";... " :20";"18.0„;... " :50";"19.0";... Input Data Input Data Grouping: 1)sum(Value) by key Emiting: Reduce Intermediate output Map Filtering: 1) 8 < HH < 22 2) col9 != „” ; Emiting: Filtering: 1) 8 < HH < 22 2) col9 != „” ; Emiting: Filtering: 1) 8 < HH < 22 2) col9 != „” ; Emiting: Result Result (final) Filtering: 1) 8 < HH < 22 2) col9 != „” ; Emiting: Filtering: 1) 8 < HH < 22 2) col9 != „” ; Emiting: Transforming: 1) Date => day of a week Emiting: Map Intermediate output Grouping: 1)sum(Value) by key Emiting: Reduce

Hands on (3) – alternative text v Designing a Map Reduce job Mapper Filter data: hours and „no weather anomalies” Emit pairs: (date, whatever) Reducer Aggregate the results by date Emit (day, whatever) And... We need another job Mapper Transform a date into day of a week Emit: (day of a week, whatever) Reducer Aggregate results by day of a week Emit (day of a week, count)

Hand on (4) Loading the data to HDFS Compiling the MapReduce source code Packing into a jar file Submiting a MapReduce jobs cd ~/tutorials; hdfs dfs –put data data; cd ~/tutorials/mapreduce; javac –classpath `hadoop classpath` *.java jar –cvf GVA.jar *.class hadoop jar GVA.jar AggByDateJob data stage hadoop jar GVA.jar AggByDayJob stage result

Things that have not covered Types of YARN schedulers Combiner – just after map reducer Writing own: input splitters, data serializers, partitioners etc. Hadoop streaming – map and reducer as an external exacutable Distributed cache – caching of arbitrary files caching

Summary MapReduce is a model for parallel data processing on Hadoop in a batch fashion 2 staged Job submission submission is not immedaite Logic written in Java (but not only) A developer skills required Fully customizable Resource allocation controlled by YARN