MapReduce.

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

MapReduce Simplified Data Processing on Large Clusters
MapReduce.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MapReduce in Action Team 306 Led by Chen Lin College of Information Science and Technology.
© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.
Developing a MapReduce Application – packet dissection.
© Hortonworks Inc Running Non-MapReduce Applications on Apache Hadoop Hitesh Shah & Siddharth Seth Hortonworks Inc. Page 1.
Resource Management with YARN: YARN Past, Present and Future
Distributed Computations
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
Big Data Analytics with R and Hadoop
HADOOP ADMIN: Session -2
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
Lecture 3-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2010 Indranil Gupta (Indy) August 31, 2010 Lecture 3  2010, I. Gupta.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture VI: 2014/04/14.
HAMS Technologies 1
Introduction to Hadoop and HDFS
HAMS Technologies 1
Hive Facebook 2009.
Data storing and data access. Plan Basic Java API for HBase – demo Bulk data loading Hands-on – Distributed storage for user files SQL on noSQL Summary.
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
Distributed Computing with Turing Machine. Turing machine  Turing machines are an abstract model of computation. They provide a precise, formal definition.
MapReduce M/R slides adapted from those of Jeff Dean’s.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.
An Introduction to HDInsight June 27 th,
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Data storing and data access. Adding a row with Java API import org.apache.hadoop.hbase.* 1.Configuration creation Configuration config = HBaseConfiguration.create();
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
MapReduce. What is MapReduce? (1) A programing model for parallel processing of a distributed data on a cluster It is an ideal solution for processing.
MAP REDUCE BASICS CHAPTER 2. Basics Divide and conquer – Partition large problem into smaller subproblems – Worker work on subproblems in parallel Threads.
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
Hive. What is Hive? Data warehousing layer on top of Hadoop – table abstractions SQL-like language (HiveQL) for “batch” data processing SQL is translated.
Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?
Data Engineering How MapReduce Works
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
1 Cloud paradigm, standards and middleware for PGS * ESRIN *
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
Apache Tez : Accelerating Hadoop Query Processing Page 1.
MapReduce Compiler RHadoop
How to download, configure and run a mapReduce program In a cloudera VM Presented By: Mehakdeep Singh Amrit Singh Chaggar Ranjodh Singh.
Hadoop MapReduce Framework
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
MapReduce Simplied Data Processing on Large Clusters
Cloud Distributed Computing Environment Hadoop
湖南大学-信息科学与工程学院-计算机与科学系
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
Data processing with Hadoop
Cloud Computing: Project Tutorial Hadoop Map-Reduce Programming
Charles Tappert Seidenberg School of CSIS, Pace University
MAPREDUCE TYPES, FORMATS AND FEATURES
Bryon Gill Pittsburgh Supercomputing Center
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Map Reduce, Types, Formats and Features
Presentation transcript:

MapReduce

What is MapReduce? (1) A programing model for parallel processing of a distributed data on a cluster It is an ideal solution for processing data on HDFS Node 1 Node 2 Node 3 Node 4 Node 5 Node X Data Slice 1 Data Slice 2 Data Slice 3 Data Slice 4 Data Slice 5 Data Slice X - Extraction - Filtering - Transformation Mapping Data processor Data processor Data processor Data processor Data processor Data processor Data shuffling - Grouping - Aggregating - Dissmising Data collector Data collector Reducing Result

Example: The famous „world counting”

What is MapReduce? (2) 2 staged data processing Map and Reduce Each stage emits key-value pairs as a result of its work Programing MapReduce In Java 3 classes Map Reduce (optional) Job configuration (with a ‚main’ function)

MapReduce on Hadoop In V2 controlled by YARN demons ResourceManager, NodeManager Client Node 2. Get new application/job id Cluster Node JVM JVM MR code 1. Run Job object YARN Resource Manager 4. Application job/submission 5. Start MR App Master 3. Copy job resources 7. Resource request Cluster Node JVM YARN Node Manager Cluster Node JVM Node Manager Map or Reduse task Cluster Node JVM Node Manager Map or Reduse task Cluster Node JVM YARN Node Manager 8. Start containers JVM MR application master 6. Get Input Splits HDFS JVM Map or Reduce task 9. Get local input data

? MR hands on (1) The problem The goal Solution Q: „What follows two rainy days in the Geneva region?” A: „Monday” The goal Proof if the theory is true or false Solution Lets take meteo data from GVA and build a histogram of days of a week followed by 2 or more bad weather days ? days count Mon | Tue |Wed |Thu | Fr | Sat | Sun

MR hands on (2) What is a bad weather day?: The source data (http://rp5.co.uk) Source: Last 3 years of weather data taken at GVA airport CSV format What is a bad weather day?: Weather anomalies (col nr 9) between 8am and 10pm "Local time in Geneva (airport)";"T";"P0";"P";"U";"DD";"Ff";"ff10";"WW";"W'W'";"c";"VV";"Td"; "06.06.2015 00:50";"18.0";"730.4";"767.3";"100";"variable wind direction";"2";"";"";"";"No Significant Clouds";"10.0 and more";"18.0"; "06.06.2015 00:20";"18.0";"730.4";"767.3";"94";"variable wind direction";"1";"";"";"";"Few clouds (10-30%) 300 m, scattered clouds (40-50%) 3300 m";"10.0 and more";"17.0"; "05.06.2015 23:50";"19.0";"730.5";"767.3";"88";"Wind blowing from the west";"2";"";"";"";"Few clouds (10-30%) 300 m, broken clouds (60-90%) 5400 m";"10.0 and more";"17.0"; "05.06.2015 23:20";"19.0";"729.9";"766.6";"83";"Wind blowing from the south-east";"4";"";"";"";"Few clouds (10-30%) 300 m, scattered clouds (40-50%) 2400 m, overcast (100%) 4500 m";"10.0 and more";"16.0"; "05.06.2015 22:50";"19.0";"729.9";"766.6";"94";"Wind blowing from the east-northeast";"5";"";"Light shower(s), rain";"";"Few clouds (10-30%) 1800 m, scattered clouds (40-50%) 2400 m, broken clouds (60-90%) 3000 m";"10.0 and more";"18.0"; "05.06.2015 22:20";"20.0";"730.7";"767.3";"88";"Wind blowing from the north-west";"2";"";"Light shower(s), rain, in the vicinity thunderstorm";"";"Few clouds (10-30%) 1800 m, cumulonimbus clouds , broken clouds (60-90%) 2400 m";"10.0 and more";"18.0"; "05.06.2015 21:50";"22.0";"730.2";"766.6";"73";"Wind blowing from the south";"7";"";"Thunderstorm";"";"Few clouds (10-30%) 1800 m, cumulonimbus clouds , scattered clouds (40-50%) 2100 m, broken clouds (60-90%) 3000 m";"10.0 and more";"17.0"; "05.06.2015 21:20";"23.0";"729.6";"765.8";"78";"Wind blowing from the west-southwest";"4";"";"Light shower(s), rain, in the vicinity thunderstorm";"";"Few clouds (10-30%) 1740 m, cumulonimbus clouds , scattered clouds (40-50%) 2100 m, broken clouds (60-90%) 3000 m";"10.0 and more";"19.0"; "05.06.2015 20:50";"23.0";"728.8";"765.0";"65";"variable wind direction";"2";"";"In the vicinity thunderstorm";"";"Scattered clouds (40-50%) 1950 m, cumulonimbus clouds , scattered clouds (40-50%) 2100 m, broken clouds (60-90%) 3300 m";"10.0 and more";"16.0"; "05.06.2015 20:20";"23.0";"728.2";"764.3";"74";"Wind blowing from the west-northwest";"4";"";"Light thunderstorm, rain";"";"Scattered clouds (40-50%) 1950 m, cumulonimbus clouds , scattered clouds (40-50%) 2100 m, broken clouds (60-90%) 3300 m";"10.0 and more";"18.0"; "05.06.2015 19:50";"28.0";"728.0";"763.5";"45";"Wind blowing from the south-west";"5";"11";"Thunderstorm";"";"Scattered clouds (40-50%) 1950 m, cumulonimbus clouds , scattered clouds (40-50%) 2100 m, broken clouds (60-90%) 6300 m";"10.0 and more";"15.0"; "05.06.2015 19:20";"28.0";"728.0";"763.5";"42";"Wind blowing from the north-northeast";"2";"";"In the vicinity thunderstorm";"";"Few clouds (10-30%) 1950 m, cumulonimbus clouds , broken clouds (60-90%) 6300 m";"10.0 and more";"14.0";

MR hands on (3) Designing MapReduce flow Input Data Result Grouping: sum(Value) by date sum>0 : ‘bad’ date sum==0: ‘good’ date Emiting: <‘good’ date, prec ‘bad’ day count> Reduce Map Filtering: 1) 8 < HH < 22 2) col9 != „” ; Emiting: <date, count> 2) col9 != „” ; bad=1 col9 = „”; bad=0 <date, bad> <Key, Value> <06.06.2015, 8> <06.06.2015, 1> <05.06.2015, 0> Intermediate output <2015.06.06, 1> <2015.05.06, 0> "06.06.2015 00:50";"18.0";... "06.06.2015 00:20";"18.0„;... "05.06.2015 23:50";"19.0";... <Key, Value> <06.06.2015, 0> <05.06.2015, 3> Result Input Data <Key, Value> <06.06.2015, 0> <05.06.2015, 3> Input Data Filtering: 1) 8 < HH < 22 2) col9 != „” ; Emiting: <date, count> Transforming: 1) Date => day of a week <day, 1> Map <Key, Value> <Sunday, 20> <Monday, 30> Result (final) Grouping: sum(Value) by day Emiting: <day,total count> Reduce <Key, Value> <06.06.2015, 8> <06.06.2015, 1> <05.06.2015, 0> Intermediate output <Sunday, 1> <Monday, 1>

Hand on (4) Loading the data to HDFS Getting script and code Compiling the MapReduce source code Packing into a jar file Submitting a MapReduce jobs cd ~/tutorials; hdfs dfs –put data data; mkdir myMR; cd myMR wget https://cern.ch/test-zbaranow/script.txt (and MRtutorial.zip); unzip MRtutorial.zip javac –classpath `hadoop classpath` *.java jar –cvf GVA.jar *.class hadoop jar GVA.jar AggByDateJob data stage hadoop jar GVA.jar AggByDayJob stage result

Things that have not covered Types of YARN schedulers Combiner – just after map reducer Writing own: input splitters, data serializes, partitioners etc. Hadoop streaming – map and reducer as an external executable Distributed cache – caching of arbitrary files caching

Summary MapReduce is a model for parallel data processing on Hadoop in a batch fashion 2 staged Job submission submission is not immediate Logic written in Java (but not only) A developer skills required Fully customizable Resource allocation controlled by YARN