”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

MapReduce.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
J OIN ALGORITHMS USING MAPREDUCE Haiping Wang
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Standard architecture emerging: – Cluster of commodity.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
HADOOP ADMIN: Session -2
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
HAMS Technologies 1
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data.
MapReduce How to painlessly process terabytes of data.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.
MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides ( (licensed.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?
A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
MapReduce and the New Software Stack. Outline  Algorithm Using MapReduce  Matrix-Vector Multiplication  Matrix-Vector Multiplication by MapReduce 
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
Jimmy Lin and Michael Schatz Design Patterns for Efficient Graph Algorithms in MapReduce Michele Iovino Facoltà di Ingegneria dell’Informazione, Informatica.
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
MapReduce and Hadoop Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata November 10, 2014.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Image taken from: slideshare
MapReduce Compiler RHadoop
SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.
By Chris immanuel, Heym Kumar, Sai janani, Susmitha
Spark.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Chapter 10 Data Analytics for IoT
Hadoop MapReduce Framework
MapReduce Types, Formats and Features
Central Florida Business Intelligence User Group
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
Distributed Systems CS
MapReduce Simplied Data Processing on Large Clusters
COS 418: Distributed Systems Lecture 1 Mike Freedman
Cloud Distributed Computing Environment Hadoop
湖南大学-信息科学与工程学院-计算机与科学系
MapReduce: Data Distribution for Reduce
February 26th – Map/Reduce
Hadoop Basics.
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
Overview of big data tools
Data processing with Hadoop
Distributed Systems CS
Charles Tappert Seidenberg School of CSIS, Pace University
CS639: Data Management for Data Science
5/7/2019 Map Reduce Map reduce.
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Map Reduce, Types, Formats and Features
Presentation transcript:

”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay

Overview MapReduce Hadoop & MapReduce HadoopDB & MapReduce MapReduceMerge Implementations of Relational Operators

Overview MapReduce Hadoop & MapReduce HadoopDB & MapReduce MapReduceMerge Implementations of Relational Operators

MapReduce 100 lines of the given document Word starting with A to H Word starting with I to M Partition Reducer 1 Reducer 2 Reducer 3 split 2 split 1 split 4 split 3 Mapper 1 Mapper 2 Word Count Word starting with N to Z

MapReduce – Example (map) map(“/path/wikipedia.org”, “to be or as”) Output of map function is (potentially many) key/value pairs. In our case, output (word, “1”) once per word in the document “to”, “1” “be”, “1” “or”, “1” “to”, “1” “be”, “1” “or”, “1”

MapReduce – Example (Reduce) Each reducer read key/value pair from one partition, then sort and group the key/value pair In our case, compute the sum “be,1”, “as 1” “be 1” “be 1” “as 1” in partition 1 “or,1”, “to 1” “or 1” “to 1” “or 1” in partition 2 Reducer 1 – “be , [1,1]” “as, [1,1]” → “[be 2, as 2]” Reducer 2 – “or , [1,1]” “to, [1,1]” → “[or 2, to 2]”

MapReduce Schematic

Overview Hadoop & MapReduce MapReduce HadoopDB & MapReduce MapReduceMerge Implementations of Relational Operators

Input Format Implementation Hadoop & MapReduce Map Reduce Job Hadoop core HDFS Map Reduce Framework Name Node Job Tracker Master Node Input Format Implementation Worker Node Task Tracker Task Tracker Task Tracker Task Tracker Data Node Data Node Data Node Data Node

Input Format Implementation Hadoop & MapReduce Sql Query Map Reduce Job Hive Hadoop core HDFS Map Reduce Framework Name Node Job Tracker Master Node Input Format Implementation Worker Node Task Tracker Task Tracker Task Tracker Task Tracker Data Node Data Node Data Node Data Node

Relational Data Processing using MapReduce map(key emp_id, value bonus) reduce(key emp_id, value bonus_list)

Overview HadoopDB & MapReduce MapReduce Hadoop & MapReduce MapReduceMerge Implementations of Relational Operators

HadoopDB & MapReduce

Overview MapReduceMerge MapReduce Hadoop & MapReduce HadoopDB & MapReduce MapReduceMerge Implementations of Relational Operators

Why to extend MapReduce Well suitable for un-structured data but not for structured data (RDBS) Join is inefficient in MapReduce Strict 2-Phase overflow Map - first phase Reduce – second phase

Extension to MapReduce Add a merge phase to map reduce algorithm Allow efficient processing of multiple dataset (RDBMS) Merge is made up of several components Merge Function Processor Function Partition Selector Configurable Iterator

Components of Merge Merger → same principle as map and reduce Processor →processes data from one source Partition Selector →select the data that should go to the merger Configurable Iterator →how to iterate through each list as the merging is done

Implementation Overview

MapReduceMerge Example Join employee and department table and compute bonuses Employee Department

Example : Map(employee dataset) 1 B $150

Example : Map(department dataset) 1 B $150

Example : Reduce(employee dataset) 1 B $150

Example: Reduce(department dataset)

Example: Merge(employee, department) 1 B $150

Implementing Relational Algebra Operations Projection Aggregation Selection Set perations: Union, Intersection, Difference Cartesian Product Rename Join

Projection All we have to do is emit a subset of the data passed in. Just a mapper can do this map(key emp_id, value emp_info){ emit(emp_id); }

Aggregation By choosing appropriate keys, can implement “group by” and aggregate SQL operators in MapReduce. Reducer sort and group the intermediate key/value pair based on the key

Aggregation

Selection If selection condition involves only the attributes of one data source, can implement in mappers. map(key emp_id, value emp_info){ if(emp_info.dept_id = A) emit(emp_id, emp_info); }

Selection If it’s on aggregates or a group of values contained in one data source, can implement in reducers. If it involves attributes or aggregates from both data sources, implement in mergers

Set Union Let each of the two MapReduces emit a sorted list of unique elements Merges just iterate simultaneously over the lists: store the lesser value and increment its iterator, if there is a lesser value if the two are equal, store one of the two, and increment both iterators

Set Intersection Let each of the two MapReduces emit a sorted list of unique elements Merges just iterate simultaneously over the lists: if there is a lesser value, increment its iterator if the two are equal, store one of the two, and increment both iterators

Sort-Merge Join Map: partition records into key ranges according to the values of the attributes on which you’re sorting, aiming for even distribution of values to mappers. Reduce: sort the data. Merge: join the sorted data for each key range.

Conclusion Map-Reduce-Merge adds the ability to execute arbitrary relational algebra queries Next steps: develop SQL-like interface and a query optimizer

References Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters, OSDI '04 MapReduce: Simplified Data Processing on Large Clusters, SIGMOD '07 HadoopDB :An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. VLDB '09 Hive : A Warehousing Solution over a MapReduce Framework, VLDB 09