IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
MapReduce.
HDFS & MapReduce Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer.
Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
Overview of MapReduce and Hadoop
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MapReduce Online Veli Hasanov Fatih University.
Clydesdale: Structured Data Processing on MapReduce Jackie.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Big Data Analytics with R and Hadoop
The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur Presented by Bryant Yao.
CS506/606: Problem Solving with Large Clusters Zak Shafran, Richard Sproat Spring 2011 Introduction URL:
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
MapReduce.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Zois Vasileios Α. Μ :4183 University of Patras Department of Computer Engineering & Informatics Diploma Thesis.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture VI: 2014/04/14.
HAMS Technologies 1
IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Apache Hadoop MapReduce What is it ? Why use it ? How does it work Some examples Big users.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
HAMS Technologies 1
Whirlwind Tour of Hadoop Edward Capriolo Rev 2. Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High.
Distributed Computing with Turing Machine. Turing machine  Turing machines are an abstract model of computation. They provide a precise, formal definition.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture V: 2014/04/07.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
MapReduce and Hadoop Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata November 10, 2014.
CCD-410 Cloudera Certified Developer for Apache Hadoop (CCDH) Cloudera.
INTRODUCTION TO BIGDATA & HADOOP
Large-scale file systems and Map-Reduce
Hadoop MapReduce Framework
MapReduce Types, Formats and Features
Introduction to MapReduce and Hadoop
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
Hadoop Basics.
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
MapReduce: Simplified Data Processing on Large Clusters
Map Reduce, Types, Formats and Features
Presentation transcript:

IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing

IBM Research | India Research Lab Hadoop – A Very Brief Introduction  A framework for creating distributed applications that process huge amounts of data.  Scalability, Fault Tolerance, Ease of Programming  Two main components  HDFS – Hadoop Distributed File System  Map-Reduce  How data is organized on HDFS?  How data is processed using Map-Reduce?

IBM Research | India Research Lab HDFS  Stores files in blocks across many nodes in a cluster  Replicates the blocks across nodes for durability  Default – 64 MB  Master/Slave Architecture  HDFS Master  NameNode Runs on a single node as a master process Directs client access to files in HDFS  HDFS Slave  DataNode Runs on all nodes in the cluster Block creation/replication/deletion Takes orders from the namenode

IBM Research | India Research Lab HDFS A B C R R R R R R R R R R R R R R R MB Replication Factor = 3 All these blocks Distributed on the Cluster

IBM Research | India Research Lab HDFS NameNode Data Nodes Put File , 4, 5 2, 5, 6 2, 3, 4 File1.txt

IBM Research | India Research Lab HDFS Read-Time = Transfer-Rate x Number of Machines NameNode Data Nodes Read File , 4, 5 2, 5, 6 2, 3, 4

IBM Research | India Research Lab HDFS  Fault-Tolerant  Handles Node Failures  Self-Healing  Rebalances files across cluster  Data from the remaining two nodes is automatically copied  Scalable  Just by adding new nodes NameNode Data Nodes Read File , 4, 5 2, 5, 6 2, 3, 4 3, 5, 6 2, 3, 6

IBM Research | India Research Lab Map-Reduce  Logical Functions : Mappers and Reducers  Developers write map and reducer functions then submit a jar to the Hadoop Cluster  Hadoop handles distributing the Map and Reduce tasks across the cluster

IBM Research | India Research Lab Map-Reduce  A map task is started for each split / 64 MB block. Each map task generates some intermediate data.  Hadoop collects the output of all map tasks, reorganizes them and passes the reorganized data to Reduce tasks  Reduce tasks process this re-organized data and generate the final output  Flow  HDFS Block to Map Task  Map Task to Hadoop Engine  Hadoop Shuffles and Sorts the Map output  Hadoop Engine to Reduce Tasks and Reduce Processing

IBM Research | India Research Lab HDFS to Map Tasks  Records are read one by one from each block and passed to map for processing. The component is called InputFormat / RecordReader  A record is passed as a key-value pair. Key is an offset and the value is the record  Offset is usually ignored by the map MAP-1 MAP-2 MAP-3 ( 0, R ) (10, R ) (20, R ) (30, R ) (40, R ) ( 50, R ) (60, R ) (70, R ) (80, R ) (90, R )) (100, R ) (110, R ) (120, R ) (130, R ) (140, R ) Input- Format

IBM Research | India Research Lab Map Task  Takes in a key-value pair and transforms it to a set of key-value pairs {K1, V1} ==> [{K2, V2}] ( 0, R ) (10, R ) (20, R ) (30, R ) (40, R ) ( 0, R ) (10, R ) (20, R ) (30, R ) (50, R )) ( 0, R ) (10, R ) (20, R ) (30, R ) (50, R ) MAP-1 MAP-2 MAP-3 (2, 3) (2, 4) (2,4) (6, 4) (2, 9) (4, 9) (8, 9) (2, 3) (2, 5) (4, 5) (2, 7) Example: If the second column is an odd number, don’t do anything. If the second column is an even number generate as many pairs as the number of even divisors of the value in the second column. The key is the divisor and the value is the value in the third column

IBM Research | India Research Lab Hadoop Sorting And Shuffling  Hadoop processes the key-value pairs output by map in a fashion so that the values in all pairs with the same key are grouped together  These groups are then passed to reducers for processing MAP-1 MAP-2 MAP-3 (2, 3) (2, 4) (2,4) (6, 4) (2, 9) (4, 9) (8, 9) (2, 3) (2, 5) (4, 5) (2, 7) (2, [3, 3, 3, 4, 4, 5, 7, 9]) (4, [5, 9]) (6, [4]) (8, [9]) Hadoop Shuffle

IBM Research | India Research Lab Hadoop Engine to Reduce Tasks and Reduce Processing  Let the number of distinct keys (groups) be m  Let the number of reduce tasks be k.  These m groups are distributed across k reduce tasks using a Hash function  Reduce task processes each group and generates the output. Example – Sums all the values REDUCER 1 (2, [3, 4, 4, 9, 3, 3, 5, 7]) (6, [4]) REDUCER 2 (4, [9, 5]) (8, [9]) (2, 38) (6, 4) (4, 14) (8, 9)

IBM Research | India Research Lab Word-Count Hadoop Uses Map-Reduc There is a Map-Phase There is a Reduce phase (Hadoop, 1) (Uses, 1) (Map, 1) (Reduce, 1) (There, 1) (is, 1) (a, 1) (Map, 1) (Phase, 1) (There, 1) (is, 1) (a, 1) (Reduce, 1) (Phase, 1) (a, [1,1]) (Hadoop, 1) (is, [1,1]) (map, [1,1]) (phase, [1,1]) (reduce, [1,1]) (there, [1,1]) (uses, 1) A-I J-Q R-Z (a, 2) (hadoop, 1) (is, 2) (map, 2) (phase, 2) (reduce, 2) (there, 2) (uses, 1)

IBM Research | India Research Lab Map-Reduce Example: Aggregation  Compute Avg of B for each distinct value of A ABC R R R R R R R MAP 1 MAP 2 (1, 10) (2, 20) (1, 10) (1, 30) (3, 40) (2, 10) (1, 20) (1, 17.5) (2, 15) (3, 40) (1, [10, 10, 30, 20]) (2, [10, 20]) (3, [40]) Reducer 1 Reducer 2

IBM Research | India Research Lab Designing a Map-Reduce Algorithm  Thinking in terms of Map and Reduce  What data should be the key?  What data should be the values?  Minimizing Cost  Reading and Map Processing Cost  Communication Cost  Processing Cost at Reducer  Load Balancing  All reducers should get similar volume of traffic  Should not happen that only few machines are busy while others are loaded

IBM Research | India Research Lab Join On Point Data  Select R.A, R.B, S.D where R.A==S.A ABC R R R R R ADE S S S S S MAP 1 MAP 2 (1, [R, 10]) (2, [R, 20]) (1, [R, 10]) (1, [R, 30]) (3, [R, 40]) (1, [S, 20]) (2, [S, 30]) (2, [S, 10]) (3, [S, 50]) (3, [S, 40]) (1, 10, 20) (1, 30, 20) (2, 20, 30) (2, 20, 10) (3, 40, 50) (3, 40, 40) (1, [(R, 10), (R, 10), (R, 30), (S, 20)] ) (2, [(R, 20), (S, 30), (S, 10)] ) (3, [(R, 40), (S, 50), (S, 40)] Reducer 1 Reducer 2

IBM Research | India Research Lab Join On Point Data  Select R.A, R.B, S.D where R.A==S.A  Attribute A range is divided into k parts. A hash function hashes the value of attribute A to [1,…,k] 1 2 … … k  A reducer is defined for each of the k part  A tuple from R and S is communicated to reducer k if the value of R.A or S.A hashes to bucket k  Each reducer computes the partial join output

IBM Research | India Research Lab Join On Point Data  Assume k = 3, h(1)=0, h(2)=1, h(3)=2 ABC R R R R R ADE S S S S S R R R S R S S R S S R1 S1 R3 S1 R4 S1 R2 S2 R2 S3 R5 S4 R5 S5

IBM Research | India Research Lab Map-Reduce Example : Inequality Join  Select R.A, R.B, S.D where R.A <= S.A  Consider 3-Node Cluster ABC R R R R R ADE S S S S S MAP 2 (r1, [S, 1, 20]) (r2, [S, 2, 30]) (r2, [S, 2, 10]) (r3, [S, 3, 50]) (r3, [S, 3, 40]) (1, 10, 20) (1, 30, 20) (1, 10, 50) (1, 10, 40) (2, 20, 50) (2, 20, 40) (1, 10, 50) (1, 10, 40) (1, 30, 50) (1, 30, 40) (3, 40, 50) (3, 40, 40) MAP 1 (r1, [R, 1, 10]) (r2, [R, 1, 10]) (r3, [R, 1, 10]) (r2, [R, 2, 20]) (r3, [R, 2, 20]) ….. (r3, [R, 3, 40]) (r1, ([R, 1, 10], [R, 1, 10], [R, 1, 30], [S, 1, 20]) (r3, ([R, 1, 10], [R, 2, 20], [R, 1, 10], [R, 1, 30], [R, 3, 40], [S, 3, 50], [S, 3, 40]) Reducer 1 Reducer 3 …… Reducer 2

IBM Research | India Research Lab Why Join On Map-Reduce Is A Complex Task?  Data for multiple relations distributed across different machines  Map-Reduce is inherently designed for processing a single dataset.  An output tuple can be generated only when all the input tuples are collected at a common machine  This needs to happen for all output tuples, is non-trivial.  Apriori, we don’t know which tuples are going to join to form an output tuple. That is precisely the join problem  Ensuring it, may involve lot of replication and hence lot of communication  Tuples from every candidate combination need to be collected at reducers and the join predicates need to be checked