MapReduce and Hadoop Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata November 10, 2014.

Slides:

Advertisements

Similar presentations

Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)

Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.

Map Reduce Based on A. Rajaraman and J.D.Ullman. Mining Massive Data Sets. Cambridge U. Press, 2009; Ch. 2. Some figures “stolen” from their slides.

Overview of MapReduce and Hadoop

Mapreduce and Hadoop Introduce Mapreduce and Hadoop

Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)

Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.

Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.

L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.

Google Distributed System and Hadoop Lakshmi Thyagarajan.

Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.

Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.

Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.

Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.

CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture VI: 2014/04/14.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read

Whirlwind Tour of Hadoop Edward Capriolo Rev 2. Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High.

Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.

Distributed Computing with Turing Machine. Turing machine  Turing machines are an abstract model of computation. They provide a precise, formal definition.

MapReduce How to painlessly process terabytes of data.

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

MapReduce and the New Software Stack CHAPTER 2 1.

CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.

By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.

Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

MapReduce and the New Software Stack. Outline  Algorithm Using MapReduce  Matrix-Vector Multiplication  Matrix-Vector Multiplication by MapReduce 

{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.

INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.

1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.

Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )

Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.

Image taken from: slideshare

”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.

Big Data is a Big Deal!.

INTRODUCTION TO BIGDATA & HADOOP

Chapter 10 Data Analytics for IoT

Large-scale file systems and Map-Reduce

Hadoop MapReduce Framework

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.

Introduction to MapReduce and Hadoop

Software Engineering Introduction to Apache Hadoop Map Reduce

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

The Basics of Apache Hadoop

湖南大学-信息科学与工程学院-计算机与科学系

February 26th – Map/Reduce

Cse 344 May 4th – Map/Reduce.

CS110: Discussion about Spark

Lecture 16 (Intro to MapReduce and Hadoop)

CS 345A Data Mining MapReduce This presentation has been altered.

CS639: Data Management for Data Science

5/7/2019 Map Reduce Map reduce.

Big Data Analysis MapReduce.

MapReduce: Simplified Data Processing on Large Clusters

CS639: Data Management for Data Science

Map Reduce, Types, Formats and Features

Presentation transcript:

MapReduce and Hadoop Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata November 10, 2014

Let’s keep the intro short  Modern data mining: process immense amount of data quickly  Exploit parallelism 2 Traditional parallelism Bring data to compute MapReduce Bring compute to data Pictures courtesy: Glenn K. Lockwood, glennklockwood.comglennklockwood.com

The MapReduce paradigm 3 Input chunks original Input pairs pairs grouped by keys Output chunks Final output Split Map Shuffle and sort Reduce Final May be already split in filesystem May not need to combine The user needs to write the map() and the reduce()

An example: word frequency counting 4 Input chunks original Input pairs pairs grouped by keys Output chunks Final output Split Map Shuffle and sort Reduce Final Problem: Given a collection of documents, count the number of times each word occurs in the collection collection of documnts subcollections of documnts map: for each word w, output pairs (w,1) the pairs (w,1) for the same words are grouped together reduce: count the number (n) of pairs for each w, make it (w,n) output: (w,n) for each w map: for each word w, emit pairs (w,1) reduce: count the number (n) of pairs for each w, make it (w,n)

An example: word frequency counting 5 apple orange peach orange apple guava peach fig peach Input chunks (apple,1) apple orange peach orange plum orange apple guava cherry fig peach fig peach apple orange peach orange plum orange apple guava cherry fig peach fig peach original Input (orange,1) (guava,1) (plum,1) (cherry,1) (fig,1) (peach,1) (apple,2) (orange,3) (guava,1) (plum,2) (cherry,2) (fig,2) (peach,3) (apple,2) (orange,3) (guava,1) (plum,2) (cherry,2) (fig,2) (peach,3) (orange,3) (plum,2) (cherry,2) (fig,2) cherry fig orange plum (apple,1) (orange,1) (peach,1) (apple,1) (orange,1) (peach,1) (orange,1) (apple,1) (guava,1) (orange,1) (apple,1) (guava,1) (peach,1) (fig,1) (peach,1) (fig,1) (peach,1) (cherry,1) (fig,1) (cherry,1) (fig,1) (orange,1) (plum,1) (orange,1) (plum,1) pairs pairs grouped by keys (guava,1) (peach,3) (apple,2) Output chunks Final output Split Map Shuffle and sort Reduce Final Problem: Given a collection of documents, count the number of times each word occurs in the collection map: for each word w, output pairs (w,1) reduce: count the number (n) of pairs for each w, make it (w,n)

HADOOP Apache Hadoop An open source MapReduce framework 6

Hadoop  Two main components – Hadoop Distributed File System (HDFS): to store data – MapReduce engine: to process data  Master – slave architecture using commodity servers 7  The HDFS – Master: Namenode – Slave: Datanode  MapReduce – Master: JobTracker – Slave: TaskTracker

HDFS: Blocks 8 Big File Block 1 Block 2 Block 4 Block 3 Datanode 1 Block 1 Block 2 Datanode 2 Datanode 4 Datanode 3 Block 3 Block 1 Block 3 Block 4 Block 2 Block 6 Block 5 Block 4 Block 6 Block 5 Block 6  Runs on top of existing filesystem  Blocks are 64MB (128MB recommended)  Single file can be > any single disk  POSIX based permissions  Fault tolerant

HDFS: Namenode and Datanode  Namenode – Only one per Hadoop Cluster – Manages the filesystem namespace – The filesystem tree – An edit log – For each block block i, the datanode(s) in which block i is saved – All the blocks residing in each datanode  Secondary Namenode – Backup namenode  Datanodes – Many per Hadoop cluster – Controls block operations – Physically puts the block in the nodes – Do the physical replication 9

HDFS: an example 10

MapReduce: JobTracker and TaskTracker 11 1.JobClient submits job to JobTracker; Binary copied into HDFS 2.JobTracker talks to Namenode 3.JobTracker creates execution plan 4.JobTracker submits work to TaskTrackers 5.TaskTrackers report progress via heartbeat 6.JobTracker updates status

Map, Shuffle and Reduce: internal steps 1.Splits data up to send it to the mapper 2.Transforms splits into key/value pairs 3.(Key-Value) with same key sent to the same reducer 4.Aggregates key/value pairs based on user-defined code 5.Determines how the result are saved 12

Fault Tolerance  If the master fails – MapReduce would fail, have to restart the entire job  A map worker node fails – Master detects (periodic ping would timeout) – All the map tasks for this node have to be restarted Even if the map tasks were done, the output were at the node  A reduce worker fails – Master sets the status of its currently executing reduce tasks to idle – Reschedule these tasks on another reduce worker 13

USING MAPREDUCE Some algorithms using MapReduce 14

Matrix – Vector Multiplication  Multiply M = (m ij ) (an n × n matrix) and v = (v i ) (an n-vector)  If n = 1000, no need of MapReduce! 15 Case 1: Large n, M does not fit into main memory, but v does  Since v fits into main memory, v is available to every map task  Map: for each matrix element m ij, emit key value pair (i, m ij v j )  Shuffle and sort: groups all m ij v j values together for the same i  Reduce: sum m ij v j for all j for the same i Mv n n

Matrix – Vector Multiplication  Multiply M = (m ij ) (an n × n matrix) and v = (v i ) (an n-vector)  If n = 1000, no need of MapReduce! 16 Case 2: Very large n, even v does not fit into main memory  For every map, many accesses to disk (for parts of v) required!  Solution: – How much of v will fit in? – Partition v and rows of M so that each partition of v fits into memory – Take dot product of one partition of v and the corresponding partition of M – Map and reduce same as before This whole chunk does not fit in main memory anymore This much will fit into main memory

Relational Algebra  Relation R(A 1, A 3, …, A n ) is a relation with attributes A i  Schema: set of attributes  Selection on condition C: apply C on each tuple in R, output only those which satisfy C  Projection on a subset S of attributes: output the components for the attributes in S  Union, Intersection, Join… Attr 1 Attr 2 Attr 3 Attr 4 xyzabc1true abcxyz1true xyzdef1false bcddef2true 17 URL1URL2 url1url2 url1 url3url5 url1url3 Links between URLs

Selection using MapReduce  Trivial  Map: For each tuple t in R, test if t satisfies C. If so, produce the key-value pair (t, t).  Reduce: The identity function. It simply passes each key-value pair to the output. 18 URL1URL2 url1url2 url1 url3url5 url1url3 Links between URLs

Union using MapReduce  Union of two relations R and S  Suppose R and S have the same schema  Map tasks are generated from chunks of both R and S  Map: For each tuple t, produce the key- value pair (t, t)  Reduce: Only need to remove duplicates – For all key t, there would be either one or two values – Output (t, t) in either case 19 URL1URL2 url1url2 url1 url3url5 url1url3 Links between URLs

Natural join using MapReduce  Join R(A,B) with S(B,C) on attribute B  Map: – For each tuple t = (a,b) of R, emit key value pair (b,(R,a)) – For each tuple t = (b,c) of S, emit key value pair (b,(S,c))  Reduce: – Each key b would be associated with a list of values that are of the form (R,a) or (S,c) – Construct all pairs consisting of one with first component R and the other with first component S, say (R,a ) and (S,c ). The output from this key and value list is a sequence of key-value pairs – The key is irrelevant. Each value is one of the triples (a, b, c ) such that (R,a ) and (S,c) are on the input list of values 20 AB xa yb zc wd R BC a1 c3 d4 g7 S

Grouping and Aggregation using MapReduce  Group and aggregate on a relation R(A,B) using aggregation function γ(B), group by A  Map: – For each tuple t = (a,b) of R, emit key value pair (a,b)  Reduce: – For all group {(a,b 1 ), …, (a,b m )} represented by a key a, apply γ to obtain b a = b 1 + … + b m – Output (a,b a ) 21 AB x2 y1 z4 z1 x5 R select A, sum(B) from R group by A; ASUM(B) x7 y1 z5

Matrix multiplication using MapReduce 22 A (m × n) B (n × l) n n C (m × l) = m l l m  Think of a matrix as a relation with three attributes  For example matrix A is represented by the relation A(I, J, V) – For every non-zero entry (i, j, a ij ), the row number is the value of I, column number is the value of J, the entry is the value in V – Also advantage: usually most large matrices would be sparse, the relation would have less number of entries  The product is ~ a natural join followed by a grouping with aggregation

Matrix multiplication using MapReduce 23 A (m × n) (i, j, a ij ) B (n × l) (j, k, b jk ) n n C (m × l) = m l l m  Natural join of (I,J,V) and (J,K,W)  tuples (i, j, k, a ij, b jk )  Map: – For every (i, j, a ij ), emit key value pair (j, (A, i, a ij )) – For every (j, k, b jk ), emit key value pair (j, (B, k, b jk ))  Reduce: for each key j for each value (A, i, a ij ) and (B, k, b jk ) produce a key value pair ((i,k),(a ij b jk ))

Matrix multiplication using MapReduce 24 A (m × n) (i, j, a ij ) B (n × l) (j, k, b jk ) n n C (m × l) = m l l m  First MapReduce process has produced key value pairs ((i,k),(a ij b jk ))  Another MapReduce process to group and aggregate  Map: identity, just emit the key value pair ((i,k),(a ij b jk ))  Reduce: for each key (i,k) produce the sum of the all the values for the key:

Matrix multiplication using MapReduce: Method 2 25 A (m × n) (i, j, a ij ) B (n × l) (j, k, b jk ) n n C (m × l) = m l l m  A method with one MapReduce step  Map: – For every (i, j, a ij ), emit for all k = 1,…, l, the key value ((i,k), (A, j, a ij )) – For every (j, k, b jk ), emit for all i = 1,…, m, the key value ((i,k), (B, j, b jk ))  Reduce: for each key (i,k) sort values (A, j, a ij ) and (B, j, b jk ) by j to group them by j for each j multiply a ij and b jk sum the products for the key (i,k) to produce May not fit in main memory. Expensive external sort!

References and acknowledgements  Mining of Massive Datasets, by Leskovec, Rajaraman and Ullman, Chapter 2  Slides by Dwaipayan Roy 26