Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Standard architecture emerging: – Cluster of commodity.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Overview of MapReduce and Hadoop
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Distributed Computations
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
Map Reduce Allan Jefferson Armando Gonçalves Rocir Leite Filipe??
MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January Shahram.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Jeffrey D. Ullman Stanford University. 2 Formal Definition Implementation Fault-Tolerance Example: Join.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
Distributed Computations MapReduce
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
Jeffrey D. Ullman Stanford University.  Mining of Massive Datasets, J. Leskovec, A. Rajaraman, J. D. Ullman.  Available for free download at i.stanford.edu/~ullman/mmds.html.
Map Reduce Architecture
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Lecture 3-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2010 Indranil Gupta (Indy) August 31, 2010 Lecture 3  2010, I. Gupta.
MapReduce.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce How to painlessly process terabytes of data.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.
Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
SECTION 5: PERFORMANCE CHRIS ZINGRAF. OVERVIEW: This section measures the performance of MapReduce on two computations, Grep and Sort. These programs.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and the New Software Stack. Outline  Algorithm Using MapReduce  Matrix-Vector Multiplication  Matrix-Vector Multiplication by MapReduce 
MapReduce: Simplified Data Processing on Large Cluster Authors: Jeffrey Dean and Sanjay Ghemawat Presented by: Yang Liu, University of Michigan EECS 582.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Jimmy Lin and Michael Schatz Design Patterns for Efficient Graph Algorithms in MapReduce Michele Iovino Facoltà di Ingegneria dell’Informazione, Informatica.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Lecture 4. MapReduce Instructor: Weidong Shi (Larry), PhD
Large-scale file systems and Map-Reduce
Map Reduce.
Lecture 3. MapReduce Instructor: Weidong Shi (Larry), PhD
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
Cse 344 May 2nd – Map/reduce.
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
CS 345A Data Mining MapReduce This presentation has been altered.
MapReduce: Simplified Data Processing on Large Clusters
Distributed Systems and Concurrency: Map Reduce
Presentation transcript:

Google’s Map Reduce

Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Standard architecture emerging: – Cluster of commodity Linux nodes – Gigabit Ethernet interconnect How to organize computations on this architecture?

Cluster Architecture Mem Disk CPU Mem Disk CPU … Switch Each rack contains nodes Mem Disk CPU Mem Disk CPU … Switch 1 Gbps between any pair of nodes in a rack 2-10 Gbps backbone between racks

Map Reduce Map-reduce is a high-level programming system that allows database processes to be written simply. The user writes code for two functions, map and reduce. A master controller divides the input data into chunks, and assigns different processors to execute the map function on each chunk. Other processors, perhaps the same ones, are then assigned to perform the reduce function on pieces of the output from the map function.

The Map Function Each map process is given a chunk of input data Input is thought of as a set of key-value records. Output is a set of key-value pairs. – Of course not the same as the key-value pairs in the input. – The "keys" are not true keys in the database sense. That is, there can be many pairs with the same key value.

Map Example Constructing an Inverted Index Map Function Input is a set of (i,d) pairs – i is document ID – d is corresponding document. The map function scans d and for each word w it finds, it emits the pair (w, i). – Notice that in the output, the word is the key and the document ID is the associated value. Output is a list of word-ID pairs. – Not necessary to catch duplicate words in the document; this can be done later, at the reduce phase.

The system will take care the key-value pairs with the same key end up in the same reduce instance.

The Reduce Function Input to reduce is a set of key-value pairs that were output by map instances. – All the the key-value pairs with the same key end up in the same reduce instance. – A set (k,v 1 ), …, (k,v n ) of key-value pairs can be considered as (k,[v 1,…,v n ]). Reduce function combines the list of values associated with a given key k.

Reduce Example Constructing an Inverted Index Reduce Function The intermediate result consists of pairs of the form (w, [i 1, i 2,…,i n ]), – where the i's are a list of document ID's, one for each occurrence of word w. The reduce function takes a list of ID's, eliminates duplicates, and sorts the list of unique ID's.

Parallelism This organization of the computation makes excellent use of whatever parallelism is available. The map function works on a single document, so we could have as many processes and processors as there are documents in the database. The reduce function works on a single word, so we could have as many processes and processors as there are words in the database. Of course, it is unlikely that we would use so many processors in practice.

Another Example – Word Count Construct a word count. For each word w that appears at least once in our database of documents, output pair (w, c), where c is the number of times w appears among all the documents. The map function Input is a document. Goes through the document, and each time it encounters another word w, it emits the pair (w, 1). Intermediate result is a list of pairs (w 1,1), (w 2,1),…. The reduce function Input is a pair (w, [1, 1,...,1]), with a 1 for each occurrence of word w. Sums the 1's, producing the count. Output is word-count pairs (w,c).

What about Joins? R(A, B)  S(B, C) The map function Input is key-value pairs (X, t), – X is either R or S, – t is a tuple of the relation named by X. Output is a single pair (b, (R, a)) or (b, (S, c)) depending on X – b is the B-value of t. – b is the B-value of t (if X=R). – c is the C-value of t (if X=C). The reduce function Input is a pair (b, [(R,a), (S,c), …]). Extracts all the A-values associated with R and all C-values associated with S. These are paired in all possible ways, with the b in the middle to form a tuple of the result.

Reading Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters Hadoop (Apache) – Open Source implementation of MapReduce