LECTURE 2 Map-Reduce for large scale similarity computation.

Slides:



Advertisements
Similar presentations
Algorithms and applications
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Intro to Map-Reduce Feb 21, map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…
MapReduce Simplified Data Processing on Large Clusters
From W1-S16. Node failure The probability that at least one node failing is: f= 1 – (1-p) n When n =1; then f =p Suppose p= but n=10000, then: f.
MapReduce.
Overview of MapReduce and Hadoop
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Recommender System with Hadoop and Spark
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Standard architecture emerging: – Cluster of commodity.
Jeffrey D. Ullman Stanford University. 2 Formal Definition Implementation Fault-Tolerance Example: Join.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Jeffrey D. Ullman Stanford University.  Mining of Massive Datasets, J. Leskovec, A. Rajaraman, J. D. Ullman.  Available for free download at i.stanford.edu/~ullman/mmds.html.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.
The exponential growth of data –Challenges for Google,Yahoo,Amazon & Microsoft in web search and indexing The volume of data being made publicly available.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.
MapReduce and the New Software Stack CHAPTER 2 1.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
ISchool, Cloud Computing Class Talk, Oct 6 th Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
HADOOP Carson Gallimore, Chris Zingraf, Jonathan Light.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
By: Joel Dominic and Carroll Wongchote 4/18/2012.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
Map reduce Cs 595 Lecture 11.
MapReduce “MapReduce allows us to stop thinking about fault tolerance.” Cathy O’Neil & Rachel Schutt, 2013.
Hadoop Aakash Kag What Why How 1.
An Open Source Project Commonly Used for Processing Big Data Sets
Large-scale file systems and Map-Reduce
Introduction to MapReduce and Hadoop
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
February 26th – Map/Reduce
Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA
Cse 344 May 4th – Map/Reduce.
CS 345A Data Mining MapReduce This presentation has been altered.
Brief Introduction to Hadoop
5/7/2019 Map Reduce Map reduce.
COS 518: Distributed Systems Lecture 11 Mike Freedman
Presentation transcript:

LECTURE 2 Map-Reduce for large scale similarity computation

…from last lecture How to convert entities into high-dimensional numerical vectors How to compute similarity between two vectors. For example, is x and y are two vectors then

..from last lecture Example: X = (1,2,3) ; Y= (3,2,1) ||X|| = (1+4+9) = = 3.74 ||Y|| = ||X|| Sim(X,Y) = ( )/( ) = 10/14 = 5/7 We also learnt that for large data sets computing pair-wise similarity can be very time consuming.

Map-Reduce Map-Reduce has become a popular framework for speeding up computations like pair-wise similarity Map-Reduce was popularized by Google and then Yahoo! (through the Hadoop open-source implementation) Map-Reduce is a programming model built on top of “cluster computing”

Cluster Computing Put simple (commodity) machines together, each with their own CPU, RAM and DISK, for parallel computing CPURAMDISK CPURAMDISK CPURAMDISK CPURAMDISK CPURAMDISK CPURAMDISK Switch rack

Map-Reduce Map-Reduce consists of two distinct entities  Distributed File System (DFS)  Library to implement Mapper and Reducer functions A DFS seamlessly manages files on the “cluster computer.”  A file is broken into “chunks” and these chunks are replicated across the nodes of a cluster.  If a node which contains chunk A fails, the system will re-start the computation on a node which contains a copy of the chunk.

Distributed File System A DFS will “chunk” files and replicated them across several nodes and then keep track of the chunks. Only practical when data is mostly read only (e.g., historical data; not for live data –like airline reservation system). File Chunk Node 3,2,18 Node 2,6,7

Node failure When several nodes are in play the chances that a single node goes down at any time goes up significantly... Suppose they are n nodes and let p be the probability that a single node will fail..  (1-p) that single node will not fail  (1-p) n that none of the nodes will fail  1 – (1-p) n that at least one will fail.

Node failure The probability that at least one node failing is: f= 1 – (1-p) n When n =1; then f =p Suppose p= but n=10000, then: f = 1 – ( ) = 0.63 [why/how ?] This is one of the most important formulas to know (in general).

Example: “Hello World” of MR DocidContent 1Silent mind, holy mind 2road kill in Java 3Java programming is fun 4My mind in Java 5Where the fun rolls 6Silent road to Cairns Task: Produce an output which, for each word in the file, counts the number of times it appears in the file. Answer: (Java, 3); (Silent, 2), (mind,3)……

Example For example  {doc1, doc2}  machine 1  {doc3,doc4}  machine 2  {doc5,doc6}  machine 3 Each chunk is also duplicated to other machines.

Example Now apply the MAP operation to each node and emit the pair (key, 1). Thus doc1 emits:  (silent,1); (mind,1); (holy, 1); (mind,1) Similarly doc6 emits:  (silent,1);(road,1); (to,1); (Cairns,1)

Example Note in the first chunk which contains (doc1, doc2)..each doc emits (key,value) pairs. We can think that each computer node emits a list of (key, value) pairs. Now this list is “grouped” so that the REDUCE function can be applied.

Example Note now that the (key,value) pairs have no connection with the docs…  (silent,1),(mind,1), (holy, 1), (mind,1), (road,1),(to,1),(Cairns,1); (Java,1),(programming,1),(is,1),(fun,1),……. Now we have a hash function h:{a..z}  {0,1}  Basically two REDUCE nodes  And (key,value) effectively become (key, list)

Example For example suppose the hash functions maps {to, Java, road} to one node. Then  (to,1) remains (to,1)  (Java,1);(Java,1);(Java,1)  (Java, [1,1,1])  (road,1);(road,1)  (road,[1,1]); Now REDUCE function converts  (Java,[1,1,1])  (Java,3) etc. Remember this is a very simple example…the challenge is to take complex tasks and express them as Map and Reduce!

Schema of Map-Reduce Tasks [MMDS] chunks Map Task (key,value) pairs Group By Keys (k,v) [k,(v,u,w,x,z)] Reduce Task Output

The similarity join problem Last time we discussed about computing the pair- wise similarity of all articles/documents in Wikipedia. As we discussed it was time consuming problem because if N is the number of documents, and d is the length of each vector, then the running time proportional to O(N 2 d). How can this problem be attacked using the Map Reduce framework.

Similarity Join Assume we are given two documents (vectors) d1 and d2. Then (ignoring the denominator) Example:  d1 = {silent mind to holy mind}; d2 = {silent road to cairns}  sim(d1,d2) = 1 silent,d1 1 silent,d2 + 1 to,d11 1 to,d2 = 2 Exploit the fact that a term (word) only contributes if it belongs to at least two documents.

Similarity Example [2] Notice, it requires some ingenuity to come up with key-value pairs. This is key to suing map-reduce effectively

Amazon Map Reduce For this class we have received an educational grant from Amazon to run exercises on their Map Reduce servers. Terminology  EC2 – is the name of Amazon’s cluster  S3 – is the name of their storage machines  Elastic Map Reduce – is the name Amazon’s Hadoop implementation of Map-Reduce Lets watch this video.this

References 1. Massive Mining of Data Sets (Rajaram, Leskovec, Ullman) 2. Computing Pairwise Similarity in Large Document Collection: A Map Reduce Perspective (El Sayed, Lin, Oard)