Word Co-occurrence Chapter 3, Lin and Dyer.

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

MapReduce.
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
Ch. 3 Lin and Dyer’s text Pages (39-69)
Cloud Computing Lecture #3 More MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, September 10, 2008 This work is licensed under a Creative.
Based on Lin and Dryer’s text: Chapter 3.  Figure 2.6.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
Jimmy Lin The iSchool University of Maryland Wednesday, April 15, 2009
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Standard architecture emerging: – Cluster of commodity.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
CS506/606: Problem Solving with Large Clusters Zak Shafran, Richard Sproat Spring 2011 Introduction URL:
Map/Reduce Programming Model
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Google’s MapReduce Connor Poske Florida State University.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Mining High Utility Itemset in Big Data
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.
X-Informatics MapReduce February Geoffrey Fox Associate Dean for Research.
MapReduce Algorithm Design Based on Jimmy Lin’s slides
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015.
Big Data Infrastructure Week 2: MapReduce Algorithm Design (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Big Data Infrastructure
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
MapReduce MapReduce is one of the most popular distributed programming models Model has two phases: Map Phase: Distributed processing based on key, value.
MapReduce “MapReduce allows us to stop thinking about fault tolerance.” Cathy O’Neil & Rachel Schutt, 2013.
Hadoop MapReduce Framework
MapReduce Types, Formats and Features
Map Reduce.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
Lecture 3: Bringing it all together
An Innovative Approach to Parallel Processing Data
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
Distributed Systems CS
MR Application with optimizations for performance and scalability
Cloud Distributed Computing Environment Hadoop
MapReduce Algorithm Design
Cse 344 May 4th – Map/Reduce.
Cloud Computing Lecture #4 Graph Algorithms with MapReduce
MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
Distributed System Gang Wu Spring,2018.
MR Application with optimizations for performance and scalability
VI-SEEM data analysis service
Cloud Computing: Project Tutorial Hadoop Map-Reduce Programming
Lecture 16 (Intro to MapReduce and Hadoop)
Distributed Systems CS
Introduction to MapReduce
MAPREDUCE TYPES, FORMATS AND FEATURES
MapReduce Algorithm Design
CS639: Data Management for Data Science
Recitation #4 Tel Aviv University 2017/2018 Slava Novgorodov
CS639: Data Management for Data Science
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
COS 518: Distributed Systems Lecture 11 Mike Freedman
Word Co-occurrence Chapter 3, Lin and Dryer.
Distributed Systems and Concurrency: Map Reduce
Presentation transcript:

Word Co-occurrence Chapter 3, Lin and Dyer

Review 1: Mapreduce Algorithm Design "simplicity" is the theme Fast "simple operation" on a large set of data Most web-mobile-internet application data yield to embarrassingly parallel processing General Idea; you write the Mapper and Reducer (Combiner and Partitioner); the execution framework takes care of the rest. Of course, you configure...the splits, the # of reducers, input path, output path,.. etc.

Review 2: Programmer has NO control over -- where a mapper or reducer runs (which node in the cluster) -- when a mapper or reducer begins or finishes --which input key-value pairs are processed by a specific mapper --what intermediate key-value pair is processed by a specific reducer

Review 3 However what control does a programmer have? 1. Ability to construct complex structures as keys and values to store and communicate partial results 2. The ability to execute user-specified code at the beginning of a map or a reduce task; and termination code at the end; 3. Ability to preserve state in both mappers and reducers across multiple input /intermediate values: counters 4. Ability to control sort order, order of distribution to reducers 5. ability to partition the key space to reducers

Lets move on co-occurrence (Section 3.2) Word counting is not the only example.. Another example: co-occurrence matrix large corpus: nXn matrix where n is the number of unique words in the corpus. (corpora is plural) Lets assume m words, i and j row and column index, m(i.j) cell will have the number of times w(i) co-occurred with w(j) For example <Winnie> is w(i) and <South Africa> w<j> on twitter feed today is 1000 The same for a month ago would have 0, <Winnie, Pooh> would have been more. Lets look at the algorithm. You need this for your Lab2.

Word Co-occurrence – Pairs version 1: class Mapper 2: method Map(docid a; doc d) 3: for all term w 2 doc d do 4: for all term u 2 Neighbors(w) do 5: Emit(pair (w; u); count 1) . Emit count for each co-occurrence 1: class Reducer 2: method Reduce(pair p; counts [c1; c2; : : :]) 3: s <- 0 4: for all count c in counts [c1; c2; : : :] do 5: s s + c . Sum co-occurrence counts 6: Emit(pair p; count s)

Word Co-occurrence – Stripes version 1.class Mapper 2: method Map(docid a; doc d) 3: for all term w in doc d do 4: H <-new AssociativeArray 5: for all term u in Neighbors(w) do 6: H{u} <-H{u} + 1 . //Tally words co-occurring with w 7: Emit(Term w; Stripe H) 1: class Reducer 2: method Reduce(term w; stripes [H1;H2;H3; : : :]) 3: Hf <-new AssociativeArray 4: for all stripe H in stripes [H1;H2;H3; : : :] do 5: Sum(Hf ,H) // Element-wise sum lots of small values into big value 6: Emit(term w; stripe Hf )

Run it on AWS and evaluate the two approaches

Summary/Observation 1.Word co-occurrence is proposed as solution for evaluating association! 2. Two methods proposed: pairs, stripes 3. MR implementation designed (pseudo code) 4. Implemented on MR on amazon cloud 5. Evaluated and relative performance studied (R2, runtime, scale)