Word Co-occurrence Chapter 3, Lin and Dyer.

Slides:

Advertisements

Similar presentations

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.

Advertisements

Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html

Ch. 3 Lin and Dyer’s text Pages (39-69)

Cloud Computing Lecture #3 More MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, September 10, 2008 This work is licensed under a Creative.

Based on Lin and Dryer’s text: Chapter 3.  Figure 2.6.

Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.

Jimmy Lin The iSchool University of Maryland Wednesday, April 15, 2009

Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Standard architecture emerging: – Cluster of commodity.

MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.

Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.

7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.

Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.

Hadoop: The Definitive Guide Chap. 8 MapReduce Features

CS506/606: Problem Solving with Large Clusters Zak Shafran, Richard Sproat Spring 2011 Introduction URL:

Map/Reduce Programming Model

By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.

Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

Google’s MapReduce Connor Poske Florida State University.

MapReduce M/R slides adapted from those of Jeff Dean’s.

Mining High Utility Itemset in Big Data

Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.

X-Informatics MapReduce February Geoffrey Fox Associate Dean for Research.

MapReduce Algorithm Design Based on Jimmy Lin’s slides

Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.

IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.

Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015.

Big Data Infrastructure Week 2: MapReduce Algorithm Design (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.

MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )

Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.

Big Data Infrastructure

”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.

MapReduce MapReduce is one of the most popular distributed programming models Model has two phases: Map Phase: Distributed processing based on key, value.

MapReduce “MapReduce allows us to stop thinking about fault tolerance.” Cathy O’Neil & Rachel Schutt, 2013.

Hadoop MapReduce Framework

MapReduce Types, Formats and Features

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.

Lecture 3: Bringing it all together

An Innovative Approach to Parallel Processing Data

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

Database Applications (15-415) Hadoop Lecture 26, April 19, 2016

Distributed Systems CS

MR Application with optimizations for performance and scalability

Cloud Distributed Computing Environment Hadoop

MapReduce Algorithm Design

Cse 344 May 4th – Map/Reduce.

Cloud Computing Lecture #4 Graph Algorithms with MapReduce

MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.

Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &

Distributed System Gang Wu Spring，2018.

MR Application with optimizations for performance and scalability

VI-SEEM data analysis service

Cloud Computing: Project Tutorial Hadoop Map-Reduce Programming

Lecture 16 (Intro to MapReduce and Hadoop)

Distributed Systems CS

Introduction to MapReduce

MAPREDUCE TYPES, FORMATS AND FEATURES

MapReduce Algorithm Design

CS639: Data Management for Data Science

Recitation #4 Tel Aviv University 2017/2018 Slava Novgorodov

CS639: Data Management for Data Science

Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &

COS 518: Distributed Systems Lecture 11 Mike Freedman

Word Co-occurrence Chapter 3, Lin and Dryer.

Distributed Systems and Concurrency: Map Reduce

Presentation transcript:

Word Co-occurrence Chapter 3, Lin and Dyer

Review 1: Mapreduce Algorithm Design "simplicity" is the theme Fast "simple operation" on a large set of data Most web-mobile-internet application data yield to embarrassingly parallel processing General Idea; you write the Mapper and Reducer (Combiner and Partitioner); the execution framework takes care of the rest. Of course, you configure...the splits, the # of reducers, input path, output path,.. etc.

Review 2: Programmer has NO control over -- where a mapper or reducer runs (which node in the cluster) -- when a mapper or reducer begins or finishes --which input key-value pairs are processed by a specific mapper --what intermediate key-value pair is processed by a specific reducer

Review 3 However what control does a programmer have? 1. Ability to construct complex structures as keys and values to store and communicate partial results 2. The ability to execute user-specified code at the beginning of a map or a reduce task; and termination code at the end; 3. Ability to preserve state in both mappers and reducers across multiple input /intermediate values: counters 4. Ability to control sort order, order of distribution to reducers 5. ability to partition the key space to reducers

Lets move on co-occurrence (Section 3.2) Word counting is not the only example.. Another example: co-occurrence matrix large corpus: nXn matrix where n is the number of unique words in the corpus. (corpora is plural) Lets assume m words, i and j row and column index, m(i.j) cell will have the number of times w(i) co-occurred with w(j) For example <Winnie> is w(i) and <South Africa> w<j> on twitter feed today is 1000 The same for a month ago would have 0, <Winnie, Pooh> would have been more. Lets look at the algorithm. You need this for your Lab2.

Word Co-occurrence – Pairs version 1: class Mapper 2: method Map(docid a; doc d) 3: for all term w 2 doc d do 4: for all term u 2 Neighbors(w) do 5: Emit(pair (w; u); count 1) . Emit count for each co-occurrence 1: class Reducer 2: method Reduce(pair p; counts [c1; c2; : : :]) 3: s <- 0 4: for all count c in counts [c1; c2; : : :] do 5: s s + c . Sum co-occurrence counts 6: Emit(pair p; count s)

Word Co-occurrence – Stripes version 1.class Mapper 2: method Map(docid a; doc d) 3: for all term w in doc d do 4: H <-new AssociativeArray 5: for all term u in Neighbors(w) do 6: H{u} <-H{u} + 1 . //Tally words co-occurring with w 7: Emit(Term w; Stripe H) 1: class Reducer 2: method Reduce(term w; stripes [H1;H2;H3; : : :]) 3: Hf <-new AssociativeArray 4: for all stripe H in stripes [H1;H2;H3; : : :] do 5: Sum(Hf ,H) // Element-wise sum lots of small values into big value 6: Emit(term w; stripe Hf )

Run it on AWS and evaluate the two approaches

Summary/Observation 1.Word co-occurrence is proposed as solution for evaluating association! 2. Two methods proposed: pairs, stripes 3. MR implementation designed (pseudo code) 4. Implemented on MR on amazon cloud 5. Evaluated and relative performance studied (R2, runtime, scale)