The first part of today’s presentation is largely stolen from Ricky Ho’s.

Slides:



Advertisements
Similar presentations
Google News Personalization: Scalable Online Collaborative Filtering
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Linear Time Methods for Propagating Beliefs Min Convolution, Distance Transforms and Box Sums Daniel Huttenlocher Computer Science Department December,
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Overview of this week Debugging tips for ML algorithms
MapReduce.
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Overview of MapReduce and Hadoop
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Evaluating Classifiers
Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
Lecture 12: Revision Lecture Dr John Levine Algorithms and Complexity March 27th 2006.
Implementation of Other Relational Algebra Operators, R. Ramakrishnan and J. Gehrke1 Implementation of other Relational Algebra Operators Chapter 12.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Forging new generations of engineers. Introductionto Basic Statistics.
1 Merge and Quick Sort Quick Sort Reading p
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Lesson Plans: Applications of Graph Theory An E 3 Presentation Ralph Cox with Special Thanks to Dr. S. Butenko TAMU and Ronnie Gonzales.
Statistics.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
Course Review COMP171 Spring Hashing / Slide 2 Elementary Data Structures * Linked lists n Types: singular, doubly, circular n Operations: insert,
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Experimental Evaluation
Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
 1 String and Data Processing. Sequential Processing Processing each element in a sequence for e in [1,2,3,4]: print e for c in “hello”: print c for.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Graph Algorithms Ch. 5 Lin and Dyer. Graphs Are everywhere Manifest in the flow of s Connections on social network Bus or flight routes Social graphs:
Lecture 3-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2010 Indranil Gupta (Indy) August 31, 2010 Lecture 3  2010, I. Gupta.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce and Graph Data Chapter 5 Based on slides from Jimmy Lin’s lecture slides ( (licensed.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
How to think in Map-Reduce Paradigm Ayon Sinha
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
Google’s MapReduce Connor Poske Florida State University.
Scaling up Decision Trees. Decision tree learning.
Sorting. Pseudocode of Insertion Sort Insertion Sort To sort array A[0..n-1], sort A[0..n-2] recursively and then insert A[n-1] in its proper place among.
CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob SPRINT: A Scalable Parallel Classifier for Data Mining.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
Data Structures and Algorithms in Parallel Computing Lecture 2.
Weighted Graphs Computing 2 COMP s1 Sedgewick Part 5: Chapter
Types of Algorithms. 2 Algorithm classification Algorithms that use a similar problem-solving approach can be grouped together We’ll talk about a classification.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:
Mining real world data RDBMS and SQL. Index RDBMS introduction SQL (Structured Query language)
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 12 – Introduction to.
1 Ch.19 Divide and Conquer. 2 BIRD’S-EYE VIEW Divide and conquer algorithms Decompose a problem instance into several smaller independent instances May.
 I can identify the shape of a data distribution using statistics or charts.  I can make inferences about the population from the shape of a sample.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Analysis of Uncertain Data: Smoothing of Histograms Eugene Fink Ankur Sarin Jaime G. Carbonell
Aggregator Stage : Definition : Aggregator classifies data rows from a single input link into groups and calculates totals or other aggregate functions.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
Mergesort example: Merge as we return from recursive calls Merge Divide 1 element 829.
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
Map Reduce.
Evaluation of Relational Operations: Other Operations
Lecture 17: Distributed Transactions
Graph Algorithms Ch. 5 Lin and Dyer.
MongoDB Aggregations.
Evaluation of Relational Operations: Other Techniques
CS 584 Project Write up Poster session for final Due on day of final
Introduction to Basic Statistics
MapReduce: Simplified Data Processing on Large Clusters
Evaluation of Relational Operations: Other Techniques
Presentation transcript:

The first part of today’s presentation is largely stolen from Ricky Ho’s Blog posting (But, any mistakes are assuredly mine!): Sunday, August 29, 2010 Designing algorithms for Map Reduce /640 July 1, 2014 (Gregory Kesden, Presenting)

Map-Reduce Overview

Parallelism is (mostly) in the maps, which are independent (unsynchronized). – O(n) becomes O(1) with parallelism Merge sorts are Merge Sorts – O(n log n) Reduces are linear – O(n) Where’s The Hype?

Some things are essentially only Maps – Identity reduce – Massively parallel – No bottleneck Examples: – Filter to retain or exclude only certain patterns – Reduce data size by random sampling – Convert format, e.g. bold to italics – Flag bad data, e.g. negative, out of range, etc. “Embarrassingly Parallel”

Map Split Part IdRed File1 File2 File… Filen IdRed DFS Part RPC/Download

Map-Reduce is, in many ways, a distributed sorting engine that can some useful work along the way – The merge sort and reduce perform the sort We can leverage this if we actually want to sort – Identity map – Identity reduce One possibly trick: Partition by range – Simplifies merge. Sorting

Common index from key to location, e.g. word to Map emits key, e.g. Reduce produces >, e.g. > Inverted Indexes

Where operation is both commutative and associative Map does local computation Reduce forms global computation Examples: Min, max, count, sum (What about average?) Simple Statistics

Divide into different intervals. Maps compute the count per interval. Reduce will compute the per interval. Note power is in map: Ability to classify in parallel Histograms

Filter: result = SELECT c1, c2, c3, c4 FROM source WHERE conditions – Implement in Map Aggregation: SELECT sum(c3) as s1, avg(c4) as s2... FROM result GROUP BY c1, c2 HAVING conditions – Implement in Reduce SELECT

Simple Join

The next few slides are from me, rather than the cited source for the rest of the presentation. Kesden’s Additional Slides

Map records to common Format Identity reduce Identity Map Reduce to form cross-product Filter to get results Normalize Format For Join

Form Graph as Adjacency List – Map: > for each adjacency – Reduce: > Work from each node in parallel Map – Node n as a key and (D, points-to) as its value D is the distance form the start Points-to is a list of Nodes reachable from n, initially direct adjacencies Emits all points reachable from n via each node in points-to Reduce – Emits one Node n for each key, the one with the shortest D Repeat Map and Reduce phases until no shorter distances found (nothing learned, nothing can be learned, convergance) Shortest Path