Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

Overview of this week Debugging tips for ML algorithms
Graphs (Part II) Shannon Quinn (with thanks to William Cohen and Aapo Kyrola of CMU, and J. Leskovec, A. Rajaraman, and J. Ullman of Stanford University)
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
Overview of MapReduce and Hadoop
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
大规模数据处理 / 云计算 Lecture 4 – Mapreduce Algorithm Design 彭波 北京大学信息科学技术学院 4/24/2011 This work is licensed under a Creative.
Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, February 21, 2013 Session 5: Graph Processing This work is licensed.
大规模数据处理 / 云计算 Lecture 6 – Graph Algorithm 彭波 北京大学信息科学技术学院 4/26/2011 This work is licensed under a Creative Commons.
Thanks to Jimmy Lin slides
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.
Cloud Computing Lecture #3 More MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, September 10, 2008 This work is licensed under a Creative.
Distributed Computations
MapReduce Algorithm Design Data-Intensive Information Processing Applications ― Session #3 Jimmy Lin University of Maryland Tuesday, February 9, 2010 This.
Jimmy Lin The iSchool University of Maryland Wednesday, April 15, 2009
Cloud Computing Lecture #5 Graph Algorithms with MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, October 1, 2008 This work is licensed.
MapReduce Algorithms CSE 490H. Algorithms for MapReduce Sorting Searching TF-IDF BFS PageRank More advanced algorithms.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
Cloud Computing Lecture #4 Graph Algorithms with MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, February 6, 2008 This work is licensed.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
CS506/606: Problem Solving with Large Clusters Zak Shafran, Richard Sproat Spring 2011 Introduction URL:
Graph Algorithms Ch. 5 Lin and Dyer. Graphs Are everywhere Manifest in the flow of s Connections on social network Bus or flight routes Social graphs:
MapReduce.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce and Graph Data Chapter 5 Based on slides from Jimmy Lin’s lecture slides ( (licensed.
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.
CSE 486/586 CSE 486/586 Distributed Systems Graph Processing Steve Ko Computer Sciences and Engineering University at Buffalo.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz (Slides by Tyler S. Randolph)
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Graph Algorithms. Graph Algorithms: Topics  Introduction to graph algorithms and graph represent ations  Single Source Shortest Path (SSSP) problem.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
大规模数据处理 / 云计算 Lecture 5 – Mapreduce Algorithm Design 彭波 北京大学信息科学技术学院 7/19/2011 This work is licensed under a Creative.
MapReduce Algorithm Design Based on Jimmy Lin’s slides
MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides ( (licensed.
Distributed Computing Seminar Lecture 5: Graph Algorithms & PageRank Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet Summer 2007 Except.
NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD 1.
大规模数据处理 / 云计算 Lecture 3 – Mapreduce Algorithm Design 闫宏飞 北京大学信息科学技术学院 7/16/2013 This work is licensed under a Creative.
Data Structures and Algorithms in Parallel Computing
Big Data Infrastructure Week 5: Analyzing Graphs (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United.
Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015.
Big Data Infrastructure Week 5: Analyzing Graphs (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United.
Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.
Big Data Infrastructure Week 2: MapReduce Algorithm Design (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
Big Data Infrastructure Week 3: From MapReduce to Spark (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
大规模数据处理 / 云计算 05 – Graph Algorithm 闫宏飞 北京大学信息科学技术学院 7/22/2014 Jimmy Lin University of Maryland SEWMGroup This work.
Big Data Infrastructure Week 11: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
Jimmy Lin and Michael Schatz Design Patterns for Efficient Graph Algorithms in MapReduce Michele Iovino Facoltà di Ingegneria dell’Informazione, Informatica.
EpiC: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National.
Big Data Infrastructure
Big Data Infrastructure
Big Data Infrastructure
Map Reduce.
PREGEL Data Management in the Cloud
Introduction to MapReduce and Hadoop
Data-Intensive Distributed Computing
湖南大学-信息科学与工程学院-计算机与科学系
MapReduce and Data Management
Cloud Computing Lecture #4 Graph Algorithms with MapReduce
MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.
Graph Algorithms Ch. 5 Lin and Dyer.
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
Parallel Applications And Tools For Cloud Computing Environments
Graph Algorithms Adapted from UMD Jimmy Lin’s slides, which is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Graph Algorithms Ch. 5 Lin and Dyer.
Presentation transcript:

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See for details

@lintool

Talk Outline Graph algorithms Graph algorithms in MapReduce Making it efficient Experimental results Punch line: per-iteration running time -69% on 1.4b link webgraph!

What’s a graph? G = (V, E), where V represents the set of vertices (nodes) E represents the set of edges (links) Both vertices and edges may contain additional information Graphs are everywhere: E.g., hyperlink structure of the web, interstate highway system, social networks, etc. Graph problems are everywhere: E.g., random walks, shortest paths, MST, max flow, bipartite matching, clustering, etc.

Source: Wikipedia (Königsberg)

Graph Representation G = (V, E) Typically represented as adjacency lists: Each node is associated with its neighbors (via outgoing edges) : 2, 4 2: 1, 3, 4 3: 1 4: 1, 3

“Message Passing” Graph Algorithms Large class of iterative algorithms on sparse, directed graphs At each iteration: Computations at each vertex Partial results (“messages”) passed (usually) along directed edges Computations at each vertex: messages aggregate to alter state Iterate until convergence

A Few Examples… Parallel breadth-first search (SSSP) Messages are distances from source Each node emits current distance + 1 Aggregation = MIN PageRank Messages are partial PageRank mass Each node evenly distributes mass to neighbors Aggregation = SUM DNA Sequence assembly Michael Schatz’s dissertation Boring! Still boring!

PageRank in a nutshell…. Random surfer model: User starts at a random Web page User randomly clicks on links, surfing from page to page With some probability, user randomly jumps around PageRank… Characterizes the amount of time spent on any given page Mathematically, a probability distribution over pages

Given page x with inlinks t 1 …t n, where C(t) is the out-degree of t  is probability of random jump N is the total number of nodes in the graph PageRank: Defined X X t1t1 t1t1 t2t2 t2t2 tntn tntn …

Sample PageRank Iteration (1) n 1 (0.2) n 4 (0.2) n 3 (0.2) n 5 (0.2) n 2 (0.2) n 1 (0.066) n 4 (0.3) n 3 (0.166) n 5 (0.3) n 2 (0.166) Iteration 1

Sample PageRank Iteration (2) n 1 (0.066) n 4 (0.3) n 3 (0.166) n 5 (0.3) n 2 (0.166) n 1 (0.1) n 4 (0.2) n 3 (0.183) n 5 (0.383) n 2 (0.133) Iteration 2

PageRank in MapReduce n 5 [n 1, n 2, n 3 ]n 1 [n 2, n 4 ]n 2 [n 3, n 5 ]n 3 [n 4 ]n 4 [n 5 ] n2n2 n4n4 n3n3 n5n5 n1n1 n2n2 n3n3 n4n4 n5n5 n2n2 n4n4 n3n3 n5n5 n1n1 n2n2 n3n3 n4n4 n5n5 n 5 [n 1, n 2, n 3 ]n 1 [n 2, n 4 ]n 2 [n 3, n 5 ]n 3 [n 4 ]n 4 [n 5 ] Map Reduce

PageRank Pseudo-Code

Why don’t distributed algorithms scale?

Source:

Three Design Patterns In-mapper combining: efficient local aggregation Smarter partitioning: create more opportunities Schimmy: avoid shuffling the graph

In-Mapper Combining Use combiners Perform local aggregation on map output Downside: intermediate data is still materialized Better: in-mapper combining Preserve state across multiple map calls, aggregate messages in buffer, emit buffer contents at end Downside: requires memory management configure map close buffer

Better Partitioning Default: hash partitioning Randomly assign nodes to partitions Observation: many graphs exhibit local structure E.g., communities in social networks Better partitioning creates more opportunities for local aggregation Unfortunately… partitioning is hard! Sometimes, chick-and-egg But in some domains (e.g., webgraphs) take advantage of cheap heuristics For webgraphs: range partition on domain-sorted URLs

Schimmy Design Pattern Basic implementation contains two dataflows: Messages (actual computations) Graph structure (“bookkeeping”) Schimmy: separate the two data flows, shuffle only the messages Basic idea: merge join between graph structure and messages ST both relations sorted by join key S1S1 T1T1 S2S2 T2T2 S3S3 T3T3 both relations consistently partitioned and sorted by join key

S1S1 T1T1 Do the Schimmy! Schimmy = reduce side parallel merge join between graph structure and messages Consistent partitioning between input and intermediate data Mappers emit only messages (actual computation) Reducers read graph structure directly from HDFS S2S2 T2T2 S3S3 T3T3 Reducer intermediate data (messages) intermediate data (messages) intermediate data (messages) from HDFS (graph structure) from HDFS (graph structure) from HDFS (graph structure)

Experiments Cluster setup: 10 workers, each 2 cores (3.2 GHz Xeon), 4GB RAM, 367 GB disk Hadoop on RHELS 5.3 Dataset: First English segment of ClueWeb09 collection 50.2m web pages (1.53 TB uncompressed, 247 GB compressed) Extracted webgraph: 1.4 billion links, 7.0 GB Dataset arranged in crawl order Setup: Measured per-iteration running time (5 iterations) 100 partitions

Results “Best Practices”

Results +18% 1.4b 674m

Results +18% -15% 1.4b 674m

Results +18% -15% -60% 1.4b 674m 86m

Results +18% -15% -60% -69% 1.4b 674m 86m

Take-Away Messages Lots of interesting graph problems! Social network analysis Bioinformatics Reducing intermediate data is key Local aggregation Better partitioning Less bookkeeping

@lintool Source code available in Cloud 9 Complete details in Jimmy Lin and Michael Schatz. Design Patterns for Efficient Graph Algorithms in MapReduce. Proceedings of the 2010 Workshop on Mining and Learning with Graphs Workshop (MLG-2010), July 2010, Washington, D.C.