Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, 2010 23 January, 2014 Jaehwan Lee.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

Overview of this week Debugging tips for ML algorithms
Graphs (Part II) Shannon Quinn (with thanks to William Cohen and Aapo Kyrola of CMU, and J. Leskovec, A. Rajaraman, and J. Ullman of Stanford University)
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
epiC: an Extensible and Scalable System for Processing Big Data
Overview of MapReduce and Hadoop
LIBRA: Lightweight Data Skew Mitigation in MapReduce
大规模数据处理 / 云计算 Lecture 4 – Mapreduce Algorithm Design 彭波 北京大学信息科学技术学院 4/24/2011 This work is licensed under a Creative.
Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, February 21, 2013 Session 5: Graph Processing This work is licensed.
Spark: Cluster Computing with Working Sets
Felix Halim, Roland H.C. Yap, Yongzheng Wu
Cloud Computing Lecture #3 More MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, September 10, 2008 This work is licensed under a Creative.
IMapReduce: A Distributed Computing Framework for Iterative Computation Yanfeng Zhang, Northeastern University, China Qixin Gao, Northeastern University,
MapReduce Algorithm Design Data-Intensive Information Processing Applications ― Session #3 Jimmy Lin University of Maryland Tuesday, February 9, 2010 This.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
Jimmy Lin The iSchool University of Maryland Wednesday, April 15, 2009
Cloud Computing Lecture #5 Graph Algorithms with MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, October 1, 2008 This work is licensed.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Cloud Computing Lecture #4 Graph Algorithms with MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, February 6, 2008 This work is licensed.
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
CS506/606: Problem Solving with Large Clusters Zak Shafran, Richard Sproat Spring 2011 Introduction URL:
MapReduce.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
MapReduce and Graph Data Chapter 5 Based on slides from Jimmy Lin’s lecture slides ( (licensed.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data.
CSE 486/586 CSE 486/586 Distributed Systems Graph Processing Steve Ko Computer Sciences and Engineering University at Buffalo.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz (Slides by Tyler S. Randolph)
MapReduce How to painlessly process terabytes of data.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Graph Algorithms. Graph Algorithms: Topics  Introduction to graph algorithms and graph represent ations  Single Source Shortest Path (SSSP) problem.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
大规模数据处理 / 云计算 Lecture 5 – Mapreduce Algorithm Design 彭波 北京大学信息科学技术学院 7/19/2011 This work is licensed under a Creative.
MapReduce Algorithm Design Based on Jimmy Lin’s slides
MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides ( (licensed.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD 1.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
大规模数据处理 / 云计算 Lecture 3 – Mapreduce Algorithm Design 闫宏飞 北京大学信息科学技术学院 7/16/2013 This work is licensed under a Creative.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Big Data Infrastructure Week 5: Analyzing Graphs (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United.
Big Data Infrastructure Week 2: MapReduce Algorithm Design (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/
大规模数据处理 / 云计算 05 – Graph Algorithm 闫宏飞 北京大学信息科学技术学院 7/22/2014 Jimmy Lin University of Maryland SEWMGroup This work.
Jimmy Lin and Michael Schatz Design Patterns for Efficient Graph Algorithms in MapReduce Michele Iovino Facoltà di Ingegneria dell’Informazione, Informatica.
EpiC: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National.
Big Data Infrastructure
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
MapReduce Simplied Data Processing on Large Clusters
Author: Ahmed Eldawy, Mohamed F. Mokbel, Christopher Jonathan
湖南大学-信息科学与工程学院-计算机与科学系
Cse 344 May 4th – Map/Reduce.
MapReduce and Data Management
MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
Pregelix: Think Like a Vertex, Scale Like Spandex
Parallel Applications And Tools For Cloud Computing Environments
Introduction to MapReduce
COS 518: Distributed Systems Lecture 11 Mike Freedman
Presentation transcript:

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee

2 / 23 Outline  Introduction  Graph Algorithms – Graph – PageRank using MapReduce  Algorithm Optimizations – In-Mapper Combining – Range Partitioning – Schimmy  Experiments  Results  Conclusions

3 / 23 Introduction  Graphs are everywhere : – e.g., hyperlink structure of the web, social networks, etc.  Graph problems are everywhere : – e.g., random walks, shortest paths, clustering, etc. Contents from Jimmy Lin and Michael Schatz at Hadoop Summit Research Track

4 / 23 Graph Representation  G = (V, E)  Typically represented as adjacency lists : – Each node is associated with its neighbors (via outgoing edges) Contents from Jimmy Lin and Michael Schatz at Hadoop Summit Research Track

5 / 23 PageRank Contents from Jimmy Lin and Michael Schatz at Hadoop Summit Research Track t : timesteps d : damping factor

6 / 23 MapReduce

7 / 23 PageRank using MapReduce [1/4] 12 3

8 / 23 PageRank using MapReduce [2/4] at Iteration 0 where id = 1 KeyValue 1V(2), V(4) KeyValue 21/8 … KeyValue 41/8 Graph Structure itself messages

9 / 23 PageRank using MapReduce [3/4] KeyValue 3V(1) KeyValue 31/8 KeyValue 31/8 KeyValue 3V(1)

10 / 23 PageRank using MapReduce [4/4]

11 / 23  Three Design Patterns – In-Mapper combining : efficient local aggregation – Smarter Partitioning : create more opportunities – Schimmy : avoid shuffling the graph Algorithm Optimizations Contents from Jimmy Lin and Michael Schatz at Hadoop Summit Research Track

12 / 23 In-Mapper Combining [1/3]

13 / 23  Use Combiners – Perform local aggregation on map output – Downside : intermediate data is still materialized  Better : in-mapper combining – Preserve state across multiple map calls, aggregate messages in buffer, emit buffer contents at end – Downside : requires memory management In-Mapper Combining [2/3] Contents from Jimmy Lin and Michael Schatz at Hadoop Summit Research Track

14 / 23 In-Mapper Combining [3/3]

15 / 23 Smarter Partitioning [1/2]

16 / 23  Default : hash partitioning – Randomly assign nodes to partitions  Observation : many graphs exhibit local structure – e.g., communities in social networks – Smarter partitioning creates more opportunities for local aggregation  Unfortunately, partitioning is hard! – Sometimes, chick-and-egg – But in some domains (e.g., webgraphs) take advantage of cheap heuristics – For webgraphs : range partition on domain-sorted URLs Smarter Partitioning [2/2] Contents from Jimmy Lin and Michael Schatz at Hadoop Summit Research Track

17 / 23  Basic implementation contains two dataflows: – 1) Messages (actual computations) – 2) Graph structure (“bookkeeping”)  Schimmy : separate the two data flows, shuffle only the messages – Basic idea : merge join between graph structure and messages Schimmy Design Pattern [1/3] Contents from Jimmy Lin and Michael Schatz at Hadoop Summit Research Track

18 / 23  Schimmy = reduce side parallel merge join between graph structure and messages – Consistent partitioning between input and intermediate data – Mappers emit only messages (actual computation) – Reducers read graph structure directly from HDFS Schimmy Design Pattern [2/3] Contents from Jimmy Lin and Michael Schatz at Hadoop Summit Research Track

19 / 23 Schimmy Design Pattern [3/3] load graph structure from HDFS

20 / 23  Cluster setup : – 10 workers, each 2 cores (3.2 GHz Xeon), 4GB RAM, 367 GB disk – Hadoop on RHELS 5.3  Dataset : – First English segment of ClueWeb09 collection – 50.2m web pages (1.53 TB uncompressed, 247 GB compressed) – Extracted webgraph : 1.4 billion links, 7.0 GB – Dataset arranged in crawl order  Setup : – Measured per-iteration running time (5 iterations) – 100 partitions Experiments Contents from Jimmy Lin and Michael Schatz at Hadoop Summit Research Track

21 / 23 Dataset : ClueWeb09

22 / 23 Results

23 / 23  Lots of interesting graph problems – Social network analysis – Bioinformatics  Reducing intermediate data is key – Local aggregation – Smarter partitioning – Less bookkeeping Conclusion Contents from Jimmy Lin and Michael Schatz at Hadoop Summit Research Track