Big Data Infrastructure Jimmy Lin University of Maryland Monday, April 13, 2015 Session 10: Beyond MapReduce — Graph Processing This work is licensed under.

Slides:



Advertisements
Similar presentations
Pregel: A System for Large-Scale Graph Processing
Advertisements

Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
MapReduce.
epiC: an Extensible and Scalable System for Processing Big Data
1 TDD: Topics in Distributed Databases Distributed Query Processing MapReduce Vertex-centric models for querying graphs Distributed query evaluation by.
CS 484. Discrete Optimization Problems A discrete optimization problem can be expressed as (S, f) S is the set of all feasible solutions f is the cost.
Naiad Timely Dataflow System Presented by Leeor Peled Advanced Topics in Computer Architecture April 2014.
Distributed Graph Analytics Imranul Hoque CS525 Spring 2013.
DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17,
1 QSX: Querying Social Graphs Parallel models for querying graphs beyond MapReduce Vertex-centric models –Pregel (BSP) –GraphLab GRAPE.
Distributed Graph Processing Abhishek Verma CS425.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.
PaaS Techniques Programming Model
Distributed Computations
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
Pregel: A System for Large-Scale Graph Processing
Paper by: Grzegorz Malewicz, Matthew Austern, Aart Bik, James Dehnert, Ilan Horn, Naty Leiser, Grzegorz Czajkowski (Google, Inc.) Pregel: A System for.
Distributed Computations MapReduce
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 22: Stream Processing, Graph Processing All slides © IG.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
Pregel: A System for Large-Scale Graph Processing
MapReduce.
A System for Large-Scale Graph Processing
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.
1 Fast Failure Recovery in Distributed Graph Processing Systems Yanyan Shen, Gang Chen, H.V. Jagadish, Wei Lu, Beng Chin Ooi, Bogdan Marius Tudor.
Pregel: A System for Large-Scale Graph Processing Presented by Dylan Davis Authors: Grzegorz Malewicz, Matthew H. Austern, Aart J.C. Bik, James C. Dehnert,
GRAPH PROCESSING Hi, I am Mayank and the second presenter for today is Shadi. We will be talking about Graph Processing.
Distributed shared memory. What we’ve learnt so far  MapReduce/Dryad as a distributed programming model  Data-flow (computation as vertex, data flow.
CSE 486/586 CSE 486/586 Distributed Systems Graph Processing Steve Ko Computer Sciences and Engineering University at Buffalo.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz (Slides by Tyler S. Randolph)
Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.
MapReduce How to painlessly process terabytes of data.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Mizan Mizan: Optimizing Graph Mining in Large Parallel Systems Panos Kalnis King Abdullah University of Science and Technology (KAUST) H. Jamjoom (IBM.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Fast, Exact Graph Diameter Computation with Vertex Programming Corey Pennycuff and Tim Weninger SIGKDD Workshop on High Performance Graph Mining August.
MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides ( (licensed.
Pregel Algorithms for Graph Connectivity Problems with Performance Guarantees Da Yan (CUHK), James Cheng (CUHK), Kai Xing (HKUST), Yi Lu (CUHK), Wilfred.
Distributed Systems CS
© 2015 A. Haeberlen, Z. Ives NETS 212: Scalable and Cloud Computing 1 University of Pennsylvania Iterative processing October 20, 2015.
Data Structures and Algorithms in Parallel Computing Lecture 4.
Data Structures and Algorithms in Parallel Computing
Pregel: A System for Large-Scale Graph Processing Nov 25 th 2013 Database Lab. Wonseok Choi.
Acknowledgement: Arijit Khan, Sameh Elnikety. Google: > 1 trillion indexed pages Web GraphSocial Network Facebook: > 1.5 billion active users 31 billion.
Big Data Infrastructure Week 2: MapReduce Algorithm Design (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.
CPT-S Advanced Databases 11 Yinghui Wu EME 49.
Big Data Infrastructure Week 11: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Jimmy Lin and Michael Schatz Design Patterns for Efficient Graph Algorithms in MapReduce Michele Iovino Facoltà di Ingegneria dell’Informazione, Informatica.
EpiC: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National.
Big Data Infrastructure
TensorFlow– A system for large-scale machine learning
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Pagerank and Betweenness centrality on Big Taxi Trajectory Graph
PREGEL Data Management in the Cloud
CPT-S 415 Big Data Yinghui Wu EME B45.
Data Structures and Algorithms in Parallel Computing
Data-Intensive Distributed Computing
Distributed Systems CS
Distributed Systems CS
Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
Introduction to MapReduce
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Big Data Infrastructure Jimmy Lin University of Maryland Monday, April 13, 2015 Session 10: Beyond MapReduce — Graph Processing This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See for details

Today’s Agenda What makes graph processing hard? Graph processing frameworks Twitter case study

What makes graph processing hard? Lessons learned so far: Partition Replicate Reduce cross-partition communication What makes MapReduce “work”?

Characteristics of Graph Algorithms What are some common features of graph algorithms? Graph traversals Computations involving vertices and their neighbors Passing information along graph edges What’s the obvious idea? Keep “neighborhoods” together!

Simple Partitioning Techniques Hash partitioning Range partitioning on some underlying linearization Web pages: lexicographic sort of domain-reversed URLs Social networks: sort by demographic characteristics

Country Structure in Facebook Ugander et al. (2011) The Anatomy of the Facebook Social Graph. Analysis of 721 million active users (May 2011) 54 countries w/ >1m active users, >50% penetration

How much difference does it make? “Best Practices” Lin and Schatz. (2010) Design Patterns for Efficient Graph Algorithms in MapReduce. PageRank over webgraph (40m vertices, 1.4b edges)

How much difference does it make? +18% 1.4b 674m Lin and Schatz. (2010) Design Patterns for Efficient Graph Algorithms in MapReduce. PageRank over webgraph (40m vertices, 1.4b edges)

How much difference does it make? +18% -15% 1.4b 674m Lin and Schatz. (2010) Design Patterns for Efficient Graph Algorithms in MapReduce. PageRank over webgraph (40m vertices, 1.4b edges)

How much difference does it make? +18% -15% -60% 1.4b 674m 86m Lin and Schatz. (2010) Design Patterns for Efficient Graph Algorithms in MapReduce. PageRank over webgraph (40m vertices, 1.4b edges)

How much difference does it make? +18% -15% -60% -69% 1.4b 674m 86m Lin and Schatz. (2010) Design Patterns for Efficient Graph Algorithms in MapReduce. PageRank over webgraph (40m vertices, 1.4b edges)

Aside: Partitioning Geo-data

Geo-data = regular graph

Space-filling curves: Z-Order Curves

Space-filling curves: Hilbert Curves

General-Purpose Graph Partitioning Karypis and Kumar. (1998) A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs.

Graph Coarsening Karypis and Kumar. (1998) A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs.

Partition

Partition + Replicate What’s the issue? Solutions?

Neighborhood Replication Ugander et al. (2011) The Anatomy of the Facebook Social Graph. What’s the cost of replicating n-hop neighborhoods? What’s the more general challenge?

What makes graph processing hard? It’s tough to apply our “usual tricks” : Partition Replicate Reduce cross-partition communication

Graph Processing Frameworks Source: Wikipedia (Waste container)

Pregel: Computational Model Based on Bulk Synchronous Parallel (BSP) Computational units encoded in a directed graph Computation proceeds in a series of supersteps Message passing architecture Each vertex, at each superstep: Receives messages directed at it from previous superstep Executes a user-defined function (modifying state) Emits messages to other vertices (for the next superstep) Termination: A vertex can choose to deactivate itself Is “woken up” if new messages received Computation halts when all vertices are inactive Source: Malewicz et al. (2010) Pregel: A System for Large-Scale Graph Processing. SIGMOD.

Pregel superstep t superstep t+1 superstep t+2 Source: Malewicz et al. (2010) Pregel: A System for Large-Scale Graph Processing. SIGMOD.

Pregel: Implementation Master-Slave architecture Vertices are hash partitioned (by default) and assigned to workers Everything happens in memory Processing cycle: Master tells all workers to advance a single superstep Worker delivers messages from previous superstep, executing vertex computation Messages sent asynchronously (in batches) Worker notifies master of number of active vertices Fault tolerance Checkpointing Heartbeat/revert Source: Malewicz et al. (2010) Pregel: A System for Large-Scale Graph Processing. SIGMOD.

Pregel: PageRank class PageRankVertex : public Vertex { public: virtual void Compute(MessageIterator* msgs) { if (superstep() >= 1) { double sum = 0; for (; !msgs->Done(); msgs->Next()) sum += msgs->Value(); *MutableValue() = 0.15 / NumVertices() * sum; } if (superstep() < 30) { const int64 n = GetOutEdgeIterator().size(); SendMessageToAllNeighbors(GetValue() / n); } else { VoteToHalt(); } }; Source: Malewicz et al. (2010) Pregel: A System for Large-Scale Graph Processing. SIGMOD.

Pregel: SSSP class ShortestPathVertex : public Vertex { void Compute(MessageIterator* msgs) { int mindist = IsSource(vertex_id()) ? 0 : INF; for (; !msgs->Done(); msgs->Next()) mindist = min(mindist, msgs->Value()); if (mindist < GetValue()) { *MutableValue() = mindist; OutEdgeIterator iter = GetOutEdgeIterator(); for (; !iter.Done(); iter.Next()) SendMessageTo(iter.Target(), mindist + iter.GetValue()); } VoteToHalt(); } }; Source: Malewicz et al. (2010) Pregel: A System for Large-Scale Graph Processing. SIGMOD.

Pregel: Combiners class MinIntCombiner : public Combiner { virtual void Combine(MessageIterator* msgs) { int mindist = INF; for (; !msgs->Done(); msgs->Next()) mindist = min(mindist, msgs->Value()); Output("combined_source", mindist); } }; Source: Malewicz et al. (2010) Pregel: A System for Large-Scale Graph Processing. SIGMOD.

Giraph Architecture Master – Application coordinator Synchronizes supersteps Assigns partitions to workers before superstep begins Workers – Computation & messaging Handle I/O – reading and writing the graph Computation/messaging of assigned partitions ZooKeeper Maintains global application state Giraph slides borrowed from Joffe (2013)

Part 0 Part 1 Part 2 Part 3 Compute / Send Messages Compute / Send Messages Worker 1 Compute / Send Messages Compute / Send Messages Maste r Worker 0 In-memory graph Send stats / iterate! Compute/Iterate 2 Worker 1 Worker 0 Part 0 Part 1 Part 2 Part 3 Output format Part 0 Part 1 Part 2 Part 3 Storing the graph 3 Split 0 Split 1 Split 2 Split 3 Worker 1 Maste r Worker 0 Input format Load / Send Graph Load / Send Graph Load / Send Graph Load / Send Graph Loading the graph 1 Split 4 Split Giraph Dataflow

Giraph Lifecycle Output All Vertices Halted? Input Compute Superstep No Master halted? No Yes ActiveInactive Vote to Halt Received Message Vertex Lifecycle

Giraph Example

Execution Trace Processor 1 Processor 2 Time

GraphX: Motivation

GraphX = Spark for Graphs Integration of record-oriented and graph-oriented processing Extends RDDs to Resilient Distributed Property Graphs Property graphs: Present different views of the graph (vertices, edges, triplets) Support map-like operations Support distributed Pregel-like aggregations

Property Graph: Example

Underneath the Covers

Today’s Agenda What makes graph processing hard? Graph processing frameworks Twitter case study

Source: Wikipedia (Japanese rock garden) Questions?