GraphX: Graph Analytics on Spark

Slides:

Advertisements

Similar presentations

epiC: an Extensible and Scalable System for Processing Big Data

Advertisements

Oracle Labs Graph Analytics Research Hassan Chafi Sr. Research Manager Oracle Labs Graph-TA 2/21/2014.

UC Berkeley a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Differentiated Graph Computation and Partitioning on Skewed Graphs

Parallel Computing MapReduce Examples Parallel Efficiency Assignment

Distributed Graph Analytics Imranul Hoque CS525 Spring 2013.

Turning Data into Value Ion Stoica CEO, Databricks (also, UC Berkeley and Conviva) UC BERKELEY.

UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Spark: Cluster Computing with Working Sets

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

PreprocessingComputePost Proc. XML Raw Data ETL SliceCompute Repeat Subgraph PageRank Initial Graph Analyz e Top Users.

GraphChi: Big Data – small machine

Spark Fast, Interactive, Language-Integrated Cluster Computing.

Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.

Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker,

Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.

Optimus: A Dynamic Rewriting Framework for Data-Parallel Execution Plans Qifa Ke, Michael Isard, Yuan Yu Microsoft Research Silicon Valley EuroSys 2013.

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

GraphX: Graph Processing in a Distributed Dataflow Framework

Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

GraphLab A New Parallel Framework for Machine Learning Carnegie Mellon Based on Slides by Joseph Gonzalez Mosharaf Chowdhury.

Big Data Infrastructure Jimmy Lin University of Maryland Monday, April 13, 2015 Session 10: Beyond MapReduce — Graph Processing This work is licensed under.

From Graphs to Tables: The Design of Scalable Systems for Graph Analytics Joseph E. Gonzalez Post-doc, UC Berkeley AMPLab Co-founder,

AMPCamp Introduction to Berkeley Data Analytics Systems (BDAS)

Joseph Gonzalez Postdoc, UC Berkeley AMPLab A System for Distributed Graph-Parallel Machine Learning Yucheng Low Aapo Kyrola.

Leveraging Big Data: Lecture 11 Instructors: Edith Cohen Amos Fiat Haim Kaplan Tova Milo.

Joseph Gonzalez Yucheng Low Aapo Kyrola Danny Bickson Joe Hellerstein Alex Smola Distributed Graph-Parallel Computation on Natural Graphs Haijie Gu The.

BiGraph BiGraph: Bipartite-oriented Distributed Graph Partitioning for Big Learning Jiaxin Shi Rong Chen, Jiaxin Shi, Binyu Zang, Haibing Guan Institute.

GraphLab A New Framework for Parallel Machine Learning

Pregel: A System for Large-Scale Graph Processing

Graph-Based Parallel Computing William Cohen 1. Announcements Thursday 4/23: student presentations on projects – come with a tablet/laptop/etc Fri 4/24:

Software tools for Complex Networks Analysis Giovanni Neglia, Small changes to a set of slides from Fabrice Huet, University of Nice Sophia- Antipolis.

Carnegie Mellon University GraphLab Tutorial Yucheng Low.

Pregel: A System for Large-Scale Graph Processing Presented by Dylan Davis Authors: Grzegorz Malewicz, Matthew H. Austern, Aart J.C. Bik, James C. Dehnert,

X-Stream: Edge-Centric Graph Processing using Streaming Partitions

GRAPH PROCESSING Hi, I am Mayank and the second presenter for today is Shadi. We will be talking about Graph Processing.

CSE 486/586 CSE 486/586 Distributed Systems Graph Processing Steve Ko Computer Sciences and Engineering University at Buffalo.

Carnegie Mellon Yucheng Low Aapo Kyrola Danny Bickson A Framework for Machine Learning and Data Mining in the Cloud Joseph Gonzalez Carlos Guestrin Joe.

Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.

GraphX: Unifying Table and Graph Analytics

GraphX: Unifying Data-Parallel and Graph-Parallel Analytics

Joseph Gonzalez Yucheng Low Danny Bickson Distributed Graph-Parallel Computation on Natural Graphs Haijie Gu Joint work with: Carlos Guestrin.

Carnegie Mellon Yucheng Low Aapo Kyrola Danny Bickson A Framework for Machine Learning and Data Mining in the Cloud Joseph Gonzalez Carlos Guestrin Joe.

Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

Data Parallel and Graph Parallel Systems for Large-scale Data Processing Presenter: Kun Li.

Graph-Based Parallel Computing

Acknowledgement: Arijit Khan, Sameh Elnikety. Google: > 1 trillion indexed pages Web GraphSocial Network Facebook: > 1.5 billion active users 31 billion.

PowerGraph: Distributed Graph- Parallel Computation on Natural Graphs Joseph E. Gonzalez, Yucheng Low, Haijie Gu, and Danny Bickson, Carnegie Mellon University;

Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.

EpiC: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National.

Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

Introduction to Spark Streaming for Real Time data analysis

Pagerank and Betweenness centrality on Big Taxi Trajectory Graph

CSCI5570 Large Scale Data Processing Systems

Graph-Based Parallel Computing

Spark Presentation.

Distributed Graph-Parallel Computation on Natural Graphs

Distributed Computing with Spark

Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.

Introduction to Spark.

Graph-Based Parallel Computing

湖南大学-信息科学与工程学院-计算机与科学系

COS 518: Advanced Computer Systems Lecture 12 Mike Freedman

CS110: Discussion about Spark

Replication-based Fault-tolerance for Large-scale Graph Processing

Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC

Pregelix: Think Like a Vertex, Scale Like Spandex

Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC

Computational Advertising and

Presentation transcript:

GraphX: Graph Analytics on Spark Joseph Gonzalez, Reynold Xin, Ion Stoica, Michael Franklin Developed at the UC Berkeley AMPLab AMPCamp: August 29, 2013

Graphs are Essential to Data Mining and Machine Learning Identify influential people and information Find communities Understand people’s shared interests Model complex data dependencies

Predicting Political Bias ? ? Liberal Conservative ? ? ? ? Post ? ? Post Post ? ? ? Post ? Post Post ? Post Post Post Post ? Post ? Post ? ? ? ? ? ? Post Post ? Conditional Random Field Belief Propagation Post ? ? ? ? ? ? ? ?

Triangle Counting Count the triangles passing through each vertex: Measures “cohesiveness” of local community 2 1 3 4 Fewer Triangles Weaker Community More Triangles Stronger Community

Collaborative Filtering User s Ratings Item s

Many More Graph Algorithms Collaborative Filtering CoEM Alternating Least Squares Graph Analytics Stochastic Gradient Descent PageRank Single Source Shortest Path Tensor Factorization SVD Triangle-Counting Structured Prediction Graph Coloring Loopy Belief Propagation K-core Decomposition Max-Product Linear Programs Personalized PageRank Classification Gibbs Sampling Neural Networks Semi-supervised ML Lasso Graph SSL …

Structure of Computation Data-Parallel Graph-Parallel Table Dependency Graph Row Row Result Row Row Pregel

The Graph-Parallel Abstraction A user-defined Vertex-Program runs on each vertex Graph constrains interaction along edges Using messages (e.g. Pregel [PODC’09, SIGMOD’10]) Through shared state (e.g., GraphLab [UAI’10, VLDB’12]) Parallelism: run multiple vertex programs simultaneously

By exploiting graph-structure Graph-Parallel systems can be orders-of-magnitude faster.

Triangle Counting on Twitter 40M Users, 1.4 Billion Links Counted: 34.8 Billion Triangles 1536 Machines 423 Minutes Hadoop [WWW’11] 64 Machines 15 Seconds GraphLab 1000 x Faster S. Suri and S. Vassilvitskii, “Counting triangles and the curse of the last reducer,” WWW’11

Specialized Graph Systems Pregel

Specialized Graph Systems APIs to capture complex graph dependencies Exploit graph structure to reduce communication and computation

Why GraphX?

Hadoop Graph Algorithms The Bigger Picture Graph Lab Hadoop Graph Algorithms Graph Creation Post Proc. Time Spent in Data Pipeline

Vertices

Edges Edges

Limitations of Specialized Graph-Parallel Systems No support for Construction & Post Processing Not interactive Requires maintaining multiple platforms Spark excels at these!

GraphX Unifies Data-Parallel and Graph-Parallel Systems Spark Table API RDDs, Fault-tolerance, and task scheduling GraphLab Graph API graph representation and execution Graph Construction Computation Post-Processing one system for the entire graph pipeline

Enable Joining Tables and Graphs Friend Graph ETL Product Rec. Graph Join Inf. User Data Prod. Rec. Tables Graphs Product Ratings

The GraphX Resilient Distributed Graph Id Rxin Jegonzal Franklin Istoica Attribute (V) (Stu., Berk.) (PstDoc, Berk.) (Prof., Berk) R F J I SrcId DstId rxin jegonzal franklin istoica Attribute (E) Friend Advisor Coworker PI

GraphX API class Graph [ V, E ] { // Table Views ----------------- def vertices: RDD[ (Id, V) ] def edges: RDD[ (Id, Id, E) ] def triplets: RDD[ ((Id, V), (Id, V), E) ] // Transformations ------------------------------ def reverse: Graph[V, E] def filterV(p: (Id, V) => Boolean): Graph[V,E] def filterE(p: Edge[V,E] => Boolean): Graph[V,E] def mapV[T](m: (Id, V) => T ): Graph[T,E] def mapE[T](m: Edge[V,E] => T ): Graph[V,T] // Joins ---------------------------------------- def joinV[T](tbl: RDD[(Id, T)]): Graph[(V, Opt[T]), E ] def joinE[T](tbl: RDD[(Id, Id, T)]): Graph[V, (E, Opt[T])] // Computation ---------------------------------- def aggregateNeighbors[T](mapF: (Edge[V,E]) => T, reduceF: (T, T) => T, direction: EdgeDir): Graph[T, E] } GraphX API

Aggregate Neighbors Map-Reduce for each vertex mapF( ) reduceF( , ) B

Example: Oldest Follower 23 42 What is the age of the oldest follower for each user? val followerAge = graph.aggNbrs( e => e.src.age, // MapF max(_, _), // ReduceF InEdges).vertices B C 30 A D E 19 75 F 16

We can express both Pregel and GraphLab using aggregateNeighbors in 40 lines of code!

Performance Optimizations Replicate & co-partition vertices with edges GraphLab (PowerGraph) style vertex-cut partitioning Minimize communication by avoiding edge data movement in JOINs In-memory hash index for fast joins

Early Performance

In Progress Optimizations Byte-code inspection of user functions E.g. if mapf does not need edge data, we can rewrite the query to delay the join Execution strategies optimizer Scan edges randomly accessing vertices Scan vertices randomly accessing edges

Current Implementation PageRank (5) Connected Comp. (10) Shortest Path (10) ALS (40) Pregel (20) GraphLab (20) GraphX Spark (relational operators)

Demo Reynold Xin

vertices = spark.textFile("hdfs://path/pages.csv") edges = spark.textFile("hdfs://path/to/links.csv”) .map(line => new Edge(line.split(‘\t’)) g = new Graph(vertices, edges).cache println(g.vertices.count) println(g.edges.count) g1 = g.filterVertices(_.split('\t')(2) == "Berkeley") ranks = Analytics.pageRank(g1, numIter = 10) println(ranks.vertices.sum)

ranks = Analytics.pageRank(g1, numIter = 10) println(ranks.vertices.sum)

Summary Graph-parallel primitives on Spark. Currently slower than GraphLab, but No need for specialized systems Easier ETL, and easier consumption of output Interactive graph data mining Future work will bring performance closer to specialized engines. Sub-second

Status Currently finalizing the APIs Feedback wanted: http://bit.ly/graph-api Also working on improving system performance Will be part of Spark 0.9

Questions? jegonzal@eecs.berkeley.edu rxin@eecs.berkeley.edu

Backup slides

Vertex Cut Partitioning

Vertex Cut Partitioning

aggregateNeighbors

aggregateNeighbors

aggregateNeighbors

aggregateNeighbors

Example: Vertex Degree

Example: Vertex Degree

Example: Vertex Degree B: 0 C: 0 D: 0 E: 0 F: 0

Example: Oldest Follower What is the age of the oldest follower for each user? val followerAge = graph.aggNbrs( e => e.src.age, // MapF max(_, _), // ReduceF InEdges).vertices B C A D E F

Specialized Graph Systems Pregel Messaging [PODC’09, SIGMOD’10] Shared State [UAI’10, VLDB’12] Many Others Giraph, Stanford GPS, Signal-Collect, Combinatorial BLAS, BoostPGL, …

The Challenge Expressive graph computation primitives implementable on Spark Leveraging advanced properties and engine extensions to make these primitives fast An optimizer for choosing execution strategies Controlled data partitioning New index-based access methods and operators

GraphX API class Graph [ V, E ] { // Table Views ----------------- def vertices: RDD[ (Id, V) ] def edges: RDD[ (Id, Id, E) ] def triplets: RDD[ ((Id, V), (Id, V), E) ] // Transformations ------------------------------ def reverse: Graph[V, E] def filterV(p: (Id, V) => Boolean): Graph[V,E] def filterE(p: Edge[V,E] => Boolean): Graph[V,E] def mapV[T](m: (Id, V) => T ): Graph[T,E] def mapE[T](m: Edge[V,E] => T ): Graph[V,T] // Joins ---------------------------------------- def joinV[T](tbl: RDD[(Id, T)]): Graph[(V, Opt[T]), E ] def joinE[T](tbl: RDD[(Id, Id, T)]): Graph[V, (E, Opt[T])] // Computation ---------------------------------- def aggregateNeighbors[T](mapF: (Edge[V,E]) => T, reduceF: (T, T) => T, direction: EdgeDir): Graph[T, E] } GraphX API