GraphX: Graph Processing in a Distributed Dataflow Framework

Slides:

Advertisements

Similar presentations

epiC: an Extensible and Scalable System for Processing Big Data

Advertisements

Make Sense of Big Data Researched by JIANG Wen-rui Led by Pro. ZOU

Oracle Labs Graph Analytics Research Hassan Chafi Sr. Research Manager Oracle Labs Graph-TA 2/21/2014.

UC Berkeley a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Distributed Graph Analytics Imranul Hoque CS525 Spring 2013.

Matei Zaharia University of California, Berkeley Spark in Action Fast Big Data Analytics using Scala UC BERKELEY.

UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Spark: Cluster Computing with Working Sets

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

PreprocessingComputePost Proc. XML Raw Data ETL SliceCompute Repeat Subgraph PageRank Initial Graph Analyz e Top Users.

Spark Fast, Interactive, Language-Integrated Cluster Computing.

Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.

Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.

APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.

Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

GraphLab A New Parallel Framework for Machine Learning Carnegie Mellon Based on Slides by Joseph Gonzalez Mosharaf Chowdhury.

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.

From Graphs to Tables: The Design of Scalable Systems for Graph Analytics Joseph E. Gonzalez Post-doc, UC Berkeley AMPLab Co-founder,

A Lightweight Infrastructure for Graph Analytics Donald Nguyen Andrew Lenharth and Keshav Pingali The University of Texas at Austin.

Joseph Gonzalez Postdoc, UC Berkeley AMPLab A System for Distributed Graph-Parallel Machine Learning Yucheng Low Aapo Kyrola.

Joseph Gonzalez Yucheng Low Aapo Kyrola Danny Bickson Joe Hellerstein Alex Smola Distributed Graph-Parallel Computation on Natural Graphs Haijie Gu The.

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.

Graph-Based Parallel Computing William Cohen 1. Announcements Thursday 4/23: student presentations on projects – come with a tablet/laptop/etc Fri 4/24:

Software tools for Complex Networks Analysis Giovanni Neglia, Small changes to a set of slides from Fabrice Huet, University of Nice Sophia- Antipolis.

X-Stream: Edge-Centric Graph Processing using Streaming Partitions

GRAPH PROCESSING Hi, I am Mayank and the second presenter for today is Shadi. We will be talking about Graph Processing.

Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

CSE 486/586 CSE 486/586 Distributed Systems Graph Processing Steve Ko Computer Sciences and Engineering University at Buffalo.

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz (Slides by Tyler S. Randolph)

Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.

GraphX: Unifying Table and Graph Analytics

Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.

GraphX: Unifying Data-Parallel and Graph-Parallel Analytics

PREDIcT: Towards Predicting the Runtime of Iterative Analytics Adrian Popescu 1, Andrey Balmin 2, Vuk Ercegovac 3, Anastasia Ailamaki

Joseph Gonzalez Yucheng Low Danny Bickson Distributed Graph-Parallel Computation on Natural Graphs Haijie Gu Joint work with: Carlos Guestrin.

Carnegie Mellon Yucheng Low Aapo Kyrola Danny Bickson A Framework for Machine Learning and Data Mining in the Cloud Joseph Gonzalez Carlos Guestrin Joe.

Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

GraphX: Graph Analytics on Spark

Data Structures and Algorithms in Parallel Computing

Graph-Based Parallel Computing

MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,

Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.

EpiC: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National.

Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

CSCI5570 Large Scale Data Processing Systems Distributed Data Analytics Systems Slide Ack.: modified based on the slides from Matei Zaharia James Cheng.

Spark: Cluster Computing with Working Sets

Distributed Programming in “Big Data” Systems Pramod Bhatotia wp

Pagerank and Betweenness centrality on Big Taxi Trajectory Graph

CSCI5570 Large Scale Data Processing Systems

Graph-Based Parallel Computing

Spark Presentation.

Distributed Graph-Parallel Computation on Natural Graphs

Distributed Computing with Spark

Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.

Introduction to Spark.

Graph-Based Parallel Computing

COS 518: Advanced Computer Systems Lecture 12 Mike Freedman

Distributed Systems CS

CS110: Discussion about Spark

Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC

Pregelix: Think Like a Vertex, Scale Like Spandex

Overview of big data tools

Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC

Introduction to Spark.

Fast, Interactive, Language-Integrated Cluster Computing

Lecture 29: Distributed Systems

Presentation transcript:

GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold Xin, Ankur Dave, Daniel Crankshaw, Michael Franklin, and Ion Stoica OSDI 2014 The goal of this project was to distill the essential advances developed in the context of specialized graph processing systems and lift those advances into general purpose distributed dataflow systems unifying graphs and tables and enabling a single system to easily and efficiently support both types of data and span the entire analytics pipeline. To motivate the need for this unified approach to analytics let me begin by illustrating modern analytics pipelines. UC BERKELEY

Modern Analytics Link Table Hyperlinks PageRank Top 20 Pages Raw Title Link Hyperlinks PageRank Top 20 Pages Title PR Raw Wikipedia < / > XML Top Communities Com. PR.. Discussion Table User Disc. Community Detection User Community Com. Editor Graph

Tables Link Table Hyperlinks PageRank Top 20 Pages Raw Wikipedia Title Link Hyperlinks PageRank Top 20 Pages Title PR Raw Wikipedia < / > XML Top Communities Com. PR.. Modern analytics involves many stages and many views of data. <Make animation auto + practice shortened versions of second and third path> Build community topic table separtely Discussion Table User Disc. Community Detection User Community Com. Editor Graph

Graphs Link Table Hyperlinks PageRank Top 20 Pages Raw Wikipedia Title Link Hyperlinks PageRank Top 20 Pages Title PR Raw Wikipedia < / > XML Top Communities Com. PR.. Modern analytics involves many stages and many views of data. <Make animation auto + practice shortened versions of second and third path> Build community topic table separtely Discussion Table User Disc. Community Detection User Community Com. Editor Graph

Separate Systems Tables Graphs

Separate Systems Dataflow Systems Graphs Table Result Row

Separate Systems Dataflow Systems Graph Systems Table Dependency Graph Result Row Dependency Graph Pregel Add list of papers to bottom of slide

Difficult to Use Users must Learn, Deploy, and Manage multiple systems Leads to brittle and often complex interfaces

Limited reuse internal data-structures across stages Inefficient Extensive data movement and duplication across the network and file system < / > XML HDFS Limited reuse internal data-structures across stages

GraphX Unifies Computation on Tables and Graphs Table View Graph View GraphX Spark Dataflow Framework Enabling a single system to easily and efficiently support the entire pipeline

? Separate Systems Dataflow Systems Graph Systems Dependency Graph Pregel Table Row Row Result Row Row

Separate Systems Dataflow Systems Graph Systems Pregel Table Dependency Graph Row Row Result Row Row

PageRank on the Live-Journal Graph ? Graphlab vs hadoop speedup Build spark second ----- Meeting Notes (3/21/14 15:09) ----- How many vertices is LJ, Wiki, Twitter? ----- Meeting Notes (3/24/14 20:39) ----- Animate, build in spark x faster than hadoop graphlab x faster than spark ----- Meeting Notes (10/3/14 09:56) ----- Try putting this before these optimizations Hadoop is 60x slower than GraphLab Spark is 16x slower than GraphLab

Distill the lessons learned from specialized graph systems Key Question How can we naturally express and efficiently execute graph computation in a general purpose dataflow framework? Distill the lessons learned from specialized graph systems

Key Question Representation Optimizations How can we naturally express and efficiently execute graph computation in a general purpose dataflow framework? Representation Optimizations

Raw Wikipedia Hyperlinks PageRank Top 20 Pages Link Table Editor Graph < / > XML Hyperlinks PageRank Top 20 Pages Title PR Link Table Link Editor Graph Community Detection User Com. Discussion Table Disc. Top Communities PR..

Example Computation: PageRank Express computation locally: Iterate until convergence Rank of Page i Weighted sum of neighbors’ ranks Random Reset Prob.

“Think like a Vertex.” - Malewicz et al., SIGMOD’10

Graph-Parallel Pattern Gonzalez et al. [OSDI’12] Gather information from neighboring vertices

Graph-Parallel Pattern Gonzalez et al. [OSDI’12] Apply an update the vertex property

Graph-Parallel Pattern Gonzalez et al. [OSDI’12] Scatter information to neighboring vertices

Many Graph-Parallel Algorithms MACHINE LEARNING NETWORK ANALYSIS Collaborative Filtering Community Detection Alternating Least Squares Triangle-Counting Stochastic Gradient Descent K-core Decomposition Tensor Factorization K-Truss Structured Prediction Graph Analytics Loopy Belief Propagation PageRank Max-Product Linear Programs Personalized PageRank Shortest Path Gibbs Sampling Graph Coloring Semi-supervised ML Graph SSL CoEM

Specialized Computational Pattern Specialized Graph Optimizations

Graph System Optimizations Specialized Data-Structures Vertex-Cuts Partitioning Remote Caching / Mirroring Message Combiners Active Set Tracking ----- Meeting Notes (10/3/14 09:56) ----- Graph System Optimizations Make more crisp vertex-cut to mirror transition The Price we paid to do this was a specialized system. The question is how to make these optimizations work in a generalized.

Representation Optimizations Distributed Join Optimization Graphs Horizontally Partitioned Tables Join Vertex Programs Dataflow Operators Optimizations Advances in Graph Processing Systems Distributed Join Optimization Materialized View Maintenance

Property Graph Data Model B C A D F E Property Graph Vertex Property: User Profile Current PageRank Value ----- Meeting Notes (10/3/14 10:02) ----- Add legend for vertex and edge properties. ----- Meeting Notes (10/3/14 10:16) ----- separate structure from properties in our table representation enabling us to reuse structural information while transforming properties. Edge Property: Weights Timestamps

Encoding Property Graphs as Tables Vertex Table (RDD) Routing Table (RDD) B C D E A F 1 2 Edge Table (RDD) A B C D E F Property Graph Part. 1 Machine 1 Machine 2 B C D E A F B C A D B C A A D D Vertex Cut F E A D ----- Meeting Notes (10/3/14 10:02) ----- Add legend for vertex and edge properties. ----- Meeting Notes (10/3/14 10:16) ----- separate structure from properties in our table representation enabling us to reuse structural information while transforming properties. Part. 2 F E

Separate Properties and Structure Reuse structural information across multiple graphs Input Graph Transformed Graph Vertex Table Routing Table Edge Routing Table Edge Table Vertex Table Transform Vertex Properties The normalized representation is crucial to being able to support iterative computation as well as having multiple graphs and sub-graphs derived from the same base data.

Table Operators Table operators are inherited from Spark: map filter groupBy sort union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey groupByKey cogroup cross zip sample take first partitionBy mapWith pipe save ...

Graph Operators (Scala) class Graph [ V, E ] { def Graph(vertices: Table[ (Id, V) ], edges: Table[ (Id, Id, E) ]) // Table Views ----------------- def vertices: Table[ (Id, V) ] def edges: Table[ (Id, Id, E) ] def triplets: Table [ ((Id, V), (Id, V), E) ] // Transformations ------------------------------ def reverse: Graph[V, E] def subgraph(pV: (Id, V) => Boolean, pE: Edge[V,E] => Boolean): Graph[V,E] def mapV(m: (Id, V) => T ): Graph[T,E] def mapE(m: Edge[V,E] => T ): Graph[V,T] // Joins ---------------------------------------- def joinV(tbl: Table [(Id, T)]): Graph[(V, T), E ] def joinE(tbl: Table [(Id, Id, T)]): Graph[V, (E, T)] // Computation ---------------------------------- def mrTriplets(mapF: (Edge[V,E]) => List[(Id, T)], reduceF: (T, T) => T): Graph[T, E] } Make mrTriplets more visible. Emphasize how this API blurs distinction between graphs and tables

Graph Operators (Scala) class Graph [ V, E ] { def Graph(vertices: Table[ (Id, V) ], edges: Table[ (Id, Id, E) ]) // Table Views ----------------- def vertices: Table[ (Id, V) ] def edges: Table[ (Id, Id, E) ] def triplets: Table [ ((Id, V), (Id, V), E) ] // Transformations ------------------------------ def reverse: Graph[V, E] def subgraph(pV: (Id, V) => Boolean, pE: Edge[V,E] => Boolean): Graph[V,E] def mapV(m: (Id, V) => T ): Graph[T,E] def mapE(m: Edge[V,E] => T ): Graph[V,T] // Joins ---------------------------------------- def joinV(tbl: Table [(Id, T)]): Graph[(V, T), E ] def joinE(tbl: Table [(Id, Id, T)]): Graph[V, (E, T)] // Computation ---------------------------------- def mrTriplets(mapF: (Edge[V,E]) => List[(Id, T)], reduceF: (T, T) => T): Graph[T, E] } capture the Gather-Scatter pattern from specialized graph-processing systems . def mrTriplets(mapF: (Edge[V,E]) => List[(Id, T)], reduceF: (T, T) => T): Graph[T, E]

Triplets Join Vertices and Edges The triplets operator joins vertices and edges: Vertices B A C D Triplets Edges A B C D A A B B A C B D

Map-Reduce Triplets Map-Reduce triplets collects information about the neighborhood of each vertex: Src. or Dst. MapFunction( )  (B, ) Reduce (B, ) (C, + ) (D, ) A B MapFunction( )  (C, ) A C MapFunction( )  (C, ) B C MapFunction( )  (D, ) C D Message Combiners

Using these basic GraphX operators we implemented Pregel and GraphLab in under 50 lines of code!

The GraphX Stack (Lines of Code) Spark (30,000) Pregel API (34) PageRank (20) Connected Comp. (20) K-core (60) Triangle Count (50) LDA (220) SVD++ (110) The simple set of GraphX operators are capable of expressing not only existing graph processing systems but can also be more convenient to express certain graph algorithms directly. Some algorithms are more naturally expressed using the GraphX primitive operators

Representation Optimizations Distributed Join Optimization Graphs Horizontally Partitioned Tables Join Vertex Programs Dataflow Operators Optimizations Advances in Graph Processing Systems Distributed Join Optimization Materialized View Maintenance

Join Site Selection using Routing Tables (RDD) B C D E A F 1 2 Vertex Table (RDD) Edge Table (RDD) Never Shuffle Edges! A B C D E F B C D E A F The routing table is one of the essential ideas GraphX introduces to optimize distributed joins. It is important to note that we never shuffle edges and this can be a big win since the edge table is often orders of magnitude larger than the vertex table.

Caching for Iterative mrTriplets Vertex Table (RDD) Edge Table (RDD) Mirror Cache A B C D E F A B C D E A F A B C D A Reusable Hash Index Scan B C Mirror Cache D D D E F A E Reusable Hash Index Scan F

Incremental Updates for Triplets View Vertex Table (RDD) Edge Table (RDD) Mirror Cache A B C D E F Change A A B C D E A F B C D A Change title Mirror Cache D E F A Change E Scan

Aggregation for Iterative mrTriplets Vertex Table (RDD) Edge Table (RDD) Mirror Cache A B C D E F Change B C D E A F A Local Aggregate B B Change C C D Change Mirror Cache Change A Local Aggregate Change D Scan D E Change F F

Reduction in Communication Due to Cached Updates Most vertices are within 8 hops of all vertices in their comp.

Benefit of Indexing Active Vertices Without Active Tracking Active Vertex Tracking

Without Join Elimination Identify and bypass joins for unused triplet fields Java bytecode inspection Without Join Elimination Join Elimination Better Factor of 2 reduction in communication

Additional Optimizations Indexing and Bitmaps: To accelerate joins across graphs To efficiently construct sub-graphs Lineage based fault-tolerance Exploits Spark lineage to recover in parallel Eliminates need for costly check-points Substantial Index and Data Reuse: Reuse routing tables across graphs and sub-graphs Reuse edge adjacency information and indices Split Hash Table: reuse key sets across hash tables

System Comparison Goal: Demonstrate that GraphX achieves performance parity with specialized graph- processing systems. Setup: 16 node EC2 Cluster (m2.4xLarge) + 1GigE Compare against GraphLab/PowerGraph (C++), Giraph (Java), & Spark (Java/Scala) 8 GB RAM, 8 Virtual Cores

PageRank Benchmark EC2 Cluster of 16 x m2.4xLarge (8 cores) + 1GigE Twitter Graph (42M Vertices,1.5B Edges) UK-Graph (106M Vertices, 3.7B Edges) 18x 7x Runtime (Seconds) ----- Meeting Notes (10/3/14 10:02) ----- Make GraphX first. GraphX performs comparably to state-of-the-art graph processing systems.

Connected Comp. Benchmark EC2 Cluster of 16 x m2.4xLarge (8 cores) + 1GigE Twitter Graph (42M Vertices,1.5B Edges) UK-Graph (106M Vertices, 3.7B Edges) 10x 8x Runtime (Seconds) Out-of-Memory GraphX performs comparably to state-of-the-art graph processing systems.

Graphs are just one stage…. What about a pipeline?

A Small Pipeline in GraphX Raw Wikipedia < / > XML Hyperlinks PageRank Top 20 Pages HDFS Compute Spark Preprocess Spark Post. 1492 605 Also faster in human time but it is hard to measure. 375 342 Timed end-to-end GraphX is the fastest

Adoption and Impact GraphX is now part of Apache Spark Part of Cloudera Hadoop Distribution In production at Alibaba Taobao Order of magnitude gains over Spark Inspired GraphLab Inc. SFrame technology Unifies Tables & Graphs on Disk

GraphX  Unified Tables and Graphs New API Blurs the distinction between Tables and Graphs New System Unifies Data-Parallel Graph-Parallel Systems Enabling users to easily and efficiently express the entire analytics pipeline

Integrated Frameworks What did we Learn? Specialized Systems Integrated Frameworks Graph Systems GraphX

Integrated Frameworks Future Work Specialized Systems Integrated Frameworks Graph Systems GraphX Parameter Server ?

Integrated Frameworks Future Work Specialized Systems Integrated Frameworks Graph Systems GraphX Parameter Server Asynchrony Non-deterministic Shared-State

Thank You http://amplab.cs.berkeley.edu/projects/graphx/ jegonzal@eecs.berkeley.edu Reynold Xin Ankur Dave Daniel Crankshaw Michael Franklin Ion Stoica

Related Work Specialized Graph-Processing Systems: GraphLab [UAI’10], Pregel [SIGMOD’10], Signal- Collect [ISWC’10], Combinatorial BLAS [IJHPCA’11], GraphChi [OSDI’12], PowerGraph [OSDI’12], Ligra [PPoPP’13], X-Stream [SOSP’13] Alternative to Dataflow framework: Naiad [SOSP’13]: GraphLINQ Hyracks: Pregelix [VLDB’15] Distributed Join Optimization: Multicast Join [Afrati et al., EDBT’10] Semi-Join in MapReduce [Blanas et al., SIGMOD’10] Semi-Join in MapReduce

Edge Files Have Locality UK-Graph (106M Vertices, 3.7B Edges) GraphLab rebalances the edge-files on-load. Runtime (Seconds) GraphX preserves the on-disk layout through Spark.  Better Vertex-Cut ----- Meeting Notes (10/3/14 10:05) ----- Skip

Scalability Scales slightly better than PowerGraph/GraphLab Runtime Twitter Graph (42M Vertices,1.5B Edges) Scales slightly better than PowerGraph/GraphLab Runtime GraphX Each node has 8 virtual cores. (Runtime is in 10 iterations) Linear Scaling EC2-Nodes

Apache Spark Dataflow Platform Zaharia et al., NSDI’12 Resilient Distributed Datasets (RDD): RDD Load Map RDD RDD Reduce HDFS HDFS ----- Meeting Notes (10/3/14 09:56) ----- Replace with standard spark slide

Apache Spark Dataflow Platform Zaharia et al., NSDI’12 Resilient Distributed Datasets (RDD): Persist in Memory RDD Load Map RDD RDD Reduce HDFS HDFS ----- Meeting Notes (10/3/14 09:56) ----- Switch to standard spark slide. .cache() Optimized for iterative access to data.

PageRank Benchmark EC2 Cluster of 16 x m2.4xLarge Nodes + 1GigE Twitter Graph (42M Vertices,1.5B Edges) UK-Graph (106M Vertices, 3.7B Edges) Runtime (Seconds) ✓ ? ? GraphX performs comparably to state-of-the-art graph processing systems.

Shared Memory Advantage Spark Shared Nothing Model Core Core Core Core Shuffle Files GraphLab Shared Memory Core Shared De-serialized In-Memory Graph ----- Meeting Notes (10/3/14 10:02) ----- Skip

Shared Memory Advantage Spark Shared Nothing Model Twitter Graph (42M Vertices,1.5B Edges) Core Core Core Core Shuffle Files Runtime (Seconds) GraphLab No SHM. TCP/IP Core

PageRank Benchmark Twitter Graph (42M Vertices,1.5B Edges) UK-Graph (106M Vertices, 3.7B Edges) GraphX performs comparably to state-of-the-art graph processing systems.

Connected Comp. Benchmark EC2 Cluster of 16 x m2.4xLarge Nodes + 1GigE Twitter Graph (42M Vertices,1.5B Edges) UK-Graph (106M Vertices, 3.7B Edges) 10x 8x Runtime (Seconds) Out-of-Memory Out-of-Memory GraphX performs comparably to state-of-the-art graph processing systems.

Fault-Tolerance Leverage Spark Fault-Tolerance Mechanism

Graph-Processing Systems Pregel oogle CombBLAS Kineograph X-Stream GraphChi Ligra GPS Representation Expose specialized API to simplify graph programming.

Vertex-Program Abstraction Pregel_PageRank(i, messages) : // Receive all the messages total = 0 foreach( msg in messages) : total = total + msg // Update the rank of this vertex R[i] = 0.15 + total // Send new messages to neighbors foreach(j in out_neighbors[i]) : Send msg(R[i]) to vertex j i ----- Meeting Notes (10/3/14 09:56) ----- Might cut this.

The Vertex-Program Abstraction R[4] * w41 R[3] * w31 R[2] * w21 + GraphLab_PageRank(i) // Compute sum over neighbors total = 0 foreach( j in neighbors(i)): total += R[j] * wji // Update the PageRank R[i] = 0.15 + total 4 1 3 2 Low, Gonzalez, et al. [UAI’10]

Example: Oldest Follower 23 42 Calculate the number of older followers for each user? val olderFollowerAge = graph .mrTriplets( e => // Map if(e.src.age > e.dst.age) { (e.srcId, 1) else { Empty } , (a,b) => a + b // Reduce ) .vertices B C 30 A Cut this slide ----- Meeting Notes (10/3/14 10:02) ----- Polish this slide and go through faster. D E 19 75 F 16

Enhanced Pregel in GraphX Require Message Combiners pregelPR(i, messageList ): messageSum // Receive all the messages total = 0 foreach( msg in messageList) : total = total + msg messageSum // Update the rank of this vertex R[i] = 0.15 + total combineMsg(a, b): // Compute sum of two messages return a + b Remove Message Computation from the Vertex Program // Send new messages to neighbors foreach(j in out_neighbors[i]) : Send msg(R[i]/E[i,j]) to vertex sendMsg(ij, R[i], R[j], E[i,j]): // Compute single message return msg(R[i]/E[i,j])

PageRank in GraphX // Load and initialize the graph val graph = GraphBuilder.text(“hdfs://web.txt”) val prGraph = graph.joinVertices(graph.outDegrees) // Implement and Run PageRank val pageRank = prGraph.pregel(initialMessage = 0.0, iter = 10)( (oldV, msgSum) => 0.15 + 0.85 * msgSum, triplet => triplet.src.pr / triplet.src.deg, (msgA, msgB) => msgA + msgB)

Example Analytics Pipeline // Load raw data tables val articles = sc.textFile(“hdfs://wiki.xml”).map(xmlParser) val links = articles.flatMap(article => article.outLinks) // Build the graph from tables val graph = new Graph(articles, links) // Run PageRank Algorithm val pr = graph.PageRank(tol = 1.0e-5) // Extract and print the top 20 articles val topArticles = articles.join(pr).top(20).collect for ((article, pageRank) <- topArticles) { println(article.title + ‘\t’ + pageRank) }

Apache Spark Dataflow Platform Zaharia et al., NSDI’12 Resilient Distributed Datasets (RDD): Persist in Memory RDD Load Map RDD RDD Reduce HDFS HDFS .cache() Lineage: HDFS RDD Reduce Map Load