Topo Sort on Spark GraphX Lecturer: 苟毓川5120309391.

Slides:



Advertisements
Similar presentations
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture10.
Advertisements

Graph Algorithms: Topological Sort The topological sorting problem: given a directed, acyclic graph G = (V, E), find a linear ordering of the vertices.
Chapter 8, Part I Graph Algorithms.
Spark: Cluster Computing with Working Sets
Connected Components, Directed Graphs, Topological Sort COMP171.
Lecture 16 CSE 331 Oct 9, Announcements Hand in your HW4 Solutions to HW4 next week Remember next week I will not be here so.
CSE 326: Data Structures Lecture #17 Trees and DAGs and Graphs, Oh MY! Bart Niswonger Summer Quarter 2001.
CSE 421 Algorithms Richard Anderson Lecture 4. What does it mean for an algorithm to be efficient?
Connected Components, Directed Graphs, Topological Sort Lecture 25 COMP171 Fall 2006.
CS344: Lecture 16 S. Muthu Muthukrishnan. Graph Navigation BFS: DFS: DFS numbering by start time or finish time. –tree, back, forward and cross edges.
Connected Components, Directed graphs, Topological sort COMP171 Fall 2005.
CSC 2300 Data Structures & Algorithms March 30, 2007 Chapter 9. Graph Algorithms.
CS2420: Lecture 36 Vladimir Kulyukin Computer Science Department Utah State University.
Important Problem Types and Fundamental Data Structures
Chapter 9 – Graphs A graph G=(V,E) – vertices and edges
Introduction to Graphs. Introduction Graphs are a generalization of trees –Nodes or verticies –Edges or arcs Two kinds of graphs –Directed –Undirected.
Data Engineering How MapReduce Works
Data Structures and Algorithms in Parallel Computing
Section 13  Questions  Graphs. Graphs  A graph representation: –Adjacency matrix. Pros:? Cons:? –Neighbor lists. Pros:? Cons:?
CSE 421 Algorithms Richard Anderson Winter 2009 Lecture 5.
ECE 250 Algorithms and Data Structures Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,
CSE 421 Algorithms Richard Anderson Autumn 2015 Lecture 5.
Spanning Trees Alyce Brady CS 510: Computer Algorithms.
Graph Search Applications, Minimum Spanning Tree
Topological Sorting.
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Main algorithm with recursion: We’ll have a function DFS that initializes, and then calls DFS-Visit, which is a recursive function and does the depth first.
Pagerank and Betweenness centrality on Big Taxi Trajectory Graph
Introduction to Algorithms
Topological Sort In this topic, we will discuss: Motivations
Spark Presentation.
Parallel Programming By J. H. Wang May 2, 2017.
GRAPHS Lecture 16 CS2110 Fall 2017.
Introduction to Spark.
SINGLE-SOURCE SHORTEST PATHS IN DAGs
I206: Lecture 15: Graphs Marti Hearst Spring 2012.
Bipartite Matching and Other Graph Algorithms
CS120 Graphs.
Assessing the Performance Impact of Scheduling Policies in Spark
MST in Log-Star Rounds of Congested Clique
Apache Spark & Complex Network
Minimum Spanning Tree.
Methodology & Current Results
Paul Beame in lieu of Richard Anderson
Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
"Learning how to learn is life's most important skill. " - Tony Buzan
Search Related Algorithms
Topological Sort CSE 373 Data Structures Lecture 19.
Connected Components, Directed Graphs, Topological Sort
Spark and Scala.
CSE 373 Graphs 4: Topological Sort reading: Weiss Ch. 9
Richard Anderson Autumn 2016 Lecture 5
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
Chapter 15 Graphs © 2006 Pearson Education Inc., Upper Saddle River, NJ. All rights reserved.
Lecture 16 CSE 331 Oct 8, 2012.
Graph Algorithms "A charlatan makes obscure what is clear; a thinker makes clear what is obscure. " - Hugh Kingsmill CLRS, Sections 22.2 – 22.4.
Big Data Analytics: Exploring Graphs with Optimized SQL Queries
DryadInc: Reusing work in large-scale computations
CS639: Data Management for Data Science
CSE 417: Algorithms and Computational Complexity
GRAPHS Lecture 17 CS2110 Spring 2018.
Important Problem Types and Fundamental Data Structures
Fast, Interactive, Language-Integrated Cluster Computing
Richard Anderson Winter 2019 Lecture 5
Data Structures and Algorithms
Motivation Contemporary big data tools such as MapReduce and graph processing tools have fixed data abstraction and support a limited set of communication.
CS639: Data Management for Data Science
For Friday Read chapter 9, sections 2-3 No homework
Presentation transcript:

Topo Sort on Spark GraphX Lecturer: 苟毓川5120309391

Outline Necessary background knowledge on Spark GraphX Topo sort on stand-alone and Spark GraphX Distributed topo sort Lecturer: 苟毓川5120309391

Spark GraphX

Server cluster in prof. Wang’s group Spark Spark GraphX A fast engine for server cluster to process large-scale data. Server cluster in prof. Wang’s group

A simple example Word count result1 text1 large result large text Spark GraphX A simple example to understand Spark Word count result1 text1 computing node1 result2 large result computing node2 large text text2 result3 text3 computing node3

Spark GraphX A simple example to understand Spark Word count

(Resilient Distributed Dataset) RDD Spark GraphX (Resilient Distributed Dataset) Dataset is partitioned into subsets Subsets can be sent to any computing node to process Data can be cached in memory to reuse We can easily put data into RDD!

Spark Spark RDD high-level tools: Streaming, Mllib, SQL, GraphX Spark GraphX high-level tools: Streaming, Mllib, SQL, GraphX Spark RDD Spark Streaming MLlib SQL GraphX

Spark GraphX A new graph abstraction(extends the Spark RDD) A component for graphs and graph-parallel computation A new graph abstraction(extends the Spark RDD) A set of basic graph operations and optimized pregelAPI A growing collection of graph algorithms

Spark GraphX Partition the graph into several subgraphs Compute the subgraph in a distributed way

Topo sort (topological)

Motivation Topo sort Build a set of graph algorithm such as topo sort on Spark GraphX Research the graph algorithm in a distributed way Try to be a contributor to Spark community Learn Spark Graphx on a base level

What is topo sort? 2 3 5 7 Topo sort A DAG (directed acyclic graph) We can get the topo order as the cource schedule: 2, 5, 3, 7 Maths 2 3 5 7 Data Structure Program Design Data Base

The realization on stand-alone Topo sort Kahn algorithm

The realization on stand-alone Topo sort 1. Compute the indegrees of every vertex Maths 2 3 5 7 Data Structure Program Design Data Base

The realization on stand-alone Topo sort 1. Compute the indegrees of every vertex Maths 2 3 5 7 Data Structure Program Design 1 1 2 Data Base indegree of vertex.

The realization on stand-alone Topo sort Adjacent matrix(graph) Vertex1 Vertex2 … VertexN Vertex1 0/1 0/1 0/1 Vertex2 0/1 0/1 0/1 … VertexN 0/1 0/1 0/1 Add to get the indegree of vertex1 Edge from VertexN to Vertex2

The realization on stand-alone Topo sort The realization on stand-alone 2. Output the vertex1 of 0 indegree, and minus the indegree of following vertices Maths 2 3 5 7 Data Structure Program Design 1 1 2 Data Base indegree of vertex.

The realization on stand-alone Topo sort The realization on stand-alone 2. Output the vertex1 of 0 indegree, and minus the indegree of following vertices Output the vertex1 Maths 2 3 5 7 Data Structure Program Design 1 1 2 Data Base indegree of vertex.

The realization on stand-alone Topo sort The realization on stand-alone 2. Output the vertex1 of 0 indegree, and minus the indegree of following vertices Output the vertex1 Maths 1 Data Structure Program Design 2 -1 1 3 5 1 2 7 Data Base indegree of vertex.

The realization on stand-alone Topo sort The realization on stand-alone 2. Output the vertex1 of 0 indegree, and minus the indegree of following vertices Output the vertex1 Maths 1 Data Structure Program Design 2 1 3 5 1-1=0 2 7 Data Base indegree of vertex.

The realization on stand-alone Topo sort The realization on stand-alone 3. Repeat the step2 Output the vertex1 Maths 1 Data Structure Program Design 2 1 3 5 2 7 Data Base

The realization on stand-alone Topo sort The realization on stand-alone 3. Repeat the step2 Maths 1 Data Structure Program Design 2 -1 1 3 5 -1 Output the vertex2 2 7 Data Base

The realization on stand-alone Topo sort The realization on stand-alone 3. Repeat the step2 Maths 1 2 Program Design Data Structure 5 2 3 1 7 Data Base

The realization on stand-alone Topo sort The realization on stand-alone 3. Repeat the step2 Maths 1 2 Data Structure Program Design 2 3 5 -1 Output the vertex3 1 7 Data Base

The realization on stand-alone Topo sort The realization on stand-alone 3. Repeat the step2 Maths 1 2 Program Design 5 2 Data Structure 7 4 Data Base 3 3

The realization on stand-alone Topo sort The realization on stand-alone Program: Use adjacent matrix to reserve the graph Use array to reserve the vertices and indegrees Use for-loop to find the 0 indegree vertex Use array to make a queue for sorting 2 3 5 7 Maths Program Design Data Structure Data Base 1 1 2

The realization on Spark GraphX Topo sort The realization on Spark GraphX Program: Use RDD to reserve the vertices and indegrees Use RDD to reserve the vertices and adjacent IDs Use reduce() to find the 0 indegree vertex Use mapValues() to change the adjacent vertices’ indegree

The performance on Spark GraphX Topo sort The performance on Spark GraphX Not good enough! My simple test: For larger scale DAG data, its behavior is a little better than stand-alone. It runs a little faster.

Topo sort Why?

Why? 2 3 5 7 Topo sort Key step(find the 0 indegree vertex): Maths Output the vertex1 Maths 2 3 5 7 Stand-alone: use for-loop Spark GraphX: use reduce() in RDD Data Structure Program Design 1 1 2 Data Base

Why? 2 3 5 7 Topo sort Key step(find the 0 indegree vertex): Maths Output the vertex1 Maths 2 3 5 7 Data Structure Program Design We can only process one vertex at a time. Other vertices must wait until this step has been done. 1 1 2 Data Base

What we need? 2 3 5 7 Topo sort Maths A more proper algorithm model! A more proper algorithm model! Data Structure Program Design 1 1 2 Data Base

Topo sort What we need? Partition the graph and compute each parts at the same time. Other vertices don’t need to wait

Distributed topo sort (D-topo sort)

The algorithm 4 2 1 7 5 6 3 D-Topo sort 1. Choose the startnode, and send st=1 to all its neighbors 4 2 1 7 5 6 3

The algorithm 4 2 1 7 5 6 3 D-Topo sort 1. Choose the startnode, and send st=1 to all its neighbors start time 4 2 1 (0) 1 7 1 5 6 3

The algorithm 4 2 1 7 5 6 3 D-Topo sort 1. Choose the startnode, and send st=1 to all its neighbors start time (1) 4 2 1 (0) 1 7 1 5 6 3 (1)

The algorithm 4 2 1 7 5 6 3 D-Topo sort 2. When a vertex receives a st, st = st+1, and send new st to all its neighbors (1) 4 2 1 (0) 1 7 1 5 6 3 (1)

The algorithm 4 2 1 7 5 6 3 D-Topo sort 2. When a vertex receives a st, st = st+1, and send new st to all its neighbors (1) 4 2 2 1 2 (0) 1 2 7 1 2 5 2 6 3 (1)

The algorithm 4 2 1 7 5 6 3 D-Topo sort 2. When a vertex receives a st, st = st+1, and send new st to all its neighbors (1) (2,2) 4 2 2 1 2 (0) 1 2 7 1 2 (2) 5 2 6 3 (1) (2,2)

The algorithm 4 2 1 7 5 6 3 D-Topo sort 3. Repeat step2(receive st and plus 1, then send) (1) (2,2) 4 2 2 1 2 (0) 1 2 7 1 2 (2) 5 2 6 3 (1) (2,2)

The algorithm 4 2 1 7 5 6 3 D-Topo sort 3. Repeat step2(receive st and plus 1, then send) (1) (2,2) 4 2 2 1 2 (0) 1 2 3 7 1 3 2 (2) 5 2 6 3 3 (1) (2,2)

The algorithm 4 2 1 7 5 6 3 D-Topo sort 3. Repeat step2(receive st and plus 1, then send) (1) (2,2,3) 4 2 2 1 2 (3) (0) 1 2 3 7 1 3 2 (2) 5 2 6 3 3 (1) (2,2,3)

The algorithm 4 2 1 7 5 6 3 D-Topo sort 4. Get the maximum value of all nodes (1) (2,2,3) 4 2 2 1 2 (3) (0) 1 2 3 7 1 3 2 (2) 5 2 6 3 3 (1) (2,2,3)

The algorithm 4 2 1 7 5 6 3 D-Topo sort 4. Get the maximum value of all nodes (1) (2,2,3) 4 2 (3) (0) 1 7 (2) 5 6 3 (1) (2,2,3)

The algorithm 4 2 1 7 5 6 3 D-Topo sort 4. Get the maximum value of all nodes (1) (2,2,3) 4 2 (3) (0) 1 7 (2) 5 6 3 (1) (2,2,3)

The algorithm 4 2 1 7 5 6 3 D-Topo sort 5. Get the sort by the maximum value (1) (2,2,3) 4 2 (3) (0) 1 7 (2) 5 6 3 (1) (2,2,3)

The algorithm 2 4 1 3 7 5 6 D-Topo sort 5. Get the sort by the maximum value (2,2,3) 2 (1) 4 (2) (3) (0) 1 3 7 5 (1) 6 (2,2,3) Topo order

Advanatges 4 2 1 7 5 6 3 D-Topo sort (0) Don’t need to repeat the computing to find the 0 indegree vertex (1) (2,2,3) 4 2 2 1 2 (3) (0) 1 2 3 7 1 3 2 (2) 5 2 6 3 3 (1) (2,2,3)

Advanatges 4 2 1 7 5 6 3 D-Topo sort (0) Nodes are able to compute (send st to adjacent nodes) at the same time (1) (2,2,3) 4 2 2 1 2 (3) (0) 1 2 3 7 1 3 2 (2) 5 2 6 3 3 (1) (2,2,3)

D-Topo sort Advanatges Much more faster in a distributed way!

Key operation in Spark GraphX D-Topo sort Key operation in Spark GraphX aggregateMessages(sendMsg, mergeMsg)

Future work D-Topo sort Finish the D-topo sort on Spark GraphX Using GraphX to do something in application level such as community detection in social network

Thank you!

Q&A