Parallel Graph Algorithms

Slides:



Advertisements
Similar presentations
Chapter 5: Tree Constructions
Advertisements

Algorithms (and Datastructures) Lecture 3 MAS 714 part 2 Hartmut Klauck.
Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
Graph Algorithms Carl Tropper Department of Computer Science McGill University.
Social network partition Presenter: Xiaofei Cao Partick Berg.
Lecture 19: Parallel Algorithms
Review Binary Search Trees Operations on Binary Search Tree
CS 484. Discrete Optimization Problems A discrete optimization problem can be expressed as (S, f) S is the set of all feasible solutions f is the cost.
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture10.
Topological Sort Topological sort is the list of vertices in the reverse order of their finishing times (post-order) of the depth-first search. Topological.
1 Chapter 26 All-Pairs Shortest Paths Problem definition Shortest paths and matrix multiplication The Floyd-Warshall algorithm.
Self Stabilizing Algorithms for Topology Management Presentation: Deniz Çokuslu.
Data Structure and Algorithms (BCS 1223) GRAPH. Introduction of Graph A graph G consists of two things: 1.A set V of elements called nodes(or points or.
Distributed Breadth-First Search with 2-D Partitioning Edmond Chow, Keith Henderson, Andy Yoo Lawrence Livermore National Laboratory LLNL Technical report.
 2004 SDU Lecture11- All-pairs shortest paths. Dynamic programming Comparing to divide-and-conquer 1.Both partition the problem into sub-problems 2.Divide-and-conquer.
All Pairs Shortest Paths and Floyd-Warshall Algorithm CLRS 25.2
Graph & BFS.
Graph COMP171 Fall Graph / Slide 2 Graphs * Extremely useful tool in modeling problems * Consist of: n Vertices n Edges D E A C F B Vertex Edge.
Graph & BFS Lecture 22 COMP171 Fall Graph & BFS / Slide 2 Graphs * Extremely useful tool in modeling problems * Consist of: n Vertices n Edges D.
Spring 2010CS 2251 Graphs Chapter 10. Spring 2010CS 2252 Chapter Objectives To become familiar with graph terminology and the different types of graphs.
Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.
1 Parallel Algorithms III Topics: graph and sort algorithms.
Algorithms All pairs shortest path
CS 584. Discrete Optimization Problems A discrete optimization problem can be expressed as (S, f) S is the set of all feasible solutions f is the cost.
Strategies for Implementing Dynamic Load Sharing.
Parallel Programming – Graph Algorithms David Monismith CS599 Notes are primarily based upon Introduction to Parallel Programming, Second Edition by Grama,
Review of Graphs A graph is composed of edges E and vertices V that link the nodes together. A graph G is often denoted G=(V,E) where V is the set of vertices.
Search Algorithms for Discrete Optimization Problems Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Adapted for 3030 To accompany the text.
CSCI-455/552 Introduction to High Performance Computing Lecture 18.
UNC Chapel Hill Lin/Foskey/Manocha Minimum Spanning Trees Problem: Connect a set of nodes by a network of minimal total length Some applications: –Communication.
Minimum Spanning Trees CSE 2320 – Algorithms and Data Structures Vassilis Athitsos University of Texas at Arlington 1.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Parallel graph algorithms Antonio-Gabriel Sturzu, SCPD Adela Diana Almasi, SCPD Adela Diana Almasi, SCPD Iulia Alexandra Floroiu, ISI Iulia Alexandra Floroiu,
1 The Floyd-Warshall Algorithm Andreas Klappenecker.
Graph Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Adapted for 3030 To accompany the text ``Introduction to Parallel Computing'',
Introduction to Algorithms Jiafen Liu Sept
Graphs. Definitions A graph is two sets. A graph is two sets. –A set of nodes or vertices V –A set of edges E Edges connect nodes. Edges connect nodes.
Parallel Programming: All-Pairs Shortest Path CS599 David Monismith Based upon notes from multiple sources.
10 Copyright © William C. Cheng Data Structures - CSCI 102 Graph Terminology A graph consists of a set of Vertices and a set of Edges C A B D a c b d e.
Most of contents are provided by the website Graph Essentials TJTSD66: Advanced Topics in Social Media.
Graph Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley,
1 Prim’s algorithm. 2 Minimum Spanning Tree Given a weighted undirected graph G, find a tree T that spans all the vertices of G and minimizes the sum.
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 20.
CS 584. Discrete Optimization Problems A discrete optimization problem can be expressed as (S, f) S is the set of all feasible solutions f is the cost.
Graphs. Graphs Similar to the graphs you’ve known since the 5 th grade: line graphs, bar graphs, etc., but more general. Those mathematical graphs are.
Graphs A graphs is an abstract representation of a set of objects, called vertices or nodes, where some pairs of the objects are connected by links, called.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley. Ver Chapter 13: Graphs Data Abstraction & Problem Solving with C++
Graph Algorithms Gayathri R To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003.
Graphs Part II Lecture 7. Lecture Objectives  Topological Sort  Spanning Tree  Minimum Spanning Tree  Shortest Path.
Data Structures and Algorithms in Parallel Computing Lecture 7.
1 Directed Graphs Chapter 8. 2 Objectives You will be able to: Say what a directed graph is. Describe two ways to represent a directed graph: Adjacency.
Graphs Slide credits:  K. Wayne, Princeton U.  C. E. Leiserson and E. Demaine, MIT  K. Birman, Cornell U.
© 2006 Pearson Addison-Wesley. All rights reserved 14 A-1 Chapter 14 Graphs.
Parallel Graph Algorithms Sathish Vadhiyar. Graph Traversal  Graph search plays an important role in analyzing large data sets  Relationship between.
Chapter 20: Graphs. Objectives In this chapter, you will: – Learn about graphs – Become familiar with the basic terminology of graph theory – Discover.
Data Structures and Algorithm Analysis Graph Algorithms Lecturer: Jing Liu Homepage:
Minimum-Cost Spanning Tree weighted connected undirected graph cost of spanning tree is sum of edge costs find spanning tree that has minimum cost Network.
1 GRAPHS – Definitions A graph G = (V, E) consists of –a set of vertices, V, and –a set of edges, E, where each edge is a pair (v,w) s.t. v,w  V Vertices.
All Pairs Shortest Path Algorithms Aditya Sehgal Amlan Bhattacharya.
Parallel Graph Algorithms
Parallel Graph Algorithms
Minimum Spanning Trees
Parallel Sort, Search, Graph Algorithms
Near-neighbor or Mesh Based Paradigm
CS 584 Project Write up Poster session for final Due on day of final
Adaptivity and Dynamic Load Balancing
Parallel Graph Algorithms
CSC 421: Algorithm Design & Analysis
INTRODUCTION A graph G=(V,E) consists of a finite non empty set of vertices V , and a finite set of edges E which connect pairs of vertices .
Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
Presentation transcript:

Parallel Graph Algorithms Sathish Vadhiyar

Graph Traversal Graph search plays an important role in analyzing large data sets Relationship between data objects represented in the form of graphs Breadth first search used in finding shortest path or sets of paths

Level-synchronized algorithm Proceeds level-by-level starting with the source vertex Level of a vertex – its graph distance from the source How to decompose the graph (vertices, edges and adjacency matrix) among processors?

Distributed BFS with 1D Partitioning Each vertex and edges emanating from it are owned by one processor 1-D partitioning of the adjacency matrix Edges emanating from vertex v is its edge list = list of vertex indices in row v of adjacency matrix A

1-D Partitioning At each level, each processor owns a set F – set of frontier vertices owned by the processor Edge lists of vertices in F are merged to form a set of neighboring vertices, N Some vertices of N owned by the same processor, while others owned by other processors Messages are sent to those processors to add these vertices to their frontier set for the next level

Lvs(v) – level of v, i.e, graph distance from source vs

2D Partitioning P=RXC processor mesh Adjacency matric divided into R.C block rows and C block columns A(i,j)(*) denotes a block owned by (i,j) processor; each processor owns C blocks

2D Partitioning Processor (i,j) owns vertices belonging to block row (j-1).R+i Thus a process stores some edges incident on its vertices, and some edges that are not

2D Paritioning Assume that the edge list for a given vertex is the column of the adjacency matrix Each block in the 2D partitioning contains partial edge lists Each processor has a frontier set of vertices, F, owned by the processor

2D Paritioning Expand Operation Consider v in F The owner of v sends messages to other processors in frontier column telling that v is in the frontier; since any of these processors may have partial edge list of v

2D Partitioning Fold Operation Partial edge lists on each processor merged to form N – potential vertices in the next frontier Vertices in N sent to their owners to form new frontier set F on those processors These owner processors are in the same processor row This communication step referred as fold operation

Analysis Advantage of 2D over 1D – processor-column and processor-row communications involve only R and C processors

BFS on GPUs

BFS on GPUs One GPU thread for a vertex In each iteration, each vertex looks at its entry in the frontier array If true, it forms the neighbors and frontiers Severe load imbalance among the treads Scope for improvement

Parallel Depth First Search Easy to parallelize Left subtree can be searched in parallel with the right subtree Statically assign a node to a processor – the whole subtree rooted at that node can be searched independently. Can lead to load imbalance; Load imbalance increases with the number of processors

Dynamic Load Balancing (DLB) Difficult to estimate the size of the search space beforehand Need to balance the search space among processors dynamically In DLB, when a processor runs out of work, it gets work from another processor

Maintaining Search Space Each processor searches the space depth-first Unexplored states saved as stack; each processor maintains its own local stack Initially, the entire search space assigned to one processor

Work Splitting When a processor receives work request, it splits its search space Half-split: Stack space divided into two equal pieces – may result in load imbalance Giving stack space near the bottom of the stack can lead to giving bigger trees Stack space near the top of the stack tend to have small trees To avoid sending very small amounts of work – nodes beyond a specified stack depth are not given away – cutoff depth

Strategies 1. Send nodes near the bottom of the stack 2. Send nodes near the cutoff depth 3. Send half the nodes between the bottom of the stack and the cutoff depth Example: Figures 11.5(a) and 11.9

Load Balancing Strategies Asynchronous round-robin: Each processor has a target processor to get work from; the value of the target is incremented with modulo Global round-robin: One single target processor variable is maintained for all processors Random polling: randomly select a donor

Termination Detection Dijikstra’s Token Termination Detection Algorithm Based on passing of a token in a logical ring; P0 initiates a token when idle; A processor holds a token until it has completed its work, and then passes to the next processor; when P0 receives again, then all processors have completed However, a processor may get more work after becoming idle

Algorithm Continued…. Taken care of by using white and black tokens Initially, the token is white; a processor j becomes black if it sends work to i<j If j completes work, it changes token to black and sends it to next processor; after sending, changes to white. When P0 receives a black token, reinitiates the ring

Tree Based Termination Detection Uses weights Initially processor 0 has weight 1 When a processor transfers work to another processor, the weights are halved in both the processors When a processor finishes, weights are returned Termination is when processor 0 gets back 1 Goes with the DFS algorithm; No separate communication steps Figure 11.10

Minimal Spanning Tree, Single-Source and All-pairs Shortest Paths

Minimal Spanning Tree – Prim’s Algorithm Spanning tree of a graph, G (V,E) – tree containing all vertices of G MST – spanning tree with minimum sum of weights Vertices are added to a set Vt that holds vertices of MST; Initially contains an arbitrary vertex,r, as root vertex

Minimal Spanning Tree – Prim’s Algorithm An array d such that d[v in (V-Vt)] holds weight of the edge with least weight between v and any vertex in Vt; Initially d[v] = w[r,v] Find the vertex in d with minimum weight and add to Vt Update d Time complexity – O(n2)

Parallelization Vertex V and d array partitioned across P processors Each processor finds local minimum in d Then global minimum across all d performed by reduction on a processor The processor finds the next vertex u, and broadcasts to all processors

Parallelization All processors update d; The owning processor of u marks u as belonging to Vt Process responsible for v must know w[u,v] to update v; 1-D block mapping of adjacency matrix Complexity – O(n2/P) + (OnlogP) for communication

Single Source Shortest Path – Dijikistra’s Algorithm Finds shortest path from the source vertex to all vertices Follows a similar structure as Prim’s Instead of d array, an array l that maintains the shortest lengths are maintained Follow similar parallelization scheme

Single Source Shortest Path on GPUs

SSSP on GPUs A single kernel is not enough since Ca cannot be updated while it is accessed. Hence costs updated in a temporary array Ua

All-Pairs Shortest Paths To find shortest paths between all pairs of vertices Dijikstra’s algorithm for single-source shortest path can be used for all vertices Two approaches

All-Pairs Shortest Paths Source-partitioned formulation: Partition the vertices across processors Works well if p<=n; No communication Can at best use only n processors Time complexity? Source-parallel formulation: Parallelize SSSP for a vertex across a subset of processors Do for all vertices with different subsets of processors Hierarchical formulation Exploits more parallelism

All-Pairs Shortest Paths Floyd’s Algorithm Consider a subset S = {v1,v2,…,vk} of vertices for some k <= n Consider finding shortest path between vi and vj Consider all paths from vi to vj whose intermediate vertices belong to the set S; Let pi,j(k) be the minimum-weight path among them with weight di,j(k)

All-Pairs Shortest Paths Floyd’s Algorithm If vk is not in the shortest path, then pi,j(k) = pi,j(k-1) If vk is in the shortest path, then the path is broken into two parts – from vi to vk, and from vk to vj So di,j(k) = min{di,j(k-1) , di,k(k-1) + dk,j(k-1) } The length of the shortest path from vi to vj is given by di,j(n). In general, solution is a matrix D(n)

Parallel Formulation 2-D Block Mapping Processors laid in a 2D mesh During kth iteration, each process Pi,j needs certain segments of the kth row and kth column of the D(k-1) matrix For dl,r(k): following are needed dl,k(k-1) (from a process along the same process row) dk,r(k-1) (from a process along the same process column) Figure 10.8

Parallel Formulation 2D Block Mapping During kth iteration, each of the root(p) processes containing part of the kth row sends it to root(p)-1 in same column; Similarly for the same row Figure 10.8 Time complexity?

APSP on GPUs Space complexity of Floyd’s algorithm is O(V2) – Impossible to go beyond a few vertices on GPUs Uses V2 threads A single O(V) operation looping over O(V2) threads - can exhibit slowdown due to high context switching overhead between threads Use Dijikistra’s – run SSSP algorithm from every vertex in graph Will require only the final output size to be O(V2) Intermediate outputs on GPU can be O(V) and can be copied to CPU memory

APSP on GPUs

Sources/References Paper: A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L. Yoo et al. SC 2005. Paper:Accelerating large graph algorithms on the GPU usingCUDA. Harish and Narayanan. HiPC 2007.

Speedup Anomalies in DFS The overall work (space searched) in parallel DFS can be smaller or larger than in sequential DFS Can cause superlinear or sublinear speedups Figures 11.18, 11.19

Parallel Formulation Pipelining In the 2D formulation, the kth iteration in all processes start only after k-1(th) iteration completes in all the processes A process can start working on the kth iteration as soon as it has computed (k-1)th iteration and has relevant parts of the D(k-1) matrix Example: Figure 10.9 Time complexity