Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parallel Graph Algorithms Sathish Vadhiyar. Graph Traversal  Graph search plays an important role in analyzing large data sets  Relationship between.

Similar presentations


Presentation on theme: "Parallel Graph Algorithms Sathish Vadhiyar. Graph Traversal  Graph search plays an important role in analyzing large data sets  Relationship between."— Presentation transcript:

1 Parallel Graph Algorithms Sathish Vadhiyar

2 Graph Traversal  Graph search plays an important role in analyzing large data sets  Relationship between data objects represented in the form of graphs  Breadth first search used in finding shortest path or sets of paths

3 Level-synchronized algorithm  Proceeds level-by-level starting with the source vertex  Level of a vertex – its graph distance from the source  How to decompose the graph (vertices, edges and adjacency matrix) among processors?

4 Distributed BFS with 1D Partitioning  Each vertex and edges emanating from it are owned by one processor  1-D partitioning of the adjacency matrix  Edges emanating from vertex v is its edge list = list of vertex indices in row v of adjacency matrix A

5 1-D Partitioning  At each level, each processor owns a set F – set of frontier vertices owned by the processor  Edge lists of vertices in F are merged to form a set of neighboring vertices, N  Some vertices of N owned by the same processor, while others owned by other processors  Messages are sent to those processors to add these vertices to their frontier set for the next level

6 L vs (v) – level of v, i.e, graph distance from source vs

7 BFS on GPUs

8  One GPU thread for a vertex  In each iteration, each vertex looks at its entry in the frontier array  If true, it forms the neighbors and frontiers  Severe load imbalance among the treads  Scope for improvement

9 Parallel Depth First Search  Easy to parallelize  Left subtree can be searched in parallel with the right subtree  Statically assign a node to a processor – the whole subtree rooted at that node can be searched independently.  Can lead to load imbalance; Load imbalance increases with the number of processors

10 Dynamic Load Balancing (DLB)  Difficult to estimate the size of the search space beforehand  Need to balance the search space among processors dynamically  In DLB, when a processor runs out of work, it gets work from another processor

11 Maintaining Search Space  Each processor searches the space depth-first  Unexplored states saved as stack; each processor maintains its own local stack  Initially, the entire search space assigned to one processor  When a processor’s local stack is empty, it requests untried alternative from another processor’s stack

12 Work Splitting  When a processor receives work request, it splits its search space  Half-split: Stack space divided into two equal pieces – may result in load imbalance  Giving stack space near the bottom of the stack can lead to giving bigger trees  Stack space near the top of the stack tend to have small trees  To avoid sending very small amounts of work – nodes beyond a specified stack depth are not given away – cutoff depth

13 Strategies  1. Send nodes near the bottom of the stack  2. Send nodes near the cutoff depth  3. Send half the nodes between the bottom of the stack and the cutoff depth  Example: Figures 11.5(a) and 11.9

14 Load Balancing Strategies  Asynchronous round-robin: Each processor has a target processor to get work from; the value of the target is incremented with modulo  Global round-robin: One single target processor variable is maintained for all processors  Random polling: randomly select a donor

15 Termination Detection  Dijikstra’s Token Termination Detection Algorithm Based on passing of a token in a logical ring; P0 initiates a token when idle; A processor holds a token until it has completed its work, and then passes to the next processor; when P0 receives again, then all processors have completed However, a processor may get more work after becoming idle

16 Algorithm Continued…. Taken care of by using white and black tokens A processor can be in one of two states: black and white Initially, the token is white; all processors are in white state

17 Algorithm Continued…. If a processor Pj sends work to Pi (i<j), the token must traverse the ring again A processor j becomes black if it sends work to i<j If j completes work, it changes token to black and sends it to next processor; after sending, changes to white. When P0 receives a black token, reinitiates the ring

18 Tree Based Termination Detection  Uses weights  Initially processor 0 has weight 1  When a processor transfers work to another processor, the weights are halved in both the processors  When a processor finishes, weights are returned  Termination is when processor 0 gets back 1  Goes with the DFS algorithm; No separate communication steps  Figure 11.10

19  Minimal Spanning Tree, Single-Source and All-pairs Shortest Paths

20 Minimal Spanning Tree – Prim’s Algorithm  Spanning tree of a graph, G (V,E) – tree containing all vertices of G  MST – spanning tree with minimum sum of weights  Vertices are added to a set Vt that holds vertices of MST; Initially contains an arbitrary vertex,r, as root vertex

21 Minimal Spanning Tree – Prim’s Algorithm  An array d such that d[v in (V-Vt)] holds weight of the edge with least weight between v and any vertex in Vt; Initially d[v] = w[r,v]  Find the vertex in d with minimum weight and add to Vt  Update d  Time complexity – O(n 2 )

22 Parallelization  Vertex V and d array partitioned across P processors  Each processor finds local minimum in d  Then global minimum across all d performed by reduction on a processor  The processor finds the next vertex u, and broadcasts to all processors

23 Parallelization  All processors update d; The owning processor of u marks u as belonging to Vt  Process responsible for v must know w[u,v] to update d[v]; 1-D block mapping of adjacency matrix  Complexity – O(n 2 /P) + (OnlogP) for communication

24 Single Source Shortest Path – Dijikistra’s Algorithm  Finds shortest path from the source vertex to all vertices  Follows a similar structure as Prim’s  Instead of d array, an array l that maintains the shortest lengths are maintained  Follow similar parallelization scheme

25 Single Source Shortest Path on GPUs

26 SSSP on GPUs  A single kernel is not enough since Ca cannot be updated while it is accessed.  Hence costs updated in a temporary array Ua

27 All-Pairs Shortest Paths  To find shortest paths between all pairs of vertices  Dijikstra’s algorithm for single-source shortest path can be used for all vertices  Two approaches

28 All-Pairs Shortest Paths  Source-partitioned formulation: Partition the vertices across processors Works well if p<=n; No communication Can at best use only n processors Time complexity?  Source-parallel formulation: Parallelize SSSP for a vertex across a subset of processors  Do for all vertices with different subsets of processors  Hierarchical formulation  Exploits more parallelism  Time complexity?

29 All-Pairs Shortest Paths Floyd’s Algorithm  Consider a subset S = {v1,v2,…,vk} of vertices for some k <= n  Consider finding shortest path between vi and vj  Consider all paths from vi to vj whose intermediate vertices belong to the set S; Let p i,j (k) be the minimum-weight path among them with weight d i,j (k)

30 All-Pairs Shortest Paths Floyd’s Algorithm  If vk is not in the shortest path, then p i,j (k) = p i,j (k-1)  If vk is in the shortest path, then the path is broken into two parts – from vi to vk, and from vk to vj  So d i,j (k) = min{d i,j (k-1), d i,k (k-1) + d k,j (k-1) }  The length of the shortest path from vi to vj is given by d i,j (n).  In general, solution is a matrix D (n)

31 Parallel Formulation 2-D Block Mapping  Processors laid in a 2D mesh  During kth iteration, each process Pi,j needs certain segments of the kth row and kth column of the D(k-1) matrix  For d l,r (k) : following are needed d l,k (k-1) (from a process along the same process row) d k,r (k-1) (from a process along the same process column) Figure 10.8

32 Parallel Formulation 2D Block Mapping  During kth iteration, each of the root(p) processes containing part of the kth row sends it to root(p)-1 in same column;  Similarly for the same row  Figure 10.8  Time complexity?

33 APSP on GPUs  Space complexity of Floyd’s algorithm is O(V 2 ) – Impossible to go beyond a few vertices on GPUs  Uses V 2 threads  A single O(V) operation looping over O(V 2) threads - can exhibit slowdown due to high context switching overhead between threads  Use Dijikistra’s – run SSSP algorithm from every vertex in graph  Will require only the final output size to be O(V 2 )  Intermediate outputs on GPU can be O(V) and can be copied to CPU memory

34 APSP on GPUs

35 Sources/References  Paper: A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L. Yoo et al. SC 2005.  Paper:Accelerating large graph algorithms on the GPU usingCUDA. Harish and Narayanan. HiPC 2007.

36 Speedup Anomalies in DFS  The overall work (space searched) in parallel DFS can be smaller or larger than in sequential DFS  Can cause superlinear or sublinear speedups  Figures 11.18, 11.19

37 Parallel Formulation Pipelining  In the 2D formulation, the kth iteration in all processes start only after k-1(th) iteration completes in all the processes  A process can start working on the kth iteration as soon as it has computed (k- 1)th iteration and has relevant parts of the D(k-1) matrix  Example: Figure 10.9  Time complexity


Download ppt "Parallel Graph Algorithms Sathish Vadhiyar. Graph Traversal  Graph search plays an important role in analyzing large data sets  Relationship between."

Similar presentations


Ads by Google