Download presentation
Presentation is loading. Please wait.
Published byStephanie Potter Modified over 8 years ago
1
Parallel graph algorithms Antonio-Gabriel Sturzu, SCPD Adela Diana Almasi, SCPD Adela Diana Almasi, SCPD Iulia Alexandra Floroiu, ISI Iulia Alexandra Floroiu, ISI
2
Content Table Floyd-Warshall Floyd-Warshall Dijkstra Dijkstra Prim Prim Bellman-Ford Bellman-Ford Connected-Components Connected-Components
3
Floyd-Warshall All pairs shortest path algorithm based on dynamic programming All pairs shortest path algorithm based on dynamic programming It uses intermediate nodes in a shortest path between two nodes It uses intermediate nodes in a shortest path between two nodes If we are at step k it will calculate the shortest path that has intermediate nodes in the {1,2,…k} set of nodes between each pair of vertices If we are at step k it will calculate the shortest path that has intermediate nodes in the {1,2,…k} set of nodes between each pair of vertices
4
Floyd-Warshall So, the serial algorithm is like this: So, the serial algorithm is like this: for k := 0 to N-1 for i := 0 to N-1 for j := 0 to N-1 I[i,j] = min(I[i,j], I[i,k]+I[k,j]) I is the cost matrix For the distributed algorithm we can divide the cost matrix between P processors in a unidimensional manner
5
Floyd-Warshall Each processor will hold N/P lines of the cost matrix, where N is the dimension of the matrix and P is the number of processors Each processor will hold N/P lines of the cost matrix, where N is the dimension of the matrix and P is the number of processors If processor p holds the lines from pi to pj then the algorithm executed by it will look like this: If processor p holds the lines from pi to pj then the algorithm executed by it will look like this: for k=0 to n-1 for i=pi to pj for j=0 to n-1 I[i,j]=min(I[i,j],I[i,k]+I[k,j]) I[i,j]=min(I[i,j],I[i,k]+I[k,j]) Also at eachk step the processors need the k’th line of the cost matrix Also at eachk step the processors need the k’th line of the cost matrix For this, the processor who owns this line must broadcast it to the other processors;this can be done in logP steps For this, the processor who owns this line must broadcast it to the other processors;this can be done in logP steps
6
Floyd-Warshall The total time for this algorithm is T=c*N^3/P+N*logP(Ts+Tw*N) The total time for this algorithm is T=c*N^3/P+N*logP(Ts+Tw*N) Ts is the time consumed for communication preparation Ts is the time consumed for communication preparation Tw is the time consumed to communicate a word from the message Tw is the time consumed to communicate a word from the message
7
Floyd-Warshall Results Results For a 500x500 cost matrix we obtained the following results: For a 500x500 cost matrix we obtained the following results: It can be observed that a botleneck is reached beginning from 9 processors It can be observed that a botleneck is reached beginning from 9 processors Nr proc Time(s) 10,840 20,558 30,378 40,283 50,230 60,194 70,170 80,153 90,286
8
Floyd-Warshall For a 900x900 cost matrix we obtained the following results: For a 900x900 cost matrix we obtained the following results: The botleneck occurs when the comunication time becomes too high as the number of processors increases The botleneck occurs when the comunication time becomes too high as the number of processors increases Nr proc Time(s) 14,77 23,191 32,142 41,609 51,300 61,085 70,957 80,847 90,991
9
Floyd-Warshall We also used OpenMP to parallelize the algorithm We also used OpenMP to parallelize the algorithm Because the first loop can’t be parallelized, we parallelized the second one using the static schedule because is has the least overhead Because the first loop can’t be parallelized, we parallelized the second one using the static schedule because is has the least overhead For the 900x900 cost matrix we obtained the following results: For the 900x900 cost matrix we obtained the following results: Nr Threads Time(s) 15,33 24,050 32,7 42,075 51,644 61,397
10
Dijkstra Greedy based algorithm Greedy based algorithm Maintains a priority queue with the vertices for which the minimum distances to the source haven’t been calculated yet Maintains a priority queue with the vertices for which the minimum distances to the source haven’t been calculated yet At each step it extracts from the queue the closest vertex to the source and relaxes all the edges that leave from it At each step it extracts from the queue the closest vertex to the source and relaxes all the edges that leave from it There are two possible implementations There are two possible implementations We parallelized the O(V^2) implementation which is preferable for dense graphs We parallelized the O(V^2) implementation which is preferable for dense graphs
11
Dijkstra The O(ElgV) implementation uses a minheap for the priority queue and is superior for sparse graphs The O(ElgV) implementation uses a minheap for the priority queue and is superior for sparse graphs The pseudocode for the O(V^2) implementation looks like this: The pseudocode for the O(V^2) implementation looks like this: for i=1 to n d[i]=inf;d[s]=0;Q<-V[G] while(Q not empty) {u=extract_min(Q); for each v adjacent to u if(d[u]+w(u,v)<d[v])d[v]=d[u]+w(u,v);}
12
Dijkstra We implemented the priority queue as a vector of distances and used a vector that tells for each vertex if it is in the queue or not We implemented the priority queue as a vector of distances and used a vector that tells for each vertex if it is in the queue or not Also for the serial implementation we represented the graph both with a cost matrix and with adjacency lists Also for the serial implementation we represented the graph both with a cost matrix and with adjacency lists For the parallel version we used a cost matrix and partitioned it in blocks of columns For the parallel version we used a cost matrix and partitioned it in blocks of columns So, each processor will keap N/P columns of the matrix So, each processor will keap N/P columns of the matrix
13
Dijkstra Also each processor will keep it’s own distance vector and visited vector;the visited vector tells for each node if the minimum distance to the source has been calculated or not Also each processor will keep it’s own distance vector and visited vector;the visited vector tells for each node if the minimum distance to the source has been calculated or not With this partitioning scheme we are now able to parallelize both the computing of the closest node to the source and the relaxation step With this partitioning scheme we are now able to parallelize both the computing of the closest node to the source and the relaxation step In each step every processor computes the local closest node to the source In each step every processor computes the local closest node to the source A global reduction is applied in order to compute the global closest node to the source.After this, the vertex is broadcasted to all nodes and each of them perform locally the relaxation step A global reduction is applied in order to compute the global closest node to the source.After this, the vertex is broadcasted to all nodes and each of them perform locally the relaxation step
14
Dijkstra Also the processor that owns the global closest node must mark it as visited in its local visited vector Also the processor that owns the global closest node must mark it as visited in its local visited vector The time complexity of this parallel algorithm is c*V^2/P+V*logP(Ts+Tw) The time complexity of this parallel algorithm is c*V^2/P+V*logP(Ts+Tw)
15
Dijkstra Results Results For a complete graph with 10000 nodes we obtained the following results: For a complete graph with 10000 nodes we obtained the following results: As it can be seen from 5 processors the communication overhead becomes too high and no significant speedup is observed As it can be seen from 5 processors the communication overhead becomes too high and no significant speedup is observed Nr proc Time(s) 1 AL 9,85 1 CM 1,3 20,790 30,61 40,494 50,481 60,436 70,422 80,40
16
Dijkstra Also it can be observed that for dense graphs the adjacency list implementation is much slower than the one with the cost matrix Also it can be observed that for dense graphs the adjacency list implementation is much slower than the one with the cost matrix One possible explanation would be that because the adjacency lists are too high, too many cache misses are produced slowing down the algorithm One possible explanation would be that because the adjacency lists are too high, too many cache misses are produced slowing down the algorithm
17
Prim Computes minimum spanning tree in a greedy manner Computes minimum spanning tree in a greedy manner Grows a tree by adding at each step a minimum cost edge that unites some vertex from the tree with some vertex that isn’t part of the tree Grows a tree by adding at each step a minimum cost edge that unites some vertex from the tree with some vertex that isn’t part of the tree It starts with some arbitrary vertex It starts with some arbitrary vertex To efficiently compute at each step the minimum cost edge it maintains a priority queue with the minimum cost edges that unite the tree with the vertices that aren’t part of the tree yet To efficiently compute at each step the minimum cost edge it maintains a priority queue with the minimum cost edges that unite the tree with the vertices that aren’t part of the tree yet The priority queue can be implemented with a vector or with a heap depending on the graph’s structure The priority queue can be implemented with a vector or with a heap depending on the graph’s structure
18
Prim The pseudocode looks like this: The pseudocode looks like this:Q<-V[G] for each u of Q key[u]=inf;key[r]=0; while(Q is not empty) {u=extract_min(Q); for each v adjacent to u if vєQ and w(u,v)<key[v] prev[v]=u;key[v]=w(u,v);}
19
Prim We will additionally need a vector that tells us if a node is in the prioriry queue or not We will additionally need a vector that tells us if a node is in the prioriry queue or not If the priority queue is implemented with a vector the algorithm has O(V^2) complexity which is good for dense graphs If the priority queue is implemented with a vector the algorithm has O(V^2) complexity which is good for dense graphs If implemented with a heap and adjacency lists the algorithm yelds a O(E*lgV) complexity which is preferable for sparse graphs If implemented with a heap and adjacency lists the algorithm yelds a O(E*lgV) complexity which is preferable for sparse graphs
20
Prim We parallelized the O(V^2) algorithm and implemented the graph with a cost matrix We parallelized the O(V^2) algorithm and implemented the graph with a cost matrix Similar to Dijkstra we partitioned the cost matrix in blocks of columns Similar to Dijkstra we partitioned the cost matrix in blocks of columns Also each node will locally hold a priority queue,a prev vector,the vector that tells us if a node is in the local priority queue or not and a local variable that holds the local minimum value of the spanning tree Also each node will locally hold a priority queue,a prev vector,the vector that tells us if a node is in the local priority queue or not and a local variable that holds the local minimum value of the spanning tree In this manner we can parallelize the calculation of the closest node to the minimum spanning tree and the update of the values in the local priority queue and the values in the local prev vector In this manner we can parallelize the calculation of the closest node to the minimum spanning tree and the update of the values in the local priority queue and the values in the local prev vector
21
Prim The closest node to the tree is computed with a global reduction operation that can be done in logP steps The closest node to the tree is computed with a global reduction operation that can be done in logP steps Also the node that holds the closest node must mark it as being out of the local priority queue and update the local minimum cost value of the spanning tree Also the node that holds the closest node must mark it as being out of the local priority queue and update the local minimum cost value of the spanning tree After the value of the closest node is broadcasted each node will update its local priority queue and prev vectors After the value of the closest node is broadcasted each node will update its local priority queue and prev vectors
22
Prim The time complexity of this distributed algorithm is c*V^2/P+V*logP(Ts+Tw) The time complexity of this distributed algorithm is c*V^2/P+V*logP(Ts+Tw) For a 10000 node complete graph we obtained the following results: For a 10000 node complete graph we obtained the following results: Nr Proc Time(s) 1 AL 9,6 1 CM 1 20,621 30,497 40,413 50,402 60,385 70,370 80,353
23
Prim It can be seen that from 5 processors very small speedup is obtained because the communication time begins to dominate the computation time It can be seen that from 5 processors very small speedup is obtained because the communication time begins to dominate the computation time A possible optimization found by some researchers of University of Illinois would be to reduce the number of global reductions by adding at each iteration multiple nodes to the minimum spanning tree A possible optimization found by some researchers of University of Illinois would be to reduce the number of global reductions by adding at each iteration multiple nodes to the minimum spanning tree Although Prim’s algorithm says that at each iteration we must add one vertex, in their paper they prove that at each iteration one can add multiple vertices to the tree but not always and not a fix number Although Prim’s algorithm says that at each iteration we must add one vertex, in their paper they prove that at each iteration one can add multiple vertices to the tree but not always and not a fix number
24
Prim The idea is that each processor must calculate the K closest nodes to the tree and after that a global reduction is performed in order to compute the K global closest nodes to the tree The idea is that each processor must calculate the K closest nodes to the tree and after that a global reduction is performed in order to compute the K global closest nodes to the tree After this, each processor performs two checks on the K vertices After this, each processor performs two checks on the K vertices A vertex is checked only if all the preceding vertices have passed the checks A vertex is checked only if all the preceding vertices have passed the checks The first check verifies if any of the preceding vertices have a shorter distance to the current vertex The first check verifies if any of the preceding vertices have a shorter distance to the current vertex The second one checks if any of the preceding vertices have a shorter distance to a vertex not in the tree than the current vertex’s distance to the spanning tree The second one checks if any of the preceding vertices have a shorter distance to a vertex not in the tree than the current vertex’s distance to the spanning tree
25
Prim If the two checks are passed the node is considered valid and the process continues until an invalid vertex is found If the two checks are passed the node is considered valid and the process continues until an invalid vertex is found After each processor performs these checks another global reduction is performed in order to compute the global vertices that can be added to the spanning tree After each processor performs these checks another global reduction is performed in order to compute the global vertices that can be added to the spanning tree
26
Bellman-Ford Can calculate single-source shortest path for negative edges and detect negative cycles Can calculate single-source shortest path for negative edges and detect negative cycles The pseudocode is straightforward: The pseudocode is straightforward: for i=1 to V d[i]=infd[s]=0 for each edge (u,v) of G[E] if(d[u]+w(u,v)<d[v])d[v]=d[u]+w(u,v)
27
Bellman-Ford Although it’s time complexity is O(VE), in practice with a small optimization that dramatically reduces the number of passes through the edges of the graph it can become faster than Dijkstra with a heap Although it’s time complexity is O(VE), in practice with a small optimization that dramatically reduces the number of passes through the edges of the graph it can become faster than Dijkstra with a heap In order to parallelize the algorithm we pass through the edges in this manner: In order to parallelize the algorithm we pass through the edges in this manner: for j=1 to V for each vє Adj[j] if(d[j]+w(j,v)<d[v])d[v]=d[j]+w(j,v) So, in this form we can’t parallelize the for j loop because we could end up writing concurrently to the same locations but we can parallelize the for v loop So, in this form we can’t parallelize the for j loop because we could end up writing concurrently to the same locations but we can parallelize the for v loop We used OpenMP for this but the results weren’t good because of the overhead of thread creation and destruction We used OpenMP for this but the results weren’t good because of the overhead of thread creation and destruction
28
Connected-Components Can be computed using BFS or DFS Can be computed using BFS or DFS To do this in a distributed manner we partition the graph among nodes To do this in a distributed manner we partition the graph among nodes The adjacency matrix is partitioned in blocks of rows and every node gets a subgraph of the initial graph The adjacency matrix is partitioned in blocks of rows and every node gets a subgraph of the initial graph After this, each processor computes the spanning forest of its local subgraph using DFS After this, each processor computes the spanning forest of its local subgraph using DFS
29
Connected-Components The next step is to merge together all the spanning forests in order to obtain the final spanning forest The next step is to merge together all the spanning forests in order to obtain the final spanning forest To merge forest A with forest B processor A will send all the edges of his forest to processor B To merge forest A with forest B processor A will send all the edges of his forest to processor B For each edge (u,v) processor B will perform first find operations on nodes u and v to determine if they are in the same tree For each edge (u,v) processor B will perform first find operations on nodes u and v to determine if they are in the same tree If not, it will execute a union operation to merge the two trees together If not, it will execute a union operation to merge the two trees together
30
Connected-Components The find and union are operations done on disjoint sets The find and union are operations done on disjoint sets The find(x) operation retrieves the representative element of the set that contains x The find(x) operation retrieves the representative element of the set that contains x In pseudocode it works like this: In pseudocode it works like this:find(x){if(x!=p[x])p[x]=find(p[x]) return p[x] } This is known as the road compression heuristic This is known as the road compression heuristic
31
Connected-Components The union(x,y) operation first finds the roots of the trees that contain x and y and after that makes the root of the tree with fewer nodes to point to the root of the bigger tree The union(x,y) operation first finds the roots of the trees that contain x and y and after that makes the root of the tree with fewer nodes to point to the root of the bigger tree This is known as the union after rank heuristic This is known as the union after rank heuristic The computation of the local forest takes O(V^2/P) time and the merging of the forests takes O(VlogP) time,where V is the number of vertices and P is the number of processors The computation of the local forest takes O(V^2/P) time and the merging of the forests takes O(VlogP) time,where V is the number of vertices and P is the number of processors
32
Conclusions Few parallel algorithms for graphs are faster than their serial counterparts because they don’t scale well when increasing the number of processors Few parallel algorithms for graphs are faster than their serial counterparts because they don’t scale well when increasing the number of processors
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.