All-Pairs-Shortest-Paths for Large Graphs on the GPU Gary J Katz 1,2, Joe Kider 1 1 University of Pennsylvania 2 Lockheed Martin IS&GS.

Slides:



Advertisements
Similar presentations
1 Introduction to Algorithms 6.046J/18.401J/SMA5503 Lecture 19 Prof. Erik Demaine.
Advertisements

Speed, Accurate and Efficient way to identify the DNA.
Algorithms (and Datastructures) Lecture 3 MAS 714 part 2 Hartmut Klauck.
Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
Advanced Algorithm Design and Analysis (Lecture 7) SW5 fall 2004 Simonas Šaltenis E1-215b
1 Chapter 26 All-Pairs Shortest Paths Problem definition Shortest paths and matrix multiplication The Floyd-Warshall algorithm.
Design and Analysis of Algorithms Single-source shortest paths, all-pairs shortest paths Haidong Xue Summer 2012, at GSU.
Lecture 17 Path Algebra Matrix multiplication of adjacency matrices of directed graphs give important information about the graphs. Manipulating these.
Chapter 25: All-Pairs Shortest-Paths
 2004 SDU Lecture11- All-pairs shortest paths. Dynamic programming Comparing to divide-and-conquer 1.Both partition the problem into sub-problems 2.Divide-and-conquer.
Dynamic Programming Dynamic Programming is a general algorithm design technique for solving problems defined by recurrences with overlapping subproblems.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
Design and Analysis of Algorithms - Chapter 81 Dynamic Programming Warshall’s and Floyd’sAlgorithm Dr. Ying Lu RAIK 283: Data Structures.
All Pairs Shortest Paths and Floyd-Warshall Algorithm CLRS 25.2
CS 311 Graph Algorithms. Definitions A Graph G = (V, E) where V is a set of vertices and E is a set of edges, An edge is a pair (u,v) where u,v  V. If.
Chapter 8 Dynamic Programming Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
Design and Analysis of Algorithms - Chapter 81 Dynamic Programming Dynamic Programming is a general algorithm design technique Dynamic Programming is a.
Lecture 22: Matrix Operations and All-pair Shortest Paths II Shang-Hua Teng.
Chapter 8 Dynamic Programming Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
The PTX GPU Assembly Simulator and Interpreter N.M. Stiffler Zheming Jin Ibrahim Savran.
Algorithms All pairs shortest path
Discrete Math for CS Chapter 8: Directed Graphs. Discrete Math for CS digraph: A digraph is a graph G = (V,E) where V is a finite set of vertices and.
Design and Analysis of Algorithms - Chapter 81 Dynamic Programming Dynamic Programming is a general algorithm design techniqueDynamic Programming is a.
Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
More Dynamic Programming Floyd-Warshall Algorithm.
A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 8 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.
CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
Dynamic Programming Dynamic Programming Dynamic Programming is a general algorithm design technique for solving problems defined by or formulated.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
GPU Architecture and Programming
Parallel graph algorithms Antonio-Gabriel Sturzu, SCPD Adela Diana Almasi, SCPD Adela Diana Almasi, SCPD Iulia Alexandra Floroiu, ISI Iulia Alexandra Floroiu,
MA/CSSE 473 Day 28 Dynamic Programming Binomial Coefficients Warshall's algorithm Student questions?
1 The Floyd-Warshall Algorithm Andreas Klappenecker.
All-pairs Shortest Paths. p2. The structure of a shortest path: All subpaths of a shortest path are shortest paths. p : a shortest path from vertex i.
Parallel Programming: All-Pairs Shortest Path CS599 David Monismith Based upon notes from multiple sources.
The all-pairs shortest path problem (APSP) input: a directed graph G = (V, E) with edge weights goal: find a minimum weight (shortest) path between every.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
All-Pairs Shortest Paths
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
Introduction to Algorithms All-Pairs Shortest Paths My T. UF.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
All Pairs Shortest Path Algorithms Aditya Sehgal Amlan Bhattacharya.
Shortest Paths.
CS427 Multicore Architecture and Parallel Computing
Chapter 8 Dynamic Programming
Parallel Graph Algorithms
All-Pairs Shortest Paths (26.0/25)
Warshall’s and Floyd’sAlgorithm
Shortest Paths.
All-Pairs Shortest Paths
Floyd-Warshall Algorithm
Dynamic Programming.
Dynamic Programming.
Shortest Paths.
Parallel Graph Algorithms
6- General Purpose GPU Programming
Directed Graphs (Part II)
COSC 3101A - Design and Analysis of Algorithms 12
Presentation transcript:

All-Pairs-Shortest-Paths for Large Graphs on the GPU Gary J Katz 1,2, Joe Kider 1 1 University of Pennsylvania 2 Lockheed Martin IS&GS

1 What Will We Cover? Quick overview of Transitive Closure and All-Pairs Shortest Path Uses for Transitive Closure and All-Pairs GPUs, What are they and why do we care? The GPU problem with performing Transitive Closure and All-Pairs…. Solution, The Block Processing Method Memory formatting in global and shared memory Results

2 Previous Work “A Blocked All-Pairs Shortest-Paths Algorithm” Venkataraman et al. “Parallel FPGA-based All-Pairs Shortest Path in a Diverted Graph” Bondhugula et al. “Accelerating large graph algorithms on the GPU using CUDA” Harish

3 NVIDIA GPU Architecture Issues No Access to main memory Programmer needs to explicitly reference L1 shared cache Can not synchronize multiprocessors Compute cores are not as smart as CPUs, does not handle if statements well

4 Background Some graph G with vertices V and edges E G= (V,E) For every pair of vertices u,v in V a shortest path from u to v, where the weight of a path is the sum of he weights of its edges

5 Adjacency Matrix

6 Quick Overview of Transitive Closure The Transitive Closure of G is defined as the graph G* = (V, E*), where E* = {(i,j) : there is a path from vertex i to vertex j in G} -Introduction to Algorithms, T. Cormen Simply Stated: The Transitive Closure of a graph is the list of edges for any vertices that can reach each other Edges 1, 5 2, 1 4, 2 4, 3 6, 3 8, Edges 1, 5 2, 1 4, 2 4, 3 6, 3 8, 6 2, 5 8, 3 7, 6 7, 3

7 Design and Analysis of Algorithms - Chapter 87 Warshall’s algorithm: transitive closure Computes the transitive closure of a relation Computes the transitive closure of a relation (Alternatively: all paths in a directed graph) (Alternatively: all paths in a directed graph) Example of transitive closure:1 Example of transitive closure:

8 Design and Analysis of Algorithms - Chapter 88 Warshall’s algorithm Main idea: a path exists between two vertices i, j, iff Main idea: a path exists between two vertices i, j, iff there is an edge from i to j; orthere is an edge from i to j; or there is a path from i to j going through vertex 1; orthere is a path from i to j going through vertex 1; or there is a path from i to j going through vertex 1 and/or 2; orthere is a path from i to j going through vertex 1 and/or 2; or … there is a path from i to j going through vertex 1, 2, … and/or k; orthere is a path from i to j going through vertex 1, 2, … and/or k; or there is a path from i to j going through any of the other verticesthere is a path from i to j going through any of the other vertices

9 Design and Analysis of Algorithms - Chapter 89 b Idea: dynamic programming Let V={1, …, n} and for k≤n, V k ={1, …, k}Let V={1, …, n} and for k≤n, V k ={1, …, k} For any pair of vertices i, j  V, identify all paths from i to j whose intermediate vertices are all drawn from V k : P ij k ={p1, p2, …}, if P ij k  then R k [i, j]=1For any pair of vertices i, j  V, identify all paths from i to j whose intermediate vertices are all drawn from V k : P ij k ={p1, p2, …}, if P ij k  then R k [i, j]=1 For any pair of vertices i, j: R n [i, j], that is R nFor any pair of vertices i, j: R n [i, j], that is R n Starting with R 0 =A, the adjacency matrix, how to get R 1  …  R k-1  R k  …  R nStarting with R 0 =A, the adjacency matrix, how to get R 1  …  R k-1  R k  …  R n ij P1 VkVk Warshall’s algorithm p2

10 Design and Analysis of Algorithms - Chapter 810 b Idea: dynamic programming p  P ij k : p is a path from i to j with all intermediate vertices in V kp  P ij k : p is a path from i to j with all intermediate vertices in V k If k is not on p, then p is also a path from i to j with all intermediate vertices in V k-1 : p  P ij k-1If k is not on p, then p is also a path from i to j with all intermediate vertices in V k-1 : p  P ij k-1 Warshall’s algorithm ij p V k-1 VkVk k

11 Design and Analysis of Algorithms - Chapter 811 b Idea: dynamic programming p  P ij k : p is a path from i to j with all intermediate vertices in V kp  P ij k : p is a path from i to j with all intermediate vertices in V k If k is on p, then we break down p into p 1 and p 2 whereIf k is on p, then we break down p into p 1 and p 2 where –p 1 is a path from i to k with all intermediate vertices in V k-1 –p 2 is a path from k to j with all intermediate vertices in V k-1 Warshall’s algorithm ij V k-1 p VkVk k p1 p2

12 Design and Analysis of Algorithms - Chapter 812 Warshall’s algorithm In the k th stage determine if a path exists between two vertices i, j using just vertices among 1, …, k In the k th stage determine if a path exists between two vertices i, j using just vertices among 1, …, k R (k-1) [i,j] (path using just 1, …, k-1) R (k) [i,j] = or R (k) [i,j] = or (R (k-1) [i,k] and R (k-1) [k,j]) (path from i to k (R (k-1) [i,k] and R (k-1) [k,j]) (path from i to k and from k to j and from k to j using just 1, …, k-1) using just 1, …, k-1) i j k k th stage {

13 Quick Overview All-Pairs-Shortest- Path Simply Stated: The All-Pairs-Shortest-Path of a graph is the most optimal list of vertices connecting any two vertices that can reach each other Paths 1 → 5 2 → 1 4 → 2 4 → 3 6 → 3 8 → 6 2 → 1 → 5 8 → 6 → 3 7 → 8 → 6 7 → 8 → 6 → 3 The All-Pairs Shortest-Path of G is defined for every pair of vertices u,v E V as the shortest (least weight) path from u to v, where the weight of a path is the sum of the weights of its constituent edges. -Introduction to Algorithms, T. Cormen

14 Uses for Transitive Closure and All- Pairs

15 Floyd-Warshall Algorithm Pass 1: Finds all connections that are connected through Pass 6: Finds all connections that are connected through 6 1 Running Time = O(V 3 ) Pass 8: Finds all connections that are connected through 8

16 Parallel Floyd-Warshall There’s a short coming to this algorithm though… Each Processing Element needs global access to memory This can be an issue for GPUs

17 The Question How do we calculate the transitive closure on the GPU to:  Take advantage of shared memory  Accommodate data sizes that do not fit in memory Can we perform partial processing of the data? Can we perform partial processing of the data?

18 Block Processing of Floyd-Warshall Multi-core Shared Memory Multi-core Shared Memory Multi-core Shared Memory GPU Multi-core Shared Memory Multi-core Shared Memory Multi-core Shared Memory GPU Data Matrix Organizational structure for block processing? Organizational structure for block processing?

19 Block Processing of Floyd-Warshall

20 Block Processing of Floyd-Warshall N =

21 Block Processing of Floyd-Warshall [i,j] [i,k] [k,j] (5,1) -> (5,1) & (1,1) (8,1) -> (8,1) & (1,1) (5,4) -> (5,1) & (1,4) (8,4) -> (8,1) & (1,4) W[i,j] = W[i,j] | (W[i,k] && W[k,j]) K = 4 K = 1 [i,j] [i,k] [k,j] (5,1) -> (5,4) & (4,1) (8,1) -> (8,4) & (4,1) (5,4) -> (5,4) & (4,4) (8,4) -> (8,4) & (4,4) For each pass, k, the cells retrieved must be processed to at least k-1

22 Block Processing of Floyd-Warshall W[i,j] = W[i,j] | (W[i,k] && W[k,j]) Putting it all Together Processing K = [1-4] Pass 1: i = [1-4], j = [1-4] Pass 2: i = [5-8], j = [1-4] i = [1-4], j = [5-8] Pass 3: i = [5-8], j = [5-8]

23 Block Processing of Floyd-Warshall N = 8 Computing k = [5-8] Range: i = [5,8] j = [5,8] k = [5,8] 5678

24 Block Processing of Floyd-Warshall Putting it all Together Processing K = [5-8] Pass 1: i = [5-8], j = [5-8] Pass 2: i = [5-8], j = [1-4] i = [1-4], j = [5-8] Pass 3: i = [1-4], j = [1-4] Transitive Closure Is complete for k = [1-8] Transitive Closure Is complete for k = [1-8] W[i,j] = W[i,j] | (W[i,k] && W[k,j])

25 Increasing the Number of Blocks Pass 1  Primary blocks are along the diagonal  Secondary blocks are the rows and columns of the primary block  Tertiary blocks are all remaining blocks

26 Increasing the Number of Blocks Pass 2  Primary blocks are along the diagonal  Secondary blocks are the rows and columns of the primary block  Tertiary blocks are all remaining blocks

27 Increasing the Number of Blocks Pass 3  Primary blocks are along the diagonal  Secondary blocks are the rows and columns of the primary block  Tertiary blocks are all remaining blocks

28 Increasing the Number of Blocks Pass 4  Primary blocks are along the diagonal  Secondary blocks are the rows and columns of the primary block  Tertiary blocks are all remaining blocks

29 Increasing the Number of Blocks Pass 5  Primary blocks are along the diagonal  Secondary blocks are the rows and columns of the primary block  Tertiary blocks are all remaining blocks

30 Increasing the Number of Blocks Pass 6  Primary blocks are along the diagonal  Secondary blocks are the rows and columns of the primary block  Tertiary blocks are all remaining blocks

31 Increasing the Number of Blocks Pass 7  Primary blocks are along the diagonal  Secondary blocks are the rows and columns of the primary block  Tertiary blocks are all remaining blocks

32 Increasing the Number of Blocks Pass 8 In Total: N Passes 3 sub-passes per pass  Primary blocks are along the diagonal  Secondary blocks are the rows and columns of the primary block  Tertiary blocks are all remaining blocks

33 Running it on the GPU Using CUDA Written by NVIDIA to access GPU as a parallel processor Do not need to use graphics API Grid Dimension { Block Dimension Memory Indexing CUDA Provides Grid Dimension Block Dimension Block Id Thread Id Block Id Thread Id

34 Partial Memory Indexing SP2 SP3 0 1 N SP1

35 Memory Format for All-Pairs Solution All-Pairs requires twice the memory footprint of Transitive Closure Connecting Node Distance N N Shortest Path

36 Results SM cache efficient GPU implementation compared to standard GPU implementation

37 Results SM cache efficient GPU implementation compared to standard CPU implementation and cache-efficient CPU implementation

38 Results SM cache efficient GPU implementation compared to best variant of Han et al.’s tuned code

39 Conclusion Advantages of Algorithm Relatively Easy to Implement Cheap Hardware Much Faster than standard CPU version Can work for any data size Special thanks to NVIDIA for supporting our research

40 Backup

41 CUDA CompUte Driver Architecture Extension of C Automatically creates thousands of threads to run on a graphics card Used to create non-graphical applications Pros: Allows user to design algorithms that will run in parallel Easy to learn, extension of C Has CPU version, implemented by kicking off threads Cons: Low level, C like language Requires understanding of GPU architecture to fully exploit gcc / cl G80 SASS foo.sass OCG cudacc EDG C/C++ frontend Open64 Global Optimizer GPU Assembly foo.s CPU Host Code foo.cpp Integrated source (foo.cu)