1 High-Performance Graph Computation via Sparse Matrices John R. Gilbert University of California, Santa Barbara with Aydin Buluc, LBNL; Armando Fox, UCB; Shoaib Kamil, MIT; Adam Lugowski, UCSB; Lenny Oliker, LBNL, Sam Williams, LBNL Dynamic Graphs Institute April 17, 2013 Support: Intel, Microsoft, DOE Office of Science, NSF
2 Outline Motivation Sparse matrices for graph algorithms CombBLAS: sparse arrays and graphs on parallel machines KDT: attributed semantic graphs in a high-level language Specialization: getting the best of both worlds
3 Top 500 list (November 2012) = x P A L U Top500 Benchmark: Solve a large system of linear equations by Gaussian elimination
4 Graph 500 list (November 2012) Graph500 Benchmark: Breadth-first search in a large power-law graph
5 Floating-point vs. graphs, November 2012 = x P A L U Peta / 15.3 Tera is about Terateps 17.6 Petaflops
6 Floating-point vs. graphs, November 2012 = x P A L U Nov 2012: 17.6 Peta / 15.3 Tera ~ 1,100 Nov 2010: 2.5 Peta / 6.6 Giga ~ 380, Petaflops 15.3 Terateps
7 By analogy to numerical scientific computing... What should the combinatorial BLAS look like? The challenge of the software stack C = A*B y = A*x μ = x T y Basic Linear Algebra Subroutines (BLAS): Ops/Sec vs. Matrix Size
8 Outline Motivation Sparse matrices for graph algorithms CombBLAS: sparse arrays and graphs on parallel machines KDT: attributed semantic graphs in a high-level language Specialization: getting the best of both worlds
9 Identification of Primitives Sparse matrix-matrix multiplication (SpGEMM) Element-wise operations × Matrices over various semirings: (+. x), (min. +), (or. and), … Sparse matrix-dense vector multiplication Sparse matrix indexing ×.* Sparse array-based primitives
10 Multiple-source breadth-first search X ATAT
11 Multiple-source breadth-first search Sparse array representation => space efficient Sparse matrix-matrix multiplication => work efficient Three possible levels of parallelism: searches, vertices, edges X ATAT ATXATX
Graph contraction via sparse triple product A1 A3 A2 A1 A2 A3 Contract x x =
Subgraph extraction via sparse triple product Extract x x =
14 The Case for Sparse Matrices Many irregular applications contain coarse-grained parallelism that can be exploited by abstractions at the proper level. Traditional graph computations Graphs in the language of linear algebra Data driven, unpredictable communication. Fixed communication patterns Irregular and unstructured, poor locality of reference Operations on matrix blocks exploit memory hierarchy Fine grained data accesses, dominated by latency Coarse grained parallelism, bandwidth limited The case for sparse matrices
15 Outline Motivation Sparse matrices for graph algorithms CombBLAS: sparse arrays and graphs on parallel machines KDT: attributed semantic graphs in a high-level language Specialization: getting the best of both worlds
Combinatorial BLAS class hierarchy CommGrid DCSCCSCCSBTriples SpMatSpDistMat DenseDistMa t DistMat Enforces interface only Combinatorial BLAS functions and operators DenseDistVecSpDistVec FullyDistVec... HAS A Polymorphism
17 Some Combinatorial BLAS functions
Parallel sparse matrix-matrix multiplication: SpGEMM x = 100K 25K 5K 25K 100K 5K A B C 2D algorithm: Sparse SUMMA (based on dense SUMMA) General implementation that handles rectangular matrices C ij += HyperSparseGEMM( A recv, B recv )
1D vs. 2D scaling for sparse matrix-matrix multiplication 2D algorithms have the potential to scale, but not linearly. SpSUMMA = 2-D data layout (Combinatorial BLAS) EpetraExt = 1-D data layout (Trilinos)
Almost linear scaling until bandwidth costs starts to dominate Scaling to more processors… Scaling proportional to √p afterwards
21 Outline Motivation Sparse matrices for graph algorithms CombBLAS: sparse arrays and graphs on parallel machines KDT: attributed semantic graphs in a high-level language Specialization: getting the best of both worlds
Parallel graph analysis software Discrete structure analysis Graph theory Computers
Parallel graph analysis software Discrete structure analysis Graph theory Computers Communication Support (MPI, GASNet, etc) Threading Support (OpenMP, Cilk, etc)) Distributed Combinatorial BLAS Shared-address space Combinatorial BLAS HPC scientists and engineers Graph algorithm developers Knowledge Discovery Toolbox (KDT) KDT is higher level (graph abstractions) Combinatorial BLAS is for performance Domain scientists
(Semantic) directed graphs – constructors, I/O – basic graph metrics (e.g., degree() ) – vectors Clustering / components Centrality / authority: betweenness centrality, PageRank Hypergraphs and sparse matrices Graph primitives (e.g., bfsTree() ) SpMV / SpGEMM on semirings Domain expert vs. graph expert Markov Clustering Input Graph Largest Component Graph of Clusters
Markov Clustering Input Graph Largest Component Graph of Clusters Domain expert vs. graph expert (Semantic) directed graphs – constructors, I/O – basic graph metrics (e.g., degree() ) – vectors Clustering / components Centrality / authority: betweenness centrality, PageRank Hypergraphs and sparse matrices Graph primitives (e.g., bfsTree() ) SpMV / SpGEMM on semirings comp = bigG.connComp() giantComp = comp.hist().argmax() G = bigG.subgraph(comp==giantComp) clusters = G.cluster(‘Markov’) clusNedge = G.nedge(clusters) smallG = G.contract(clusters) # visualize
Domain expert vs. graph expert Markov Clustering Input Graph Largest Component Graph of Clusters (Semantic) directed graphs – constructors, I/O – basic graph metrics (e.g., degree() ) – vectors Clustering / components Centrality / authority: betweenness centrality, PageRank Hypergraphs and sparse matrices Graph primitives (e.g., bfsTree() ) SpMV / SpGEMM on semirings […] L = G.toSpParMat() d = L.sum(kdt.SpParMat.Column) L = -L L.setDiag(d) M = kdt.SpParMat.eye(G.nvert()) – mu*L pos = kdt.ParVec.rand(G.nvert()) for i in range(nsteps): pos = M.SpMV(pos) comp = bigG.connComp() giantComp = comp.hist().argmax() G = bigG.subgraph(comp==giantComp) clusters = G.cluster(‘Markov’) clusNedge = G.nedge(clusters) smallG = G.contract(clusters) # visualize
Aimed at domain experts who know their problem well but don’t know how to program a supercomputer Easy-to-use Python interface Runs on a laptop as well as a cluster with 10,000 processors Open source software (New BSD license) V0.3 release April 2013 A general graph library with operations based on linear algebraic primitives
28 A few KDT applications
Graph of text & phone calls Betweenness centrality on text messages Betweenness centrality Betweenness centrality on phone calls Filtering a graph by attribute value
Example: Vertex types: Person, Phone, Camera, Gene, Pathway Edge types: PhoneCall, TextMessage, CoLocation, SequenceSimilarity Edge attributes: Time, Duration Calculate centrality just for s among engineers sent between given start and end times Attributed semantic graphs and filters def onlyEngineers (self): return self.position == Engineer def timed (self, sTime, eTime): return ((self.type == ) and (self.Time > sTime) and (self.Time < eTime)) G.addVFilter(onlyEngineers) G.addEFilter(timed (start, end)) # rank via centrality based on recent transactions among engineers bc = G.rank(’approxBC’)
31 Outline Motivation Sparse matrices for graph algorithms CombBLAS: sparse arrays and graphs on parallel machines KDT: attributed semantic graphs in a high-level language Specialization: getting the best of both worlds
32 On-the-fly filter performance issues Write filters in Python and call back from CombBLAS –Flexible and easy but runs slow The issue is local computation, not parallelism or data movement All user-written semirings face the same issue
33 Solution: Just-in-time specialization (SEJITS) On first call, translate Python filter or semiring op to C++ Compile with GCC Call the compiled code thereafter (Lots of details omitted ….)
Filtered BFS with SEJITS Time (in seconds) for a single BFS iteration on scale 25 RMAT (33M vertices, 500M edges) with 10% of elements passing filter. Machine is NERSC’s Hopper.
35 Conclusion Sparse arrays and matrices give useful primitives and algorithms for high-performance graph computation. Just-in-time specialization enables flexible graph analytics. It helps to look at things from two directions. kdt.sourceforge.net/ gauss.cs.ucsb.edu/~aydin/CombBLAS/