1 Sparse Matrices for High-Performance Graph Analytics John R. Gilbert University of California, Santa Barbara Oak Ridge National Laboratory October 3,

Slides:



Advertisements
Similar presentations
Implementing Iterative Algorithms with SPARQL Robert Techentin, Barry Gilbert, Mayo Clinic Adam Lugowski, Kevin Deweese, John Gilbert, University of California.
Advertisements

Benchmarking traversal operations over graph databases Marek Ciglan 1, Alex Averbuch 2 and Ladialav Hluchý 1 1 Institute of Informatics, Slovak Academy.
Oracle Labs Graph Analytics Research Hassan Chafi Sr. Research Manager Oracle Labs Graph-TA 2/21/2014.
1 Computational models of the physical world Cortical bone Trabecular bone.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.
1 Parallel Sparse Operations in Matlab: Exploring Large Graphs John R. Gilbert University of California at Santa Barbara Aydin Buluc (UCSB) Brad McRae.
Data Structure and Algorithms (BCS 1223) GRAPH. Introduction of Graph A graph G consists of two things: 1.A set V of elements called nodes(or points or.
Microsoft Proprietary High Productivity Computing Large-scale Knowledge Discovery: Co-evolving Algorithms and Mechanisms Steve Reinhardt Principal Architect.
Parallel Algorithms in STAPL Implementation and Evaluation Jeremy Vu, Mauro Bianco, Nancy Amato Parasol Lab, Department of Computer.
CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.
Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.
1cs542g-term Sparse matrix data structure  Typically either Compressed Sparse Row (CSR) or Compressed Sparse Column (CSC) Informally “ia-ja” format.
Graph Analysis with High Performance Computing by Bruce Hendrickson and Jonathan W. Berry Sandria National Laboratories Published in the March/April 2008.
Massive Graph Visualization: LDRD Final Report Sandia National Laboratories Sand Printed October 2007.
High-Performance Computation for Path Problems in Graphs
Reconfigurable Application Specific Computers RASCs Advanced Architectures with Multiple Processors and Field Programmable Gate Arrays FPGAs Computational.
Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.
October 14-15, 2005Conformal Computing Geometry of Arrays: Mathematics of Arrays and  calculus Lenore R. Mullin Computer Science Department College.
1 A High-Performance Interactive Tool for Exploring Large Graphs John R. Gilbert University of California, Santa Barbara Aydin Buluc & Viral Shah (UCSB)
Science Advisory Committee Meeting - 20 September 3, 2010 Stanford University 1 04_Parallel Processing Parallel Processing Majid AlMeshari John W. Conklin.
Tools and Primitives for High Performance Graph Computation
CS240A: Computation on Graphs. Graphs and Sparse Matrices Sparse matrix is a representation.
FLANN Fast Library for Approximate Nearest Neighbors
Deployed Large-Scale Graph Analytics: Use Cases, Target Audiences, and Knowledge Discovery Toolbox (KDT) Technology Aydin Buluc, LBL John.
A Lightweight Infrastructure for Graph Analytics Donald Nguyen Andrew Lenharth and Keshav Pingali The University of Texas at Austin.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
1 High-Performance Graph Computation via Sparse Matrices John R. Gilbert University of California, Santa Barbara with Aydin Buluc, LBNL; Armando Fox, UCB;
Conjugate gradients, sparse matrix-vector multiplication, graphs, and meshes Thanks to Aydin Buluc, Umit Catalyurek, Alan Edelman, and Kathy Yelick for.
Pregel: A System for Large-Scale Graph Processing
1 Challenges in Combinatorial Scientific Computing John R. Gilbert University of California, Santa Barbara Grand Challenges in Data-Intensive Discovery.
© Fujitsu Laboratories of Europe 2009 HPC and Chaste: Towards Real-Time Simulation 24 March
1 Graph Algorithms in the Language of Linear Algebra John R. Gilbert University of California, Santa Barbara CS 240A presentation adapted from: Intel Non-Numeric.
Knowledge Discovery Toolbox kdt.sourceforge.net Adam Lugowski.
11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.
MapReduce and Graph Data Chapter 5 Based on slides from Jimmy Lin’s lecture slides ( (licensed.
Ligra: A Lightweight Graph Processing Framework for Shared Memory
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
After step 2, processors know who owns the data in their assumed partitions— now the assumed partition defines the rendezvous points Scalable Conceptual.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
1/52 Overlapping Community Search Graph Data Management Lab, School of Computer Science
CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.
Numerical Libraries Project Microsoft Incubation Group Mary Beth Hribar Microsoft Corporation CSCAPES Workshop June 10, 2008 Copyright Microsoft Corporation,
CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.
Interactive Supercomputing Update IDC HPC User’s Forum, September 2008.
CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.
CS240A: Computation on Graphs. Graphs and Sparse Matrices Sparse matrix is a representation.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
HAMA: An Efficient Matrix Computation with the MapReduce Framework Sangwon Seo, Edward J. Woon, Jaehong Kim, Seongwook Jin, Jin-soo Kim, Seungryoul Maeng.
Data Structures and Algorithms in Parallel Computing Lecture 3.
Graphs A graphs is an abstract representation of a set of objects, called vertices or nodes, where some pairs of the objects are connected by links, called.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University.
Data Structures and Algorithms in Parallel Computing Lecture 7.
Ariful Azad Lawrence Berkeley National Laboratory
An Interactive Environment for Combinatorial Scientific Computing Viral B. Shah John R. Gilbert Steve Reinhardt With thanks to: Brad McRae, Stefan Karpinski,
C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
GUILLOU Frederic. Outline Introduction Motivations The basic recommendation system First phase : semantic similarities Second phase : communities Application.
Miraj Kheni Authors: Toyotaro Suzumura, Koji Ueno
Performance Evaluation of Adaptive MPI
Amir Kamil and Katherine Yelick
for more information ... Performance Tuning
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz
Amir Kamil and Katherine Yelick
Big Data Analytics: Exploring Graphs with Optimized SQL Queries
Presentation transcript:

1 Sparse Matrices for High-Performance Graph Analytics John R. Gilbert University of California, Santa Barbara Oak Ridge National Laboratory October 3, 2014 Support: Intel, Microsoft, DOE Office of Science, NSF

2 Thanks … Lucas Bang (UCSB), Jon Berry (Sandia), Eric Boman (Sandia), Aydin Buluc (LBL), John Conroy (CCS), Kevin Deweese (UCSB), Erika Duriakova (Dublin), Armando Fox (UCB), Shoaib Kamil (MIT), Jeremy Kepner (MIT), Tristan Konolige (UCSB), Adam Lugowski (UCSB), Tim Mattson (Intel), Brad McRae (TNC), Dave Mizell (YarcData), Lenny Oliker (LBL), Carey Priebe (JHU), Steve Reinhardt (YarcData), Lijie Ren (Google), Eric Robinson (Lincoln), Viral Shah (UIDAI), Veronika Strnadova (UCSB), Yun Teng (UCSB), Joshua Vogelstein (Duke), Drew Waranis (UCSB), Sam Williams (LBL)

3 Outline Motivation: Graph applications Mathematics: Sparse matrices for graph algorithms Software: CombBLAS, QuadMat, KDT Standards: Graph BLAS Forum

4 Computational models of the physical world Cortical bone Trabecular bone

5 Large graphs are everywhere… WWW snapshot, courtesy Y. HyunYeast protein interaction network, courtesy H. Jeong Internet structure Social interactions Scientific datasets: biological, chemical, cosmological, ecological, …

6 Co-author graph from 1993 Householder symposium Social network analysis (1993)

7 Facebook graph: > 1,000,000,000 vertices Social network analysis (2014)

8 Large-scale genomic mapping and sequencing [Strnadova, Buluc, Chapman, G, Gonzalez, Jegelska, Rokhsar, Oliker 2014] –Problem: scale to millions of markers times thousands of individuals, with “unknown” rates > 50% –Tools used or desired: spanning trees, approximate TSP, incremental connected components, spectral and custom clustering, k-nearest neighbors –Results: using more data gives better genomic maps

9 Alignment and matching of brain scans [Conroy, G, Kratzer, Lyzinski, Priebe, Vogelstein 2014] –Problem: match functional regions across individuals –Tools: Laplacian eigenvectors, geometric spectral partitioning, clustering, and more...

10 Landscape connectivity modeling [McRae et al.] Habitat quality, gene flow, corridor identification, conservation planning Targeting larger problems: Yellowstone-to-Yukon corridor Tools: Graph contraction, connected components, Laplacian linear systems Figures courtesy of Brad McRae, NCEAS

11 (B -1/2 A B -1/2 ) (B 1/2 x) = B -1/2 b –Problem: approximate target graph by sparse subgraph –Ax = b in nearly linear time in theory [ST08, KMP10, KOSZ13] –Tools: spanning trees, subgraph extraction and contraction, breadth-first search, shortest paths,... Combinatorial acceleration of Laplacian solvers [Boman, Deweese, G 2014]

12 Computers Continuous physical modeling Linear algebra Discrete structure analysis Graph theory Computers The middleware challenge for graph analysis

13 By analogy to numerical scientific computing... What should the combinatorial BLAS look like? The middleware challenge for graph analysis C = A*B y = A*x μ = x T y Basic Linear Algebra Subroutines (BLAS): Ops/Sec vs. Matrix Size

14 Identification of Primitives Sparse matrix-matrix multiplication (SpGEMM) Element-wise operations × Matrices over various semirings: (+. x), (min. +), (or. and), … Sparse matrix-dense vector multiplication Sparse matrix indexing ×.* Sparse array primitives for graph manipulation

15 Multiple-source breadth-first search X ATAT

16 Multiple-source breadth-first search Sparse array representation => space efficient Sparse matrix-matrix multiplication => work efficient Three possible levels of parallelism: searches, vertices, edges X ATAT ATXATX 

17 Examples of semirings in graph algorithms ( edge/vertex attributes, vertex data aggregation, edge data processing ) Schema for user-specified computation at vertices and edges Real field: (R, +, *)Numerical linear algebra Boolean algebra: ({0 1}, |, &)Graph traversal Tropical semiring: (R U { ∞}, min, +) Shortest paths (S, select, select) Select subgraph, or contract nodes to form quotient graph

Graph contraction via sparse triple product A1 A3 A2 A1 A2 A3 Contract x x =

Subgraph extraction via sparse triple product Extract x x =

20 Counting triangles (clustering coefficient) A Clustering coefficient: Pr (wedge i-j-k makes a triangle with edge i-k) 3 * # triangles / # wedges 3 * 4 / 19 = 0.63 in example may want to compute for each vertex j

21 A Clustering coefficient: Pr (wedge i-j-k makes a triangle with edge i-k) 3 * # triangles / # wedges 3 * 4 / 19 = 0.63 in example may want to compute for each vertex j Cohen’s algorithm to count triangles: - Count triangles by lowest-degree vertex. - Enumerate “low-hinged” wedges. - Keep wedges that close. hi lo hi lo hi lo Counting triangles (clustering coefficient)

22 A LU C A = L + U (hi->lo + lo->hi) L × U = B (wedge, low hinge) A ∧ B = C (closed wedge) sum(C)/2 = 4 triangles A B, C Counting triangles (clustering coefficient)

23 A few other graph algorithms we’ve implemented in linear algebraic style Maximal independent set (KDT/SEJITS) [BDFGKLOW 2013] Peer-pressure clustering (SPARQL) [DGLMR 2013] Time-dependent shortest paths (CombBLAS) [Ren 2012] Gaussian belief propagation (KDT) [LABGRTW 2011] Markoff clustering (CombBLAS, KDT) [BG 2011, LABGRTW 2011] Betweenness centrality (CombBLAS) [BG 2011] Hybrid BFS/bully connected components (CombBLAS) [Konolige, in progress] Geometric mesh partitioning (Matlab ) [GMT 1998]

24 Graph algorithms in the language of linear algebra Kepner et al. study [2006]: fundamental graph algorithms including min spanning tree, shortest paths, independent set, max flow, clustering, … SSCA#2 / centrality [2008] Basic breadth-first search / Graph500 [2010] Beamer et al. [2013] direction- optimizing breadth-first search, implemented in CombBLAS

Aimed at graph algorithm designers/programmers who are not expert in mapping algorithms to parallel hardware. Flexible templated C++ interface. Scalable performance from laptop to 100,000-processor HPC. Open source software. Version released January An extensible distributed-memory library offering a small but powerful set of linear algebraic operations specifically targeting graph analytics. Combinatorial BLAS

Combinatorial BLAS in distributed memory CommGrid DCSCCSCCSBTriples SpMatSpDistMat DenseDistMa t DistMat Enforces interface only Combinatorial BLAS functions and operators DenseDistVecSpDistVec FullyDistVec... HAS A Polymorphism

Matrix/vector distributions, interleaved on each other. 2D Layout for Sparse Matrices & Vectors - 2D matrix layout wins over 1D with large core counts and with limited bandwidth/compute - 2D vector layout sometimes important for load balance Default distribution in Combinatorial BLAS. Scalable with increasing number of processes

Benchmarking graph analytics frameworks Satish et al., Navigating the Maze of Graph Analytics Frameworks Using Massive Graph Datasets, SIGMOD Combinatorial BLAS was fastest of all tested graph processing frameworks on 3 out of 4 benchmarks.

29 Combinatorial BLAS “users” (Sep 2014) IBM (T.J.Watson, Zurich, & Tokyo) Intel Cray Microsoft Stanford UC Berkeley Carnegie-Mellon Georgia Tech Ohio State U Texas Austin Columbia U Minnesota NC State UC San Diego Sandia Labs SEI Paradigm4 IHPC (Singapore) King Fahd U (Saudi Arabia) Tokyo Inst of Technology Chinese Academy of Sciences U Ghent (Belgium) Bilkent U (Turkey) U Canterbury (New Zealand) Purdue Indiana U UC Merced Mississippi State

30 m rows n columns subdivide by dimension on power of 2 indices Blocks store enough matrix elements for meaningful computation; denser parts of matrix have more blocks. QuadMat shared-memory data structure [Lugowski, G]

31 Pair-List QuadMat SpGEMM algorithm -Problem: Natural recursive matrix multiplication is inefficient due to deep tree of sparse matrix additions. -Solution: Rearrange into block inner product pair lists. -A single matrix element can participate in pair lists with different block sizes. -Symbolic phase followed by computational phase -Multithreaded implementation in Intel TBB

32 QuadMat compared to Csparse & CombBLAS

Aimed at domain experts who know their problem well but don’t know how to program a supercomputer Easy-to-use Python interface Runs on a laptop as well as a cluster with 10,000 processors Open source software (New BSD license) A general graph library with operations based on linear algebraic primitives Knowledge Discovery Toolbox

Example: Vertex types: Person, Phone, Camera, Gene, Pathway Edge types: PhoneCall, TextMessage, CoLocation, SequenceSimilarity Edge attributes: Time, Duration Calculate centrality just for s among engineers sent between given start and end times Attributed semantic graphs and filters def onlyEngineers (self): return self.position == Engineer def timed (self, sTime, eTime): return ((self.type == ) and (self.Time > sTime) and (self.Time < eTime)) G.addVFilter(onlyEngineers) G.addEFilter(timed (start, end)) # rank via centrality based on recent transactions among engineers bc = G.rank(’approxBC’)

SEJITS for filter/semiring acceleration Embedded DSL: Python for the whole application Introspect, translate Python to equivalent C++ code Call compiled/optimized C++ instead of Python

Filtered BFS with SEJITS Time (in seconds) for a single BFS iteration on scale 25 RMAT (33M vertices, 500M edges) with 10% of elements passing filter. Machine is NERSC’s Hopper.

37 The (original) BLAS Experts in mapping algorithms to hardware tune BLAS for specific platforms. Experts in numerical linear algebra build software on top of the BLAS to get high performance “for free.” Today every computer, phone, etc. comes with /usr/lib/libblas The Basic Linear Algebra Subroutines had a revolutionary impact on computational linear algebra. BLAS 1 vector ops Lawson, Hanson, Kincaid, Krogh, 1979 LINPACK BLAS 2 matrix-vector ops Dongarra, Du Croz, Hammarling, Hanson, 1988 LINPACK on vector machines BLAS 3 matrix-matrix ops Dongarra, Du Croz, Hammarling, Hanson, 1990 LAPACK on cache based machines

38 No, it’s not reasonable to define a universal set of building blocks. –Huge diversity in matching graph algorithms to hardware platforms. –No consensus on data structures or linguistic primitives. –Lots of graph algorithms remain to be discovered. –Early standardization can inhibit innovation. Yes, it is reasonable to define a common set of building blocks… … for graphs as linear algebra. –Representing graphs in the language of linear algebra is a mature field. –Algorithms, high level interfaces, and implementations vary. –But the core primitives are well established. Can we standardize a “Graph BLAS”?

Graph BLAS Forum:

Matrix times matrix over semiring Specific cases and function names: SpGEMM: sparse matrix times sparse matrix SpMSpV: sparse matrix times sparse vector SpMV: Sparse matrix times dense vector SpMM: Sparse matrix times dense matrix Implements C  = A .  B for j = 1 : N C(i,k) = C(i,k)  (A(i,j)  B(j,k)) If input C is omitted, implements C = A .  B Transpose flags specify operation on A T, B T, and/or C T instead

41 Conclusion Matrix computation is beginning to repay a 50-year debt to graph algorithms. Graphs in the language of linear algebra are sufficiently mature to support a standard set of BLAS. It helps to look at things from two directions.