Photos placed in horizontal position with even amount of white space between photos and header Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL SAND NO C miniTri and a Task-Based Linear Algebra Building Blocks Approach for Scalable Graph Analytics Michael Wolf, Jon Berry, Dylan Stark IEEE HPEC 16 September, 2015
Miniapps Small, self contained applications Proxy for important characteristics of full applications Important co-design tool: performance analysis Target 1: existing applications on new and future architectures Target 2: new applications on existing architectures Strong partnership with industry Mantevo MiniApp Suite (Sandia): mantevo.org 2 phdMesh (Mantevo)
Graph BLAS Effort to standardize building blocks for graph algorithms in language of linear algebra Overloaded linear algebra kernels express most graph computations Matrix-graph duality – highly impactful in CS&E (partitioning, solvers,…) Promising for data sciences but many challenges Graph BLAS BoF – Buluc, 18:00 Eden Vale (C1) 3 CS&E = Computational Science and Engineering Eds. Kepner, Gilbert
Task Parallelism 4 Dependency between tasks T4 T2 T5T6T7T8T1T2 T3 T4 T1 T3 T1 T2 T3 T4T5 T6 T4 T7T8 T3 T2T3T4T1 T3 T4T5T6 T7T8T1 T2 Core 0 Core 1Core 2 Core 3 Kernel boundaries Dataflow of application expressed through tasks/dependencies (Avoid explicit barriers between kernels) Overdecomposition of problem into tasks (# tasks > #cores) Tasks scheduled and moved to appropriate compute resources
Task Parallelism 5 Dependency between tasks T4 T2 T5T6T7T8T1T2 T3 T4 T1 T3 T1 T2 T3 T4T5 T6 T4 T7T8 T3 T2T3T4T1 T3 T4T5T6 T7T8T1 T2 Core 0 Core 1Core 2 Core 3 Kernel boundaries Natural for data analytics (model intrinsically data-centric) Several task parallel models/libraries: HPX, Kokkos/Qthreads, Uintah, Legion, …
Background miniTri Linear Algebra-Based miniTri (miniTriLA) Task Parallel Approach to miniTriLA Summary Outline 6
miniTri: Data Analytics Miniapp 7 Triangle-based miniapp Proxy (Mantevo) for triangle based data analytics (e.g., graph noise- filtering) miniTri Steps: For each triangle, calculate triangle degrees for vertices and edges For each triangle, calculate integer k given triangle degree info 3 3 miniTri poses different challenges than traditional data analytics benchmarks such as Graph 500. miniTri poses different challenges than traditional data analytics benchmarks such as Graph 500. k: Max clique size of given triangle
miniTri: Overview 8 Developed 12+ variants Different methods: buckets data structure, set intersection, linear algebra Different programming models: OpenMP, MPI, HPX, Kokkos/Qthreads Focus of talk: linear algebra-based miniTri Related triangle enumeration work Azad, Buluç, Gilbert. “Parallel triangle counting and enumeration using matrix algebra”, Proc. of the IPDPSW, GABB Workshop, Challenge: Can miniTri be implemented efficiently using Graph BLAS-like building blocks? Challenge: Can miniTri be implemented efficiently using Graph BLAS-like building blocks? k:
Background miniTri Linear Algebra-Based miniTri (miniTriLA) Task Parallel Approach to miniTriLA Summary Outline 9
Linear Algebra Based miniTri (miniTriLA) Developed Graph BLAS-like formulation of miniTri Important stressor of Graph BLAS 10 A = adjacency matrix for graph, B = incidence matrix for graph, 1 = vector of ones miniTri 1 C = A * B 2 t v = C * 1 3 t e = C T * 1 4 kcount(C, t v, t e ) Adjacency matrix of G Incidence matrix of G G
miniTriLA: Triangle Enumeration 11 A = adjacency matrix for graph, B = incidence matrix for graph, 1 = vector of ones miniTri 1 C = A * B 2 t v = C * 1 3 t e = C T * 1 4 kcount(C, t v, t e ) Wedges implicitly stored in rows of A Adj matrix: wedges = pairwise combinations of nonzero column ids + row # E.g., row 5 wedges: {1,5,2}, {1,5,3}, {1,5,4}, {2,5,3}, {2,5,4}, {3,5,4} Enumerates each triangle 3 times (once: C=L*B, where L is lower triangle part of A) SPGEMM = Sparse Matrix Multiplication G Adjacency matrix of G Incidence matrix of G Matrix of Triangles
miniTriLA: Triangle Enumeration 12 A = adjacency matrix for graph, B = incidence matrix for graph, 1 = vector of ones miniTri 1 C = A * B 2 t v = C * 1 3 t e = C T * 1 4 kcount(C, t v, t e ) Columns of B tell us whether wedge is “closed” to form triangle Overloaded SpGEMM yields triangle enumeration: If A i,x =A i,y = 1 and B x,j =B y,j = 1 (or equivalently A i,* B *,j =2), C i,j = triangle {i,x,y} Else, no triangle Enumerates each triangle 3 times (once: C=L*B, where L is lower triangle part of A) SPGEMM = Sparse Matrix Multiplication G Adjacency matrix of G Incidence matrix of G Matrix of Triangles
miniTriLA: Triangle Degree Calculation 13 A = adjacency matrix for graph, B = incidence matrix for graph, 1 = vector of ones miniTri 1 C = A * B 2 t v = C * 1 3 t e = C T * 1 4 kcount(C, t v, t e ) Triangle degree calculation Each triangle represented once in C for each of its edges and vertices Triangle vertex and edge degrees are number of nz in rows and columns Triangles with v 4 Triangles with e 4,5 C= Overloaded SpMV to count triangles in rows
miniTriLA: Kcounts 14 A = adjacency matrix for graph, B = incidence matrix for graph, 1 = vector of ones miniTri 1 C = A * B 2 t e = C * 1 3 t v = C T * 1 4 kcount(C, t e, t v ) K123 triangle count 002 Compute k for each triangle (summarize in table) k-count table gives us upper bound on largest clique in graph Largest c such that Comb(c,3) triangles have k-counts at least c G k:
miniTriLA: Challenges 15 A = adjacency matrix for graph, B = incidence matrix for graph, 1 = vector of ones miniTri 1 C = A * B 2 t e = C * 1 3 t v = C T * 1 4 kcount(C, t e, t v ) Challenge 1: Computation difficult to load-balance Challenge 2: Typical graph BLAS approach forms C, which means storing all triangles in graph Worst case: O(|E| 3/2 ) triangles in graph, typical: triangles/edge Severely limits size of graph
Background miniTri Linear Algebra-Based miniTri (miniTriLA) Task Parallel Approach to miniTriLA Summary Outline 16
Task Parallel Approach Each block of linear algebra operations assigned to a task Block of rows, 2D block of elements Global barriers between kernels removed Replaced by dependencies between tasks Tasks in subsequent kernels can start/finish before first kernel finishes 17 Illustrative Example of GraphBLAS Approaches Traditional data parallel Task parallel
Task Parallel Implementation Implemented Task Parallel miniTriLA HPX (LSU), Kokkos/Qthreads (Sandia) Addressing Challenge #1: Load imbalance Task parallelism can be used to improve load balance 18 Initial results show slight improvement of task parallel approach over data parallel approach
Addressing Challenge #2: Storing all Triangles Key insight: Task parallelism can be used in resource constrained environments to reduce memory footprint Prioritize k-count tasks to free blocks of triangles from memory Need runtime system to support advanced resource management/priorities (ongoing effort: HPX and Kokkos/Qthreads) 19 Prioritize Task parallel approach allows Graph BLAS implementation of miniTri to solve much larger problems
Background miniTri Linear Algebra-Based miniTri (miniTriLA) Task Parallel Approach to miniTriLA Summary Outline 20
Summary Overview of new data analytics miniapp miniTri Will be released soon as part of Mantevo Presented linear algebra-based formulation of miniTri miniTri in 4 compact linear algebra-based operations Graph Algorithm Building Block (GABB) for triangle enumeration GABBs for calculating triangle vertex and edge degree miniTri poses challenges for Graph BLAS-like implementations Load balancing, Memory usage Presented task parallel approach that addresses these challenges Asynchrony is key Can use task priorities to constrain memory usage 21
Additional Info/Resources miniTri will be released soon under Mantevo Waiting for copyright approval mantevo.org has several other miniapps Graph BLAS HPX LSU – HPX-5 (IU) – Kokkos Qthreads 22
Extra 23
Traditional Data Parallelism 24 Data distributed across cores Global barriers between kernels Data distributed across cores Global barriers between kernels C1 Core 0 Core 1Core 2 Core 3 Kernel boundaries C2 C3C4 C1C2 C3C4 C1C2 C3 C4 C1 C2 C3 C4 C1C2 C3C4
kcount 25 Upper bound on largest clique in graph = largest c such that Comb(c,3) triangles have k-counts at least c Any v of k-clique, incident on Comb(k-1,2) triangles of that clique Any e of k-clique, incident on k-2 triangles of that clique argmax selects the largest k satisfying these condition (largest clique containing triangle) 3 3 k:
miniTriLA: Challenges 26 A = adjacency matrix for graph, B = incidence matrix for graph, 1 = vector of ones miniTri 1 C = A * B 2 t e = C * 1 3 t v = C T * 1 4 kcount(C, t e, t v ) Challenge 1: Computation difficult to load-balance Challenge 2: Graph BLAS approach forms C, which means storing all triangles in graph Worst case: O(|E| 3/2 ) triangles in graph, typical: triangles/edge Severely limits size of graph