Photos placed in horizontal position with even amount of white space between photos and header Sandia National Laboratories is a multi-program laboratory.

Slides:



Advertisements
Similar presentations
Copyright 2011, Data Mining Research Laboratory Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining Xintian Yang, Srinivasan.
Advertisements

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa.
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation,
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation,
Unstructured Data Partitioning for Large Scale Visualization CSCAPES Workshop June, 2008 Kenneth Moreland Sandia National Laboratories Sandia is a multiprogram.
Problem Uncertainty quantification (UQ) is an important scientific driver for pushing to the exascale, potentially enabling rigorous and accurate predictive.
OpenFOAM on a GPU-based Heterogeneous Cluster
Parallel Algorithms in STAPL Implementation and Evaluation Jeremy Vu, Mauro Bianco, Nancy Amato Parasol Lab, Department of Computer.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
Exploring Communication Options with Adaptive Mesh Refinement Courtenay T. Vaughan, and Richard F. Barrett Sandia National Laboratories SIAM Computational.
Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.
Graph Analysis with High Performance Computing by Bruce Hendrickson and Jonathan W. Berry Sandria National Laboratories Published in the March/April 2008.
Chapter 9 Graph algorithms. Sample Graph Problems Path problems. Connectedness problems. Spanning tree problems.
Models of Parallel Computation Advanced Algorithms & Data Structures Lecture Theme 12 Prof. Dr. Th. Ottmann Summer Semester 2006.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy under contract.
SAND Number: P Sandia is a multi-program laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.
Graph Algorithms. Overview Graphs are very general data structures – data structures such as dense and sparse matrices, sets, multi-sets, etc. can be.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
L11: Sparse Linear Algebra on GPUs CS Sparse Linear Algebra 1 L11: Sparse Linear Algebra CS6235
Combinatorial Scientific Computing is concerned with the development, analysis and utilization of discrete algorithms in scientific and engineering applications.
LTE Review (September 2005 – January 2006) January 17, 2006 Daniel M. Dunlavy John von Neumann Fellow Optimization and Uncertainty Estimation (1411) (8962.
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation,
Dax: Rethinking Visualization Frameworks for Extreme-Scale Computing DOECGF 2011 April 28, 2011 Kenneth Moreland Sandia National Laboratories SAND P.
Are their more appropriate domain-specific performance metrics for science and engineering HPC applications available then the canonical “percent of peak”
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
1 SOC Test Architecture Optimization for Signal Integrity Faults on Core-External Interconnects Qiang Xu and Yubin Zhang Krishnendu Chakrabarty The Chinese.
Data Intensive Computing at Sandia September 15, 2010 Andy Wilson Senior Member of Technical Staff Data Analysis and Visualization Sandia National Laboratories.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Efficient and Scalable Computation of the Energy and Makespan Pareto Front for Heterogeneous Computing Systems Kyle M. Tarplee 1, Ryan Friese 1, Anthony.
The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation,
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Floating-Point Reuse in an FPGA Implementation of a Ray-Triangle Intersection Algorithm Craig Ulmer June 27, 2006 Sandia is a multiprogram.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
After step 2, processors know who owns the data in their assumed partitions— now the assumed partition defines the rendezvous points Scalable Conceptual.
Strategies for Solving Large-Scale Optimization Problems Judith Hill Sandia National Laboratories October 23, 2007 Modeling and High-Performance Computing.
Compiling Several Classes of Communication Patterns on a Multithreaded Architecture Gagan Agrawal Department of Computer and Information Sciences Ohio.
Graphulo- 1 Graph Analytics in GraphBLAS Jeremy Kepner, Vijay Gadepally, Ben Miller 2014 December This material is based upon work supported by the National.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
JAVA AND MATRIX COMPUTATION
Task Graph Scheduling for RTR Paper Review By Gregor Scott.
1/24 Introduction to Graphs. 2/24 Graph Definition Graph : consists of vertices and edges. Each edge must start and end at a vertex. Graph G = (V, E)
Implicit Hitting Set Problems Richard M. Karp Erick Moreno Centeno DIMACS 20 th Anniversary.
Partitioning using Mesh Adjacencies  Graph-based dynamic balancing Parallel construction and balancing of standard partition graph with small cuts takes.
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation,
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation,
08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.
Photos placed in horizontal position with even amount of white space between photos and header Sandia National Laboratories is a multi-program laboratory.
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation,
Compressing Bi-Level Images by Block Matching on a Tree Architecture Sergio De Agostino Computer Science Department Sapienza University of Rome ITALY.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
What’s New for Epetra Michael A. Heroux Sandia National Laboratories Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.
Performing Fault-tolerant, Scalable Data Collection and Analysis James Jolly University of Wisconsin-Madison Visualization and Scientific Computing Dept.
On the Path to Trinity - Experiences Bringing Codes to the Next Generation ASC Platform Courtenay T. Vaughan and Simon D. Hammond Sandia National Laboratories.
Photos placed in horizontal position with even amount of white space between photos and header Sandia National Laboratories is a multi-program laboratory.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Decomposition and Parallel Tasks (cont.) Dr. Xiao Qin Auburn University
Miraj Kheni Authors: Toyotaro Suzumura, Koji Ueno
Conception of parallel algorithms
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
= Michael M. Wolf, Sandia National Laboratories
Ray-Cast Rendering in VTK-m
Big Data Analytics: Exploring Graphs with Optimized SQL Queries
Presentation transcript:

Photos placed in horizontal position with even amount of white space between photos and header Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL SAND NO C miniTri and a Task-Based Linear Algebra Building Blocks Approach for Scalable Graph Analytics Michael Wolf, Jon Berry, Dylan Stark IEEE HPEC 16 September, 2015

Miniapps  Small, self contained applications  Proxy for important characteristics of full applications  Important co-design tool: performance analysis  Target 1: existing applications on new and future architectures  Target 2: new applications on existing architectures  Strong partnership with industry  Mantevo MiniApp Suite (Sandia): mantevo.org 2 phdMesh (Mantevo)

Graph BLAS  Effort to standardize building blocks for graph algorithms in language of linear algebra  Overloaded linear algebra kernels express most graph computations  Matrix-graph duality – highly impactful in CS&E (partitioning, solvers,…)  Promising for data sciences but many challenges  Graph BLAS BoF – Buluc, 18:00 Eden Vale (C1) 3 CS&E = Computational Science and Engineering Eds. Kepner, Gilbert

Task Parallelism 4 Dependency between tasks T4 T2 T5T6T7T8T1T2 T3 T4 T1 T3 T1 T2 T3 T4T5 T6 T4 T7T8 T3 T2T3T4T1 T3 T4T5T6 T7T8T1 T2 Core 0 Core 1Core 2 Core 3 Kernel boundaries  Dataflow of application expressed through tasks/dependencies (Avoid explicit barriers between kernels)  Overdecomposition of problem into tasks (# tasks > #cores)  Tasks scheduled and moved to appropriate compute resources

Task Parallelism 5 Dependency between tasks T4 T2 T5T6T7T8T1T2 T3 T4 T1 T3 T1 T2 T3 T4T5 T6 T4 T7T8 T3 T2T3T4T1 T3 T4T5T6 T7T8T1 T2 Core 0 Core 1Core 2 Core 3 Kernel boundaries  Natural for data analytics (model intrinsically data-centric)  Several task parallel models/libraries: HPX, Kokkos/Qthreads, Uintah, Legion, …

 Background  miniTri  Linear Algebra-Based miniTri (miniTriLA)  Task Parallel Approach to miniTriLA  Summary Outline 6

miniTri: Data Analytics Miniapp 7  Triangle-based miniapp  Proxy (Mantevo) for triangle based data analytics (e.g., graph noise- filtering)  miniTri Steps:  For each triangle, calculate triangle degrees for vertices and edges  For each triangle, calculate integer k given triangle degree info 3 3 miniTri poses different challenges than traditional data analytics benchmarks such as Graph 500. miniTri poses different challenges than traditional data analytics benchmarks such as Graph 500. k: Max clique size of given triangle

miniTri: Overview 8  Developed 12+ variants  Different methods: buckets data structure, set intersection, linear algebra  Different programming models: OpenMP, MPI, HPX, Kokkos/Qthreads  Focus of talk: linear algebra-based miniTri  Related triangle enumeration work  Azad, Buluç, Gilbert. “Parallel triangle counting and enumeration using matrix algebra”, Proc. of the IPDPSW, GABB Workshop, Challenge: Can miniTri be implemented efficiently using Graph BLAS-like building blocks? Challenge: Can miniTri be implemented efficiently using Graph BLAS-like building blocks? k:

 Background  miniTri  Linear Algebra-Based miniTri (miniTriLA)  Task Parallel Approach to miniTriLA  Summary Outline 9

Linear Algebra Based miniTri (miniTriLA)  Developed Graph BLAS-like formulation of miniTri  Important stressor of Graph BLAS 10 A = adjacency matrix for graph, B = incidence matrix for graph, 1 = vector of ones miniTri 1 C = A * B 2 t v = C * 1 3 t e = C T * 1 4 kcount(C, t v, t e ) Adjacency matrix of G Incidence matrix of G G

miniTriLA: Triangle Enumeration 11 A = adjacency matrix for graph, B = incidence matrix for graph, 1 = vector of ones miniTri 1 C = A * B 2 t v = C * 1 3 t e = C T * 1 4 kcount(C, t v, t e )  Wedges implicitly stored in rows of A  Adj matrix: wedges = pairwise combinations of nonzero column ids + row #  E.g., row 5 wedges: {1,5,2}, {1,5,3}, {1,5,4}, {2,5,3}, {2,5,4}, {3,5,4} Enumerates each triangle 3 times (once: C=L*B, where L is lower triangle part of A) SPGEMM = Sparse Matrix Multiplication G Adjacency matrix of G Incidence matrix of G Matrix of Triangles

miniTriLA: Triangle Enumeration 12 A = adjacency matrix for graph, B = incidence matrix for graph, 1 = vector of ones miniTri 1 C = A * B 2 t v = C * 1 3 t e = C T * 1 4 kcount(C, t v, t e )  Columns of B tell us whether wedge is “closed” to form triangle  Overloaded SpGEMM yields triangle enumeration:  If A i,x =A i,y = 1 and B x,j =B y,j = 1 (or equivalently A i,* B *,j =2), C i,j = triangle {i,x,y}  Else, no triangle Enumerates each triangle 3 times (once: C=L*B, where L is lower triangle part of A) SPGEMM = Sparse Matrix Multiplication G Adjacency matrix of G Incidence matrix of G Matrix of Triangles

miniTriLA: Triangle Degree Calculation 13 A = adjacency matrix for graph, B = incidence matrix for graph, 1 = vector of ones miniTri 1 C = A * B 2 t v = C * 1 3 t e = C T * 1 4 kcount(C, t v, t e )  Triangle degree calculation  Each triangle represented once in C for each of its edges and vertices  Triangle vertex and edge degrees are number of nz in rows and columns Triangles with v 4 Triangles with e 4,5 C= Overloaded SpMV to count triangles in rows

miniTriLA: Kcounts 14 A = adjacency matrix for graph, B = incidence matrix for graph, 1 = vector of ones miniTri 1 C = A * B 2 t e = C * 1 3 t v = C T * 1 4 kcount(C, t e, t v ) K123 triangle count 002  Compute k for each triangle (summarize in table)  k-count table gives us upper bound on largest clique in graph  Largest c such that Comb(c,3) triangles have k-counts at least c G k:

miniTriLA: Challenges 15 A = adjacency matrix for graph, B = incidence matrix for graph, 1 = vector of ones miniTri 1 C = A * B 2 t e = C * 1 3 t v = C T * 1 4 kcount(C, t e, t v )  Challenge 1: Computation difficult to load-balance  Challenge 2: Typical graph BLAS approach forms C, which means storing all triangles in graph  Worst case: O(|E| 3/2 ) triangles in graph, typical: triangles/edge  Severely limits size of graph

 Background  miniTri  Linear Algebra-Based miniTri (miniTriLA)  Task Parallel Approach to miniTriLA  Summary Outline 16

Task Parallel Approach  Each block of linear algebra operations assigned to a task  Block of rows, 2D block of elements  Global barriers between kernels removed  Replaced by dependencies between tasks  Tasks in subsequent kernels can start/finish before first kernel finishes 17 Illustrative Example of GraphBLAS Approaches Traditional data parallel Task parallel

Task Parallel Implementation  Implemented Task Parallel miniTriLA  HPX (LSU), Kokkos/Qthreads (Sandia)  Addressing Challenge #1: Load imbalance  Task parallelism can be used to improve load balance 18 Initial results show slight improvement of task parallel approach over data parallel approach

Addressing Challenge #2: Storing all Triangles  Key insight: Task parallelism can be used in resource constrained environments to reduce memory footprint  Prioritize k-count tasks to free blocks of triangles from memory  Need runtime system to support advanced resource management/priorities (ongoing effort: HPX and Kokkos/Qthreads) 19 Prioritize Task parallel approach allows Graph BLAS implementation of miniTri to solve much larger problems

 Background  miniTri  Linear Algebra-Based miniTri (miniTriLA)  Task Parallel Approach to miniTriLA  Summary Outline 20

Summary  Overview of new data analytics miniapp miniTri  Will be released soon as part of Mantevo  Presented linear algebra-based formulation of miniTri  miniTri in 4 compact linear algebra-based operations  Graph Algorithm Building Block (GABB) for triangle enumeration  GABBs for calculating triangle vertex and edge degree  miniTri poses challenges for Graph BLAS-like implementations  Load balancing, Memory usage  Presented task parallel approach that addresses these challenges  Asynchrony is key  Can use task priorities to constrain memory usage 21

Additional Info/Resources  miniTri will be released soon under Mantevo  Waiting for copyright approval  mantevo.org has several other miniapps  Graph BLAS   HPX  LSU –  HPX-5 (IU) –  Kokkos   Qthreads  22

Extra 23

Traditional Data Parallelism 24 Data distributed across cores Global barriers between kernels Data distributed across cores Global barriers between kernels C1 Core 0 Core 1Core 2 Core 3 Kernel boundaries C2 C3C4 C1C2 C3C4 C1C2 C3 C4 C1 C2 C3 C4 C1C2 C3C4

kcount 25 Upper bound on largest clique in graph = largest c such that Comb(c,3) triangles have k-counts at least c Any v of k-clique, incident on Comb(k-1,2) triangles of that clique Any e of k-clique, incident on k-2 triangles of that clique argmax selects the largest k satisfying these condition (largest clique containing triangle) 3 3 k:

miniTriLA: Challenges 26 A = adjacency matrix for graph, B = incidence matrix for graph, 1 = vector of ones miniTri 1 C = A * B 2 t e = C * 1 3 t v = C T * 1 4 kcount(C, t e, t v )  Challenge 1: Computation difficult to load-balance  Challenge 2: Graph BLAS approach forms C, which means storing all triangles in graph  Worst case: O(|E| 3/2 ) triangles in graph, typical: triangles/edge  Severely limits size of graph