= Michael M. Wolf, Sandia National Laboratories

Slides:

Advertisements

Similar presentations

Dynamic Load Balancing in Scientific Simulation Angen Zheng.

Advertisements

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

A system Performance Model Instructor: Dr. Yanqing Zhang Presented by: Rajapaksage Jayampthi S.

1 In-Network PCA and Anomaly Detection Ling Huang* XuanLong Nguyen* Minos Garofalakis § Michael Jordan* Anthony Joseph* Nina Taft § *UC Berkeley § Intel.

Graph Analysis with High Performance Computing by Bruce Hendrickson and Jonathan W. Berry Sandria National Laboratories Published in the March/April 2008.

Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.

Algorithm: For all e E t, define X e = {w e if e G t, 1 - w e otherwise}. Measure likelihood of substructure S by. Flag S as anomalous if, where is an.

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

D4M – Signal Processing On Databases

Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.

Benchmarking Parallel Eigen Decomposition for Residuals Analysis of Very Large Graphs Edward Rutledge, Benjamin Miller, Michelle Beard HPEC 2012 September.

Combinatorial Scientific Computing is concerned with the development, analysis and utilization of discrete algorithms in scientific and engineering applications.

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.

Combinatorial Scientific Computing and Petascale Simulation (CSCAPES) A SciDAC Institute Funded by DOE’s Office of Science Investigators Alex Pothen, Florin.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

An Evaluation of Partitioners for Parallel SAMR Applications Sumir Chandra & Manish Parashar ECE Dept., Rutgers University Submitted to: Euro-Par 2001.

OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.

Domain decomposition in parallel computing Ashok Srinivasan Florida State University.

Data Structures and Algorithms in Parallel Computing Lecture 7.

A comparative approach for gene network inference using time-series gene expression data Guillaume Bourque* and David Sankoff *Centre de Recherches Mathématiques,

A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.

 In the previews parts we have seen some kind of segmentation method.  In this lecture we will see graph cut, which is a another segmentation method.

Community structure in graphs Santo Fortunato. More links “inside” than “outside” Graphs are “sparse” “Communities”

Dynamic Load Balancing in Scientific Simulation

COMP8330/7330/7336 Advanced Parallel and Distributed Computing Decomposition and Parallel Tasks (cont.) Dr. Xiao Qin Auburn University

High Performance Computing Seminar

Network (graph) Models

These slides are based on the book:

Optimizing Distributed Actor Systems for Dynamic Interactive Services

Auburn University

Auburn University

Structures of Networks

OPERATING SYSTEMS CS 3502 Fall 2017

EE384Y: Packet Switch Architectures Scaling Crossbar Switches

Parallel Hypergraph Partitioning for Scientific Computing

Software Architecture

Chapter 1: Introduction to Systems Analysis and Design

CSCI5570 Large Scale Data Processing Systems

Parallel Programming By J. H. Wang May 2, 2017.

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.

Applications of graph theory in complex systems research

Statistical Data Analysis

Parallel Algorithm Design

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs (cont.) Dr. Xiao.

Andrew B. Kahng and Xu Xu UCSD CSE and ECE Depts.

EE 193: Parallel Computing

Multi-Core Parallel Routing

Parallelism in High-Performance Computing Applications

Department of Computer Science University of California, Santa Barbara

Effective Social Network Quarantine with Minimal Isolation Costs

A survey of network anomaly detection techniques

Models of Network Formation

COMP60621 Fundamentals of Parallel and Distributed Systems

COMP60611 Fundamentals of Parallel and Distributed Systems

Reducing Total Network Power Consumption

Statistical Data Analysis

A Parallelization of State-of-the-Art Graph Bisection Algorithms

Mattan Erez The University of Texas at Austin

Unfolding with system identification

Department of Computer Science University of California, Santa Barbara

COMP60611 Fundamentals of Parallel and Distributed Systems

Chapter 1: Introduction to Systems Analysis and Design

2019/9/14 The Deep Learning Vision for Heterogeneous Network Traffic Control Proposal, Challenges, and Future Perspective Author: Nei Kato, Zubair Md.

Dynamic Load Balancing of Unstructured Meshes

Software Architecture

Gurbinder Gill Roshan Dathathri Loc Hoang Keshav Pingali

Presentation transcript:

Improving the Performance of Graph Analysis Through Partitioning with Sampling = Michael M. Wolf, Sandia National Laboratories Benjamin A. Miller, MIT Lincoln Laboratory IEEE HPEC 2015 September 16, 2015

Motivating Graph Analytics Applications Graphs represent entities and relationships detected through multiple sources 1,000s – 1,000,000s tracks and locations GOAL: Identify anomalous patterns of life ISR Social Graphs represent relationships between individuals or documents 10,000s – 10,000,000s individual and interactions GOAL: Identify hidden social networks Cyber Graphs represent communication patterns of computers on a network 1,000,000s – 1,000,000,000s network events GOAL: Detect cyber attacks or malicious software Graphs analysis has proven to be useful in a number of different application areas. This reason for this is likely due to graph analytics proclivity for being able to identify patterns of coordination. The application of graph analytics grows harder as the patterns of coordination grow more subtle and the background graph grows larger and noisier. Framework for detecting anomalies in massive datasets Sparse matrix-vector (SpMV) product dominates runtime M. Wolf and B. Miller, “Sparse Matrix Partitioning for Parallel Eigenanalysis of Large Static and Dynamic Graphs," in IEEE HPEC, 2014.

MITLL Statistical Detection Framework for Graphs THRESHOLD NOISE SIGNAL ‘+’ H0 H1 Graph Theory Detection Theory The graph analytics I’ll discuss here are part of an effort at MIT-LL called Signal Processing for Graphs. This effort attempts to enable signal processing concepts to be applied to graph data Develop fundamental graph signal processing concepts Signal Processing for Graphs (SPG) Demonstrate in simulation Apply to real data B.A. Miller, N.T. Bliss, P.J. Wolfe, and M.S. Beard, “Detection theory for graphs,” Lincoln Laboratory Journal, vol. 20, no. 1, pp. 10–30, 2013.

Computational Focus: Dimensionality Reduction Graph Model Construction Residual Decomposition Component Selection Anomaly Detection Identification Temporal Integration Eigensystem Example: Modularity Matrix |e| – Number of edges in graph G(A) k – degree vector ki = degree(vi), Solve: Dimensionality reduction dominates SPG computation Eigen decomposition is key computational kernel Parallel implementation required for very large graph problems Fit into memory, minimize runtime Dimensionality reduction stages dominate the computation in the graph analytic processing chain. Eigen decomposition is a key computational kernel in this processing chain. Need fast parallel eigensolvers M. Wolf and B. Miller, “Sparse Matrix Partitioning for Parallel Eigenanalysis of Large Static and Dynamic Graphs," in IEEE HPEC, 2014.

Outline Motivation: Anomaly Detection in Very Large Graphs Parallel Eigenanalysis Performance Challenges Partitioning: Dynamic Graphs and Sampling Impact of Graph Structure on Parallel SpMV Performance Summary Outline

Partitioning Objective Ideally we minimize total execution time of SpMV Settle for easier objectives Balance computational work Minimize communication metric Total communication volume Number of messages Can Partition matrices in different ways 1D 2D Can model problem in different ways Graph Bipartite graph Hypergraph

1D Partitioning 1D Column 1D Row Each process assigned nonzeros for set of columns Each process assigned nonzeros for set of rows

Weak Scaling Runtime to Find 1st Eigenvector NERSC Hopper^ 4 billion vertices Weak scaling on Hopper. Up to 4 billion vertex graph! For a weak scaling study, the amount of memory per core is kept constant as we increase the number of cores. The hope with a weak scaling study, the execution will remain constant as the number of cores are increased. For many sparse problems, this may not be possible. R-MAT, 218 vertices/core Modularity Matrix NERSC Hopper^ 1D random partitioning Solved system for up to 4 billion vertex graph ^ This research used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. M. Wolf and B. Miller, “Sparse Matrix Partitioning for Parallel Eigenanalysis of Large Static and Dynamic Graphs," in IEEE HPEC, 2014.

Scalability limited and runtime increases for large numbers of cores Strong Scaling: SpMV Strong scaling result for SpMV kernel on LLGrid. Performance degrades as the number of cores is increased beyond a certain point (64). R-Mat, 223 vertices NERSC Hopper* 1D random partitioning Scalability limited and runtime increases for large numbers of cores * This research used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. SpMV= Sparse matrix-vector multiplication M. Wolf and B. Miller, “Sparse Matrix Partitioning for Parallel Eigenanalysis of Large Static and Dynamic Graphs," in IEEE HPEC, 2014.

Communication Pattern: 1D Block Partitioning 2D Finite Difference Matrix (9 point) Number of Rows: 223 Nonzeros/Row: 9 NNZ/process min: 1.17E+06 max: 1.18E+06 avg: 1.18E+06 max/avg: 1.00 # Messages (Phase 1) total: 126 max: 2 Volume (Phase 1) total: 2.58E+05 max: 4.10E+03 source process Communication pattern for SpMV operation resulting from 1D Block Partitioning of 2D finite difference matrix. These very regular matrices have very nice properties to obtain good parallel performance. Nice properties: Great load balance Small number of messages Low communication volume P=64 destination process M. Wolf and B. Miller, “Sparse Matrix Partitioning for Parallel Eigenanalysis of Large Static and Dynamic Graphs," in IEEE HPEC, 2014.

Communication Pattern: 1D Random Partitioning R-Mat (0.5, 0.125, 0.125, 0.25) Number of Rows: 223 Nonzeros/Row: 8 NNZ/process min: 1.05E+06 max: 1.07E+06 avg: 1.06E+06 max/avg: 1.01 # Messages (Phase 1) total: 4032 max: 63 Volume (Phase 1) total: 5.48E+07 max: 8.62E+05 All-to-all communication source process Communication pattern for SpMV operation resulting from 1D Random Partitioning of rmat matrix. These power-law graph matrices have very challenging properties that makes obtaining good parallel performance difficult. Important to note is the all-to-all communication. This can be prohibitively expensive for large numbers of processes. Random partitioning (compared to block) “smooths” out the communication hot spots and improves the computational load balance. Nice properties: Great load balance Challenges: All-to-all communication P=64 destination process M. Wolf and B. Miller, “Sparse Matrix Partitioning for Parallel Eigenanalysis of Large Static and Dynamic Graphs," in IEEE HPEC, 2014.

2D Cartesian Hypergraph** 2D Partitioning 2D Cartesian Hypergraph** 2D Random Cartesian* = (permuted) 2D Partitioning More flexibility: no particular part for entire row/column, more general sets of nonzeros Use flexibility of 2D partitioning to bound number of messages 2D Random Cartesian* Block Cartesian with rows/columns randomly distributed Cyclic striping to minimize number of messages 2D Cartesian Hypergraph** Use hypergraph partitioning to minimize communication volume Con: more costly to partition than random 1D partitioning tends not to not be scalable for these power law graph matrices. Alternatively, we can try 2D partitioning, where entire rows/colums are no longer assigned to a particular process. More flexibility Several 2D methods have been developed: fine-grain hypergraph, Mondriaan, coarse-grain, etc. Mondriaan (Vastenhouw and Bisseling) (2005) Recursive bisection hypergraph partitioning Each level: 1D row or column partitioning Many other methods An important class of 2D methods are the “2D Cartesian” type methods that bound the number of messages communicated in the resulting SpMV operation. Coarse-grain hypergraph partitioning also falls into this category. *Hendrickson, et al.; Bisseling; Yoo, et al. **Boman, Devine, Rajamanickam, “Scalable Matrix Computations on Large Scale-Free Graphs Using 2D Partitioning, SC2013.

Communication Pattern: 2D Random Partitioning Cartesian Blocks (2DR) R-Mat (0.5, 0.125, 0.125, 0.25) Number of Rows: 223 Nonzeros/Row: 8 NNZ/process min: 1.04E+06 max: 1.05E+06 avg: 1.05E+06 max/avg: 1.01 # Messages (Phase 1) total: 448 max: 7 Volume (Phase 1) total: 2.57E+07 max: 4.03E+05 source process Nice properties: No all-to-all communication Total volume lower than 1DR P=64 destination process 1DR = 1D Random

Communication Pattern: 2D Random Partitioning Cartesian Blocks (2DR) R-Mat (0.5, 0.125, 0.125, 0.25) Number of Rows: 223 Nonzeros/Row: 8 NNZ/process min: 1.04E+06 max: 1.05E+06 avg: 1.05E+06 max/avg: 1.01 # Messages (Phase 2) total: 448 max: 7 Volume (Phase 2) total: 2.57E+07 max: 4.03E+05 source process Nice properties: No all-to-all communication Total volume lower than 1DR P=64 destination process 1DR = 1D Random

Communication Pattern: 2D Cartesian Hypergraph Partitioning R-Mat (0.5, 0.125, 0.125, 0.25) Number of Rows: 223 Nonzeros/Row: 8 NNZ/process min: 5.88E+05 max: 1.29E+06 avg: 1.05E+06 max/avg: 1.23 # Messages (Phase 1) total: 448 max: 7 Volume (Phase 1) total: 2.33E+07 max: 4.52E+05 source process Total volume actually slightly higher than 2D Random Cartesian blocks. However, for other problems it is often lower. Nice properties: No all-to-all communication Total volume lower than 2DR Challenges: Imbalance worse than 2DR P=64 destination process 2DR = 2D Random Cartesian

Improved Strong Scaling: SpMV * ** Strong scaling of SpMV operation isolated from eigensolver. Even a simple 2D distribution can improve scalability. R-Mat, 223 vertices/rows NERSC Hopper^ Zoltan PHG hypergraph partitioner 2D methods show improved scalability ^ This research used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. *Hendrickson, et al.; Bisseling; Yoo, et al. **Boman, Devine, Rajamanickam, “Scalable Matrix Computations on Large Scale-Free Graphs Using 2D Partitioning, SC2013. M. Wolf and B. Miller, “Sparse Matrix Partitioning for Parallel Eigenanalysis of Large Static and Dynamic Graphs," in IEEE HPEC, 2014.

Improved Strong Scaling: Eigensolver Runtime to Find 1st Eigenvector Even a simple 2D distribution can improve scalability. R-Mat, 223 vertices Modularity Matrix NERSC Hopper* 2D methods show improved scalability * This research used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. M. Wolf and B. Miller, “Sparse Matrix Partitioning for Parallel Eigenanalysis of Large Static and Dynamic Graphs," in IEEE HPEC, 2014.

Challenge with Hypergraph Partitioning NERSC Hopper ~40,000 SpMVs R-Mat, 223 vertices 1024 cores Outline High partitioning cost of hypergraph methods must be amortized by computing many SpMV operations Anomaly detection* requires at most 1000s of SpMV operations *L1 norm method: computing 100 eigenvectors M. Wolf and B. Miller, “Sparse Matrix Partitioning for Parallel Eigenanalysis of Large Static and Dynamic Graphs," in IEEE HPEC, 2014.

Outline Motivation: Anomaly Detection in Very Large Graphs Parallel Eigenanalysis Performance Challenges Partitioning: Dynamic Graphs and Sampling Impact of Graph Structure on Parallel SpMV Performance Summary Outline

Sampling and Partitioning for Web/SN Graphs Graph Sampling and Partitioning Apply partition to G Sample E Partition G’ Input Graph, G=(V, E) G’=(V,E’) G1’=(V1,E1’) G2’=(V2,E2’) G1=(V1,E1) G2=(V2,E2) |E’|≈α|E| Sampling rate Sampling + Partitioning: Produce smaller graph G’ by sampling edges in graph G (uniform random sampling), keep vertices same Partition G’ (2D Cartesian Hypergraph, Zoltan PHG) Apply partition to G Idea: Partition sampled graph to reduce partitioning time

Partitioning + Sampling: Partitioning Time 2D Cartesian Hypergraph hollywood-2009**: Actor network 1.1 M vertices, 110 M edges Edge sampling greatly reduces partitioning time (by up to 6x) NERSC Hopper* * This research used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. **The University of Florida Sparse Matrix Collection

Partitioning + Sampling: SpMV Time 2D Cartesian Hypergraph hollywood-2009**: Actor network 1.1 M vertices, 110 M edges NERSC Hopper* Resulting SpMV time does not increase for modest sampling * This research used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. **The University of Florida Sparse Matrix Collection

Challenge with Hypergraph Partitioning Revisited 2DR = 2D Cartesian Random 2DH = 2D Cartesian Hypergraph (Zoltan PHG) ~100,000 SpMVs ~10,000 SpMVs hollywood-2009 1024 cores Outline NERSC Hopper* Sampling rate: 0.1 Sampling reduces overhead of hypergraph partitioning (fewer SpMVs needed to amortize partitioning cost) * This research used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.

Outline Motivation: Anomaly Detection in Very Large Graphs Parallel Eigenanalysis Performance Challenges Partitioning: Dynamic Graphs and Sampling Impact of Graph Structure on Parallel SpMV Performance Summary Outline

Experiments Want to determine impact of both data and sampling method features on SpMV performance Use three graph models (220 vertices, avg degree 32) Erdős–Rényi: neither community structure nor skewed degree distribution Chung–Lu: skewed degree distribution, no community structure Average degree distribution based on beta distribution Stochastic Blockmodel: community structure, no skewed degree distribution 32 communities, internal connections 15 times more likely Also three sampling strategies Uniformly sample edges Sample edges based on degree (edges connected to high-degree vertices are more likely to be sampled) Sample edges based on inverse degree (edges connected to high-degree vertices are less likely to be sampled)

Results Erdős–Rényi Chung–Lu Stochastic Blockmodel NERSC Hopper* * This research used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.

b d c a R-Mat Experiment R-Mat Partitioning Time SpMV Time Time relative to unsampled partitioning R-Mat graphs: 1 M vertices, 100 M edges R-Mat parameters: a b=c=(0.75-a)/2 d=0.25 Number of processors: 64 NERSC Hopper 2D Cartesian Hypergraph (Zoltan PHG) b d c a R-Mat SpMV Time Time relative to 2D random time Sampling greatly reduces partitioning time (by up to 100x) SpMV runtime is decreased (over 2D random) for graphs with more community-like structure NERSC Hopper* * This research used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.

Discussion As community structure increases, the benefit of hypergraph partitioning also increases Sampling affects SpMV performance in a non-monotonic way This will require further investigation Chung-Lu Graphs benefit more from “degree” sampling than “inverse degree” sampling Suggests high-degree vertices are important for partitioning SpMV of R-Mat graphs significantly improves at some sampling rates May be an artifact of the R-Mat process

Summary Partitioning for extreme scale problems and architectures challenging Need high quality partitions Need fast partitioners Possible approach: Sampling and partitioning Significant improvement of hypergraph partitioning performance for web/SN graphs Question: what is the impact of graph structure on partitioning? Consistently reduces partitioning time May increase or decrease SpMV time Appears highly dependent on aspects of the graph structure Future Work Larger problem set, larger problems More intelligent sampling techniques Experiments using real data with varying characteristics Summary.