CS 140: Sparse Matrix-Vector Multiplication and Graph Partitioning

Slides:



Advertisements
Similar presentations
Great Theoretical Ideas in Computer Science
Advertisements

CS 240A: Solving Ax = b in parallel Dense A: Gaussian elimination with partial pivoting (LU) Same flavor as matrix * matrix, but more complicated Sparse.
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
Clustering II CMPUT 466/551 Nilanjan Ray. Mean-shift Clustering Will show slides from:
Numerical Algorithms Matrix multiplication
10/11/2001Random walks and spectral segmentation1 CSE 291 Fall 2001 Marina Meila and Jianbo Shi: Learning Segmentation by Random Walks/A Random Walks View.
C++ Programming: Program Design Including Data Structures, Third Edition Chapter 21: Graphs.
CS267 L17 Graph Partitioning III.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 17: Graph Partitioning - III James Demmel
CS 584. Review n Systems of equations and finite element methods are related.
1cs542g-term Sparse matrix data structure  Typically either Compressed Sparse Row (CSR) or Compressed Sparse Column (CSC) Informally “ia-ja” format.
VLSI Layout Algorithms CSE 6404 A 46 B 65 C 11 D 56 E 23 F 8 H 37 G 19 I 12J 14 K 27 X=(AB*CD)+ (A+D)+(A(B+C)) Y = (A(B+C)+AC+ D+A(BC+D)) Dr. Md. Saidur.
Chapter 9 Graph algorithms. Sample Graph Problems Path problems. Connectedness problems. Spanning tree problems.
Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.
CS267 L14 Graph Partitioning I.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 14: Graph Partitioning - I James Demmel
Chapter 9 Graph algorithms Lec 21 Dec 1, Sample Graph Problems Path problems. Connectedness problems. Spanning tree problems.
CS294-6 Reconfigurable Computing Day 13 October 6, 1998 Interconnect Partitioning.
A scalable multilevel algorithm for community structure detection
CS267 L15 Graph Partitioning II.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 15: Graph Partitioning - II James Demmel
CS267 L15 Graph Partitioning II.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 15: Graph Partitioning - II James Demmel
CS240A: Conjugate Gradients and the Model Problem.
15-853Page :Algorithms in the Real World Separators – Introduction – Applications.
Multilevel Graph Partitioning and Fiduccia-Mattheyses
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
Graph partition in PCB and VLSI physical synthesis Lin Zhong ELEC424, Fall 2010.
Conjugate gradients, sparse matrix-vector multiplication, graphs, and meshes Thanks to Aydin Buluc, Umit Catalyurek, Alan Edelman, and Kathy Yelick for.
Abhiram Ranade IIT Bombay
296.3:Algorithms in the Real World
Graph Partitioning Donald Nguyen October 24, 2011.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Presenter : Kuang-Jui Hsu Date : 2011/5/3(Tues.).
Graph Partitioning Problem Kernighan and Lin Algorithm
The Landscape of Sparse Ax=b Solvers Direct A = LU Iterative y’ = Ay Non- symmetric Symmetric positive definite More RobustLess Storage More Robust More.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Computation on meshes, sparse matrices, and graphs Some slides are from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.
CS 290H Lecture 5 Elimination trees Read GLN section 6.6 (next time I’ll assign 6.5 and 6.7) Homework 1 due Thursday 14 Oct by 3pm turnin file1.
15-853Page1 Separatosr in the Real World Practice Slides combined from
Parallel Computing Sciences Department MOV’01 Multilevel Combinatorial Methods in Scientific Computing Bruce Hendrickson Sandia National Laboratories Parallel.
CS 219: Sparse Matrix Algorithms
Lecture 5: Mathematics of Networks (Cont) CS 790g: Complex Networks Slides are modified from Networks: Theory and Application by Lada Adamic.
10/25/ VLSI Physical Design Automation Prof. David Pan Office: ACES Lecture 3. Circuit Partitioning.
CS 584. Load Balancing Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Spectral Analysis based on the Adjacency Matrix of Network Data Leting Wu Fall 2009.
Solution of Sparse Linear Systems
CS240A: Conjugate Gradients and the Model Problem.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University.
Data Structures and Algorithms in Parallel Computing Lecture 7.
CS 290H Administrivia: May 14, 2008 Course project progress reports due next Wed 21 May. Reading in Saad (second edition): Sections
Spectral Partitioning: One way to slice a problem in half C B A.
Graphs, Vectors, and Matrices Daniel A. Spielman Yale University AMS Josiah Willard Gibbs Lecture January 6, 2016.
 In the previews parts we have seen some kind of segmentation method.  In this lecture we will see graph cut, which is a another segmentation method.
Community structure in graphs Santo Fortunato. More links “inside” than “outside” Graphs are “sparse” “Communities”
Chapter 20: Graphs. Objectives In this chapter, you will: – Learn about graphs – Become familiar with the basic terminology of graph theory – Discover.
Spectral Clustering Shannon Quinn (with thanks to William Cohen of Carnegie Mellon University, and J. Leskovec, A. Rajaraman, and J. Ullman of Stanford.
Computation on Graphs. Graphs and Sparse Matrices Sparse matrix is a representation of.
High Performance Computing Seminar
Parallel Hypergraph Partitioning for Scientific Computing
Solving Linear Systems Ax=b
A Continuous Optimization Approach to the Minimum Bisection Problem
CS 290H Administrivia: April 16, 2008
CS 240A: Graph and hypergraph partitioning
Computation on meshes, sparse matrices, and graphs
Computational meshes, matrices, conjugate gradients, and mesh partitioning Some slides are from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy.
Segmentation Graph-Theoretic Clustering.
James Demmel 11/30/2018 Graph Partitioning James Demmel 04/30/2010 CS267,
Spectral Clustering Eric Xing Lecture 8, August 13, 2010
3.3 Network-Centric Community Detection
Read GLN sections 6.1 through 6.4.
James Demmel CS 267 Applications of Parallel Computers Lecture 14: Graph Partitioning - I James Demmel.
Presentation transcript:

CS 140: Sparse Matrix-Vector Multiplication and Graph Partitioning CS267 Lecture 2

Parallel sparse matrix-vector product Lay out matrix and vectors by rows y(i) = sum(A(i,j)*x(j)) Only compute terms with A(i,j) ≠ 0 Algorithm Each processor i: Broadcast x(i) Compute y(i) = A(i,:)*x Comm volume v = p*n (way too much!) Reducing communication volume Only send each proc the parts of x it needs Reorder matrix for better locality by graph partitioning x y P0 P1 P2 P3 P0 P1 P2 P3

2D block decomposition for 5-point stencil n stencil cells, p processors Each processor has a patch of n/p cells Block row (or block col) layout: v = 2 * p * sqrt(n) 2-dimensional block layout: v = 4 * sqrt(p) * sqrt(n)

Graphs and Sparse Matrices Sparse matrix is a representation of a (sparse) graph 1 2 3 4 5 6 1 1 1 2 1 1 1 3 1 1 1 4 1 1 5 1 1 6 1 1 3 2 4 1 5 6 Matrix entries are edge weights Number of nonzeros per row is the vertex degree Edges represent data dependencies in matrix-vector multiplication

Sparse Matrix Vector Multiplication

Graph partitioning Assigns subgraphs to processors Determines parallelism and locality. Tries to make subgraphs all same size (load balance) Tries to minimize edge crossings (communication). Exact minimization is NP-complete. edge crossings = 6 edge crossings = 10 Nodes and edges in graph may be weighted Optimal graph partitioning is NP-complete (define for non-CS audience) Good graph partitioning techniques exist More on this in a later lecture CS267 Lecture 2

Applications of graph partitioning Telephone network design The original application! 1970 algorithm due to Kernighan & Lin Sparse Matrix times Vector Multiplication To solve Ax = b, partial differential equations, eigenvalue problems, … N = {1,…,n}, (j,k) in E if A(j,k) nonzero, WN(j) = #nonzeros in row j, WE(j,k) = 1 Data mining and clustering Physical Mapping of DNA VLSI Layout Sparse Gaussian Elimination Reorder matrix rows and columns to decrease “fill” in factors Load Balancing while Minimizing Communication . . .

Graph partitioning demo > load meshes > gplotg(Airfoil,Axy) > specdice(Airfoil,Axy,5) > meshdemo spectral bisection 32-way partition

Partitioning by Repeated Bisection To partition into 2k parts, bisect graph recursively k times

Recursive Bisection Recursive bisection approach: Partition data into two sets. Recursively subdivide each set into two sets. Only minor modifications needed to allow P ≠ 2n. Umit V. Catalyurek 10

CS 240A: Graph and hypergraph partitioning (excerpts) Thanks to Aydin Buluc, Umit Catalyurek, Alan Edelman, and Kathy Yelick for some of these slides. The whole CS240A partitioning lecture is at http://cs.ucsb.edu/~gilbert/cs240a/slides/cs240a-partitioning.pdf

Graph partitioning in practice Graph partitioning heuristics have been an active research area for many years, often motivated by partitioning for parallel computation. Some techniques: Iterative-swapping (Kernighan-Lin, Fiduccia-Matheysses) Spectral partitioning (uses eigenvectors of Laplacian matrix of graph) Geometric partitioning (for meshes with specified vertex coordinates) Breadth-first search (fast but dated) Many popular modern codes (e.g. Metis, Chaco, Zoltan) use multilevel iterative swapping

Iterative swapping: Kernighan/Lin, Fiduccia/Mattheyses Take a initial partition and iteratively improve it Kernighan/Lin (1970), cost = O(|N|3) but simple Fiduccia/Mattheyses (1982), cost = O(|E|) but more complicated Start with a weighted graph and a partition A U B, where |A| = |B| T = cost(A,B) = S {weight(e): e connects nodes in A and B} Find subsets X of A and Y of B with |X| = |Y| Swapping X and Y should decrease cost: newA = A - X U Y and newB = B - Y U X newT = cost(newA , newB) < cost(A,B) Compute newT efficiently for many possible X and Y, (not time to do all possible), then choose smallest What else did Kernighan invent?

Simplified Fiduccia-Mattheyses: Example (1) a b Red nodes are in Part1; black nodes are in Part2. The initial partition into two parts is arbitrary. In this case it cuts 8 edges. The initial node gains are shown in red. -1 1 2 c d e f g h 3 Nodes tentatively moved (and cut size after each pair): none (8);

Simplified Fiduccia-Mattheyses: Example (2) 1 a b The node in Part1 with largest gain is g. We tentatively move it to Part2 and recompute the gains of its neighbors. Tentatively moved nodes are hollow circles. After a node is tentatively moved its gain doesn’t matter any more. -3 1 -2 2 c d e f g h -2 Nodes tentatively moved (and cut size after each pair): none (8); g,

Simplified Fiduccia-Mattheyses: Example (3) -1 -2 a b The node in Part2 with largest gain is d. We tentatively move it to Part1 and recompute the gains of its neighbors. After this first tentative swap, the cut size is 4. -1 -2 c d e f g h Nodes tentatively moved (and cut size after each pair): none (8); g, d (4);

Simplified Fiduccia-Mattheyses: Example (4) -1 -2 a b The unmoved node in Part1 with largest gain is f. We tentatively move it to Part2 and recompute the gains of its neighbors. -1 -2 c d e f g h -2 Nodes tentatively moved (and cut size after each pair): none (8); g, d (4); f

Simplified Fiduccia-Mattheyses: Example (5) -3 -2 a b The unmoved node in Part2 with largest gain is c. We tentatively move it to Part1 and recompute the gains of its neighbors. After this tentative swap, the cut size is 5. c d e f g h Nodes tentatively moved (and cut size after each pair): none (8); g, d (4); f, c (5);

Simplified Fiduccia-Mattheyses: Example (6) -1 a b The unmoved node in Part1 with largest gain is b. We tentatively move it to Part2 and recompute the gains of its neighbors. c d e f g h Nodes tentatively moved (and cut size after each pair): none (8); g, d (4); f, c (5); b

Simplified Fiduccia-Mattheyses: Example (7) -1 a b There is a tie for largest gain between the two unmoved nodes in Part2. We choose one (say e) and tentatively move it to Part1. It has no unmoved neighbors so no gains are recomputed. After this tentative swap the cut size is 7. c d e f g h Nodes tentatively moved (and cut size after each pair): none (8); g, d (4); f, c (5); b, e (7);

Simplified Fiduccia-Mattheyses: Example (8) b The unmoved node in Part1 with the largest gain (the only one) is a. We tentatively move it to Part2. It has no unmoved neighbors so no gains are recomputed. c d e f g h Nodes tentatively moved (and cut size after each pair): none (8); g, d (4); f, c (5); b, e (7); a

Simplified Fiduccia-Mattheyses: Example (9) b The unmoved node in Part2 with the largest gain (the only one) is h. We tentatively move it to Part1. The cut size after the final tentative swap is 8, the same as it was before any tentative moves. c d e f g h Nodes tentatively moved (and cut size after each pair): none (8); g, d (4); f, c (5); b, e (7); a, h (8)

Simplified Fiduccia-Mattheyses: Example (10) b After every node has been tentatively moved, we look back at the sequence and see that the smallest cut was 4, after swapping g and d. We make that swap permanent and undo all the later tentative swaps. This is the end of the first improvement step. c d e f g h Nodes tentatively moved (and cut size after each pair): none (8); g, d (4); f, c (5); b, e (7); a, h (8)

Simplified Fiduccia-Mattheyses: Example (11) Now we recompute the gains and do another improvement step starting from the new size-4 cut. The details are not shown. The second improvement step doesn’t change the cut size, so the algorithm ends with a cut of size 4. In general, we keep doing improvement steps as long as the cut size keeps getting smaller. a b c d e f g h

Spectral Bisection Based on theory of Fiedler (1970s), rediscovered several times in different communities Motivation I: analogy to a vibrating string Motivation II: continuous relaxation of discrete optimization problem Implementation: eigenvectors via Lanczos algorithm To optimize sparse-matrix-vector multiply, we graph partition To graph partition, we find an eigenvector of a matrix To find an eigenvector, we do sparse-matrix-vector multiply No free lunch ...

Laplacian Matrix 1 4 2 -1 -1 0 0 -1 2 -1 0 0 -1 -1 4 -1 -1 0 0 -1 2 -1 Definition: The Laplacian matrix L(G) of a graph G is a symmetric matrix, with one row and column for each node. It is defined by L(G) (i,i) = degree of node i (number of incident edges) L(G) (i,j) = -1 if i != j and there is an edge (i,j) L(G) (i,j) = 0 otherwise 1 4 2 -1 -1 0 0 -1 2 -1 0 0 -1 -1 4 -1 -1 0 0 -1 2 -1 0 0 -1 -1 2 G = L(G) = 2 3 5

Properties of Laplacian Matrix Theorem: L(G) has the following properties L(G) is symmetric. This implies the eigenvalues of L(G) are real, and its eigenvectors are real and orthogonal. Rows of L sum to zero: Let e = [1,…,1]T, i.e. the column vector of all ones. Then L(G)*e=0. The eigenvalues of L(G) are nonnegative: 0 = l1 <= l2 <= … <= ln The number of connected components of G is equal to the number of li that are 0.

Spectral Bisection Algorithm Compute eigenvector v2 corresponding to l2(L(G)) Partition nodes around the median of v2(n) Why in the world should this work? Intuition: vibrating string or membrane Heuristic: continuous relaxation of discrete optimization

Motivation for Spectral Bisection Vibrating string Think of G = 1D mesh as masses (nodes) connected by springs (edges), i.e. a string that can vibrate Vibrating string has modes of vibration, or harmonics Label nodes by whether mode - or + to partition into N- and N+ Same idea for other graphs (eg planar graph ~ trampoline)

2nd eigenvector of L(planar mesh)

Multilevel Partitioning If G is too big for our algorithms, what can we do? (1) Replace G by a coarse approximation Gc, and partition Gc instead (2) Use partition of Gc to get a rough partitioning of G, and then iteratively improve it What if Gc is still too big? Apply same idea recursively

Multilevel Partitioning - High Level Algorithm (N+,N- ) = Multilevel_Partition( N, E ) … recursive partitioning routine returns N+ and N- where N = N+ U N- if |N| is small (1) Partition G = (N,E) directly to get N = N+ U N- Return (N+, N- ) else (2) Coarsen G to get an approximation Gc = (Nc, Ec) (3) (Nc+ , Nc- ) = Multilevel_Partition( Nc, Ec ) (4) Expand (Nc+ , Nc- ) to a partition (N+ , N- ) of N (5) Improve the partition ( N+ , N- ) Return ( N+ , N- ) endif (5) “V - cycle:” (2,3) How do we Coarsen? Expand? Improve? (4) (5) (2,3) (4) (5) (2,3) (4) (1)

Coarsening by Maximal Matching

Example of Coarsening

At bottom of recursion, Partition the coarsest graph …Using spectral bisection, say. 1 4 2 -1 -1 0 0 -1 2 -1 0 0 -1 -1 4 -1 -1 0 0 -1 2 -1 0 0 -1 -1 2 G = L(G) = 2 3 5

Expand a partition of Gc to a partition of G

After each expansion, Improve the partition… … by iterative swapping. d c b 1 2 3 -1 a h g f e d c b e