1 An Interactive Environment for Combinatorial Supercomputing John R. Gilbert University of California, Santa Barbara Viral Shah (UCSB) Steve Reinhardt (Silicon Graphics) with thanks to Alan Edelman (MIT & ISC) and Jeremy Kepner (MIT-LL) Support: DOE Office of Science, DARPA, SGI, ISC
2 Parallel Computing Today Lab Beowulf cluster Columbia, NASA Ames Research Center
3 Combinatorial Scientific Computing Emerging large scale, high-performance applications: Web search and information retrieval Knowledge discovery Machine learning Computational biology Bioinformatics Sparse matrix methods Geometric modeling... How will combinatorial methods be used by nonexperts?
4 Analogy: Matrix Division in Matlab x = A \ b; Works for either full or sparse A Is A square? no => use QR to solve least squares problem Is A triangular or permuted triangular? yes => sparse triangular solve Is A symmetric with positive diagonal elements? yes => attempt Cholesky after symmetric minimum degree Otherwise => use LU on A(:, colamd(A))
5 Interactive Data Exploration
6 RMAT Approximate Power-Law Graph
7 Strongly Connected Components
8 Matlab*P A = rand(4000*p, 4000*p); x = randn(4000*p, 1); y = zeros(size(x)); while norm(x-y) / norm(x) > 1e-11 y = x; x = A*x; x = x / norm(x); end;
9 Matlab*P 1.0 (1998): Edelman, Husbands, Isbell (MIT) Matlab*P 2.0 (2002- ): MIT / UCSB / LBNL Star-P (2004- ): Interactive Supercomputing / SGI Matlab*P History
10 Data-Parallel Operations Copyright The MathWorks, Inc. Version a Release 12.1 >> A = randn(500*p, 500*p) A = ddense object: 500-by-500 >> E = eig(A); >> E(1) ans = i e = pp2matlab(E); >> ppwhos Name Size Bytes Class A 500px500p 688 ddense object E 500px1 652 ddense object e 500x double array (complex)
11 >> quad('4./(1+x.^2)', 0, 1); ans = >> a = (0:3*p) / 4 a = ddense object: 1-by-4 >> pp2matlab(a) ans = >> b = a +.25; >> c = ppeval('quad','4./(1+x.^2)', a, b); c = ddense object: 1-by-4 >> sum(c) ans = Task-Parallel Operations
12 MATLAB ® Star-P Architecture Ordinary Matlab variables Star-P client manager server manager package manager processor #0 processor #n-1 processor #1 processor #2 processor #3... ScaLAPACK FFTW FPGA interface matrix manager Distributed matrices sort dense/sparse UPC user code MPI user code
13 P0P0 P1P1 P2P2 PnPn Each processor stores: # of local nonzeros (# local edges) range of local rows (local vertices) nonzeros in a compressed row data structure (local edges) Distributed Sparse Array Structure
14 The sparse( ) Constructor A = sparse (I, J, V, nr, nc); Input: ddense vectors I, J, V, dimensions nr, nc Output: A ( I (k), J (k)) = V (k) Sum values with duplicate indices Sorts triples by Inverse: [I, J, V] = find(A);
15 Sparse Array and Matrix Operations dsparse layout, same semantics as ddense Matrix arithmetic: +, max, sum, etc. matrix * matrix and matrix * vector Matrix indexing and concatenation A (1:3, [4 5 2]) = [ B(:, J) C ] ; Linear solvers: x = A \ b; using SuperLU (MPI) Eigensolvers: [V, D] = eigs(A); using PARPACK (MPI)
16 Large-Scale Graph Algorithms Graph theory, algorithms, and data structures are ubiquitous in sparse matrix computation. Time to turn the relationship around! Represent a graph as a sparse adjacency matrix. A sparse matrix language is a good start on primitives for computing with graphs. Leverage the mature techniques and tools of high- performance numerical computation.
17 Sparse Adjacency Matrix and Graph Adjacency matrix: sparse array w/ nonzeros for graph edges Storage-efficient implementation from sparse data structures xATxATx ATAT
18 Breadth-First Search: Sparse mat * vec xATxATx ATAT Multiply by adjacency matrix step to neighbor vertices Work-efficient implementation from sparse data structures
19 Breadth-First Search: Sparse mat * vec xATxATx ATAT Multiply by adjacency matrix step to neighbor vertices Work-efficient implementation from sparse data structures
20 Breadth-First Search: Sparse mat * vec ATAT (A T ) 2 x xATxATx Multiply by adjacency matrix step to neighbor vertices Work-efficient implementation from sparse data structures
21 Connected Components of a Graph Sequential Matlab uses depth-first search ( dmperm ), which doesn’t parallelize well Pointer-jumping algorithms (Shiloach/Vishkin & descendants) –repeat Link every (super)vertex to a neighbor Shrink each tree to a supervertex by pointer jumping –until no further change Other graph kernels in progress: –Shortest-path search (after Husbands, LBNL) –Bipartite matching (after Riedy, UCB) –Strongly connected components (after Pinar, LBNL)
22 Maximal Independent Set degree = sum(G, 2); prob = 1./ (2 * deg); select = rand (n, 1) < prob; if ~isempty (select & (G * select)) % keep higher degree vertices end IndepSet = [IndepSet select]; neighbor = neighbor | (G * select); remain = neighbor == 0; G = G(remain, remain); Starting guess: Select some vertices randomly
23 Maximal Independent Set degree = sum(G, 2); prob = 1./ (2 * deg); select = rand (n, 1) < prob; if ~isempty (select & (G * select)) % keep higher degree vertices end IndepSet = [IndepSet select]; neighbor = neighbor | (G * select); remain = neighbor == 0; G = G(remain, remain); If neighbors are selected, keep only a higher-degree one. Add selected vertices to the independent set.
24 Maximal Independent Set degree = sum(G, 2); prob = 1./ (2 * deg); select = rand (n, 1) < prob; if ~isempty (select & (G * select); % keep higher degree vertices end IndepSet = [IndepSet select]; neighbor = neighbor | (G * select); remain = neighbor == 0; G = G(remain, remain); Discard neighbors of the independent set. Iterate on the rest of the graph.
25 Many tight clusters, loosely interconnected Input data is edge triples Vertices and edges permuted randomly SSCA#2: “Graph Analysis” Benchmark (spec version 1) Fine-grained, irregular data access Searching and clustering
26 Given “scale” = log 2 (#vertices) Creates edge triples Randomly permutes triples and vertex numbers SSCA#2: Scalable Data Generator 366,003,600,0001,317,613,000,0002,096,2641,073,741, ,597,598,00012,951,350,000207,08233,554, ,052,403126,188,64920,6431,048, ,1161,238,8152,02032, ,67013, ,02410 #Edges Undirected#Edges Directed#Cliques#VerticesScale Statistics for SSCA2 spec v1.1
27 Concise SSCA#2 in Star-P Kernel 1: Construct graph data structures Graphs are dsparse matrices, created by sparse( )
28 Kernels 2 and 3 Kernel 2: Search by edge labels About 12 lines of executable Matlab or Star-P Kernel 3: Extract subgraphs Returns subgraphs consisting of vertices and edges within fixed distance of given starting vertices Sparse matrix-matrix product for multiple breadth-first search About 25 lines of executable Matlab or Star-P
29 Kernel 4: Clustering by BFS % Grow each seed to vertices % reached by at least k % paths of length 1 or 2 C = sparse(seeds, 1:ns, 1, n, ns); C = A * C; C = C + A * C; C = C >= k; Grow local clusters from many seeds in parallel Breadth-first search by sparse matrix * matrix Cluster vertices connected by many short paths
30 Kernel 4: Clustering by Peer Pressure Steps in a peer pressure algorithm: 1. Vote for a cluster leader 2. Collect neighbor votes 3. Vote for a new leader (based on neighbor votes) Clustering qualities depend on details of each step. Want relatively few potential leaders, e.g. a maximal indep set. Other choices possible – for SSCA2 graph, simpler rules work too. Neighbor votes can be combined using various weightings. Each version of kernel4 is about 25 lines of code.
[ignore, leader] = max(G); S = G * sparse(1:n,leader,1,n,n); [ignore, leader] = max(S); Each vertex votes for highest numbered neighbor as its leader Number of leaders is approximately number of clusters (small relative to the number of nodes) Kernel 4: Clustering by Peer Pressure
[ignore, leader] = max(G); S = sparse(leader,1:n,1,n,n) * G; [ignore, leader] = max(S); Matrix multiplication gathers neighbor votes S(i,j) is # of votes for i from j’s neighbors In SSCA2 (spec1.0), most of graph structure is recovered right away; iteration needed for harder graphs Kernel 4: Clustering by Peer Pressure
33 Scaling Up Recent results on SGI Altix (up to 128 processors): Have run SSCA2 (spec v1.0) on graphs with 2 27 = 134 million vertices and about one billion (10 9 ) edges Have manipulated graphs with 400 million vertices and 4 billion edges Timings scale well – for large graphs, 2x problem size 2x time 2x problem size & 2x processors same time Tracking new SSCA2 draft spec v2.0, in progress Using this benchmark to tune Star-P sparse array infrastructure
34 Toolbox for Graph Analysis and Pattern Discovery Layer 1: Graph Theoretic Tools Graph operations Global structure of graphs Graph partitioning and clustering Graph generators Visualization and graphics Scan and combining operations Utilities
35 Sparse Matrix times Sparse Matrix Shows up often as a primitive. Graphs are mostly not mesh-like, i.e. geometric locality and good separators. On a 2D processor grid, the parallel sparse algorithm looks much like the parallel dense algorithm. Redistribute to round-robin cyclic or random distribution for load balance.
36 Load Balance Without Redistribution
37 Load Balance With Redistribution
38 Compressed Sparse Matrix Storage Full storage: 2-dimensional array. (nrows*ncols) memory Sparse storage: Compressed storage by columns (CSC). Three 1-dimensional arrays. (2*nzs + ncols + 1) memory. Similarly, CSR value: row: colstart:
39 Single Processor MatMul: C = A * B C(:, :) = 0; for i = 1:n for j = 1:n for k = 1:n C(i, j) = C(i, j) + A(i, k) * B(k, j); The n 3 scalar updates can be done in any order. Six possible algorithms: ijk, ikj, jik, jki, kij, kji (lots more if you consider blocking for cache) Goal: Sparse algorithm with time = O(nonzero flops) For sparse, even time = O(n 2 ) is too slow!
40 Possible Organizations of Sparse MatMul Outer product: for k = 1:n C = C + A(:, k) * B(k, :) Inner product: for i = 1:n for j = 1:n C(i, j) = A(i, :) * B(:, j) Column by column: for j = 1:n for k where B(k, j) 0 C(:, j) = C(:, j) + A(:, k) * B(k, j) Barriers to O(flops) work - Inserting updates into C is too slow - n 2 loop iterations cost too much if C is sparse - Loop k only over nonzeros in column j of B - Use sparse accumulator (SPA) for column updates
41 Sparse Accumulator (SPA) Abstract data type for a single matrix column Operations on SPA: –initialize spa O(n) time, O(n) storage –spa = spa + scalar * (CSC vector) O(nnz(vector)) time –(CSC vector) = spa O(nnz(spa)) time –spa = 0 O(nnz(spa)) time –… possibly other ops Standard implementation of SPA (many variations): –n-element array “value” –n-element flag array “is-nonzero” –linked structure (or other) to gather nonzeros from spa
42 CSC Sparse Matrix Multiplication with SPA B = x C A for j = 1:n C(:, j) = A * B(:, j) SPA gather scatter/ accumulate All matrix columns and vectors are stored compressed except the SPA.
43 Matrices over Semirings Matrix multiplication C = AB (or matrix/vector): C i,j = A i,1 B 1,j + A i,2 B 2,j + · · · + A i,n B n,j Replace scalar operations and + by : associative, distributes over , identity 1 : associative, commutative, identity 0 annihilates under Then C i,j = A i,1 B 1,j A i,2 B 2,j · · · A i,n B n,j Examples: ( ,+) ; (and,or) ; (+,min) ;... Same data reference pattern and control flow
44 Remarks Tools for combinatorial methods built on parallel sparse matrix infrastructure Easy-to-use interactive programming environment –Rapid prototyping tool for algorithm development –Interactive exploration and visualization of data –Alan Edelman: “parallel computing is fun again ” Sparse matrix * sparse matrix is a key primitive Matrices over semirings like (min,+) as well as (+,*)