Star-P and the Knowledge Discovery Suite Steve Reinhardt, Viral Shah, John Gilbert,

Star-P and the Knowledge Discovery Suite Steve Reinhardt, spr@InteractiveSupercomputing.com Viral Shah, vshah@InteractiveSupercomputing.com John Gilbert, gilbert@cs.ucsb.edu Stefan Karpinski, sgk@cs.ucsb.edu

Context for Knowledge Discovery

Goal for Large-scale Knowledge Discovery via Star-P/KDS Enable domain expert to explore big unstructured data interactively –Domain Expert: Scientist or analyst, not math or graph expert –Explore: human-guided characterization, from simple statistics to complex clustering or factoring, even when best algorithm not known –Big: 20GB+ of data commonplace, >1TB largest –Unstructured Data: E.g., arising from metabolic networks, climate change, social interactions, and Internet traffic Allow analyst to discern structure in the data via exploration KDT implements many key algorithms; extensible for other algorithms –Interactively: common queries take O(10 seconds) on 30-128P Altix –Depends on algorithms being general-purpose, reusable, and usable by non-experts

Star-P Basics MATLAB® is a registered trademark of The MathWorks, Inc. ISC's products are not sponsored or endorsed by The Mathworks, Inc. or by any other trademark owner referred to in this document.

A Comprehensive (?) Picture of Knowledge Discovery Data structures (sparse matrices,...) Parallel constructs Graph primitives (BFS, MIS, conncomp,...) Linear algebra (spmat*vec,...) Solvers (MUMPS, SuperLU,...) Dimensionality reduction / factorization SVD, eigen, NMF, PCA, Clustering K-means,... Input HDF5, OPeNDAP, Hadoop, live data feeds Visualization Renoir, In-Spire,... Classification SVM, Bayesian, HMMs,... Utility (sort, indexing) SVD implemented by ISC Hadoop implemented by others Etc.

Knowledge Discovery Suite Simple data analysis operations at very large scale –Sorting, set operations, statistical operations Graph operations on very large graphs –Simple queries, breadth-first search, connected components, independent sets Visualization with desktop tools –Distributed image generation for large graphs and datasets Clustering and decomposition –K-means clustering, non-negative matrix factorization, principal component analysis Bayesian network modeling –Expectation maximization (Baum-Welch), hidden Markov models What would you like to see?

Ways to Work Together You use Star-P infrastructure for parallel algorithm development/deployment You develop serial algorithms, we develop parallel versions We jointly develop parallel algorithms We develop or integrate missing functionality you need We target joint client with integrated package

A Brief Demo

Sample Kernels, Algorithms, and Workflows

Simplest KDS operation: Parallel Sorting Simple, widely used combinatorial primitive [W, perm] = sort (V) Used in many sparse matrix and array algorithms: sparse (), indexing, concatenation, transpose, reshape, repmat, etc. Communication efficient 368154729 123456789

Sorting performance

A complex workflow (SSCA#2, kernel 3) function subgraphs = kernel3 (G, pathlen, starts) % KERNEL3 : SSCA#2 Kernel 3 -- Graph Extraction starts = starts(:,2); nstarts = length(starts); A = grsparse (G); nv = nverts (G); % Use sparse matrix multiplication to do several BFS searches at once. s = sparse (starts, 1:nstarts, 1, nv, nstarts); for k=1:pathlen s = A * s; % Ideally reach should support this. Not yet. s = (s ~= 0); end for i = 1:nstarts x = s(:,i); vtxmap = find(x); S.graph = subgraph (G, vtxmap); S.vtxmap = vtxmap; subgraphs{i} = S; end

A complex workflow (SSCA#2, kernel 4) function leader = kernel4f (G) % KERNEL4F : SSCA#2 Kernel 4 -- Graph Clustering % Find a Maximal Independent Set in G [IS, misrounds] = mis (G); fprintf ('MIS rounds: %d. MIS nodes: %d\n', misrounds, length(IS)); % Find neighbours of each node from the IS neighFromIS = G * sparse(IS, IS, 1, n, n); % Pick one of the neighbouring IS nodes as a leader [ign leader] = max (neighFromIS, [], 2); % Collect votes from neighbours [I J] = find (G); S = sparse (I, leader(J), 1, n, n); % Pick the most popular leader among neighbours and join that cluster [ign leader] = max (S, [], 2);

Scaling Performance: cSSCA#2 on 128P "scale“ #vertices (== 2^scale) #edges (~= 10 * vertices) graph size (bytes) time (seconds) 224.194E+064.194E+077.550E+08122.51 241.678E+071.678E+083.020E+09402.31 266.711E+076.711E+081.208E+101237.1 Timings scale well – for large graphs, –2x problem size  2x time –2x problem size & 2x processors  same time

Distributed visualization

App#1: Computational Ecology Modeling dispersal of species within a habitat (to maximize range) Large geographic areas, linked with GIS data Blend of numerical and combinatorial algorithms Brad McRae and Paul Beier, “Circuit theory predicts gene flow in plant and animal populations”, PNAS, Vol. 104, no. 50, December 11, 2007

Results Solution time reduced from 3 days (desktop) to 5 minutes (14p) for typical problems Aiming for much larger problems: Yellowstone-to-Yukon (Y2Y)

App#2 Factoring network flow behavior [Karpinski, Almeroth, Belding]

Algorithmic exploration Many NMF variants exist in the literature –Not clear how useful on large data –Not clear how to calibrate (i.e., number of iterations to converge) NMF algorithms combine linear algebra and optimization methods Basic and “improved” NMF factorization algorithms implemented: –euclidean (Lee & Seung 2000) –K-L divergence (Lee & Seung 2000) –semi-nonnegative (Ding et al. 2006) –left/right-orthogonal (Ding et al. 2006) –bi-orthogonal tri-factorization (Ding et al. 2006) –sparse euclidean (Hoyer et al. 2002) –sparse divergence (Liu et al. 2003) –non-smooth (Pascual-Montano et al. 2006)

NMF traffic analysis results NMF identifies essential components of the traffic Analyst labels different types of external behavior

Future Directions What should KDS contain: –More algorithms ? –Other classes of algorithms? –Visualization ? –Easy use of hardware accelerators (GPU, Cell) ?

Star-P and the Knowledge Discovery Suite Steve Reinhardt, Viral Shah, John Gilbert,

Similar presentations

Presentation on theme: "Star-P and the Knowledge Discovery Suite Steve Reinhardt, Viral Shah, John Gilbert,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Star-P and the Knowledge Discovery Suite Steve Reinhardt, Viral Shah, John Gilbert,

Similar presentations

Presentation on theme: "Star-P and the Knowledge Discovery Suite Steve Reinhardt, Viral Shah, John Gilbert,"— Presentation transcript:

Similar presentations

About project

Feedback