Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches Alexander Gray Georgia Institute of Technology
Core methods of statistics / machine learning / mining 1.Querying: spherical range-search O(N), orthogonal range-search O(N), spatial join O(N 2 ), nearest-neighbor O(N), all-nearest-neighbors O(N 2 ) 2.Density estimation: mixture of Gaussians, kernel density estimation O(N 2 ), kernel conditional density estimation O(N 3 ) 3.Regression: linear regression, kernel regression O(N 2 ), Gaussian process regression O(N 3 ) 4.Classification: decision tree, nearest-neighbor classifier O(N 2 ), nonparametric Bayes classifier O(N 2 ), support vector machine O(N 3 ) 5.Dimension reduction: principal component analysis, non-negative matrix factorization, kernel PCA O(N 3 ), maximum variance unfolding O(N 3 ) 6.Outlier detection: by density estimation or dimension reduction 7.Clustering: by density estimation or dimension reduction, k-means, mean- shift segmentation O(N 2 ), hierarchical clustering O(N 3 ) 8.Time series analysis: Kalman filter, hidden Markov model, trajectory tracking O(N n ) 9.Feature selection and causality: LASSO, L 1 SVM, Gaussian graphical models, discrete graphical models 10.Fusion and matching: sequence alignment, bipartite matching O(N 3 ), n- point correlation 2-sample testing O(N n )
Now pretty fast (2011)… 1.Querying: spherical range-search O(logN)*, orthogonal range-search O(logN)*, spatial join O(N)*, nearest-neighbor O(logN), all-nearest- neighbors O(N) 2.Density estimation: mixture of Gaussians, kernel density estimation O(N), kernel conditional density estimation O(N log3 )* 3.Regression: linear regression, kernel regression O(N), Gaussian process regression O(N)* 4.Classification: decision tree, nearest-neighbor classifier O(N), nonparametric Bayes classifier O(N)*, support vector machine 5.Dimension reduction: principal component analysis, non-negative matrix factorization, kernel PCA O(N)*, maximum variance unfolding O(N)* 6.Outlier detection: by density estimation or dimension reduction 7.Clustering: by density estimation or dimension reduction, k-means, mean- shift segmentation O(N), hierarchical clustering O(NlogN) 8.Time series analysis: Kalman filter, hidden Markov model, trajectory tracking O(N logn )* 9.Feature selection and causality: LASSO, L 1 SVM, Gaussian graphical models, discrete graphical models 10.Fusion and matching: sequence alignment, bipartite matching O(N)**, n- point correlation 2-sample testing O(N logn )*
Things we made fast fastest, fastest in some settings 1.Querying: spherical range-search O(logN)*, orthogonal range-search O(logN)*, spatial join O(N)*, nearest-neighbor O(logN), all-nearest- neighbors O(N) 2.Density estimation: mixture of Gaussians, kernel density estimation O(N), kernel conditional density estimation O(N log3 )* 3.Regression: linear regression, kernel regression O(N), Gaussian process regression O(N)* 4.Classification: decision tree, nearest-neighbor classifier O(N), nonparametric Bayes classifier O(N)*, support vector machine O(N)/O(N 2 ) 5.Dimension reduction: principal component analysis, non-negative matrix factorization, kernel PCA O(N)*, maximum variance unfolding O(N)* 6.Outlier detection: by density estimation or dimension reduction 7.Clustering: by density estimation or dimension reduction, k-means, mean- shift segmentation O(N), hierarchical (FoF) clustering O(NlogN) 8.Time series analysis: Kalman filter, hidden Markov model, trajectory tracking O(N logn )* 9.Feature selection and causality: LASSO, L 1 SVM, Gaussian graphical models, discrete graphical models 10.Fusion and matching: sequence alignment, bipartite matching O(N)**, n- point correlation 2-sample testing O(N logn )*
Core computational problems What are the basic mathematical operations making things hard? Alternative to speeding up each of the 1000s of statistical methods: treat common computational bottlenecks Divide up the space of problems (and associated algorithmic strategies), so we can examine the unique challenges and possible ways forward within each
The 7 Giants of data 1.Basic statistics 2.Generalized N-body problems 3.Graph-theoretic problems 4.Linear-algebraic problems 5.Optimizations 6.Integrations 7.Alignment problems
The 7 Giants of data 1. Basic statistics e.g. counts, contingency tables, means, medians, variances, range queries (SQL queries) 2. Generalized N-body problems e.g. nearest-nbrs (in NLDR, etc), kernel summations (in KDE, GP, SVM, etc), clustering, MST, spatial correlations
The 7 Giants of data 3. Graph-theoretic problems e.g. betweenness centrality, commute distance, graphical model inference 4. Linear-algebraic problems e.g. linear algebra, PCA, Gaussian process regression, manifold learning 5. Optimizations e.g. LP/QP/SDP/SOCP/MINLPs in regularized methods, compressed sensing
The 7 Giants of data 6. Integrations e.g. Bayesian inference 7. Alignment problems e.g. BLAST in genomics, string matching, phylogenies, SLAM, cross -match
Back to our list basic, N-body, graphs, linear algebra, optimization, integration, alignment 1.Querying: spherical range-search O(N), orthogonal range-search O(N), spatial join O(N 2 ), nearest-neighbor O(N), all-nearest-neighbors O(N 2 ) 2.Density estimation: mixture of Gaussians, kernel density estimation O(N 2 ), kernel conditional density estimation O(N 3 ) 3.Regression: linear regression, kernel regression O(N 2 ), Gaussian process regression O(N 3 ) 4.Classification: decision tree, nearest-neighbor classifier O(N 2 ), nonparametric Bayes classifier O(N 2 ), support vector machine O(N 3 ) 5.Dimension reduction: principal component analysis, non-negative matrix factorization, kernel PCA O(N 3 ), maximum variance unfolding O(N 3 ) 6.Outlier detection: by density estimation or dimension reduction 7.Clustering: by density estimation or dimension reduction, k-means, mean- shift segmentation O(N 2 ), hierarchical clustering O(N 3 ) 8.Time series analysis: Kalman filter, hidden Markov model, trajectory tracking O(N n ) 9.Feature selection and causality: LASSO, L 1 SVM, Gaussian graphical models, discrete graphical models 10.Fusion and matching: sequence alignment, bipartite matching O(N 3 ), n- point correlation 2-sample testing O(N n )
5 settings 1.Regular: batch, in-RAM/core, one CPU 2.Streaming (non-batch) 3.Disk (out-of-core) 4.Distributed: threads/multi-core (shared memory) 5.Distributed: clusters/cloud (distributed memory)
4 common data types 1.Vector data, iid 2.Time series 3.Images 4.Graphs
3 desiderata 1.Fast experimental runtime/performance* 2.Fast theoretic (provable) runtime/performance* 3.Accuracy guarantees *Performance: runtime, memory, communication, disk accesses; time-constrained, anytime, etc.
7 general solution strategies 1.Divide and conquer (indexing structures) 2.Dynamic programming 3.Function transforms 4.Random sampling (Monte Carlo) 5.Non-random sampling (active learning) 6.Parallelism 7.Problem reduction
1. Summary statistics Examples: counts, contingency tables, means, medians, variances, range queries (SQL queries) Whats unique/challenges: streaming, new guarantees Promising/interesting: –Sketching approaches –AD-trees –MapReduce/Hadoop (Aster,Greenplum,Netezza)
2. Generalized N-body problems Examples: nearest-nbrs (in NLDR, etc), kernel summations (in KDE, GP, SVM, etc), clustering, MST, spatial correlations Whats unique/challenges: general dimension, non-Euclidean, new guarantees (e.g. in rank) Promising/interesting: –Generalized/higher-order FMM O(N 2 ) O(N) –Random projections –GPUs
3. Graph-theoretic problems Examples: betweenness centrality, commute dist, graphical model inference Whats unique/challenges: high interconnectivity (cliques), out-of-core Promising/interesting: –Variational methods –Stochastic composite likelihood methods –MapReduce/Hadoop (Facebook,etc)
4. Linear-algebraic problems Examples: linear algebra, PCA, Gaussian process regression, manifold learning Whats unique/challenges: probabilistic guarantees, kernel matrices Promising/interesting: –Sampling-based methods –Online methods –Approximate matrix-vector multiply via N-body
5. Optimizations Examples: LP/QP/SDP/SOCP/MINLPs in regularized methods, compressed sensing Whats unique/challenges: stochastic programming, streaming Promising/interesting: –Reformulations/relaxations of various ML forms –Online, mini-batch methods –Parallel online methods –Submodular functions –Global optimization (non-convex)
6. Integrations Examples: Bayesian inference Whats unique/challenges: general dimension Promising/interesting: –MCMC –ABC –Particle filtering –Adaptive importance sampling, active learning
7. Alignments Examples: BLAST in genomics, string matching, phylogenies, SLAM, cross- match Whats unique/challenges: greater heterogeneity, measurement errors Promising/interesting: –Probabilistic representations –Reductions to generalized N-body problems
Reductions/transformations between problems Gaussian graphical models linear alg Bayesian integration MAP optimization Euclidean graphs N-body problems Linear algebra on kernel matrices N- body inside conjugate gradient Can featurize a graph or any other structure matrix-based ML problem Create new ML methods with different computational properties
General conclusions Algorithms can dramatically change the runtime order, e.g. O(N 2 ) to O(N) High dimensionality is a persistent challenge The non-default (e.g. streaming, disk…) settings need more research work Systems issues need more work, e.g. connection to data storage/management Hadoop does not solve everything
General conclusions No general theory for the tradeoff between statistical quality and computational cost (lower/upper bounds, etc) More aspects of hardness (statistical and computational) are needed