Presentation is loading. Please wait.

Presentation is loading. Please wait.

Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches Alexander Gray Georgia Institute of Technology www.fast-lab.org.

Similar presentations


Presentation on theme: "Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches Alexander Gray Georgia Institute of Technology www.fast-lab.org."— Presentation transcript:

1 Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches Alexander Gray Georgia Institute of Technology www.fast-lab.org

2 Core methods of statistics / machine learning / mining 1.Querying: spherical range-search O(N), orthogonal range-search O(N), spatial join O(N 2 ), nearest-neighbor O(N), all-nearest-neighbors O(N 2 ) 2.Density estimation: mixture of Gaussians, kernel density estimation O(N 2 ), kernel conditional density estimation O(N 3 ) 3.Regression: linear regression, kernel regression O(N 2 ), Gaussian process regression O(N 3 ) 4.Classification: decision tree, nearest-neighbor classifier O(N 2 ), nonparametric Bayes classifier O(N 2 ), support vector machine O(N 3 ) 5.Dimension reduction: principal component analysis, non-negative matrix factorization, kernel PCA O(N 3 ), maximum variance unfolding O(N 3 ) 6.Outlier detection: by density estimation or dimension reduction 7.Clustering: by density estimation or dimension reduction, k-means, mean- shift segmentation O(N 2 ), hierarchical clustering O(N 3 ) 8.Time series analysis: Kalman filter, hidden Markov model, trajectory tracking O(N n ) 9.Feature selection and causality: LASSO, L 1 SVM, Gaussian graphical models, discrete graphical models 10.Fusion and matching: sequence alignment, bipartite matching O(N 3 ), n- point correlation 2-sample testing O(N n )

3 Now pretty fast (2011)… 1.Querying: spherical range-search O(logN)*, orthogonal range-search O(logN)*, spatial join O(N)*, nearest-neighbor O(logN), all-nearest- neighbors O(N) 2.Density estimation: mixture of Gaussians, kernel density estimation O(N), kernel conditional density estimation O(N log3 )* 3.Regression: linear regression, kernel regression O(N), Gaussian process regression O(N)* 4.Classification: decision tree, nearest-neighbor classifier O(N), nonparametric Bayes classifier O(N)*, support vector machine 5.Dimension reduction: principal component analysis, non-negative matrix factorization, kernel PCA O(N)*, maximum variance unfolding O(N)* 6.Outlier detection: by density estimation or dimension reduction 7.Clustering: by density estimation or dimension reduction, k-means, mean- shift segmentation O(N), hierarchical clustering O(NlogN) 8.Time series analysis: Kalman filter, hidden Markov model, trajectory tracking O(N logn )* 9.Feature selection and causality: LASSO, L 1 SVM, Gaussian graphical models, discrete graphical models 10.Fusion and matching: sequence alignment, bipartite matching O(N)**, n- point correlation 2-sample testing O(N logn )*

4 Things we made fast fastest, fastest in some settings 1.Querying: spherical range-search O(logN)*, orthogonal range-search O(logN)*, spatial join O(N)*, nearest-neighbor O(logN), all-nearest- neighbors O(N) 2.Density estimation: mixture of Gaussians, kernel density estimation O(N), kernel conditional density estimation O(N log3 )* 3.Regression: linear regression, kernel regression O(N), Gaussian process regression O(N)* 4.Classification: decision tree, nearest-neighbor classifier O(N), nonparametric Bayes classifier O(N)*, support vector machine O(N)/O(N 2 ) 5.Dimension reduction: principal component analysis, non-negative matrix factorization, kernel PCA O(N)*, maximum variance unfolding O(N)* 6.Outlier detection: by density estimation or dimension reduction 7.Clustering: by density estimation or dimension reduction, k-means, mean- shift segmentation O(N), hierarchical (FoF) clustering O(NlogN) 8.Time series analysis: Kalman filter, hidden Markov model, trajectory tracking O(N logn )* 9.Feature selection and causality: LASSO, L 1 SVM, Gaussian graphical models, discrete graphical models 10.Fusion and matching: sequence alignment, bipartite matching O(N)**, n- point correlation 2-sample testing O(N logn )*

5 Core computational problems What are the basic mathematical operations making things hard? Alternative to speeding up each of the 1000s of statistical methods: treat common computational bottlenecks Divide up the space of problems (and associated algorithmic strategies), so we can examine the unique challenges and possible ways forward within each

6 The 7 Giants of data 1.Basic statistics 2.Generalized N-body problems 3.Graph-theoretic problems 4.Linear-algebraic problems 5.Optimizations 6.Integrations 7.Alignment problems

7 The 7 Giants of data 1. Basic statistics e.g. counts, contingency tables, means, medians, variances, range queries (SQL queries) 2. Generalized N-body problems e.g. nearest-nbrs (in NLDR, etc), kernel summations (in KDE, GP, SVM, etc), clustering, MST, spatial correlations

8 The 7 Giants of data 3. Graph-theoretic problems e.g. betweenness centrality, commute distance, graphical model inference 4. Linear-algebraic problems e.g. linear algebra, PCA, Gaussian process regression, manifold learning 5. Optimizations e.g. LP/QP/SDP/SOCP/MINLPs in regularized methods, compressed sensing

9 The 7 Giants of data 6. Integrations e.g. Bayesian inference 7. Alignment problems e.g. BLAST in genomics, string matching, phylogenies, SLAM, cross -match

10 Back to our list basic, N-body, graphs, linear algebra, optimization, integration, alignment 1.Querying: spherical range-search O(N), orthogonal range-search O(N), spatial join O(N 2 ), nearest-neighbor O(N), all-nearest-neighbors O(N 2 ) 2.Density estimation: mixture of Gaussians, kernel density estimation O(N 2 ), kernel conditional density estimation O(N 3 ) 3.Regression: linear regression, kernel regression O(N 2 ), Gaussian process regression O(N 3 ) 4.Classification: decision tree, nearest-neighbor classifier O(N 2 ), nonparametric Bayes classifier O(N 2 ), support vector machine O(N 3 ) 5.Dimension reduction: principal component analysis, non-negative matrix factorization, kernel PCA O(N 3 ), maximum variance unfolding O(N 3 ) 6.Outlier detection: by density estimation or dimension reduction 7.Clustering: by density estimation or dimension reduction, k-means, mean- shift segmentation O(N 2 ), hierarchical clustering O(N 3 ) 8.Time series analysis: Kalman filter, hidden Markov model, trajectory tracking O(N n ) 9.Feature selection and causality: LASSO, L 1 SVM, Gaussian graphical models, discrete graphical models 10.Fusion and matching: sequence alignment, bipartite matching O(N 3 ), n- point correlation 2-sample testing O(N n )

11 5 settings 1.Regular: batch, in-RAM/core, one CPU 2.Streaming (non-batch) 3.Disk (out-of-core) 4.Distributed: threads/multi-core (shared memory) 5.Distributed: clusters/cloud (distributed memory)

12 4 common data types 1.Vector data, iid 2.Time series 3.Images 4.Graphs

13 3 desiderata 1.Fast experimental runtime/performance* 2.Fast theoretic (provable) runtime/performance* 3.Accuracy guarantees *Performance: runtime, memory, communication, disk accesses; time-constrained, anytime, etc.

14 7 general solution strategies 1.Divide and conquer (indexing structures) 2.Dynamic programming 3.Function transforms 4.Random sampling (Monte Carlo) 5.Non-random sampling (active learning) 6.Parallelism 7.Problem reduction

15 1. Summary statistics Examples: counts, contingency tables, means, medians, variances, range queries (SQL queries) Whats unique/challenges: streaming, new guarantees Promising/interesting: –Sketching approaches –AD-trees –MapReduce/Hadoop (Aster,Greenplum,Netezza)

16 2. Generalized N-body problems Examples: nearest-nbrs (in NLDR, etc), kernel summations (in KDE, GP, SVM, etc), clustering, MST, spatial correlations Whats unique/challenges: general dimension, non-Euclidean, new guarantees (e.g. in rank) Promising/interesting: –Generalized/higher-order FMM O(N 2 ) O(N) –Random projections –GPUs

17 3. Graph-theoretic problems Examples: betweenness centrality, commute dist, graphical model inference Whats unique/challenges: high interconnectivity (cliques), out-of-core Promising/interesting: –Variational methods –Stochastic composite likelihood methods –MapReduce/Hadoop (Facebook,etc)

18 4. Linear-algebraic problems Examples: linear algebra, PCA, Gaussian process regression, manifold learning Whats unique/challenges: probabilistic guarantees, kernel matrices Promising/interesting: –Sampling-based methods –Online methods –Approximate matrix-vector multiply via N-body

19 5. Optimizations Examples: LP/QP/SDP/SOCP/MINLPs in regularized methods, compressed sensing Whats unique/challenges: stochastic programming, streaming Promising/interesting: –Reformulations/relaxations of various ML forms –Online, mini-batch methods –Parallel online methods –Submodular functions –Global optimization (non-convex)

20 6. Integrations Examples: Bayesian inference Whats unique/challenges: general dimension Promising/interesting: –MCMC –ABC –Particle filtering –Adaptive importance sampling, active learning

21 7. Alignments Examples: BLAST in genomics, string matching, phylogenies, SLAM, cross- match Whats unique/challenges: greater heterogeneity, measurement errors Promising/interesting: –Probabilistic representations –Reductions to generalized N-body problems

22 Reductions/transformations between problems Gaussian graphical models linear alg Bayesian integration MAP optimization Euclidean graphs N-body problems Linear algebra on kernel matrices N- body inside conjugate gradient Can featurize a graph or any other structure matrix-based ML problem Create new ML methods with different computational properties

23 General conclusions Algorithms can dramatically change the runtime order, e.g. O(N 2 ) to O(N) High dimensionality is a persistent challenge The non-default (e.g. streaming, disk…) settings need more research work Systems issues need more work, e.g. connection to data storage/management Hadoop does not solve everything

24 General conclusions No general theory for the tradeoff between statistical quality and computational cost (lower/upper bounds, etc) More aspects of hardness (statistical and computational) are needed


Download ppt "Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches Alexander Gray Georgia Institute of Technology www.fast-lab.org."

Similar presentations


Ads by Google