Download presentation
Presentation is loading. Please wait.
Published byJune McDowell Modified over 9 years ago
1
Fast Algorithms for Analyzing Massive Data Alexander Gray Georgia Institute of Technology www.fast-lab.org
2
The FASTlab Fundamental Algorithmic and Statistical Tools Laboratory www.fast-lab.org www.fast-lab.org 1.Alexander Gray: Assoc Prof, Applied Math + CS; PhD CS 2.Arkadas Ozakin: Research Scientist, Math + Physics; PhD Physics 3.Dongryeol Lee: PhD student, CS + Math 4.Ryan Riegel: PhD student, CS + Math 5.Sooraj Bhat: PhD student, CS 6.Nishant Mehta: PhD student, CS 7.Parikshit Ram: PhD student, CS + Math 8.William March: PhD student, Math + CS 9.Hua Ouyang: PhD student, CS 10.Ravi Sastry: PhD student, CS 11.Long Tran: PhD student, CS 12.Ryan Curtin: PhD student, EE 13.Ailar Javadi: PhD student, EE 14.Anita Zakrzewska: PhD student, CS + 5-10 MS students and undergraduates
3
7 tasks of machine learning / data mining 1.Querying: spherical range-search O(N), orthogonal range-search O(N), nearest-neighbor O(N), all-nearest-neighbors O(N 2 ) 2.Density estimation: mixture of Gaussians, kernel density estimation O(N 2 ), kernel conditional density estimation O(N 3 ) 3.Classification: decision tree, nearest-neighbor classifier O(N 2 ), kernel discriminant analysis O(N 2 ), support vector machine O(N 3 ), L p SVM 4.Regression: linear regression, LASSO, kernel regression O(N 2 ), Gaussian process regression O(N 3 ) 5.Dimension reduction: PCA, non-negative matrix factorization, kernel PCA O(N 3 ), maximum variance unfolding O(N 3 ); Gaussian graphical models, discrete graphical models 6.Clustering: k-means, mean-shift O(N 2 ), hierarchical (FoF) clustering O(N 3 ) 7.Testing and matching: MST O(N 3 ), bipartite cross-matching O(N 3 ), n-point correlation 2-sample testing O(N n ), kernel embedding
4
7 tasks of machine learning / data mining 1.Querying: spherical range-search O(N), orthogonal range-search O(N), nearest-neighbor O(N), all-nearest-neighbors O(N 2 ) 2.Density estimation: mixture of Gaussians, kernel density estimation O(N 2 ), kernel conditional density estimation O(N 3 ) 3.Classification: decision tree, nearest-neighbor classifier O(N 2 ), kernel discriminant analysis O(N 2 ), support vector machine O(N 3 ), L p SVM 4.Regression: linear regression, LASSO, kernel regression O(N 2 ), Gaussian process regression O(N 3 ) 5.Dimension reduction: PCA, non-negative matrix factorization, kernel PCA O(N 3 ), maximum variance unfolding O(N 3 ); Gaussian graphical models, discrete graphical models 6.Clustering: k-means, mean-shift O(N 2 ), hierarchical (FoF) clustering O(N 3 ) 7.Testing and matching: MST O(N 3 ), bipartite cross-matching O(N 3 ), n-point correlation 2-sample testing O(N n ), kernel embedding
5
7 tasks of machine learning / data mining 1.Querying: spherical range-search O(N), orthogonal range-search O(N), nearest- neighbor O(N), all-nearest-neighbors O(N 2 ) 2.Density estimation: mixture of Gaussians, kernel density estimation O(N 2 ), kernel conditional density estimation O(N 3 ), submanifold density estimation [Ozakin & Gray, NIPS 2010], O(N 3 ), convex adaptive kernel estimation [Sastry & Gray, AISTATS 2011] O(N 4 ) 3.Classification: decision tree, nearest-neighbor classifier O(N 2 ), kernel discriminant analysis O(N 2 ), support vector machine O(N 3 ), L p SVM, non-negative SVM [Guan et al, 2011] 4.Regression: linear regression, LASSO, kernel regression O(N 2 ), Gaussian process regression O(N 3 ) 5.Dimension reduction: PCA, non-negative matrix factorization, kernel PCA O(N 3 ), maximum variance unfolding O(N 3 ); Gaussian graphical models, discrete graphical models, rank-preserving maps [Ouyang and Gray, ICML 2008] O(N 3 ); isometric separation maps [Vasiiloglou, Gray, and Anderson MLSP 2009] O(N 3 ); isometric NMF [Vasiiloglou, Gray, and Anderson MLSP 2009] O(N 3 ); functional ICA [Mehta and Gray, 2009], density preserving maps [Ozakin and Gray, in prep] O(N 3 ) 6.Clustering: k-means, mean-shift O(N 2 ), hierarchical (FoF) clustering O(N 3 ) 7.Testing and matching: MST O(N 3 ), bipartite cross-matching O(N 3 ), n-point correlation 2-sample testing O(N n ), kernel embedding
6
7 tasks of machine learning / data mining 1.Querying: spherical range-search O(N), orthogonal range-search O(N), nearest-neighbor O(N), all-nearest-neighbors O(N 2 ) 2.Density estimation: mixture of Gaussians, kernel density estimation O(N 2 ), kernel conditional density estimation O(N 3 ) 3.Classification: decision tree, nearest-neighbor classifier O(N 2 ), kernel discriminant analysis O(N 2 ), support vector machine O(N 3 ), L p SVM 4.Regression: linear regression, kernel regression O(N 2 ), Gaussian process regression O(N 3 ), LASSO 5.Dimension reduction: PCA, non-negative matrix factorization, kernel PCA O(N 3 ), maximum variance unfolding O(N 3 ), Gaussian graphical models, discrete graphical models 6.Clustering: k-means, mean-shift O(N 2 ), hierarchical (FoF) clustering O(N 3 ) 7.Testing and matching: MST O(N 3 ), bipartite cross-matching O(N 3 ), n-point correlation 2-sample testing O(N n ), kernel embedding Computational Problem!
7
The “7 Giants” of Data (computational problem types) [Gray, Indyk, Mahoney, Szalay, in National Acad of Sci Report on Analysis of Massive Data, in prep] 1.Basic statistics: means, covariances, etc. 2.Generalized N-body problems: distances, geometry 3.Graph-theoretic problems: discrete graphs 4.Linear-algebraic problems: matrix operations 5.Optimizations: unconstrained, convex 6.Integrations: general dimension 7.Alignment problems: dynamic prog, matching
8
7 general strategies 1.Divide and conquer / indexing (trees) 2.Function transforms (series) 3.Sampling (Monte Carlo, active learning) 4.Locality (caching) 5.Streaming (online) 6.Parallelism (clusters, GPUs) 7.Problem transformation (reformulations)
9
Fastest approach for: –nearest neighbor, range search (exact) ~O(logN) [Bentley 1970], all-nearest-neighbors (exact) O(N) [Gray & Moore, NIPS 2000], [Ram, Lee, March, Gray, NIPS 2010], anytime nearest neighbor (exact) [Ram & Gray, SDM 2012], max inner product [Ram & Gray, under review] –mixture of Gaussians [Moore, NIPS 1999], k-means [Pelleg and Moore, KDD 1999], mean-shift clustering O(N) [Lee & Gray, AISTATS 2009], hierarchical clustering (single linkage, friends-of-friends) O(NlogN) [March & Gray, KDD 2010] –nearest neighbor classification [Liu, Moore, Gray, NIPS 2004], kernel discriminant analysis O(N) [Riegel & Gray, SDM 2008] –n-point correlation functions ~O(N logn ) [Gray & Moore, NIPS 2000], [Moore et al. Mining the Sky 2000], multi-matcher jackknifed npcf [March & Gray, under review] 1. Divide and conquer
10
3-point correlation (biggest previous: 20K) VIRGO simulation data, N = 75,000,000 naïve: 5x10 9 sec. (~150 years) multi-tree: 55 sec. (exact) n=2: O(N) n=3: O(N log3 ) n=4: O(N 2 )
11
3-point correlation Naive - O(N n ) (estimated) Single bandwidth [Gray & Moore 2000, Moore et al. 2000] Multi-bandwidth [March & Gray in prep 2010] new 2 point cor. 100 matchers 2.0 x 10 7 s352.8 s 56,000 4.96 s 71.1 3 point cor. 243 matchers 1.1 x 10 11 s 891.6 s 1.23 x 10 8 13.58 s 65.6 4 point cor. 216 matchers 2.3 x 10 14 s 14530 s 1.58 x 10 10 503.6 s 28.8 10 6 points, galaxy simulation data
12
2. Function transforms Fastest approach for: –Kernel estimation (low-ish dimension): dual-tree fast Gauss transforms (multipole/Hermite expansions) [Lee, Gray, Moore NIPS 2005], [Lee and Gray, UAI 2006] –KDE and GP (kernel density estimation, Gaussian process regression) (high-D): random Fourier functions [Lee and Gray, in prep]
13
3. Sampling Fastest approach for (approximate): –PCA: cosine trees [Holmes, Gray, Isbell, NIPS 2008] –Kernel estimation: bandwidth learning [Holmes, Gray, Isbell, NIPS 2006], [Holmes, Gray, Isbell, UAI 2007], Monte Carlo multipole method (with SVD trees) [Lee & Gray, NIPS 2009] –Nearest-neighbor: distance-approx: spill trees with random proj: [Liu, Moore, Gray, Yang, NIPS 2004], rank-approximate: [Ram, Ouyang, Gray, NIPS 2009] Rank-approximate NN: Best meaning-retaining approximation criterion in the face of high-dimensional distances More accurate than LSH
14
3. Sampling Active learning: the sampling can depend on previous samples –Linear classifiers: rigorous framework for pool-based active learning [Sastry and Gray, AISTATS 2012] Empirically allows reduction in the number of objects that require labeling Theoretical rigor: unbiasedness
15
4. Caching Fastest approach for (using disk): –Nearest-neighbor, 2-point: Disk-based treee algorithms in Microsoft SQL Server [Riegel, Aditya, Budavari, Gray, in prep] Builds kd-tree on top of built-in B-trees Fixed-pass algorithm to build kd-tree No. of pointsMLDB (Dual tree)Naive 40,0008 seconds159 seconds 200,00043 seconds3480 seconds 2,000,000297 seconds80 hours 10,000,00029 mins 27 sec74 days 20,000,00058mins 48sec280 days 40,000,000112m 32 sec2 years
16
5. Streaming / online Fastest approach for (approximate, or streaming): –Online learning/stochastic optimization: just use the current sample to update the gradient SVM (squared hinge loss): stochastic Frank-Wolfe [Ouyang and Gray, SDM 2010] SVM, LASSO, et al.: noise-adaptive stochastic approximation [Ouyang and Gray, in prep, on arxiv], accelerated non-smooth SGD [Ouyang and Gray, under review] –faster than SGD –solves step size problem –beats all existing convergence rates
17
6. Parallelism Fastest approach for (using many machines): –KDE, GP, n-point: distributed trees [Lee and Gray, SDM 2012], 6000+ cores; [March et al, in prep for Gordon Bell Prize 2012], 100K cores? Each process owns the global tree and its local tree First log p levels built in parallel; each process determines where to send data Asynchronous averaging; provable convergence –SVM, LASSO, et al.: distributed online optimization [Ouyang and Gray, in prep, on arxiv] Provable theoretical speedup for the first time
18
7. Transformations between problems Change the problem type: –Linear algebra on kernel matrices N-body inside conjugate gradient [Gray, TR 2004] –Euclidean graphs N-body problems [March & Gray, KDD 2010] –HMM as graph matrix factorization [Tran & Gray, in prep] Optimizations: reformulate the objective and constraints: –Maximum variance unfolding: SDP via Burer-Monteiro convex relaxation [Vasiloglou, Gray, Anderson MLSP 2009] –L q SVM, 0<q<1: DC programming [Guan & Gray, CSDA 2011] –L 0 SVM: mixed integer nonlinear program via perspective cuts [Guan & Gray, under review] –Do reformulations automatically [Agarwal et al, PADL 2010], [Bhat et al, POPL 2012] Create new ML methods with desired computational properties: –Density estimation trees: nonparametric density estimation, O(NlogN) [Ram & Gray, KDD 2011] –Local linear SVMs: nonlinear classification, O(NlogN) [Sastry & Gray, under review] –Discriminative local coding: nonlinear classification O(NlogN) [Mehta & Gray, under review]
19
Software For academic use only: MLPACK –Open source, C++, written by students –Data must fit in RAM: distributed in progress For institutions: Skytree Server –First commercial-grade high-performance machine learning server –Fastest, biggest ML available: up to 10,000x faster than existing solutions (on one machine) –V.12, April 2012-ish: distributed, streaming –Connects to stats packages, Matlab, DBMS, Python, etc –www.skytreecorp.comwww.skytreecorp.com –Colleagues: Email me to try it out: agray@cc.gatech.eduagray@cc.gatech.edu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.