Download presentation
Presentation is loading. Please wait.
Published byHollie Price Modified over 9 years ago
1
Fast Statistical Algorithms in Databases Alexander Gray Georgia Institute of Technology College of Computing FASTlab: Fundamental Algorithmic and Statistical Tools Laboratory
2
The FASTlab Fundamental Algorithmic and Statistical Tools Laboratory 1.Arkadas Ozakin: Research scientist, PhD Theoretical Physics 2.Nikolaos Vasiloglou: Research scientist, PhD Elec Comp Engineering 3.Abhimanyu Aditya: MS, CS 4.Leif Poorman: MS, Physics 5.Dong Ryeol Lee: PhD student, CS + Math 6.Ryan Riegel: PhD student, CS + Math 7.Parikshit Ram: PhD student, CS + Math 8.William March: PhD student, Math + CS 9.James Waters: PhD student, Physics + CS 10.Hua Ouyang: PhD student, CS 11.Sooraj Bhat: PhD student, CS 12.Ravi Sastry: PhD student, CS 13.Long Tran: PhD student, CS 14.Wei Guan: PhD student, CS (co-supervised) 15.Nishant Mehta: PhD student, CS (co-supervised) 16.Wee Chin Wong: PhD student, ChemE (co-supervised) 17.Jaegul Choo: PhD student, CS (co-supervised) 18.Ailar Javadi: PhD Student, ECE (co-supervised) 19.Shakir Sharfraz: MS student, CS 20.Subbanarasimhiah Harish: MS student, CS
3
Our challenge Allow statistical analyses (both simple and state-of-the-art) on terascale or petascale datasets Return answers with error guarantees Rigorous theoretical runtimes Ensure small constants (small actual runtimes, not just good asymptotic runtimes)
4
10 data analysis problems, and scalable tools we’d like for them 1.Querying (e.g. characterizing a region of space): –spherical range-search O(N) –orthogonal range-search O(N) –k-nearest-neighbors O(N) –all-k-nearest-neighbors O(N 2 ) 2.Density estimation (e.g. comparing galaxy types): –mixture of Gaussians –kernel density estimation O(N 2 ) –manifold kernel density estimation O(N 3 ) [Ozakin and Gray 2009, in submission] –hyper-kernel density estimation O(N 4 ) [Sastry and Gray 2009, in submission] –L 2 density tree [Ram and Gray, in prep]
5
10 data analysis problems, and scalable tools we’d like for them 3. Regression (e.g. photometric redshifts): –linear regression O(ND 2 ) –kernel regression O(N 2 ) –Gaussian process regression/kriging O(N 3 ) 4. Classification (e.g. quasar detection, star-galaxy separation): –k-nearest-neighbor classifier O(N 2 ) –nonparametric Bayes classifier O(N 2 ) –support vector machine (SVM) O(N 3 ) –isometric separation map O(N 3 ) [Vasiloglou, Gray, and Anderson 2008] –non-negative SVM O(N 3 ) [Guan and Gray, in prep] –false-positive-limiting SVM O(N 3 ) [Sastry and Gray, in prep]
6
10 data analysis problems, and scalable tools we’d like for them 5.Dimension reduction (e.g. galaxy or spectra characterization): –principal component analysis O(D 3 ) –non-negative matrix factorization –kernel PCA (including LLE, IsoMap, et al.) O(N 3 ) –maximum variance unfolding O(N 3 ) –rank-based manifolds O(N 3 ) [Ouyang and Gray 2008] –isometric non-negative matrix factorization O(N 3 ) [Vasiloglou, Gray, and Anderson 2008] –density-preserving maps O(N 3 ) [Ozakin and Gray 2009, in submission] 6.Outlier detection (e.g. new object types, data cleaning): –by density estimation, by dimension reduction –by robust L p estimation [Ram, Riegel and Gray, in prep]
7
10 data analysis problems, and scalable tools we’d like for them 7. Clustering (e.g. automatic Hubble sequence) –by dimension reduction, by density estimation –k-means –mean-shift segmentation O(N 2 ) –hierarchical clustering (“friends-of-friends”) O(N 3 ) 8. Time series analysis (e.g. asteroid tracking, variable objects): –Kalman filter O(D 2 ) –hidden Markov model O(D 2 ) –trajectory tracking O(N n ) –functional independent component analysis [Mehta and Gray 2008] –Markov matrix factorization [Tran, Wong, and Gray 2009, in submission]
8
10 data analysis problems, and scalable tools we’d like for them 9. Feature selection and causality (e.g. which features predict star/galaxy) –LASSO regression –L 1 SVM –Gaussian graphical model inference and structure search –discrete graphical model inference and structure search –mixed-integer SVM [Guan and Gray 2009, in submission] –L 1 Gaussian graphical model inference and structure search [Tran, Lee, Holmes, and Gray, in prep] 10. 2-sample testing and matching (e.g. cosmological validation, multiple surveys): –minimum spanning tree O(N 3 ) –n-point correlation O(N n ) –bipartite matching/Gaussian graphical model inference O(N 3 ) [Waters and Gray, in prep]
9
Core methods of statistics / machine learning / mining Querying: nearest-neighbor O(N), spherical range-search O(N), orthogonal range-search O(N), contingency table Density estimation: kernel density estimation O(N 2 ), mixture of Gaussians O(N) Regression: linear regression O(ND 2 ), kernel regression O(N 2 ), Gaussian process regression O(N 3 ) Classification: nearest-neighbor classifier O(N 2 ), nonparametric Bayes classifier O(N 2 ), support vector machine Dimension reduction: principal component analysis O(D 3 ), non-negative matrix factorization, kernel PCA O(N 3 ), maximum variance unfolding O(N 3 ) Outlier detection: by robust L 2 estimation, by density estimation, by dimension reduction Clustering: k-means O(N), hierarchical clustering O(N 3 ), by dimension reduction Time series analysis: Kalman filter O(D 3 ), hidden Markov model, trajectory tracking Feature selection and causality: LASSO, L 1 SVM, Gaussian graphical models, discrete graphical models 2-sample testing and matching: n-point correlation O(N n ), bipartite matching O(N 3 )
10
4 main computational bottlenecks: generalized N-body, graphical models, linear algebra, optimization Querying: nearest-neighbor O(N), spherical range-search O(N), orthogonal range-search O(N), contingency table Density estimation: kernel density estimation O(N 2 ), mixture of Gaussians O(N) Regression: linear regression O(ND 2 ), kernel regression O(N 2 ), Gaussian process regression O(N 3 ) Classification: nearest-neighbor classifier O(N 2 ), nonparametric Bayes classifier O(N 2 ), support vector machine Dimension reduction: principal component analysis O(D 3 ), non-negative matrix factorization, kernel PCA O(N 3 ), maximum variance unfolding O(N 3 ) Outlier detection: by robust L 2 estimation, by density estimation, by dimension reduction Clustering: k-means O(N), hierarchical clustering O(N 3 ), by dimension reduction Time series analysis: Kalman filter O(D 3 ), hidden Markov model, trajectory tracking Feature selection and causality: LASSO, L 1 SVM, Gaussian graphical models, discrete graphical models 2-sample testing: n-point correlation O(N n ), bipartite matching O(N 3 )
11
kd-trees: most widely-used space- partitioning tree for general dimension [Bentley 1975], [Friedman, Bentley & Finkel 1977],[Moore & Lee 1995] (others: ball-trees, cover-trees, etc) GNPs: Data structure needed
12
What algorithmic methods do we use? Generalized N-body algorithms (multiple trees) for distance/similarity- based computations [2000, 2003, 2009] Hierarchical series expansions for kernel summations [2004, 2006, 2008] Multi-scale Monte Carlo for linear algebra and summations [2007, 2008] Parallel computing [1998, 2006, 2009]
13
Computational complexity using fast algorithms Querying: nearest-neighbor O(logN), spherical range-search O(logN), orthogonal range-search O(logN), contingency table Density estimation: kernel density estimation O(N) or O(1), mixture of Gaussians O(logN) Regression: linear regression O(D) or O(1), kernel regression O(N) or O(1), Gaussian process regression O(N) or O(1) Classification: nearest-neighbor classifier O(N), nonparametric Bayes classifier O(N), support vector machine O(N) Dimension reduction: principal component analysis O(D) or O(1), non- negative matrix factorization, kernel PCA O(N) or O(1), maximum variance unfolding O(N) Outlier detection: by robust L 2 estimation, by density estimation, by dimension reduction Clustering: k-means O(logN), hierarchical clustering O(NlogN), by dimension reduction Time series analysis: Kalman filter O(D) or O(1), hidden Markov model, trajectory tracking Feature selection and causality: LASSO, L 1 SVM, Gaussian graphical models, discrete graphical models 2-sample testing and matching: n-point correlation O(N logn ), bipartite matching O(N) or O(1)
14
Computational complexity using fast algorithms Querying: nearest-neighbor O(logN), spherical range-search O(logN), orthogonal range-search O(logN), contingency table Density estimation: kernel density estimation O(N) or O(1), mixture of Gaussians O(logN) Regression: linear regression O(D) or O(1), kernel regression O(N) or O(1), Gaussian process regression O(N) or O(1) Classification: nearest-neighbor classifier O(N), nonparametric Bayes classifier O(N), support vector machine O(N) Dimension reduction: principal component analysis O(D) or O(1), non- negative matrix factorization, kernel PCA O(N) or O(1), maximum variance unfolding O(N) Outlier detection: by robust L 2 estimation, by density estimation, by dimension reduction Clustering: k-means O(logN), hierarchical clustering O(NlogN), by dimension reduction Time series analysis: Kalman filter O(D) or O(1), hidden Markov model, trajectory tracking Feature selection and causality: LASSO, L 1 SVM, Gaussian graphical models, discrete graphical models 2-sample testing and matching: n-point correlation O(N logn ), bipartite matching O(N) or O(1)
15
All-k-nearest neighbors Naïve cost: O(N 2 ) Fastest algorithm: Generalized Fast Multipole Method, aka multi-tree methods –Technique: Use two trees simultaneously –[Gray and Moore 2001, NIPS; Ram, Lee, March, and Gray 2009, submitted; Riegel, Boyer and Gray, to be submitted] –O(N), exact Applications: querying, adaptive density estimation first step [Budavari et al., in prep]
16
Kernel density estimation Naïve cost: O(N 2 ) Fastest algorithm: dual-tree fast Gauss transform, Monte Carlo multipole method –Technique: Hermite operators, Monte Carlo –[Lee, Gray and Moore 2005, NIPS; Lee and Gray 2006, UAI; Holmes, Gray and Isbell 2007, NIPS; Lee and Gray 2008, NIPS] –O(N), relative error guarantee Applications: accurate density estimates, e.g. galaxy types [Balogh et al. 2004]
17
Nonparametric Bayes classification Naïve cost: O(N 2 ) Fastest algorithm: dual-tree bounding with hybrid tree expansion –Technique: bound each posterior –[Liu, Moore, and Gray 2004; Gray and Riegel 2004, CompStat; Riegel and Gray 2007, SDM] –O(N), exact Applications: accurate classification, e.g. quasar detection [Richards et al. 2004,2009], [Scranton et al. 2005]
18
Principal component analysis Naïve cost: O(D 3 ) Fastest algorithm: QUIC-SVD –Technique: tree-based Monte Carlo using cosine tree –[Holmes, Gray, and Isbell 2009] –O(1), relative error guarantee Applications: spectral decomposition
19
Friends-of-friends Naïve cost: O(N 3 ) Fastest algorithm: dual-tree Boruvka algorithm –Technique: maintain implicit friend sets –[March and Gray, to be submitted] –O(NlogN), exact Applications: cluster-finding, 2-sample
20
n-point correlations Naïve cost: O(N n ) Fastest algorithm: n-tree algorithms –Technique: use n trees simultaneously –[Gray and Moore 2001, NIPS; Moore et al. 2001, Mining the Sky; Waters, Riegel and Gray 2009, to be submitted] –O(N logn ), exact Applications: 2-sample, e.g. AGN [Wake et al. 2004], ISW [Giannantonio et al. 2006], 3pt validation [Kulkarni et al. 2007]
21
3-point runtime (biggest previous: 20K) VIRGO simulation data, N = 75,000,000 naïve: 5x10 9 sec. (~150 years) multi-tree: 55 sec. (exact) n=2: O(N) n=3: O(N log3 ) n=4: O(N 2 )
22
Software MLPACK (C++) –First scalable comprehensive ML library MLPACK Pro - Very-large-scale data (parallel)
23
Software MLPACK (C++) –First scalable comprehensive ML library MLPACK Pro - Very-large-scale data (parallel) MLPACK-db –fast data analytics in relational databases (SQL Server; on-disk) –Thanks to Tamas Budavari
24
Fast algorithms in SQL Server: Challenges Don’t copy the data: In-RAM kd- trees disk-based data structures Commercial database: no access to guts (e.g. caching, memory) Already optimized for different operations (SQL queries) and data structures (B+ trees) Big learning curve
25
Building a kd-tree in SQL Server Piggyback onto B+ trees and disk layout layers: Use SQL queries Hybrid approach: upper tree using Morton z-curve (SQL), lower using in-RAM building; 3 passes total Clustered indices for fast sequential access of nodes, data Abstracted algs in managed C#
26
Current state So far: All-NN, KDE*, NBC*, 2pt* Performance for all-nearest-neighbors on 40M, 4-D SDSS color data: –Naïve SQL Server: 50 yrs –Batched SQL Server: 3 yrs –Single-tree MLPACK-db: 14.5 hrs –Dual-tree MLPACK-db: 227 min –TPIE: 23 min, MLPACK-Pro: 10 min We believe there is 10x factor of fat –e.g. this uses only 100Mb of 2Gb RAM
27
The end You can now use many of the state-of-the- art data analysis methods on large survey datasets We need more funding Talk to me… I have a lot of problems (ahem) but always want more! Alexander Gray agray@cc.gatech.eduagray@cc.gatech.edu www.fast-lab.org
28
Ex: support vector machine Data: IJCNN1 [DP01a] 2 classes 49,990 training points 91,701 testing points 22 features SMO: 12,831 SV’s, 84,360 iterations, 98.3% accuracy, 765 sec SFW: 4,145 SV’s, 4,500 iterations, 98.1% accuracy, 21 sec [Ouyang and Gray 2009, in submission] SMO: O(N 2 -N 3 ) SFW: O(N/e + 1/e 2 )
29
A few new ones… Non-vector data (like time series, images, graphs, trees): –Functional ICA (Mehta and Gray 2008) –Generalized mean map kernel (Mehta and Gray, in submission) Visualizing high-dimensional data –Rank-based manifolds (Ouyang and Gray 2008) –Density-preserving maps (Ozakin and Gray, in submission) –Intrinsic dimension estimation (Ozakin and Gray, in prep) Accounting for measurement errors Probabilistic cross-match
30
2-point correlation r Characterization of an entire distribution? “How many pairs have distance < r ?” 2-point correlation function
31
The n-point correlation functions Spatial inferences: filaments, clusters, voids, homogeneity, isotropy, 2-sample testing, … Foundation for theory of point processes [Daley,Vere-Jones 1972], unifies spatial statistics [Ripley 1976] Used heavily in biostatistics, cosmology, particle physics, statistical physics 2pcf definition: 3pcf definition:
32
3-point correlation “How many triples have pairwise distances < r ?” r3r3 r1r1 r2r2 Standard model: n>0 terms should be zero!
33
How can we count n-tuples efficiently? “How many triples have pairwise distances < r ?”
34
Use n trees! [Gray & Moore, NIPS 2000]
35
“How many valid triangles a-b-c (where ) could there be? A B C r count{A,B,C} = ?
36
“How many valid triangles a-b-c (where ) could there be? count{A,B,C} = count{A,B, C.left } + count{A,B,C.right} A B C r
37
“How many valid triangles a-b-c (where ) could there be? A B C r count{A,B,C} = count{A,B,C.left} + count{A,B, C.right }
38
“How many valid triangles a-b-c (where ) could there be? A B C r count{A,B,C} = ?
39
“How many valid triangles a-b-c (where ) could there be? A B C r Exclusion count{A,B,C} = 0!
40
“How many valid triangles a-b-c (where ) could there be? AB C count{A,B,C} = ? r
41
“How many valid triangles a-b-c (where ) could there be? AB C Inclusion count{A,B,C} = |A| x |B| x |C| r Inclusion
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.