Fast Statistical Algorithms in Databases Alexander Gray Georgia Institute of Technology College of Computing FASTlab: Fundamental Algorithmic and Statistical.

Fast Statistical Algorithms in Databases Alexander Gray Georgia Institute of Technology College of Computing FASTlab: Fundamental Algorithmic and Statistical Tools Laboratory

The FASTlab Fundamental Algorithmic and Statistical Tools Laboratory 1.Arkadas Ozakin: Research scientist, PhD Theoretical Physics 2.Nikolaos Vasiloglou: Research scientist, PhD Elec Comp Engineering 3.Abhimanyu Aditya: MS, CS 4.Leif Poorman: MS, Physics 5.Dong Ryeol Lee: PhD student, CS + Math 6.Ryan Riegel: PhD student, CS + Math 7.Parikshit Ram: PhD student, CS + Math 8.William March: PhD student, Math + CS 9.James Waters: PhD student, Physics + CS 10.Hua Ouyang: PhD student, CS 11.Sooraj Bhat: PhD student, CS 12.Ravi Sastry: PhD student, CS 13.Long Tran: PhD student, CS 14.Wei Guan: PhD student, CS (co-supervised) 15.Nishant Mehta: PhD student, CS (co-supervised) 16.Wee Chin Wong: PhD student, ChemE (co-supervised) 17.Jaegul Choo: PhD student, CS (co-supervised) 18.Ailar Javadi: PhD Student, ECE (co-supervised) 19.Shakir Sharfraz: MS student, CS 20.Subbanarasimhiah Harish: MS student, CS

Our challenge Allow statistical analyses (both simple and state-of-the-art) on terascale or petascale datasets Return answers with error guarantees Rigorous theoretical runtimes Ensure small constants (small actual runtimes, not just good asymptotic runtimes)

10 data analysis problems, and scalable tools we’d like for them 1.Querying (e.g. characterizing a region of space): –spherical range-search O(N) –orthogonal range-search O(N) –k-nearest-neighbors O(N) –all-k-nearest-neighbors O(N 2 ) 2.Density estimation (e.g. comparing galaxy types): –mixture of Gaussians –kernel density estimation O(N 2 ) –manifold kernel density estimation O(N 3 ) [Ozakin and Gray 2009, in submission] –hyper-kernel density estimation O(N 4 ) [Sastry and Gray 2009, in submission] –L 2 density tree [Ram and Gray, in prep]

10 data analysis problems, and scalable tools we’d like for them 3. Regression (e.g. photometric redshifts): –linear regression O(ND 2 ) –kernel regression O(N 2 ) –Gaussian process regression/kriging O(N 3 ) 4. Classification (e.g. quasar detection, star-galaxy separation): –k-nearest-neighbor classifier O(N 2 ) –nonparametric Bayes classifier O(N 2 ) –support vector machine (SVM) O(N 3 ) –isometric separation map O(N 3 ) [Vasiloglou, Gray, and Anderson 2008] –non-negative SVM O(N 3 ) [Guan and Gray, in prep] –false-positive-limiting SVM O(N 3 ) [Sastry and Gray, in prep]

10 data analysis problems, and scalable tools we’d like for them 5.Dimension reduction (e.g. galaxy or spectra characterization): –principal component analysis O(D 3 ) –non-negative matrix factorization –kernel PCA (including LLE, IsoMap, et al.) O(N 3 ) –maximum variance unfolding O(N 3 ) –rank-based manifolds O(N 3 ) [Ouyang and Gray 2008] –isometric non-negative matrix factorization O(N 3 ) [Vasiloglou, Gray, and Anderson 2008] –density-preserving maps O(N 3 ) [Ozakin and Gray 2009, in submission] 6.Outlier detection (e.g. new object types, data cleaning): –by density estimation, by dimension reduction –by robust L p estimation [Ram, Riegel and Gray, in prep]

10 data analysis problems, and scalable tools we’d like for them 7. Clustering (e.g. automatic Hubble sequence) –by dimension reduction, by density estimation –k-means –mean-shift segmentation O(N 2 ) –hierarchical clustering (“friends-of-friends”) O(N 3 ) 8. Time series analysis (e.g. asteroid tracking, variable objects): –Kalman filter O(D 2 ) –hidden Markov model O(D 2 ) –trajectory tracking O(N n ) –functional independent component analysis [Mehta and Gray 2008] –Markov matrix factorization [Tran, Wong, and Gray 2009, in submission]

10 data analysis problems, and scalable tools we’d like for them 9. Feature selection and causality (e.g. which features predict star/galaxy) –LASSO regression –L 1 SVM –Gaussian graphical model inference and structure search –discrete graphical model inference and structure search –mixed-integer SVM [Guan and Gray 2009, in submission] –L 1 Gaussian graphical model inference and structure search [Tran, Lee, Holmes, and Gray, in prep] 10. 2-sample testing and matching (e.g. cosmological validation, multiple surveys): –minimum spanning tree O(N 3 ) –n-point correlation O(N n ) –bipartite matching/Gaussian graphical model inference O(N 3 ) [Waters and Gray, in prep]

Core methods of statistics / machine learning / mining Querying: nearest-neighbor O(N), spherical range-search O(N), orthogonal range-search O(N), contingency table Density estimation: kernel density estimation O(N 2 ), mixture of Gaussians O(N) Regression: linear regression O(ND 2 ), kernel regression O(N 2 ), Gaussian process regression O(N 3 ) Classification: nearest-neighbor classifier O(N 2 ), nonparametric Bayes classifier O(N 2 ), support vector machine Dimension reduction: principal component analysis O(D 3 ), non-negative matrix factorization, kernel PCA O(N 3 ), maximum variance unfolding O(N 3 ) Outlier detection: by robust L 2 estimation, by density estimation, by dimension reduction Clustering: k-means O(N), hierarchical clustering O(N 3 ), by dimension reduction Time series analysis: Kalman filter O(D 3 ), hidden Markov model, trajectory tracking Feature selection and causality: LASSO, L 1 SVM, Gaussian graphical models, discrete graphical models 2-sample testing and matching: n-point correlation O(N n ), bipartite matching O(N 3 )

4 main computational bottlenecks: generalized N-body, graphical models, linear algebra, optimization Querying: nearest-neighbor O(N), spherical range-search O(N), orthogonal range-search O(N), contingency table Density estimation: kernel density estimation O(N 2 ), mixture of Gaussians O(N) Regression: linear regression O(ND 2 ), kernel regression O(N 2 ), Gaussian process regression O(N 3 ) Classification: nearest-neighbor classifier O(N 2 ), nonparametric Bayes classifier O(N 2 ), support vector machine Dimension reduction: principal component analysis O(D 3 ), non-negative matrix factorization, kernel PCA O(N 3 ), maximum variance unfolding O(N 3 ) Outlier detection: by robust L 2 estimation, by density estimation, by dimension reduction Clustering: k-means O(N), hierarchical clustering O(N 3 ), by dimension reduction Time series analysis: Kalman filter O(D 3 ), hidden Markov model, trajectory tracking Feature selection and causality: LASSO, L 1 SVM, Gaussian graphical models, discrete graphical models 2-sample testing: n-point correlation O(N n ), bipartite matching O(N 3 )

kd-trees: most widely-used space- partitioning tree for general dimension [Bentley 1975], [Friedman, Bentley & Finkel 1977],[Moore & Lee 1995] (others: ball-trees, cover-trees, etc) GNPs: Data structure needed

What algorithmic methods do we use? Generalized N-body algorithms (multiple trees) for distance/similarity- based computations [2000, 2003, 2009] Hierarchical series expansions for kernel summations [2004, 2006, 2008] Multi-scale Monte Carlo for linear algebra and summations [2007, 2008] Parallel computing [1998, 2006, 2009]

Computational complexity using fast algorithms Querying: nearest-neighbor O(logN), spherical range-search O(logN), orthogonal range-search O(logN), contingency table Density estimation: kernel density estimation O(N) or O(1), mixture of Gaussians O(logN) Regression: linear regression O(D) or O(1), kernel regression O(N) or O(1), Gaussian process regression O(N) or O(1) Classification: nearest-neighbor classifier O(N), nonparametric Bayes classifier O(N), support vector machine O(N) Dimension reduction: principal component analysis O(D) or O(1), non- negative matrix factorization, kernel PCA O(N) or O(1), maximum variance unfolding O(N) Outlier detection: by robust L 2 estimation, by density estimation, by dimension reduction Clustering: k-means O(logN), hierarchical clustering O(NlogN), by dimension reduction Time series analysis: Kalman filter O(D) or O(1), hidden Markov model, trajectory tracking Feature selection and causality: LASSO, L 1 SVM, Gaussian graphical models, discrete graphical models 2-sample testing and matching: n-point correlation O(N logn ), bipartite matching O(N) or O(1)

All-k-nearest neighbors Naïve cost: O(N 2 ) Fastest algorithm: Generalized Fast Multipole Method, aka multi-tree methods –Technique: Use two trees simultaneously –[Gray and Moore 2001, NIPS; Ram, Lee, March, and Gray 2009, submitted; Riegel, Boyer and Gray, to be submitted] –O(N), exact Applications: querying, adaptive density estimation first step [Budavari et al., in prep]

Kernel density estimation Naïve cost: O(N 2 ) Fastest algorithm: dual-tree fast Gauss transform, Monte Carlo multipole method –Technique: Hermite operators, Monte Carlo –[Lee, Gray and Moore 2005, NIPS; Lee and Gray 2006, UAI; Holmes, Gray and Isbell 2007, NIPS; Lee and Gray 2008, NIPS] –O(N), relative error guarantee Applications: accurate density estimates, e.g. galaxy types [Balogh et al. 2004]

Nonparametric Bayes classification Naïve cost: O(N 2 ) Fastest algorithm: dual-tree bounding with hybrid tree expansion –Technique: bound each posterior –[Liu, Moore, and Gray 2004; Gray and Riegel 2004, CompStat; Riegel and Gray 2007, SDM] –O(N), exact Applications: accurate classification, e.g. quasar detection [Richards et al. 2004,2009], [Scranton et al. 2005]

Principal component analysis Naïve cost: O(D 3 ) Fastest algorithm: QUIC-SVD –Technique: tree-based Monte Carlo using cosine tree –[Holmes, Gray, and Isbell 2009] –O(1), relative error guarantee Applications: spectral decomposition

Friends-of-friends Naïve cost: O(N 3 ) Fastest algorithm: dual-tree Boruvka algorithm –Technique: maintain implicit friend sets –[March and Gray, to be submitted] –O(NlogN), exact Applications: cluster-finding, 2-sample

n-point correlations Naïve cost: O(N n ) Fastest algorithm: n-tree algorithms –Technique: use n trees simultaneously –[Gray and Moore 2001, NIPS; Moore et al. 2001, Mining the Sky; Waters, Riegel and Gray 2009, to be submitted] –O(N logn ), exact Applications: 2-sample, e.g. AGN [Wake et al. 2004], ISW [Giannantonio et al. 2006], 3pt validation [Kulkarni et al. 2007]

3-point runtime (biggest previous: 20K) VIRGO simulation data, N = 75,000,000 naïve: 5x10 9 sec. (~150 years) multi-tree: 55 sec. (exact) n=2: O(N) n=3: O(N log3 ) n=4: O(N 2 )

Software MLPACK (C++) –First scalable comprehensive ML library MLPACK Pro - Very-large-scale data (parallel)

Software MLPACK (C++) –First scalable comprehensive ML library MLPACK Pro - Very-large-scale data (parallel) MLPACK-db –fast data analytics in relational databases (SQL Server; on-disk) –Thanks to Tamas Budavari

Fast algorithms in SQL Server: Challenges Don’t copy the data: In-RAM kd- trees  disk-based data structures Commercial database: no access to guts (e.g. caching, memory) Already optimized for different operations (SQL queries) and data structures (B+ trees) Big learning curve

Building a kd-tree in SQL Server Piggyback onto B+ trees and disk layout layers: Use SQL queries Hybrid approach: upper tree using Morton z-curve (SQL), lower using in-RAM building; 3 passes total Clustered indices for fast sequential access of nodes, data Abstracted algs in managed C#

Current state So far: All-NN, KDE*, NBC*, 2pt* Performance for all-nearest-neighbors on 40M, 4-D SDSS color data: –Naïve SQL Server: 50 yrs –Batched SQL Server: 3 yrs –Single-tree MLPACK-db: 14.5 hrs –Dual-tree MLPACK-db: 227 min –TPIE: 23 min, MLPACK-Pro: 10 min We believe there is 10x factor of fat –e.g. this uses only 100Mb of 2Gb RAM

The end You can now use many of the state-of-the- art data analysis methods on large survey datasets We need more funding Talk to me… I have a lot of problems (ahem) but always want more! Alexander Gray agray@cc.gatech.eduagray@cc.gatech.edu www.fast-lab.org

Ex: support vector machine Data: IJCNN1 [DP01a] 2 classes 49,990 training points 91,701 testing points 22 features SMO: 12,831 SV’s, 84,360 iterations, 98.3% accuracy, 765 sec SFW: 4,145 SV’s, 4,500 iterations, 98.1% accuracy, 21 sec [Ouyang and Gray 2009, in submission] SMO: O(N 2 -N 3 ) SFW: O(N/e + 1/e 2 )

A few new ones… Non-vector data (like time series, images, graphs, trees): –Functional ICA (Mehta and Gray 2008) –Generalized mean map kernel (Mehta and Gray, in submission) Visualizing high-dimensional data –Rank-based manifolds (Ouyang and Gray 2008) –Density-preserving maps (Ozakin and Gray, in submission) –Intrinsic dimension estimation (Ozakin and Gray, in prep) Accounting for measurement errors Probabilistic cross-match

2-point correlation r Characterization of an entire distribution? “How many pairs have distance < r ?” 2-point correlation function

The n-point correlation functions Spatial inferences: filaments, clusters, voids, homogeneity, isotropy, 2-sample testing, … Foundation for theory of point processes [Daley,Vere-Jones 1972], unifies spatial statistics [Ripley 1976] Used heavily in biostatistics, cosmology, particle physics, statistical physics 2pcf definition: 3pcf definition:

3-point correlation “How many triples have pairwise distances < r ?” r3r3 r1r1 r2r2 Standard model: n>0 terms should be zero!

How can we count n-tuples efficiently? “How many triples have pairwise distances < r ?”

Use n trees! [Gray & Moore, NIPS 2000]

“How many valid triangles a-b-c (where ) could there be? A B C r count{A,B,C} = ?

“How many valid triangles a-b-c (where ) could there be? count{A,B,C} = count{A,B, C.left } + count{A,B,C.right} A B C r

“How many valid triangles a-b-c (where ) could there be? A B C r count{A,B,C} = count{A,B,C.left} + count{A,B, C.right }

“How many valid triangles a-b-c (where ) could there be? A B C r count{A,B,C} = ?

“How many valid triangles a-b-c (where ) could there be? A B C r Exclusion count{A,B,C} = 0!

“How many valid triangles a-b-c (where ) could there be? AB C count{A,B,C} = ? r

“How many valid triangles a-b-c (where ) could there be? AB C Inclusion count{A,B,C} = |A| x |B| x |C| r Inclusion

Fast Statistical Algorithms in Databases Alexander Gray Georgia Institute of Technology College of Computing FASTlab: Fundamental Algorithmic and Statistical.

Similar presentations

Presentation on theme: "Fast Statistical Algorithms in Databases Alexander Gray Georgia Institute of Technology College of Computing FASTlab: Fundamental Algorithmic and Statistical."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fast Statistical Algorithms in Databases Alexander Gray Georgia Institute of Technology College of Computing FASTlab: Fundamental Algorithmic and Statistical.

Similar presentations

Presentation on theme: "Fast Statistical Algorithms in Databases Alexander Gray Georgia Institute of Technology College of Computing FASTlab: Fundamental Algorithmic and Statistical."— Presentation transcript:

Similar presentations

About project

Feedback