Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fast Statistical Algorithms in Databases Alexander Gray Georgia Institute of Technology College of Computing FASTlab: Fundamental Algorithmic and Statistical.

Similar presentations


Presentation on theme: "Fast Statistical Algorithms in Databases Alexander Gray Georgia Institute of Technology College of Computing FASTlab: Fundamental Algorithmic and Statistical."— Presentation transcript:

1 Fast Statistical Algorithms in Databases Alexander Gray Georgia Institute of Technology College of Computing FASTlab: Fundamental Algorithmic and Statistical Tools Laboratory

2 The FASTlab Fundamental Algorithmic and Statistical Tools Laboratory 1.Arkadas Ozakin: Research scientist, PhD Theoretical Physics 2.Nikolaos Vasiloglou: Research scientist, PhD Elec Comp Engineering 3.Abhimanyu Aditya: MS, CS 4.Leif Poorman: MS, Physics 5.Dong Ryeol Lee: PhD student, CS + Math 6.Ryan Riegel: PhD student, CS + Math 7.Parikshit Ram: PhD student, CS + Math 8.William March: PhD student, Math + CS 9.James Waters: PhD student, Physics + CS 10.Hua Ouyang: PhD student, CS 11.Sooraj Bhat: PhD student, CS 12.Ravi Sastry: PhD student, CS 13.Long Tran: PhD student, CS 14.Wei Guan: PhD student, CS (co-supervised) 15.Nishant Mehta: PhD student, CS (co-supervised) 16.Wee Chin Wong: PhD student, ChemE (co-supervised) 17.Jaegul Choo: PhD student, CS (co-supervised) 18.Ailar Javadi: PhD Student, ECE (co-supervised) 19.Shakir Sharfraz: MS student, CS 20.Subbanarasimhiah Harish: MS student, CS

3 Our challenge Allow statistical analyses (both simple and state-of-the-art) on terascale or petascale datasets Return answers with error guarantees Rigorous theoretical runtimes Ensure small constants (small actual runtimes, not just good asymptotic runtimes)

4 10 data analysis problems, and scalable tools we’d like for them 1.Querying (e.g. characterizing a region of space): –spherical range-search O(N) –orthogonal range-search O(N) –k-nearest-neighbors O(N) –all-k-nearest-neighbors O(N 2 ) 2.Density estimation (e.g. comparing galaxy types): –mixture of Gaussians –kernel density estimation O(N 2 ) –manifold kernel density estimation O(N 3 ) [Ozakin and Gray 2009, in submission] –hyper-kernel density estimation O(N 4 ) [Sastry and Gray 2009, in submission] –L 2 density tree [Ram and Gray, in prep]

5 10 data analysis problems, and scalable tools we’d like for them 3. Regression (e.g. photometric redshifts): –linear regression O(ND 2 ) –kernel regression O(N 2 ) –Gaussian process regression/kriging O(N 3 ) 4. Classification (e.g. quasar detection, star-galaxy separation): –k-nearest-neighbor classifier O(N 2 ) –nonparametric Bayes classifier O(N 2 ) –support vector machine (SVM) O(N 3 ) –isometric separation map O(N 3 ) [Vasiloglou, Gray, and Anderson 2008] –non-negative SVM O(N 3 ) [Guan and Gray, in prep] –false-positive-limiting SVM O(N 3 ) [Sastry and Gray, in prep]

6 10 data analysis problems, and scalable tools we’d like for them 5.Dimension reduction (e.g. galaxy or spectra characterization): –principal component analysis O(D 3 ) –non-negative matrix factorization –kernel PCA (including LLE, IsoMap, et al.) O(N 3 ) –maximum variance unfolding O(N 3 ) –rank-based manifolds O(N 3 ) [Ouyang and Gray 2008] –isometric non-negative matrix factorization O(N 3 ) [Vasiloglou, Gray, and Anderson 2008] –density-preserving maps O(N 3 ) [Ozakin and Gray 2009, in submission] 6.Outlier detection (e.g. new object types, data cleaning): –by density estimation, by dimension reduction –by robust L p estimation [Ram, Riegel and Gray, in prep]

7 10 data analysis problems, and scalable tools we’d like for them 7. Clustering (e.g. automatic Hubble sequence) –by dimension reduction, by density estimation –k-means –mean-shift segmentation O(N 2 ) –hierarchical clustering (“friends-of-friends”) O(N 3 ) 8. Time series analysis (e.g. asteroid tracking, variable objects): –Kalman filter O(D 2 ) –hidden Markov model O(D 2 ) –trajectory tracking O(N n ) –functional independent component analysis [Mehta and Gray 2008] –Markov matrix factorization [Tran, Wong, and Gray 2009, in submission]

8 10 data analysis problems, and scalable tools we’d like for them 9. Feature selection and causality (e.g. which features predict star/galaxy) –LASSO regression –L 1 SVM –Gaussian graphical model inference and structure search –discrete graphical model inference and structure search –mixed-integer SVM [Guan and Gray 2009, in submission] –L 1 Gaussian graphical model inference and structure search [Tran, Lee, Holmes, and Gray, in prep] 10. 2-sample testing and matching (e.g. cosmological validation, multiple surveys): –minimum spanning tree O(N 3 ) –n-point correlation O(N n ) –bipartite matching/Gaussian graphical model inference O(N 3 ) [Waters and Gray, in prep]

9 Core methods of statistics / machine learning / mining Querying: nearest-neighbor O(N), spherical range-search O(N), orthogonal range-search O(N), contingency table Density estimation: kernel density estimation O(N 2 ), mixture of Gaussians O(N) Regression: linear regression O(ND 2 ), kernel regression O(N 2 ), Gaussian process regression O(N 3 ) Classification: nearest-neighbor classifier O(N 2 ), nonparametric Bayes classifier O(N 2 ), support vector machine Dimension reduction: principal component analysis O(D 3 ), non-negative matrix factorization, kernel PCA O(N 3 ), maximum variance unfolding O(N 3 ) Outlier detection: by robust L 2 estimation, by density estimation, by dimension reduction Clustering: k-means O(N), hierarchical clustering O(N 3 ), by dimension reduction Time series analysis: Kalman filter O(D 3 ), hidden Markov model, trajectory tracking Feature selection and causality: LASSO, L 1 SVM, Gaussian graphical models, discrete graphical models 2-sample testing and matching: n-point correlation O(N n ), bipartite matching O(N 3 )

10 4 main computational bottlenecks: generalized N-body, graphical models, linear algebra, optimization Querying: nearest-neighbor O(N), spherical range-search O(N), orthogonal range-search O(N), contingency table Density estimation: kernel density estimation O(N 2 ), mixture of Gaussians O(N) Regression: linear regression O(ND 2 ), kernel regression O(N 2 ), Gaussian process regression O(N 3 ) Classification: nearest-neighbor classifier O(N 2 ), nonparametric Bayes classifier O(N 2 ), support vector machine Dimension reduction: principal component analysis O(D 3 ), non-negative matrix factorization, kernel PCA O(N 3 ), maximum variance unfolding O(N 3 ) Outlier detection: by robust L 2 estimation, by density estimation, by dimension reduction Clustering: k-means O(N), hierarchical clustering O(N 3 ), by dimension reduction Time series analysis: Kalman filter O(D 3 ), hidden Markov model, trajectory tracking Feature selection and causality: LASSO, L 1 SVM, Gaussian graphical models, discrete graphical models 2-sample testing: n-point correlation O(N n ), bipartite matching O(N 3 )

11 kd-trees: most widely-used space- partitioning tree for general dimension [Bentley 1975], [Friedman, Bentley & Finkel 1977],[Moore & Lee 1995] (others: ball-trees, cover-trees, etc) GNPs: Data structure needed

12 What algorithmic methods do we use? Generalized N-body algorithms (multiple trees) for distance/similarity- based computations [2000, 2003, 2009] Hierarchical series expansions for kernel summations [2004, 2006, 2008] Multi-scale Monte Carlo for linear algebra and summations [2007, 2008] Parallel computing [1998, 2006, 2009]

13 Computational complexity using fast algorithms Querying: nearest-neighbor O(logN), spherical range-search O(logN), orthogonal range-search O(logN), contingency table Density estimation: kernel density estimation O(N) or O(1), mixture of Gaussians O(logN) Regression: linear regression O(D) or O(1), kernel regression O(N) or O(1), Gaussian process regression O(N) or O(1) Classification: nearest-neighbor classifier O(N), nonparametric Bayes classifier O(N), support vector machine O(N) Dimension reduction: principal component analysis O(D) or O(1), non- negative matrix factorization, kernel PCA O(N) or O(1), maximum variance unfolding O(N) Outlier detection: by robust L 2 estimation, by density estimation, by dimension reduction Clustering: k-means O(logN), hierarchical clustering O(NlogN), by dimension reduction Time series analysis: Kalman filter O(D) or O(1), hidden Markov model, trajectory tracking Feature selection and causality: LASSO, L 1 SVM, Gaussian graphical models, discrete graphical models 2-sample testing and matching: n-point correlation O(N logn ), bipartite matching O(N) or O(1)

14 Computational complexity using fast algorithms Querying: nearest-neighbor O(logN), spherical range-search O(logN), orthogonal range-search O(logN), contingency table Density estimation: kernel density estimation O(N) or O(1), mixture of Gaussians O(logN) Regression: linear regression O(D) or O(1), kernel regression O(N) or O(1), Gaussian process regression O(N) or O(1) Classification: nearest-neighbor classifier O(N), nonparametric Bayes classifier O(N), support vector machine O(N) Dimension reduction: principal component analysis O(D) or O(1), non- negative matrix factorization, kernel PCA O(N) or O(1), maximum variance unfolding O(N) Outlier detection: by robust L 2 estimation, by density estimation, by dimension reduction Clustering: k-means O(logN), hierarchical clustering O(NlogN), by dimension reduction Time series analysis: Kalman filter O(D) or O(1), hidden Markov model, trajectory tracking Feature selection and causality: LASSO, L 1 SVM, Gaussian graphical models, discrete graphical models 2-sample testing and matching: n-point correlation O(N logn ), bipartite matching O(N) or O(1)

15 All-k-nearest neighbors Naïve cost: O(N 2 ) Fastest algorithm: Generalized Fast Multipole Method, aka multi-tree methods –Technique: Use two trees simultaneously –[Gray and Moore 2001, NIPS; Ram, Lee, March, and Gray 2009, submitted; Riegel, Boyer and Gray, to be submitted] –O(N), exact Applications: querying, adaptive density estimation first step [Budavari et al., in prep]

16 Kernel density estimation Naïve cost: O(N 2 ) Fastest algorithm: dual-tree fast Gauss transform, Monte Carlo multipole method –Technique: Hermite operators, Monte Carlo –[Lee, Gray and Moore 2005, NIPS; Lee and Gray 2006, UAI; Holmes, Gray and Isbell 2007, NIPS; Lee and Gray 2008, NIPS] –O(N), relative error guarantee Applications: accurate density estimates, e.g. galaxy types [Balogh et al. 2004]

17 Nonparametric Bayes classification Naïve cost: O(N 2 ) Fastest algorithm: dual-tree bounding with hybrid tree expansion –Technique: bound each posterior –[Liu, Moore, and Gray 2004; Gray and Riegel 2004, CompStat; Riegel and Gray 2007, SDM] –O(N), exact Applications: accurate classification, e.g. quasar detection [Richards et al. 2004,2009], [Scranton et al. 2005]

18 Principal component analysis Naïve cost: O(D 3 ) Fastest algorithm: QUIC-SVD –Technique: tree-based Monte Carlo using cosine tree –[Holmes, Gray, and Isbell 2009] –O(1), relative error guarantee Applications: spectral decomposition

19 Friends-of-friends Naïve cost: O(N 3 ) Fastest algorithm: dual-tree Boruvka algorithm –Technique: maintain implicit friend sets –[March and Gray, to be submitted] –O(NlogN), exact Applications: cluster-finding, 2-sample

20 n-point correlations Naïve cost: O(N n ) Fastest algorithm: n-tree algorithms –Technique: use n trees simultaneously –[Gray and Moore 2001, NIPS; Moore et al. 2001, Mining the Sky; Waters, Riegel and Gray 2009, to be submitted] –O(N logn ), exact Applications: 2-sample, e.g. AGN [Wake et al. 2004], ISW [Giannantonio et al. 2006], 3pt validation [Kulkarni et al. 2007]

21 3-point runtime (biggest previous: 20K) VIRGO simulation data, N = 75,000,000 naïve: 5x10 9 sec. (~150 years) multi-tree: 55 sec. (exact) n=2: O(N) n=3: O(N log3 ) n=4: O(N 2 )

22 Software MLPACK (C++) –First scalable comprehensive ML library MLPACK Pro - Very-large-scale data (parallel)

23 Software MLPACK (C++) –First scalable comprehensive ML library MLPACK Pro - Very-large-scale data (parallel) MLPACK-db –fast data analytics in relational databases (SQL Server; on-disk) –Thanks to Tamas Budavari

24 Fast algorithms in SQL Server: Challenges Don’t copy the data: In-RAM kd- trees  disk-based data structures Commercial database: no access to guts (e.g. caching, memory) Already optimized for different operations (SQL queries) and data structures (B+ trees) Big learning curve

25 Building a kd-tree in SQL Server Piggyback onto B+ trees and disk layout layers: Use SQL queries Hybrid approach: upper tree using Morton z-curve (SQL), lower using in-RAM building; 3 passes total Clustered indices for fast sequential access of nodes, data Abstracted algs in managed C#

26 Current state So far: All-NN, KDE*, NBC*, 2pt* Performance for all-nearest-neighbors on 40M, 4-D SDSS color data: –Naïve SQL Server: 50 yrs –Batched SQL Server: 3 yrs –Single-tree MLPACK-db: 14.5 hrs –Dual-tree MLPACK-db: 227 min –TPIE: 23 min, MLPACK-Pro: 10 min We believe there is 10x factor of fat –e.g. this uses only 100Mb of 2Gb RAM

27 The end You can now use many of the state-of-the- art data analysis methods on large survey datasets We need more funding Talk to me… I have a lot of problems (ahem) but always want more! Alexander Gray agray@cc.gatech.eduagray@cc.gatech.edu www.fast-lab.org

28 Ex: support vector machine Data: IJCNN1 [DP01a] 2 classes 49,990 training points 91,701 testing points 22 features SMO: 12,831 SV’s, 84,360 iterations, 98.3% accuracy, 765 sec SFW: 4,145 SV’s, 4,500 iterations, 98.1% accuracy, 21 sec [Ouyang and Gray 2009, in submission] SMO: O(N 2 -N 3 ) SFW: O(N/e + 1/e 2 )

29 A few new ones… Non-vector data (like time series, images, graphs, trees): –Functional ICA (Mehta and Gray 2008) –Generalized mean map kernel (Mehta and Gray, in submission) Visualizing high-dimensional data –Rank-based manifolds (Ouyang and Gray 2008) –Density-preserving maps (Ozakin and Gray, in submission) –Intrinsic dimension estimation (Ozakin and Gray, in prep) Accounting for measurement errors Probabilistic cross-match

30 2-point correlation r Characterization of an entire distribution? “How many pairs have distance < r ?” 2-point correlation function

31 The n-point correlation functions Spatial inferences: filaments, clusters, voids, homogeneity, isotropy, 2-sample testing, … Foundation for theory of point processes [Daley,Vere-Jones 1972], unifies spatial statistics [Ripley 1976] Used heavily in biostatistics, cosmology, particle physics, statistical physics 2pcf definition: 3pcf definition:

32 3-point correlation “How many triples have pairwise distances < r ?” r3r3 r1r1 r2r2 Standard model: n>0 terms should be zero!

33 How can we count n-tuples efficiently? “How many triples have pairwise distances < r ?”

34 Use n trees! [Gray & Moore, NIPS 2000]

35 “How many valid triangles a-b-c (where ) could there be? A B C r count{A,B,C} = ?

36 “How many valid triangles a-b-c (where ) could there be? count{A,B,C} = count{A,B, C.left } + count{A,B,C.right} A B C r

37 “How many valid triangles a-b-c (where ) could there be? A B C r count{A,B,C} = count{A,B,C.left} + count{A,B, C.right }

38 “How many valid triangles a-b-c (where ) could there be? A B C r count{A,B,C} = ?

39 “How many valid triangles a-b-c (where ) could there be? A B C r Exclusion count{A,B,C} = 0!

40 “How many valid triangles a-b-c (where ) could there be? AB C count{A,B,C} = ? r

41 “How many valid triangles a-b-c (where ) could there be? AB C Inclusion count{A,B,C} = |A| x |B| x |C| r Inclusion


Download ppt "Fast Statistical Algorithms in Databases Alexander Gray Georgia Institute of Technology College of Computing FASTlab: Fundamental Algorithmic and Statistical."

Similar presentations


Ads by Google