Computational Mathematics for Large-scale Data Analysis Alexander Gray Georgia Institute of Technology College of Computing FASTlab: Fundamental Algorithmic.

Slides:



Advertisements
Similar presentations
Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches Alexander Gray Georgia Institute of Technology
Advertisements

How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.
Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.
Uncertainty Quantification & the PSUADE Software
An Introduction of Support Vector Machine
Machine learning continued Image source:
Fast N-Body Algorithms for Massive Datasets Alexander Gray Georgia Institute of Technology.
An Overview of Machine Learning
Industrial Engineering College of Engineering Bayesian Kernel Methods for Binary Classification and Online Learning Problems Theodore Trafalis Workshop.
COMP 328: Final Review Spring 2010 Nevin L. Zhang Department of Computer Science & Engineering The Hong Kong University of Science & Technology
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Scalable Data Mining The Auton Lab, Carnegie Mellon University Brigham Anderson, Andrew Moore, Dan Pelleg, Alex Gray, Bob Nichols, Andy.
Machine Learning on Massive Datasets Alexander Gray Georgia Institute of Technology College of Computing FASTlab: Fundamental Algorithmic and Statistical.
Supervised Distance Metric Learning Presented at CMU’s Computer Vision Misc-Read Reading Group May 9, 2007 by Tomasz Malisiewicz.
Scalable Text Mining with Sparse Generative Models
Astro-DISC: Astronomy and cosmology applications of distributed super computing.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Teaching Science with Sloan Digital Sky Survey Data GriPhyN/iVDGL Education and Outreach meeting March 1, 2002 Jordan Raddick The Johns Hopkins University.
Ensemble Learning (2), Tree and Forest
How to do Machine Learning on Massive Astronomical Datasets Alexander Gray Georgia Institute of Technology Computational Science and Engineering College.
How to do Fast Analytics on Massive Datasets Alexander Gray Georgia Institute of Technology Computational Science and Engineering College of Computing.
The Statistical Properties of Large Scale Structure Alexander Szalay Department of Physics and Astronomy The Johns Hopkins University.
Overview G. Jogesh Babu. Probability theory Probability is all about flip of a coin Conditional probability & Bayes theorem (Bayesian analysis) Expectation,
Sky Surveys and the Virtual Observatory Alex Szalay The Johns Hopkins University.
Spatial Indexing and Visualizing Large Multi-dimensional Databases I. Csabai, M. Trencséni, L. Dobos, G. Herczegh, P. Józsa, N. Purger Eötvös University,
1 Robust HMM classification schemes for speaker recognition using integral decode Marie Roch Florida International University.
Playing in High Dimensions Bob Nichol ICG, Portsmouth Thanks to all my colleagues in SDSS, GRIST & PiCA Special thanks to Chris Miller, Alex Gray, Gordon.
Anomaly detection with Bayesian networks Website: John Sandiford.
Amdahl Numbers as a Metric for Data Intensive Computing Alex Szalay The Johns Hopkins University.
Fast Algorithms for Analyzing Massive Data Alexander Gray Georgia Institute of Technology
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Alex Szalay, Jim Gray Analyzing Large Data Sets in Astronomy.
DATA MINING LECTURE 10 Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
FODAVA-Lead Education, Community Building, and Research: Dimension Reduction and Data Reduction: Foundations for Interactive Visualization Haesun Park.
CSE 185 Introduction to Computer Vision Pattern Recognition 2.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Fast Statistical Algorithms in Databases Alexander Gray Georgia Institute of Technology College of Computing FASTlab: Fundamental Algorithmic and Statistical.
Fast search methods Pasi Fränti Clustering methods: Part 5 Speech and Image Processing Unit School of Computing University of Eastern Finland
MURI: Integrated Fusion, Performance Prediction, and Sensor Management for Automatic Target Exploitation 1 Dynamic Sensor Resource Management for ATE MURI.
Fast Algorithms & Data Structures for Visualization and Machine Learning on Massive Data Sets Alexander Gray Fundamental Algorithmic and Statistical Tools.
Gap-filling and Fault-detection for the life under your feet dataset.
Three New Ideas in SDP-based Manifold Learning Alexander Gray Georgia Institute of Technology College of Computing FASTlab: Fundamental Algorithmic and.
ISOMAP TRACKING WITH PARTICLE FILTER Presented by Nikhil Rane.
GRASP Learning a Kernel Matrix for Nonlinear Dimensionality Reduction Kilian Q. Weinberger, Fei Sha and Lawrence K. Saul ICML’04 Department of Computer.
Classification Derek Hoiem CS 598, Spring 2009 Jan 27, 2009.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Data-Intensive Statistical Challenges in Astrophysics Alex Szalay The Johns Hopkins University Collaborators: T. Budavari, C-W Yip (JHU), M. Mahoney (Stanford),
Lecture 2: Statistical learning primer for biologists
Overview G. Jogesh Babu. Overview of Astrostatistics A brief description of modern astronomy & astrophysics. Many statistical concepts have their roots.
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
NTU & MSRA Ming-Feng Tsai
DATA MINING LECTURE 10b Classification k-nearest neighbor classifier
Scientific Data Analysis via Statistical Learning Raquel Romano romano at hpcrd dot lbl dot gov November 2006.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Manifold Learning JAMES MCQUEEN – UW DEPARTMENT OF STATISTICS.
Inh Jee University of Texas at Austin Eiichiro Komatsu & Karl Gebhardt
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Overview G. Jogesh Babu. R Programming environment Introduction to R programming language R is an integrated suite of software facilities for data manipulation,
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
CSE 4705 Artificial Intelligence
Data Transformation: Normalization
Astrostatistics Antonios Karampelas, PhD
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Overview G. Jogesh Babu.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Topological Signatures For Fast Mobility Analysis
What is Artificial Intelligence?
Presentation transcript:

Computational Mathematics for Large-scale Data Analysis Alexander Gray Georgia Institute of Technology College of Computing FASTlab: Fundamental Algorithmic and Statistical Tools Laboratory

The FASTlab Fundamental Algorithmic and Statistical Tools Laboratory 1.Arkadas Ozakin: Research scientist, PhD Theoretical Physics 2.Dong Ryeol Lee: PhD student, CS + Math 3.Ryan Riegel: PhD student, CS + Math 4.Parikshit Ram: PhD student, CS + Math 5.William March: PhD student, Math + CS 6.James Waters: PhD student, Physics + CS 7.Hua Ouyang: PhD student, CS 8.Sooraj Bhat: PhD student, CS 9.Ravi Sastry: PhD student, CS 10.Long Tran: PhD student, CS 11.Michael Holmes: PhD student, CS + Physics (co-supervised) 12.Nikolaos Vasiloglou: PhD student, EE (co-supervised) 13.Wei Guan: PhD student, CS (co-supervised) 14.Nishant Mehta: PhD student, CS (co-supervised) 15.Wee Chin Wong: PhD student, ChemE (co-supervised) 16.Abhimanyu Aditya: MS student, CS 17.Yatin Kanetkar: MS student, CS 18.Praveen Krishnaiah: MS student, CS 19.Devika Karnik: MS student, CS 20.Prasad Jakka: MS student, CS

Exponential growth in dataset sizes 1990 COBE 1, Boomerang 10, CBI 50, WMAP 1 Million 2008 Planck 10 Million Data: CMB Maps Data: Local Redshift Surveys 1986 CfA 3, LCRS 23, dF 250, SDSS 800,000 Data: Angular Surveys 1970 Lick 1M 1990 APM 2M 2005 SDSS 200M 2010 LSST 2B Instruments [Science, Szalay & J. Gray, 2001]

1.How did galaxies evolve? 2.What was the early universe like? 3.Does dark energy exist? 4.Is our model (GR+inflation) right? Astrophysicist Machine learning/ statistics guy R. Nichol, Inst. Cosmol. Gravitation A. Connolly, U. Pitt Physics C. Miller, NOAO R. Brunner, NCSA G. Djorgovsky, Caltech G. Kulkarni, Inst. Cosmol. Gravitation D. Wake, Inst. Cosmol. Gravitation R. Scranton, U. Pitt Physics M. Balogh, U. Waterloo Physics I. Szapudi, U. Hawaii Inst. Astronomy G. Richards, Princeton Physics A. Szalay, Johns Hopkins Physics

1.How did galaxies evolve? 2.What was the early universe like? 3.Does dark energy exist? 4.Is our model (GR+inflation) right? Astrophysicist Machine learning/ statistics guy O(N n ) O(N 2 ) O(N 3 ) O(c D T(N)) R. Nichol, Inst. Cosmol. Grav. A. Connolly, U. Pitt Physics C. Miller, NOAO R. Brunner, NCSA G. Djorgovsky, Caltech G. Kulkarni, Inst. Cosmol. Grav. D. Wake, Inst. Cosmol. Grav. R. Scranton, U. Pitt Physics M. Balogh, U. Waterloo Physics I. Szapudi, U. Hawaii Inst. Astro. G. Richards, Princeton Physics A. Szalay, Johns Hopkins Physics Kernel density estimator n-point spatial statistics Nonparametric Bayes classifier Support vector machine Nearest-neighbor statistics Gaussian process regression Hierarchical clustering

R. Nichol, Inst. Cosmol. Grav. A. Connolly, U. Pitt Physics C. Miller, NOAO R. Brunner, NCSA G. Djorgovsky, Caltech G. Kulkarni, Inst. Cosmol. Grav. D. Wake, Inst. Cosmol. Grav. R. Scranton, U. Pitt Physics M. Balogh, U. Waterloo Physics I. Szapudi, U. Hawaii Inst. Astro. G. Richards, Princeton Physics A. Szalay, Johns Hopkins Physics Kernel density estimator n-point spatial statistics Nonparametric Bayes classifier Support vector machine Nearest-neighbor statistics Gaussian process regression Hierarchical clustering 1.How did galaxies evolve? 2.What was the early universe like? 3.Does dark energy exist? 4.Is our model (GR+inflation) right? Astrophysicist Machine learning/ statistics guy O(N n ) O(N 2 ) O(N 3 ) O(N 2 ) O(N 3 )

R. Nichol, Inst. Cosmol. Grav. A. Connolly, U. Pitt Physics C. Miller, NOAO R. Brunner, NCSA G. Djorgovsky, Caltech G. Kulkarni, Inst. Cosmol. Grav. D. Wake, Inst. Cosmol. Grav. R. Scranton, U. Pitt Physics M. Balogh, U. Waterloo Physics I. Szapudi, U. Hawaii Inst. Astro. G. Richards, Princeton Physics A. Szalay, Johns Hopkins Physics Kernel density estimator n-point spatial statistics Nonparametric Bayes classifier Support vector machine Nearest-neighbor statistics Gaussian process regression Hierarchical clustering 1.How did galaxies evolve? 2.What was the early universe like? 3.Does dark energy exist? 4.Is our model (GR+inflation) right? But I have 1 million points Astrophysicist Machine learning/ statistics guy O(N n ) O(N 2 ) O(N 3 ) O(N 2 ) O(N 3 )

Uncertainty quantification Many types – all generally lead to more computational expense: Uncertainty of statistical model: cross-validation, bootstrap Uncertainty of simulation: run many times, estimate a model each time Uncertainty of simulation vs. real data: expensive spatial statistics

How to do Stats/Machine Learning on Very Large Survey Datasets? 1.Choose the appropriate statistical task and method for the scientific question 2.Use the fastest algorithm and data structure for the statistical method 3.Apply method to the data!

How to do Stats/Machine Learning on Very Large Survey Datasets? 1.Choose the appropriate statistical task and method for the scientific question 2.Use the fastest algorithm and data structure for the statistical method 3.Apply method to the data!

10 data analysis problems, and scalable tools we’d like for them 1.Querying (e.g. characterizing a region of space): –spherical range-search O(N) –orthogonal range-search O(N) –k-nearest-neighbors O(N) –all-k-nearest-neighbors O(N 2 ) 2.Density estimation (e.g. comparing galaxy types): –mixture of Gaussians –kernel density estimation O(N 2 ) –L 2 density tree [Ram and Gray in prep] –manifold kernel density estimation O(N 3 ) [Ozakin and Gray 2008, to be submitted] –hyper-kernel density estimation O(N 4 ) [Sastry and Gray 2008, submitted]

10 data analysis problems, and scalable tools we’d like for them 3. Regression (e.g. photometric redshifts): –linear regression O(D 2 ) –kernel regression O(N 2 ) –Gaussian process regression/kriging O(N 3 ) 4. Classification (e.g. quasar detection, star-galaxy separation): –k-nearest-neighbor classifier O(N 2 ) –nonparametric Bayes classifier O(N 2 ) –support vector machine (SVM) O(N 3 ) –non-negative SVM O(N 3 ) [Guan and Gray, in prep] –false-positive-limiting SVM O(N 3 ) [Sastry and Gray, in prep] –separation map O(N 3 ) [Vasiloglou, Gray, and Anderson 2008, submitted]

10 data analysis problems, and scalable tools we’d like for them 5.Dimension reduction (e.g. galaxy or spectra characterization): –principal component analysis O(D 2 ) –non-negative matrix factorization –kernel PCA O(N 3 ) –maximum variance unfolding O(N 3 ) –co-occurrence embedding O(N 3 ) [Ozakin and Gray, in prep] –rank-based manifolds O(N 3 ) [Ouyang and Gray 2008, ICML] –isometric non-negative matrix factorization O(N 3 ) [Vasiloglou, Gray, and Anderson 2008, submitted] 6.Outlier detection (e.g. new object types, data cleaning): –by density estimation, by dimension reduction –by robust L p estimation [Ram, Riegel and Gray, in prep]

10 data analysis problems, and scalable tools we’d like for them 7. Clustering (e.g. automatic Hubble sequence) –by dimension reduction, by density estimation –k-means –mean-shift segmentation O(N 2 ) –hierarchical clustering (“friends-of-friends”) O(N 3 ) 8. Time series analysis (e.g. asteroid tracking, variable objects): –Kalman filter O(D 2 ) –hidden Markov model O(D 2 ) –trajectory tracking O(N n ) –Markov matrix factorization [Tran, Wong, and Gray 2008, submitted] –functional independent component analysis [Mehta and Gray 2008, submitted]

10 data analysis problems, and scalable tools we’d like for them 9. Feature selection and causality (e.g. which features predict star/galaxy) –LASSO regression –L 1 SVM –Gaussian graphical model inference and structure search –discrete graphical model inference and structure search –0-1 feature-selecting SVM [Guan and Gray, in prep] –L 1 Gaussian graphical model inference and structure search [Tran, Lee, Holmes, and Gray, in prep] sample testing and matching (e.g. cosmological validation, multiple surveys): –minimum spanning tree O(N 3 ) –n-point correlation O(N n ) –bipartite matching/Gaussian graphical model inference O(N 3 ) [Waters and Gray, in prep]

How to do Stats/Machine Learning on Very Large Survey Datasets? 1.Choose the appropriate statistical task and method for the scientific question 2.Use the fastest algorithm and data structure for the statistical method 3.Apply method to the data!

Core computational problems What are the basic mathematical operations, or bottleneck subroutines, can we focus on developing fast algorithms for?

Core computational problems Aggregations, GNPs, graphs, linear algebra, optimization, integration Querying: nearest-neighbor, sph range-search, ortho range-search, all-nn Density estimation: kernel density estimation, mixture of Gaussians Regression: linear regression, kernel regression, Gaussian process regression Classification: nearest-neighbor classifier, nonparametric Bayes classifier, support vector machine Dimension reduction: principal component analysis, non-negative matrix factorization, kernel PCA, maximum variance unfolding Outlier detection: by robust L 2 estimation, by density estimation, by dimension reduction Clustering: k-means, mean-shift, hierarchical clustering (“friends-of- friends”), by dimension reduction, by density estimation Time series analysis: Kalman filter, hidden Markov model, trajectory tracking Feature selection and causality: LASSO regression, L 1 support vector machine, Gaussian graphical models, discrete graphical models 2-sample testing and matching: n-point correlation, bipartite matching

kd-trees: most widely-used space- partitioning tree [Bentley 1975], [Friedman, Bentley & Finkel 1977],[Moore & Lee 1995] Simple example: spherical range search. How can we compute this efficiently?

A kd-tree: level 1

A kd-tree: level 2

A kd-tree: level 3

A kd-tree: level 4

A kd-tree: level 5

A kd-tree: level 6

Range-count recursive algorithm

Pruned! (inclusion) Range-count recursive algorithm

Pruned! (exclusion) Range-count recursive algorithm

fastest practical algorithm [Bentley 1975] our algorithms can use any tree Range-count recursive algorithm

How to do Stats/Machine Learning on Very Large Survey Datasets? 1.Choose the appropriate statistical task and method for the scientific question 2.Use the fastest algorithm and data structure for the statistical method 3.Apply method to the data!

All-k-nearest neighbors Applications: querying, adaptive density estimation first step [Budavari et al., in prep] Naïve cost: O(N 2 ) Fastest algorithm: Generalized Fast Multipole Method, aka multi-tree methods –Technique: Use two trees simultaneously –[Gray and Moore 2001, NIPS; Riegel, Boyer and Gray, to be submitted] –O(N), exact

Kernel density estimation Applications: accurate density estimates, e.g. galaxy types [Balogh et al. 2004] Naïve cost: O(N 2 ) Fastest algorithm: dual-tree fast Gauss transform, Monte Carlo multipole method –Technique: Hermite operators, Monte Carlo –[Lee, Gray and Moore 2005, NIPS; Lee and Gray 2006, UAI; Holmes, Gray and Isbell 2007, NIPS; Lee and Gray 2008, NIPS] –O(N), relative error guarantee

Nonparametric Bayes classification (or KDA) Applications: accurate classification, e.g. quasar detection [Richards et al. 2004,2009], [Scranton et al. 2005] Naïve cost: O(N 2 ) Fastest algorithm: dual-tree bounding with hybrid tree expansion –Technique: bound each posterior –[Liu, Moore, and Gray 2004; Gray and Riegel 2004, CompStat; Riegel and Gray 2007, SDM] –O(N), exact

Friends-of-friends Applications: cluster-finding, 2-sample Naïve cost: O(N 3 ) Fastest algorithm: dual-tree Boruvka algorithm –Technique: similar to all-k-nn –[March and Gray, to be submitted] –O(NlogN), exact

n-point correlations Applications: 2-sample, e.g. AGN [Wake et al. 2004], ISW [Giannantonio et al. 2006], 3pt validation [Kulkarni et al. 2007] Naïve cost: O(N n ) Fastest algorithm: n-tree algorithms –Technique: use n trees simultaneously –[Gray and Moore 2001, NIPS; Moore et al. 2001, Mining the Sky; Waters, Riegel and Gray, to be submitted] –O(N logn ), exact

3-point runtime (biggest previous: 20K) VIRGO simulation data, N = 75,000,000 naïve: 5x10 9 sec. (~150 years) multi-tree: 55 sec. (exact) n=2: O(N) n=3: O(N log3 ) n=4: O(N 2 )

Now fast! very fast as fast as possible (conjecture) Querying: nearest-neighbor, sph range-search, ortho range-search, all-nn Density estimation: kernel density estimation, mixture of Gaussians Regression: linear regression, kernel regression, Gaussian process regression Classification: nearest-neighbor classifier, nonparametric Bayes classifier, support vector machine Dimension reduction: principal component analysis, non-negative matrix factorization, kernel PCA, maximum variance unfolding Outlier detection: by robust L 2 estimation Clustering: k-means, mean-shift, hierarchical clustering (“friends-of- friends”) Time series analysis: Kalman filter, hidden Markov model, trajectory tracking Feature selection and causality: LASSO regression, L 1 support vector machine, Gaussian graphical models, discrete graphical models 2-sample testing and matching: n-point correlation, bipartite matching

The end You can now use many of the state-of-the- art data analysis methods on large datasets –MLPACK software –CAS Analytics in SQL Server Computational astronomy workshop and large- scale data analysis workshop coming soon Alexander Gray ( is best; webpage sorely out of date)

2-point correlation r Characterization of an entire distribution? “How many pairs have distance < r ?” 2-point correlation function

The n-point correlation functions Spatial inferences: filaments, clusters, voids, homogeneity, isotropy, 2-sample testing, … Foundation for theory of point processes [Daley,Vere-Jones 1972], unifies spatial statistics [Ripley 1976] Used heavily in biostatistics, cosmology, particle physics, statistical physics 2pcf definition: 3pcf definition:

3-point correlation “How many triples have pairwise distances < r ?” r3r3 r1r1 r2r2 Standard model: n>0 terms should be zero!

How can we count n-tuples efficiently? “How many triples have pairwise distances < r ?”

Use n trees! [Gray & Moore, NIPS 2000]

“How many valid triangles a-b-c (where ) could there be? A B C r count{A,B,C} = ?

“How many valid triangles a-b-c (where ) could there be? count{A,B,C} = count{A,B, C.left } + count{A,B,C.right} A B C r

“How many valid triangles a-b-c (where ) could there be? A B C r count{A,B,C} = count{A,B,C.left} + count{A,B, C.right }

“How many valid triangles a-b-c (where ) could there be? A B C r count{A,B,C} = ?

“How many valid triangles a-b-c (where ) could there be? A B C r Exclusion count{A,B,C} = 0!

“How many valid triangles a-b-c (where ) could there be? AB C count{A,B,C} = ? r

“How many valid triangles a-b-c (where ) could there be? AB C Inclusion count{A,B,C} = |A| x |B| x |C| r Inclusion