Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches Alexander Gray Georgia Institute of Technology www.fast-lab.org.

Slides:



Advertisements
Similar presentations
Bayesian Belief Propagation
Advertisements

Text mining Gergely Kótyuk Laboratory of Cryptography and System Security (CrySyS) Budapest University of Technology and Economics
Introduction to Support Vector Machines (SVM)
Spatial Database Systems. Spatial Database Applications GIS applications (maps): Urban planning, route optimization, fire or pollution monitoring, utility.
Aggregating local image descriptors into compact codes
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Salvatore giorgi Ece 8110 machine learning 5/12/2014
Support Vector Machines
An Overview of Machine Learning
Uncertainty Representation. Gaussian Distribution variance Standard deviation.
Principal Component Analysis
Distributed Regression: an Efficient Framework for Modeling Sensor Network Data Carlos Guestrin Peter Bodik Romain Thibaux Mark Paskin Samuel Madden.
Dimensional reduction, PCA
SVM QP & Midterm Review Rob Hall 10/14/ This Recitation Review of Lagrange multipliers (basic undergrad calculus) Getting to the dual for a QP.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Multiple Object Class Detection with a Generative Model K. Mikolajczyk, B. Leibe and B. Schiele Carolina Galleguillos.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
How to do Fast Analytics on Massive Datasets Alexander Gray Georgia Institute of Technology Computational Science and Engineering College of Computing.
Midterm Review. 1-Intro Data Mining vs. Statistics –Predictive v. experimental; hypotheses vs data-driven Different types of data Data Mining pitfalls.
Overview G. Jogesh Babu. Probability theory Probability is all about flip of a coin Conditional probability & Bayes theorem (Bayesian analysis) Expectation,
Fast Algorithms for Analyzing Massive Data Alexander Gray Georgia Institute of Technology
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.
Fast Statistical Algorithms in Databases Alexander Gray Georgia Institute of Technology College of Computing FASTlab: Fundamental Algorithmic and Statistical.
Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.
MURI: Integrated Fusion, Performance Prediction, and Sensor Management for Automatic Target Exploitation 1 Dynamic Sensor Resource Management for ATE MURI.
Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang.
Big Data Analytics Carlos Ordonez. Big Data Analytics research Input? BIG DATA (large data sets, large files, many documents, many tables, fast growing)
GRASP Learning a Kernel Matrix for Nonlinear Dimensionality Reduction Kilian Q. Weinberger, Fei Sha and Lawrence K. Saul ICML’04 Department of Computer.
Principal Component Analysis Machine Learning. Last Time Expectation Maximization in Graphical Models – Baum Welch.
Principal Manifolds and Probabilistic Subspaces for Visual Recognition Baback Moghaddam TPAMI, June John Galeotti Advanced Perception February 12,
Lecture 2: Statistical learning primer for biologists
Overview G. Jogesh Babu. Overview of Astrostatistics A brief description of modern astronomy & astrophysics. Many statistical concepts have their roots.
Supervisor: Nakhmani Arie Semester: Winter 2007 Target Recognition Harmatz Isca.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
The Unscented Particle Filter 2000/09/29 이 시은. Introduction Filtering –estimate the states(parameters or hidden variable) as a set of observations becomes.
Stats Term Test 4 Solutions. c) d) An alternative solution is to use the probability mass function and.
11 Lecture 24: MapReduce Algorithms Wrap-up. Admin PS2-4 solutions Project presentations next week – 20min presentation/team – 10 teams => 3 days – 3.
Incremental Reduced Support Vector Machines Yuh-Jye Lee, Hung-Yi Lo and Su-Yun Huang National Taiwan University of Science and Technology and Institute.
1 Review and Summary We have covered a LOT of material, spending more time and more detail on 2D image segmentation and analysis, but hopefully giving.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Day 17: Duality and Nonlinear SVM Kristin P. Bennett Mathematical Sciences Department Rensselaer Polytechnic Institute.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Overview G. Jogesh Babu. R Programming environment Introduction to R programming language R is an integrated suite of software facilities for data manipulation,
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
LECTURE 10: DISCRIMINANT ANALYSIS
Overview G. Jogesh Babu.
Machine Learning Basics
Probabilistic Models with Latent Variables
Filtering and State Estimation: Basic Concepts
Parallel Analytic Systems
Dimension reduction : PCA and Clustering
INTRODUCTION TO Machine Learning
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
LECTURE 09: DISCRIMINANT ANALYSIS
Topological Signatures For Fast Mobility Analysis
Machine Learning – a Probabilistic Perspective
EM Algorithm and its Applications
What is Artificial Intelligence?
Goodfellow: Chapter 14 Autoencoders
Presentation transcript:

Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches Alexander Gray Georgia Institute of Technology

Core methods of statistics / machine learning / mining 1.Querying: spherical range-search O(N), orthogonal range-search O(N), spatial join O(N 2 ), nearest-neighbor O(N), all-nearest-neighbors O(N 2 ) 2.Density estimation: mixture of Gaussians, kernel density estimation O(N 2 ), kernel conditional density estimation O(N 3 ) 3.Regression: linear regression, kernel regression O(N 2 ), Gaussian process regression O(N 3 ) 4.Classification: decision tree, nearest-neighbor classifier O(N 2 ), nonparametric Bayes classifier O(N 2 ), support vector machine O(N 3 ) 5.Dimension reduction: principal component analysis, non-negative matrix factorization, kernel PCA O(N 3 ), maximum variance unfolding O(N 3 ) 6.Outlier detection: by density estimation or dimension reduction 7.Clustering: by density estimation or dimension reduction, k-means, mean- shift segmentation O(N 2 ), hierarchical clustering O(N 3 ) 8.Time series analysis: Kalman filter, hidden Markov model, trajectory tracking O(N n ) 9.Feature selection and causality: LASSO, L 1 SVM, Gaussian graphical models, discrete graphical models 10.Fusion and matching: sequence alignment, bipartite matching O(N 3 ), n- point correlation 2-sample testing O(N n )

Now pretty fast (2011)… 1.Querying: spherical range-search O(logN)*, orthogonal range-search O(logN)*, spatial join O(N)*, nearest-neighbor O(logN), all-nearest- neighbors O(N) 2.Density estimation: mixture of Gaussians, kernel density estimation O(N), kernel conditional density estimation O(N log3 )* 3.Regression: linear regression, kernel regression O(N), Gaussian process regression O(N)* 4.Classification: decision tree, nearest-neighbor classifier O(N), nonparametric Bayes classifier O(N)*, support vector machine 5.Dimension reduction: principal component analysis, non-negative matrix factorization, kernel PCA O(N)*, maximum variance unfolding O(N)* 6.Outlier detection: by density estimation or dimension reduction 7.Clustering: by density estimation or dimension reduction, k-means, mean- shift segmentation O(N), hierarchical clustering O(NlogN) 8.Time series analysis: Kalman filter, hidden Markov model, trajectory tracking O(N logn )* 9.Feature selection and causality: LASSO, L 1 SVM, Gaussian graphical models, discrete graphical models 10.Fusion and matching: sequence alignment, bipartite matching O(N)**, n- point correlation 2-sample testing O(N logn )*

Things we made fast fastest, fastest in some settings 1.Querying: spherical range-search O(logN)*, orthogonal range-search O(logN)*, spatial join O(N)*, nearest-neighbor O(logN), all-nearest- neighbors O(N) 2.Density estimation: mixture of Gaussians, kernel density estimation O(N), kernel conditional density estimation O(N log3 )* 3.Regression: linear regression, kernel regression O(N), Gaussian process regression O(N)* 4.Classification: decision tree, nearest-neighbor classifier O(N), nonparametric Bayes classifier O(N)*, support vector machine O(N)/O(N 2 ) 5.Dimension reduction: principal component analysis, non-negative matrix factorization, kernel PCA O(N)*, maximum variance unfolding O(N)* 6.Outlier detection: by density estimation or dimension reduction 7.Clustering: by density estimation or dimension reduction, k-means, mean- shift segmentation O(N), hierarchical (FoF) clustering O(NlogN) 8.Time series analysis: Kalman filter, hidden Markov model, trajectory tracking O(N logn )* 9.Feature selection and causality: LASSO, L 1 SVM, Gaussian graphical models, discrete graphical models 10.Fusion and matching: sequence alignment, bipartite matching O(N)**, n- point correlation 2-sample testing O(N logn )*

Core computational problems What are the basic mathematical operations making things hard? Alternative to speeding up each of the 1000s of statistical methods: treat common computational bottlenecks Divide up the space of problems (and associated algorithmic strategies), so we can examine the unique challenges and possible ways forward within each

The 7 Giants of data 1.Basic statistics 2.Generalized N-body problems 3.Graph-theoretic problems 4.Linear-algebraic problems 5.Optimizations 6.Integrations 7.Alignment problems

The 7 Giants of data 1. Basic statistics e.g. counts, contingency tables, means, medians, variances, range queries (SQL queries) 2. Generalized N-body problems e.g. nearest-nbrs (in NLDR, etc), kernel summations (in KDE, GP, SVM, etc), clustering, MST, spatial correlations

The 7 Giants of data 3. Graph-theoretic problems e.g. betweenness centrality, commute distance, graphical model inference 4. Linear-algebraic problems e.g. linear algebra, PCA, Gaussian process regression, manifold learning 5. Optimizations e.g. LP/QP/SDP/SOCP/MINLPs in regularized methods, compressed sensing

The 7 Giants of data 6. Integrations e.g. Bayesian inference 7. Alignment problems e.g. BLAST in genomics, string matching, phylogenies, SLAM, cross -match

Back to our list basic, N-body, graphs, linear algebra, optimization, integration, alignment 1.Querying: spherical range-search O(N), orthogonal range-search O(N), spatial join O(N 2 ), nearest-neighbor O(N), all-nearest-neighbors O(N 2 ) 2.Density estimation: mixture of Gaussians, kernel density estimation O(N 2 ), kernel conditional density estimation O(N 3 ) 3.Regression: linear regression, kernel regression O(N 2 ), Gaussian process regression O(N 3 ) 4.Classification: decision tree, nearest-neighbor classifier O(N 2 ), nonparametric Bayes classifier O(N 2 ), support vector machine O(N 3 ) 5.Dimension reduction: principal component analysis, non-negative matrix factorization, kernel PCA O(N 3 ), maximum variance unfolding O(N 3 ) 6.Outlier detection: by density estimation or dimension reduction 7.Clustering: by density estimation or dimension reduction, k-means, mean- shift segmentation O(N 2 ), hierarchical clustering O(N 3 ) 8.Time series analysis: Kalman filter, hidden Markov model, trajectory tracking O(N n ) 9.Feature selection and causality: LASSO, L 1 SVM, Gaussian graphical models, discrete graphical models 10.Fusion and matching: sequence alignment, bipartite matching O(N 3 ), n- point correlation 2-sample testing O(N n )

5 settings 1.Regular: batch, in-RAM/core, one CPU 2.Streaming (non-batch) 3.Disk (out-of-core) 4.Distributed: threads/multi-core (shared memory) 5.Distributed: clusters/cloud (distributed memory)

4 common data types 1.Vector data, iid 2.Time series 3.Images 4.Graphs

3 desiderata 1.Fast experimental runtime/performance* 2.Fast theoretic (provable) runtime/performance* 3.Accuracy guarantees *Performance: runtime, memory, communication, disk accesses; time-constrained, anytime, etc.

7 general solution strategies 1.Divide and conquer (indexing structures) 2.Dynamic programming 3.Function transforms 4.Random sampling (Monte Carlo) 5.Non-random sampling (active learning) 6.Parallelism 7.Problem reduction

1. Summary statistics Examples: counts, contingency tables, means, medians, variances, range queries (SQL queries) Whats unique/challenges: streaming, new guarantees Promising/interesting: –Sketching approaches –AD-trees –MapReduce/Hadoop (Aster,Greenplum,Netezza)

2. Generalized N-body problems Examples: nearest-nbrs (in NLDR, etc), kernel summations (in KDE, GP, SVM, etc), clustering, MST, spatial correlations Whats unique/challenges: general dimension, non-Euclidean, new guarantees (e.g. in rank) Promising/interesting: –Generalized/higher-order FMM O(N 2 ) O(N) –Random projections –GPUs

3. Graph-theoretic problems Examples: betweenness centrality, commute dist, graphical model inference Whats unique/challenges: high interconnectivity (cliques), out-of-core Promising/interesting: –Variational methods –Stochastic composite likelihood methods –MapReduce/Hadoop (Facebook,etc)

4. Linear-algebraic problems Examples: linear algebra, PCA, Gaussian process regression, manifold learning Whats unique/challenges: probabilistic guarantees, kernel matrices Promising/interesting: –Sampling-based methods –Online methods –Approximate matrix-vector multiply via N-body

5. Optimizations Examples: LP/QP/SDP/SOCP/MINLPs in regularized methods, compressed sensing Whats unique/challenges: stochastic programming, streaming Promising/interesting: –Reformulations/relaxations of various ML forms –Online, mini-batch methods –Parallel online methods –Submodular functions –Global optimization (non-convex)

6. Integrations Examples: Bayesian inference Whats unique/challenges: general dimension Promising/interesting: –MCMC –ABC –Particle filtering –Adaptive importance sampling, active learning

7. Alignments Examples: BLAST in genomics, string matching, phylogenies, SLAM, cross- match Whats unique/challenges: greater heterogeneity, measurement errors Promising/interesting: –Probabilistic representations –Reductions to generalized N-body problems

Reductions/transformations between problems Gaussian graphical models linear alg Bayesian integration MAP optimization Euclidean graphs N-body problems Linear algebra on kernel matrices N- body inside conjugate gradient Can featurize a graph or any other structure matrix-based ML problem Create new ML methods with different computational properties

General conclusions Algorithms can dramatically change the runtime order, e.g. O(N 2 ) to O(N) High dimensionality is a persistent challenge The non-default (e.g. streaming, disk…) settings need more research work Systems issues need more work, e.g. connection to data storage/management Hadoop does not solve everything

General conclusions No general theory for the tradeoff between statistical quality and computational cost (lower/upper bounds, etc) More aspects of hardness (statistical and computational) are needed