Fast Algorithms for Analyzing Massive Data Alexander Gray Georgia Institute of Technology www.fast-lab.org.

Slides:



Advertisements
Similar presentations
Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches Alexander Gray Georgia Institute of Technology
Advertisements

Text mining Gergely Kótyuk Laboratory of Cryptography and System Security (CrySyS) Budapest University of Technology and Economics
Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
INTRODUCTION TO Machine Learning 2nd Edition
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
Optimization Tutorial
An Overview of Machine Learning
4/15/2017 Using Gaussian Process Regression for Efficient Motion Planning in Environments with Deformable Objects Barbara Frank, Cyrill Stachniss, Nichola.
Computer vision: models, learning and inference
Lecture 1: Introduction and Motivation Prof. Irwin King and Prof. Michael R. Lyu Computer Science & Engineering Dept. The Chinese University of Hong Kong.
Computational Mathematics for Large-scale Data Analysis Alexander Gray Georgia Institute of Technology College of Computing FASTlab: Fundamental Algorithmic.
Discriminative and generative methods for bags of features
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
EE 290A: Generalized Principal Component Analysis Lecture 6: Iterative Methods for Mixture-Model Segmentation Sastry & Yang © Spring, 2011EE 290A, University.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Principal Component Analysis
An Introduction to Kernel-Based Learning Algorithms K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf Presented by: Joanna Giforos CS8980: Topics.
Dimensional reduction, PCA
SVM QP & Midterm Review Rob Hall 10/14/ This Recitation Review of Lagrange multipliers (basic undergrad calculus) Getting to the dual for a QP.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Scalable Data Mining The Auton Lab, Carnegie Mellon University Brigham Anderson, Andrew Moore, Dan Pelleg, Alex Gray, Bob Nichols, Andy.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Machine Learning on Massive Datasets Alexander Gray Georgia Institute of Technology College of Computing FASTlab: Fundamental Algorithmic and Statistical.
Supervised Distance Metric Learning Presented at CMU’s Computer Vision Misc-Read Reading Group May 9, 2007 by Tomasz Malisiewicz.
Three Algorithms for Nonlinear Dimensionality Reduction Haixuan Yang Group Meeting Jan. 011, 2005.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
FLANN Fast Library for Approximate Nearest Neighbors
How to do Machine Learning on Massive Astronomical Datasets Alexander Gray Georgia Institute of Technology Computational Science and Engineering College.
How to do Fast Analytics on Massive Datasets Alexander Gray Georgia Institute of Technology Computational Science and Engineering College of Computing.
Outline Separating Hyperplanes – Separable Case
Data mining and machine learning A brief introduction.
DATA MINING LECTURE 10 Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Machine Learning Lecture 11 Summary G53MLE | Machine Learning | Dr Guoping Qiu1.
1 Comparison of Principal Component Analysis and Random Projection in Text Mining Steve Vincent April 29, 2004 INFS 795 Dr. Domeniconi.
FODAVA-Lead Education, Community Building, and Research: Dimension Reduction and Data Reduction: Foundations for Interactive Visualization Haesun Park.
Fast Statistical Algorithms in Databases Alexander Gray Georgia Institute of Technology College of Computing FASTlab: Fundamental Algorithmic and Statistical.
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
Fast Algorithms & Data Structures for Visualization and Machine Learning on Massive Data Sets Alexander Gray Fundamental Algorithmic and Statistical Tools.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Nonlinear Data Discrimination via Generalized Support Vector Machines David R. Musicant and Olvi L. Mangasarian University of Wisconsin - Madison
Three New Ideas in SDP-based Manifold Learning Alexander Gray Georgia Institute of Technology College of Computing FASTlab: Fundamental Algorithmic and.
GRASP Learning a Kernel Matrix for Nonlinear Dimensionality Reduction Kilian Q. Weinberger, Fei Sha and Lawrence K. Saul ICML’04 Department of Computer.
Map-Reduce for Machine Learning on Multicore C. Chu, S.K. Kim, Y. Lin, Y.Y. Yu, G. Bradski, A.Y. Ng, K. Olukotun (NIPS 2006) Shimin Chen Big Data Reading.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan
RSVM: Reduced Support Vector Machines Y.-J. Lee & O. L. Mangasarian First SIAM International Conference on Data Mining Chicago, April 6, 2001 University.
Principal Manifolds and Probabilistic Subspaces for Visual Recognition Baback Moghaddam TPAMI, June John Galeotti Advanced Perception February 12,
Linear Models for Classification
CS 1699: Intro to Computer Vision Support Vector Machines Prof. Adriana Kovashka University of Pittsburgh October 29, 2015.
Lecture 2: Statistical learning primer for biologists
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Maximum Entropy Discrimination Tommi Jaakkola Marina Meila Tony Jebara MIT CMU MIT.
Large Scale Distributed Distance Metric Learning by Pengtao Xie and Eric Xing PRESENTED BY: PRIYANKA.
DATA MINING LECTURE 10b Classification k-nearest neighbor classifier
Principal Component Analysis and Linear Discriminant Analysis for Feature Reduction Jieping Ye Department of Computer Science and Engineering Arizona State.
Lecture 5: Statistical Methods for Classification CAP 5415: Computer Vision Fall 2006.
11 Lecture 24: MapReduce Algorithms Wrap-up. Admin PS2-4 solutions Project presentations next week – 20min presentation/team – 10 teams => 3 days – 3.
ECE 471/571 – Lecture 3 Discriminant Function and Normal Density 08/27/15.
1 Kernel Machines A relatively new learning methodology (1992) derived from statistical learning theory. Became famous when it gave accuracy comparable.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
CS 2750: Machine Learning Support Vector Machines
Probabilistic Models with Latent Variables
Lecture 15: Least Square Regression Metric Embeddings
University of Wisconsin - Madison
Non-Negative Matrix Factorization
Presentation transcript:

Fast Algorithms for Analyzing Massive Data Alexander Gray Georgia Institute of Technology

The FASTlab Fundamental Algorithmic and Statistical Tools Laboratory Alexander Gray: Assoc Prof, Applied Math + CS; PhD CS 2.Arkadas Ozakin: Research Scientist, Math + Physics; PhD Physics 3.Dongryeol Lee: PhD student, CS + Math 4.Ryan Riegel: PhD student, CS + Math 5.Sooraj Bhat: PhD student, CS 6.Nishant Mehta: PhD student, CS 7.Parikshit Ram: PhD student, CS + Math 8.William March: PhD student, Math + CS 9.Hua Ouyang: PhD student, CS 10.Ravi Sastry: PhD student, CS 11.Long Tran: PhD student, CS 12.Ryan Curtin: PhD student, EE 13.Ailar Javadi: PhD student, EE 14.Anita Zakrzewska: PhD student, CS MS students and undergraduates

7 tasks of machine learning / data mining 1.Querying: spherical range-search O(N), orthogonal range-search O(N), nearest-neighbor O(N), all-nearest-neighbors O(N 2 ) 2.Density estimation: mixture of Gaussians, kernel density estimation O(N 2 ), kernel conditional density estimation O(N 3 ) 3.Classification: decision tree, nearest-neighbor classifier O(N 2 ), kernel discriminant analysis O(N 2 ), support vector machine O(N 3 ), L p SVM 4.Regression: linear regression, LASSO, kernel regression O(N 2 ), Gaussian process regression O(N 3 ) 5.Dimension reduction: PCA, non-negative matrix factorization, kernel PCA O(N 3 ), maximum variance unfolding O(N 3 ); Gaussian graphical models, discrete graphical models 6.Clustering: k-means, mean-shift O(N 2 ), hierarchical (FoF) clustering O(N 3 ) 7.Testing and matching: MST O(N 3 ), bipartite cross-matching O(N 3 ), n-point correlation 2-sample testing O(N n ), kernel embedding

7 tasks of machine learning / data mining 1.Querying: spherical range-search O(N), orthogonal range-search O(N), nearest-neighbor O(N), all-nearest-neighbors O(N 2 ) 2.Density estimation: mixture of Gaussians, kernel density estimation O(N 2 ), kernel conditional density estimation O(N 3 ) 3.Classification: decision tree, nearest-neighbor classifier O(N 2 ), kernel discriminant analysis O(N 2 ), support vector machine O(N 3 ), L p SVM 4.Regression: linear regression, LASSO, kernel regression O(N 2 ), Gaussian process regression O(N 3 ) 5.Dimension reduction: PCA, non-negative matrix factorization, kernel PCA O(N 3 ), maximum variance unfolding O(N 3 ); Gaussian graphical models, discrete graphical models 6.Clustering: k-means, mean-shift O(N 2 ), hierarchical (FoF) clustering O(N 3 ) 7.Testing and matching: MST O(N 3 ), bipartite cross-matching O(N 3 ), n-point correlation 2-sample testing O(N n ), kernel embedding

7 tasks of machine learning / data mining 1.Querying: spherical range-search O(N), orthogonal range-search O(N), nearest- neighbor O(N), all-nearest-neighbors O(N 2 ) 2.Density estimation: mixture of Gaussians, kernel density estimation O(N 2 ), kernel conditional density estimation O(N 3 ), submanifold density estimation [Ozakin & Gray, NIPS 2010], O(N 3 ), convex adaptive kernel estimation [Sastry & Gray, AISTATS 2011] O(N 4 ) 3.Classification: decision tree, nearest-neighbor classifier O(N 2 ), kernel discriminant analysis O(N 2 ), support vector machine O(N 3 ), L p SVM, non-negative SVM [Guan et al, 2011] 4.Regression: linear regression, LASSO, kernel regression O(N 2 ), Gaussian process regression O(N 3 ) 5.Dimension reduction: PCA, non-negative matrix factorization, kernel PCA O(N 3 ), maximum variance unfolding O(N 3 ); Gaussian graphical models, discrete graphical models, rank-preserving maps [Ouyang and Gray, ICML 2008] O(N 3 ); isometric separation maps [Vasiiloglou, Gray, and Anderson MLSP 2009] O(N 3 ); isometric NMF [Vasiiloglou, Gray, and Anderson MLSP 2009] O(N 3 ); functional ICA [Mehta and Gray, 2009], density preserving maps [Ozakin and Gray, in prep] O(N 3 ) 6.Clustering: k-means, mean-shift O(N 2 ), hierarchical (FoF) clustering O(N 3 ) 7.Testing and matching: MST O(N 3 ), bipartite cross-matching O(N 3 ), n-point correlation 2-sample testing O(N n ), kernel embedding

7 tasks of machine learning / data mining 1.Querying: spherical range-search O(N), orthogonal range-search O(N), nearest-neighbor O(N), all-nearest-neighbors O(N 2 ) 2.Density estimation: mixture of Gaussians, kernel density estimation O(N 2 ), kernel conditional density estimation O(N 3 ) 3.Classification: decision tree, nearest-neighbor classifier O(N 2 ), kernel discriminant analysis O(N 2 ), support vector machine O(N 3 ), L p SVM 4.Regression: linear regression, kernel regression O(N 2 ), Gaussian process regression O(N 3 ), LASSO 5.Dimension reduction: PCA, non-negative matrix factorization, kernel PCA O(N 3 ), maximum variance unfolding O(N 3 ), Gaussian graphical models, discrete graphical models 6.Clustering: k-means, mean-shift O(N 2 ), hierarchical (FoF) clustering O(N 3 ) 7.Testing and matching: MST O(N 3 ), bipartite cross-matching O(N 3 ), n-point correlation 2-sample testing O(N n ), kernel embedding Computational Problem!

The “7 Giants” of Data (computational problem types) [Gray, Indyk, Mahoney, Szalay, in National Acad of Sci Report on Analysis of Massive Data, in prep] 1.Basic statistics: means, covariances, etc. 2.Generalized N-body problems: distances, geometry 3.Graph-theoretic problems: discrete graphs 4.Linear-algebraic problems: matrix operations 5.Optimizations: unconstrained, convex 6.Integrations: general dimension 7.Alignment problems: dynamic prog, matching

7 general strategies 1.Divide and conquer / indexing (trees) 2.Function transforms (series) 3.Sampling (Monte Carlo, active learning) 4.Locality (caching) 5.Streaming (online) 6.Parallelism (clusters, GPUs) 7.Problem transformation (reformulations)

Fastest approach for: –nearest neighbor, range search (exact) ~O(logN) [Bentley 1970], all-nearest-neighbors (exact) O(N) [Gray & Moore, NIPS 2000], [Ram, Lee, March, Gray, NIPS 2010], anytime nearest neighbor (exact) [Ram & Gray, SDM 2012], max inner product [Ram & Gray, under review] –mixture of Gaussians [Moore, NIPS 1999], k-means [Pelleg and Moore, KDD 1999], mean-shift clustering O(N) [Lee & Gray, AISTATS 2009], hierarchical clustering (single linkage, friends-of-friends) O(NlogN) [March & Gray, KDD 2010] –nearest neighbor classification [Liu, Moore, Gray, NIPS 2004], kernel discriminant analysis O(N) [Riegel & Gray, SDM 2008] –n-point correlation functions ~O(N logn ) [Gray & Moore, NIPS 2000], [Moore et al. Mining the Sky 2000], multi-matcher jackknifed npcf [March & Gray, under review] 1. Divide and conquer

3-point correlation (biggest previous: 20K) VIRGO simulation data, N = 75,000,000 naïve: 5x10 9 sec. (~150 years) multi-tree: 55 sec. (exact) n=2: O(N) n=3: O(N log3 ) n=4: O(N 2 )

3-point correlation Naive - O(N n ) (estimated) Single bandwidth [Gray & Moore 2000, Moore et al. 2000] Multi-bandwidth [March & Gray in prep 2010] new 2 point cor. 100 matchers 2.0 x 10 7 s352.8 s 56, s point cor. 243 matchers 1.1 x s s 1.23 x s point cor. 216 matchers 2.3 x s s 1.58 x s points, galaxy simulation data

2. Function transforms Fastest approach for: –Kernel estimation (low-ish dimension): dual-tree fast Gauss transforms (multipole/Hermite expansions) [Lee, Gray, Moore NIPS 2005], [Lee and Gray, UAI 2006] –KDE and GP (kernel density estimation, Gaussian process regression) (high-D): random Fourier functions [Lee and Gray, in prep]

3. Sampling Fastest approach for (approximate): –PCA: cosine trees [Holmes, Gray, Isbell, NIPS 2008] –Kernel estimation: bandwidth learning [Holmes, Gray, Isbell, NIPS 2006], [Holmes, Gray, Isbell, UAI 2007], Monte Carlo multipole method (with SVD trees) [Lee & Gray, NIPS 2009] –Nearest-neighbor: distance-approx: spill trees with random proj: [Liu, Moore, Gray, Yang, NIPS 2004], rank-approximate: [Ram, Ouyang, Gray, NIPS 2009] Rank-approximate NN: Best meaning-retaining approximation criterion in the face of high-dimensional distances More accurate than LSH

3. Sampling Active learning: the sampling can depend on previous samples –Linear classifiers: rigorous framework for pool-based active learning [Sastry and Gray, AISTATS 2012] Empirically allows reduction in the number of objects that require labeling Theoretical rigor: unbiasedness

4. Caching Fastest approach for (using disk): –Nearest-neighbor, 2-point: Disk-based treee algorithms in Microsoft SQL Server [Riegel, Aditya, Budavari, Gray, in prep] Builds kd-tree on top of built-in B-trees Fixed-pass algorithm to build kd-tree No. of pointsMLDB (Dual tree)Naive 40,0008 seconds159 seconds 200,00043 seconds3480 seconds 2,000, seconds80 hours 10,000,00029 mins 27 sec74 days 20,000,00058mins 48sec280 days 40,000,000112m 32 sec2 years

5. Streaming / online Fastest approach for (approximate, or streaming): –Online learning/stochastic optimization: just use the current sample to update the gradient SVM (squared hinge loss): stochastic Frank-Wolfe [Ouyang and Gray, SDM 2010] SVM, LASSO, et al.: noise-adaptive stochastic approximation [Ouyang and Gray, in prep, on arxiv], accelerated non-smooth SGD [Ouyang and Gray, under review] –faster than SGD –solves step size problem –beats all existing convergence rates

6. Parallelism Fastest approach for (using many machines): –KDE, GP, n-point: distributed trees [Lee and Gray, SDM 2012], cores; [March et al, in prep for Gordon Bell Prize 2012], 100K cores? Each process owns the global tree and its local tree First log p levels built in parallel; each process determines where to send data Asynchronous averaging; provable convergence –SVM, LASSO, et al.: distributed online optimization [Ouyang and Gray, in prep, on arxiv] Provable theoretical speedup for the first time

7. Transformations between problems Change the problem type: –Linear algebra on kernel matrices  N-body inside conjugate gradient [Gray, TR 2004] –Euclidean graphs  N-body problems [March & Gray, KDD 2010] –HMM as graph  matrix factorization [Tran & Gray, in prep] Optimizations: reformulate the objective and constraints: –Maximum variance unfolding: SDP via Burer-Monteiro convex relaxation [Vasiloglou, Gray, Anderson MLSP 2009] –L q SVM, 0<q<1: DC programming [Guan & Gray, CSDA 2011] –L 0 SVM: mixed integer nonlinear program via perspective cuts [Guan & Gray, under review] –Do reformulations automatically [Agarwal et al, PADL 2010], [Bhat et al, POPL 2012] Create new ML methods with desired computational properties: –Density estimation trees: nonparametric density estimation, O(NlogN) [Ram & Gray, KDD 2011] –Local linear SVMs: nonlinear classification, O(NlogN) [Sastry & Gray, under review] –Discriminative local coding: nonlinear classification O(NlogN) [Mehta & Gray, under review]

Software For academic use only: MLPACK –Open source, C++, written by students –Data must fit in RAM: distributed in progress For institutions: Skytree Server –First commercial-grade high-performance machine learning server –Fastest, biggest ML available: up to 10,000x faster than existing solutions (on one machine) –V.12, April 2012-ish: distributed, streaming –Connects to stats packages, Matlab, DBMS, Python, etc – –Colleagues: me to try it out: