Fast Algorithms & Data Structures for Visualization and Machine Learning on Massive Data Sets Alexander Gray Fundamental Algorithmic and Statistical Tools.

Slides:



Advertisements
Similar presentations
Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches Alexander Gray Georgia Institute of Technology
Advertisements

Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.
Clustering Basic Concepts and Algorithms
Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
An Overview of Machine Learning
Self Organization of a Massive Document Collection
Clustering and Dimensionality Reduction Brendan and Yifang April
Computational Mathematics for Large-scale Data Analysis Alexander Gray Georgia Institute of Technology College of Computing FASTlab: Fundamental Algorithmic.
Two Technique Papers on High Dimensionality Allan Rempel December 5, 2005.
EE 290A: Generalized Principal Component Analysis Lecture 6: Iterative Methods for Mixture-Model Segmentation Sastry & Yang © Spring, 2011EE 290A, University.
Support Vector Machines and Kernel Methods
Affine-invariant Principal Components Charlie Brubaker and Santosh Vempala Georgia Tech School of Computer Science Algorithms and Randomness Center.
Object retrieval with large vocabularies and fast spatial matching
Principal Component Analysis
Dimensionality Reduction and Embeddings
An Introduction to Kernel-Based Learning Algorithms K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf Presented by: Joanna Giforos CS8980: Topics.
Scalable Data Mining The Auton Lab, Carnegie Mellon University Brigham Anderson, Andrew Moore, Dan Pelleg, Alex Gray, Bob Nichols, Andy.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Machine Learning on Massive Datasets Alexander Gray Georgia Institute of Technology College of Computing FASTlab: Fundamental Algorithmic and Statistical.
PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Prof.Dr.Cevdet Demir
Dimensionality Reduction
Tutorial on Statistical N-Body Problems and Proximity Data Structures
Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)
INTRODUCTION Problem: Damage condition of residential areas are more concerned than that of natural areas in post-hurricane damage assessment. Recognition.
CS 485/685 Computer Vision Face Recognition Using Principal Components Analysis (PCA) M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive.
How to do Machine Learning on Massive Astronomical Datasets Alexander Gray Georgia Institute of Technology Computational Science and Engineering College.
How to do Fast Analytics on Massive Datasets Alexander Gray Georgia Institute of Technology Computational Science and Engineering College of Computing.
Overview G. Jogesh Babu. Probability theory Probability is all about flip of a coin Conditional probability & Bayes theorem (Bayesian analysis) Expectation,
Machine Learning Queens College Lecture 13: SVM Again.
Anomaly detection with Bayesian networks Website: John Sandiford.
Fast Algorithms for Analyzing Massive Data Alexander Gray Georgia Institute of Technology
Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
FODAVA-Lead Education, Community Building, and Research: Dimension Reduction and Data Reduction: Foundations for Interactive Visualization Haesun Park.
CSE 185 Introduction to Computer Vision Pattern Recognition 2.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Fast Statistical Algorithms in Databases Alexander Gray Georgia Institute of Technology College of Computing FASTlab: Fundamental Algorithmic and Statistical.
Computer Vision Lab. SNU Young Ki Baik Nonlinear Dimensionality Reduction Approach (ISOMAP, LLE)
Database Systems Carlos Ordonez. What is “Database systems” research? Input? large data sets, large files, relational tables How? Fast external algorithms;
Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang.
Mingyang Zhu, Huaijiang Sun, Zhigang Deng Quaternion Space Sparse Decomposition for Motion Compression and Retrieval SCA 2012.
Clustering and Testing in High- Dimensional Data M. Radavičius, G. Jakimauskas, J. Sušinskas (Institute of Mathematics and Informatics, Vilnius, Lithuania)
Big Data Analytics Carlos Ordonez. Big Data Analytics research Input? BIG DATA (large data sets, large files, many documents, many tables, fast growing)
Three New Ideas in SDP-based Manifold Learning Alexander Gray Georgia Institute of Technology College of Computing FASTlab: Fundamental Algorithmic and.
GRASP Learning a Kernel Matrix for Nonlinear Dimensionality Reduction Kilian Q. Weinberger, Fei Sha and Lawrence K. Saul ICML’04 Department of Computer.
Non-Linear Dimensionality Reduction
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
Overview G. Jogesh Babu. Overview of Astrostatistics A brief description of modern astronomy & astrophysics. Many statistical concepts have their roots.
PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Learning Kernel Classifiers 1. Introduction Summarized by In-Hee Lee.
Iterative K-Means Algorithm Based on Fisher Discriminant UNIVERSITY OF JOENSUU DEPARTMENT OF COMPUTER SCIENCE JOENSUU, FINLAND Mantao Xu to be presented.
Manifold Learning JAMES MCQUEEN – UW DEPARTMENT OF STATISTICS.
Day 17: Duality and Nonlinear SVM Kristin P. Bennett Mathematical Sciences Department Rensselaer Polytechnic Institute.
Overview G. Jogesh Babu. R Programming environment Introduction to R programming language R is an integrated suite of software facilities for data manipulation,
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
Spectral Methods for Dimensionality
PREDICT 422: Practical Machine Learning
Shuang Hong Yang College of Computing, Georgia Tech, USA Hongyuan Zha
Intro to Machine Learning
Probabilistic Models with Latent Variables
Parallel Analytic Systems
Dimension reduction : PCA and Clustering
Fast and Exact K-Means Clustering
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Intro to Machine Learning
SVMs for Document Ranking
Presentation transcript:

Fast Algorithms & Data Structures for Visualization and Machine Learning on Massive Data Sets Alexander Gray Fundamental Algorithmic and Statistical Tools Laboratory (FASTlab) Computational Science and Engineering Division College of Computing Georgia Institute of Technology

The FASTlab Arkadas Ozakin: Research scientist, PhD theoretical physics Dong Ryeol Lee: PhD student, CS + Math Ryan Riegel: PhD student, CS + Math Parikshit Ram: PhD student, CS + Math William March: PhD student, Math + CS James Waters: PhD student, Physics + CS Nadeem Syed: PhD student, CS Hua Ouyang: PhD student, CS Sooraj Bhat: PhD student, CS Ravi Sastry: PhD student, CS Long Tran: PhD student, CS Michael Holmes: PhD student, CS + Physics (co-supervised) Nikolaos Vasiloglou: PhD student, EE (co-supervised) Wei Guan: PhD student, CS (co-supervised) Nishant Mehta: PhD student, CS (co-supervised) Wee Chin Wong: PhD student, ChemE (co-supervised) Abhimanyu Aditya: MS student, CS Yatin Kanetkar: MS student, CS

Goal New displays for high-dimensional data –Isometric non-negative matrix factorization –Rank-based embedding –Density-preserving maps –Co-occurrence embedding New algorithms for scaling them to big datasets –Distances: Generalized Fast Multipole Method –Dot products: Cosine Trees and QUIC-SVD –MLPACK

Plotting high-D in 2-D Dimension reduction beyond PCA: manifolds, embedding, etc.

Goal New displays for high-dimensional data –Isometric non-negative matrix factorization –Rank-based embedding –Density-preserving maps –Co-occurrence embedding New algorithms for scaling them to big datasets –Distances: Generalized Fast Multipole Method –Dot products: Cosine Trees and QUIC-SVD –MLPACK

Isometric Non-negative Matrix Factorization NMF maintains the interpretability of components of data like images or text or spectra (SDSS) However as a low-D display it is not faithful in general to the original distances Isometric NMF [Vasiloglou, Gray, Anderson, to be submitted SIAM DM 2008] preserves both distances and non- negativity; geometric prog formulation

Rank-based Embedding Suppose you don’t really have meaningful or reliable distances… but you can say “A and B are farther apart than A and C”, e.g. in document relevance It is still possible to make an embedding! In fact there is some indication that using ranks is more stable than using distances Can be formulated using hyperkernels; becomes either an SDP or a QP [Ouyang and Gray, ICML 2008]

Density-preserving Maps Preserving densities is statistically more meaningful than preserving distances Might allow more reliable conclusions from the low-D display about clustering and outliers DC formulation (Ozakin and Gray, to be submitted AISTATS 2008)

Co-occurrence Embedding Consider InBio data: 3M occurrences of species in Costa Rica Densities are not reliable as the sampling strategy is unknown But the overlapping of two species’ densities (co- occurrence) may be more reliable How can distribution distances be embedded (Syed, Ozakin, Gray to be submitted ICML 2009)?

Goal New displays for high-dimensional data –Isometric non-negative matrix factorization –Rank-based embedding –Density-preserving maps –Co-occurrence embedding New algorithms for scaling them to big datasets –Distances: Generalized Fast Multipole Method –Dot products: Cosine Trees and QUIC-SVD –MLPACK

Computational problem Such manifold methods are expensive: typically O(N 3 ) –But it is big datasets that are often the most important to visually summarize What are the underlying computations? –All-k-nearest-neighbors –Kernel summations –Eigendecomposition –Convex optimization

Computational problem Such manifold methods are expensive: typically O(N 3 ) –But it is big datasets that are often the most important to visually summarize What are the underlying computations? –All-k-nearest-neighbors (distances) –Kernel summations (distances) –Eigendecomposition (dot products) –Convex optimization (dot products)

Distances: Generalized Fast Multipole Method Generalized N-body Problems (Gray and Moore NIPS 2000; Riegel, Boyer, and Gray TR 2008) include: –All-k-nearest-neighbors –Kernel summations –Force summations in physics –A very large number of bottleneck statistics and machine learning computations Defined using category theory

Distances: Generalized Fast Multipole Method There exists a generalization (Gray and Moore NIPS 2000; Riegel, Boyer, and Gray TR 2008) of the Fast Multipole Method (Greengard and Rokhlin 1987) which: –specializes to each of these problems –is the fastest practical algorithm for these problems –elucidates general principles for such problems Parallel: THOR (Tree-based Higher-Order Reduce)

Distances: Generalized Fast Multipole Method Elements of the GFMM: –A spatial tree data structure, e.g. kd-trees, metric trees, cover-trees, SVD trees, disk trees –A tree expansion pattern –Tree-stored cached statistics –An error criterion and pruning criterion –A local approximation/pruning scheme with error bounds, e.g. Hermite expansions, Monte Carlo, exact pruning

kd-trees: most widely-used space- partitioning tree [Bentley 1975], [Friedman, Bentley & Finkel 1977],[Moore & Lee 1995]

A kd-tree: level 1

A kd-tree: level 2

A kd-tree: level 3

A kd-tree: level 4

A kd-tree: level 5

A kd-tree: level 6

Example: Generalized histogram query point q bandwidth h

Range-count recursive algorithm

Pruned! (inclusion) Range-count recursive algorithm

Pruned! (exclusion) Range-count recursive algorithm

fastest practical algorithm [Bentley 1975] our algorithms can use any tree Range-count recursive algorithm

Dot products: Cosine Trees and QUIC-SVD QUIC-SVD (Holmes, Gray, Isbell NIPS 2008) Cosine Trees: Trees for dot products Use Monte Carlo within cosine trees to achieve best-rank approximation with user- specified relative error Very fast, but with probabilistic bounds

Dot products: Cosine Trees and QUIC-SVD Uses of QUIC-SVD: PCA, KPCA eigendecomposition Working on: fast interior-point convex optimization

Bigger goal: make all the best statistical/learning methods efficient! Ground rules: –Asymptotic speedup as well as practical speedup –Arbitrarily high accuracy with error guarantees –No manual tweaking –Really works (validated in a big real-world problem) Treating entire classes of methods –Methods based on distances (generalized N-body problems) –Methods based on dot products (linear algebra) –Soon: Methods based on discrete structures (combinatorial/graph problems) Watch for MLPACK, coming Dec –Meant to be the equivalent of linear algebra’s LAPACK

So far: fastest algs for… 2000 all-nearest-neighbors (1970) 2000 n-point correlation functions (1950) 2003,05,06 kernel density estimation (1953) 2004 nearest-neighbor classification (1965) 2005,06,08 nonparametric Bayes classifier (1951) 2006 mean-shift clustering/tracking (1972) 2006 k-means clustering (1960s) 2007 hierarchical clustering/EMST (1960s) 2007 affinity propagation/clustering (2007)

So far: fastest algs for… 2008 principal component analysis* (1930s) 2008 local linear kernel regression (1960s) 2008 hidden Markov models* (1970s) Working on: –linear regression, Kalman filters (1960s) –Gaussian process regression (1960s) –Gaussian graphical models (1970s) –Manifolds, spectral clustering (2000s) –Convex kernel machines (2000s)

Some application highlights so far… First large-scale dark energy confirmation: top Science breakthrough of 2003) First large-scale cosmic magnification confirmation (Nature, 2005) Integration into Google image search (we think), 2005 Integration into Microsoft SQL Server, 2008 Working on: –Integration into Large Hadron Collider pipeline, 2008 –Fast IP-based spam filtering (Secure Comp, 2008) –Fast recommendation (Netflix)

To find out more: Best way: Mostly outdated: