How to do Fast Analytics on Massive Datasets Alexander Gray Georgia Institute of Technology Computational Science and Engineering College of Computing.

Slides:



Advertisements
Similar presentations
Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches Alexander Gray Georgia Institute of Technology
Advertisements

CS525: Special Topics in DBs Large-Scale Data Management
How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.
Christoph F. Eick Questions and Topics Review Dec. 10, Compare AGNES /Hierarchical clustering with K-means; what are the main differences? 2. K-means.
Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.
Detecting Faces in Images: A Survey
An Overview of Machine Learning
A Comprehensive Study on Third Order Statistical Features for Image Splicing Detection Xudong Zhao, Shilin Wang, Shenghong Li and Jianhua Li Shanghai Jiao.
Computational Mathematics for Large-scale Data Analysis Alexander Gray Georgia Institute of Technology College of Computing FASTlab: Fundamental Algorithmic.
COMP 328: Final Review Spring 2010 Nevin L. Zhang Department of Computer Science & Engineering The Hong Kong University of Science & Technology
New Geometric Methods of Mixture Models for Interactive Visualization PIs: Jia Li, Xiaolong (Luke) Zhang, Bruce Lindsay Department of Statistics College.
Collaborative Filtering in iCAMP Max Welling Professor of Computer Science & Statistics.
Scikit-learn: Machine learning in Python
Principal Component Analysis
SVM QP & Midterm Review Rob Hall 10/14/ This Recitation Review of Lagrange multipliers (basic undergrad calculus) Getting to the dual for a QP.
Scalable Data Mining The Auton Lab, Carnegie Mellon University Brigham Anderson, Andrew Moore, Dan Pelleg, Alex Gray, Bob Nichols, Andy.
Machine Learning on Massive Datasets Alexander Gray Georgia Institute of Technology College of Computing FASTlab: Fundamental Algorithmic and Statistical.
Kernel Methods Part 2 Bing Han June 26, Local Likelihood Logistic Regression.
Scalable Text Mining with Sparse Generative Models
Distributed Model-Based Learning PhD student: Zhang, Xiaofeng.
FLANN Fast Library for Approximate Nearest Neighbors
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
How to do Machine Learning on Massive Astronomical Datasets Alexander Gray Georgia Institute of Technology Computational Science and Engineering College.
Large-Scale Content-Based Image Retrieval Project Presentation CMPT 880: Large Scale Multimedia Systems and Cloud Computing Under supervision of Dr. Mohamed.
General Information Course Id: COSC6342 Machine Learning Time: MO/WE 2:30-4p Instructor: Christoph F. Eick Classroom:SEC 201
Midterm Review. 1-Intro Data Mining vs. Statistics –Predictive v. experimental; hypotheses vs data-driven Different types of data Data Mining pitfalls.
Utilising software to enhance your research Eamonn Hynes 5 th November, 2012.
Machine Learning CUNY Graduate Center Lecture 1: Introduction.
CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.
Anomaly detection with Bayesian networks Website: John Sandiford.
Fast Algorithms for Analyzing Massive Data Alexander Gray Georgia Institute of Technology
1 A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs Ph.D Showcase, Dept. of Computer Science Sasi Kumar Pitchaimalai Ph.D Candidate Database.
General Information Course Id: COSC6342 Machine Learning Time: TU/TH 10a-11:30a Instructor: Christoph F. Eick Classroom:AH123
DATA MINING LECTURE 10 Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines.
Machine Learning Lecture 11 Summary G53MLE | Machine Learning | Dr Guoping Qiu1.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
FODAVA-Lead Education, Community Building, and Research: Dimension Reduction and Data Reduction: Foundations for Interactive Visualization Haesun Park.
Fast Statistical Algorithms in Databases Alexander Gray Georgia Institute of Technology College of Computing FASTlab: Fundamental Algorithmic and Statistical.
Fast Algorithms & Data Structures for Visualization and Machine Learning on Massive Data Sets Alexander Gray Fundamental Algorithmic and Statistical Tools.
Database Systems Carlos Ordonez. What is “Database systems” research? Input? large data sets, large files, relational tables How? Fast external algorithms;
Big Data Analytics Carlos Ordonez. Big Data Analytics research Input? BIG DATA (large data sets, large files, many documents, many tables, fast growing)
Three New Ideas in SDP-based Manifold Learning Alexander Gray Georgia Institute of Technology College of Computing FASTlab: Fundamental Algorithmic and.
Map-Reduce for Machine Learning on Multicore C. Chu, S.K. Kim, Y. Lin, Y.Y. Yu, G. Bradski, A.Y. Ng, K. Olukotun (NIPS 2006) Shimin Chen Big Data Reading.
Fast Kernel-Density-Based Classification and Clustering Using P-Trees Anne Denton Major Advisor: William Perrizo.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Principal Component Analysis Machine Learning. Last Time Expectation Maximization in Graphical Models – Baum Welch.
1/18 New Feature Presentation of Transition Probability Matrix for Image Tampering Detection Luyi Chen 1 Shilin Wang 2 Shenghong Li 1 Jianhua Li 1 1 Department.
Lecture 2: Statistical learning primer for biologists
1 Unsupervised Learning and Clustering Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of.
PhD Dissertation Defense Scaling Up Machine Learning Algorithms to Handle Big Data BY KHALIFEH ALJADDA ADVISOR: PROFESSOR JOHN A. MILLER DEC-2014 Computer.
Dimensionality Reduction in Unsupervised Learning of Conditional Gaussian Networks Authors: Pegna, J.M., Lozano, J.A., Larragnaga, P., and Inza, I. In.
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
DATA MINING LECTURE 10b Classification k-nearest neighbor classifier
1 Database Systems Group Research Overview OLAP Statistical Tests Goal: Isolate factors that cause significant changes in a measured value – Ex:
1 Seattle University Master’s of Science in Business Analytics Key skills, learning outcomes, and a sample of jobs to apply for, or aim to qualify for,
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
General Information Course Id: COSC6342 Machine Learning Time: TU/TH 1-2:30p Instructor: Christoph F. Eick Classroom:AH301
Manifold Learning JAMES MCQUEEN – UW DEPARTMENT OF STATISTICS.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Big Data, Data Mining, Tools
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Big Data Analytics in Parallel Systems
Special Topics in Data Mining Applications Focus on: Text Mining
Distributions and Concepts in Probability Theory
CMPT 733, SPRING 2016 Jiannan Wang
Probabilistic Models with Latent Variables
Parallel Analytic Systems
What is Artificial Intelligence?
Presentation transcript:

How to do Fast Analytics on Massive Datasets Alexander Gray Georgia Institute of Technology Computational Science and Engineering College of Computing FASTlab: Fundamental Algorithmic and Statistical Tools

The FASTlab Fundamental Algorithmic and Statistical Tools Laboratory 1.Arkadas Ozakin: Research scientist, PhD Theoretical Physics 2.Dong Ryeol Lee: PhD student, CS + Math 3.Ryan Riegel: PhD student, CS + Math 4.Parikshit Ram: PhD student, CS + Math 5.William March: PhD student, Math + CS 6.James Waters: PhD student, Physics + CS 7.Hua Ouyang: PhD student, CS 8.Sooraj Bhat: PhD student, CS 9.Ravi Sastry: PhD student, CS 10.Long Tran: PhD student, CS 11.Michael Holmes: PhD student, CS + Physics (co-supervised) 12.Nikolaos Vasiloglou: PhD student, EE (co-supervised) 13.Wei Guan: PhD student, CS (co-supervised) 14.Nishant Mehta: PhD student, CS (co-supervised) 15.Wee Chin Wong: PhD student, ChemE (co-supervised) 16.Abhimanyu Aditya: MS student, CS 17.Yatin Kanetkar: MS student, CS 18.Praveen Krishnaiah: MS student, CS 19.Devika Karnik: MS student, CS 20.Prasad Jakka: MS student, CS

Allow users to apply all the state-of- the-art statistical methods… ….with orders-of-magnitude more computational efficiency –Via: Fast algorithms + Distributed computing Our mission

The problem: big datasets D N M Could be large: N (#data), D (#features), M (#models)

Core methods of statistics / machine learning / mining Querying: nearest-neighbor O(N), spherical range-search O(N), orthogonal range-search O(N), contingency table Density estimation: kernel density estimation O(N 2 ), mixture of Gaussians O(N) Regression: linear regression O(D 3 ), kernel regression O(N 2 ), Gaussian process regression O(N 3 ) Classification: nearest-neighbor classifier O(N 2 ), nonparametric Bayes classifier O(N 2 ), support vector machine Dimension reduction: principal component analysis O(D 3 ), non-negative matrix factorization, kernel PCA O(N 3 ), maximum variance unfolding O(N 3 ) Outlier detection: by robust L 2 estimation, by density estimation, by dimension reduction Clustering: k-means O(N), hierarchical clustering O(N 3 ), by dimension reduction Time series analysis: Kalman filter O(D 3 ), hidden Markov model, trajectory tracking 2-sample testing: n-point correlation O(N n ) Cross-match: bipartite matching O(N 3 )

5 main computational bottlenecks: Aggregations, GNPs, graphical models, linear algebra, optimization Querying: nearest-neighbor O(N), spherical range-search O(N), orthogonal range-search O(N), contingency table Density estimation: kernel density estimation O(N 2 ), mixture of Gaussians O(N) Regression: linear regression O(D 3 ), kernel regression O(N 2 ), Gaussian process regression O(N 3 ) Classification: nearest-neighbor classifier O(N 2 ), nonparametric Bayes classifier O(N 2 ), support vector machine Dimension reduction: principal component analysis O(D 3 ), non-negative matrix factorization, kernel PCA O(N 3 ), maximum variance unfolding O(N 3 ) Outlier detection: by robust L 2 estimation, by density estimation, by dimension reduction Clustering: k-means O(N), hierarchical clustering O(N 3 ), by dimension reduction Time series analysis: Kalman filter O(D 3 ), hidden Markov model, trajectory tracking 2-sample testing: n-point correlation O(N n ) Cross-match: bipartite matching O(N 3 )

Multi-resolution data structures e.g. kd-trees [Bentley 1975], [Friedman, Bentley & Finkel 1977],[Moore & Lee 1995] How can we compute this efficiently?

A kd-tree: level 1

A kd-tree: level 2

A kd-tree: level 3

A kd-tree: level 4

A kd-tree: level 5

A kd-tree: level 6

Computational complexity using fast algorithms Querying: nearest-neighbor O(logN), spherical range-search O(logN), orthogonal range-search O(logN), contingency table Density estimation: kernel density estimation O(N) or O(1), mixture of Gaussians O(logN) Regression: linear regression O(D) or O(1), kernel regression O(N) or O(1), Gaussian process regression O(N) or O(1) Classification: nearest-neighbor classifier O(N), nonparametric Bayes classifier O(N), support vector machine Dimension reduction: principal component analysis O(D) or O(1), non- negative matrix factorization, kernel PCA O(N) or O(1), maximum variance unfolding O(N) Outlier detection: by robust L 2 estimation, by density estimation, by dimension reduction Clustering: k-means O(logN), hierarchical clustering O(NlogN), by dimension reduction Time series analysis: Kalman filter O(D) or O(1), hidden Markov model, trajectory tracking 2-sample testing: n-point correlation O(N logn ) Cross-match: bipartite matching O(N) or O(1)

Ex: 3-point correlation runtime (biggest previous: 20K) VIRGO simulation data, N = 75,000,000 naïve: 5x10 9 sec. (~150 years) multi-tree: 55 sec. (exact) n=2: O(N) n=3: O(N log3 ) n=4: O(N 2 )

Our upcoming products MLPACK (C++) Dec –First scalable comprehensive ML library THOR (distributed) Apr –Faster tree-based “MapReduce” for analytics MLPACK-db Apr –fast data analytics in relational databases (SQL Server)

The end It’s possible to scale up all statistical methods …with smart algorithms, not just brute force Look for MLPACK (C++), THOR (distributed), MLPACK-db (RDBMS) Alexander Gray ( is best; webpage sorely out of date)

Range-count recursive algorithm

Pruned! (inclusion) Range-count recursive algorithm

Pruned! (exclusion) Range-count recursive algorithm

fastest practical algorithm [Bentley 1975] our algorithms can use any tree Range-count recursive algorithm