Three New Ideas in SDP-based Manifold Learning Alexander Gray Georgia Institute of Technology College of Computing FASTlab: Fundamental Algorithmic and.

Slides:



Advertisements
Similar presentations
Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches Alexander Gray Georgia Institute of Technology
Advertisements

CS525: Special Topics in DBs Large-Scale Data Management
How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.
Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.
Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.
Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Stephen D. Bay 1 and Mark Schwabacher 2 1 Institute for.
Computational Mathematics for Large-scale Data Analysis Alexander Gray Georgia Institute of Technology College of Computing FASTlab: Fundamental Algorithmic.
CIS 678 Artificial Intelligence problems deduction, reasoning knowledge representation planning learning natural language processing motion and manipulation.
Thursday, November 13, 2008 ASA 156: Statistical Approaches for Analysis of Music and Speech Audio Signals AudioDB: Scalable approximate nearest-neighbor.
CS335 Principles of Multimedia Systems Content Based Media Retrieval Hao Jiang Computer Science Department Boston College Dec. 4, 2007.
Machine Learning on Massive Datasets Alexander Gray Georgia Institute of Technology College of Computing FASTlab: Fundamental Algorithmic and Statistical.
SIGDIG – Signal Discrimination for Condition Monitoring A system for condition analysis and monitoring of industrial signals Collaborative research effort.
IIIT Hyderabad Atif Iqbal and Anoop Namboodiri Cascaded.
Machine Learning in Simulation-Based Analysis 1 Li-C. Wang, Malgorzata Marek-Sadowska University of California, Santa Barbara.
How to do Machine Learning on Massive Astronomical Datasets Alexander Gray Georgia Institute of Technology Computational Science and Engineering College.
How to do Fast Analytics on Massive Datasets Alexander Gray Georgia Institute of Technology Computational Science and Engineering College of Computing.
Big Data Course Plans at Purdue Ananth Iyer. Big Data/Analytics Coursera course on Big Data by Bill Howe claims that Big Data involves issues of
Large-Scale Content-Based Image Retrieval Project Presentation CMPT 880: Large Scale Multimedia Systems and Cloud Computing Under supervision of Dr. Mohamed.
Data Mining Techniques
Utilising software to enhance your research Eamonn Hynes 5 th November, 2012.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.
Anomaly detection with Bayesian networks Website: John Sandiford.
Fast Algorithms for Analyzing Massive Data Alexander Gray Georgia Institute of Technology
Pattern Matching in DAME using AURA technology Jim Austin, Robert Davis, Bojian Liang, Andy Pasley University of York.
Students: Nidal Hurani, Ghassan Ibrahim Supervisor: Shai Rozenrauch Industrial Project (234313) Tube Lifetime Predictive Algorithm COMPUTER SCIENCE DEPARTMENT.
Lionel F. Lovett, II Jackson State University Research Alliance in Math and Science Computer Science and Mathematics Division Mentors: George Ostrouchov.
FODAVA-Lead Education, Community Building, and Research: Dimension Reduction and Data Reduction: Foundations for Interactive Visualization Haesun Park.
Fast Statistical Algorithms in Databases Alexander Gray Georgia Institute of Technology College of Computing FASTlab: Fundamental Algorithmic and Statistical.
Computer Science, Software Engineering & Robotics Workshop, FGCU, April 27-28, 2012 Fault Prediction with Particle Filters by David Hatfield mentors: Dr.
Fast Algorithms & Data Structures for Visualization and Machine Learning on Massive Data Sets Alexander Gray Fundamental Algorithmic and Statistical Tools.
Database Systems Carlos Ordonez. What is “Database systems” research? Input? large data sets, large files, relational tables How? Fast external algorithms;
(Rare) Category Detection Using Hierarchical Mean Shift Pavan Vatturi Weng-Keen Wong
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Wednesday, March 29, 2000.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Machine Learning Extract from various presentations: University of Nebraska, Scott, Freund, Domingo, Hong,
DDM Kirk. LSST-VAO discussion: Distributed Data Mining (DDM) Kirk Borne George Mason University March 24, 2011.
M Machine Learning F# and Accord.net.
Problem Query image by content in an image database.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Scientific Data Analysis via Statistical Learning Raquel Romano romano at hpcrd dot lbl dot gov November 2006.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
1 Seattle University Master’s of Science in Business Analytics Key skills, learning outcomes, and a sample of jobs to apply for, or aim to qualify for,
Manifold Learning JAMES MCQUEEN – UW DEPARTMENT OF STATISTICS.
Knowledge Discovery in a DBMS Data Mining Computing models and finding patterns in large databases current major challenge in database systems & large.
Machine learning & object recognition Cordelia Schmid Jakob Verbeek.
Intelligent and Adaptive Systems Research Group A Novel Method of Estimating the Number of Clusters in a Dataset Reza Zafarani and Ali A. Ghorbani Faculty.
András Benczúr Head, “Big Data – Momentum” Research Group Big Data Analytics Institute for Computer.
The KDD Process for Extracting Useful Knowledge from Volumes of Data Fayyad, Piatetsky-Shapiro, and Smyth Ian Kim SWHIG Seminar.
Book web site:
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
Oracle Advanced Analytics
Database management system Data analytics system:
Make Predictions Using Azure Machine Learning Studio
Intro to Machine Learning
Data Mining 101 with Scikit-Learn
Big Data Analytics in Parallel Systems
A GACP and GTMCP company
Intro to Machine Learning
SCALABLE OPEN ACCESS Hussein Suleman
ECE 539 Project Aditya Ghule
Data Warehousing and Data Mining
Ch 4. The Evolution of Analytic Scalability
Big Data Young Lee BUS 550.
Course Introduction CSC 576: Data Mining.
Intro to Machine Learning
H2O is used by more than 14,000 companies
Extreme-Scale Distribution-Based Data Analysis
Overview: Chapter 2 Localization and Tracking
Presentation transcript:

Three New Ideas in SDP-based Manifold Learning Alexander Gray Georgia Institute of Technology College of Computing FASTlab: Fundamental Algorithmic and Statistical Tools

The FASTlab Fundamental Algorithmic and Statistical Tools Laboratory 1.Arkadas Ozakin: Research scientist, PhD Theoretical Physics 2.Dong Ryeol Lee: PhD student, CS + Math 3.Ryan Riegel: PhD student, CS + Math 4.Parikshit Ram: PhD student, CS + Math 5.William March: PhD student, Math + CS 6.James Waters: PhD student, Physics + CS 7.Hua Ouyang: PhD student, CS 8.Sooraj Bhat: PhD student, CS 9.Ravi Sastry: PhD student, CS 10.Long Tran: PhD student, CS 11.Michael Holmes: PhD student, CS + Physics (co-supervised) 12.Nikolaos Vasiloglou: PhD student, EE (co-supervised) 13.Wei Guan: PhD student, CS (co-supervised) 14.Nishant Mehta: PhD student, CS (co-supervised) 15.Wee Chin Wong: PhD student, ChemE (co-supervised) 16.Abhimanyu Aditya: MS student, CS 17.Yatin Kanetkar: MS student, CS 18.Praveen Krishnaiah: MS student, CS 19.Devika Karnik: MS student, CS 20.Prasad Jakka: MS student, CS

10 sample tasks “Find engines like this one” (querying) “Plot the distribution of engine sizes and emissions” (density estimation) “Predict the lifetime maintenance cost” (regression) “Predict existence of fault or not” (classification) “Predict the number of failures next year” (time series analysis) “Show all engines on a 2-d plot” (dimension reduction) “Show or remove the unusual engines” (outlier detection) “Show the different types of engines” (clustering) “Is this group equivalent to this group?” (two-sample testing) “What’s the best action to take based on this behavior?” (reinforcement learning/control) Types of data: Sensor measurements Documents Database records, etc.

Rankmap Can do manifold learning using only ordinal data

Isometric Separation Maps Preserve class proximity

Density-Preserving Maps Preserve densities, not distances

The problem: big datasets D N M Could be large: N (#data), D (#features), M (#models)

Dual-tree All-nearest-neighbors O(N2)  O(N)

Rank-approximate Nearest-neighbor Search Distance approximation  rank approximation

Multi-scale Decompositions e.g. kd-trees [Bentley 1975], [Friedman, Bentley & Finkel 1977],[Moore & Lee 1995] How can we compute these efficiently?

A kd-tree: level 1

A kd-tree: level 2

A kd-tree: level 3

A kd-tree: level 4

A kd-tree: level 5

A kd-tree: level 6

Some application highlights Our software is being put into the pipelines of the world’s massive- scale science projects –Astronomy sky surveys (LSST, Pan- STARRS, DES): 1B objects/month –Large Hadron Collider: 1M events/sec

Some application highlights Others –McAfee spam blacklisting: 300M s/day –Supermarket demand forecasting –Algorithmic trading –Audio fingerprint matching –Legal document browsing and search

Software MLPACK (C++) –First scalable comprehensive ML library MLPACK-db –fast data analytics in relational databases (SQL Server) MLPACK Pro - Very-large-scale data