How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.

Slides:



Advertisements
Similar presentations
Using Galaxy Clusters for Cosmology and the XCS Ben Hoyle, Bob Nichol, David Bacon, Ed Lloyd-Davies, Kathy Romer & the XCS Cape Town April ‘08.
Advertisements

“The Dark Side of the SDSS” Bob Nichol ICG, Portsmouth Chris Miller, David Wake, Brice Menard, Idit Zehavi, Ryan Scranton, Gordon Richards, Daniel Eisenstein,
Indian Statistical Institute Kolkata
Daniel Eisenstein – Univ. of Arizona Dark Energy and Cosmic Sound Bob Nichol on behalf of the SDSS Collaboration Copy of presentation to be given by Daniel.
1 Reversible Sketches for Efficient and Accurate Change Detection over Network Data Streams Robert Schweller Ashish Gupta Elliot Parsons Yan Chen Computer.
20 Spatial Queries for an Astronomer's Bench (mark) María Nieto-Santisteban 1 Tobias Scholl 2 Alexander Szalay 1 Alfons Kemper 2 1. The Johns Hopkins University,
New Software Bob Nichol ICG, Portsmouth Thanks to all my colleagues in SDSS, GRIST & PiCA Special thanks to Chris Miller, Alex Gray, Gordon Richards, Brent.
1 Visualizing the Legislature Howard University - Systems and Computer Science October 29, 2010 Mugizi Robert Rwebangira.
Purchasing Your Very Own Computer By: Andrew Pipes By: Andrew Pipes.
Probabilistic Cross-Identification of Astronomical Sources Tamás Budavári Alexander S. Szalay María Nieto-Santisteban The Johns Hopkins University.
Visualization of Clusters with a Density-Based Similarity Measure Rebecca Nugent Department of Statistics, Carnegie Mellon University June 9, 2007 Joint.
Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University.
Data-Intensive Science (eScience) Ed Lazowska Bill & Melinda Gates Chair in Computer Science & Engineering University of Washington August 2011.
John Gilland  Mr. Keating  Computer Applications  Comm. 115.
Random Sample Consensus: A Paradigm for Model Fitting with Application to Image Analysis and Automated Cartography Martin A. Fischler, Robert C. Bolles.
Astro-DISC: Astronomy and cosmology applications of distributed super computing.
Teaching Science with Sloan Digital Sky Survey Data GriPhyN/iVDGL Education and Outreach meeting March 1, 2002 Jordan Raddick The Johns Hopkins University.
How to do Fast Analytics on Massive Datasets Alexander Gray Georgia Institute of Technology Computational Science and Engineering College of Computing.
Mini-workshop: e-Science & Data Mining. e-Science & Data Mining Special Interest Group Bob Mann Institute for Astronomy & NeSC University of Edinburgh.
Playing in High Dimensions Bob Nichol ICG, Portsmouth Thanks to all my colleagues in SDSS, GRIST & PiCA Special thanks to Chris Miller, Alex Gray, Gordon.
Alex Szalay, Jim Gray Analyzing Large Data Sets in Astronomy.
WFMOS Feasibility Study Value-added Science Bob Nichol, ICG Portsmouth.
Radio Galaxies and Quasars Powerful natural radio transmitters associated with Giant elliptical galaxies Demo.
computer
Fast Forward Ten Years Into the Life of Travis Gordon Presented by: Travis Gordon Project 9: Fast Forward Ten Years 1/5/12.
Web and Grid Services from Pitt/CMU Andrew Connolly Department of Physics and Astronomy University of Pittsburgh Jeff Gardner, Alex Gray, Simon Krughoff,
EScience May 2007 From Photons to Petabytes: Astronomy in the Era of Large Scale Surveys and Virtual Observatories R. Chris Smith NOAO/CTIO, LSST.
“Big Data” and Data-Intensive Science (eScience) Ed Lazowska Bill & Melinda Gates Chair in Computer Science & Engineering University of Washington July.
Prospects for observing quasar jets with the Space Interferometry Mission Ann E. Wehrle Space Science Institute, La Canada Flintridge, CA, and Boulder,
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Three New Ideas in SDP-based Manifold Learning Alexander Gray Georgia Institute of Technology College of Computing FASTlab: Fundamental Algorithmic and.
Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture.
G. Miknaitis SC2006, Tampa, FL Observational Cosmology at Fermilab: Sloan Digital Sky Survey Dark Energy Survey SNAP Gajus Miknaitis EAG, Fermilab.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Jonathan Crabtree Assistant Director of Computing and Archival Research UNC Chapel Hill, Odum Institute Vision for Sociometric Analysis of Long-tail Science.
Application of Decision Trees Decision trees have been used in many areas. One notable area is star-galaxy classification. As an example, the Space Telescope.
Binary Quasars in the Sloan Digital Sky Survey Joseph F. Hennawi Berkeley Hubble Symposium April 7, 2005.
Research Overview Piyush Kumar. 1.Optimization 2.Computational Geometry 3.Computer Graphics 4.Pattern Recognition and Machine Learning / Biometrics 5.Robotics.
Detecting New a Priori Probabilities of Data Using Supervised Learning Karpov Nikolay Associate professor NRU Higher School of Economics.
Dec 21, 2006For ICDM Panel on 10 Best Algorithms Support Vector Machines: A Survey Qiang Yang, for ICDM 2006 Panel Partially.
EScience: Techniques and Technologies for 21st Century Discovery Ed Lazowska Bill & Melinda Gates Chair in Computer Science & Engineering Computer Science.
What can we see in the sky?. IN THE SKY WE CAN SEE MUCH MORE!
Brian Lukoff Stanford University October 13, 2006.
The XMM Cluster Survey: Project summary and Cosmology Forecasts Kathy Romer University of Sussex.
Deep Extragalactic Space The basic “yardstick” of distance is now the Megaparsec = 3.3 million light years Question: how do we determine the distances.
Commentary on: The Virtual Observatory G. Jogesh Babu Center for Astrostatistics
Evaluating Event Credibility on Twitter Presented by Yanan Xie College of Computer Science, Zhejiang University 2012.
Incremental Reduced Support Vector Machines Yuh-Jye Lee, Hung-Yi Lo and Su-Yun Huang National Taiwan University of Science and Technology and Institute.
Knowledge Discovery in a DBMS Data Mining Computing models and finding patterns in large databases current major challenge in database systems & large.
July 17 & 18, 2003 JD08: Large Telescopes and Virtual Observatories 1 Discovery of Brown Dwarfs With Virtual Observatories : Why Get Excited Over One Brown.
The Science of Creation
Multiplicative updates for L1-regularized regression
Interactive Machine Learning with a GPU-Accelerated Toolkit
Business analytics Lessons from an undergraduate introductory course
Pfizer HTS Machine Learning Algorithms: November 2002
Customer Satisfaction Based on Voice
ALZHEIMER DISEASE PREDICTION USING DATA MINING TECHNIQUES P.SUGANYA (RESEARCH SCHOLAR) DEPARTMENT OF COMPUTER SCIENCE TIRUPPUR KUMARAN COLLEGE FOR WOMEN.
Software Architecture Design and Analysis
Comparing Numbers.
Automatic Discovery of Network Applications: A Hybrid Approach
TED Talks – A Predictive Analysis Using Classification Algorithms
Relative merits of the algorithms under consideration
בארגונים במוסדות ובחברה
Welcome to Buskerud University College!
Place Value.
Quasars By Ava Pickel.
Welcome! Knowledge Discovery and Data Mining
Comparing Numbers.
REU 2019 Week 2 Volodymyr Bobyr.
Machine Learning in Business John C. Hull
Presentation transcript:

How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing Joint work with Gordon Richards (Princeton), Robert Nichol (Portsmouth ICG), Robert Brunner (UIUC/NCSA), Andrew Moore (CMU)

What I do Often the most general and powerful statistical (or “machine learning”) methods are computationally infeasible. I design machine learning methods and fast algorithms to make such statistical methods possible on massive datasets (without sacrificing accuracy).

Quasar detection •Science motivation: use quasars to trace the distant/old mass in the universe •Thus we want lots of sky  SDSS DR1, 2099 square degrees, to g = 21 •Biggest quasar catalog to date: tens of thousands •Should be ~1.6M z<3 quasars to g=21

Classification •Traditional approach: look at 2-d color-color plot (UVX method) –doesn’t use all available information –not particularly accurate (~60% for relatively bright magnitudes) •Statistical approach: Pose as classification. 1.Training: Train a classifier on large set of known stars and quasars (‘training set’) 2.Prediction: The classifier will label an unknown set of objects (‘test set’)

Which classifier? 1.Statistical question: Must handle arbitrary nonlinear decision boundaries, noise/overlap 2.Computational question: We have 16,713 quasars from [Schneider et al. 2003] (.08<z<5.4), 478,144 stars (semi-cleaned sky sample) – way too big for many classifiers 3.Scientific question: We must be able to understand what it’s doing and why, and inject scientific knowledge

Which classifier? •Popular answers: –logistic regression: fast but linear only –naïve Bayes classifier: fast but quadratic only –decision tree: fast but not the most accurate –support vector machine: accurate but O(N 3 ) –boosting: accurate but requires thousands of classifiers –neural net: reasonable compromise but awkward/human-intensive to train •The good nonparametric methods are also black boxes – hard/impossible to interpret

Main points of this talk 1.nonparametric Bayes classifier 2.can be made fast (algorithm design) 3.accurate and tractable  science

Main points of this talk 1.nonparametric Bayes classifier 2.can be made fast (algorithm design) 3.accurate and tractable  science

Optimal decision theory Optimal decision boundary Star density Quasar density x density f(x)

Bayes’ rule, for Classification

So how do you estimate an arbitrary density?

Kernel Density Estimation (KDE) for example (Gaussian kernel):

Kernel Density Estimation (KDE) • There is a principled way to choose the optimal smoothing parameter h • Guaranteed to converge to the true underlying density (consistency) • Nonparametric – distribution need not be known

Nonparametric Bayes Classifier (NBC) [1951] • Nonparametric – distribution can be arbitrary • This is Bayes-optimal, given the right densities • Very clear interpretation • Parameter choices are easy to understand, automatable • There’s a way to enter prior information Main obstacle:

Main points of this talk 1.nonparametric Bayes classifier 2.can be made fast (algorithm design) 3.accurate and tractable  science

kd-trees: most widely-used space- partitioning tree [Bentley 1975], [Friedman, Bentley & Finkel 1977] • Univariate axis-aligned splits • Split on widest dimension • O(N log N) to build, O(N) space

A kd-tree: level 1

A kd-tree: level 2

A kd-tree: level 3

A kd-tree: level 4

A kd-tree: level 5

A kd-tree: level 6

For higher dimensions: ball-trees (computational geometry)

We have a fast algorithm for Kernel Density Estimation (KDE) •Generalization of N-body algorithms (multipole expansions optional) •Dual kd-tree traversal: O(N) •Works in arbitrary dimension •The fastest method to date [Gray & Moore 2003]

We could just use the KDE algorithm for each class. But: •for the Gaussian kernel this is approximate •choosing the smoothing parameter to minimize (cross- validated) classification error is more accurate But we need a fast algorithm for the Nonparametric Bayes Classifier (NBC)

Leave-one-out cross-validation Observations: 1.Doing bandwidth selection requires only prediction. 2.To predict class label, we don’t need to compute the full densities. Just which one is higher.  We can make a fast exact algorithm for prediction

Fast NBC prediction algorithm 1. Build a tree for each class

Fast NBC prediction algorithm 2. Obtain bounds on P(C)f(x q |C) for each class P(C 1 )f(x q |C 1 )P(C 2 )f(x q |C 2 ) xqxq

Fast NBC prediction algorithm 3. Choose the next node-pair with priority = bound difference P(C 1 )f(x q |C 1 )P(C 2 )f(x q |C 2 ) xqxq

Fast NBC prediction algorithm 3. Choose the next node-pair with priority = bound difference P(C 1 )f(x q |C 1 )P(C 2 )f(x q |C 2 ) x speedup exact

Main points of this talk 1.nonparametric Bayes classifier 2.can be made fast (algorithm design) 3.accurate and tractable  science

Resulting quasar catalog •100,563 UVX quasar candidates •Of 22,737 objects w/ spectra, 97.6% are quasars. We estimate 95.0% efficiency overall. (aka “purity”: good/all) •94.7% completeness w.r.t. g<19.5 UVX quasars from DR1 (good/all true) •Largest mag. range ever: 14.2<g<21.0 •[Richards et al. 2004, ApJ] •More recently, 195k quasars

Cosmic magnification [Scranton et al. 2005] 13.5M galaxies, 195,000 quasars Most accurate measurement of cosmic magnification to date [Nature, April 2005] more flux more area

Next steps (in progress) •better accuracy via coordinate-dependent priors •5 magnitudes •use simulated quasars to push to higher redshift •use DR4 higher-quality data •faster bandwidth search •500k quasars easily, then 1M

Bigger picture •nearest neighbor (1-,k-,all-,approx,clsf) [Gray & Moore 2000], [Miller et al. 2003], etc. •n-point correlation functions [Gray & Moore 2000], [Moore et al. 2000], [Scranton et al. 2003], [Gray & Moore 2004], [Nichol et al in prep.] •density estimation (nonparametric) [Gray & Moore 2000], [Gray & Moore 2003], [Balogh et al. 2003] •Bayes classification (nonparametric) [Richards et al. 2004], [Gray et al PhyStat] •nonparametric regression •clustering: k-means and mixture models, more…

Bigger picture •nearest neighbor (1-,k-,all-,approx,clsf) [Gray & Moore 2000], [Miller et al. 2003], etc. •n-point correlation functions [Gray & Moore 2000], [Moore et al. 2000], [Scranton et al. 2003], [Gray & Moore 2004], [Nichol et al in prep.] •density estimation (nonparametric) [Gray & Moore 2000], [Gray & Moore 2003], [Balogh et al. 2003] •Bayes classification (nonparametric) [Richards et al. 2004], [Gray et al PhyStat] •nonparametric regression •clustering: k-means and mixture models, more… fastest algs

Bigger picture •nearest neighbor (1-,k-,all-,approx,clsf) [Gray & Moore 2000], [Miller et al. 2003], etc. •n-point correlation functions [Gray & Moore 2000], [Moore et al. 2000], [Scranton et al. 2003], [Gray & Moore 2004], [Nichol et al in prep.] •density estimation (nonparametric) [Gray & Moore 2000], [Gray & Moore 2003], [Balogh et al. 2003] •Bayes classification (nonparametric) [Richards et al. 2004], [Gray et al PhyStat] •nonparametric regression •clustering: k-means and mixture models, more… fastest alg

Bigger picture •nearest neighbor (1-,k-,all-,approx,clsf) [Gray & Moore 2000], [Miller et al. 2003], etc. •n-point correlation functions [Gray & Moore 2000], [Moore et al. 2000], [Scranton et al. 2003], [Gray & Moore 2004], [Nichol et al in prep.] •density estimation (nonparametric) [Gray & Moore 2000], [Gray & Moore 2003], [Balogh et al. 2003] •Bayes classification (nonparametric) [Richards et al. 2004], [Gray et al PhyStat] •nonparametric regression •clustering: k-means and mixture models, more… fastest alg

Bigger picture •nearest neighbor (1-,k-,all-,approx,clsf) [Gray & Moore 2000], [Miller et al. 2003], etc. •n-point correlation functions [Gray & Moore 2000], [Moore et al. 2000], [Scranton et al. 2003], [Gray & Moore 2004], [Nichol et al in prep.] •density estimation (nonparametric) [Gray & Moore 2000], [Gray & Moore 2003], [Balogh et al. 2003] •Bayes classification (nonparametric) [Richards et al. 2004], [Gray et al PhyStat] •nonparametric regression •clustering: k-means and mixture models, more… fastest alg

Bigger picture •nearest neighbor (1-,k-,all-,approx,clsf) [Gray & Moore 2000], [Miller et al. 2003], etc. •n-point correlation functions [Gray & Moore 2000], [Moore et al. 2000], [Scranton et al. 2003], [Gray & Moore 2004], [Nichol et al in prep.] •density estimation (nonparametric) [Gray & Moore 2000], [Gray & Moore 2003], [Balogh et al. 2003] •Bayes classification (nonparametric) [Richards et al. 2004], [Gray et al PhyStat] •nonparametric regression •clustering: k-means and mixture models, others •support vector machines, maybe fastest alg we’ll see…

Take-home messages •Estimating a density? Use kernel density estimation (KDE). •Classification problem? Consider the nonparametric Bayes classifier (NBC). •Want to do these on huge datasets? Talk to us, use our software. •Different computational/statistical problem? Grab me after the talk!