Paper: A. Kapoor, H. Ahn, and R. Picard, “Mixture of Gaussian Processes for Combining Multiple Modalities,” MIT Media Lab Technical Report, 2005. Paper.

Slides:



Advertisements
Similar presentations
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Advertisements

INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.
Supervised Learning Recap
CMPUT 466/551 Principal Source: CMU
Large Scale Manifold Transduction Michael Karlen Jason Weston Ayse Erkan Ronan Collobert ICML 2008.
Learning on Probabilistic Labels Peng Peng, Raymond Chi-wing Wong, Philip S. Yu CSE, HKUST 1.
Relational Learning with Gaussian Processes By Wei Chu, Vikas Sindhwani, Zoubin Ghahramani, S.Sathiya Keerthi (Columbia, Chicago, Cambridge, Yahoo!) Presented.
A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Pattern Recognition and Machine Learning
Assuming normally distributed data! Naïve Bayes Classifier.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.
MSRC Summer School - 30/06/2009 Cambridge – UK Hybrids of generative and discriminative methods for machine learning.
Collaborative Ordinal Regression Shipeng Yu Joint work with Kai Yu, Volker Tresp and Hans-Peter Kriegel University of Munich, Germany Siemens Corporate.
Expectation Maximization for GMM Comp344 Tutorial Kai Zhang.
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Amos Storkey, School of Informatics. Density Traversal Clustering and Generative Kernels a generative framework for spectral clustering Amos Storkey, Tom.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Crash Course on Machine Learning
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Hypothesis Testing.
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
EM and expected complete log-likelihood Mixture of Experts
Naive Bayes Classifier
LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Learning Theory Reza Shadmehr Linear and quadratic decision boundaries Kernel estimates of density Missing data.
Classification Techniques: Bayesian Classification
CSC321: Neural Networks Lecture 24 Products of Experts Geoffrey Hinton.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Christopher M. Bishop, Pattern Recognition and Machine Learning.
Bayesian Multivariate Logistic Regression by Sean O’Brien and David Dunson (Biometrics, 2004 ) Presented by Lihan He ECE, Duke University May 16, 2008.
Sparse Bayesian Learning for Efficient Visual Tracking O. Williams, A. Blake & R. Cipolloa PAMI, Aug Presented by Yuting Qi Machine Learning Reading.
Biointelligence Laboratory, Seoul National University
Simple examples of the Bayesian approach For proportions and means.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
Boosted Particle Filter: Multitarget Detection and Tracking Fayin Li.
Intelligent Database Systems Lab Advisor : Dr.Hsu Graduate : Keng-Wei Chang Author : Lian Yan and David J. Miller 國立雲林科技大學 National Yunlin University of.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Lecture 2: Statistical learning primer for biologists
Gaussian Processes For Regression, Classification, and Prediction.
Maximum Entropy Discrimination Tommi Jaakkola Marina Meila Tony Jebara MIT CMU MIT.
Learning Kernel Classifiers Chap. 3.3 Relevance Vector Machine Chap. 3.4 Bayes Point Machines Summarized by Sang Kyun Lee 13 th May, 2005.
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
Bayesian Density Regression Author: David B. Dunson and Natesh Pillai Presenter: Ya Xue April 28, 2006.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Introduction to Gaussian Process CS 478 – INTRODUCTION 1 CS 778 Chris Tensmeyer.
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Machine Learning: Ensemble Methods
Data Mining Practical Machine Learning Tools and Techniques
Sparse Kernel Machines
CH 5: Multivariate Methods
Particle Filtering for Geometric Active Contours
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Multimodal Learning with Deep Boltzmann Machines
CSCI 5822 Probabilistic Models of Human and Machine Learning
Bayesian Models in Machine Learning
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Parametric Methods Berlin Chen, 2005 References:
Mathematical Foundations of BME
MAS 622J Course Project Classification of Affective States - GP Semi-Supervised Learning, SVM and kNN Hyungil Ahn
Presentation transcript:

Paper: A. Kapoor, H. Ahn, and R. Picard, “Mixture of Gaussian Processes for Combining Multiple Modalities,” MIT Media Lab Technical Report, Paper Recommended by Ya Xue Duke University Machine Learning Discussion Group Discussion Leader: David Williams 29 April 2005

Overview Paper presents an approach for multi-sensor classification using Gaussian Processes (GP) that can handle missing sensors (“channels”) and noisy labels. Framework uses a mixture of GPs; Expectation Propagation (EP) is used to learn a classifier for each sensor. For final classification of an unlabeled data point, individual sensor outputs are combined probabilistically.

Framework Each x is data from one of P sensors. Each y=f(x) is a “soft label” from one of P sensors. (e.g., the input to logistic or probit function). There is a GP prior on each of the P functions f. λ is a “switch” that determines which sensor to use to classify. t is the hard label (+1 or -1)

Accounting for Noisy Labels ε = Labeling error rate Φ = Probit function (cdf of Gaussian) t = label (+1 or -1) y = “soft label” In experiments, ε was chosen via evidence maximization.

GP Classification for a Single Sensor Posterior is proportional to the product of the (GP) prior and the likelihood: EP is used to approximate the likelihood (so the posterior is also a Gaussian): The GP prior is of course Gaussian, and enforces a smoothness constraint. Resulting posterior can then be marginalized to classify a test point: where K is the kernel matrix

GP Classification for Multiple Sensors Variational bound of the posterior of the soft labels (Y) and the switches (Λ) is used: Final classification of a test point will be given by: In multi-sensor setting, the likelihood (with j-th sensor) will be:

GP Classification for Multiple Sensors Three Steps: 1. Initialization 2. Variational Updates 3. Classifying Test Data Begin with n labeled (training) data points, with data from P sensors, and one testing data point.

Recall that λ is a “switch” that determines which sensor to use to classify. Q(λ), a multinomial distribution, is intialized uniformly. For each of the P sensors, use EP as in single-sensor case to obtain a Gaussian posterior, and then initialize to this obtained distribution. {Apparently, in obtaining these posteriors, one simply uses whatever data is available from each sensor.}

Variational update rules: Update for switches uses (6) below, which is intractable. Authors suggest using importance sampling to compute it. Update for soft labels is a product of Gaussians:

To classify test point, the posterior over the testing data’s switch is needed: The authors’ approach seems ad hoc, and their explanation is unclear: For an unlabeled test point, perform P classifications using single sensor classifiers. These P probabilities are then set to be the posterior probabilities of the switches (i.e., of using each sensor) (after normalizing so they sum to unity). This then gives the posterior of the switches. Final classification of the test point is then given by:

What is done when the testing data is missing data from some sensors? Explanation (quoted from paper) is unclear: Classifying Test Data when Sensors are Missing Whatever the authors did, it must surely be quite ad hoc.

Results On this data set, there were 136 data points. Using the proposed mixture of GP approach (83.55%) is barely better than just using the “Posture Modality” (82.02% or 82.99%); the difference is about 2 or 1 additional data points correctly classified. Moreover, the error bars overlap. To me, this is not “significantly outperforming”:

Conclusions Idea to combine the sensors probabilistically seems good, but the method seems to have a lot of inelegant attributes. Inelegance: –EP approximation of the likelihood –posterior of switches required importance sampling –posterior of switches for test point seems completely ad hoc –handling of missing test data is ad hoc Noisy label formulation is not new for us. No mention of semi-supervised extensions or active feature acquisition. All sensors seem to be treated equally, regardless of how much training data is possessed for each one. Avoids missing data problem by essentially treating each sensor individually, and then combining all of the individual outputs at the end. Method applied to only a single data set, and results were falsely claimed to be significantly better.