Paper: A. Kapoor, H. Ahn, and R. Picard, “Mixture of Gaussian Processes for Combining Multiple Modalities,” MIT Media Lab Technical Report, Paper Recommended by Ya Xue Duke University Machine Learning Discussion Group Discussion Leader: David Williams 29 April 2005
Overview Paper presents an approach for multi-sensor classification using Gaussian Processes (GP) that can handle missing sensors (“channels”) and noisy labels. Framework uses a mixture of GPs; Expectation Propagation (EP) is used to learn a classifier for each sensor. For final classification of an unlabeled data point, individual sensor outputs are combined probabilistically.
Framework Each x is data from one of P sensors. Each y=f(x) is a “soft label” from one of P sensors. (e.g., the input to logistic or probit function). There is a GP prior on each of the P functions f. λ is a “switch” that determines which sensor to use to classify. t is the hard label (+1 or -1)
Accounting for Noisy Labels ε = Labeling error rate Φ = Probit function (cdf of Gaussian) t = label (+1 or -1) y = “soft label” In experiments, ε was chosen via evidence maximization.
GP Classification for a Single Sensor Posterior is proportional to the product of the (GP) prior and the likelihood: EP is used to approximate the likelihood (so the posterior is also a Gaussian): The GP prior is of course Gaussian, and enforces a smoothness constraint. Resulting posterior can then be marginalized to classify a test point: where K is the kernel matrix
GP Classification for Multiple Sensors Variational bound of the posterior of the soft labels (Y) and the switches (Λ) is used: Final classification of a test point will be given by: In multi-sensor setting, the likelihood (with j-th sensor) will be:
GP Classification for Multiple Sensors Three Steps: 1. Initialization 2. Variational Updates 3. Classifying Test Data Begin with n labeled (training) data points, with data from P sensors, and one testing data point.
Recall that λ is a “switch” that determines which sensor to use to classify. Q(λ), a multinomial distribution, is intialized uniformly. For each of the P sensors, use EP as in single-sensor case to obtain a Gaussian posterior, and then initialize to this obtained distribution. {Apparently, in obtaining these posteriors, one simply uses whatever data is available from each sensor.}
Variational update rules: Update for switches uses (6) below, which is intractable. Authors suggest using importance sampling to compute it. Update for soft labels is a product of Gaussians:
To classify test point, the posterior over the testing data’s switch is needed: The authors’ approach seems ad hoc, and their explanation is unclear: For an unlabeled test point, perform P classifications using single sensor classifiers. These P probabilities are then set to be the posterior probabilities of the switches (i.e., of using each sensor) (after normalizing so they sum to unity). This then gives the posterior of the switches. Final classification of the test point is then given by:
What is done when the testing data is missing data from some sensors? Explanation (quoted from paper) is unclear: Classifying Test Data when Sensors are Missing Whatever the authors did, it must surely be quite ad hoc.
Results On this data set, there were 136 data points. Using the proposed mixture of GP approach (83.55%) is barely better than just using the “Posture Modality” (82.02% or 82.99%); the difference is about 2 or 1 additional data points correctly classified. Moreover, the error bars overlap. To me, this is not “significantly outperforming”:
Conclusions Idea to combine the sensors probabilistically seems good, but the method seems to have a lot of inelegant attributes. Inelegance: –EP approximation of the likelihood –posterior of switches required importance sampling –posterior of switches for test point seems completely ad hoc –handling of missing test data is ad hoc Noisy label formulation is not new for us. No mention of semi-supervised extensions or active feature acquisition. All sensors seem to be treated equally, regardless of how much training data is possessed for each one. Avoids missing data problem by essentially treating each sensor individually, and then combining all of the individual outputs at the end. Method applied to only a single data set, and results were falsely claimed to be significantly better.