SSL Chapter 4 Risk of Semi-supervised Learning: How Unlabeled Data Can Degrade Performance of Generative Classifiers.

Slides:



Advertisements
Similar presentations
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Advertisements

Naïve-Bayes Classifiers Business Intelligence for Managers.
INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.
Unsupervised Learning
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
CMPUT 466/551 Principal Source: CMU
Probabilistic inference
Bayes Rule How is this rule derived? Using Bayes rule for probabilistic inference: –P(Cause | Evidence): diagnostic probability –P(Evidence | Cause): causal.
Classification and risk prediction
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
Visual Recognition Tutorial
Thanks to Nir Friedman, HU
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Crash Course on Machine Learning
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall 報告人:黃子齊
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Text Classification, Active/Interactive learning.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Bayesian decision theory: A framework for making decisions when uncertainty exit 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e.
Sampling Theory and Some Important Sampling Distributions.
CSE 446 Logistic Regression Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Learning Theory Reza Shadmehr Distribution of the ML estimates of model parameters Signal dependent noise models.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.
Bayesian Classification 1. 2 Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Lecture 1.31 Criteria for optimal reception of radio signals.
CS479/679 Pattern Recognition Dr. George Bebis
Chapter 7. Classification and Prediction
Usman Roshan CS 675 Machine Learning
Chapter 3: Maximum-Likelihood Parameter Estimation
DEEP LEARNING BOOK CHAPTER to CHAPTER 6
12. Principles of Parameter Estimation
LECTURE 06: MAXIMUM LIKELIHOOD ESTIMATION
CH 5: Multivariate Methods
Classification of unlabeled data:
Sampling Distributions
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
CSC 594 Topics in AI – Natural Language Processing
ECE 5424: Introduction to Machine Learning
Overview of Supervised Learning
Bias and Variance of the Estimator
Modelling data and curve fitting
Econ 3790: Business and Economics Statistics
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
10701 / Machine Learning Today: - Cross validation,
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 8 —
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
EE513 Audio Signals and Systems
The loss function, the normal equation,
Mathematical Foundations of BME Reza Shadmehr
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
What is The Optimal Number of Features
12. Principles of Parameter Estimation
Presentation transcript:

SSL Chapter 4 Risk of Semi-supervised Learning: How Unlabeled Data Can Degrade Performance of Generative Classifiers

Amount of the data Is it better to have more unlabeled data? Literature presents the positive value of unlabeled data Unlabeled data should certainly not be discarded (O’Neill, 1978)

Model selection: Correct Model Assume Xv and Yv sampled from Xv and Yv. Suppose we know there exist parameter set s.t. P(Xv,Yv|Q) = P(Xv,Yv) => “Correct model” Extra labeled/unlabeled data will reduce the error. Labeled data is more effective.

Detailed Analysis Shahshahani & Landgrebe Unlabeled data degrade the performance of NB with Gaussian variables. Deviations from modeling assumptions Unlabeled data should used when the labeled data alone produce poor performance. (suggestion)

Detailed Analyze Nigam et al.(2000) Reasons of poor performance Numerical problems in the learning method Mismatch between natural clusters and actual labels There are various studies that presents the addition of unlabeled data degrades accuracy of classification.

Empirical Study Notation and assumptions Binary classification Xv is an instance of data while Xvi is an attribute of Xv All classifiers use EM to maximize likelihood

Empirical Study Bayes classifier with increasing number of unlabeled data. Generated randomly. Xi and Xj is independent given class label Correct model

Empirical Study Tree-augmented NB is used. The model is incorrect. Each attr. directly dependent on the class and at most another attr. The model is incorrect.

Empirical Study More complex model with TAN assumptions. With few labeled data the performance improves. Still model is incorrect.

Empirical Study NB classifier Real data with binary classes (UCI rep.) Better when the size of labeled data is small. Similar with previous case.

Empirical Study

Summary of first part Correct model => guarantee benefits from unlabeled data Incorrect model => may degrade performance Characteristics of the distribution of data and classes do not match. How we know that the priori is the “correct” one?

Asymptotic Bias AL : Asymptotic bias of labeled data Au : Asymptotic bias of unlabeled data AL and Au can be different Scenario Train with labeled data s.t. the result is close to AL. Add huge amount of unlabeled data. The result may be tending to Au

Toy Problem : Gender Prediction G: Girl B: Boy Mother craved chocolate C: Yes or No Mother’s weight gain W: More or Less W and G conditionally independent on C G->C->W P(G,C,W) = P(G) P(C|G) P(W|C)

Toy Problem : Gender Prediction P(G = Boy) = 0.5 P(C = No | G = Boy) = 0.1 P(C = No | G = Boy) = 0.8 P(W = Less | C = No) = 0.7 P(W = Less | C = Yes) = 0.2 We can compute P(W = Less | G = Boy) = 0.25 P(W = Less | G = Girl) = 0.6

Toy Problem : Gender Prediction From the independence assumption P(G = Girl | C = No) = 0.89 P(G = Boy | C = No) = 0.11 P(G = Girl | C = Yes) = 0.18 P(G = Boy | C = Yes) = 0.82 So, if C = No choose G = Girl else G = Boy

Toy Problem : Gender Prediction Incorrect Model C <- G -> W C and W are independent P(G,C,W) = P(G)P(C|G)P(W|G) Suppose “oracle” gave us P(C|G) We need to estimate P(G) and P(W|G)

Toy Problem : Gender Prediction Incorrect Model Only labeled data Unbiased mean and variance inversely proportional to the size of the DL. Even small sized DL will produce good estimates

Toy Problem : Gender Prediction Incorrect Model P(G) ~ 0.5 P(W = Less | G = Girl) ~ 0.6 P(W = Less | G = Boy) ~ 0.25 P(G=Girl|C,W) P(G=Boy|C,W) C=No,W=Less 0.95 0.05 C=No,W=More 0.81 0.19 C=Yes,W=Less 0.35 0.65 0.11 0.89

Toy Problem : Gender Prediction Incorrect Model Classify with the maximum a posteriori value of G The “bias” from “true” a posteriori in not zero Produce the same optimal Bayes rule with the previous case. Classifier likely to yield to minimum classification error

Toy Problem : Gender Prediction Incorrect Model + Unlabeled Data DL / Du -> 0 P(G = Boy) = 0.5 P(W = Less | G = Girl) = 0.78 P(W = Less | G = Boy) = 0.07

Toy Problem : Gender Prediction Incorrect Model + Unlabeled Data The a posteriori probabilities for G P(G=Girl|C,W) P(G=Boy|C,W) C=No,W=Less 0.99 0.01 C=No,W=More 0.55 0.45 C=Yes,W=Less 0.71 0.29 0.05 0.95

Toy Problem : Gender Prediction Incorrect Model + Unlabeled Data 3 out of 4 times classifier chooses Girl against Boy. Prediction has changed from the optimal Expected error rate increases. What Happened? Unlabeled data changed the asymptotic limit When model is incorrect the affect of unlabeled data is important

Asymptotic Analysis (Xv,Yv): Instance vector, class label Binary classes with values -1 and +1 Assume 0-1 loss Apply Bayes rule to get Bayes Error n independent samples l: labeled u: unlabeled samples n = l + u

Asymptotic Analysis With probability (1 – h) a sample is unlabeled With probability h a sample is labeled P(Xv,Yv | Q) is the parametric form Use EM

Asymptotic Analysis Likelihood of labeled and unlabeled data

Asymptotic Analysis Parameter estimation obtained by maximizing the as n->infinity, it maximizes

Theorem on Asymptotic Analysis The limiting value of Q* of maximum-likelihood estimates is

Theorem on Asymptotic Analysis Qh* is the value of Q that maximizes the previous theorem. Ql* optimum of labeled data Qu* optimum of unlabeled data

Theorem on Asymptotic Analysis Model is correct. P(Xv,Yv|QT ) = P(Xv,Yv) for some QT. QT =Ql*= Qu*= Qh* In this case asymptotic bias is zero.

Theorem on Asymptotic Analysis Model is correct. Assume P(Xv,Yv) does not belong to P(Xv,Yv|Q) e(Q) is the classification error with parameter Q Assume e(Ql*) < e(Qu*)

Theorem on Asymptotic Analysis Labeled data will train the model such that error will be e(Ql*) As we added unlabeled data the error will be closer to the e(Qu*) So using only labeled data will result a smaller classification error