SSL Chapter 4 Risk of Semi-supervised Learning: How Unlabeled Data Can Degrade Performance of Generative Classifiers
Amount of the data Is it better to have more unlabeled data? Literature presents the positive value of unlabeled data Unlabeled data should certainly not be discarded (O’Neill, 1978)
Model selection: Correct Model Assume Xv and Yv sampled from Xv and Yv. Suppose we know there exist parameter set s.t. P(Xv,Yv|Q) = P(Xv,Yv) => “Correct model” Extra labeled/unlabeled data will reduce the error. Labeled data is more effective.
Detailed Analysis Shahshahani & Landgrebe Unlabeled data degrade the performance of NB with Gaussian variables. Deviations from modeling assumptions Unlabeled data should used when the labeled data alone produce poor performance. (suggestion)
Detailed Analyze Nigam et al.(2000) Reasons of poor performance Numerical problems in the learning method Mismatch between natural clusters and actual labels There are various studies that presents the addition of unlabeled data degrades accuracy of classification.
Empirical Study Notation and assumptions Binary classification Xv is an instance of data while Xvi is an attribute of Xv All classifiers use EM to maximize likelihood
Empirical Study Bayes classifier with increasing number of unlabeled data. Generated randomly. Xi and Xj is independent given class label Correct model
Empirical Study Tree-augmented NB is used. The model is incorrect. Each attr. directly dependent on the class and at most another attr. The model is incorrect.
Empirical Study More complex model with TAN assumptions. With few labeled data the performance improves. Still model is incorrect.
Empirical Study NB classifier Real data with binary classes (UCI rep.) Better when the size of labeled data is small. Similar with previous case.
Empirical Study
Summary of first part Correct model => guarantee benefits from unlabeled data Incorrect model => may degrade performance Characteristics of the distribution of data and classes do not match. How we know that the priori is the “correct” one?
Asymptotic Bias AL : Asymptotic bias of labeled data Au : Asymptotic bias of unlabeled data AL and Au can be different Scenario Train with labeled data s.t. the result is close to AL. Add huge amount of unlabeled data. The result may be tending to Au
Toy Problem : Gender Prediction G: Girl B: Boy Mother craved chocolate C: Yes or No Mother’s weight gain W: More or Less W and G conditionally independent on C G->C->W P(G,C,W) = P(G) P(C|G) P(W|C)
Toy Problem : Gender Prediction P(G = Boy) = 0.5 P(C = No | G = Boy) = 0.1 P(C = No | G = Boy) = 0.8 P(W = Less | C = No) = 0.7 P(W = Less | C = Yes) = 0.2 We can compute P(W = Less | G = Boy) = 0.25 P(W = Less | G = Girl) = 0.6
Toy Problem : Gender Prediction From the independence assumption P(G = Girl | C = No) = 0.89 P(G = Boy | C = No) = 0.11 P(G = Girl | C = Yes) = 0.18 P(G = Boy | C = Yes) = 0.82 So, if C = No choose G = Girl else G = Boy
Toy Problem : Gender Prediction Incorrect Model C <- G -> W C and W are independent P(G,C,W) = P(G)P(C|G)P(W|G) Suppose “oracle” gave us P(C|G) We need to estimate P(G) and P(W|G)
Toy Problem : Gender Prediction Incorrect Model Only labeled data Unbiased mean and variance inversely proportional to the size of the DL. Even small sized DL will produce good estimates
Toy Problem : Gender Prediction Incorrect Model P(G) ~ 0.5 P(W = Less | G = Girl) ~ 0.6 P(W = Less | G = Boy) ~ 0.25 P(G=Girl|C,W) P(G=Boy|C,W) C=No,W=Less 0.95 0.05 C=No,W=More 0.81 0.19 C=Yes,W=Less 0.35 0.65 0.11 0.89
Toy Problem : Gender Prediction Incorrect Model Classify with the maximum a posteriori value of G The “bias” from “true” a posteriori in not zero Produce the same optimal Bayes rule with the previous case. Classifier likely to yield to minimum classification error
Toy Problem : Gender Prediction Incorrect Model + Unlabeled Data DL / Du -> 0 P(G = Boy) = 0.5 P(W = Less | G = Girl) = 0.78 P(W = Less | G = Boy) = 0.07
Toy Problem : Gender Prediction Incorrect Model + Unlabeled Data The a posteriori probabilities for G P(G=Girl|C,W) P(G=Boy|C,W) C=No,W=Less 0.99 0.01 C=No,W=More 0.55 0.45 C=Yes,W=Less 0.71 0.29 0.05 0.95
Toy Problem : Gender Prediction Incorrect Model + Unlabeled Data 3 out of 4 times classifier chooses Girl against Boy. Prediction has changed from the optimal Expected error rate increases. What Happened? Unlabeled data changed the asymptotic limit When model is incorrect the affect of unlabeled data is important
Asymptotic Analysis (Xv,Yv): Instance vector, class label Binary classes with values -1 and +1 Assume 0-1 loss Apply Bayes rule to get Bayes Error n independent samples l: labeled u: unlabeled samples n = l + u
Asymptotic Analysis With probability (1 – h) a sample is unlabeled With probability h a sample is labeled P(Xv,Yv | Q) is the parametric form Use EM
Asymptotic Analysis Likelihood of labeled and unlabeled data
Asymptotic Analysis Parameter estimation obtained by maximizing the as n->infinity, it maximizes
Theorem on Asymptotic Analysis The limiting value of Q* of maximum-likelihood estimates is
Theorem on Asymptotic Analysis Qh* is the value of Q that maximizes the previous theorem. Ql* optimum of labeled data Qu* optimum of unlabeled data
Theorem on Asymptotic Analysis Model is correct. P(Xv,Yv|QT ) = P(Xv,Yv) for some QT. QT =Ql*= Qu*= Qh* In this case asymptotic bias is zero.
Theorem on Asymptotic Analysis Model is correct. Assume P(Xv,Yv) does not belong to P(Xv,Yv|Q) e(Q) is the classification error with parameter Q Assume e(Ql*) < e(Qu*)
Theorem on Asymptotic Analysis Labeled data will train the model such that error will be e(Ql*) As we added unlabeled data the error will be closer to the e(Qu*) So using only labeled data will result a smaller classification error