Download presentation
Presentation is loading. Please wait.
1
Co-training LING 572 Fei Xia 02/21/06
2
Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000) –(Abney, 2002) –(Sarkar, 2002) –…–… Used in document classification, parsing, etc.
3
Outline Basic concept: (Blum and Mitchell, 1998) Relation with other SSL algorithms: (Nigam and Ghani, 2000)
4
An example Web-page classification: e.g., find homepages of faculty members. –Page text: words occurring on that page e.g., “research interest”, “teaching” –Hyperlink text: words occurring in hyperlinks that point to that page: e.g., “my advisor”
5
Two views Features can be split into two sets: –The instance space: –Each example: D: the distribution over X C1: the set of target functions over X1. C2: the set of target function over X2.
6
Assumption #1: compatibility The instance distribution D is compatible with the target function f=(f 1, f 2 ) if for any x=(x 1, x 2 ) with non-zero prob, f(x)=f 1 (x 1 )=f 2 (x 2 ). The compatibility of f with D: Each set of features is sufficient for classification
7
Assumption #2: conditional independence
8
Co-training algorithm
9
Co-training algorithm (cont) Why uses U’, in addition to U? –Using U’ yields better results. –Possible explanation: this forces h1 and h2 select examples that are more representative of the underlying distribution D that generates U. Choosing p and n: the ratio of p/n should match the ratio of positive examples and negative examples in D. Choosing the iteration number and the size of U’.
10
Intuition behind the co-training algorithm h 1 adds examples to the labeled set that h 2 will be able to use for learning, and vice verse. If the conditional independence assumption holds, then on average each added document will be as informative as a random document, and the learning will progress.
11
Experiments: setting 1051 web pages from 4 CS depts –263 pages (25%) as test data –The remaining 75% of pages Labeled data: 3 positive and 9 negative examples Unlabeled data: the rest (776 pages) Manually labeled into a number of categories: e.g., “course home page”. Two views: –View #1 (page-based): words in the page –View #2 (hyperlink-based): words in the hyperlinks Learner: Naïve Bayes
12
Naïve Bayes classifier (Nigam and Ghani, 2000)
13
Experiment: results Page- based classifier Hyperlink- based classifier Combined classifier Supervised training 12.912.411.1 Co-training6.211.65.0 p=1, n=3 # of iterations: 30 |U’| = 75
14
Questions Can co-training algorithms be applied to datasets without natural feature divisions? How sensitive are the co-training algorithms to the correctness of the assumptions? What is the relation between co-training and other SSL methods (e.g., self-training)?
15
(Nigam and Ghani, 2000)
16
EM Pool the features together. Use initial labeled data to get initial parameter estimates. In each iteration use all the data (labeled and unlabeled) to re-estimate the parameters. Repeat until converge.
17
Experimental results: WebKB course database EM performs better than co-training Both are close to supervised method when trained on more labeled data.
18
Another experiment: The News 2*2 dataset A semi-artificial dataset Conditional independence assumption holds. Co-training outperforms EM and the “oracle” result.
19
Co-training vs. EM Co-training splits features, EM does not. Co-training incrementally uses the unlabeled data. EM probabilistically labels all the data at each round; EM iteratively uses the unlabeled data.
20
Co-EM: EM with feature split Repeat until converge –Train A-feature-set classifier using the labeled data and the unlabeded data with B’s labels –Use classifier A to probabilistically label all the unlabeled data –Train B-feature-set classifier using the labeled data and the unlabeled data with A’s labels. –B re-labels the data for use by A.
21
Four SSL methods Results on the News 2*2 dataset
22
Random feature split Co-training: 3.7% 5.5% Co-EM: 3.3% 5.1% When the conditional independence assumption does not hold, but there is sufficient redundancy among the features, co-training still works well.
23
Assumptions Assumptions made by the underlying classifier (supervised learner): –Naïve Bayes: words occur independently of each other, given the class of the document. –Co-training uses the classifier to rank the unlabeled examples by confidence. –EM uses the classifier to assign probabilities to each unlabeled example. Assumptions made by SSL method: –Co-training: conditional independence assumption. –EM: maximizing likelihood correlates with reducing classification errors.
24
Summary of (Nigam and Ghani, 2002) Comparison of four SSL methods: self-training, co-training, EM, co-EM. The performance of the SSL methods depends on how well the underlying assumptions are met. Random splitting features is not as good as natural splitting, but it still works if there is sufficient redundancy among features.
25
Variations of co-training Goldman and Zhou (2000) use two learners of different types but both takes the whole feature set. Zhou and Li (2005) use three learners. If two agree, the data is used to teach the third learner. Balcan et al. (2005) relax the conditional independence assumption with much weaker expansion condition.
26
An alternative? L L1, L L2 U U1, U U2 Repeat –Train h1 using L1 on Feat Set1 –Train h2 using L2 on Feat Set2 –Classify U2 with h1 and let U2’ be the subset with the most confident scores, L2 + U2’ L2, U2-U2’ U2 –Classify U1 with h2 and let U1’ be the subset with the most confident scores, L1 + U1’ L1, U1-U1’ U1
27
Yarowsky’s algorithm one-sense-per-discourse View #1: the ID of the document that a word is in one-sense-per-allocation View #2: local context of word in the document Yarowsky’s algorithm is a special case of co- training (Blum & Mitchell, 1998) Is this correct? No, according to (Abney, 2002).
28
Summary of co-training The original paper: (Blum and Mitchell, 1998) –Two “independent” views: split the features into two sets. –Train a classifier on each view. –Each classifier labels data that can be used to train the other classifier. Extension: –Relax the conditional independence assumptions –Instead of using two views, use two or more classifiers trained on the whole feature set.
29
Summary of SSL Goal: use both labeled and unlabeled data. Many algorithms: EM, co-EM, self-training, co-training, … Each algorithm is based on some assumptions. SSL works well when the assumptions are satisfied.
30
Additional slides
31
Rule independence H1 (H2) consists of rules that are functions of X1 (X2, resp) only.
32
EM: the data is generated according to some simple known parametric model. –Ex: the positive examples are generated according to an n-dimensional Gaussian D+ centered around the point
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.