Presentation is loading. Please wait.

Presentation is loading. Please wait.

Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science.

Similar presentations


Presentation on theme: "Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science."— Presentation transcript:

1 Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science Department

2 2 Combining Labeled and Unlabeled Data (a.k.a. Semi-supervised Learning) Many applications have lots of unlabeled data, but labeled data is rare or expensive: Web page, document classification OCR, Image classification Several methods have been developed to try to use unlabeled data to improve performance, e.g.: Transductive SVM Co-training Graph-based methods

3 3 Co-training: method for combining labeled & unlabeled data Works in scenarios where examples have distinct, yet sufficient feature sets: –An example: –Belief is that the two parts of the example are consistent, i.e. 9 c 1, c 2 such that Each view is sufficient for correct classification Works by using unlabeled data to propagate learned information. + + + X1X1 X2X2

4 4 Co-Training: method for combining labeled & unlabeled data For example, if we want to classify web pages: My AdvisorProf. Avrim BlumMy AdvisorProf. Avrim Blum x 2 - Text info x 1 - Link info x - Link info & Text info

5 5 Iterative Co-Training Have learning algorithms A 1, A 2 on each of the two views. Use labeled data to learn two initial hypotheses h 1, h 2. Look through unlabeled data to find examples where one of h i is confident but other is not. Have the confident h i label it for algorithm A 3-i. Repeat

6 6 Iterative Co-Training A Simple Example: Learning Intervals c1c1 c2c2 Use labeled data to learn h 1 1 and h 2 1 Use unlabeled data to bootstrap h11h11 h21h21 Labeled examples Unlabeled examples h12h12 h21h21 h12h12 h22h22

7 7 Theoretical/Conceptual Question What properties do we need for co-training to work well? Need assumptions about: –the underlying data distribution –the learning algorithms on the two sides

8 8 Theoretical/Conceptual Question What property of the data do we need for co-training to work well? Previous work: 1)Independence given the label 2)Weak rule dependence Our work - much weaker assumption about how the data should behave: expansion property of the underlying distribution Though we will need stronger assumption on the learning algorithm compared to (1).

9 9 Co-Training, Formal Setting Assume that examples are drawn from distribution D over instance space. Let c be the target function; assume that each view is sufficient for correct classification: –c can be decomposed into c 1, c 2 over each view s. t. D has no probability mass on examples x with c 1 (x 1 )  c 2 (x 2 ) Let X + and X - denote the positive and negative regions of X. Let D + and D - be the marginal distribution of D over X + and X - respectively. Let –think of as D+D+ D-D-

10 10 (Formalization) We assume that D + is expanding. Expansion: This is a natural analog of the graph-theoretic notions of conductance and expansion. S1S1 S2S2

11 11 Property of the underlying distribution Necessary condition for co-training to work well: –If S 1 and S 2 (our confident sets) do not expand, then we might never see examples for which one hypothesis could help the other. We show, sufficient for co-training to generalize well in a relatively small number of iterations, under some assumptions: –the data is perfectly separable –have strong learning algorithms on the two sides

12 12 Expansion, Examples: Learning Intervals c1c1 c2c2 D+D+ c1c1 c2c2 Zero probability mass in the regions Non-expanding distributionExpanding distribution c1c1 c2c2 D+D+ S1S1 S2S2

13 13 Weaker than independence given the label & than weak rule dependence. D+D+ S1S1 S2S2 D-D- e.g, w.h.p. a random degree-3 bipartite graph is expanding, but would NOT have independence given the label, or weak rule dependence

14 14 Main Result Assume D + is  -expanding. Assume that on each of the two views we have algorithms A 1 and A 2 for learning from positive data only. Assume that we have initial confident sets S 1 0 and S 2 0 such that

15 15 Main Result, Interpretation Assumption on A 1, A 2 implies the they never generalize incorrectly. Question is: what needs to be true for them to actually generalize to whole of D+? + + + X1+X1+ X2+X2+

16 16 Main Result, Proof Idea Expansion implies that at each iteration, there is reasonable probability mass on "new, useful" data. Algorithms generalize to most of this new region. See paper for real proof. + + +

17 17 What if assumptions are violated? What if our algorithms can make incorrect generalizations and/or there is no perfect separability?

18 18 What if assumptions are violated? Expect "leakage" to negative region. If negative region is expanding too, then incorrect generalizations will grow at exponential rate. Correct generalization are growing at exponential rate too, but will slow down first. Expect overall accuracy to go up then down.

19 19 Synthetic Experiments Create a 2n-by-2n bipartite graph; –nodes 1 to n on each side represent positive clusters –nodes n+1 to 2n on each side represent negative clusters Connect each node on the left to 3 nodes on the right: –each neighbor is chosen with prob. 1-  to be a random node of the same class, and with prob.  to be a random node of the opposite class Begin with an initial confident set and then propagate confidence through rounds of co-training: –monitor the percentage of the positive class covered, the percent of the negative class mistakenly covered, and the overall accuracy

20 20 Synthetic Experiments solid line indicates overall accuracy green curve is accuracy on positives red curve is accuracy on negatives  =0.01, n=5000, d=3  =0.001, n=5000, d=3

21 21 Conclusions We propose a much weaker expansion assumption of the underlying data distribution. It seems to be the “right” condition on the distribution for co-training to work well. It directly motivates the iterative nature of many of the practical co-training based algorithms.


Download ppt "Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science."

Similar presentations


Ads by Google