Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science.

Slides:

Advertisements

Similar presentations

Semi-Supervised Learning Avrim Blum Carnegie Mellon University [USC CS Distinguished Lecture Series, 2008]

Advertisements

Co Training Presented by: Shankar B S DMML Lab

Ensemble Learning Reading: R. Schapire, A brief introduction to boosting.

Boosting Approach to ML

Semi-supervised Learning Rong Jin. Semi-supervised learning  Label propagation  Transductive learning  Co-training  Active learning.

Semi-Supervised Learning

Absorbing Random walks Coverage

Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006.

ALADDIN Workshop on Graph Partitioning in Vision and Machine Learning Jan 9-11, 2003 Welcome! [Organizers: Avrim Blum, Jon Kleinberg, John Lafferty, Jianbo.

Learning using Graph Mincuts Shuchi Chawla Carnegie Mellon University 1/11/2003.

Robust Moving Object Detection & Categorization using self- improving classifiers Omar Javed, Saad Ali & Mubarak Shah.

Active Learning of Binary Classifiers

CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.

Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.

Maria-Florina Balcan Carnegie Mellon University Margin-Based Active Learning Joint with Andrei Broder & Tong Zhang Yahoo! Research.

Classification with reject option in gene expression data Blaise Hanczar and Edward R Dougherty BIOINFORMATICS Vol. 24 no , pages

1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.

Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)

Maria-Florina Balcan Learning with Similarity Functions Maria-Florina Balcan & Avrim Blum CMU, CSD.

Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.

Graph-Based Semi-Supervised Learning with a Generative Model Speaker: Jingrui He Advisor: Jaime Carbonell Machine Learning Department

Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff.

7-2 Estimating a Population Proportion

CS Ensembles and Bayes1 Semi-Supervised Learning Can we improve the quality of our learning by combining labeled and unlabeled data Usually a lot.

Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.

Ch. 9 Fundamental of Hypothesis Testing

On Kernels, Margins, and Low- dimensional Mappings or Kernels versus features Nina Balcan CMU Avrim Blum CMU Santosh Vempala MIT.

Semi-supervised Learning Rong Jin. Semi-supervised learning  Label propagation  Transductive learning  Co-training  Active learing.

Machine Learning Theory Maria-Florina Balcan Lecture 1, Jan. 12 th 2010.

Online Learning Algorithms

Incorporating Unlabeled Data in the Learning Process

INFERENTIAL STATISTICS – Samples are only estimates of the population – Sample statistics will be slightly off from the true values of its population’s.

Standard Error and Research Methods

Ensembles of Classifiers Evgueni Smirnov

Machine Learning Theory Maria-Florina (Nina) Balcan Lecture 1, August 23 rd 2011.

Abstract - Many interactive image processing approaches are based on semi-supervised learning, which employ both labeled and unlabeled data in its training.

Transfer Learning From Multiple Source Domains via Consensus Regularization Ping Luo, Fuzhen Zhuang, Hui Xiong, Yuhong Xiong, Qing He.

Semisupervised Learning A brief introduction. Semisupervised Learning Introduction Types of semisupervised learning Paper for review References.

DATA MINING LECTURE 13 Absorbing Random walks Coverage.

Experimental Evaluation of Learning Algorithms Part 1.

DATA MINING LECTURE 13 Pagerank, Absorbing Random Walks Coverage Problems.

Mining Social Network for Personalized Prioritization Language Techonology Institute School of Computer Science Carnegie Mellon University Shinjae.

Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Section 7-1 Review and Preview.

Extending the Multi- Instance Problem to Model Instance Collaboration Anjali Koppal Advanced Machine Learning December 11, 2007.

Advisor : Prof. Sing Ling Lee Student : Chao Chih Wang Date :

SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.

Maria-Florina Balcan Active Learning Maria Florina Balcan Lecture 26th.

COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.

Agnostic Active Learning Maria-Florina Balcan*, Alina Beygelzimer**, John Langford*** * : Carnegie Mellon University, ** : IBM T.J. Watson Research Center,

Ensemble Methods in Machine Learning

1 Use graphs and not pure logic Variables represented by nodes and dependencies by edges. Common in our language: “threads of thoughts”, “lines of reasoning”,

Classification using Co-Training

Online Social Networks and Media Absorbing random walks Label Propagation Opinion Formation.

Boosting and Additive Trees (Part 1) Ch. 10 Presented by Tal Blum.

Computational Intelligence: Methods and Applications Lecture 15 Model selection and tradeoffs. Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.

Learning with General Similarity Functions Maria-Florina Balcan.

Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate its.

Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate accuracy.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Cross Domain Distribution Adaptation via Kernel Mapping

Combining Labeled and Unlabeled Data with Co-Training

Data Mining Practical Machine Learning Tools and Techniques

CHAPTER 29: Multiple Regression*

Semi-Supervised Learning

Experiments in Machine Learning

Computational Learning Theory

Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models

Computational Learning Theory

Presentation transcript:

Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science Department

2 Combining Labeled and Unlabeled Data (a.k.a. Semi-supervised Learning) Many applications have lots of unlabeled data, but labeled data is rare or expensive: Web page, document classification OCR, Image classification Several methods have been developed to try to use unlabeled data to improve performance, e.g.: Transductive SVM Co-training Graph-based methods

3 Co-training: method for combining labeled & unlabeled data Works in scenarios where examples have distinct, yet sufficient feature sets: –An example: –Belief is that the two parts of the example are consistent, i.e. 9 c 1, c 2 such that Each view is sufficient for correct classification Works by using unlabeled data to propagate learned information X1X1 X2X2

4 Co-Training: method for combining labeled & unlabeled data For example, if we want to classify web pages: My AdvisorProf. Avrim BlumMy AdvisorProf. Avrim Blum x 2 - Text info x 1 - Link info x - Link info & Text info

5 Iterative Co-Training Have learning algorithms A 1, A 2 on each of the two views. Use labeled data to learn two initial hypotheses h 1, h 2. Look through unlabeled data to find examples where one of h i is confident but other is not. Have the confident h i label it for algorithm A 3-i. Repeat

6 Iterative Co-Training A Simple Example: Learning Intervals c1c1 c2c2 Use labeled data to learn h 1 1 and h 2 1 Use unlabeled data to bootstrap h11h11 h21h21 Labeled examples Unlabeled examples h12h12 h21h21 h12h12 h22h22

7 Theoretical/Conceptual Question What properties do we need for co-training to work well? Need assumptions about: –the underlying data distribution –the learning algorithms on the two sides

8 Theoretical/Conceptual Question What property of the data do we need for co-training to work well? Previous work: 1)Independence given the label 2)Weak rule dependence Our work - much weaker assumption about how the data should behave: expansion property of the underlying distribution Though we will need stronger assumption on the learning algorithm compared to (1).

9 Co-Training, Formal Setting Assume that examples are drawn from distribution D over instance space. Let c be the target function; assume that each view is sufficient for correct classification: –c can be decomposed into c 1, c 2 over each view s. t. D has no probability mass on examples x with c 1 (x 1 )  c 2 (x 2 ) Let X + and X - denote the positive and negative regions of X. Let D + and D - be the marginal distribution of D over X + and X - respectively. Let –think of as D+D+ D-D-

10 (Formalization) We assume that D + is expanding. Expansion: This is a natural analog of the graph-theoretic notions of conductance and expansion. S1S1 S2S2

11 Property of the underlying distribution Necessary condition for co-training to work well: –If S 1 and S 2 (our confident sets) do not expand, then we might never see examples for which one hypothesis could help the other. We show, sufficient for co-training to generalize well in a relatively small number of iterations, under some assumptions: –the data is perfectly separable –have strong learning algorithms on the two sides

12 Expansion, Examples: Learning Intervals c1c1 c2c2 D+D+ c1c1 c2c2 Zero probability mass in the regions Non-expanding distributionExpanding distribution c1c1 c2c2 D+D+ S1S1 S2S2

13 Weaker than independence given the label & than weak rule dependence. D+D+ S1S1 S2S2 D-D- e.g, w.h.p. a random degree-3 bipartite graph is expanding, but would NOT have independence given the label, or weak rule dependence

14 Main Result Assume D + is  -expanding. Assume that on each of the two views we have algorithms A 1 and A 2 for learning from positive data only. Assume that we have initial confident sets S 1 0 and S 2 0 such that

15 Main Result, Interpretation Assumption on A 1, A 2 implies the they never generalize incorrectly. Question is: what needs to be true for them to actually generalize to whole of D+? X1+X1+ X2+X2+

16 Main Result, Proof Idea Expansion implies that at each iteration, there is reasonable probability mass on "new, useful" data. Algorithms generalize to most of this new region. See paper for real proof

17 What if assumptions are violated? What if our algorithms can make incorrect generalizations and/or there is no perfect separability?

18 What if assumptions are violated? Expect "leakage" to negative region. If negative region is expanding too, then incorrect generalizations will grow at exponential rate. Correct generalization are growing at exponential rate too, but will slow down first. Expect overall accuracy to go up then down.

19 Synthetic Experiments Create a 2n-by-2n bipartite graph; –nodes 1 to n on each side represent positive clusters –nodes n+1 to 2n on each side represent negative clusters Connect each node on the left to 3 nodes on the right: –each neighbor is chosen with prob. 1-  to be a random node of the same class, and with prob.  to be a random node of the opposite class Begin with an initial confident set and then propagate confidence through rounds of co-training: –monitor the percentage of the positive class covered, the percent of the negative class mistakenly covered, and the overall accuracy

20 Synthetic Experiments solid line indicates overall accuracy green curve is accuracy on positives red curve is accuracy on negatives  =0.01, n=5000, d=3  =0.001, n=5000, d=3

21 Conclusions We propose a much weaker expansion assumption of the underlying data distribution. It seems to be the “right” condition on the distribution for co-training to work well. It directly motivates the iterative nature of many of the practical co-training based algorithms.