Learning from Labeled and Unlabeled Data Tom Mitchell Statistical Approaches to Learning and Discovery, 10-702 and 15-802 March 31, 2003.

Slides:



Advertisements
Similar presentations
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Advertisements

Machine Learning Week 2 Lecture 1.
CMPUT 466/551 Principal Source: CMU
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Multi-View Learning in the Presence of View Disagreement C. Mario Christoudias, Raquel Urtasun, Trevor Darrell UC Berkeley EECS & ICSI MIT CSAIL.
The loss function, the normal equation,
Chapter 5: Partially-Supervised Learning
Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding.
Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science.
Text Learning Tom M. Mitchell Aladdin Workshop Carnegie Mellon University January 2003.
Unsupervised Models for Named Entity Classification Michael Collins and Yoram Singer Yimeng Zhang March 1 st, 2007.
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
Semi Supervised Learning Qiang Yang –Adapted from… Thanks –Zhi-Hua Zhou – ople/zhouzh/ –LAMDA.
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
Combining Labeled and Unlabeled Data for Multiclass Text Categorization Rayid Ghani Accenture Technology Labs.
1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
Bing LiuCS Department, UIC1 Chapter 8: Semi-Supervised Learning Also called “partially supervised learning”
Experimental Evaluation
Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Online Learning Algorithms
Ensembles of Classifiers Evgueni Smirnov
A k-Nearest Neighbor Based Algorithm for Multi-Label Classification Min-Ling Zhang
Transfer Learning From Multiple Source Domains via Consensus Regularization Ping Luo, Fuzhen Zhuang, Hui Xiong, Yuhong Xiong, Qing He.
Semi-Supervised Learning over Text Tom M. Mitchell Machine Learning Department Carnegie Mellon University September 2006 Modified by Charles Ling.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Learning with Positive and Unlabeled Examples using Weighted Logistic Regression Wee Sun Lee National University of Singapore Bing Liu University of Illinois,
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
1 Machine Learning: Experimental Evaluation. 2 Motivation Evaluating the performance of learning systems is important because: –Learning systems are usually.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Text Classification, Active/Interactive learning.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Machine Learning, Decision Trees, Overfitting Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 14,
Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project.
INTRODUCTION TO Machine Learning 3rd Edition
An Overview on Semi-Supervised Learning Methods Matthias Seeger MPI for Biological Cybernetics Tuebingen, Germany.
Concept learning, Regression Adapted from slides from Alpaydin’s book and slides by Professor Doina Precup, Mcgill University.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Ensemble Methods in Machine Learning
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
Logistic Regression William Cohen.
Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of Categories Rayid Ghani Advisor: Tom Mitchell.
Machine Learning 5. Parametric Methods.
NTU & MSRA Ming-Feng Tsai
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Classification using Co-Training
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Ensembles of Classifiers Evgueni Smirnov. Outline 1 Methods for Independently Constructing Ensembles 1.1 Bagging 1.2 Randomness Injection 1.3 Feature-Selection.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Chapter 8: Semi-Supervised Learning
ECE 5424: Introduction to Machine Learning
Computational Learning Theory
10701 / Machine Learning Today: - Cross validation,
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Computational Learning Theory
Machine learning overview
Presentation transcript:

Learning from Labeled and Unlabeled Data Tom Mitchell Statistical Approaches to Learning and Discovery, and March 31, 2003

When can Unlabeled Data help supervised learning? Consider setting: Set X of instances drawn from unknown P(X) f: X  Y target function (or, P(Y|X)) Set H of possible hypotheses for f Given: iid labeled examples iid unlabeld examples Determine:

When can Unlabeled Data help supervised learning? Important question! In many cases, unlabeled data is plentiful, labeled data expensive Medical outcomes (x=, y=outcome) Text classification (x=document, y=relevance) User modeling (x=user actions, y=user intent) …

Four Ways to Use Unlabeled Data for Supervised Learning 1.Use to reweight labeled examples 2.Use to help EM learn class-specific generative models 3.If problem has redundantly sufficient features, use CoTraining 4.Use to detect/preempt overfitting

1. Use U to reweight labeled examples

2. Use U with EM and Assumed Generative Model Y X1X1 X4X4 X3X3 X2X2 YX1X2X3X ?0001 ?0101 Learn P(Y|X)

From [Nigam et al., 2000]

E Step: M Step: w t is t-th word in vocabulary

Elaboration 1: Downweight the influence of unlabeled examples by factor New M step: Chosen by cross validation

Experimental Evaluation Newsgroup postings –20 newsgroups, 1000/group Web page classification –student, faculty, course, project –4199 web pages Reuters newswire articles –12,902 articles –90 topics categories

20 Newsgroups

Using one labeled example per class

Can’t really get something for nothing… But unlabeled data useful to degree that assumed form for P(X,Y) is correct E.g., in text classification, useful despite obvious error in assumed form of P(X,Y) 2. Use U with EM and Assumed Generative Model

3. If Problem Setting Provides Redundantly Sufficient Features, use CoTraining

Redundantly Sufficient Features Professor Faloutsos my advisor

CoTraining Algorithm #1 [Blum&Mitchell, 1998] Given: labeled data L, unlabeled data U Loop: Train g1 (hyperlink classifier) using L Train g2 (page classifier) using L Allow g1 to label p positive, n negative examps from U Allow g2 to label p positive, n negative examps from U Add these self-labeled examples to L

CoTraining: Experimental Results begin with 12 labeled web pages (academic course) provide 1,000 additional unlabeled web pages average error: learning from labeled data 11.1%; average error: cotraining 5.0% Typical run:

Co-Training Rote Learner My advisor pages hyperlinks

Co-Training Rote Learner My advisor pages hyperlinks

Expected Rote CoTraining error given m examples Where g is the jth connected component of graph j

How many unlabeled examples suffice? Want to assure that connected components in the underlying distribution, G D, are connected components in the observed sample, G S GDGD GSGS O(log(N)/  ) examples assure that with high probability, G S has same connected components as G D [Karger, 94] N is size of G D,  is min cut over all connected components of G D

CoTraining Setting If –x1, x2 conditionally independent given y –f is PAC learnable from noisy labeled data Then –f is PAC learnable from weak initial classifier plus unlabeled data

PAC Generalization Bounds on CoTraining [Dasgupta et al., NIPS 2001]

Idea: Want classifiers that produce a maximally consistent labeling of the data If learning is an optimization problem, what function should we optimize? What if CoTraining Assumption Not Perfectly Satisfied?

What Objective Function? Error on labeled examples Disagreement over unlabeled Misfit to estimated class priors

What Function Approximators? Same fn form as Naïve Bayes, Max Entropy Use gradient descent to simultaneously learn g1 and g2, directly minimizing E = E1 + E2 + E3 + E4 No word independence assumption, use both labeled and unlabeled data

Classifying Jobs for FlipDog X1: job title X2: job description

Gradient CoTraining Classifying FlipDog job descriptions: SysAdmin vs. WebProgrammer Final Accuracy Labeled data alone: 86% CoTraining: 96%

Gradient CoTraining Classifying Upper Case sequences as Person Names 25 labeled 5000 unlabeled 2300 labeled 5000 unlabeled Using labeled data only Cotraining Cotraining without fitting class priors (E4) * sensitive to weights of error terms E3 and E4.11 *.15 * * Error Rates

CoTraining Summary Unlabeled data improves supervised learning when example features are redundantly sufficient –Family of algorithms that train multiple classifiers Theoretical results –Expected error for rote learning –If X1,X2 conditionally indep given Y PAC learnable from weak initial classifier plus unlabeled data error bounds in terms of disagreement between g1(x1) and g2(x2) Many real-world problems of this type –Semantic lexicon generation [Riloff, Jones 99], [Collins, Singer 99] –Web page classification [Blum, Mitchell 98] –Word sense disambiguation [Yarowsky 95] –Speech recognition [de Sa, Ballard 98]

4. Use U to Detect/Preempt Overfitting

Definition of distance metric –Non-negative d(f,g)¸0; –symmetric d(f,g)=d(g,f); –triangle inequality d(f,g) · d(f,h)+d(h,g) Classification with zero-one loss: Regression with squared loss:

Experimental Evaluation of TRI [Schuurmans & Southey, MLJ 2002] Use it to select degree of polynomial for regression Compare to alternatives such as cross validation, structural risk minimization, …

Generated y values contain zero mean Gaussian noise Y=f(x)+ 

Cross validation (Ten-fold) Structural risk minimization Approximation ratio: true error of selected hypothesis true error of best hypothesis considered Results using 200 unlabeled, t labeled Worst performance in top.50 of trials

Bound on Error of TRI Relative to Best Hypothesis Considered

Extension to TRI: Adjust for expected bias of training data estimates [Schuurmans & Southey, MLJ 2002] Experimental results: averaged over multiple target functions, outperforms TRI

Summary Several ways to use unlabeled data in supervised learning Ongoing research area 1.Use to reweight labeled examples 2.Use to help EM learn class-specific generative models 3.If problem has redundantly sufficient features, use CoTraining 4.Use to detect/preempt overfitting

Further Reading EM approach: K.Nigam, et al., "Text Classification from Labeled and Unlabeled Documents using EM", Machine Learning, 39, pp.103—134. CoTraining: A. Blum and T. Mitchell, “Combining Labeled and Unlabeled Data with Co-Training,” Proceedings of the 11th Annual Conference on Computational Learning Theory (COLT-98). S. Dasgupta, et al., “PAC Generalization Bounds for Co-training”, NIPS 2001 Model selection: D. Schuurmans and F. Southey, “Metric-Based methods for Adaptive Model Selection and Regularizaiton,” Machine Learning, 48, 51—84.