Semi-Supervised Learning over Text Tom M. Mitchell Machine Learning Department Carnegie Mellon University September 2006 Modified by Charles Ling.

Slides:



Advertisements
Similar presentations
Machine learning continued Image source:
Advertisements

Multi-View Learning in the Presence of View Disagreement C. Mario Christoudias, Raquel Urtasun, Trevor Darrell UC Berkeley EECS & ICSI MIT CSAIL.
Chapter 5: Partially-Supervised Learning
Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding.
Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science.
Text Learning Tom M. Mitchell Aladdin Workshop Carnegie Mellon University January 2003.
Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06.
CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?
Unsupervised Models for Named Entity Classification Michael Collins and Yoram Singer Yimeng Zhang March 1 st, 2007.
Inductive Semi-supervised Learning Gholamreza Haffari Supervised by: Dr. Anoop Sarkar Simon Fraser University, School of Computing Science.
Semi Supervised Learning Qiang Yang –Adapted from… Thanks –Zhi-Hua Zhou – ople/zhouzh/ –LAMDA.
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
ECOC for Text Classification Hybrids of EM & Co-Training (with Kamal Nigam) Learning to build a monolingual corpus from the web (with Rosie Jones) Effect.
Text Classification from Labeled and Unlabeled Documents using EM Kamal Nigam Andrew K. McCallum Sebastian Thrun Tom Mitchell Machine Learning (2000) Presented.
Combining Labeled and Unlabeled Data for Multiclass Text Categorization Rayid Ghani Accenture Technology Labs.
Analysis of Semi-supervised Learning with the Yarowsky Algorithm
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
Bing LiuCS Department, UIC1 Learning from Positive and Unlabeled Examples Bing Liu Department of Computer Science University of Illinois at Chicago Joint.
Bing LiuCS Department, UIC1 Chapter 8: Semi-Supervised Learning Also called “partially supervised learning”
CS Ensembles and Bayes1 Semi-Supervised Learning Can we improve the quality of our learning by combining labeled and unlabeled data Usually a lot.
Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.
Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of Categories Rayid Ghani Center for Automated Learning & Discovery.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Introduction to machine learning
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
Semi-Supervised Learning
(ACM KDD 09’) Prem Melville, Wojciech Gryc, Richard D. Lawrence
Learning with Positive and Unlabeled Examples using Weighted Logistic Regression Wee Sun Lee National University of Singapore Bing Liu University of Illinois,
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Text Classification, Active/Interactive learning.
Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.
Recent Trends in Text Mining Girish Keswani
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Machine Learning Introduction Study on the Coursera All Right Reserved : Andrew Ng Lecturer:Much Database Lab of Xiamen University Aug 12,2014.
1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
1 COMP3503 Semi-Supervised Learning COMP3503 Semi-Supervised Learning Daniel L. Silver.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Today Ensemble Methods. Recap of the course. Classifier Fusion
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With.
MLE’s, Bayesian Classifiers and Naïve Bayes Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 30,
MLE’s, Bayesian Classifiers and Naïve Bayes Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 30,
Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project.
Report on Semi-supervised Training for Statistical Parsing Zhang Hao
Bing LiuCS Department, UIC1 Chapter 8: Semi-supervised learning.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.
Analysis of Bootstrapping Algorithms Seminar of Machine Learning for Text Mining UPC, 18/11/2004 Mihai Surdeanu.
Data Mining and Decision Support
Learning from Labeled and Unlabeled Data Tom Mitchell Statistical Approaches to Learning and Discovery, and March 31, 2003.
Classification using Co-Training
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Data Mining Practical Machine Learning Tools and Techniques
Recent Trends in Text Mining
Chapter 8: Semi-Supervised Learning
Data Mining Lecture 11.
Statistical NLP: Lecture 9
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Statistical NLP : Lecture 9 Word Sense Disambiguation
KnowItAll and TextRunner
Presentation transcript:

Semi-Supervised Learning over Text Tom M. Mitchell Machine Learning Department Carnegie Mellon University September 2006 Modified by Charles Ling

Statistical learning methods require LOTS of training data Can we use all that unlabelled text?

Outline Maximizing likelihood in probabilistic models –EM for text classification Co-Training and redundantly predictive features –Document classification –Named entity recognition –Theoretical analysis Sample of additional tasks –Word sense disambiguation –Learning HTML-based extractors –Large-scale bootstrapping: extracting from the web

Many text learning tasks Document classification. –f: Doc  Class –Spam filtering, relevance rating, web page classification,... –and unsupervised document clustering Information extraction. –f: Sentence  Fact, f: Doc  Facts Parsing –f: Sentence  ParseTree –Related: part-of-speech tagging, co-reference res., prep phrase attachment Translation –f: EnglishDoc  FrenchDoc

1.Semi-supervised Document classification (probabilistic model and EM)

Document Classification: Bag of Words Approach aardvark0 about2 all2 Africa1 apple0 anxious0... gas1... oil1 … Zaire0

Supervised: Naïve Bayes Learner Train : For each class c j of documents 1. Estimate P(c j ) 2. For each word w i estimate P(w i | c j ) Classify (doc): Assign doc to most probable class * assuming words are conditionally independent, given class *

For code and data, see click on “Software and Data” Accuracy vs. # training examples

What if we have labels for only some documents? Y X1X1 X4X4 X3X3 X2X2 YX1X2X3X ?0110 ?0101 Learn P(Y|X) EM: Repeat until convergence 1.Use probabilistic labels to train classifier h 2.Apply h to assign probabilistic labels to unlabeled data

From [Nigam et al., 2000]

E Step: M Step: w t is t-th word in vocabulary

Using one labeled example per class Words sorted by P(w|course) / P(w| : course)

20 Newsgroups

Elaboration 1: Downweight the influence of unlabeled examples by factor New M step: Chosen by cross validation

Why/When will this work? What’s best case? Worst case? How can we test which we have?

EM for Semi-Supervised Doc Classification If all data is labeled, corresponds to supervised training of Naïve Bayes classifier If all data unlabeled, corresponds to mixture-of- multinomial clustering If both labeled and unlabeled data, it helps if and only if the mixture-of-multinomial modeling assumption is correct Of course we could extend this to Bayes net models other than Naïve Bayes (e.g., TAN tree) Other extensions: model negative class as mixture of N multinomials

2. Using Redundantly Predictive Features (Co-Training)

Redundantly Predictive Features Professor Faloutsos my advisor

Co-Training Answer 1 Classifier 1 Answer 2 Classifier 2 Key idea: Classifier 1 and Classifier J must: 1. Correctly classify labeled examples 2. Agree on classification of unlabeled

CoTraining Algorithm #1 [Blum&Mitchell, 1998] Given: labeled data L, unlabeled data U Loop: Train g1 (hyperlink classifier) using L Train g2 (page classifier) using L Allow g1 to label p positive, n negative examps from U Allow g2 to label p positive, n negative examps from U Add these self-labeled examples to L

CoTraining: Experimental Results begin with 12 labeled web pages (academic course) provide 1,000 additional unlabeled web pages average error: learning from labeled data 11.1%; average error: cotraining 5.0% Typical run:

Co-Training for Named Entity Extraction (i.e.,classifying which strings refer to people, places, dates, etc.) Answer1 Classifier 1 Answer2 Classifier 2 I flew to New York today. New YorkI flew to ____ today [Riloff&Jones 98; Collins et al., 98; Jones 05]

One result [Blum&Mitchell 1998]: If –X1 and X2 are conditionally independent given Y –f is PAC learnable from noisy labeled data Then –f is PAC learnable from weak initial classifier plus unlabeled data CoTraining setting : wish to learn f: X  Y, given L and U drawn from P(X) features describing X can be partitioned (X = X1 x X2) such that f can be computed from either X1 or X2

Co-Training Rote Learner My advisor pages hyperlinks

Co Training What’s the best-case graph? (most benefit from unlabeled data) What the worst case? What does conditional-independence imply about graph? x1x

Expected Rote CoTraining error given m examples Where g is the jth connected component of graph of L+U, m is number of labeled examples j

How many unlabeled examples suffice? Want to assure that connected components in the underlying distribution, G D, are connected components in the observed sample, G S GDGD GSGS O(log(N)/  ) examples assure that with high probability, G S has same connected components as G D [Karger, 94] N is size of G D,  is min cut over all connected components of G D

PAC Generalization Bounds on CoTraining [Dasgupta et al., NIPS 2001] This theorem assumes X1 and X2 are conditionally independent given Y

Co-Training Theory Final Accuracy # unlabeled examples dependencies among input features # Redundantly predictive inputs # labeled examples Correctness of confidence assessments How can we tune learning environment to enhance effectiveness of Co-Training?  best: inputs conditionally indep given class, increased number of redundant inputs, …

Idea: Want classifiers that produce a maximally consistent labeling of the data If learning is an optimization problem, what function should we optimize? What if CoTraining Assumption Not Perfectly Satisfied?

Example 2: Learning to extract named entities I arrived in Beijing on Saturday. location? If: “I arrived in on Saturday.” Then: Location(X)

Co-Training for Named Entity Extraction (i.e.,classifying which strings refer to people, places, dates, etc.) Answer1 Classifier 1 Answer2 Classifier 2 I arrived in Beijing saturday. BeijingI arrived in __ saturday [Riloff&Jones 98; Collins et al., 98; Jones 05]

Bootstrap learning to extract named entities [Riloff and Jones, 1999], [Collins and Singer, 1999],... Iterations Initialization Australia Canada China England France Germany Japan Mexico Switzerland United_states locations in ?x South Africa United Kingdom Warrenton Far_East Oregon Lexington Europe U.S._A. Eastern Canada Blair Southwestern_states Texas States Singapore … operations in ?x Thailand Maine production_control northern_Los New_Zealand eastern_Europe Americas Michigan New_Hampshire Hungary south_america district Latin_America Florida... republic of ?x …...

Co-EM [Nigam & Ghani, 2000; Jones 2005] Idea: Like co-training, use one set of features to label the other Like EM, iterate –Assigning probabilistic values to unobserved class labels –Updating model parameters (= labels of other feature set)

CoEM applied to Named Entity Recognition [Rosie Jones, 2005], [Ghani & Nigam, 2000] Update rules:

CoEM applied to Named Entity Recognition [Rosie Jones, 2005], [Ghani & Nigam, 2000] Update rules:

CoEM applied to Named Entity Recognition [Rosie Jones, 2005], [Ghani & Nigam, 2000] Update rules:

[Jones, 2005] Can use this for active learning...

[Jones, 2005]

Idea: Want classifiers that produce a maximally consistent labeling of the data If learning is an optimization problem, what function should we optimize? What if CoTraining Assumption Not Perfectly Satisfied?

What Objective Function? Error on labeled examples Disagreement over unlabeled Misfit to estimated class priors

What Function Approximators? Same functional form as logistic regression Use gradient descent to simultaneously learn g1 and g2, directly minimizing E = E1 + E2 + E3 + E4 No word independence assumption, use both labeled and unlabeled data

Gradient CoTraining Classifying Capitalized sequences as Person Names 25 labeled 5000 unlabeled 2300 labeled 5000 unlabeled Using labeled data only Cotraining Cotraining without fitting class priors (E4) * sensitive to weights of error terms E3 and E4.11 *.15 * * Error Rates Eg., “Company president Mary Smith said today…” x1x2x1

Example 3: Word sense disambiguation [Yarowsky] “bank” = river bank, or financial bank?? Assumes a single word sense per document –X1: the document containing the word –X2: the immediate context of the word (‘swim near the __’) Successfully learns “context  word sense” rules when word occurs multiples times in documents.

Example 4: Bootstrap learning for IE from HTML structure [Muslea, et al. 2001] X 1 : HTML preceding the target X 2 : HTML following the target

Example Bootstrap learning algorithms: Classifying web pages [Blum&Mitchell 98; Slattery 99] Classifying [Kiritchenko&Matwin 01; Chan et al. 04] Named entity extraction [Collins&Singer 99; Jones&Riloff 99] Wrapper induction [Muslea et al., 01; Mohapatra et al. 04] Word sense disambiguation [Yarowsky 96] Discovering new word senses [Pantel&Lin 02] Synonym discovery [Lin et al., 03] Relation extraction [Brin et al.; Yangarber et al. 00] Statistical parsing [Sarkar 01]

What to Know Several approaches to semi-supervised learning –EM with probabilistic model –Co-Training –Graph similarity methods –... –See reading list below Redundancy is important Much more to be done: –Better theoretical models of when/how unlabeled data can help –Bootstrap learning from the web (e.g. Etzioni, 2005, 2006) –Active learning (use limited labeling time of human wisely) –Never ending bootstrap learning? –...

Further Reading Semi-Supervised Learning, Olivier Chapelle, Bernhard Sch¨olkopf, and Alexander Zien (eds.), MIT Press, Semi-Supervised Learning Literature Survey, Xiaojin Zhu, 2006.Semi-Supervised Learning Literature Survey Unsupervised word sense disambiguation rivaling supervised methods D. Yarowsky (1995)Unsupervised word sense disambiguation rivaling supervised methods "Semi-Supervised Text Classification Using EM," K. Nigam, A. McCallum, and T. Mitchell, in Semi-Supervised Learning, Olivier Chapelle, Bernhard Sch¨olkopf, and Alexander Zien (eds.), MIT Press, 2006.Semi-Supervised Text Classification Using EM " Text Classification from Labeled and Unlabeled Documents using EM," K. Nigam, Andrew McCallum, Sebastian Thrun and Tom Mitchell. Machine Learning, Kluwer Academic Press, Text Classification from Labeled and Unlabeled Documents using EM " Combining Labeled and Unlabeled Data with Co-Training," A. Blum and T. Mitchell, Proceedings of the 1998 Conference on Computational Learning Theory, July Combining Labeled and Unlabeled Data with Co-Training Discovering Word Senses from Text Pantel & Lin (2002)Discovering Word Senses from Text Creating Subjective and Objective Sentence Classifiers from Unannotated Texts by Janyce Wiebe and Ellen Riloff (2005)Creating Subjective and Objective Sentence Classifiers from Unannotated Texts Graph Based Semi-Supervised Approach for Information Extraction by Hany Hassan, Ahmed Hassan and Sara Noeman (2006)Graph Based Semi-Supervised Approach for Information Extraction The use of unlabeled data to improve supervised learning for text summarization by MR Amini, P Gallinari (2002)The use of unlabeled data to improve supervised learning for text summarization

Further Reading Yusuke Shinyama and Satoshi Sekine. Preemptive Information Extraction using Unrestricted Relation DiscoveryPreemptive Information Extraction using Unrestricted Relation Discovery Alexandre Klementiev and Dan Roth. Named Entity Transliteration and Discovery from Multilingual Comparable Corpora.Named Entity Transliteration and Discovery from Multilingual Comparable Corpora Rion L. Snow, Daniel Jurafsky, Andrew Y. Ng. Learning syntactic patterns for automatic hypernym discoveryLearning syntactic patterns for automatic hypernym discovery Sarkar. (1999). Applying Co-training methods to Statistical Parsing. S. Brin, Extracting patterns and relations from the World Wide Web, EDBT'98Extracting patterns and relations from the World Wide Web O. Etzioni et al., "Unsupervised Named-Entity Extraction from the Web: An Experimental Study," AI Journal, 2005.Unsupervised Named-Entity Extraction from the Web: An Experimental Study