Text Classification from Labeled and Unlabeled Documents using EM Kamal Nigam Andrew K. McCallum Sebastian Thrun Tom Mitchell Machine Learning (2000) Presented.

Slides:



Advertisements
Similar presentations
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Advertisements

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Unsupervised Learning
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 16 10/18/2011.
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013.
Probabilistic Clustering-Projection Model for Discrete Data
Segmentation and Fitting Using Probabilistic Methods
S ENTIMENTAL A NALYSIS O F B LOGS B Y C OMBINING L EXICAL K NOWLEDGE W ITH T EXT C LASSIFICATION. 1 By Prem Melville, Wojciech Gryc, Richard D. Lawrence.
Hidden Variables, the EM Algorithm, and Mixtures of Gaussians Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 03/15/12.
What is Statistical Modeling
Visual Recognition Tutorial
Overview Full Bayesian Learning MAP learning
Classification and risk prediction
Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding.
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Expectation Maximization Algorithm
Object Class Recognition Using Discriminative Local Features Gyuri Dorko and Cordelia Schmid.
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
Maximum Likelihood (ML), Expectation Maximization (EM)
Visual Recognition Tutorial
Scalable Text Mining with Sparse Generative Models
Maximum likelihood (ML)
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
(ACM KDD 09’) Prem Melville, Wojciech Gryc, Richard D. Lawrence
Learning with Positive and Unlabeled Examples using Weighted Logistic Regression Wee Sun Lee National University of Singapore Bing Liu University of Illinois,
Rainbow Tool Kit Matt Perry Global Information Systems Spring 2003.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Text Classification, Active/Interactive learning.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Text Clustering.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
1 COMP3503 Semi-Supervised Learning COMP3503 Semi-Supervised Learning Daniel L. Silver.
Processing of large document collections Part 3 (Evaluation of text classifiers, term selection) Helena Ahonen-Myka Spring 2006.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Flat clustering approaches
Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten’s and E. Frank’s “Data Mining” and Jeremy Wyatt and others.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Machine Learning Expectation Maximization and Gaussian Mixtures CSE 473 Chapter 20.3.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Machine Learning Expectation Maximization and Gaussian Mixtures CSE 473 Chapter 20.3.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
Classification of unlabeled data:
Learning Sequence Motif Models Using Expectation Maximization (EM)
Lecture 15: Text Classification & Naive Bayes
Probabilistic Models with Latent Variables
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Presentation transcript:

Text Classification from Labeled and Unlabeled Documents using EM Kamal Nigam Andrew K. McCallum Sebastian Thrun Tom Mitchell Machine Learning (2000) Presented by Andrew Smith, May 12, 2003

Presentation Outline Motivation and Background The Naive Bayes classifier Incorporating unlabeled data with EM (basic algorithm) Enhancement 1 – Modulating the influence of the unlabeled data Enhancement 2 – A different probabilistic model Conclusions

Motivation The task: - Given a set of news articles, automatically find documents on the same topic. - We would like to require as few labeled documents as possible, since labeling documents by hand is expensive.

Previous work The problem: - Existing statistical text learning algorithms require many training examples. - (Lang 1995) A classifier with 1000 training documents ranked unlabeled documents. Of the top 10% only about 50% were correct.

Motivation Can we somehow use unlabeled documents? - Yes! Unlabeled data provide information about the joint probability distribution.

Algorithm Outline 1. Train a classifier with only the labeled documents. 2. Use it to probabilistically classify the unlabeled documents. 3. Use ALL the documents to train a new classifier. 4. Iterate steps 2 and 3 to convergence. This is reminiscent of K-Means and EM.

Presentation Outline Motivation and Background The Naive Bayes classifier Incorporating unlabeled data with EM (basic algorithm) Enhancement 1 – Modulating the influence of the unlabeled data Enhancement 2 – A different probabilistic model Conclusions

Probabilistic Framework Assumptions: - The data are produced by a mixture model. Mixture components and class labels - There is a one-to-one correspondence between mixture components and document classes. - Documents - Indicator variables. This statement means the i th document belongs to class j.

Probabilistic Framework (2) Mixture Weights Probability of class j generating document i the vocabulary (indexed over t) Documents are ordered word lists. indicates the word at position j in document i. indicates a word in the vocabulary.

Probabilistic Framework (3) The probability of document d i is The probability of mixture component c j generating document d i is:

Probabilistic Framework (4) The Naive Bayes assumption: The words of a document are generated independently of their order in the document, given the class.

Probabilistic Framework (5) Now the probability of a document given its class becomes We can use Bayes Rule to classify documents: find the class with highest probability given a novel document.

Probabilistic Framework (6) To learn the parameters  of the classifier, use ML; find the most likely set of parameters given the data set: The two parameters we need to find are the word probability estimates and the mixture weights, written = and

Probabilistic Framework (6) The maximization yields parameters that are word frequency counts: 1 + No. of occurrences of w t in class j |V| + No. of words in class j 1 + No. of documents in class j |C| + |D| Laplace smoothing gives each word a prior probability.

Probabilistic Framework (7) Formally Number of occurrences of word t in document i This is 1 if document i is in class j, or 0 otherwise.

Probabilistic Framework (8) Using Bayes Rule:

Presentation Outline Motivation and Background The Naive Bayes classifier Incorporating unlabeled data with EM (basic algorithm) Enhancement 1 – Modulating the influence of the unlabeled data Enhancement 2 – A different probabilistic model

Application of EM to NB 1) Estimate with only labeled data 2) Assign probabilistically weighted class-labels to unlabeled data. 3) Use all class labels (given and estimated) to find new parameters. 4) Repeat 2 and 3 until does not change.

More Notation Set of unlabeled documents Set of labeled documents

Deriving the basic Algorithm (1) The probability of all the data is: For unlabeled data, the component of the probability is a sum across all mixture components.

Deriving the basic Algorithm (2) Easier to maximize the log-likelihood: This contains a log of sums, which makes maximization intractable.

Deriving the basic Algorithm (3) Suppose we have access to the labels for the unlabeled documents, expressed as a matrix of indicator variables z, where if document i is in class j, and 0 otherwise (so rows are documents and columns are classes). Then the terms of are nonzero only when z ij = 1; we treat the labeled and unlabeled documents the same.

Deriving the basic Algorithm (4) The complete log-likelihood becomes: If we replace z with its expected value according to the current classifier, then this equation bounds from below the exact log- likelihood, so iteratively increasing this equation will increase the log-likelihood.

Deriving the basic Algorithm (5) This leads to the basic algorithm: E-step: M-step:

Data sets 20 Newsgroups data set: articles drawn evenly from 20 newsgroups Many categories fall into confusable clusters. Words from a stoplist of common short words are removed unique words occurring more than once Word counts of documents are scaled so each document has the same length.

Data sets WebKB data set: 4199 web pages from university CS departments Divided into four categories (student, faculty, course, project) with pages. No stoplist or stemming used. Only 300 most informative words used (mutual information with class variable). Validation with a leave-one-university-out approach to prevent idiosyncrasies of particular universities from inflating success measures.

Classification accuracy of 20 NewsGroups

Classification accuracy of WebKB

Predictive words found with EM Iteration 0Iteration 1Iteration 2 IntelligenceDDD DDDDD artificiallecturelecture understandingcccc DDwD*DD:DD distDD:DDdue identicalhandoutD* rusduehomework arrangeproblemassignment gamessethandout dartmouthtayset naturalDDamhw cognitiveyurttasexam logichomeworkproblem provingkkfouryDDam prologsecpostscript knowledgepostscriptsolution humanexamquiz representationsolutionchapter fieldassafascii

Presentation Outline Motivation and Background The Naive Bayes classifier Incorporating unlabeled data with EM (basic algorithm) Enhancement 1 – Modulating the influence of the unlabeled data Enhancement 2 – A different probabilistic model Conclusions

The problem Suppose you have a few labeled documents and many more unlabeled documents. Then the algorithm almost becomes unsupervised clustering! The only function of the labeled data is to assign class labels to the mixture components. When the mixture-model assumptions are not true, the basic algorithm will find components that don’t correspond to different class labels.

The solution: EM- Modulate the influence of unlabeled data with a parameter And maximize Unlabeled Documents labeled documents

EM- The E-step is exactly as before, assign probabilistic class labels. The M-step is modified to reflect. Define: as a weighting factor to modify the frequency counts.

EM- The new NB parameter estimates become sum over all words and documents Probabilistic class assignment Word count Weight

Classification accuracy of WebKB

Presentation Outline Motivation and Background The Naive Bayes classifier Incorporating unlabeled data with EM (basic algorithm) Enhancement 1 – Modulating the influence of the unlabeled data Enhancement 2 – A different probabilistic model Conclusions

The idea EM- reduced the effects of violated assumptions with the parameter. Alternatively, we can change our assumptions. Specifically, change the requirement of a one-to-one correspondence between classes and mixture components to a many-to-one correspondence. For textual data, this corresponds to saying that a class may consist of several different sub-topics, each best characterized by a different word distribution.

More Notation now represents only mixture components, not classes. represents the a th class (“topic”) is the assignment of mixture components to classes This assignment is pre-determined, deterministic, and permanent; once assigned to a particular class, mixture components do not change assignment.

The Algorithm M-step: same as before, find estimates for the mixture components using Laplace priors (MAP). E-step: - For unlabeled documents, calculate the probabilistic mixture component memberships exactly as before. - For labeled documents, we previously considered to be a fixed indicator (0 or 1) of class membership. Now we allow it to vary between 1 and 0 for mixture components in the same class as d i. We set it to zero for mixture components belonging to classes other than the one containing d i.

Algorithm details Initialize the mixture components for each class by randomly setting for components in the correct class. Documents are classified by summing up the mixture component probabilities of one class to form a class probability:

Another data set Reuters (21578 Distribution 1.0) data set: news articles in 90 topics from Reuters newswire, only the ten most populous classes are used. No stemming used. Documents are split into early and late categories (by date). The task is to predict the topics of the later articles with classifiers trained on the early ones. For all experiments on Reuters, 10 binary classifiers are trained – one per topic.

Performance Metrics To evaluate the performance, define the two quantities The recall-precision breakeven point is the value when the two quantities are equal. The breakeven point is used instead of accuracy (fraction correctly classified). Because the data sets have a much higher frequency of negative examples, the classifier could achieve high accuracy by always predicting negative. True Pos. False Pos. False Neg. True Neg. Actual value Pos. Neg. Pos. Neg. Prediction True Pos. True Pos. + False Neg. True Pos. True Pos. + False Pos. Recall = Precision =

Classification of Reuters (breakeven points)

Classification accuracy of Reuters

Classification of Reuters Using different numbers of mixture components (breakeven points)

Classification of Reuters Naive Bayes with different numbers of mixture components (breakeven points)

Classification of Reuters Using cross-validation or best-EM to select the number of mixture components (breakeven points)

Conclusions Cross-validation tends to underestimate the best number of mixture components. Incorporating unlabeled data into any classifier is important because of the high cost of hand-labeling documents. Classifiers based on generative models that make incorrect assumptions can still achieve high accuracy. The new algorithm does not produce binary classifiers that are much better than NB.