Employing EM and Pool-Based Active Learning for Text Classification Andrew McCallumKamal Nigam Just Research and Carnegie Mellon University.

Employing EM and Pool-Based Active Learning for Text Classification Andrew McCallumKamal Nigam Just Research and Carnegie Mellon University

Text Active Learning Many applications Scenario: ask for labels of a few documents While learning: –Learner carefully selects unlabeled document –Trainer provides label –Learner rebuilds classifier

Query-By-Committee (QBC) Label documents with high classification variance Iterate: –Create a committee of classifiers –Measure committee disagreement about the class of unlabeled documents –Select a document for labeling Theoretical results promising [Freund et al. 97] [Seung et al. 92]

Text Framework “Bag of Words” document representation Naïve Bayes classification: For each class, estimate P(word|class)

Outline: Our approach Create committee by sampling from distribution over classifiers Measure committee disagreement with KL- divergence of the committee members Select documents from a large pool using both disagreement and density-weighting Add EM to use documents not selected for labeling

Creating Committees Each class a distribution of word frequencies For each member, construct each class by: –Drawing from the Dirichlet distribution defined by the labeled data labeled data Classifier distribution MAP classifier Member 1 Member 2 Member 3 Committee

Measuring Committee Disagreement Kullback-Leibler Divergence to the mean –compares differences in how members “vote” for classes –Considers entire class distribution of each member –Considers “confidence” of the top-ranked class

Selecting Documents Stream-based sampling: –Disagreement => Probability of selection –Implicit (but crude) instance distribution information Pool-based sampling: –Select highest disagreement of all documents –Lose distribution information

Disagreement

Density-weighted pool-based sampling A balance of disagreement and distributional information Select documents by: Calculate Density by: –(Geometric) Average Distance to all documents

Disagreement

Density

Datasets and Protocol Reuters-21578 and subset of Newsgroups One initial labeled document per class 200 iterations of active learning mac ibm graphics windows X computers acqcorntrade...

QBC on Reuters acq, P(+) = 0.25trade, P(+) = 0.038corn, P(+) = 0.018

Selection comparison on News5

EM after Active Learning After active learning only a few documents have been labeled Use EM to predict the labels of the remaining unlabeled documents Use all documents to build a new classification model, which is often more accurate.

QBC and EM on News5

Related Work Active learning with text: –[Dagan & Engelson 95]: QBC Part of speech tagging –[Lewis & Gale 94]: Pool-based non-QBC –[Liere & Tadepalli 97 & 98]: QBC Winnow & Perceptrons EM with text: –[Nigam et al. 98]: EM with unlabeled data

Conclusions & Future Work Small P(+) => better active learning Leverage unlabeled pool by: –pool-based sampling –density-weighting –Expectation-Maximization Different active learning approaches a la [Cohn et al. 96] Interleaved EM & active learning

Document classification: the Potential 3 x 10 8 unlabeled web pages Classification important for the Web –Knowledge extraction –User interest modeling

Document classification: the Status Good techniques exist –Many parameters to estimate –Data very sparse –Lots of training examples needed

Document classification: the Challenge Labeling data is expensive –requires human interaction –domains may constrain labeling effort Use Active Learning! –Pick carefully which documents to label –Make a quicker knee

Disagreement Example

Reuters-21578 Skewed priors => better active learning? Reuters: Binary classification & skewed priors Better active learning results with more infrequent classes

comp.* Newsgroups dataset 5 categories, 1000 documents each 20% held out for testing One initial labeled document per class 200 iterations of active learning 10 runs per curve mac ibm graphics windows X computers

Text Classification Many applications Good techniques exist Require lots of data Labeling expensive Use active learning Corn prices rose today while corn futures dropped in surprising trading activity. Corn...

For each unlabeled document: Pick two consistent hypotheses If they disagree about the label, request it Old QBC stuff

Employing EM and Pool-Based Active Learning for Text Classification Andrew McCallumKamal Nigam Just Research and Carnegie Mellon University.

Similar presentations

Presentation on theme: "Employing EM and Pool-Based Active Learning for Text Classification Andrew McCallumKamal Nigam Just Research and Carnegie Mellon University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Employing EM and Pool-Based Active Learning for Text Classification Andrew McCallumKamal Nigam Just Research and Carnegie Mellon University.

Similar presentations

Presentation on theme: "Employing EM and Pool-Based Active Learning for Text Classification Andrew McCallumKamal Nigam Just Research and Carnegie Mellon University."— Presentation transcript:

Similar presentations

About project

Feedback