Download presentation
Presentation is loading. Please wait.
Published byRoy Rich Modified over 9 years ago
1
Employing EM and Pool-Based Active Learning for Text Classification Andrew McCallumKamal Nigam Just Research and Carnegie Mellon University
2
Text Active Learning Many applications Scenario: ask for labels of a few documents While learning: –Learner carefully selects unlabeled document –Trainer provides label –Learner rebuilds classifier
3
Query-By-Committee (QBC) Label documents with high classification variance Iterate: –Create a committee of classifiers –Measure committee disagreement about the class of unlabeled documents –Select a document for labeling Theoretical results promising [Freund et al. 97] [Seung et al. 92]
4
Text Framework “Bag of Words” document representation Naïve Bayes classification: For each class, estimate P(word|class)
5
Outline: Our approach Create committee by sampling from distribution over classifiers Measure committee disagreement with KL- divergence of the committee members Select documents from a large pool using both disagreement and density-weighting Add EM to use documents not selected for labeling
6
Creating Committees Each class a distribution of word frequencies For each member, construct each class by: –Drawing from the Dirichlet distribution defined by the labeled data labeled data Classifier distribution MAP classifier Member 1 Member 2 Member 3 Committee
7
Measuring Committee Disagreement Kullback-Leibler Divergence to the mean –compares differences in how members “vote” for classes –Considers entire class distribution of each member –Considers “confidence” of the top-ranked class
8
Selecting Documents Stream-based sampling: –Disagreement => Probability of selection –Implicit (but crude) instance distribution information Pool-based sampling: –Select highest disagreement of all documents –Lose distribution information
9
Disagreement
10
Density-weighted pool-based sampling A balance of disagreement and distributional information Select documents by: Calculate Density by: –(Geometric) Average Distance to all documents
11
Disagreement
12
Density
13
Datasets and Protocol Reuters-21578 and subset of Newsgroups One initial labeled document per class 200 iterations of active learning mac ibm graphics windows X computers acqcorntrade...
14
QBC on Reuters acq, P(+) = 0.25trade, P(+) = 0.038corn, P(+) = 0.018
15
Selection comparison on News5
16
EM after Active Learning After active learning only a few documents have been labeled Use EM to predict the labels of the remaining unlabeled documents Use all documents to build a new classification model, which is often more accurate.
17
QBC and EM on News5
18
Related Work Active learning with text: –[Dagan & Engelson 95]: QBC Part of speech tagging –[Lewis & Gale 94]: Pool-based non-QBC –[Liere & Tadepalli 97 & 98]: QBC Winnow & Perceptrons EM with text: –[Nigam et al. 98]: EM with unlabeled data
19
Conclusions & Future Work Small P(+) => better active learning Leverage unlabeled pool by: –pool-based sampling –density-weighting –Expectation-Maximization Different active learning approaches a la [Cohn et al. 96] Interleaved EM & active learning
20
Document classification: the Potential 3 x 10 8 unlabeled web pages Classification important for the Web –Knowledge extraction –User interest modeling
21
Document classification: the Status Good techniques exist –Many parameters to estimate –Data very sparse –Lots of training examples needed
22
Document classification: the Challenge Labeling data is expensive –requires human interaction –domains may constrain labeling effort Use Active Learning! –Pick carefully which documents to label –Make a quicker knee
23
Disagreement Example
24
Reuters-21578 Skewed priors => better active learning? Reuters: Binary classification & skewed priors Better active learning results with more infrequent classes
25
comp.* Newsgroups dataset 5 categories, 1000 documents each 20% held out for testing One initial labeled document per class 200 iterations of active learning 10 runs per curve mac ibm graphics windows X computers
26
Text Classification Many applications Good techniques exist Require lots of data Labeling expensive Use active learning Corn prices rose today while corn futures dropped in surprising trading activity. Corn...
27
For each unlabeled document: Pick two consistent hypotheses If they disagree about the label, request it Old QBC stuff
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.