Download presentation
Presentation is loading. Please wait.
1
Generative and Discriminative Models in Text Classification David D. Lewis Independent Consultant Chicago, IL, USA Dave@DavidDLewis.com www.DavidDLewis.com Workshop on Challenges in Information Retrieval and Language Modeling, U Mass, CIIR, Amherst, MA, 11 Sept 2002
2
Text Classification Given a document, decide which of several classes it belongs to: –TREC filtering –TDT tracking task –Text categorization! Automated indexing, content filtering, alerting,... More LM papers here than any other IR problem –Others: parts of IE, author identification,...
3
Lang. Models are Generative Model predicts probability document d will be generated by a source c e.g. Unigram language model: Parameters, i.e. P(w|c)’s, are fit to optimally predict generation of d
4
Classify Text w/ Gen. Model One source model for each class c Choose class c with largest value of: For 2 classes, unigram P(d|c), we have: aka Naive Bayes (NB), Roberston/KSJ
5
The Discriminative Alternative Directly model probability of generating class conditional on words: P(c|w) Logistic regression: Tune parameters to optimize conditional likelihood (class probability predictions)
6
LR & NB: Same Parameters!
7
Observations LR & NB have same parameterization for 2- or k-class, binary or raw TF weighting LR outperforms NB in text categorization and batch filtering studies –NB optimizes parameters to predict words, LR optimizes to predict class
8
False Hopes for LM? Leveraging unlabeled data (e.g. EM)? –Initial results show only small impact (same story as syntactic class tagging) Non-unigram models –More accurately predict the wrong thing? Cross-lingual TC –Any more than MT followed by TC?
9
True LM Hopes 1: Small Data? Number training examples to reach maximum effectiveness (Ng & Jordan ‘01): –NB: O(log # features) –LR: O(# features) LR and NB not compared yet (?) in low data (TREC adaptive, TDT tracking) case –Priors/smoothing likely to prove critical
10
True LM Hopes 2: Facets? MeSH category assignments: Anti-Inflammatory Agents, Non- Steroidal/*therapeutic use Tumor Necrosis Factor/antagonists & inhibitors/immunology Most combinations have zero training data Berger & Lafferty MT approach?
11
Non-LM TC Challenges? Integration of prior knowledge Choosing documents to label (TREC adaptive, active learning, sampling) Combining text and nontext predictors Knowing how well a classifier will/can do Evolving category systems, switching vocabularies
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.