Presentation is loading. Please wait.

Presentation is loading. Please wait.

Generative and Discriminative Models in Text Classification David D. Lewis Independent Consultant Chicago, IL, USA

Similar presentations


Presentation on theme: "Generative and Discriminative Models in Text Classification David D. Lewis Independent Consultant Chicago, IL, USA"— Presentation transcript:

1 Generative and Discriminative Models in Text Classification David D. Lewis Independent Consultant Chicago, IL, USA Dave@DavidDLewis.com www.DavidDLewis.com Workshop on Challenges in Information Retrieval and Language Modeling, U Mass, CIIR, Amherst, MA, 11 Sept 2002

2 Text Classification Given a document, decide which of several classes it belongs to: –TREC filtering –TDT tracking task –Text categorization! Automated indexing, content filtering, alerting,... More LM papers here than any other IR problem –Others: parts of IE, author identification,...

3 Lang. Models are Generative Model predicts probability document d will be generated by a source c e.g. Unigram language model: Parameters, i.e. P(w|c)’s, are fit to optimally predict generation of d

4 Classify Text w/ Gen. Model One source model for each class c Choose class c with largest value of: For 2 classes, unigram P(d|c), we have: aka Naive Bayes (NB), Roberston/KSJ

5 The Discriminative Alternative Directly model probability of generating class conditional on words: P(c|w) Logistic regression: Tune parameters to optimize conditional likelihood (class probability predictions)

6 LR & NB: Same Parameters!

7 Observations LR & NB have same parameterization for 2- or k-class, binary or raw TF weighting LR outperforms NB in text categorization and batch filtering studies –NB optimizes parameters to predict words, LR optimizes to predict class

8 False Hopes for LM? Leveraging unlabeled data (e.g. EM)? –Initial results show only small impact (same story as syntactic class tagging) Non-unigram models –More accurately predict the wrong thing? Cross-lingual TC –Any more than MT followed by TC?

9 True LM Hopes 1: Small Data? Number training examples to reach maximum effectiveness (Ng & Jordan ‘01): –NB: O(log # features) –LR: O(# features) LR and NB not compared yet (?) in low data (TREC adaptive, TDT tracking) case –Priors/smoothing likely to prove critical

10 True LM Hopes 2: Facets? MeSH category assignments: Anti-Inflammatory Agents, Non- Steroidal/*therapeutic use Tumor Necrosis Factor/antagonists & inhibitors/immunology Most combinations have zero training data Berger & Lafferty MT approach?

11 Non-LM TC Challenges? Integration of prior knowledge Choosing documents to label (TREC adaptive, active learning, sampling) Combining text and nontext predictors Knowing how well a classifier will/can do Evolving category systems, switching vocabularies


Download ppt "Generative and Discriminative Models in Text Classification David D. Lewis Independent Consultant Chicago, IL, USA"

Similar presentations


Ads by Google