1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS
2 Topics LM approach –What is it? –Why is it preferred? Controlling Filtering decision
3 What is LM Approach? We distinguish all ‘statistical’ approaches from ‘probabilistic’ approaches. The tf-idf metric computes various statistics of words and documents. By ‘probabilistic’ approaches, we (I) mean methods where we compute the probability of a document being relevant to a user’s need, given the query, the document, and the rest of the world, using a formula that arguably computes P(Doc is Relevant | Query, Document, Collection, etc.) If we use Bayes’ rule, we end up with the prior for each document, p(Doc is Relevant | Everything except Query) and the likelihood of the query p(Q | Doc is Relevant) The LM approach is a solution to the second part of this. The prior probability component is also important.
4 What it is not If we compute a LM for the query and a document and ask the probability that the two underlying LMs are the same, I would NOT call this a posterior probability model. The LMs would not be expected to be the same even with long queries.
5 Issues in LM Approaches for Filtering We (ideally) have three sets of documents: –Positive documents –Negative documents –Large corpus of unknown (mostly negative) documents We can estimate a model for both positive and negative documents –We can find more positive documents in large corpus –We use large corpus to smooth models from positive and negative documents We compute the probability of each of each new document given each of the models The log of the ratio of these two likelihoods is a score that indicates whether the document is positive or negative.
6 Language Modeling Choices We can model the probability of the document given the topic in many ways. A simple unigram mixture works surprisingly well. –Weighted mixture of distributions from the topic training and the full corpus We improve over the ‘naïve Bayes’ model significantly by using the Estimate Maximize technique We can extend the model in many ways: –Ngram model of words –Phrases: proper names, collocations Because we use a formal generative model, we know how to incorporate any effect we want. –E.g., probability of features of top-5 documents given some document is relevant
7 How to Set the Threshold For filtering, we are required to make a hard decision of whether to accept the document, rather than just rank the documents. Problems: –The score for a particular document depends on many factors that are not important for the decision Length of document Percentage of low-likelihood words –The range of scores depends on the particular topic. Would like to map the score for any document and topic into a real posterior probability
8 Score Normalization Techniques By using the relative score for two models, we remove some of the variance due to the particular document. We can normalize for the peculiarities of the topic by computing the distribution of scores for Off-Topic documents. Advantages of using Off-Topic documents: –We have a very large number of documents –We can fix the probability of false alarms
9 The Bottom Line For TDT tracking, the probabilistic approach to modeling the document and to score normalization results in better performance, whether for mono- language, cross-language, speech recognition output, etc. Large improvement will come after multiple sites start using similar techniques.
10 Grand Challenges Tested in TDT –Operating with small amounts of training data for each category 1 to 4 documents per event –Robustness to changes over time adaptation –Multi-lingual domains –How to set threshold for filtering –Using model of ‘eventness’ Large hierarchical category sets –How to use the structure Effective use of prior knowledge Predicting performance and characterizing classes Need a task where both the discriminative and the LM approach will be tested.
11 What do you really want? If a user provides a document about the 9/11 World Trade Center crash and says they want “more like this”, do they want documents about: –Airplane crashes –Terrorism –Building fires –Injuries and Death –Some combination of the above In general, we need a way to clarify which combination of topics the user wants In TDT, we predefine the task to mean we want more about this specific event (and not about some other terrorist airplane crash into a building).