SIGIR 2005 Relevance Information: A Loss of Entropy but a Gain for IDF? Arjen P. de Vries Thomas Roelleke,
Motivation How should relevance information be incorporated in systems using TF*IDF term weighting? –TF*IDF combines frequent occurrence with term discriminative-ness –Adding relevance information to a retrieval system corresponds to a loss of entropy; how does this affect IDF (the measure of term discriminativeness)?
Overview PART I –IDF, relevance, and BIR PART II –Alternative estimation for IDF
IDF A robust summary statistic of term occurrence, that helps identify ‘good’ terms Follows naturally from the Binary Independence Retrieval model (BIR) –The ranking that results from the situation without relevance information –Related to the occurrence probability P(t|C)
Binary term presence/absence BIR Rank documents by their probability of relevance Using odds of relevance avoids estimation of some terms without affecting the ranking
BIR Without relevance information, h(t,C) = – log n/(N – n) (Almost) the Discriminative-ness of term t in collection C!
BIR and IDF View IDF as term statistic in a set of documents, R or ¬R Then, the BIR probability estimation known as F4 corresponds to IDF(t,¬R) – IDF(t,R) + IDF(¬t,R) – IDF(¬t,¬R) IDF(t,R) can be interpreted as the discriminativeness of term presence among the relevant documents, etc.
BIR and IDF In practice, the ‘complement method’ gives ¬R = C\R ≈ C, so, usually, updating IDF under relevance information corresponds to subtracting IDF(t,R)! The BIR modifies h(t,C) more significantly for those terms that are rare in the relevant set; for, they do not help identify good documents
Implication for TF*IDF systems A system using IDF(t,C) uses presence weighting only, assuming that the term t occurs in all relevant documents (such that IDF(t,R) = – log R/R = 0) Systems using TF*IDF term weighting can incorporate RFB in accordance to the binary independence retrieval model
Estimation IDF Recall that IDF(t,C) = – log P(t|C), the occurrence probability of t in C. –Assuming events d are disjoint and exhaustive we obtain P(t|C)=n/N Q: Is this the best method for estimation? –Notice that, in the BIR formulation, sets R and ¬R have very different cardinality…
Estimation TF For TF weights, we know that P(t|d) estimated by a Poisson approximation (e.g., applied in BM25) or by lifting (e.g., applied in Inquery) leads to superior retrieval results Motivation for this different estimate is to better handle the influence of varying document lengths
Poisson Estimate The ‘Poisson estimate’ approximates the (Poisson-based) probability that term t occurs at least once
Poisson vs. Estimate vs. n/N
Again, for small n
|Poisson – Estimate|
Experimental Setup Ad-hoc retrieval –TREC-7 and TREC-8 (topics ) –No stemming Routing –LA Times articles for training (1989/1990) –Remainder for testing ( ) BM25 constants:
Results: IDF vs. IDF p IDF IDF p TTDTDN TREC TREC TTDTDN TREC TREC
IDF vs. IDF p For the short T queries, the user selects carefully the most discriminative terms with respect to relevance The longer TD and TDN queries contain however also noisy, non-discriminative terms
IDF vs. IDF p IDF p orders terms with respect to their discriminative-ness in the same order as IDF, but reduces the influence of the non- discriminative terms on the ranking –Differentiate more between rare terms, and less between frequent terms As a result, the effect of the Poisson- based estimation is much stronger for the longer queries
TF*IDF vs. TF*IDF p Estimation with IDF p results in better mean average precision than the ‘traditional’ estimate Strong emphasis on discriminative-ness (Poisson approximation IDF p using large values of K) improves effectiveness Best overall performance for K=N/10
Routing experiment The TF*IDF p results without feedback are better than all TF*IDF results But, the TF*IDF p results without feedback are also better than all TF*IDF p results with feedback Finally, the TF*IDF results improve only marginally with feedback LA times training data not representative?
Conclusions PART I –IDF and the Binary Independence Retrieval model are very closely related –Relevance information can be incorporated in TF*IDF by revising IDF PART II –Different estimation of the occurrence probability in IDF leads to improved retrieval effectiveness
Open Questions Can we derive the choice for K=N/10 analytically? Is the observed improvement in effectiveness really due to a better (frequentist) model of the occurrence probability, or is it a qualitative argument for informative-ness? More questions in the audience?!