SIGIR 2005 Relevance Information: A Loss of Entropy but a Gain for IDF? Arjen P. de Vries Thomas Roelleke,

Slides:



Advertisements
Similar presentations
Query Classification Using Asymmetrical Learning Zheng Zhu Birkbeck College, University of London.
Advertisements

Text Categorization.
Information Retrieval and Organisation Chapter 11 Probabilistic Information Retrieval Dell Zhang Birkbeck, University of London.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
CpSc 881: Information Retrieval
Probabilistic Ranking Principle
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Information Retrieval Models: Probabilistic Models
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.
Evaluating Search Engine
Hinrich Schütze and Christina Lioma
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Incorporating Language Modeling into the Inference Network Retrieval Framework Don Metzler.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
A Markov Random Field Model for Term Dependencies Donald Metzler and W. Bruce Croft University of Massachusetts, Amherst Center for Intelligent Information.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Evaluating the Performance of IR Sytems
The Vector Space Model …and applications in Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
1 Computing Relevance, Similarity: The Vector Space Model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Parallel and Distributed Searching. Lecture Objectives Review Boolean Searching Indicate how Searches may be carried out in parallel Overview Distributed.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Vector Space Models.
Relevance-Based Language Models Victor Lavrenko and W.Bruce Croft Department of Computer Science University of Massachusetts, Amherst, MA SIGIR 2001.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Dependence Language Model for Information Retrieval Jianfeng Gao, Jian-Yun Nie, Guangyuan Wu, Guihong Cao, Dependence Language Model for Information Retrieval,
CpSc 881: Information Retrieval. 2 Using language models (LMs) for IR ❶ LM = language model ❷ We view the document as a generative model that generates.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
M. Yağmur Şahin Çağlar Terzi Arif Usta. Introduction What similarity calculations should be used? For each type of queries For each or type of documents.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval Chengxiang Zhai, John Lafferty School of Computer Science Carnegie.
Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.
LEARNING IN A PAIRWISE TERM-TERM PROXIMITY FRAMEWORK FOR INFORMATION RETRIEVAL Ronan Cummins, Colm O’Riordan (SIGIR’09) Speaker : Yi-Ling Tai Date : 2010/03/15.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 14: Language Models for IR.
A Formal Study of Information Retrieval Heuristics
Lecture 13: Language Models for IR
CSCI 5417 Information Retrieval Systems Jim Martin
Information Retrieval Models: Probabilistic Models
Evaluation.
Representation of documents and queries
Learning Literature Search Models from Citation Behavior
CS 430: Information Discovery
Relevance and Reinforcement in Interactive Browsing
INF 141: Information Retrieval
Retrieval Performance Evaluation - Measures
ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB SEARCH
Presentation transcript:

SIGIR 2005 Relevance Information: A Loss of Entropy but a Gain for IDF? Arjen P. de Vries Thomas Roelleke,

Motivation How should relevance information be incorporated in systems using TF*IDF term weighting? –TF*IDF combines frequent occurrence with term discriminative-ness –Adding relevance information to a retrieval system corresponds to a loss of entropy; how does this affect IDF (the measure of term discriminativeness)?

Overview PART I –IDF, relevance, and BIR PART II –Alternative estimation for IDF

IDF A robust summary statistic of term occurrence, that helps identify ‘good’ terms Follows naturally from the Binary Independence Retrieval model (BIR) –The ranking that results from the situation without relevance information –Related to the occurrence probability P(t|C)

Binary term presence/absence BIR Rank documents by their probability of relevance Using odds of relevance avoids estimation of some terms without affecting the ranking

BIR Without relevance information, h(t,C) = – log n/(N – n) (Almost) the Discriminative-ness of term t in collection C!

BIR and IDF View IDF as term statistic in a set of documents, R or ¬R Then, the BIR probability estimation known as F4 corresponds to IDF(t,¬R) – IDF(t,R) + IDF(¬t,R) – IDF(¬t,¬R) IDF(t,R) can be interpreted as the discriminativeness of term presence among the relevant documents, etc.

BIR and IDF In practice, the ‘complement method’ gives ¬R = C\R ≈ C, so, usually, updating IDF under relevance information corresponds to subtracting IDF(t,R)! The BIR modifies h(t,C) more significantly for those terms that are rare in the relevant set; for, they do not help identify good documents

Implication for TF*IDF systems A system using IDF(t,C) uses presence weighting only, assuming that the term t occurs in all relevant documents (such that IDF(t,R) = – log R/R = 0) Systems using TF*IDF term weighting can incorporate RFB in accordance to the binary independence retrieval model

Estimation IDF Recall that IDF(t,C) = – log P(t|C), the occurrence probability of t in C. –Assuming events d are disjoint and exhaustive we obtain P(t|C)=n/N Q: Is this the best method for estimation? –Notice that, in the BIR formulation, sets R and ¬R have very different cardinality…

Estimation TF For TF weights, we know that P(t|d) estimated by a Poisson approximation (e.g., applied in BM25) or by lifting (e.g., applied in Inquery) leads to superior retrieval results Motivation for this different estimate is to better handle the influence of varying document lengths

Poisson Estimate The ‘Poisson estimate’ approximates the (Poisson-based) probability that term t occurs at least once

Poisson vs. Estimate vs. n/N

Again, for small n

|Poisson – Estimate|

Experimental Setup Ad-hoc retrieval –TREC-7 and TREC-8 (topics ) –No stemming Routing –LA Times articles for training (1989/1990) –Remainder for testing ( ) BM25 constants:

Results: IDF vs. IDF p IDF IDF p TTDTDN TREC TREC TTDTDN TREC TREC

IDF vs. IDF p For the short T queries, the user selects carefully the most discriminative terms with respect to relevance The longer TD and TDN queries contain however also noisy, non-discriminative terms

IDF vs. IDF p IDF p orders terms with respect to their discriminative-ness in the same order as IDF, but reduces the influence of the non- discriminative terms on the ranking –Differentiate more between rare terms, and less between frequent terms As a result, the effect of the Poisson- based estimation is much stronger for the longer queries

TF*IDF vs. TF*IDF p Estimation with IDF p results in better mean average precision than the ‘traditional’ estimate Strong emphasis on discriminative-ness (Poisson approximation IDF p using large values of K) improves effectiveness Best overall performance for K=N/10

Routing experiment The TF*IDF p results without feedback are better than all TF*IDF results But, the TF*IDF p results without feedback are also better than all TF*IDF p results with feedback Finally, the TF*IDF results improve only marginally with feedback LA times training data not representative?

Conclusions PART I –IDF and the Binary Independence Retrieval model are very closely related –Relevance information can be incorporated in TF*IDF by revising IDF PART II –Different estimation of the occurrence probability in IDF leads to improved retrieval effectiveness

Open Questions Can we derive the choice for K=N/10 analytically? Is the observed improvement in effectiveness really due to a better (frequentist) model of the occurrence probability, or is it a qualitative argument for informative-ness? More questions in the audience?!