A Formal Study of Information Retrieval Heuristics

Slides:

Advertisements

Similar presentations

Keynote at ICTIR 2011, Sept. 13, 2011, Bertinoro, Italy Axiomatic Analysis and Optimization of Information Retrieval Models ChengXiang (Cheng) Zhai Department.

Advertisements

Information Retrieval and Organisation Chapter 11 Probabilistic Information Retrieval Dell Zhang Birkbeck, University of London.

A Formal Study of Information Retrieval Heuristics Hui Fang, Tao Tao, and ChengXiang Zhai University of Illinois at Urbana Champaign SIGIR 2004 (Best paper.

Diversified Retrieval as Structured Prediction Redundancy, Diversity, and Interdependent Document Relevance (IDR ’09) SIGIR 2009 Workshop Yisong Yue Cornell.

1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,

Improvements to BM25 and Language Models Examined ANDREW TROTMAN, ANTTI PUURULA, BLAKE BURGESS AUSTRALASIAN DOCUMENT COMPUTING SYMPOSIUM 2014 MELBOURNE,

Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.

A Novel TF-IDF Weighting Scheme for Effective Ranking Jiaul H. Paik Indian Statistical Institute, Kolkata, India SIGIR’2013 Presenter:

CpSc 881: Information Retrieval

A Maximum Coherence Model for Dictionary-based Cross-language Information Retrieval Yi Liu, Rong Jin, Joyce Y. Chai Dept. of Computer Science and Engineering.

Information Retrieval Models: Probabilistic Models

DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.

T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.

Modeling Modern Information Retrieval

Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.

The Vector Space Model …and applications in Information Retrieval.

Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.

INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,

Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.

IR Models: Review Vector Model and Probabilistic.

Chapter 5: Information Retrieval and Web Search

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.

1 Information Filtering & Recommender Systems (Lecture for CS410 Text Info Systems) ChengXiang Zhai Department of Computer Science University of Illinois,

A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA

CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.

Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.

Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.

A General Optimization Framework for Smoothing Language Models on Graph Structures Qiaozhu Mei, Duo Zhang, ChengXiang Zhai University of Illinois at Urbana-Champaign.

Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.

Context-Sensitive Information Retrieval Using Implicit Feedback Xuehua Shen : department of Computer Science University of Illinois at Urbana-Champaign.

Chapter 6: Information Retrieval and Web Search

Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.

Less is More Probabilistic Models for Retrieving Fewer Relevant Documents Harr Chen, David R. Karger MIT CSAIL ACM SIGIR 2006 August 9, 2006.

Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.

A Language Modeling Approach to Information Retrieval 한 경 수  Introduction  Previous Work  Model Description  Empirical Results  Conclusions.

Gravitation-Based Model for Information Retrieval Shuming Shi, Ji-Rong Wen, Qing Yu, Ruihua Song, Wei-Ying Ma Microsoft Research Asia SIGIR 2005.

Vector Space Models.

Lower-Bounding Term Frequency Normalization Yuanhua Lv and ChengXiang Zhai University of Illinois at Urbana-Champaign CIKM 2011 Best Student Award Paper.

1 A Formal Study of Information Retrieval Heuristics Hui Fang, Tao Tao and ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.

Dependence Language Model for Information Retrieval Jianfeng Gao, Jian-Yun Nie, Guangyuan Wu, Guihong Cao, Dependence Language Model for Information Retrieval,

Active Feedback in Ad Hoc IR Xuehua Shen, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.

Information Retrieval Models: Vector Space Models

Term Weighting approaches in automatic text retrieval. Presented by Ehsan.

Axiomatic Analysis and Optimization of Information Retrieval Models ChengXiang (“Cheng”) Zhai Department of Computer Science University of Illinois at.

Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.

Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.

A Study of Poisson Query Generation Model for Information Retrieval

Hui Fang (ACL 2008) presentation 2009/02/04 Rick Liu.

SIGIR 2005 Relevance Information: A Loss of Entropy but a Gain for IDF? Arjen P. de Vries Thomas Roelleke,

Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.

A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval Chengxiang Zhai, John Lafferty School of Computer Science Carnegie.

A Formal Study of Information Retrieval Heuristics Hui Fang, Tao Tao, ChengXiang Zhai University of Illinois at Urbana Champaign Urbana SIGIR 2004 Presented.

LEARNING IN A PAIRWISE TERM-TERM PROXIMITY FRAMEWORK FOR INFORMATION RETRIEVAL Ronan Cummins, Colm O’Riordan (SIGIR’09) Speaker : Yi-Ling Tai Date : 2010/03/15.

Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,

Queensland University of Technology

Information Retrieval Models: Probabilistic Models

Information Retrieval and Web Search

A Markov Random Field Model for Term Dependencies

Representation of documents and queries

Chapter 5: Information Retrieval and Web Search

CS 430: Information Discovery

Retrieval Utilities Relevance feedback Clustering

INF 141: Information Retrieval

Information Retrieval and Web Design

Language Models for TR Rong Jin

ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB SEARCH

VECTOR SPACE MODEL Its Applications and implementations

Presentation transcript:

A Formal Study of Information Retrieval Heuristics Hui Fang , Tao Tao , ChengXiang Zhai University of Illinois at Urbana Champaign SIGIR 2004 Best Paper Presented by Lingjie Zhang

Motivation Good retrieval performance is closely related to the use of various retrieval heuristics, e.g. TF-IDF A function F is optimal if it satisfies all the constraints If function 𝐹 𝑎 satisfies more constraints than function 𝐹 𝑏 , Fa would perform better than Fb empirically What exactly are these “necessary” heuristics that seem to cause good retrieval performance? Relevance can be modeled by a set of formally defined constraints on a retrieval function

Formal Definitions of Heuristic Retrieval Constraints Six intuitive and desirable constraints Term Frequency Constraints (TFC1, TFC2) Term Discrimination Constraints (TDC) Length Normalization Constraints (LNC1, LNC2) TF-Length Constraints (TF-LNC) Any reasonable retrieval formula should satisfy Empirical studies show that good retrieval performance is closely related to the use of various retrieval heuristics, especially TF-IDF weighting and document length normalization. We formally define a set of basic desirable constraints that any reasonable retrieval formula should satisfy.

Formal Definitions of Heuristic Retrieval Constraints (TFCs,TDC,LNCs,TF-LNC) Term Frequency Constraints (TFCs) TFC1: more occurrences of a query term q={w} , Assume |d1|=|d2|. If c(w,d1) > c(w,d2), then f(d1,q) > f(d2,q) Give a higher score to a document with more occurrences of a query term The score caused by increasing TF from 1 to 2 should be larger than that caused by increasing TF from 100 to 101.

Formal Definitions of Heuristic Retrieval Constraints (TFCs,TDC,LNCs,TF-LNC) (1) the increase in the score due to an increase in TF is smaller for larger TFs q={w} , Assume |d1|=|d2|=|d3| , c(w,d1)>0, If c(w,d2) - c(w,d1) =1 , c(w,d3) - c(w,d2) =1 then f(d2,q) - f(d1,q) > f(d3,q) - f(d2,q) (2) favor a document with more distinct query terms q={w}, Assume |d1| = |d2| and idf(w1) = idf(w2). If c(w1, d2) = c(w1, d1) + c(w2, d1) and c(w2, d2) = 0, c(w1, d1) ≠ 0, c(w2, d1) ≠ 0, then f(d1,q) > f(d2,q) 1->2 100->101 Give a higher score to a document with more occurrences of a query term The score caused by increasing TF from 1 to 2 should be larger than that caused by increasing TF from 100 to 101. Tfc2 has 2 property : (1) (2) a higher score will be given to the document covering more distinct query terms

Formal Definitions of Heuristic Retrieval Constraints (TFCs,TDC,LNCs,TF-LNC) Term Discrimination Constraints (TDC) Favor a document that has more occurrences of discriminative terms (i.e., high IDF terms). q={w} Assume |d1|=|d2|, c(w1,d1) + c(w2,d1)= c(w1,d2) + c(w2,d2) If idf(w1) ≥ idf(w2) and c(w1,d1) > c(w2,d2) , then f(d1,q) ≥ f(d2,q) 𝑤 1 𝑤 2 q: c(w1,d1) c(w2,d1) 𝑑 1 : c(w1,d2) c(w2,d2) 𝑑 2 : Red word is a rare word. It has higher idf. D1 gets higher scores according to this constraint. idf(w1) ≥ idf(w2) then f(d1,q) ≥ f(d2,q)

Formal Definitions of Heuristic Retrieval Constraints (TFCs,TDC,LNCs,TF-LNC) Length Normalization Constraints (LNCs) LNC1: penalize long documents. Let q be a query , d1 and d2 are two documents If some word w’ q , c(w’,d2) = c(w’,d1) +1 but for any query term w, c(w,d2) = c(w,d1) then f(d1,q) ≥ f(d2,q) LNC2: avoid over-penalizing long documents Let q be a query ,∀ k >1 , d1 and d2 are two documents If |d1| = k · |d2| and for all terms w , c(w, d1) = k · c(w, d2), then f(d1, q) ≥ f(d2, q). we concatenate a document with itself k times to form a new document, then the score of the new document should not be lower than the original document.

Formal Definitions of Heuristic Retrieval Constraints (TFCs,TDC,LNCs,TF-LNC) TF-Length Constraints (TF-LNC) Regularize the interaction of TF and document length. q={w} If c(w,d1) > c(w,d2) and |d1|=|d2| + c(w,d1) - c(w,d2) then f(d1,q) > f(d2,q) d1 is generated by adding more occurrences of the query term to d2 , the score of d1 should be higher than d2 .

Analysis of Three Representative Retrieval Formulas Different models, but similar heuristics Pivoted Normalization Method Okapi Method Dirichlet Prior Method Are they performing well because they implement similar retrieval heuristics?

Okapi Method Retrieval function k1 (between 1.0~2.0 ) B (usually 0.75) k3 (between 0 ~1000) When 𝑑𝑓 𝑤 >𝑁/2, the IDF part will be a negative value Violate many constraints: a highly eective retrieval formula that represents the classical probabilistic retrieval model. e.g. violate TFCs

Okapi Method Modify Okapi Method 𝑙𝑛 𝑁+1 𝑑𝑓 Modify Okapi Method This modified Okapi satisfies all the constraints but TDC Expected to help verbose queries Solve the problem of negative IDF Replace the original IDF in Okapi with the regular IDF in the pivoted normalization formula The performance of the modified Okapi would perform better than the original Okapi for verbose queries. the conditions do not provide any bound for the parameter b. Therefore, the performance of Okapi can be expected to be less sensitive to the length normalization parameter than the pivoted normalization method. Modified Okapi Original Okapi Pivoted

Pivoted Normalization Method Retrieval function Analyzing TFC : yes TDC : when 𝑐 𝑤 1 , 𝑑 2 ≤𝑐( 𝑤 2 , 𝑑 1 ) LNC1 : yes LNC2 : when s≤ 𝑡𝑓 1 − 𝑡𝑓 2 𝑘 | 𝑑 2 | 𝑎𝑣𝑑𝑙 −1 𝑡𝑓 2 −( | 𝑑 2 | 𝑎𝑣𝑑𝑙 −1) 𝑡𝑓 1 TF-LNC : when Pivoted Normalization method is one of the best performing vector space retrieval formulas. TF-LNC is satisfied only if s is blow a certain upper bound Upper bound of s

Dirichlet Prior Method Retrieval function Analysis TFC : yes TDC : when LNC1 : yes LNC2 : when 𝑐 𝑤, 𝑑 2 ≥| 𝑑 2 |∙𝑝(𝑤|𝐶) TF-LNC : yes 𝜇> (𝑐( 𝑤 1 , 𝑑 1 )) −𝑐( 𝑤 2 , 𝑑 2 )) 𝑝( 𝑤 2 |𝐶) = 𝑎𝑣𝑑𝑙 ×(𝑐( 𝑤 1 , 𝑑 1 )) −𝑐( 𝑤 2 , 𝑑 2 )) lower bound of 𝜇

Experiments - Setup Document set Query combination Short-keyword (SK, keyword title) Shot-verbose (SV, one sentence description) Long-keyword (LK, keyword list) Long-verbose (LV, multiple sentences) As is well-known, retrieval performance can vary significantly from one test collection to another. We thus construct several quite different and representative test collections using the existing TREC test collections. Preprocessing Only stemming with the Porter’s stemmer No stop words have been removed AP: news article , DOE: technical report, FR: government documents, ADF :combination of AP, DOE, FR Web: web data used in the TREC8 Trec7: ad hoc data used in the TREC7 Trec8: ad hoc data used in the TREC8

Experiments - Parameter Sensitivity PN method is sensitive to s, s<= 0.4. Okapi is more stable with the change of b DP method is sensitive to µ.

Experiments - Performance Comparison For any query type, the performance of Dirichlet prior method is comparable to pivoted normalization method For keyword queries, the performance of Okapi is comparable to the other two retrieval formulas Satisfying more constraints appears to be correlated with a better performance. For ver-bose queries, the performance of okapi may be worse than others. However, for verbose queries, the performance of Okapi may be worse than others due to the possible negative IDF part in the formula Modiﬁed Okapi would perform better than the original Okapi for verbose queries.

Conclusion Define six basic constraints that any reasonable retrieval function should satisfy When the constraints is not satisfied, it often indicates non- optimality of the method For okapi formula we successfully predict the non-optimality for verbose queries

Future Work Can repeat all the experiments by removing the stop words with a standard list To explore additional necessary heuristics for a reasonable retrieval formula. Apply these constraints to many other retrieval models and different smoothing methods To find retrieval methods so that they would satisfy all the constraints.

Thank you!