A Formal Study of Information Retrieval Heuristics Hui Fang, Tao Tao, and ChengXiang Zhai University of Illinois at Urbana Champaign SIGIR 2004 (Best paper.

Slides:



Advertisements
Similar presentations
Keynote at ICTIR 2011, Sept. 13, 2011, Bertinoro, Italy Axiomatic Analysis and Optimization of Information Retrieval Models ChengXiang (Cheng) Zhai Department.
Advertisements

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Information Retrieval and Organisation Chapter 11 Probabilistic Information Retrieval Dell Zhang Birkbeck, University of London.
SIGIR’2005 Gravitation-Based Model for Information Retrieval Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,
Improvements to BM25 and Language Models Examined ANDREW TROTMAN, ANTTI PUURULA, BLAKE BURGESS AUSTRALASIAN DOCUMENT COMPUTING SYMPOSIUM 2014 MELBOURNE,
Hierarchical Dirichlet Trees for Information Retrieval Gholamreza Haffari Simon Fraser University Yee Whye Teh University College London NAACL talk, Boulder,
A Novel TF-IDF Weighting Scheme for Effective Ranking Jiaul H. Paik Indian Statistical Institute, Kolkata, India SIGIR’2013 Presenter:
CpSc 881: Information Retrieval
Information Retrieval Models: Probabilistic Models
IR Models: Overview, Boolean, and Vector
Hinrich Schütze and Christina Lioma
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Modeling Modern Information Retrieval
Evaluating Hypotheses
Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
Vector Space Model CS 652 Information Extraction and Integration.
Copyright ©2011 Pearson Education 8-1 Chapter 8 Confidence Interval Estimation Statistics for Managers using Microsoft Excel 6 th Global Edition.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 7 Statistical Intervals Based on a Single Sample.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
1 Information Filtering & Recommender Systems (Lecture for CS410 Text Info Systems) ChengXiang Zhai Department of Computer Science University of Illinois,
A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,
A General Optimization Framework for Smoothing Language Models on Graph Structures Qiaozhu Mei, Duo Zhang, ChengXiang Zhai University of Illinois at Urbana-Champaign.
Context-Sensitive Information Retrieval Using Implicit Feedback Xuehua Shen : department of Computer Science University of Illinois at Urbana-Champaign.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Chapter 5 Parameter estimation. What is sample inference? Distinguish between managerial & financial accounting. Understand how managers can use accounting.
A Language Modeling Approach to Information Retrieval 한 경 수  Introduction  Previous Work  Model Description  Empirical Results  Conclusions.
Gravitation-Based Model for Information Retrieval Shuming Shi, Ji-Rong Wen, Qing Yu, Ruihua Song, Wei-Ying Ma Microsoft Research Asia SIGIR 2005.
Lower-Bounding Term Frequency Normalization Yuanhua Lv and ChengXiang Zhai University of Illinois at Urbana-Champaign CIKM 2011 Best Student Award Paper.
1 A Formal Study of Information Retrieval Heuristics Hui Fang, Tao Tao and ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Dependence Language Model for Information Retrieval Jianfeng Gao, Jian-Yun Nie, Guangyuan Wu, Guihong Cao, Dependence Language Model for Information Retrieval,
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Information Retrieval Models: Vector Space Models
Term Weighting approaches in automatic text retrieval. Presented by Ehsan.
Copyright © Cengage Learning. All rights reserved. 9 Inferences Based on Two Samples.
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
Axiomatic Analysis and Optimization of Information Retrieval Models ChengXiang (“Cheng”) Zhai Department of Computer Science University of Illinois at.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
A Study of Poisson Query Generation Model for Information Retrieval
Hui Fang (ACL 2008) presentation 2009/02/04 Rick Liu.
SIGIR 2005 Relevance Information: A Loss of Entropy but a Gain for IDF? Arjen P. de Vries Thomas Roelleke,
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval Chengxiang Zhai, John Lafferty School of Computer Science Carnegie.
Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.
A Formal Study of Information Retrieval Heuristics Hui Fang, Tao Tao, ChengXiang Zhai University of Illinois at Urbana Champaign Urbana SIGIR 2004 Presented.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,
IR 6 Scoring, term weighting and the vector space model.
A Formal Study of Information Retrieval Heuristics
Latent Semantic Indexing
Information Retrieval Models: Probabilistic Models
Recuperação de Informação B
Information Retrieval and Web Design
Language Models for TR Rong Jin
Presentation transcript:

A Formal Study of Information Retrieval Heuristics Hui Fang, Tao Tao, and ChengXiang Zhai University of Illinois at Urbana Champaign SIGIR 2004 (Best paper award)

INTRODUCTION Empirical studies show that good retrieval performance is closely related to the use of various retrieval heuristics, especially TF-IDF weighting and document length normalization. We formally define a set of basic desirable constraints that any reasonable retrieval formula should satisfy.

HEURISTIC RETRIEVAL CONSTRAINTS suggest that there are some “basic requirements” that all reasonable retrieval formulas should follow. Exp: If a retrieval formula does not penalize common words, then it violates the “IDF requirement”, thus can be regarded as “unreasonable.” Note that these constraints are necessary, but not necessarily sucient, and should not be regarded as the only constraints that we want a retrieval function to satisfy.

HEURISTIC RETRIEVAL CONSTRAINTS We focus on these six basic constraints in this paper because they capture the major well-known IR heuristics, particularly TF-IDF weighting and length normalization. suppose Ci represents the set of all the retrieval functions satisfying the i-th Constraint, then we can show that ∀ i, j, ∃ e ∈ Ci, such that e ∈ Ci - Cj.

Term Frequency Constraints (TFCs) TFC1: Let q = {w} be a query with only one term w. Assume |d1| = |d2|. If c(w, d1) > c(w, d2), then f(d1, q) > f(d2, q). TFC2: Let q = {w} be a query with only one term w. Assume |d1| = |d2| = |d3| and c(w, d1) > 0. If c(w, d2) - c(w, d1) = 1 and c(w, d3) - c(w, d2) = 1, then f(d2, q) - f(d1, q) > f(d3, q) - f(d2, q).

Term Frequency Constraints (TFCs) TFC2 constraint also implies another desirable property Let q be a query and w1,w2 ∈ q be two query terms. Assume|d1| = |d2| and idf(w1) = idf(w2). If c(w1, d1) = c(w1, d2)+ c(w2, d2) and c(w2, d1) = 0, c(w1, d2) != 0,c(w2, d2) != 0,then f(d1, q) < f(d2, q).

Term Discrimination Constraint (TDC) TDC: Let q be a query and w1,w2 ∈ q be two query terms. Assume |d1| = |d2|, c(w1,d1)+c(w2, d1) = c(w1, d2)+ c(w2, d2). If idf(w1) ≥ idf(w2) and c(w1, d1) ≥ c(w1, d2), then f(d1, q) ≥ f(d2, q).

Term Discrimination Constraint (TDC) Simply weighting each term with an IDF factor does not ensure that this constraint be satisfied. IDF can be any reasonable measure of term discrimination value (usually based on term popularity in a collection).

Length Normalization Constraints (LNCs) LNC1: Let q be a query and d1, d2 be two documents. If for some word w’ ! ∈ q, c(w’, d2) = c(w’, d1) + 1but for any query term w, c(w, d2) = c(w, d1), then f(d1, q) ≥ f(d2, q). LNC2: Let q be a query. ∀ k > 1, if d1 and d2 are two documents such that |d1| = k *|d2| and for all terms w, c(w, d1) = k * c(w, d2), then f(d1, q) ≥ f(d2, q).

TF-LENGTH Constraint (TF-LNC) TF-LNC: Let q = {w} be a query with only one term w. If c(w, d1) > c(w, d2) and |d1| = |d2| + c(w, d1) - c(w, d2), then f(d1, q) > f(d2, q).

TF-LENGTH Constraint (TF-LNC) Based on TF-LNC and LNC1, we can derive the following constraint Let q = {w} be a query with only one term w. If d3 is the document such that c(w, d3) > c(w, d2) and |d3| < |d2| + c(w, d3) - c(w, d2), then f(d3, q) > f(d2, q).

TF-LENGTH Constraint (TF-LNC) The above constraint can be derived in the following way. Assume we have a document d1 such that |d1| = |d2| + c(w, d1)- c(w, d2) and c(w, d3) =c(w, d1). the only difference between d1 and d3 is that d1 has more occurrences of the non-query terms. According to LNC1, f(d3, q) ≥ f(d1, q). Since f(d1, q) > f(d2, q) follows from TF-LNC, it is clear that f(d3, q) > f(d2, q).

Summary of intuitions for each formalized constraint

Pivoted Normalization Method The pivoted normalization retrieval formula is one of the best performing vector space retrieval formulas.

Pivoted Normalization Method TFC: yes TDC : when LNC1 : yes LNC2: when TF-LNC: when

Pivoted Normalization Method in order to avoid over-penalizing a long document, a reasonable value for s should be generally small it should be below 0.4 even in the case of a small k, and we know that for a larger k the bound would be even tighter.

Okapi Method a highly eective retrieval formula that represents the classical probabilistic retrieval model.

Okapi Method TFC:when TDC:when ( and )or (! and ! ) LNC1:when LNC2:when TF-LNC :when

Okapi Method (Modified) Replacing the IDF part in Okapi with the IDF part of the pivoted normalization formula TFC:yes TDC:when LNC1:yes LNC2:yes TF-LNC:yes

Okapi Method the conditions do not provide any bound for the parameter b. Therefore, the performance of Okapi can be expected to be less sensitive to the length normalization parameter than the pivoted normalization method. our analysis suggests that the performance of Okapi may be relatively worse for verbose queries than for keyword queries. Hypothesize that the performance of the modified okapi would perform better than the original okapi for verbose queries.

Dirichlet Prior Method Dirichlet prior retrieval method is one of the best performing language modeling approaches

Dirichlet Prior Method TFC:yes TDC:when LNC1:yes LNC2:when TF-LNC:yes

Dirichlet Prior Method For discriminative words with a high term frequency in a document, µ needs to be suciently large (at least as large as the average document length) in order to balance TF and IDF appropriately. µ has a lower bound, and a very small µ might cause poor retrieval performance

EXPERIMENTS As is well-known, retrieval performance can vary significantly from one test collection to another. We thus construct several quite different and representative test collections using the existing TREC test collections.

Parameter Sensitivity The pivoted normalization method is sensitive to the value of parameter s. Okapi is more stable with the change of parameter b.

Parameter Sensitivity The Dirichlet prior method is sensitive to the value of parameter µ.

Performance Comparison For any query type, the performance of Dirichlet prior method is comparable to pivoted normalization method when the retrieval parameters are set to an optimal value. For keyword queries, the performance of Okapi is comparable to the other two retrieval formulas.

Performance Comparison For ver-bose queries, the performance of okapi may be worse than others.

CONCLUSIONS AND FUTUREWORK we formally define six basic constraints that any reasonable retrieval function should satisfy For okapi formula we successfully predict the nonoptimality for verbose queries Can repeat all the experiments by removing the stop words with a standard list

CONCLUSIONS AND FUTUREWORK to explore additional necessary heuristics for a reasonable retrieval formula. Apply these constraints to many other retrieval models and different smoothing methods To find retrieval methods so that they would satisfy all the constraints.