Lower-Bounding Term Frequency Normalization Yuanhua Lv and ChengXiang Zhai University of Illinois at Urbana-Champaign CIKM 2011 Best Student Award Paper.

Slides:

Advertisements

Similar presentations

Keynote at ICTIR 2011, Sept. 13, 2011, Bertinoro, Italy Axiomatic Analysis and Optimization of Information Retrieval Models ChengXiang (Cheng) Zhai Department.

Advertisements

A Formal Study of Information Retrieval Heuristics Hui Fang, Tao Tao, and ChengXiang Zhai University of Illinois at Urbana Champaign SIGIR 2004 (Best paper.

Diversified Retrieval as Structured Prediction Redundancy, Diversity, and Interdependent Document Relevance (IDR ’09) SIGIR 2009 Workshop Yisong Yue Cornell.

1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,

A Machine Learning Approach for Improved BM25 Retrieval

Improvements to BM25 and Language Models Examined ANDREW TROTMAN, ANTTI PUURULA, BLAKE BURGESS AUSTRALASIAN DOCUMENT COMPUTING SYMPOSIUM 2014 MELBOURNE,

A Novel TF-IDF Weighting Scheme for Effective Ranking Jiaul H. Paik Indian Statistical Institute, Kolkata, India SIGIR’2013 Presenter:

Probabilistic Ranking Principle

Information Retrieval Models: Probabilistic Models

Graph-of-word and TW-IDF: New Approach to Ad Hoc IR

Evaluation of representations in AI problem solving Eugene Fink.

Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.

1 Tensor Query Expansion: a cognitively motivated relevance model Mike Symonds, Peter Bruza, Laurianne Sitbon and Ian Turner Queensland University of Technology.

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.

Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.

Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.

Axiomatic Analysis and Optimization of Information Retrieval Models SIGIR 2014 Tutorial ChengXiang Zhai Dept. of Computer Science University of Illinois.

The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.

Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.

A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA

Language Models Hongning Wang Two-stage smoothing [Zhai & Lafferty 02] c(w,d) |d| P(w|d) = +  p(w|C) ++ Stage-1 -Explain unseen words -Dirichlet.

A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.

Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.

Estimating Topical Context by Diverging from External Resources SIGIR’13, July 28–August 1, 2013, Dublin, Ireland. Presenter: SHIH, KAI WUN Romain Deveaud.

Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.

Dimensionality of the latent structure and item selection via latent class multidimensional IRT models FRANCESCO BARTOLUCCI.

Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.

Effective Query Formulation with Multiple Information Sources

A General Optimization Framework for Smoothing Language Models on Graph Structures Qiaozhu Mei, Duo Zhang, ChengXiang Zhai University of Illinois at Urbana-Champaign.

Context-Sensitive Information Retrieval Using Implicit Feedback Xuehua Shen : department of Computer Science University of Illinois at Urbana-Champaign.

Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science ＆ Information Engineering.

Probabilistic Ranking Principle Hongning Wang

Chapter 7 Probability and Samples: The Distribution of Sample Means.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.

1 Using The Past To Score The Present: Extending Term Weighting Models with Revision History Analysis CIKM’10 Advisor ： Jia Ling, Koh Speaker ： SHENG HONG,

Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

Positional Relevance Model for Pseudo–Relevance Feedback Yuanhua Lv & ChengXiang Zhai Department of Computer Science, UIUC Presented by Bo Man 2014/11/18.

Gravitation-Based Model for Information Retrieval Shuming Shi, Ji-Rong Wen, Qing Yu, Ruihua Song, Wei-Ying Ma Microsoft Research Asia SIGIR 2005.

1 A Formal Study of Information Retrieval Heuristics Hui Fang, Tao Tao and ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.

Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.

AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam.

NTNU Speech Lab Dirichlet Mixtures for Query Estimation in Information Retrieval Mark D. Smucker, David Kulp, James Allan Center for Intelligent Information.

1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:

Information Retrieval Models: Vector Space Models

Michael Bendersky, W. Bruce Croft Dept. of Computer Science Univ. of Massachusetts Amherst Amherst, MA SIGIR

DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.

Characteristics of Studies that might Meet the What Works Clearinghouse Standards: Tips on What to Look For 1.

An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,

{ Adaptive Relevance Feedback in Information Retrieval Yuanhua Lv and ChengXiang Zhai (CIKM ‘09) Date: 2010/10/12 Advisor: Dr. Koh, Jia-Ling Speaker: Lin,

Axiomatic Analysis and Optimization of Information Retrieval Models ChengXiang (“Cheng”) Zhai Department of Computer Science University of Illinois at.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.

Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.

Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.

A Study of Poisson Query Generation Model for Information Retrieval

Shadow Detection in Remotely Sensed Images Based on Self-Adaptive Feature Selection Jiahang Liu, Tao Fang, and Deren Li IEEE TRANSACTIONS ON GEOSCIENCE.

A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval Chengxiang Zhai, John Lafferty School of Computer Science Carnegie.

A Formal Study of Information Retrieval Heuristics Hui Fang, Tao Tao, ChengXiang Zhai University of Illinois at Urbana Champaign Urbana SIGIR 2004 Presented.

University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G

Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,

CS 533 – 5 min. Presentations M. Sami Arpa Enes Taylan

A Formal Study of Information Retrieval Heuristics

and Knowledge Graphs for Query Expansion Saeid Balaneshinkordan

Information Retrieval Models: Probabilistic Models

Compact Query Term Selection Using Topically Related Text

Language Models for Information Retrieval

John Lafferty, Chengxiang Zhai School of Computer Science

Language Models for TR Rong Jin

Presentation transcript:

Lower-Bounding Term Frequency Normalization Yuanhua Lv and ChengXiang Zhai University of Illinois at Urbana-Champaign CIKM 2011 Best Student Award Paper Speaker: Tom Nov 8 th, 2011

It is very difficult to improve retrieval models BM25 [Robertson et al. 1994] Pivoted length normalization (PIV) [Singhal et al. 1996] Query likelihood with Dirichlet prior (DIR) [Ponte & Croft 1998; Zhai & Lafferty 2001] PL2 [Amati & Rijsbergen 2002] 2 17 years 15 years 10 years 9 years All these models remain strong baselines today after so many years!

3 1. Why does it seem to be so hard to beat these state-of-the-art retrieval models {BM25, PIV, DIR, PL2 …}? 2. Are they hitting the ceiling?

Key heuristic in all effective retrieval models: term frequency (TF) normalization by document length [Singhal et al. 96; Fang et al. 04] BM25 DIR: Query likelihood with Dirichlet prior 4 PIV and PL2 implement similar retrieval heuristics Term Frequency Document length Term discrimination

However, the component of TF normalization by document length is NOT lower-bounded properly BM25 DIR: Query likelihood with Dirichlet prior 5 When a document is very long, its score from matching a query term could be too small!

As a result, long documents could be overly penalized D 2 matches the query term, while D 1 does not Score PL2 S(D 2 ) < S(D 1 ) Score DIR S(D 2 ) < S(D 1 )

Empirical evidence: long documents indeed overly penalized 7 Prob. of relevance/retrieval: the probability of a randomly selected relevant/retrieved document having a certain document length [Singhal et al. 96] Relevance Retrieval Relevance Document length

8 Functionality analysis of retrieval models Bug TF normalization not lower-bounded properly, and long documents overly penalized Are these retrieval models sharing this similar bug because they all violate some necessary retrieval heuristics? Can we formally capture these necessary heuristics? White-box Testing

Two novel heuristics for regulating the interactions between TF and doc. length There should be a sufficiently large gap between the presence and absence of a query term –Document length normalization should not cause a very long document with a non-zero TF to receive a score too close to or even lower than a short document with a zero TF A short document that only covers a very small subset of the query terms should not easily dominate over a very long document that contains many distinct query terms 9 LB2 LB1

Lower-bounding constraint 1 (LB1): Occurrence > Non-Occurrence 10 D1:D1: w Score(Q, D 1 ) = Score(Q, D 2 ) Score(Q’, D 1 ) < Score(Q’, D 2 ) Q: w D2:D2: wq Q’: w q

Lower-bounding constraint 2 (LB2): First Occurrence > Repeated Occurrence 11 D1:D1: q1q1 Score(Q, D 1 ) = Score(Q, D 2 ) D2:D2: q1q1 D1’:D1’: q1q1 q1q1 D2’:D2’: q1q1 q2q2 Q: q1q1 q2q2 Score(Q, D 1 ’) < Score(Q, D 2 ’)

BM25 satisfies LB1 but violates LB2 LB1 is satisfied unconditionally LB2 is equivalent to: 12 (Parameters: k 1 > 0 && 0 < b < 1) Long documents tend to violate LB2 Large b or k 1 violates LB2 easily

DIR satisfies LB2 but violates LB1 LB2 is equivalent to: LB1 is equivalent to: 13 Long documents tend to violate LB1 satisfied unconditionally! Large µ or non-discriminative terms violate LB1 easily

No retrieval model satisfies both constraints 14 ModelLB1LB2Parameter and/or query restrictions BM25YesNob and k 1 should not be too large PIVYesNos should not be too large PL2No c should not be too small DIRNoYesµ should not be too large; query terms should be discriminative Can we "fix" this problem for all the models in a general way?

Solution: a general approach to lower- bounding TF normalization The score of a document D from matching a query term t: 15 Term discrimination BM25 DIR PIV and PL2 also have their corresponding components

Solution: a general approach to lower- bounding TF normalization (Cont.) Objective: an improved version that does not hurt other retrieval heuristics, but A heuristic solution: 16 l can be absorbed into δ which satisfies all retrieval heuristics that are satisfied by

Example: BM25+, a lower-bounded version of BM25 17 BM25: BM25+: BM25+ incurs almost no additional computational cost Similarly, we can also improve PIV, DIR, and PL2, leading to PIV+, DIR+, and PL2+ respectively

BM25+ can satisfy both LB1 and LB2 Similarly to BM25, BM25+ satisfies LB1 LB2 can also be satisfied unconditionally if: 18 Experiments show later that setting δ = 1.0 works very well

The proposed approach can fix or alleviate the problem of all these retrieval models 19 BM25+Yes PIV+Yes PL2+Yes DIR+AlleviatedYes BM25YesNo PIVYesNo PL2No DIRNoYes Current retrieval models Improved retrieval models LB1LB2

Experiment Setup Standard TREC document collections –Web: WT2G, WT10G, and Terabyte –News: Robust04 Standard TREC query sets: –Short (the title field): e.g., “Iraq foreign debt reduction” –Verbose (the description field): e.g., “Identify any efforts, proposed or undertaken, by world governments to seek reduction of Iraq's foreign debt ” 2-fold cross validation for parameter tuning 20

BM25+ improves over BM25 significantly 21 BM25+ performs better on Web data than on News data Web News Superscripts 1/2/3/4 indicating significance at the 0.05/0.02/0.01/0.001 level δ = 1.0 works well, confirming constraint analysis that BM25+ performs better on verbose queries ? Short Verbose σ = 2.31σ = 2.63 σ = 1.19

BM25 overly penalizes long documents more seriously for verbose queries 22 The “condition” that BM25 violates LB2 is (monotonically decreasing with b & k 1 ) The optimal settings of b & k 1 are larger for verbose queries

The improvement indeed comes from alleviating the problem of overly-penalizing long docs 23 BM25+ (verbose) BM25+ (short) BM25 (short) BM25 (verbose)

DIR+ improves over DIR significantly 24 Fixing δ = 0.05 works very well DIR+ performs better on verbose than on short queries Superscripts 1/2/3/4 indicating significance at the 0.05/0.02/0.01/0.001 level Short Verbose ? DIR can only satisfy LB1 if Optimal µ settings

PL2+ improves over PL2 significantly 25 Fixing δ = 0.8 works very well PL2+ performs better on verbose than on short queries Superscripts 1/2/3/4 indicating significance at the 0.05/0.02/0.01/0.001 level Short Verbose Optimal settings of c: the smaller, the more dangerous

PIV+ works as we expected 26 PIV+ does not consistently outperform PIV, as we expected Superscripts 1 indicating significance at the 0.05 level PIV can satisfy LB2 if It’s fine, as the optimal settings of s are very small

27 1. Why does it seem to be so hard to beat these state-of-the-art retrieval models {BM25, PIV, DIR, PL2 …}? 2. Are they hitting the ceiling? We weren’t able to figure out their deficiency analytically. No, they haven’t hit the ceiling yet!

Conclusions Reveal a common deficiency of current retrieval models Propose two novel formal constraints Show that current retrieval models do not satisfy both constraints, and that retrieval performance tends to be poor if either constraint is violated Develop a general and efficient solution, which has been shown analytically to fix/alleviate the problem of current retrieval models Demonstrate the effectiveness of the proposed algorithms across different collections for different types of queries 28

Our models {BM25+, DIR+, PL2+} can potentially replace current state-of-the-art retrieval models {BM25, DIR, PL2} 29 BM25: BM25+:

Future work This work has demonstrated the power of doing axiomatic analysis to fix deficiencies of retrieval models. Are there any other deficiencies of current retrieval models? If so, can we solve them with axiomatic analysis? Can we go beyond bag of words with constraint analysis? Can we find a comprehensive set of constraints that are sufficient for deriving a unique (optimal) retrieval function 30

Thanks! 31

Sensitivity of δ in BM25+ 32