1 A Formal Study of Information Retrieval Heuristics Hui Fang, Tao Tao and ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.

Slides:



Advertisements
Similar presentations
Keynote at ICTIR 2011, Sept. 13, 2011, Bertinoro, Italy Axiomatic Analysis and Optimization of Information Retrieval Models ChengXiang (Cheng) Zhai Department.
Advertisements

SIGIR’2005 Gravitation-Based Model for Information Retrieval Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia
A Formal Study of Information Retrieval Heuristics Hui Fang, Tao Tao, and ChengXiang Zhai University of Illinois at Urbana Champaign SIGIR 2004 (Best paper.
Diversified Retrieval as Structured Prediction Redundancy, Diversity, and Interdependent Document Relevance (IDR ’09) SIGIR 2009 Workshop Yisong Yue Cornell.
1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,
1.Accuracy of Agree/Disagree relation classification. 2.Accuracy of user opinion prediction. 1.Task extraction performance on Bing web search log with.
Hierarchical Dirichlet Trees for Information Retrieval Gholamreza Haffari Simon Fraser University Yee Whye Teh University College London NAACL talk, Boulder,
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
A Novel TF-IDF Weighting Scheme for Effective Ranking Jiaul H. Paik Indian Statistical Institute, Kolkata, India SIGIR’2013 Presenter:
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Information Retrieval Models: Probabilistic Models
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Incorporating Language Modeling into the Inference Network Retrieval Framework Don Metzler.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Modern Information Retrieval Chapter 2 Modeling. Probabilistic model the appearance or absent of an index term in a document is interpreted either as.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.
Indexing and Representation: The Vector Space Model Document represented by a vector of terms Document represented by a vector of terms Words (or word.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 5: Information Retrieval and Web Search
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
1 Information Filtering & Recommender Systems (Lecture for CS410 Text Info Systems) ChengXiang Zhai Department of Computer Science University of Illinois,
5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
Exploiting Domain Structure for Named Entity Recognition Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
A General Optimization Framework for Smoothing Language Models on Graph Structures Qiaozhu Mei, Duo Zhang, ChengXiang Zhai University of Illinois at Urbana-Champaign.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Context-Sensitive Information Retrieval Using Implicit Feedback Xuehua Shen : department of Computer Science University of Illinois at Urbana-Champaign.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Gravitation-Based Model for Information Retrieval Shuming Shi, Ji-Rong Wen, Qing Yu, Ruihua Song, Wei-Ying Ma Microsoft Research Asia SIGIR 2005.
Vector Space Models.
Lower-Bounding Term Frequency Normalization Yuanhua Lv and ChengXiang Zhai University of Illinois at Urbana-Champaign CIKM 2011 Best Student Award Paper.
Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu.
Active Feedback in Ad Hoc IR Xuehua Shen, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
An Efficient Information Retrieval System Objectives: n Efficient Retrieval incorporating keyword’s position; and occurrences of keywords in heading or.
Information Retrieval Models: Vector Space Models
Axiomatic Analysis and Optimization of Information Retrieval Models ChengXiang (“Cheng”) Zhai Department of Computer Science University of Illinois at.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
A Study of Poisson Query Generation Model for Information Retrieval
Hui Fang (ACL 2008) presentation 2009/02/04 Rick Liu.
SIGIR 2005 Relevance Information: A Loss of Entropy but a Gain for IDF? Arjen P. de Vries Thomas Roelleke,
Context-Sensitive IR using Implicit Feedback Xuehua Shen, Bin Tan, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval Chengxiang Zhai, John Lafferty School of Computer Science Carnegie.
A Formal Study of Information Retrieval Heuristics Hui Fang, Tao Tao, ChengXiang Zhai University of Illinois at Urbana Champaign Urbana SIGIR 2004 Presented.
IR 6 Scoring, term weighting and the vector space model.
Recommending Forum Posts to Designated Experts
A Formal Study of Information Retrieval Heuristics
Information Retrieval Models: Probabilistic Models
INF 141: Information Retrieval
Language Models for TR Rong Jin
ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB SEARCH
VECTOR SPACE MODEL Its Applications and implementations
Presentation transcript:

1 A Formal Study of Information Retrieval Heuristics Hui Fang, Tao Tao and ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign USA

2 Empirical Observations in IR Retrieval heuristics are necessary for good retrieval performance. –E.g. TF-IDF weighting, document length normalization Similar formulas may have different performances. Performance is sensitive to parameter setting.

3 Pivoted Normalization Method Dirichlet Prior Method Okapi Method Inversed Document Frequency Document Length Normalization Term Frequency Empirical Observations in IR (Cont.) 1+ln(c(w,d)) Alternative TF transformation Parameter sensitivity

4 Research Questions How can we formally characterize these necessary retrieval heuristics? Can we predict the empirical behavior of a method without experimentation?

5 Formalized heuristic retrieval constraints Analytical evaluation of the current retrieval formulas Benefits of constraint analysis –Better understanding of parameter optimization –Explanation of performance difference –Improvement of existing retrieval formulas Outline

6 d2:d2: d1:d1: Term Frequency Constraints (TFC1) TFC1 TF weighting heuristic I: Give a higher score to a document with more occurrences of a query term. q : w If and Let q be a query with only one term w. then

7 Term Frequency Constraints (TFC2) TF weighting heuristic II: Favor a document with more distinct query terms. d1:d1: d2:d2: then If and Let q be a query and w 1, w 2 be two query terms. Assume TFC2 q: w1w1 w2w2

8 Term Discrimination Constraint (TDC) IDF weighting heuristic: Penalize the words popular in the collection; Give higher weights to discriminative terms. Query: SVM Tutorial Assume IDF(SVM)>IDF(Tutorial)... … SVM Tutorial … Doc 1 … SVM Tutorial … Doc 2 SVM Tutorial

9 Term Discrimination Constraint (Cont.) TDC Let q be a query and w 1, w 2 be two query terms. Assume andIf then and for all other words w.and q: w1w1 w2w2 d2:d2: d1:d1:

10 Length Normalization Constraints(LNCs) Document length normalization heuristic: Penalize long documents(LNC1); Avoid over-penalizing long documents (LNC2). LNC2 d2:d2: q: Let q be a query. d1:d1: Ifand then d1:d1: d2:d2: q: Let q be a query. If for some word but for other words then LNC1

11 TF-LENGTH Constraint (TF-LNC) TF-LNC TF-LN heuristic: Regularize the interaction of TF and document length. q: w d2:d2: d1:d1: Let q be a query with only one term w. then and If

12 Analytical Evaluation Retrieval FormulaTFCsTDCLNC1LNC2TF-LNC Pivoted Norm.YesConditionalYesConditional Dirichlet PriorYesConditionalYesConditionalYes Okapi (original)Conditional Okapi (modified)YesConditionalYes

13 Term Discrimination Constraint (TDC) IDF weighting heuristic: Penalize the words popular in the collection; Give higher weights to discriminative terms.... … SVM Tutorial … Doc 1 Query: SVM Tutorial Assume IDF(SVM)>IDF(Tutorial) … Tutorial SVM Tutorial … Doc 2

14 Benefits of Constraint Analysis Provide an approximate bound for the parameters –A constraint may be satisfied only if the parameter is within a particular interval. Compare different formulas analytically without experimentations –When a formula does not satisfy the constraint, it often indicates non-optimality of the formula. Suggest how to improve the current retrieval models –Violation of constraints may pinpoint where a formula needs to be improved.

15 Parameter sensitivity of s s Avg. Prec. Benefits 1 : Bounding Parameters Pivoted Normalization Method LNC2  s< Optimal s (for average precision)

16 Negative when df(w) is large  Violate many constraints Benefits 2 : Analytical Comparison Okapi Method Pivoted Okapi keyword queryverbose query s or b Avg. Prec

17 Benefits 3: Improving Retrieval Formulas Make Okapi satisfy more constraints; expected to help verbose queries Modified Okapi Method keyword query verbose query s or b Avg. Prec. Pivoted Okapi Modified Okapi

18 Conclusions and Future Work Conclusions –Retrieval heuristics can be captured through formally defined constraints. –It is possible to evaluate a retrieval formula analytically through constraint analysis. Future Work –Explore additional necessary heuristics –Apply these constraints to many other retrieval methods –Develop new retrieval formulas through constraint analysis

19 The End Thank you!