1 A Formal Study of Information Retrieval Heuristics Hui Fang, Tao Tao and ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.

1 1 A Formal Study of Information Retrieval Heuristics Hui Fang, Tao Tao and ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign USA

2 2 Empirical Observations in IR Retrieval heuristics are necessary for good retrieval performance. –E.g. TF-IDF weighting, document length normalization Similar formulas may have different performances. Performance is sensitive to parameter setting.

3 3 Pivoted Normalization Method Dirichlet Prior Method Okapi Method Inversed Document Frequency Document Length Normalization Term Frequency Empirical Observations in IR (Cont.) 1+ln(c(w,d)) Alternative TF transformation Parameter sensitivity

4 4 Research Questions How can we formally characterize these necessary retrieval heuristics? Can we predict the empirical behavior of a method without experimentation?

5 5 Formalized heuristic retrieval constraints Analytical evaluation of the current retrieval formulas Benefits of constraint analysis –Better understanding of parameter optimization –Explanation of performance difference –Improvement of existing retrieval formulas Outline

6 6 d2:d2: d1:d1: Term Frequency Constraints (TFC1) TFC1 TF weighting heuristic I: Give a higher score to a document with more occurrences of a query term. q : w If and Let q be a query with only one term w. then

7 7 Term Frequency Constraints (TFC2) TF weighting heuristic II: Favor a document with more distinct query terms. d1:d1: d2:d2: then If and Let q be a query and w 1, w 2 be two query terms. Assume TFC2 q: w1w1 w2w2

8 8 Term Discrimination Constraint (TDC) IDF weighting heuristic: Penalize the words popular in the collection; Give higher weights to discriminative terms. Query: SVM Tutorial Assume IDF(SVM)>IDF(Tutorial)... … SVM Tutorial … Doc 1 … SVM Tutorial … Doc 2 SVM Tutorial

9 9 Term Discrimination Constraint (Cont.) TDC Let q be a query and w 1, w 2 be two query terms. Assume andIf then and for all other words w.and q: w1w1 w2w2 d2:d2: d1:d1:

10 10 Length Normalization Constraints(LNCs) Document length normalization heuristic: Penalize long documents(LNC1); Avoid over-penalizing long documents (LNC2). LNC2 d2:d2: q: Let q be a query. d1:d1: Ifand then d1:d1: d2:d2: q: Let q be a query. If for some word but for other words then LNC1

11 11 TF-LENGTH Constraint (TF-LNC) TF-LNC TF-LN heuristic: Regularize the interaction of TF and document length. q: w d2:d2: d1:d1: Let q be a query with only one term w. then and If

12 12 Analytical Evaluation Retrieval FormulaTFCsTDCLNC1LNC2TF-LNC Pivoted Norm.YesConditionalYesConditional Dirichlet PriorYesConditionalYesConditionalYes Okapi (original)Conditional Okapi (modified)YesConditionalYes

13 13 Term Discrimination Constraint (TDC) IDF weighting heuristic: Penalize the words popular in the collection; Give higher weights to discriminative terms.... … SVM Tutorial … Doc 1 Query: SVM Tutorial Assume IDF(SVM)>IDF(Tutorial) … Tutorial SVM Tutorial … Doc 2

14 14 Benefits of Constraint Analysis Provide an approximate bound for the parameters –A constraint may be satisfied only if the parameter is within a particular interval. Compare different formulas analytically without experimentations –When a formula does not satisfy the constraint, it often indicates non-optimality of the formula. Suggest how to improve the current retrieval models –Violation of constraints may pinpoint where a formula needs to be improved.

15 15 Parameter sensitivity of s s Avg. Prec. Benefits 1 : Bounding Parameters Pivoted Normalization Method LNC2  s<0.4 0.4 Optimal s (for average precision)

16 16 Negative when df(w) is large  Violate many constraints Benefits 2 : Analytical Comparison Okapi Method Pivoted Okapi keyword queryverbose query s or b Avg. Prec

17 17 Benefits 3: Improving Retrieval Formulas Make Okapi satisfy more constraints; expected to help verbose queries Modified Okapi Method keyword query verbose query s or b Avg. Prec. Pivoted Okapi Modified Okapi

18 18 Conclusions and Future Work Conclusions –Retrieval heuristics can be captured through formally defined constraints. –It is possible to evaluate a retrieval formula analytically through constraint analysis. Future Work –Explore additional necessary heuristics –Apply these constraints to many other retrieval methods –Develop new retrieval formulas through constraint analysis

