Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 A Formal Study of Information Retrieval Heuristics Hui Fang, Tao Tao and ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.

Similar presentations


Presentation on theme: "1 A Formal Study of Information Retrieval Heuristics Hui Fang, Tao Tao and ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign."— Presentation transcript:

1 1 A Formal Study of Information Retrieval Heuristics Hui Fang, Tao Tao and ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign USA

2 2 Empirical Observations in IR Retrieval heuristics are necessary for good retrieval performance. –E.g. TF-IDF weighting, document length normalization Similar formulas may have different performances. Performance is sensitive to parameter setting.

3 3 Pivoted Normalization Method Dirichlet Prior Method Okapi Method Inversed Document Frequency Document Length Normalization Term Frequency Empirical Observations in IR (Cont.) 1+ln(c(w,d)) Alternative TF transformation Parameter sensitivity

4 4 Research Questions How can we formally characterize these necessary retrieval heuristics? Can we predict the empirical behavior of a method without experimentation?

5 5 Formalized heuristic retrieval constraints Analytical evaluation of the current retrieval formulas Benefits of constraint analysis –Better understanding of parameter optimization –Explanation of performance difference –Improvement of existing retrieval formulas Outline

6 6 d2:d2: d1:d1: Term Frequency Constraints (TFC1) TFC1 TF weighting heuristic I: Give a higher score to a document with more occurrences of a query term. q : w If and Let q be a query with only one term w. then

7 7 Term Frequency Constraints (TFC2) TF weighting heuristic II: Favor a document with more distinct query terms. d1:d1: d2:d2: then If and Let q be a query and w 1, w 2 be two query terms. Assume TFC2 q: w1w1 w2w2

8 8 Term Discrimination Constraint (TDC) IDF weighting heuristic: Penalize the words popular in the collection; Give higher weights to discriminative terms. Query: SVM Tutorial Assume IDF(SVM)>IDF(Tutorial)... … SVM Tutorial … Doc 1 … SVM Tutorial … Doc 2 SVM Tutorial

9 9 Term Discrimination Constraint (Cont.) TDC Let q be a query and w 1, w 2 be two query terms. Assume andIf then and for all other words w.and q: w1w1 w2w2 d2:d2: d1:d1:

10 10 Length Normalization Constraints(LNCs) Document length normalization heuristic: Penalize long documents(LNC1); Avoid over-penalizing long documents (LNC2). LNC2 d2:d2: q: Let q be a query. d1:d1: Ifand then d1:d1: d2:d2: q: Let q be a query. If for some word but for other words then LNC1

11 11 TF-LENGTH Constraint (TF-LNC) TF-LNC TF-LN heuristic: Regularize the interaction of TF and document length. q: w d2:d2: d1:d1: Let q be a query with only one term w. then and If

12 12 Analytical Evaluation Retrieval FormulaTFCsTDCLNC1LNC2TF-LNC Pivoted Norm.YesConditionalYesConditional Dirichlet PriorYesConditionalYesConditionalYes Okapi (original)Conditional Okapi (modified)YesConditionalYes

13 13 Term Discrimination Constraint (TDC) IDF weighting heuristic: Penalize the words popular in the collection; Give higher weights to discriminative terms.... … SVM Tutorial … Doc 1 Query: SVM Tutorial Assume IDF(SVM)>IDF(Tutorial) … Tutorial SVM Tutorial … Doc 2

14 14 Benefits of Constraint Analysis Provide an approximate bound for the parameters –A constraint may be satisfied only if the parameter is within a particular interval. Compare different formulas analytically without experimentations –When a formula does not satisfy the constraint, it often indicates non-optimality of the formula. Suggest how to improve the current retrieval models –Violation of constraints may pinpoint where a formula needs to be improved.

15 15 Parameter sensitivity of s s Avg. Prec. Benefits 1 : Bounding Parameters Pivoted Normalization Method LNC2  s<0.4 0.4 Optimal s (for average precision)

16 16 Negative when df(w) is large  Violate many constraints Benefits 2 : Analytical Comparison Okapi Method Pivoted Okapi keyword queryverbose query s or b Avg. Prec

17 17 Benefits 3: Improving Retrieval Formulas Make Okapi satisfy more constraints; expected to help verbose queries Modified Okapi Method keyword query verbose query s or b Avg. Prec. Pivoted Okapi Modified Okapi

18 18 Conclusions and Future Work Conclusions –Retrieval heuristics can be captured through formally defined constraints. –It is possible to evaluate a retrieval formula analytically through constraint analysis. Future Work –Explore additional necessary heuristics –Apply these constraints to many other retrieval methods –Develop new retrieval formulas through constraint analysis

19 19 The End Thank you!


Download ppt "1 A Formal Study of Information Retrieval Heuristics Hui Fang, Tao Tao and ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign."

Similar presentations


Ads by Google