Download presentation
Presentation is loading. Please wait.
Published byErnest Peters Modified over 6 years ago
1
A Formal Study of Information Retrieval Heuristics
Hui Fang , Tao Tao , ChengXiang Zhai University of Illinois at Urbana Champaign SIGIR Best Paper Presented by Lingjie Zhang
2
Motivation Good retrieval performance is closely related to the use of various retrieval heuristics, e.g. TF-IDF A function F is optimal if it satisfies all the constraints If function πΉ π satisfies more constraints than function πΉ π , Fa would perform better than Fb empirically What exactly are these βnecessaryβ heuristics that seem to cause good retrieval performance? Relevance can be modeled by a set of formally defined constraints on a retrieval function
3
Formal Definitions of Heuristic Retrieval Constraints
Six intuitive and desirable constraints Term Frequency Constraints (TFC1, TFC2) Term Discrimination Constraints (TDC) Length Normalization Constraints (LNC1, LNC2) TF-Length Constraints (TF-LNC) Any reasonable retrieval formula should satisfy Empirical studies show that good retrieval performance is closely related to the use of various retrieval heuristics, especially TF-IDF weighting and document length normalization. We formally define a set of basic desirable constraints that any reasonable retrieval formula should satisfy.
4
Formal Definitions of Heuristic Retrieval Constraints (TFCs,TDC,LNCs,TF-LNC)
Term Frequency Constraints (TFCs) TFC1: more occurrences of a query term q={w} , Assume |d1|=|d2| If c(w,d1) > c(w,d2), then f(d1,q) > f(d2,q) Give a higher score to a document with more occurrences of a query term The score caused by increasing TF from 1 to 2 should be larger than that caused by increasing TF from 100 to 101.
5
Formal Definitions of Heuristic Retrieval Constraints (TFCs,TDC,LNCs,TF-LNC)
(1) the increase in the score due to an increase in TF is smaller for larger TFs q={w} , Assume |d1|=|d2|=|d3| , c(w,d1)>0, If c(w,d2) - c(w,d1) =1 , c(w,d3) - c(w,d2) =1 then f(d2,q) - f(d1,q) > f(d3,q) - f(d2,q) (2) favor a document with more distinct query terms q={w}, Assume |d1| = |d2| and idf(w1) = idf(w2). If c(w1, d2) = c(w1, d1) + c(w2, d1) and c(w2, d2) = 0, c(w1, d1) β 0, c(w2, d1) β 0, then f(d1,q) > f(d2,q) 1->2 100->101 Give a higher score to a document with more occurrences of a query term The score caused by increasing TF from 1 to 2 should be larger than that caused by increasing TF from 100 to 101. Tfc2 has 2 property : (1) (2) a higher score will be given to the document covering more distinct query terms
6
Formal Definitions of Heuristic Retrieval Constraints (TFCs,TDC,LNCs,TF-LNC)
Term Discrimination Constraints (TDC) Favor a document that has more occurrences of discriminative terms (i.e., high IDF terms). q={w} Assume |d1|=|d2|, c(w1,d1) + c(w2,d1)= c(w1,d2) + c(w2,d2) If idf(w1) β₯ idf(w2) and c(w1,d1) > c(w2,d2) , then f(d1,q) β₯ f(d2,q) π€ 1 π€ 2 q: c(w1,d1) c(w2,d1) π 1 : c(w1,d2) c(w2,d2) π 2 : Red word is a rare word. It has higher idf. D1 gets higher scores according to this constraint. idf(w1) β₯ idf(w2) then f(d1,q) β₯ f(d2,q)
7
Formal Definitions of Heuristic Retrieval Constraints (TFCs,TDC,LNCs,TF-LNC)
Length Normalization Constraints (LNCs) LNC1: penalize long documents. Let q be a query , d1 and d2 are two documents If some word wβ q , c(wβ,d2) = c(wβ,d1) but for any query term w, c(w,d2) = c(w,d1) then f(d1,q) β₯ f(d2,q) LNC2: avoid over-penalizing long documents Let q be a query ,β k >1 , d1 and d2 are two documents If |d1| = k Β· |d2| and for all terms w , c(w, d1) = k Β· c(w, d2), then f(d1, q) β₯ f(d2, q). we concatenate a document with itself k times to form a new document, then the score of the new document should not be lower than the original document.
8
Formal Definitions of Heuristic Retrieval Constraints (TFCs,TDC,LNCs,TF-LNC)
TF-Length Constraints (TF-LNC) Regularize the interaction of TF and document length. q={w} If c(w,d1) > c(w,d2) and |d1|=|d2| + c(w,d1) - c(w,d2) then f(d1,q) > f(d2,q) d1 is generated by adding more occurrences of the query term to d2 , the score of d1 should be higher than d2 .
9
Analysis of Three Representative Retrieval Formulas
Different models, but similar heuristics Pivoted Normalization Method Okapi Method Dirichlet Prior Method Are they performing well because they implement similar retrieval heuristics?
10
Okapi Method Retrieval function
k1 (between 1.0~2.0 ) B (usually 0.75) k3 (between 0 ~1000) When ππ π€ >π/2, the IDF part will be a negative value Violate many constraints: a highly eective retrieval formula that represents the classical probabilistic retrieval model. e.g. violate TFCs
11
Okapi Method Modify Okapi Method
ππ π+1 ππ Modify Okapi Method This modiο¬ed Okapi satisο¬es all the constraints but TDC Expected to help verbose queries Solve the problem of negative IDF Replace the original IDF in Okapi with the regular IDF in the pivoted normalization formula The performance of the modiο¬ed Okapi would perform better than the original Okapi for verbose queries. the conditions do not provide any bound for the parameter b. Therefore, the performance of Okapi can be expected to be less sensitive to the length normalization parameter than the pivoted normalization method. Modified Okapi Original Okapi Pivoted
12
Pivoted Normalization Method
Retrieval function Analyzing TFC : yes TDC : when π π€ 1 , π 2 β€π( π€ 2 , π 1 ) LNC1 : yes LNC2 : when sβ€ π‘π 1 β π‘π 2 π | π 2 | ππ£ππ β1 π‘π 2 β( | π 2 | ππ£ππ β1) π‘π 1 TF-LNC : when Pivoted Normalization method is one of the best performing vector space retrieval formulas. TF-LNC is satisfied only if s is blow a certain upper bound Upper bound of s
13
Dirichlet Prior Method
Retrieval function Analysis TFC : yes TDC : when LNC1 : yes LNC2 : when π π€, π 2 β₯| π 2 |βπ(π€|πΆ) TF-LNC : yes π> (π( π€ 1 , π 1 )) βπ( π€ 2 , π 2 )) π( π€ 2 |πΆ) = ππ£ππ Γ(π( π€ 1 , π 1 )) βπ( π€ 2 , π 2 )) lower bound of π
14
Experiments - Setup Document set Query combination
Short-keyword (SK, keyword title) Shot-verbose (SV, one sentence description) Long-keyword (LK, keyword list) Long-verbose (LV, multiple sentences) As is well-known, retrieval performance can vary significantly from one test collection to another. We thus construct several quite different and representative test collections using the existing TREC test collections. Preprocessing Only stemming with the Porterβs stemmer No stop words have been removed AP: news article , DOE: technical report, FR: government documents, ADF :combination of AP, DOE, FR Web: web data used in the TREC8 Trec7: ad hoc data used in the TREC7 Trec8: ad hoc data used in the TREC8
15
Experiments - Parameter Sensitivity
PN method is sensitive to s, s<= 0.4. Okapi is more stable with the change of b DP method is sensitive to Β΅.
16
Experiments - Performance Comparison
For any query type, the performance of Dirichlet prior method is comparable to pivoted normalization method For keyword queries, the performance of Okapi is comparable to the other two retrieval formulas Satisfying more constraints appears to be correlated with a better performance. For ver-bose queries, the performance of okapi may be worse than others. However, for verbose queries, the performance of Okapi may be worse than others due to the possible negative IDF part in the formula Modiο¬ed Okapi would perform better than the original Okapi for verbose queries.
17
Conclusion Define six basic constraints that any reasonable retrieval function should satisfy When the constraints is not satisfied, it often indicates non- optimality of the method For okapi formula we successfully predict the non-optimality for verbose queries
18
Future Work Can repeat all the experiments by removing the stop words with a standard list To explore additional necessary heuristics for a reasonable retrieval formula. Apply these constraints to many other retrieval models and different smoothing methods To find retrieval methods so that they would satisfy all the constraints.
19
Thank you!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.