A Formal Study of Information Retrieval Heuristics

A Formal Study of Information Retrieval Heuristics
Hui Fang , Tao Tao , ChengXiang Zhai University of Illinois at Urbana Champaign SIGIR Best Paper Presented by Lingjie Zhang

Motivation Good retrieval performance is closely related to the use of various retrieval heuristics, e.g. TF-IDF A function F is optimal if it satisfies all the constraints If function 𝐹 𝑎 satisfies more constraints than function 𝐹 𝑏 , Fa would perform better than Fb empirically What exactly are these “necessary” heuristics that seem to cause good retrieval performance? Relevance can be modeled by a set of formally defined constraints on a retrieval function

Formal Definitions of Heuristic Retrieval Constraints
Six intuitive and desirable constraints Term Frequency Constraints (TFC1, TFC2) Term Discrimination Constraints (TDC) Length Normalization Constraints (LNC1, LNC2) TF-Length Constraints (TF-LNC) Any reasonable retrieval formula should satisfy Empirical studies show that good retrieval performance is closely related to the use of various retrieval heuristics, especially TF-IDF weighting and document length normalization. We formally define a set of basic desirable constraints that any reasonable retrieval formula should satisfy.

Formal Definitions of Heuristic Retrieval Constraints (TFCs,TDC,LNCs,TF-LNC)
Term Frequency Constraints (TFCs) TFC1: more occurrences of a query term q={w} , Assume |d1|=|d2| If c(w,d1) > c(w,d2), then f(d1,q) > f(d2,q) Give a higher score to a document with more occurrences of a query term The score caused by increasing TF from 1 to 2 should be larger than that caused by increasing TF from 100 to 101.

(1) the increase in the score due to an increase in TF is smaller for larger TFs q={w} , Assume |d1|=|d2|=|d3| , c(w,d1)>0, If c(w,d2) - c(w,d1) =1 , c(w,d3) - c(w,d2) =1 then f(d2,q) - f(d1,q) > f(d3,q) - f(d2,q) (2) favor a document with more distinct query terms q={w}, Assume |d1| = |d2| and idf(w1) = idf(w2). If c(w1, d2) = c(w1, d1) + c(w2, d1) and c(w2, d2) = 0, c(w1, d1) ≠ 0, c(w2, d1) ≠ 0, then f(d1,q) > f(d2,q) 1->2 100->101 Give a higher score to a document with more occurrences of a query term The score caused by increasing TF from 1 to 2 should be larger than that caused by increasing TF from 100 to 101. Tfc2 has 2 property : (1) (2) a higher score will be given to the document covering more distinct query terms

Term Discrimination Constraints (TDC) Favor a document that has more occurrences of discriminative terms (i.e., high IDF terms). q={w} Assume |d1|=|d2|, c(w1,d1) + c(w2,d1)= c(w1,d2) + c(w2,d2) If idf(w1) ≥ idf(w2) and c(w1,d1) > c(w2,d2) , then f(d1,q) ≥ f(d2,q) 𝑤 1 𝑤 2 q: c(w1,d1) c(w2,d1) 𝑑 1 : c(w1,d2) c(w2,d2) 𝑑 2 : Red word is a rare word. It has higher idf. D1 gets higher scores according to this constraint. idf(w1) ≥ idf(w2) then f(d1,q) ≥ f(d2,q)

Length Normalization Constraints (LNCs) LNC1: penalize long documents. Let q be a query , d1 and d2 are two documents If some word w’ q , c(w’,d2) = c(w’,d1) but for any query term w, c(w,d2) = c(w,d1) then f(d1,q) ≥ f(d2,q) LNC2: avoid over-penalizing long documents Let q be a query ,∀ k >1 , d1 and d2 are two documents If |d1| = k · |d2| and for all terms w , c(w, d1) = k · c(w, d2), then f(d1, q) ≥ f(d2, q). we concatenate a document with itself k times to form a new document, then the score of the new document should not be lower than the original document.

TF-Length Constraints (TF-LNC) Regularize the interaction of TF and document length. q={w} If c(w,d1) > c(w,d2) and |d1|=|d2| + c(w,d1) - c(w,d2) then f(d1,q) > f(d2,q) d1 is generated by adding more occurrences of the query term to d2 , the score of d1 should be higher than d2 .

Analysis of Three Representative Retrieval Formulas
Different models, but similar heuristics Pivoted Normalization Method Okapi Method Dirichlet Prior Method Are they performing well because they implement similar retrieval heuristics?

Okapi Method Retrieval function
k1 (between 1.0~2.0 ) B (usually 0.75) k3 (between 0 ~1000) When 𝑑𝑓 𝑤 >𝑁/2, the IDF part will be a negative value Violate many constraints: a highly eective retrieval formula that represents the classical probabilistic retrieval model. e.g. violate TFCs

Okapi Method Modify Okapi Method
𝑙𝑛 𝑁+1 𝑑𝑓 Modify Okapi Method This modified Okapi satisfies all the constraints but TDC Expected to help verbose queries Solve the problem of negative IDF Replace the original IDF in Okapi with the regular IDF in the pivoted normalization formula The performance of the modified Okapi would perform better than the original Okapi for verbose queries. the conditions do not provide any bound for the parameter b. Therefore, the performance of Okapi can be expected to be less sensitive to the length normalization parameter than the pivoted normalization method. Modified Okapi Original Okapi Pivoted

Pivoted Normalization Method
Retrieval function Analyzing TFC : yes TDC : when 𝑐 𝑤 1 , 𝑑 2 ≤𝑐( 𝑤 2 , 𝑑 1 ) LNC1 : yes LNC2 : when s≤ 𝑡𝑓 1 − 𝑡𝑓 2 𝑘 | 𝑑 2 | 𝑎𝑣𝑑𝑙 −1 𝑡𝑓 2 −( | 𝑑 2 | 𝑎𝑣𝑑𝑙 −1) 𝑡𝑓 1 TF-LNC : when Pivoted Normalization method is one of the best performing vector space retrieval formulas. TF-LNC is satisfied only if s is blow a certain upper bound Upper bound of s

Dirichlet Prior Method
Retrieval function Analysis TFC : yes TDC : when LNC1 : yes LNC2 : when 𝑐 𝑤, 𝑑 2 ≥| 𝑑 2 |∙𝑝(𝑤|𝐶) TF-LNC : yes 𝜇> (𝑐( 𝑤 1 , 𝑑 1 )) −𝑐( 𝑤 2 , 𝑑 2 )) 𝑝( 𝑤 2 |𝐶) = 𝑎𝑣𝑑𝑙 ×(𝑐( 𝑤 1 , 𝑑 1 )) −𝑐( 𝑤 2 , 𝑑 2 )) lower bound of 𝜇

Experiments - Setup Document set Query combination
Short-keyword (SK, keyword title) Shot-verbose (SV, one sentence description) Long-keyword (LK, keyword list) Long-verbose (LV, multiple sentences) As is well-known, retrieval performance can vary significantly from one test collection to another. We thus construct several quite different and representative test collections using the existing TREC test collections. Preprocessing Only stemming with the Porter’s stemmer No stop words have been removed AP: news article , DOE: technical report, FR: government documents, ADF :combination of AP, DOE, FR Web: web data used in the TREC8 Trec7: ad hoc data used in the TREC7 Trec8: ad hoc data used in the TREC8

Experiments - Parameter Sensitivity
PN method is sensitive to s, s<= 0.4. Okapi is more stable with the change of b DP method is sensitive to µ.

Experiments - Performance Comparison
For any query type, the performance of Dirichlet prior method is comparable to pivoted normalization method For keyword queries, the performance of Okapi is comparable to the other two retrieval formulas Satisfying more constraints appears to be correlated with a better performance. For ver-bose queries, the performance of okapi may be worse than others. However, for verbose queries, the performance of Okapi may be worse than others due to the possible negative IDF part in the formula Modiﬁed Okapi would perform better than the original Okapi for verbose queries.

Conclusion Define six basic constraints that any reasonable retrieval function should satisfy When the constraints is not satisfied, it often indicates non- optimality of the method For okapi formula we successfully predict the non-optimality for verbose queries

Future Work Can repeat all the experiments by removing the stop words with a standard list To explore additional necessary heuristics for a reasonable retrieval formula. Apply these constraints to many other retrieval models and different smoothing methods To find retrieval methods so that they would satisfy all the constraints.

Thank you!

A Formal Study of Information Retrieval Heuristics

Similar presentations

Presentation on theme: "A Formal Study of Information Retrieval Heuristics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Formal Study of Information Retrieval Heuristics

Similar presentations

Presentation on theme: "A Formal Study of Information Retrieval Heuristics"— Presentation transcript:

Similar presentations

About project

Feedback