A Formal Study of Information Retrieval Heuristics Hui Fang, Tao Tao, ChengXiang Zhai University of Illinois at Urbana Champaign Urbana SIGIR 2004 Presented.

A Formal Study of Information Retrieval Heuristics Hui Fang, Tao Tao, ChengXiang Zhai University of Illinois at Urbana Champaign Urbana SIGIR 2004 Presented by CHU Huei-Ming 2004/01/17

2 Outline Formal Definitions of Heuristic Retrieval Constraints Analysis of Three Representative Retrieval Formulas –Pivoted Normalization Method –Okapi Method –Dirichlet Prior Method Experiments Conclusion and Future Work

3 Formal Definitions of Heuristic Retrieval Constraints Six intuitive and desirable constraints Any reasonable retrieval formula should satisfy –Term Frequency Constraints (TFCs) –Term Discrimination Constraints (TDC) –Length Normalization Constraints (LNCs) –TF-Length Constraints (TF-LNC)

4 Formal Definitions of Heuristic Retrieval Constraints Term Frequency Constraints (TFCs) –TFC1: q={w}, Assume |d 1 |=|d 2 |. If c(w,d 1 ) > c(w,d 2 ), then f(d 1,q) > f(d 2,q) –TFC2: q={w}, Assume |d 1 |=|d 2 |=|d 3 |, c(w,d 1 )>0, If c(w,d 2 ) - c(w,d 1 ) =1, c(w,d 3 ) - c(w,d 2 ) =1 then f(d 2,q) - f(d 1,q) > f(d 3,q) - f(d 2,q)

5 Formal Definitions of Heuristic Retrieval Constraints Term Discrimination Constraints (TDC) –Let q be a query, and w 1,w 2 q be two query term –Assume |d 1 |=|d 2 |, c(w 1,d 1 ) + c(w 2,d 1 )= c(w 1,d 2 ) + c(w 2,d 2 ) –If idf(w 1 ) ≥ idf(w 2 ) and c(w 1,d 1 ) > c(w 2,d 2 ), then f(d 1,q) ≥ f(d 2,q)

6 Formal Definitions of Heuristic Retrieval Constraints Length Normalization Constraints (LNCs) –LNC1 Let q be a query, d 1 and d 2 are two documents If some word w ’ q, c(w’,d 2 ) = c(w’,d 1 ) +1 but for any query term w, c(w,d 2 ) = c(w,d 1 ) then f(d 1,q) ≥ f(d 2,q) –LNC2 Let q be a query, ∀ k >1, d 1 and d 2 are two documents If |d 1 | = k · |d 2 | and for all terms w, c(w, d 1 ) = k · c(w, d 2 ), then f(d 1, q) ≥ f(d 2, q).

7 Formal Definitions of Heuristic Retrieval Constraints TF-Length Constraints (TF-LNC) –q={w}, d 1 and d 2 are two documents –If c(w,d 1 ) > c(w,d 2 ) and |d 1 |=|d 2 | + c(w,d 1 ) - c(w,d 2 ) –then f(d 1,q) > f(d 2,q)

8 Formal Definitions of Heuristic Retrieval Constraints

9 Analysis of Three Representative Retrieval Formulas Pivoted Normalization Method Okapi Method Dirichlet Prior Method

10 Analysis of Three Representative Retrieval Formulas Pivoted Normalization Method Retrieval function Analyzing

11 Analysis of Three Representative Retrieval Formulas Pivoted Normalization Method Check TF-LNC constraint when |d 1 |=avdl, it is equivalent to the TF-LNC is satisfied only if s is blow a certain upper bound

12 Analysis of Three Representative Retrieval Formulas Pivoted Normalization Method Check the LNC2 constraint

13 Analysis of Three Representative Retrieval Formulas Pivoted Normalization Method Consider common case when |d 2 |=avdl Performance can be bad for a large s

14 Analysis of Three Representative Retrieval Formulas Pivoted Normalization Method Check TDC constraint –It is equivalent to c(w 2,d 1 ) ≥ c(w 1,d 2 ) this is conditional satisfied

15 Analysis of Three Representative Retrieval Formulas Okapi Method Retrieval function k 1 (between 1.0~2.0 ) b (usually 0.75) and k 3 (between 0 ~1000)

16 Analysis of Three Representative Retrieval Formulas Okapi Method Analysis –When df(w)> N/2, the IDF part in the formula will be a negative value –When the IDF part is positive (mostly true for keyword query) –TFC and LNCs are satisfied –TF-LNC constraint : considering a common case when |d 2 |=avdl the constraint is equivalent to b ≤ avdl / c(w, d 2 ) –TDC is equivalent to c(w 2,d 1 ) ≥ c(w 1,d 2 ) same as the formula above

17 Analysis of Three Representative Retrieval Formulas Okapi Method Modify Okapi Method –Solve the problem of negative IDF –Replace the original IDF in Okapi with the regular IDF in the pivoted normalization formula –The performance is better on the verbose queries Analysis result

18 Analysis of Three Representative Retrieval Formulas Dirichlet Prior Method Retrieval function Use Dirichlet prior smoothing method to smooth a document language model Rank the documents according to the likelihood of the query according to the estimated language model of each document

19 Analysis of Three Representative Retrieval Formulas Dirichlet Prior Method Analysis –LNC2 constraint is equivalent to c(w,d 2 ) ≥ |d 2 | p(w|C) Which is usually satisfied for content-carrying words –TDC constraint led to some lower bound for parameter

20 Analysis of Three Representative Retrieval Formulas Dirichlet Prior Method Analysis –TDC : consider a common case of w 2, p(w 2 |C)=1/avdl –Means for discriminative words with a high term frequency in a document, needs to be sufficiently large –In order to balance the TF and IDF appropriately

21 Experiments Setup Document set –AP: news article, DOE: technical report, FR: government documents, –ADF :combination of AP, DOE, FR –Web: web data used in the TREC8 –Trec7: ad hoc data used in the TREC7 –Trec8: ad hoc data used in the TREC8

22 Experiments Setup Query combination –Short-keyword (SK, keyword title) –Shot-verbose (SV, one sentence description) –Long-keyword (LK, keyword list) –Long-verbose (LV, multiple sentences) Preprocessing –Only stemming with the Porter ’ s stemmer –No stop words have been removed

23 Experiments Parameter Sensitivity Pivoted normalization method The analysis of LNC2 constraint for the pivoted normalization methods suggests the s should be smaller than 0.4

24 Experiments Parameter Sensitivity Okapi method k 1 =1.2, k 3 =1000, b changes from 0.1 to 1.0

25 Experiments Parameter Sensitivity Dirichlet prior method

26 Experiments Parameter Sensitivity Dirichlet prior method

27 Experiments Performance Comparison

28 Experiments Performance Comparison For any query type, the performance of Dirichlet prior method is comparable to pivoted normalization method For keyword queries, the performance of Okapi is comparable to the other two retrieval formulas For verbose queries, the performance of Okapi may be worse than others due to the possible negative IDF part in the formula

29 Experiments Performance Comparison Average precision comparison

30 Conclusion and Future Work Define six basic constraints that any reasonable retrieval function should satisfy When the constraints is not satisfied, it often indicates non- optimality of the method

A Formal Study of Information Retrieval Heuristics Hui Fang, Tao Tao, ChengXiang Zhai University of Illinois at Urbana Champaign Urbana SIGIR 2004 Presented.

Similar presentations

Presentation on theme: "A Formal Study of Information Retrieval Heuristics Hui Fang, Tao Tao, ChengXiang Zhai University of Illinois at Urbana Champaign Urbana SIGIR 2004 Presented."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Formal Study of Information Retrieval Heuristics Hui Fang, Tao Tao, ChengXiang Zhai University of Illinois at Urbana Champaign Urbana SIGIR 2004 Presented.

Similar presentations

Presentation on theme: "A Formal Study of Information Retrieval Heuristics Hui Fang, Tao Tao, ChengXiang Zhai University of Illinois at Urbana Champaign Urbana SIGIR 2004 Presented."— Presentation transcript:

Similar presentations

About project

Feedback