Download presentation
Presentation is loading. Please wait.
Published byGeorgina James Modified over 8 years ago
1
A Formal Study of Information Retrieval Heuristics Hui Fang, Tao Tao, ChengXiang Zhai University of Illinois at Urbana Champaign Urbana SIGIR 2004 Presented by CHU Huei-Ming 2004/01/17
2
2 Outline Formal Definitions of Heuristic Retrieval Constraints Analysis of Three Representative Retrieval Formulas –Pivoted Normalization Method –Okapi Method –Dirichlet Prior Method Experiments Conclusion and Future Work
3
3 Formal Definitions of Heuristic Retrieval Constraints Six intuitive and desirable constraints Any reasonable retrieval formula should satisfy –Term Frequency Constraints (TFCs) –Term Discrimination Constraints (TDC) –Length Normalization Constraints (LNCs) –TF-Length Constraints (TF-LNC)
4
4 Formal Definitions of Heuristic Retrieval Constraints Term Frequency Constraints (TFCs) –TFC1: q={w}, Assume |d 1 |=|d 2 |. If c(w,d 1 ) > c(w,d 2 ), then f(d 1,q) > f(d 2,q) –TFC2: q={w}, Assume |d 1 |=|d 2 |=|d 3 |, c(w,d 1 )>0, If c(w,d 2 ) - c(w,d 1 ) =1, c(w,d 3 ) - c(w,d 2 ) =1 then f(d 2,q) - f(d 1,q) > f(d 3,q) - f(d 2,q)
5
5 Formal Definitions of Heuristic Retrieval Constraints Term Discrimination Constraints (TDC) –Let q be a query, and w 1,w 2 q be two query term –Assume |d 1 |=|d 2 |, c(w 1,d 1 ) + c(w 2,d 1 )= c(w 1,d 2 ) + c(w 2,d 2 ) –If idf(w 1 ) ≥ idf(w 2 ) and c(w 1,d 1 ) > c(w 2,d 2 ), then f(d 1,q) ≥ f(d 2,q)
6
6 Formal Definitions of Heuristic Retrieval Constraints Length Normalization Constraints (LNCs) –LNC1 Let q be a query, d 1 and d 2 are two documents If some word w ’ q, c(w’,d 2 ) = c(w’,d 1 ) +1 but for any query term w, c(w,d 2 ) = c(w,d 1 ) then f(d 1,q) ≥ f(d 2,q) –LNC2 Let q be a query, ∀ k >1, d 1 and d 2 are two documents If |d 1 | = k · |d 2 | and for all terms w, c(w, d 1 ) = k · c(w, d 2 ), then f(d 1, q) ≥ f(d 2, q).
7
7 Formal Definitions of Heuristic Retrieval Constraints TF-Length Constraints (TF-LNC) –q={w}, d 1 and d 2 are two documents –If c(w,d 1 ) > c(w,d 2 ) and |d 1 |=|d 2 | + c(w,d 1 ) - c(w,d 2 ) –then f(d 1,q) > f(d 2,q)
8
8 Formal Definitions of Heuristic Retrieval Constraints
9
9 Analysis of Three Representative Retrieval Formulas Pivoted Normalization Method Okapi Method Dirichlet Prior Method
10
10 Analysis of Three Representative Retrieval Formulas Pivoted Normalization Method Retrieval function Analyzing
11
11 Analysis of Three Representative Retrieval Formulas Pivoted Normalization Method Check TF-LNC constraint when |d 1 |=avdl, it is equivalent to the TF-LNC is satisfied only if s is blow a certain upper bound
12
12 Analysis of Three Representative Retrieval Formulas Pivoted Normalization Method Check the LNC2 constraint
13
13 Analysis of Three Representative Retrieval Formulas Pivoted Normalization Method Consider common case when |d 2 |=avdl Performance can be bad for a large s
14
14 Analysis of Three Representative Retrieval Formulas Pivoted Normalization Method Check TDC constraint –It is equivalent to c(w 2,d 1 ) ≥ c(w 1,d 2 ) this is conditional satisfied
15
15 Analysis of Three Representative Retrieval Formulas Okapi Method Retrieval function k 1 (between 1.0~2.0 ) b (usually 0.75) and k 3 (between 0 ~1000)
16
16 Analysis of Three Representative Retrieval Formulas Okapi Method Analysis –When df(w)> N/2, the IDF part in the formula will be a negative value –When the IDF part is positive (mostly true for keyword query) –TFC and LNCs are satisfied –TF-LNC constraint : considering a common case when |d 2 |=avdl the constraint is equivalent to b ≤ avdl / c(w, d 2 ) –TDC is equivalent to c(w 2,d 1 ) ≥ c(w 1,d 2 ) same as the formula above
17
17 Analysis of Three Representative Retrieval Formulas Okapi Method Modify Okapi Method –Solve the problem of negative IDF –Replace the original IDF in Okapi with the regular IDF in the pivoted normalization formula –The performance is better on the verbose queries Analysis result
18
18 Analysis of Three Representative Retrieval Formulas Dirichlet Prior Method Retrieval function Use Dirichlet prior smoothing method to smooth a document language model Rank the documents according to the likelihood of the query according to the estimated language model of each document
19
19 Analysis of Three Representative Retrieval Formulas Dirichlet Prior Method Analysis –LNC2 constraint is equivalent to c(w,d 2 ) ≥ |d 2 | p(w|C) Which is usually satisfied for content-carrying words –TDC constraint led to some lower bound for parameter
20
20 Analysis of Three Representative Retrieval Formulas Dirichlet Prior Method Analysis –TDC : consider a common case of w 2, p(w 2 |C)=1/avdl –Means for discriminative words with a high term frequency in a document, needs to be sufficiently large –In order to balance the TF and IDF appropriately
21
21 Experiments Setup Document set –AP: news article, DOE: technical report, FR: government documents, –ADF :combination of AP, DOE, FR –Web: web data used in the TREC8 –Trec7: ad hoc data used in the TREC7 –Trec8: ad hoc data used in the TREC8
22
22 Experiments Setup Query combination –Short-keyword (SK, keyword title) –Shot-verbose (SV, one sentence description) –Long-keyword (LK, keyword list) –Long-verbose (LV, multiple sentences) Preprocessing –Only stemming with the Porter ’ s stemmer –No stop words have been removed
23
23 Experiments Parameter Sensitivity Pivoted normalization method The analysis of LNC2 constraint for the pivoted normalization methods suggests the s should be smaller than 0.4
24
24 Experiments Parameter Sensitivity Okapi method k 1 =1.2, k 3 =1000, b changes from 0.1 to 1.0
25
25 Experiments Parameter Sensitivity Dirichlet prior method
26
26 Experiments Parameter Sensitivity Dirichlet prior method
27
27 Experiments Performance Comparison
28
28 Experiments Performance Comparison For any query type, the performance of Dirichlet prior method is comparable to pivoted normalization method For keyword queries, the performance of Okapi is comparable to the other two retrieval formulas For verbose queries, the performance of Okapi may be worse than others due to the possible negative IDF part in the formula
29
29 Experiments Performance Comparison Average precision comparison
30
30 Conclusion and Future Work Define six basic constraints that any reasonable retrieval function should satisfy When the constraints is not satisfied, it often indicates non- optimality of the method
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.