Download presentation
Presentation is loading. Please wait.
Published byDiana Little Modified over 9 years ago
1
LANGUAGE MODELS FOR RELEVANCE FEEDBACK 2003. 07. 14 Lee Won Hee
2
2 Abstract The language modeling approach to IR Query - random event Documents - ranked according to the likelihood users have a prototypical document in mind and will choose query terms accordingly inferences about the semantic content of documents do not need to be made resulting in a conceptually
3
3 1. Introduction the language modeling approach to IR Developed by Ponte and Croft, 1998 Query – random event generated according to a probability distribution Document similarity -estimating a model of the term generation probabilities for the query terms for each document -ranking the documents according to the probability of generating the query The main advantage to the language modeling approach Document boundaries are not predefined -use the document level statistics of tf and idf Uncertainty are modeled by probabilities -noisy data such as OCR text and automatically recognized speech transcripts relevance feedback or document routing
4
4 2. The Language Modeling Approach to IR The query generation probability The probability will be estimated starting with the maximum likelihood estimate of the probability of term t in document d -tf (t,d) : the raw term frequency of term t in document d -dl d : the total number of tokens in document d
5
5 2.1 Insufficient Data Two problem with the maximum likelihood estimator We do not wish to assign a probability of zero for a document that is missing one or more of the query terms -If a user included several synonyms in the query, a document missing even one of them would not be retrieved -A more reasonable distribution We only have a document sized sample from that distribution and so the variation in the raw counts may partially be accounted for by randomness * cft : the raw count of term t in the collection * cs : the raw collection size or the total number of tokens in the collection
6
6 2.2 Averaging The mean probability estimate of t in documents containing it -to circumvent the problem of insufficient data -some risk : if the mean were used by itself, there would be no distinction between documents with different term frequencies Combining the two estimates using the geometric distribution -Ghosh et al., 1983 -robustness of estimation, minimize the risk dft : the document frequency of t : the mean term frequency of term t in documents
7
7 2.3 Combining the Two Estimates The estimate of the probability of producing the query for a given document model -first term : the probability of producing the terms in the query -second term : the probability of not producing other terms -better discriminators of the document If tf (t,d) >0 otherwise If tf (t,d) >0 otherwise
8
8 3. Related Work 1.The harper and van rijsbergen model 2.The rocchio method 3.The inquery model 4.Exponential models
9
9 3.1 The Harper and Van rijsbergen model (1978) to obtain better estimates for the probability of relevance of a document given the query An approximation of the dependence of query terms was defined by the authors by means of a maximal spanning tree each node of the tree : a single query term The edges between nodes : weighted by a measure of term dependency A tree that spanned all of the nodes and that maximized the expected mutual information - P(xi,xj) : the probability of term xi and term xj occurring - P(xi) : the probability of term xi occurring in a relevant document - P(xj) : the probability of term xj occurring in a relevant document
10
10 3.2 The Rocchio Method (1971) Rocchio method provides a mechanism for the selection and weighting of expansion terms can be used to rank the terms in the judged documents -The top N can then be added to the query and weighted a reasonable solution to the problem of relevance feedback that works very well in practice empirically determine the optimal value of,, - : the weight assigned for occurring in relevant doc - : the weight assigned for occurring in non-relevant doc
11
11 3.3 The INQUERY Model (1/2) INQUERY inference network (Turtle, 1991) document portion -computed in advance query portion -computed at retrieval time Document Network document nodes – d 1...d i text nodes – t 1...t j concept representation nodes – r 1...r k Query Network query concepts – c 1 …c m queries – q 1, q 2 Information need – I Uncertainty due to differences in word sense Figure 3.1 Example inference network
12
12 3.3 The INQUERY Model (2/2) Relevance Feedback Implementation of the theoretical relevance feedback was done by Hains(1996) Annotated query network Proposition nodes – k 1, k 2 Observed relevance judgment nodes – j 1, j 2 and nodes – require that an annotation to have an effect on the score The drawback of this technique It requires inferences of considerable complexity Relevance judgment -Two additional layers of inference and several new propositions are required Figure 3.3 Annotated query network
13
13 3.4 Exponential Models An approach to predicting topic shifts in text using exponential models (Beeferman et al., 1997) The model utilized ratios of long range language models and short range language models -predict useful terms Topic shift -When a long range language model is not able to predict the next word better than a short range language model - Pl(x) : the probability of seeing word x given the context of the last 500 words - Ps(x) : the probability of seeing word x given the two previous words
14
14 4. Query Expansion in the Language Modeling Approach Assumption of this approach Users can choose query terms that are likely to occur in documents in which they would interested This approach has been developed into a ranking formula by means of probabilistic language models
15
15 4.1 Interactive Retrieval with Relevance Feedback Relevance Feedback Small number of documents are judged relevant by user -The relevance of all the remaining documents is unknown to the system
16
16 4.2 Document Routing Document Routing The task is to choose terms associated with documents of interest and to avoid those associated with other documents Training collection is available with a large number of relevance judgments, both positive and negative, for particular query Ratio Method Can utilize additional information by estimating probabilities for both sets
17
17 4.3 The Ratio Method Ratio Method predict useful terms Terms can be ranked according to the probability of occurrence according to the relevant document models Terms are ranked according to this ratio and top N are added to the initial query - R : the set of relevant documents - P(t|M d ) : the probability of term t given the document model for d - cft : the raw count of term t in the collection - cs : the raw collection size
18
18 4.4 Evaluation Result are measured using the recall and precision
19
19 4.5 Experiments (1/2) Comparison of Rocchio method vs Language Model approach Language Model : log ratio of the probability in the judged relevant set Rocchio : weighting function was tf,idf and no negative feedback( = 0 ) Language Modeling approach works well
20
20 4.5 Experiments (2/2)
21
21 4.6 Information Routing Ratio Methods With More Data Ratio 1 Ratio 2 -The log ratio of the average probability in judged relevant documents vs. the average probability in judged non-relevant documents Result The language modeling approach is a good model for retrieval
22
22 5. Query Term Weighting probability estimation Maximum likelihood probability The average probability (combined a geometric risk function) risk function current risk function treats all terms equally The change will be to mix the estimation -useless term, stop word – term is assigned an equal probability estimate for every documents ( it to have no effect on the ranking ) user specified Language Models Queries -A specific type of text produced by the user The term weights -Equivalent to the generation probabilities of the query model
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.