Download presentation
Presentation is loading. Please wait.
Published byValerie Heath Modified over 9 years ago
1
C.Watterscsci64031 Probabilistic Retrieval Model
2
C.Watterscsci64032 Classification Problem For each query assume –R=Set of relevant docs –NR=Set of nonrelevant docs For each document then what is the probability that it belongs in one or other set Retrieve d j if P(d j is rel) > P(d j is not rel)
3
C.Watterscsci64033 Bayes Theorem Probability based on related occurrences So P(R|d i ) is probability that a doc is R given that it has been retrieved Ex. P(H|E) prob it is July(Hyp) if it is hot(Event) P(E|H) * P(H) (prob its hot given it is July) =--------------------- S P(E|H i ) *P(H i ) (prob its hot given its Jan etc)
4
C.Watterscsci64034 Assumption Distribution of keywords of interest is different in the relevant docs vs the not relevant docs Also known as the cluster hypothesis
5
C.Watterscsci64035 Getting visas for immigration to Australia and migration within the borders requires a two week entry permit…. The long range migration pattern of geese interesting enough does not include the southern Pacific ….
6
C.Watterscsci64036 How to estimate these probabilities??? Assume relevance depends only on query and document representation (keywords) Computing the odds of a given doc being relevant to a given query! P(d j rel to q) P(d j notrel to q) Use this to rank documents
7
C.Watterscsci64037 Similarity as Odds Sim (d j,q) = P(d j is rel) P(d j is not rel) Using Bayes get Sim (d j,q) = P(d j |R) * P(R) P(d j |NR) * P(NR)
8
C.Watterscsci64038 Move from docs to terms Assuming independence of terms P(k i |R) is the probability that a relevant doc contains the term k i Remember that any term may also occur in NR docs so P(k i |R) + P(k i |R) =1 Sim (d j,q) ~ S w i,q * w i,j (log P(k i |R) + log 1-P(k i |NR) 1- P(k i |R) P(k i |NR) ) GIVES us RANK
9
C.Watterscsci64039 OK now what? Work with keywords with weights 0 and 1 Query is a set of keywords Doc is a set of keywords Need P(k i |R) Prob that a keyword occurs in one of the relevant docs
10
C.Watterscsci640310 Getting Started 1. assume that P(k i |R) constant over all k = 0.5 (even odds) for any given doc Looking for terms that do not fit this! 2. assume P(k i |NR) = n i /N i.e based on distribution of terms overall
11
11 Finding P(k i ) 1.First, retrieve set of docs and determine R set V V i is subset of V containing keyword k i Need to improve our guesses for P(k i |R) & P(k i |NR) 2. So Use distribution of k i in docs in V P(k i |R) = V i / V 3. Assume if not retrieved then not relevant P(k i |NR) = (n i – V i ) / N-V
12
C.Watterscsci640312 Now Use new probs to rerank docs And try again This can be done without human judgement BUT it helps to get real feedback at step 1
13
C.Watterscsci640313 Good and Bad News Advantages –Ranking scheme Disadvantages –Making the initial guess to get V i –Binary weights –Independence of terms –Computation
14
C.Watterscsci640314 Relevance Feedback
15
C.Watterscsci640315 Relevance Feedback Problem –2.2 term queries without (explicit) structure Example (relevance feedback) Manual –Add terms –Remove terms –Adjust the weights if possible –Add/remove operators
16
C.Watterscsci640316 What can we do automatically? ???? change query based on documents retrieved change query based on user preferences Change query based on user history Change query based on community of users
17
C.Watterscsci640317 Hypothesis A better query can be discovered by analyzing the features in relevant and in nonrelevant items
18
C.Watterscsci640318 Feedback and VSM Q 0 = (q 1, q 2, … q t ), q i is weight of query term Generates H 0 Q ’ = (q 1 ’, q 2 ’, … q t ’), q i ’ is altered weight of query term Add term to query by increasing w > 0 Drop term by decreasing w = 0
19
C.Watterscsci640319
20
C.Watterscsci640320
21
C.Watterscsci640321 VSM View Move query vector in the t-dimensional term space from an area of lower density to an area of higher density of close documents
22
C.Watterscsci640322 Optimal Query and VSM Given Sim(D j,Q)= d ij. Q i Optimal Query is then (D i is term vector) Q opt = |D i | is Euclidian vector length
23
C.Watterscsci640323 Feedback from relevant Docs retrieved Keep original query Replace sums with those on known relevant and known nonrelevant docs Q 1 = Q i+1 =
24
C.Watterscsci640324 Example Q’= Q+ R- NR Q 0 = (5,0,3,0,1) Relevant: D 1 =(2,1,2,0,0) Nonrelevant: D 2 =(1,0,0,0,2) =1, =.5, =.25 Q 1 =(5,0,3,0,1)+.5(2,1,2,0,0)-.25(1,0,0,0,2) =(5.75,.5,4,0,.5)
25
C.Watterscsci640325 Variations Don’t normalize by number of judged docs Use only highest ranked non-relevant docs –Effective with few judged docs Rocchio: choose and =1 – for many judged docs Expanding by all terms effective Expanding by most highly weighted terms is not!
26
C.Watterscsci640326 Relevance Feedback for Boolean Examine terms in relevant docs Discover conjuncts (t1 and t2) –Phrase detection –Persistent co-occurrences (box car) Discover co-occurrences (t1 or t3) –Thesaurus –Occasional co-occurrences (auto car) –Co-occur with friends (auto & car car & sedan)
27
C.Watterscsci640327 Relevance Feedback Summary Can be very effective Need reasonable number of judged docs –Unpredictable results < 5 judged docs Can be used with both VSM and Boolean Requires either direct input from users or monitoring (time, printing, saving, etc)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.