C.Watterscsci64031 Probabilistic Retrieval Model.

C.Watterscsci64031 Probabilistic Retrieval Model

C.Watterscsci64032 Classification Problem For each query assume –R=Set of relevant docs –NR=Set of nonrelevant docs For each document then what is the probability that it belongs in one or other set Retrieve d j if P(d j is rel) > P(d j is not rel)

C.Watterscsci64033 Bayes Theorem Probability based on related occurrences So P(R|d i ) is probability that a doc is R given that it has been retrieved Ex. P(H|E) prob it is July(Hyp) if it is hot(Event) P(E|H) * P(H) (prob its hot given it is July) =--------------------- S P(E|H i ) *P(H i ) (prob its hot given its Jan etc)

C.Watterscsci64034 Assumption Distribution of keywords of interest is different in the relevant docs vs the not relevant docs Also known as the cluster hypothesis

C.Watterscsci64035 Getting visas for immigration to Australia and migration within the borders requires a two week entry permit…. The long range migration pattern of geese interesting enough does not include the southern Pacific ….

C.Watterscsci64036 How to estimate these probabilities??? Assume relevance depends only on query and document representation (keywords) Computing the odds of a given doc being relevant to a given query! P(d j rel to q) P(d j notrel to q) Use this to rank documents

C.Watterscsci64037 Similarity as Odds Sim (d j,q) = P(d j is rel) P(d j is not rel) Using Bayes get Sim (d j,q) = P(d j |R) * P(R) P(d j |NR) * P(NR)

C.Watterscsci64039 OK now what? Work with keywords with weights 0 and 1 Query is a set of keywords Doc is a set of keywords Need P(k i |R) Prob that a keyword occurs in one of the relevant docs

C.Watterscsci640310 Getting Started 1. assume that P(k i |R) constant over all k = 0.5 (even odds) for any given doc Looking for terms that do not fit this! 2. assume P(k i |NR) = n i /N i.e based on distribution of terms overall

11 Finding P(k i ) 1.First, retrieve set of docs and determine R set V V i is subset of V containing keyword k i Need to improve our guesses for P(k i |R) & P(k i |NR) 2. So Use distribution of k i in docs in V P(k i |R) = V i / V 3. Assume if not retrieved then not relevant P(k i |NR) = (n i – V i ) / N-V

C.Watterscsci640312 Now Use new probs to rerank docs And try again This can be done without human judgement BUT it helps to get real feedback at step 1

C.Watterscsci640313 Good and Bad News Advantages –Ranking scheme Disadvantages –Making the initial guess to get V i –Binary weights –Independence of terms –Computation

C.Watterscsci640314 Relevance Feedback

C.Watterscsci640315 Relevance Feedback Problem –2.2 term queries without (explicit) structure Example (relevance feedback) Manual –Add terms –Remove terms –Adjust the weights if possible –Add/remove operators

C.Watterscsci640316 What can we do automatically? ???? change query based on documents retrieved change query based on user preferences Change query based on user history Change query based on community of users

C.Watterscsci640317 Hypothesis A better query can be discovered by analyzing the features in relevant and in nonrelevant items

C.Watterscsci640318 Feedback and VSM Q 0 = (q 1, q 2, … q t ), q i is weight of query term Generates H 0 Q ’ = (q 1 ’, q 2 ’, … q t ’), q i ’ is altered weight of query term Add term to query by increasing w > 0 Drop term by decreasing w = 0

C.Watterscsci640319

C.Watterscsci640320

C.Watterscsci640321 VSM View Move query vector in the t-dimensional term space from an area of lower density to an area of higher density of close documents

C.Watterscsci640322 Optimal Query and VSM Given Sim(D j,Q)= d ij. Q i Optimal Query is then (D i is term vector) Q opt = |D i | is Euclidian vector length

C.Watterscsci640323 Feedback from relevant Docs retrieved Keep original query Replace sums with those on known relevant and known nonrelevant docs Q 1 = Q i+1 =

C.Watterscsci640324 Example Q’=  Q+  R-  NR Q 0 = (5,0,3,0,1) Relevant: D 1 =(2,1,2,0,0) Nonrelevant: D 2 =(1,0,0,0,2)  =1,  =.5,  =.25 Q 1 =(5,0,3,0,1)+.5(2,1,2,0,0)-.25(1,0,0,0,2) =(5.75,.5,4,0,.5)

C.Watterscsci640325 Variations Don’t normalize by number of judged docs Use only highest ranked non-relevant docs –Effective with few judged docs Rocchio: choose  and  =1 –  for many judged docs Expanding by all terms effective Expanding by most highly weighted terms is not!

C.Watterscsci640326 Relevance Feedback for Boolean Examine terms in relevant docs Discover conjuncts (t1 and t2) –Phrase detection –Persistent co-occurrences (box car) Discover co-occurrences (t1 or t3) –Thesaurus –Occasional co-occurrences (auto car) –Co-occur with friends (auto & car car & sedan)

C.Watterscsci640327 Relevance Feedback Summary Can be very effective Need reasonable number of judged docs –Unpredictable results < 5 judged docs Can be used with both VSM and Boolean Requires either direct input from users or monitoring (time, printing, saving, etc)

C.Watterscsci64031 Probabilistic Retrieval Model.

Similar presentations

Presentation on theme: "C.Watterscsci64031 Probabilistic Retrieval Model."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

C.Watterscsci64031 Probabilistic Retrieval Model.

Similar presentations

Presentation on theme: "C.Watterscsci64031 Probabilistic Retrieval Model."— Presentation transcript:

Similar presentations

About project

Feedback