Presentation is loading. Please wait.

Presentation is loading. Please wait.

Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.

Similar presentations


Presentation on theme: "Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded."— Presentation transcript:

1 Retrieval Models II Vector Space, Probabilistic

2  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded. Favors long documents with a large number of unique terms. Measures how many terms matched but not how many terms are not matched.

3  Allan, Ballesteros, Croft, and/or Turtle Inner Product -- Examples Binary: –D = 1, 1, 1, 0, 1, 1, 0 –Q = 1, 0, 1, 0, 0, 1, 1 sim(D, Q) = 3 retrievaldatabase architecture computer text management information Size of vector = size of vocabulary = 7 0 means corresponding term not found in document or query Weighted: D 1 = 2T 1 + 3T 2 + 5T 3 D 2 = 3T 1 + 7T 2 + 1T 3 Q = 0T 1 + 0T 2 + 2T 3 sim(D 1, Q) = 2*0 + 3*0 + 5*2 = 10 sim(D 2, Q) = 3*0 + 7*0 + 1*2 = 2

4  Allan, Ballesteros, Croft, and/or Turtle Cosine Similarity Measure Cosine similarity measures the cosine of the angle between two vectors. Inner product normalized by the vector lengths. D 1 = 2T 1 + 3T 2 + 5T 3 CosSim(D 1, Q) = 10 /  (4+9+25)(0+0+4) = 0.81 D 2 = 3T 1 + 7T 2 + 1T 3 CosSim(D 2, Q) = 2 /  (9+49+1)(0+0+4) = 0.13 Q = 0T 1 + 0T 2 + 2T 3  t3t3 t1t1 t2t2 D1D1 D2D2 Q  D 1 is 6 times better than D 2 using cosine similarity but only 5 times better using inner product. CosSim(d j, q) =

5  Allan, Ballesteros, Croft, and/or Turtle Simple Implementation 1.Convert all documents in collection D to tf-idf weighted vectors, d j, for keyword vocabulary V. 2.Convert query to a tf-idf-weighted vector q. 3.For each d j in D do Compute score s j = cosSim(d j, q) 4.Sort documents by decreasing score. 5.Present top ranked documents to the user. Time complexity: O(|V| · |D|) Bad for large V & D ! |V| = 10,000; |D| = 100,000; |V| · |D| = 1,000,000,000

6  Allan, Ballesteros, Croft, and/or Turtle Comments on Vector Space Models Simple, mathematically based approach. Considers both local (tf) and global (idf) word occurrence frequencies. Provides partial matching and ranked results. Tends to work quite well in practice despite obvious weaknesses. Allows efficient implementation for large document collections.

7  Allan, Ballesteros, Croft, and/or Turtle Problems with Vector Space Model Assumption of term independence Missing semantic information (e.g. word sense). Missing syntactic information (e.g. phrase structure, word order, proximity information). Lacks the control of a Boolean model (e.g., requiring a term to appear in a document). –Given a two-term query “A B”, may prefer a document containing A frequently but not B, over a document that contains both A and B, but both less frequently.

8  Allan, Ballesteros, Croft, and/or Turtle Statistical Models A document is typically represented by a bag of words (unordered words with frequencies). Bag = set that allows multiple occurrences of the same element. User specifies a set of desired terms with optional weights: –Weighted query terms: Q = –Unweighted query terms: Q = –No Boolean conditions specified in the query.

9  Allan, Ballesteros, Croft, and/or Turtle Statistical Retrieval Retrieval via similarity based on probability of relevance to Q Given Q, the set of all documents is partitioned with into the sets rel and nonrel. –The sets rel and nonrel change from query to query Output documents are ranked according to probability of relevance to query. –Pr(relevance) of each document to the query is not available in practice

10  Allan, Ballesteros, Croft, and/or Turtle Basic Probabilistic Retrieval Model We need a similarity function s so that: – P(rel|D i )>P(rel|D j ) iff s(Q, D i )>s(Q, D j ) Retrieve if P(relevant|D) > P(non-relevant|D) –calculate P(D|R)/P(D|NR) Different ways of estimating these probabilities lead to different probabilistic models

11  Allan, Ballesteros, Croft, and/or Turtle Probability Experiment: a specific set of actions the results of which can not be predicted with certainty –i.e. rolling two dice and recording their values Simple outcome: ea. possible set of recorded data –for the example, each pair is a simple outcome (1,1) (2,1) (3,1)... (6,1) (1,2) (2,2) (3,2)... (6,2) (1,3) (2,3) (3,3)... (6,3) (1,4) (2,4) (3,4)... (6,4) (1,5) (2,5) (3,5)...(6,5) (1,6) (2,6) (3,6)... (6,6)

12  Allan, Ballesteros, Croft, and/or Turtle Probability Sample Space – non-empty set containing all possible simple outcomes of the experiment (1,1) (2,1) (3,1)... (6,1) (1,2) (2,2) (3,2)... (6,2) (1,3) (2,3) (3,3)... (6,3) (1,4) (2,4) (3,4)... (6,4) (1,5) (2,5) (3,5)...(6,5) (1,6) (2,6) (3,6)... (6,6) –each element is known as a sample point

13  Allan, Ballesteros, Croft, and/or Turtle Probability Sample Space (1,1) (2,1) (3,1)... (6,1) (1,2) (2,2) (3,2)... (6,2) (1,3) (2,3) (3,3)... (6,3) (1,4) (2,4) (3,4)... (6,4) (1,5) (2,5) (3,5)...(6,5) (1,6) (2,6) (3,6)... (6,6) Event space: subsets of a sample space defined by a specific event or outcome –i.e. the event that the sum of the two die is 4

14  Allan, Ballesteros, Croft, and/or Turtle Event Space The probability of an event is the sum of the probabilities of the sample points associated with the event –what is the probability that the sum is 4? Recall that sample points represent the possible outcomes of a statistical “ experiment ” 36 possible outcomes when rolling 2 die 3 ways to get sum of 4: (1,3) (2,2) (3,1) Pr(sum is 4) = 3/36 = 1/12

15  Allan, Ballesteros, Croft, and/or Turtle Event Space For a retrieval model, the event space is Q x D, –where each sample point is a query-document pair –Each has an associated relevance judgment For a particular a query, a probabilistic model tries to estimate P(R|D)

16  Allan, Ballesteros, Croft, and/or Turtle Probability Ranking Principle Ranking documents in decreasing order of pr(rel) to the query, where probabilities are estimated using all available evidence, produces the best possible effectiveness –Assume relevance of a document is independent of other documents in the collection –Bayes Decision Rule: Retrieve if P(R|D) > P(NR|D) minimizes the average probability of error –equivalent to optimizing recall/fallout tradeoff

17  Allan, Ballesteros, Croft, and/or Turtle Basic Probabilistic Model Doc d = (t 1, t 2, … t n ) t i = 0 means index term t i absent, t i = 1 means term t i present –p i = P(t i =1|R) and 1-p i = P(t i =0|R) –q i = P(t i =1|NR) and 1- q i = P(t i =0|NR) Assume conditional independence –P(d|R) is product of the probs for the components of d (i.e. product of probabilities of getting a particular vector of 1 ’ s and 0 ’ s) Appearance of a term in a doc interpreted either as –evidence that document is relevant or –evidence that document is non-relevant The key is finding a means of estimating p i and q i –p i is prob term is present given Relevant –q i is prob term is present given Not Relevant

18  Allan, Ballesteros, Croft, and/or Turtle Basic Probabilistic Model Need to calculate –“ relevant to query given term appears ” and –“ irrelevant to query given term appears ” These values can be based upon some known relevance judgments

19  Allan, Ballesteros, Croft, and/or Turtle Estimation with Relevance Information –N = total # docs, R = # rel docs, N-R = # nonrel docs –f t = # docs w/term –R t = # rel docs w/ term –f t -R t = # nonrel w/term We can estimate the conditional probabilities w/ the table –P(rel|t is present) = R t /f t –P(non-rel|t is present) = f t -R t / f t –P(t is present| rel) = R t / R –P(t is present| nonrel) = f t - R t / N-R

20  Allan, Ballesteros, Croft, and/or Turtle Estimation with Relevance Information w t = (R t /(R-R t ))/ ((f t - R t )/(N - f t - (R -R t ))) –ratio of rel with term to relevant without term ratio of nonrel with term to nonrel without term Suppose N = 20; R = 13 relevant; term t appears in 11 rel; term appears in 12 docs –w t =(11/(13-11))/((12-11)/(20-12-(13-11))) = 5.5/0.17 = 33

21  Allan, Ballesteros, Croft, and/or Turtle Estimation with Relevance Information Think of this as extent to which the term can discriminate b/w relevant and non-relevant docs – w t = (R t /(R-R t ))/ ((f t - R t )/(N - f t - (R -R t ))) –w t = 33 –t is strongly indicative of relevance since it frequently appears in rel documents and rarely in nonrel –what if N = 20; R = 13; R t = 4; f t = 7 ? w t = (4/9)/(3/4) = 0.59 t counts slightly against the doc being relevant –w t = 1 indicates neutral term since it appears randomly across relevant and nonrelevant docs

22  Allan, Ballesteros, Croft, and/or Turtle Estimation with Relevance Information w t = (R t /(R-R t ))/ ((f t - R t )/(N - f t - (R -R t ))) w t = 1 indicates neutral term since it appears randomly across relevant and nonrelevant docs assuming that the occurrences of terms in documents are independent –document weight is the product of it ’ s term weights w(d) =  w t –conventional to express as a sum of logs:  log w t –neg. values indicate nonrel, 0 indicates there is as much evidence for relevance as for non-relevance

23  Allan, Ballesteros, Croft, and/or Turtle Estimation Relevance information is usually not available Estimate prob based on information in query & collection –previous queries can also be used with some learning approaches If q i (pr of occurrence in non-relevant documents) is estimated as f t /N, the second part of the weight is –which for large N is the IDF weight –# non-relevant documents are approximated by the whole collection

24  Allan, Ballesteros, Croft, and/or Turtle Estimation p i (probability of occurrence in relevant documents) can be estimated in various ways –constant (Croft and Harper combination match) –proportional to probability of occurrence in collection –more accurately, proportional to log(probability of occurrence) Greif, 1998 Maximum likelihood estimates have problems with small samples or zero values Estimating probabilities is the same problem as determining weighting formulae in less formal models

25  Allan, Ballesteros, Croft, and/or Turtle An Independence Assumption Typically, terms aren ’ t independent (e.g. phrases) –Modeling dependence can be very complex The set of all terms are distributed independently in both rel and nonrel –Very strong assumption! e.g.) Q: “ What is happening with the impeachment trial? ” –Occurrence of “ impeachment ” in relevant documents is independent from occurrence of “ trial ”


Download ppt "Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded."

Similar presentations


Ads by Google