Download presentation
Presentation is loading. Please wait.
1
VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop
2
VIR The Vector (or vector space) model of IR (VIR) is a classical model of IR. It is traditionally called so because both the documents and queries are conceived, for retrieval purposes, as strings of numbers as though they were (mathematical) vectors. Retrieval is based on whether the 'query vector' and the 'document vector' are 'close enough'. 2Adrienn Skrop
3
Notations Given a finite set D of elements called documents: D = {D 1,..., D j,..., D m } and a finite set T of elements called index terms: T = {t 1,..., t i,..., t n } Any document D j is assigned a vector v j of finite real numbers, called weights, of length m as follows: v j = (w ij ) i=1,...,n (w 1j,..., w ij,..., w nj ) where 0 w ij 1 (i.e. w ij is normalised, e.g. division by the largest). The weight w ij is interpreted as an extent to which the term t i 'characterises' document D j. 3Adrienn Skrop
4
Identifiers The choice of terms and weights is a difficult theoretical (e.g. linguistic, semantical) and practical problem, and several techniques can be used to cope with it. The VIR model assumes that the most obvious place where appropriate content identifiers might be found are the documents themselves. It is assumed that frequencies of words in documents can give meaningful indication of their content, hence they can be taken as identifiers. 4Adrienn Skrop
5
Word significance factors Steps of simple automatic method to identify word significance factors: 1. Identify lexical units. Write a computer program to recognise words (word = any sequence of characters preceded and followed by ‘space’, ‘dot’, ‘comma’). 2. Exclude stopwords from documents, i.e. those words that are unlikely to bear any significance. For example, a, about, and, etc. 5Adrienn Skrop
6
Word significance factors 3. Apply stemming to the remaining words, i.e. reduce or transform them to their linguistic roots. A widely used stemming algorithm is the well-known Porter's algorithm. Practically, the index terms are the stems. 4. Compute for each document D j the number of occurrences f ij of each term t i in that document. tf i = j m =1 f ij 5. Calculate the total tf i for each term t i as follows 6Adrienn Skrop
7
Word significance factors 6. Rank the terms in decreasing order according to tf i, and exclude the very high frequency terms, e.g. over some threshold (on the ground that they are almost always insignificant), and the very low frequency terms as well, e.g. below some threshold (on the ground that they are not much on the writer's mind). 7Adrienn Skrop
8
Word significance factors 7. The remaining terms can be taken as document identifiers; they will be called index terms (or simply terms). 8. Compute for the index terms t i a significance factor (or weight) w ij with respect to each document D j. Several methods are known. 8Adrienn Skrop
9
Query object An object Q k, called query, coming from a user, is also conceived as being a (much smaller) document, a vector v k can be computed for it, too, in a similar way. 9Adrienn Skrop
10
VIR VIR is defined as follows: A document D j is retrieved in response to a query Q k, if the document and the query are "similar enough", i.e. a similarity measure s jk between the document (identified by v j ) and the query (identified by v k ) is over some threshold K, i.e. S jk = s(v j, v k ) > K 10Adrienn Skrop
11
Steps of VIR Define query Q k. Compute for the index terms t i a significance factor (or weight) w ik with respect query Q k in the same way as the documents D j. Compute the similarity values for the documents D j. Give the hit list of the retrieved documents (in decreasing order) in respect to the similarity values. 11Adrienn Skrop
12
Similarity measures Dot product (simple matching coefficient; inner product) s jk = (v j, v k ) = i n =1 w ij w ik If D j and Q k are conceived as sets of terms, the set theoretic counterpart of the simple matching coefficient is s jk = |D j Q k | 12Adrienn Skrop
13
Similarity measures Cosine measure: c jk s jk = c jk = (v j, v k )/(||v j || ||v k ||) If D j and Q k are conceived as sets of terms, the set theoretic counterpart of the Cosine measure is c jk = |D j Q k | / (|D j | |Q k |) 1/2 13Adrienn Skrop
14
Similarity measures Dice's coefficient: d jk s jk = d jk = 2 (v j, v k ) / i n =1 (w ij + w ik ) If D j and Q k are conceived as sets of terms, the set theoretic counterpart of Dice's coefficient is d jk = 2 |D j Q k | / (|D j | + |Q k |) 14Adrienn Skrop
15
Similarity measures Jaccard's coefficient: J jk s jk = J jk = (v j, v k ) / i n =1 (w ij + w ik )/2 w ij w ik If D j and Q k are conceived as sets of terms, the set theoretic counterpart of Jaccard's coefficient is: J jk = |D j Q k | / | D j Q k | 15Adrienn Skrop
16
Similarity measures Overlap coefficient: O jk s jk = O jk = (v j, v k ) / min ( i n =1 w ij, i n =1 w ik ) If D j and Q k are conceived as sets of terms, the set theoretic counterpart of the overlap coefficient is O jk = |D j Q k | / min (|D j |,|Q k |) 16Adrienn Skrop
17
Example Let the set of documents be O = {O 1, O 2, O 3 } where O 1 = Bayes' Principle: The principle that, in estimating a parameter, one should initially assume that each possible value has equal probability (a uniform prior distribution). O 2 = Bayesian Decision Theory: A mathematical theory of decision-making which presumes utility and probability functions, and according to which the act to be chosen is the Bayes act, i.e. the one with highest Subjective Expected Utility. If one had unlimited time and calculating power with which to make every decision, this procedure would be the best way to make any decision. 17Adrienn Skrop
18
Example O 3 = Bayesian Epistemology: A philosophical theory which holds that the epistemic status of a proposition (i.e. how well proven or well established it is) is best measured by a probability and that the proper way to revise this probability is given by Bayesian conditionalisation or similar procedures. A Bayesian epistemologist would use probability to define, and explore the relationship between, concepts such as epistemic status, support or explanatory power. 18Adrienn Skrop
19
Example Step 1. Identify lexical units Original text: Output: Bayes' Principle: The principle that, in estimating a parameter, one should initially assume that each possible value has equal probability (a uniform prior distribution). Bayes principle the principle that in estimating a parameter one should initially assume that each possible value has equal probability a uniform prior distribution 19Adrienn Skrop
20
Example Step 2. Exclude stopwords from the document Stoplist: a aboard about above accordingly across actually add added after afterwards again against ago all allows almost alone along alongside … 20Adrienn Skrop
21
Example Step 3. Apply stemming Text before andafter stemmig: Bayes principle estimating parameter one initially assume possible value Bayes principl estim paramet on initi assum possibl valu … 21Adrienn Skrop
22
Example Step 4. Compute for each document D j the number of occurrences f ij of each term t i in that document. E.g., t 1 := Bayes f 11 1 f 12 2 f 13 3 22Adrienn Skrop
23
Example Step 5. Calculate the total tf i for each term t i E.g. number of term t 1 = Bayes in all the documents: tf 1 6 23Adrienn Skrop
24
Example Step 6. Rank the terms in decreasing order according to tf i, and exclude the very high and very low frequency terms 24Adrienn Skrop
25
Example Step 7. The remaining terms can be taken as index terms. Let the set T of index terms be (not stemmed here) T = {t 1, t 2, t 3 } = { t 1 = Bayes, t 2 = probability, t 3 = epistemology }. Conceive the documents as sets of terms: D = { D 1, D 2, D 3 } where D 1 = {(t 1 = Bayes); (t 2 = probability)} D 2 = {(t 1 = Bayes); (t 2 = probability)} D 3 = {(t 1 = Bayes); (t 2 = probability); (t 3 = epistemology)} 25Adrienn Skrop
26
Example Conceive the documents as sets of terms (together with their frequencies): D = {D 1, D 2, D 3 } where D 1 = { (Bayes, 1); (probability, 1); (epistemology, 0) } D 2 = { (Bayes, 2); (probability, 1); (epistemology, 0) } D 3 = { (Bayes, 3); (probability, 3); (epistemology, 3) } 26Adrienn Skrop
27
Documents in vector space 27Adrienn Skrop
28
Example Step 8. Compute for the index terms t i a significance factor (or weight) w ij with respect to each document D j. Creating Term-Document matrix TD: TD 3×3 = (w ij ), where w ij denotes the weight of term t i in document D j 28Adrienn Skrop
29
Example Using Frequency Weighting Method: TD = Using Term Frequency Normalized Weighting Method: TD = 29Adrienn Skrop
30
Example Step 9. Define a query. Let the query Q be: Q = {What is Bayesian epistemology? } Let the query Q be (as a set of terms): Q = {(t 1 = Bayes); (t 3 = epistemology) } Notice that, in the VIR, the query is not a Boolean expression anymore (as in BIR). 30Adrienn Skrop
31
Example Step 10. Compute for the index terms t i a significance factor (or weight) w ik with respect query Q k. Using Frequency or Binary Weighting Method: Q k (1, 0, 1) 31Adrienn Skrop
32
Query in vector space 32Adrienn Skrop
33
Example 12. Compute the similarity values, and give the ranking order. Using, Term Frequency Normalized (tfn) Weighting Method : Q= TD= 33Adrienn Skrop
34
Example 34Adrienn Skrop
35
Example Dot Product: Dot 1 0.5 Dot 2 0.632 Dot 3 0.816 Cosine Measure: Cosine 1 0.5 Cosine 2 0.632 Cosine 3 0.816 Dice Measure: Dice 1 0.354 Dice 2 0.459 Dice 3 0.519 Jaccard Measure: Jaccard 1 0.21 Jaccard 2 0.29 Jaccard 3 0.32 The ranking order: D 3, D 2, D 1, 35Adrienn Skrop
36
Conclusion Both the query and the document are represented as vectors of real numbers. A similarity measure is meant to express a likeness between a query and a document. Those documents are said to be retrieved in response to a query for which the similarity measure exceeds some threshold. 36Adrienn Skrop
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.