VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop
VIR The Vector (or vector space) model of IR (VIR) is a classical model of IR. It is traditionally called so because both the documents and queries are conceived, for retrieval purposes, as strings of numbers as though they were (mathematical) vectors. Retrieval is based on whether the 'query vector' and the 'document vector' are 'close enough'. 2Adrienn Skrop
Notations Given a finite set D of elements called documents: D = {D 1,..., D j,..., D m } and a finite set T of elements called index terms: T = {t 1,..., t i,..., t n } Any document D j is assigned a vector v j of finite real numbers, called weights, of length m as follows: v j = (w ij ) i=1,...,n (w 1j,..., w ij,..., w nj ) where 0 w ij 1 (i.e. w ij is normalised, e.g. division by the largest). The weight w ij is interpreted as an extent to which the term t i 'characterises' document D j. 3Adrienn Skrop
Identifiers The choice of terms and weights is a difficult theoretical (e.g. linguistic, semantical) and practical problem, and several techniques can be used to cope with it. The VIR model assumes that the most obvious place where appropriate content identifiers might be found are the documents themselves. It is assumed that frequencies of words in documents can give meaningful indication of their content, hence they can be taken as identifiers. 4Adrienn Skrop
Word significance factors Steps of simple automatic method to identify word significance factors: 1. Identify lexical units. Write a computer program to recognise words (word = any sequence of characters preceded and followed by ‘space’, ‘dot’, ‘comma’). 2. Exclude stopwords from documents, i.e. those words that are unlikely to bear any significance. For example, a, about, and, etc. 5Adrienn Skrop
Word significance factors 3. Apply stemming to the remaining words, i.e. reduce or transform them to their linguistic roots. A widely used stemming algorithm is the well-known Porter's algorithm. Practically, the index terms are the stems. 4. Compute for each document D j the number of occurrences f ij of each term t i in that document. tf i = j m =1 f ij 5. Calculate the total tf i for each term t i as follows 6Adrienn Skrop
Word significance factors 6. Rank the terms in decreasing order according to tf i, and exclude the very high frequency terms, e.g. over some threshold (on the ground that they are almost always insignificant), and the very low frequency terms as well, e.g. below some threshold (on the ground that they are not much on the writer's mind). 7Adrienn Skrop
Word significance factors 7. The remaining terms can be taken as document identifiers; they will be called index terms (or simply terms). 8. Compute for the index terms t i a significance factor (or weight) w ij with respect to each document D j. Several methods are known. 8Adrienn Skrop
Query object An object Q k, called query, coming from a user, is also conceived as being a (much smaller) document, a vector v k can be computed for it, too, in a similar way. 9Adrienn Skrop
VIR VIR is defined as follows: A document D j is retrieved in response to a query Q k, if the document and the query are "similar enough", i.e. a similarity measure s jk between the document (identified by v j ) and the query (identified by v k ) is over some threshold K, i.e. S jk = s(v j, v k ) > K 10Adrienn Skrop
Steps of VIR Define query Q k. Compute for the index terms t i a significance factor (or weight) w ik with respect query Q k in the same way as the documents D j. Compute the similarity values for the documents D j. Give the hit list of the retrieved documents (in decreasing order) in respect to the similarity values. 11Adrienn Skrop
Similarity measures Dot product (simple matching coefficient; inner product) s jk = (v j, v k ) = i n =1 w ij w ik If D j and Q k are conceived as sets of terms, the set theoretic counterpart of the simple matching coefficient is s jk = |D j Q k | 12Adrienn Skrop
Similarity measures Cosine measure: c jk s jk = c jk = (v j, v k )/(||v j || ||v k ||) If D j and Q k are conceived as sets of terms, the set theoretic counterpart of the Cosine measure is c jk = |D j Q k | / (|D j | |Q k |) 1/2 13Adrienn Skrop
Similarity measures Dice's coefficient: d jk s jk = d jk = 2 (v j, v k ) / i n =1 (w ij + w ik ) If D j and Q k are conceived as sets of terms, the set theoretic counterpart of Dice's coefficient is d jk = 2 |D j Q k | / (|D j | + |Q k |) 14Adrienn Skrop
Similarity measures Jaccard's coefficient: J jk s jk = J jk = (v j, v k ) / i n =1 (w ij + w ik )/2 w ij w ik If D j and Q k are conceived as sets of terms, the set theoretic counterpart of Jaccard's coefficient is: J jk = |D j Q k | / | D j Q k | 15Adrienn Skrop
Similarity measures Overlap coefficient: O jk s jk = O jk = (v j, v k ) / min ( i n =1 w ij, i n =1 w ik ) If D j and Q k are conceived as sets of terms, the set theoretic counterpart of the overlap coefficient is O jk = |D j Q k | / min (|D j |,|Q k |) 16Adrienn Skrop
Example Let the set of documents be O = {O 1, O 2, O 3 } where O 1 = Bayes' Principle: The principle that, in estimating a parameter, one should initially assume that each possible value has equal probability (a uniform prior distribution). O 2 = Bayesian Decision Theory: A mathematical theory of decision-making which presumes utility and probability functions, and according to which the act to be chosen is the Bayes act, i.e. the one with highest Subjective Expected Utility. If one had unlimited time and calculating power with which to make every decision, this procedure would be the best way to make any decision. 17Adrienn Skrop
Example O 3 = Bayesian Epistemology: A philosophical theory which holds that the epistemic status of a proposition (i.e. how well proven or well established it is) is best measured by a probability and that the proper way to revise this probability is given by Bayesian conditionalisation or similar procedures. A Bayesian epistemologist would use probability to define, and explore the relationship between, concepts such as epistemic status, support or explanatory power. 18Adrienn Skrop
Example Step 1. Identify lexical units Original text: Output: Bayes' Principle: The principle that, in estimating a parameter, one should initially assume that each possible value has equal probability (a uniform prior distribution). Bayes principle the principle that in estimating a parameter one should initially assume that each possible value has equal probability a uniform prior distribution 19Adrienn Skrop
Example Step 2. Exclude stopwords from the document Stoplist: a aboard about above accordingly across actually add added after afterwards again against ago all allows almost alone along alongside … 20Adrienn Skrop
Example Step 3. Apply stemming Text before andafter stemmig: Bayes principle estimating parameter one initially assume possible value Bayes principl estim paramet on initi assum possibl valu … 21Adrienn Skrop
Example Step 4. Compute for each document D j the number of occurrences f ij of each term t i in that document. E.g., t 1 := Bayes f 11 1 f 12 2 f 13 3 22Adrienn Skrop
Example Step 5. Calculate the total tf i for each term t i E.g. number of term t 1 = Bayes in all the documents: tf 1 6 23Adrienn Skrop
Example Step 6. Rank the terms in decreasing order according to tf i, and exclude the very high and very low frequency terms 24Adrienn Skrop
Example Step 7. The remaining terms can be taken as index terms. Let the set T of index terms be (not stemmed here) T = {t 1, t 2, t 3 } = { t 1 = Bayes, t 2 = probability, t 3 = epistemology }. Conceive the documents as sets of terms: D = { D 1, D 2, D 3 } where D 1 = {(t 1 = Bayes); (t 2 = probability)} D 2 = {(t 1 = Bayes); (t 2 = probability)} D 3 = {(t 1 = Bayes); (t 2 = probability); (t 3 = epistemology)} 25Adrienn Skrop
Example Conceive the documents as sets of terms (together with their frequencies): D = {D 1, D 2, D 3 } where D 1 = { (Bayes, 1); (probability, 1); (epistemology, 0) } D 2 = { (Bayes, 2); (probability, 1); (epistemology, 0) } D 3 = { (Bayes, 3); (probability, 3); (epistemology, 3) } 26Adrienn Skrop
Documents in vector space 27Adrienn Skrop
Example Step 8. Compute for the index terms t i a significance factor (or weight) w ij with respect to each document D j. Creating Term-Document matrix TD: TD 3×3 = (w ij ), where w ij denotes the weight of term t i in document D j 28Adrienn Skrop
Example Using Frequency Weighting Method: TD = Using Term Frequency Normalized Weighting Method: TD = 29Adrienn Skrop
Example Step 9. Define a query. Let the query Q be: Q = {What is Bayesian epistemology? } Let the query Q be (as a set of terms): Q = {(t 1 = Bayes); (t 3 = epistemology) } Notice that, in the VIR, the query is not a Boolean expression anymore (as in BIR). 30Adrienn Skrop
Example Step 10. Compute for the index terms t i a significance factor (or weight) w ik with respect query Q k. Using Frequency or Binary Weighting Method: Q k (1, 0, 1) 31Adrienn Skrop
Query in vector space 32Adrienn Skrop
Example 12. Compute the similarity values, and give the ranking order. Using, Term Frequency Normalized (tfn) Weighting Method : Q= TD= 33Adrienn Skrop
Example 34Adrienn Skrop
Example Dot Product: Dot 1 0.5 Dot 2 Dot 3 Cosine Measure: Cosine 1 0.5 Cosine 2 Cosine 3 Dice Measure: Dice 1 Dice 2 Dice 3 Jaccard Measure: Jaccard 1 0.21 Jaccard 2 0.29 Jaccard 3 0.32 The ranking order: D 3, D 2, D 1, 35Adrienn Skrop
Conclusion Both the query and the document are represented as vectors of real numbers. A similarity measure is meant to express a likeness between a query and a document. Those documents are said to be retrieved in response to a query for which the similarity measure exceeds some threshold. 36Adrienn Skrop