Latent Semantic Indexing and Probabilistic (Bayesian) Information Retrieval
Dimensionality Reduction Imagine we’ve collected data on the HEIGHT and WEIGHT of everyone in a classroom of N students. These might be plotted as in figure 5.3 Notice the correlation around an axis we might call SIZE. Students vary most along this dimension; it captures most of the information about their distribution. It is possible to capture a major source of variation across the HEIGHT / WEIGHT sample, because the two quantities are correlated.
Dimensionality Reduction in the Vector Space Model In the Vector Space Model (VSM), the Index is a D*V element matrix, where D is the number of documents and V is the size of the vocabulary (see next slide) Attempts to reduce this large dimensional space into something smaller are called dimensionality reduction. There are two reasons we might be interested in reducing dimensions: a) reduce the size of the representation of the documents – initially there are many zeros, representing all the words not in the document – i.e. the matrix is sparse. b) to exploit what is known as the latent semantic relationships among these keywords. The VSM assumes the vocabulary dimensions to be orthogonal to one another; i.e the keywords are independent. But index terms are highly dependent, highly correlated with one another. We exploit this by capturing only those axes of maximal variation and throwing away the rest.
Singular Value Decomposition (SVD) Just as students’ HEIGHT and WEIGHT are correlated about the dimension SIZE, we can guess that (at least some small sets of) keywords are correlated, and so we can reduce the dimensionality of the Index in the same way we reduced that of our students’ sizes. Using SVD, the Index is decomposed into three matrices, U, L and A (which can be multiplied together to get back the original Index). L is a diagonal matrix, where the value in cell 1 reflects the importance of the most dominant correlation, the value in cell 2 reflects the second most dominant correlation… the value in the Vth cell reflects the least dominant correlation. We reduce the number of dimensions from V to k, by keeping only the first v cells in L. Finally multiply the three matrices back together again to produce an Index with fewer dimensions – an approximation of the original.
How many dimensions to reduce to? To date, the only answers are empirical Using too few dimensions dramatically degrades performance. A few hundred dimensions might suffice for a very topically-focussed vocabulary (e.g. medicine), but more might be needed when describing a broader domain of discourse See next slide: 500 dimensions gives highest proportion correct on synonym text.
“Latent Semantic” Claims In IR, SVD was first applied to the Index matrix by Deerwester et al. (1990), and was called Latent Semantic Indexing (LSI). The “Latent Semantic” claim derives from the authors’ belief that the reduced dimension representation of documents in fact reveals semantic correlations among index terms. While one author might use CAR and another the synonym AUTO, the correlation of both of those with other terms like HIGHWAY, GASOLINE and DRIVING will result in an abstracted document feature / dimension on which queries using either keyword, CAR or AUTO, will work equivalently. Retrieval based on synonyms has been achieved.
Probabilisitic (Bayesian) Retrieval There are no absolute logical grounds on which to prove that any document is relevant to any query. Our best hope is to retrieve documents which are probably relevant. The probability ranking principle (van Rijsbergen) is the assumption that an optimal IR system orders (ranks) documents in decreasing probability of relevance.
Pr(Rel) There are at least two possible interpretations of what a probability of relevance Pr(Rel) might mean: a) imagine (considering a particular query) an “experiment” showing the document to multiple users b) the same document/query relevance question is repeatedly put to the same user, who sometimes replies that it’s relevant and sometimes that it isn’t. Either way, we focus on one query and compute Pr(Rel) conditionalised by the features we might associate with the document d.
Bayesian Inversion Let x be a vector of features xi describing a document. A matching function match(q,d) α Pr(Rel|x) If we had worked hard on a corpus of documents to identify (always with respect to some particular query) which were Rel and which were not, it would be possible to carefully study which features xi were reliably found in relevant documents and which were not. Collecting such statistics for each feature would then allow us to estimate Pr(x|Rel) – the probability of any set of features x given that we know the document is Rel. The retrieval question requires that we ask the converse: the probability that a document with features x should be considered relevant. This inversion is achieved via the familiar Bayes rule: Pr(Rel|x) = Pr(x|Rel). Pr(Rel) / Pr(x)
Odds Calculation (2) The first term Odds(Rel) will be small; the odds of picking a relevant versus irrelevant document independent of any features of the document are not good. Odds(Rel) is a characteristic of the entire corpus or the generality of the query but insensitive to any analysis we might perform on a particular document. In order to calculate the second term Pr(x | Rel) / Pr(x | NRel) we need a more refined model of how documents are “constructed” from their features.
Binary Independence Model The binary assumption is that all the features xi are binary (either present or absent in a document). The much bigger assumption is that the document’s features occur independently of each other: But think of the example of “click here” – whenever we see “click” on a web page it is usually followed by “here”.
Comparing with the query Recall that both queries and documents live in the same vector space defined over the features xi. The two products of equation 5.46 (defined in terms of presence or absence of a feature in a document) can be broken into four subcases, depending on whether the features occur in the query. We don’t care about terms that are not in the query because they don’t affect the query document comparison – assume that the probability of these features being present in relevant and irrelevant documents is equal. See figure 5.6 (previous slide): sets D and Q are defined in terms of those features xi present and absent in the document and query respectively. Equation 5.46 can be rewritten (don’t worry about the exact derivation as follows (5.48):
Estimating the pi and qi with RelFbk Consider the retrospective case where when we have RelFbk from a user who has evaluated each of the top N documents in an initial retrieval and has found R of these to be relevant (as well as evaluating all the N-R remaining and found them to be irrelevant). If a particular feature xi is present in n of the retrieved documents with r of these relevant, then this bit of RelFbk provides reasonable estimates for pi and qi. See equations 5.49 and 5.50 on the previous slide.