5. Vector Space and Probabilistic Retrieval Models

5. Vector Space and Probabilistic Retrieval Models
Many slides in this section are adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong) These notes are based, in part, on notes by Dr. Raymond J. Mooney at the University of Texas at Austin.

Keyword Discrimination Model
The Vector representation of documents can be used as the source of another approach to term weighting Question: what happens if we removed one of the words used as dimensions in the vector space? If the average similarity among documents changes significantly, then the word was a good discriminator If there is little change, the word is not as helpful and should be weighted less Note that the goal is to have a representation that makes it easier for a queries to discriminate among documents Average similarity can be measured after removing each word from the matrix Any of the similarity measures can be used (we will look at a variety of other similarity measures later).

Keyword Discrimination
Measuring average similarity (assume there are N documents) sim(D1,D2) = similarity score for pair of documents D1 and D2 Better way to calculate AVG-SIM Calculate centroid D* (avg. document vector = Sum vectors / N) Then: Computationally Expensive

Keyword Discrimination
Discrimination value (discriminant) and term weights Computing Term Weights New weight for a term k in a document i is the original term frequency of k in i time the discriminant value: disck > 0 ==> termk is a good discriminant disck < 0 ==> termk is a poor discriminant disck = 0 ==> termk is indifferent

Keyword Discrimination - Example
Using Normalized Cosine Note: D* for each of the SIMk is now computed with only two terms

Keyword Discrimination - Example
This shows that t1 tends to be a poor discriminator, while t3 is a good discriminator. The new term weight will now reflect the discrimination value for these terms. Note that further normalization can be done to make all term weights positive.

Signal-To-Noise Ratio
Based on work of Shannon in 1940’s on Information Theory Developed a model of communication of messages across a noisy channel Goal is to devise an encoding of messages that is most robust in the face of channel noise In IR, messages describe the content of documents Amount of information about the document from a word is inversely proportional to its probability of occurrence The least informative words are those that occur approximately uniformly across the corpus of documents a word that occurs with the similar frequency across many documents (e.g., “the”, “and”, etc.) is less informative than one that occurs with high frequency in one or two documents Shannon used entropy (a logarithmic measure) to measure average information gain with noise defined as its inverse

Signal-To-Noise Ratio
pk = Prob(term k occurs in document i) = tfik / tfk Infok = - pk log pk Noisek = - pk log (1/pk) Note: here we always take logs to be base 2. Note: NOISE is the negation of AVG-INFO, so only one of these needs to be computed in practice. The weight of term k in document i

Signal-To-Noise Ratio - Example
pk = tfik / tfk Note: By definition, if the term k does not appear in the document, we assume Info(k) = 0 for that doc. This is the “entropy” of term k in the collection

Signal-To-Noise Ratio - Example
The weight of term k in document i Additional normalization can be performed to have values in the range [0,1]

Probabilistic Term Weights
Probabilistic model makes explicit distinctions between occurrences of terms in relevant and non-relevant documents If we know pi: probability of term xi appears in relevant doc. qi: probability of term xi appears in non-relevant doc. with binary and independence assumption, the the weight of term xi in document Dk is: Estimates of pi and qi requires relevance information: using test queries and test collections to “train” the values of pi and qi other AI/learning technique? Intelligent Information Retrieval

Other Vector Space Similarity Measures
Simple Matching: Cosine Coefficient: Dice’s Coefficient: Jaccard’s Coefficient:

Vector Space Similarity Measures
Consider the following two document and the query vectors: D1 = (0.8, 0.3) D2 = (0.2, 0.7) Q = (0.4, 0.8) Computing similarity using Jaccard’s Coefficient: Computing similarity using Dice’s Coefficient: sim(Q, D1) = 0.73 sim(Q, D2) = 0.96

Vector Space Similarity Measures -- Example

Naïve Implementation Convert all documents in collection D to tf-idf weighted vectors, dj, for keyword vocabulary V. Convert query to a tf-idf-weighted vector q. For each dj in D do Compute score sj = cosSim(dj, q) Sort documents by decreasing score. Present top ranked documents to the user. Time complexity: O(|V|·|D|) Bad for large V & D ! |V| = 10,000; |D| = 100,000; |V|·|D| = 1,000,000,000

Comments on Vector Space Models
Simple, mathematically based approach. Considers both local (tf) and global (idf) word occurrence frequencies. Provides partial matching and ranked results. Tends to work quite well in practice despite obvious weaknesses. Allows efficient implementation for large document collections.

Problems with Vector Space Model
Missing semantic information (e.g. word sense). Missing syntactic information (e.g. phrase structure, word order, proximity information). Assumption of term independence (e.g. ignores synonomy). Lacks the control of a Boolean model (e.g., requiring a term to appear in a document). Given a two-term query “A B”, may prefer a document containing A frequently but not B, over a document that contains both A and B, but both less frequently.

3. Probabilistic Model Attempts to be theoretically sound
try to predict the probability of a document’s being relevant, given the query there are many variations usually more complicated to compute than v.s. usually many approximations are required Relevance information is required from a random sample of documents and queries (training examples) Works about the same (sometimes better) than vector space approaches

Basic Probabilistic Retrieval
Retrieval is modeled as a classification process Two classes for each query: the relevant and non-relevant documents (with respect to a given query) could easily be extended to three classes (i.e. add a don’t care) Given a particular document D, calculate the probability of belonging to the relevant class retrieve if greater than probability of belonging to non-relevant class i.e. retrieve if P(R|D) > P(NR|D) Equivalently, rank by a discriminant value (also called likelihood ratio) P(R|D) / P(NR|D) Different ways of estimating these probabilities lead to different models

Basic Probabilistic Retrieval
A given query divides the document collection into two sets: relevant and non-relevant If a document set D has been selected in response to a query, retrieve the document if dis(D) > 1 where dis(D) = P(R|D) / P(NR|D) is the discriminant of D This criteria can be modified by weighting the two probabilities Relevant Documents P(R|D) P(NR|D) Non-Relevant Documents Document

Estimating Probabilities
Bayes’ Rule can be used to “invert” conditional probabilities: Applying that to discriminant function: Note that P(R) is the probability that a random document is relevant to the query, and P(NR) = 1 - P(R) P(R) = n / N and P(NR) = 1 - P(R) = (N - n) / N where n = number of relevant documents, and N = total number of documents in the collection

Estimating Probabilities
Now we need to estimate P(D|R) and P(D|NR) If we assume that a document is represented by terms t1, . . ., tn, and that these terms are statistically independent, then and similarly we can compute P(D|NR) Note that P(ti|R) is the probability that a term ti occurs in a relevant document, and it can be estimated based on previously available sample (e.g., through relevance feedback) So, based on the probability of the distribution of terms in relevant and non-relevant documents we can estimate whether the document should be retrieved (i.e, if dis(D) > 1) Note that documents that are retrieved can be ranked based on the value of the discriminant

Probabilistic Retrieval - Example
Since the discriminant is less than one, document D should not be retrieved

Probabilistic Retrieval (cont.)
In practice, can’t build a model for each query Instead a general model is built based on query-document pairs in the historical (training) data Then for a given query Q, the discriminant is computed only based on the conditional probabilities of the query terms If query term t occurs in D, take P(t|R) and P(t|NR) If query term t does not appear in D, take 1-P(t|R) and 1- P(t|NR) Q = t1, t3, t D = t1, t4, t5

Probabilistic Models Advantages Disadvantages Strong theoretical basis
In principle should supply the best predictions of relevance given available information Can be implemented similarly to Vector Relevance information is required -- or is “guestimated” Important indicators of relevance may not be term -- though terms only are usually used Optimally requires on-going collection of relevance information

Vector and Probabilistic Models
Support “natural language” queries Treat documents and queries the same Support relevance feedback searching Support ranked retrieval Differ primarily in theoretical basis and in how the ranking is calculated Vector assumes relevance Probabilistic relies on relevance judgments or estimates

5. Vector Space and Probabilistic Retrieval Models

Similar presentations

Presentation on theme: "5. Vector Space and Probabilistic Retrieval Models"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

5. Vector Space and Probabilistic Retrieval Models

Similar presentations

Presentation on theme: "5. Vector Space and Probabilistic Retrieval Models"— Presentation transcript:

Similar presentations

About project

Feedback