5. Vector Space and Probabilistic Retrieval Models

Slides:



Advertisements
Similar presentations
Traditional IR models Jian-Yun Nie.
Advertisements

Boolean and Vector Space Retrieval Models
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
Chapter 5: Introduction to Information Retrieval
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Retrieval Models and Ranking Systems CSC 575 Intelligent Information Retrieval.
The Probabilistic Model. Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework; n Given a user query, there is an.
CpSc 881: Information Retrieval
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
IR Models: Overview, Boolean, and Vector
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Probabilistic IR Models Based on probability theory Basic idea : Given a document d and a query q, Estimate the likelihood of d being relevant for the.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Ch 4: Information Retrieval and Text Mining
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Modeling Modern Information Retrieval
1 Query Language Baeza-Yates and Navarro Modern Information Retrieval, 1999 Chapter 4.
8/28/97Information Organization and Retrieval IR Implementation Issues, Web Crawlers and Web Search Engines University of California, Berkeley School of.
Indexing and Representation: The Vector Space Model Document represented by a vector of terms Document represented by a vector of terms Words (or word.
Vector Space Model CS 652 Information Extraction and Integration.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
IR Models: Review Vector Model and Probabilistic.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Information Retrieval and Web Search IR models: Vectorial Model Instructor: Rada Mihalcea Class web page: [Note: Some.
1 Computing Relevance, Similarity: The Vector Space Model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Alternative IR models DR.Yeni Herdiyeni, M.Kom STMIK ERESHA.
Latent Semantic Indexing and Probabilistic (Bayesian) Information Retrieval.
Vector Space Models.
C.Watterscsci64031 Probabilistic Retrieval Model.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 14: Language Models for IR.
1 Probabilistic Models for Ranking Some of these slides are based on Stanford IR Course slides at
Plan for Today’s Lecture(s)
Lecture 13: Language Models for IR
Information Retrieval and Web Search
אחזור מידע, מנועי חיפוש וספריות
Information Retrieval and Web Search
Representation of documents and queries
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
6. Implementation of Vector-Space Retrieval
4. Boolean and Vector Space Retrieval Models
Boolean and Vector Space Retrieval Models
CS 430: Information Discovery
INF 141: Information Retrieval
Recuperação de Informação B
Recuperação de Informação B
Information Retrieval and Web Design
Advanced information retrieval
Term Frequency–Inverse Document Frequency
ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB SEARCH
Presentation transcript:

5. Vector Space and Probabilistic Retrieval Models Many slides in this section are adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong) These notes are based, in part, on notes by Dr. Raymond J. Mooney at the University of Texas at Austin.

Keyword Discrimination Model The Vector representation of documents can be used as the source of another approach to term weighting Question: what happens if we removed one of the words used as dimensions in the vector space? If the average similarity among documents changes significantly, then the word was a good discriminator If there is little change, the word is not as helpful and should be weighted less Note that the goal is to have a representation that makes it easier for a queries to discriminate among documents Average similarity can be measured after removing each word from the matrix Any of the similarity measures can be used (we will look at a variety of other similarity measures later).

Keyword Discrimination Measuring average similarity (assume there are N documents) sim(D1,D2) = similarity score for pair of documents D1 and D2 Better way to calculate AVG-SIM Calculate centroid D* (avg. document vector = Sum vectors / N) Then: Computationally Expensive

Keyword Discrimination Discrimination value (discriminant) and term weights Computing Term Weights New weight for a term k in a document i is the original term frequency of k in i time the discriminant value: disck > 0 ==> termk is a good discriminant disck < 0 ==> termk is a poor discriminant disck = 0 ==> termk is indifferent

Keyword Discrimination - Example Using Normalized Cosine Note: D* for each of the SIMk is now computed with only two terms

Keyword Discrimination - Example This shows that t1 tends to be a poor discriminator, while t3 is a good discriminator. The new term weight will now reflect the discrimination value for these terms. Note that further normalization can be done to make all term weights positive.

Signal-To-Noise Ratio Based on work of Shannon in 1940’s on Information Theory Developed a model of communication of messages across a noisy channel Goal is to devise an encoding of messages that is most robust in the face of channel noise In IR, messages describe the content of documents Amount of information about the document from a word is inversely proportional to its probability of occurrence The least informative words are those that occur approximately uniformly across the corpus of documents a word that occurs with the similar frequency across many documents (e.g., “the”, “and”, etc.) is less informative than one that occurs with high frequency in one or two documents Shannon used entropy (a logarithmic measure) to measure average information gain with noise defined as its inverse

Signal-To-Noise Ratio pk = Prob(term k occurs in document i) = tfik / tfk Infok = - pk log pk Noisek = - pk log (1/pk) Note: here we always take logs to be base 2. Note: NOISE is the negation of AVG-INFO, so only one of these needs to be computed in practice. The weight of term k in document i

Signal-To-Noise Ratio - Example pk = tfik / tfk Note: By definition, if the term k does not appear in the document, we assume Info(k) = 0 for that doc. This is the “entropy” of term k in the collection

Signal-To-Noise Ratio - Example The weight of term k in document i Additional normalization can be performed to have values in the range [0,1]

Probabilistic Term Weights Probabilistic model makes explicit distinctions between occurrences of terms in relevant and non-relevant documents If we know pi: probability of term xi appears in relevant doc. qi: probability of term xi appears in non-relevant doc. with binary and independence assumption, the the weight of term xi in document Dk is: Estimates of pi and qi requires relevance information: using test queries and test collections to “train” the values of pi and qi other AI/learning technique? Intelligent Information Retrieval

Other Vector Space Similarity Measures Simple Matching: Cosine Coefficient: Dice’s Coefficient: Jaccard’s Coefficient:

Vector Space Similarity Measures Consider the following two document and the query vectors: D1 = (0.8, 0.3) D2 = (0.2, 0.7) Q = (0.4, 0.8) Computing similarity using Jaccard’s Coefficient: Computing similarity using Dice’s Coefficient: sim(Q, D1) = 0.73 sim(Q, D2) = 0.96

Vector Space Similarity Measures -- Example

Naïve Implementation Convert all documents in collection D to tf-idf weighted vectors, dj, for keyword vocabulary V. Convert query to a tf-idf-weighted vector q. For each dj in D do Compute score sj = cosSim(dj, q) Sort documents by decreasing score. Present top ranked documents to the user. Time complexity: O(|V|·|D|) Bad for large V & D ! |V| = 10,000; |D| = 100,000; |V|·|D| = 1,000,000,000

Comments on Vector Space Models Simple, mathematically based approach. Considers both local (tf) and global (idf) word occurrence frequencies. Provides partial matching and ranked results. Tends to work quite well in practice despite obvious weaknesses. Allows efficient implementation for large document collections.

Problems with Vector Space Model Missing semantic information (e.g. word sense). Missing syntactic information (e.g. phrase structure, word order, proximity information). Assumption of term independence (e.g. ignores synonomy). Lacks the control of a Boolean model (e.g., requiring a term to appear in a document). Given a two-term query “A B”, may prefer a document containing A frequently but not B, over a document that contains both A and B, but both less frequently.

3. Probabilistic Model Attempts to be theoretically sound try to predict the probability of a document’s being relevant, given the query there are many variations usually more complicated to compute than v.s. usually many approximations are required Relevance information is required from a random sample of documents and queries (training examples) Works about the same (sometimes better) than vector space approaches

Basic Probabilistic Retrieval Retrieval is modeled as a classification process Two classes for each query: the relevant and non-relevant documents (with respect to a given query) could easily be extended to three classes (i.e. add a don’t care) Given a particular document D, calculate the probability of belonging to the relevant class retrieve if greater than probability of belonging to non-relevant class i.e. retrieve if P(R|D) > P(NR|D) Equivalently, rank by a discriminant value (also called likelihood ratio) P(R|D) / P(NR|D) Different ways of estimating these probabilities lead to different models

Basic Probabilistic Retrieval A given query divides the document collection into two sets: relevant and non-relevant If a document set D has been selected in response to a query, retrieve the document if dis(D) > 1 where dis(D) = P(R|D) / P(NR|D) is the discriminant of D This criteria can be modified by weighting the two probabilities Relevant Documents P(R|D) P(NR|D) Non-Relevant Documents Document

Estimating Probabilities Bayes’ Rule can be used to “invert” conditional probabilities: Applying that to discriminant function: Note that P(R) is the probability that a random document is relevant to the query, and P(NR) = 1 - P(R) P(R) = n / N and P(NR) = 1 - P(R) = (N - n) / N where n = number of relevant documents, and N = total number of documents in the collection

Estimating Probabilities Now we need to estimate P(D|R) and P(D|NR) If we assume that a document is represented by terms t1, . . ., tn, and that these terms are statistically independent, then and similarly we can compute P(D|NR) Note that P(ti|R) is the probability that a term ti occurs in a relevant document, and it can be estimated based on previously available sample (e.g., through relevance feedback) So, based on the probability of the distribution of terms in relevant and non-relevant documents we can estimate whether the document should be retrieved (i.e, if dis(D) > 1) Note that documents that are retrieved can be ranked based on the value of the discriminant

Probabilistic Retrieval - Example Since the discriminant is less than one, document D should not be retrieved

Probabilistic Retrieval (cont.) In practice, can’t build a model for each query Instead a general model is built based on query-document pairs in the historical (training) data Then for a given query Q, the discriminant is computed only based on the conditional probabilities of the query terms If query term t occurs in D, take P(t|R) and P(t|NR) If query term t does not appear in D, take 1-P(t|R) and 1- P(t|NR) Q = t1, t3, t4 D = t1, t4, t5

Probabilistic Models Advantages Disadvantages Strong theoretical basis In principle should supply the best predictions of relevance given available information Can be implemented similarly to Vector Relevance information is required -- or is “guestimated” Important indicators of relevance may not be term -- though terms only are usually used Optimally requires on-going collection of relevance information

Vector and Probabilistic Models Support “natural language” queries Treat documents and queries the same Support relevance feedback searching Support ranked retrieval Differ primarily in theoretical basis and in how the ranking is calculated Vector assumes relevance Probabilistic relies on relevance judgments or estimates