Indexing and Representation: The Vector Space Model Document represented by a vector of terms Document represented by a vector of terms Words (or word.

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

Chapter 5: Introduction to Information Retrieval
Retrieval Models and Ranking Systems CSC 575 Intelligent Information Retrieval.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
IR Models: Overview, Boolean, and Vector
Evaluating Search Engine
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Information Retrieval Review
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
SLIDE 1IS 202 – FALL 2002 Lecture 20: Evaluation Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 240 – Spring 2010 Logistic Regression The logistic function: The logistic function is useful because it can take as an input any.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
INFO 624 Week 3 Retrieval System Evaluation
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Information Access I Measurement and Evaluation GSLT, Göteborg, October 2003 Barbara Gawronska, Högskolan i Skövde.
Hinrich Schütze and Christina Lioma
Current Topics in Information Access: IR Background
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
8/28/97Information Organization and Retrieval IR Implementation Issues, Web Crawlers and Web Search Engines University of California, Berkeley School of.
Evaluating the Performance of IR Sytems
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
9/19/2000Information Organization and Retrieval Vector and Probabilistic Ranking Ray Larson & Marti Hearst University of California, Berkeley School of.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.
ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Chapter 5: Information Retrieval and Web Search
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
Advanced Multimedia Text Classification Tamara Berg.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Vector Space Models.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Natural Language Processing Topics in Information Retrieval August, 2002.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.
Plan for Today’s Lecture(s)
Evaluation of Information Retrieval Systems
Evaluation.
Representation of documents and queries
Evaluation of Information Retrieval Systems
5. Vector Space and Probabilistic Retrieval Models
Retrieval Performance Evaluation - Measures
Presentation transcript:

Indexing and Representation: The Vector Space Model Document represented by a vector of terms Document represented by a vector of terms Words (or word stems) Words (or word stems) Phrases (e.g. computer science) Phrases (e.g. computer science) Removes words on “stop list” Removes words on “stop list” Documents aren’t about “the” Documents aren’t about “the” Often assumed that terms are uncorrelated. Often assumed that terms are uncorrelated. Correlations between term vectors implies a similarity between documents. Correlations between term vectors implies a similarity between documents. For efficiency, an inverted index of terms is often stored. For efficiency, an inverted index of terms is often stored.

Document Representation What values to use for terms Boolean (term present /absent) Boolean (term present /absent) tf (term frequency) - Count of times term occurs in document. tf (term frequency) - Count of times term occurs in document. The more times a term t occurs in document d the more likely it is that t is relevant to the document. The more times a term t occurs in document d the more likely it is that t is relevant to the document. Used alone, favors common words, long documents. Used alone, favors common words, long documents. df document frequency df document frequency The more a term t occurs throughout all documents, the more poorly t discriminates between documents The more a term t occurs throughout all documents, the more poorly t discriminates between documents tf-idf term frequency * inverse document frequency - tf-idf term frequency * inverse document frequency - High value indicates that the word occurs more often in this document than average. High value indicates that the word occurs more often in this document than average.

Document Vectors Documents are represented as “bags of words” Documents are represented as “bags of words” Represented as vectors when used computationally Represented as vectors when used computationally A vector is like an array of floating point A vector is like an array of floating point Has direction and magnitude Has direction and magnitude Each vector holds a place for every term in the collection Each vector holds a place for every term in the collection Therefore, most vectors are sparse Therefore, most vectors are sparse

Vector Representation Documents and Queries are represented as vectors. Documents and Queries are represented as vectors. Position 1 corresponds to term 1, position 2 to term 2, position t to term t Position 1 corresponds to term 1, position 2 to term 2, position t to term t

Document Vectors novagalaxy heath’wood filmroledietfur ABCDEFGHIABCDEFGHI Document ids

Assigning Weights Want to weight terms highly if they are Want to weight terms highly if they are frequent in relevant documents … BUT frequent in relevant documents … BUT infrequent in the collection as a whole infrequent in the collection as a whole

Assigning Weights tf x idf measure: tf x idf measure: term frequency (tf) term frequency (tf) inverse document frequency (idf) inverse document frequency (idf)

tf x idf Normalize the term weights (so longer documents are not unfairly given more weight) Normalize the term weights (so longer documents are not unfairly given more weight)

tf x idf normalization Normalize the term weights (so longer documents are not unfairly given more weight) Normalize the term weights (so longer documents are not unfairly given more weight) normalize usually means force all values to fall within a certain range, usually between 0 and 1, inclusive. normalize usually means force all values to fall within a certain range, usually between 0 and 1, inclusive.

Vector Space Similarity Measure combine tf x idf into a similarity measure

Computing Similarity Scores

Documents in Vector Space t1t1 t2t2 t3t3 D1D1 D2D2 D 10 D3D3 D9D9 D4D4 D7D7 D8D8 D5D5 D 11 D6D6

Computing a similarity score

Similarity Measures Simple matching (coordination level match) Dice’s Coefficient Jaccard’s Coefficient Cosine Coefficient Overlap Coefficient

Problems with Vector Space There is no real theoretical basis for the assumption of a term space There is no real theoretical basis for the assumption of a term space it is more for visualization that having any real basis it is more for visualization that having any real basis most similarity measures work about the same regardless of model most similarity measures work about the same regardless of model Terms are not really orthogonal dimensions Terms are not really orthogonal dimensions Terms are not independent of all other terms Terms are not independent of all other terms

Probabilistic Models Rigorous formal model attempts to predict the probability that a given document will be relevant to a given query Rigorous formal model attempts to predict the probability that a given document will be relevant to a given query Ranks retrieved documents according to this probability of relevance (Probability Ranking Principle) Ranks retrieved documents according to this probability of relevance (Probability Ranking Principle) Relies on accurate estimates of probabilities for accurate results Relies on accurate estimates of probabilities for accurate results

Probabilistic Retrieval Goes back to 1960’s (Maron and Kuhns) Goes back to 1960’s (Maron and Kuhns) Robertson’s “Probabilistic Ranking Principle” Robertson’s “Probabilistic Ranking Principle” Retrieved documents should be ranked in decreasing probability that they are relevant to the user’s query. Retrieved documents should be ranked in decreasing probability that they are relevant to the user’s query. How to estimate these probabilities? How to estimate these probabilities? Several methods (Model 1, Model 2, Model 3) with different emphases on how estimates are done. Several methods (Model 1, Model 2, Model 3) with different emphases on how estimates are done.

Probabilistic Models: Some Notation D = All present and future documents D = All present and future documents Q = All present and future queries Q = All present and future queries (D i,Q j ) = A document query pair (D i,Q j ) = A document query pair x = class of similar documents, x = class of similar documents, y = class of similar queries, y = class of similar queries, Relevance is a relation: Relevance is a relation:

Probabilistic Models: Logistic Regression Probability of relevance is based on Logistic regression from a sample set of documents to determine values of the coefficients. At retrieval the probability estimate is obtained by: For the 6 X attribute measures shown next

Probabilistic Models: Logistic Regression attributes Average Absolute Query Frequency Query Length Average Absolute Document Frequency Document Length Average Inverse Document Frequency Inverse Document Frequency Number of Terms in common between query and document -- logged

Probabilistic Models: Logistic Regression Estimates for relevance based on log-linear model with various statistical measures of document content as independent variables. Estimates for relevance based on log-linear model with various statistical measures of document content as independent variables. Log odds of relevance is a linear function of attributes: Term contributions summed: Probability of Relevance is inverse of log odds:

Probabilistic Models Strong theoretical basis Strong theoretical basis In principle should supply the best predictions of relevance given available information In principle should supply the best predictions of relevance given available information Can be implemented similarly to Vector Can be implemented similarly to Vector Relevance information is required -- or is “guestimated” Relevance information is required -- or is “guestimated” Important indicators of relevance may not be term -- though terms only are usually used Important indicators of relevance may not be term -- though terms only are usually used Optimally requires on- going collection of relevance information Optimally requires on- going collection of relevance information AdvantagesDisadvantages

Vector and Probabilistic Models Support “natural language” queries Support “natural language” queries Treat documents and queries the same Treat documents and queries the same Support relevance feedback searching Support relevance feedback searching Support ranked retrieval Support ranked retrieval Differ primarily in theoretical basis and in how the ranking is calculated Differ primarily in theoretical basis and in how the ranking is calculated Vector assumes relevance Vector assumes relevance Probabilistic relies on relevance judgments or estimates Probabilistic relies on relevance judgments or estimates

Simple Presentation of Results Order by similarity Order by similarity Decreased order of presumed relevance Decreased order of presumed relevance Items retrieved early in search may help generate feedback by relevance feedback Items retrieved early in search may help generate feedback by relevance feedback Select top k documents Select top k documents Select documents within  of query Select documents within  of query

Problems with Vector Space There is no real theoretical basis for the assumption of a term space There is no real theoretical basis for the assumption of a term space it is more for visualization that having any real basis it is more for visualization that having any real basis most similarity measures work about the same regardless of model most similarity measures work about the same regardless of model Terms are not really orthogonal dimensions Terms are not really orthogonal dimensions Terms are not independent of all other terms Terms are not independent of all other terms

Evaluation Relevance Relevance Evaluation of IR Systems Evaluation of IR Systems Precision vs. Recall Precision vs. Recall Cutoff Points Cutoff Points Test Collections/TREC Test Collections/TREC Blair & Maron Study Blair & Maron Study

What to Evaluate? How much learned about the collection? How much learned about the collection? How much learned about a topic? How much learned about a topic? How much of the information need is satisfied? How much of the information need is satisfied? How inviting the system is? How inviting the system is?

What to Evaluate? What can be measured that reflects users’ ability to use system? (Cleverdon 66) What can be measured that reflects users’ ability to use system? (Cleverdon 66) Coverage of Information Coverage of Information Form of Presentation Form of Presentation Effort required/Ease of Use Effort required/Ease of Use Time and Space Efficiency Time and Space Efficiency Recall Recall proportion of relevant material actually retrieved proportion of relevant material actually retrieved Precision Precision proportion of retrieved material actually relevant proportion of retrieved material actually relevant effectiveness

Relevance In what ways can a document be relevant to a query? In what ways can a document be relevant to a query? Answer precise question precisely. Answer precise question precisely. Partially answer question. Partially answer question. Suggest a source for more information. Suggest a source for more information. Give background information. Give background information. Remind the user of other knowledge. Remind the user of other knowledge. Others... Others...

Standard IR Evaluation Precision Precision Recall Recall Collection # relevant in collection # retrieved # relevant retrieved Retrieved Documents

Precision/Recall Curves There is a tradeoff between Precision and Recall There is a tradeoff between Precision and Recall So measure Precision at different levels of Recall So measure Precision at different levels of Recall precision recall x x x x

Precision/Recall Curves Difficult to determine which of these two hypothetical results is better: Difficult to determine which of these two hypothetical results is better: precision recall x x x x

Precision/Recall Curves

Document Cutoff Levels Another way to evaluate: Another way to evaluate: Fix the number of documents retrieved at several levels: Fix the number of documents retrieved at several levels: top 5, top 10, top 20, top 50, top 100, top 500 top 5, top 10, top 20, top 50, top 100, top 500 Measure precision at each of these levels Measure precision at each of these levels Take (weighted) average over results Take (weighted) average over results This is a way to focus on high precision This is a way to focus on high precision

The E-Measure Combine Precision and Recall into one number (van Rijsbergen 79) P = precision R = recall b = measure of relative importance of P or R For example, b = 0.5 means user is twice as interested in precision as recall

TREC Text REtrieval Conference/Competition Text REtrieval Conference/Competition Run by NIST (National Institute of Standards & Technology) Run by NIST (National Institute of Standards & Technology) 1997 was the 6th year 1997 was the 6th year Collection: 3 Gigabytes, >1 Million Docs Collection: 3 Gigabytes, >1 Million Docs Newswire & full text news (AP, WSJ, Ziff) Newswire & full text news (AP, WSJ, Ziff) Government documents (federal register) Government documents (federal register) Queries + Relevance Judgments Queries + Relevance Judgments Queries devised and judged by “Information Specialists” Queries devised and judged by “Information Specialists” Relevance judgments done only for those documents retrieved -- not entire collection! Relevance judgments done only for those documents retrieved -- not entire collection! Competition Competition Various research and commercial groups compete Various research and commercial groups compete Results judged on precision and recall, going up to a recall level of 1000 documents Results judged on precision and recall, going up to a recall level of 1000 documents

Sample TREC queries (topics) Number: 168 Topic: Financing AMTRAK Description: A document will address the role of the Federal Government in financing the operation of the National Railroad Transportation Corporation (AMTRAK) Narrative: A relevant document must provide information on the government’s responsibility to make AMTRAK an economically viable entity. It could also discuss the privatization of AMTRAK as an alternative to continuing government subsidies. Documents comparing government subsidies given to air and bus transportation with those provided to aMTRAK would also be relevant.

TREC Benefits: Benefits: made research systems scale to large collections (pre-WWW) made research systems scale to large collections (pre-WWW) allows for somewhat controlled comparisons allows for somewhat controlled comparisons Drawbacks: Drawbacks: emphasis on high recall, which may be unrealistic for what most users want emphasis on high recall, which may be unrealistic for what most users want very long queries, also unrealistic very long queries, also unrealistic comparisons still difficult to make, because systems are quite different on many dimensions comparisons still difficult to make, because systems are quite different on many dimensions focus on batch ranking rather than interaction focus on batch ranking rather than interaction no focus on the WWW no focus on the WWW

TREC Results Differ each year Differ each year For the main track: For the main track: Best systems not statistically significantly different Best systems not statistically significantly different Small differences sometimes have big effects Small differences sometimes have big effects how good was the hyphenation model how good was the hyphenation model how was document length taken into account how was document length taken into account Systems were optimized for longer queries and all performed worse for shorter, more realistic queries Systems were optimized for longer queries and all performed worse for shorter, more realistic queries Excitement is in the new tracks Excitement is in the new tracks Interactive Interactive Multilingual Multilingual NLP NLP

Blair and Maron 1985 Highly influential paper Highly influential paper A classic study of retrieval effectiveness A classic study of retrieval effectiveness earlier studies were on unrealistically small collections earlier studies were on unrealistically small collections Studied an archive of documents for a legal suit Studied an archive of documents for a legal suit ~350,000 pages of text ~350,000 pages of text 40 queries 40 queries focus on high recall focus on high recall Used IBM’s STAIRS full-text system Used IBM’s STAIRS full-text system Main Result: System retrieved less than 20% of the relevant documents for a particular information needs when lawyers thought they had 75% Main Result: System retrieved less than 20% of the relevant documents for a particular information needs when lawyers thought they had 75% But many queries had very high precision But many queries had very high precision

Blair and Maron, cont. Why recall was low Why recall was low users can’t foresee exact words and phrases that will indicate relevant documents users can’t foresee exact words and phrases that will indicate relevant documents “accident” referred to by those responsible as: “accident” referred to by those responsible as: “event,” “incident,” “situation,” “problem,” … differing technical terminology differing technical terminology slang, misspellings slang, misspellings Perhaps the value of higher recall decreases as the number of relevant documents grows, so more detailed queries were not attempted once the users were satisfied Perhaps the value of higher recall decreases as the number of relevant documents grows, so more detailed queries were not attempted once the users were satisfied

Blair and Maron, cont. Why recall was low Why recall was low users can’t foresee exact words and phrases that will indicate relevant documents users can’t foresee exact words and phrases that will indicate relevant documents “accident” referred to by those responsible as: “accident” referred to by those responsible as: “event,” “incident,” “situation,” “problem,” … differing technical terminology differing technical terminology slang, misspellings slang, misspellings Perhaps the value of higher recall decreases as the number of relevant documents grows, so more detailed queries were not attempted once the users were satisfied Perhaps the value of higher recall decreases as the number of relevant documents grows, so more detailed queries were not attempted once the users were satisfied