Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.

Slides:



Advertisements
Similar presentations
Lecture 6: Scoring, Term Weighting and the Vector Space Model
Advertisements

Text Categorization.
Boolean and Vector Space Retrieval Models
Beyond Boolean Queries Ranked retrieval  Thus far, our queries have all been Boolean.  Documents either match or don’t.  Good for expert users with.
TEXT SIMILARITY David Kauchak CS159 Spring Quiz #2  Out of 30 points  High:  Ave: 23  Will drop lowest quiz  I do not grade based on.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Hinrich Schütze and Christina Lioma
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
TF-IDF David Kauchak cs160 Fall 2009 adapted from:
Learning for Text Categorization
IR Models: Overview, Boolean, and Vector
CpSc 881: Information Retrieval
Boolean, Vector Space, Probabilistic
Indexing The essential step in searching. Review a bit We have seen so far – Crawling In the abstract and as implemented Your own code and Nutch If you.
Ch 4: Information Retrieval and Text Mining
CS276 Information Retrieval and Web Mining
Hinrich Schütze and Christina Lioma
Boolean, Vector Space, Probabilistic
Information Retrieval IR 6. Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Vector Space Model : TF - IDF
CES 514 Data Mining March 11, 2010 Lecture 5: scoring, term weighting, vector space model (Ch 6)
Chapter 5: Information Retrieval and Web Search
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
CS276A Text Information Retrieval, Mining, and Exploitation Lecture 4 15 Oct 2002.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Documents as vectors Each doc j can be viewed as a vector of tf.idf values, one component for each term So we have a vector space terms are axes docs live.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
Boolean and Vector Space Models
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater.
Scoring, Term Weighting, and Vector Space Model Lecture 7: Scoring, Term Weighting and the Vector Space Model Web Search and Mining 1.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1.
Information Retrieval continued Many slides in this presentation are from Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction.
TF-IDF David Kauchak cs458 Fall 2012 adapted from:
Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang.
Basic ranking Models Boolean and Vector Space Models.
Advanced topics in Computer Science Jiaheng Lu Department of Computer Science Renmin University of China
Information Retrieval and Data Mining (AT71. 07) Comp. Sc. and Inf
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
Information Retrieval and Web Search IR models: Vectorial Model Instructor: Rada Mihalcea Class web page: [Note: Some.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
Introduction to Information Retrieval Introduction to Information Retrieval COMP4210: Information Retrieval and Search Engines Lecture 5: Scoring, Term.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 5 9/6/2011.
Vector Space Models.
1 CS 391L: Machine Learning Text Categorization Raymond J. Mooney University of Texas at Austin.
Introduction to Information Retrieval CSE 538 MRS BOOK – CHAPTER VI SCORING, TERM WEIGHTING AND THE VECTOR SPACE MODEL 1.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Lecture 6: Scoring, Term Weighting and the Vector Space Model
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 9: Scoring, Term Weighting and the Vector Space Model.
Web Information Retrieval
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
The Vector Space Models (VSM)
Plan for Today’s Lecture(s)
7CCSMWAL Algorithmic Issues in the WWW
Ch 6 Term Weighting and Vector Space Model
Information Retrieval and Web Search
Information Retrieval and Web Search
Representation of documents and queries
From frequency to meaning: vector space models of semantics
4. Boolean and Vector Space Retrieval Models
Boolean and Vector Space Retrieval Models
CS276: Information Retrieval and Web Search
Term Frequency–Inverse Document Frequency
Presentation transcript:

Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course taught by Ray Mooney at UT Austin (who in turn adapted them from Joydeep Ghosh), and from an IR course taught by Chris Manning at Stanford)

Ranked Retrieval Thus far, our queries have all been Boolean –Documents either match or don’t Good for expert users with precise understanding of their needs and the collection –Also good for applications: Applications can easily consume 1000s of results Not good for the majority of users –Most users incapable of writing Boolean queries (or they are, but they think it’s too much work) –Most users don’t want to wade through 1000s of results This is particularly true of Web search Ch. 6

Problem with Boolean Search Boolean queries often result in either too few (=0) or too many (1000s) results. Query 1: “standard user dlink 650” → 200,000 hits Query 2: “standard user dlink 650 no card found”: 0 hits It takes a lot of skill to come up with a query that produces a manageable number of hits. –AND gives too few; OR gives too many Ch. 6

Ranked Retrieval Models Rather than a set of documents satisfying a query expression, in ranked retrieval, the system returns an ordering over the (top) documents in the collection for a query Free text queries: Rather than a query language of operators and expressions, the user’s query is just one or more words in a human language In principle, there are two separate choices here, but in practice, ranked retrieval has normally been associated with free text queries and vice versa 3

Not a Problem in Ranked Retrieval When a system produces a ranked result set, large result sets are not an issue –Indeed, the size of the result set is not an issue –We just show the top k ( ≈ 10) results –We don’t overwhelm the user –Premise: the ranking algorithm works Ch. 6

Scoring as the Basis of Ranked Retrieval We wish to return in order the documents most likely to be useful to the searcher How can we rank-order the documents in the collection with respect to a query? Assign a score – say in [0, 1] – to each document This score measures how well document and query “match”. Ch. 6

Query-document Matching Scores We need a way of assigning a score to a query/document pair Let’s start with a one-term query If the query term does not occur in the document: score should be 0 The more frequent the query term in the document, the higher the score (should be)

Take 1: Jaccard coefficient Jaccard: A commonly used measure of overlap of two sets A and B Jaccard(A,B) = |A ∩ B| / |A ∪ B| Jaccard(A,A) = 1 Jaccard(A,B) = 0 if A ∩ B = 0 A and B don’t have to be the same size. Always assigns a number between 0 and 1. Ch. 6

Exercise: Jaccard Coefficient What is the query-document match score that the Jaccard coefficient computes for each of the two documents below? Query: march of dimes Document 1: caesar died in march Document 2: the long march Ch. 6

Issues with Jaccard for Scoring It does not consider term frequency (how many times a term occurs in a document) Rare terms in a collection are more informative than frequent terms. Jaccard does not consider this information We need a more sophisticated way of normalizing for length Ch. 6

Recall (from Boolean Retrieval): Binary Term-Document Incidence Matrix Each document is represented by a binary vector ∈ {0,1} |V| Sec. 6.2

Term-document Count Matrices Consider the number of occurrences of a term in a document: –Each document is a count vector: a column below Sec. 6.2

Vector-Space Model t distinct terms remain after preprocessing –Unique terms that form the VOCABULARY These “ orthogonal ” terms form a vector space. Dimension = t = |vocabulary| –2 terms  bi-dimensional; … ; t-terms  t-dimensional Each term, i, in a document or query j, is given a real- valued weight, w ij. Both documents and queries are expressed as t- dimensional vectors: d j = (w 1j, w 2j, …, w tj )

Vector-Space Model Query as vector: We regard query as short document We return the documents ranked by the closeness of their vectors to the query, also represented as a vector. Vector-space model was developed in the SMART system (Salton, c. 1970) and standardly used by TREC participants and web IR systems

Graphic Representation Example: D 1 = 2T 1 + 3T 2 + 5T 3 D 2 = 3T 1 + 7T 2 + T 3 Q = 0T 1 + 0T 2 + 2T 3 T3T3 T1T1 T2T2 D 1 = 2T 1 + 3T 2 + 5T 3 D 2 = 3T 1 + 7T 2 + T 3 Q = 0T 1 + 0T 2 + 2T Is D 1 or D 2 more similar to Q? How to measure the degree of similarity? Distance? Angle? Projection?

Document Collection Representation A collection of n documents can be represented in the vector space model by a term-document matrix. An entry in the matrix corresponds to the “ weight ” of a term in the document; zero means the term has no significance in the document or it simply doesn’t exist in the document. T 1 T 2 …. T t D 1 w 11 w 21 … w t1 D 2 w 12 w 22 … w t2 : : : : D n w 1n w 2n … w tn

Term Frequency tf The term frequency tf ij of term i in document j is defined as the number of times that term i occurs in document j. More frequent terms in a document are more important, i.e. more indicative of the topic. May want to normalize term frequency (tf) : tf ij = freq ij / max{f ij } We want to use tf when computing query-document match scores. But how?

Term Frequency tf Raw term frequency is not what we want: –A document with 10 occurrences of the term is more relevant than a document with 1 occurrence of the term. –But not 10 times more relevant. Relevance does not increase proportionally with term frequency.

Document Frequency Rare terms are more informative than frequent terms –Recall stop words Consider a term in the query that is rare in the collection (e.g., arachnocentric) A document containing this term is very likely to be relevant to the query arachnocentric → We want a high weight for rare terms like arachnocentric. Sec

Document Frequency (continued) Frequent terms are less informative than rare terms Consider a query term that is frequent in the collection (e.g., high, increase, line) A document containing such a term is more likely to be relevant than a document that doesn’t But it’s not a sure indicator of relevance. → For frequent terms, we want high positive weights for words like high, increase, and line But lower weights than for rare terms. We will use document frequency (df) to capture this. Sec

Idf Weight df i is the document frequency of term i: the number of documents that contain term i –df i is an inverse measure of the informativeness of term i –df i  n, where n is the number of documents in the collection We define the idf (inverse document frequency) of term i by –We use log (N/df i ) instead of N/df i to “dampen” the effect of a high df (as compared to tf) Sec

idf example, suppose N = 1 million termdf t idf t calpurnia1 animal100 sunday1,000 fly10,000 under100,000 the1,000,000 There is one idf value for each term i in a collection. Sec

Collection vs. Document Frequency The collection frequency of i is the number of occurrences of i in the collection, counting multiple occurrences. Example: Which word is a better search term (and should get a higher weight)? WordCollection frequencyDocument frequency insurance try Sec

tf-idf Weighting The tf-idf weight of a term is the product of its tf weight and its idf weight. Best known weighting scheme in information retrieval –Theoretically proven to work well (Papineni, NAACL 2001) –Note: the “-” in tf-idf is a hyphen, not a minus sign! –Alternative names: tf.idf, tf x idf Increases with the number of occurrences within a document Increases with the rarity of the term in the collection Sec

Computing tf-idf: An Example Given a document containing terms with given frequencies: A(3), B(2), C(1) Assume collection contains 10,000 documents and document frequencies of these terms are: A(50), B(1300), C(250) Then: A: tf = 3/3; idf = log(10000/50) = 5.3; tf-idf = 5.3 B: tf = 2/3; idf = log(10000/1300) = 2.0; tf-idf = 1.3 C: tf = 1/3; idf = log(10000/250) = 3.7; tf-idf = 1.2

Binary → Count → Weight matrix Each document is now represented by a real-valued vector of tf-idf weights ∈ R |V| Sec. 6.3

Documents as Vectors So we have a |V|-dimensional vector space Terms are axes of the space Documents are points or vectors in this space Very high-dimensional: tens of millions of dimensions when you apply this to a web search engine These are very sparse vectors - most entries are zero. Sec. 6.3

Query as Vector Query vector is typically treated as a document and also tf-idf weighted. Alternative is for the user to supply weights for the given query terms.

Similarity Measure We now have vectors for all documents in the collection, a vector for the query, how to compute similarity? A similarity measure is a function that computes the degree of similarity between two vectors. Using a similarity measure between the query and each document: –It is possible to rank the retrieved documents in the order of presumed relevance. –It is possible to enforce a certain threshold so that the size of the retrieved set can be controlled.

First Cut: Euclidean Distance Distance between vectors d 1 and d 2 is the length of the vector |d 1 – d 2 |. –Euclidean distance Exercise: Determine the Euclidean distance between the vectors (0, 3, 2, 1, 10) and (2, 7, 1, 0, 0) Why is this not a great idea? We still haven’t dealt with the issue of length normalization –Long documents would be more similar to each other by virtue of length, not topic

Second Cut: Manhattan Distance Or “city block” measure –Based on the idea that generally in American cities you cannot follow a direct line between two points. Uses the formula: Exercise: Determine the Manhattan distance between the vectors (0, 3, 2, 1, 10) and (2, 7, 1, 0, 0) x y

Third Cut: Inner Product Similarity between vectors for the documents d 1 and document d 2 can be computed as the vector inner product: where w ij is the weight of term i in document j For binary vectors, the inner product is the number of matched query terms in the document (size of intersection). For weighted term vectors, it is the sum of the products of the weights of the matched terms.

Properties of Inner Product Favors long documents with a large number of unique terms. –Again, the issue of normalization Measures how many terms matched but not how many terms are not matched.

Exercise

Cosine Similarity Distance between vectors d 1 and d 2 captured by the cosine of the angle x between them. Note – this is similarity, not distance t 1 d2d2 d1d1 t 3 t 2 θ

Cosine Similarity w ij – weight of term i in document j Cosine of angle between two vectors The denominator involves the lengths of the vectors So the cosine measure is also known as the normalized inner product

Exercise Documents: Sense and Sensibility, Pride and Prejudice; Wuthering Heights Vocabulary of four terms Measure cos(SaS,PaP), cos(SaS,WH) termSaSPaPWH affection jealous10711 gossip206 wuthering0038 termSaSPaPWH affection jealous gossip wuthering

Exercise Rank the following by decreasing cosine similarity: –Two documents that have only frequent words (the, a, an, of) in common. –Two documents that have no words in common. –Two documents that have many rare words in common (wingspan, tailfin).

Cosine Similarity vs. Inner Product Cosine similarity measures the cosine of the angle between two vectors. Inner product normalized by the vector lengths. D 1 = 2T 1 + 3T 2 + 5T 3 CosSim(D 1, Q) = 10 /  (4+9+25)(0+0+4) = 0.81 D 2 = 3T 1 + 7T 2 + 1T 3 CosSim(D 2, Q) = 2 /  (9+49+1)(0+0+4) = 0.13 Q = 0T 1 + 0T 2 + 2T 3  t3t3 t1t1 t2t2 D1D1 D2D2 Q  D 1 is 6 times better than D 2 using cosine similarity but only 5 times better using inner product. CosSim(d j, q) = InnerProduct(d j, q) =

Comments on Vector Space Models Simple, mathematically based approach. Considers both local (tf) and global (idf) word occurrence frequencies. Provides partial matching and ranked results. Tends to work quite well in practice despite obvious weaknesses. Allows efficient implementation for large document collections.

Problems with Vector Space Model Missing semantic information (e.g. word sense). Missing syntactic information (e.g. phrase structure, word order, proximity information). Assumption of term independence Lacks the control of a Boolean model (e.g., requiring a term to appear in a document). –Given a two-term query “ A B ”, may prefer a document containing A frequently but not B, over a document that contains both A and B, but both less frequently.