IR Theory: IR Basics & Vector Space Model
IR Approach Is the document relevant to the query? Information Seeker Authors Information Need Concepts Why is IR hard? Because language is hard! Query String Document Text Is the document relevant to the query? Search Engine
IR System Architecture Documents Query Representation Module Representation Module Document Representation Query Representation Matching/Ranking Module Results Search Engine
Step 1: Representation Documents Query Representation Module Matching/Ranking Module Results Search Engine
How to represent text? How do we represent the complexities of language? Computers don’t “understand” documents or queries Simple, yet effective approach: “bag of words” Treat all the words in a document as index terms for that document Disregard order, structure, meaning, etc. of the words Bag of Words McDonald's slims down spuds Fast-food chain to reduce certain types of fat in its french fries with new cooking oil. NEW YORK (CNN/Money) - McDonald's Corp. is cutting the amount of "bad" fat in its french fries nearly in half, the fast-food chain said Tuesday as it moves to make all its fried menu items healthier. But does that mean the popular shoestring fries won't taste the same? The company says no. "It's a win-win for our customers because they are getting the same great french-fry taste along with an even healthier nutrition profile," said Mike Roberts, president of McDonald's USA. But others are not so sure. McDonald's will not specifically discuss the kind of oil it plans to use, but at least one nutrition expert says playing with the formula could mean a different taste. Shares of Oak Brook, Ill.-based McDonald's (MCD: down $0.54 to $23.22, Research, Estimates) were lower Tuesday afternoon. It was unclear Tuesday whether competitors Burger King and Wendy's International (WEN: down $0.80 to $34.91, Research, Estimates) would follow suit. Neither company could immediately be reached for comment. … 16 × said 14 × McDonalds 12 × fat 11 × fries 8 × new 6 × company french nutrition 5 × food oil percent reduce taste Tuesday … Search Engine
Bag-of-Word Representation Document 1 Term Document 1 Document 2 The quick brown fox jumped over the lazy dog’s back. quick brown fox over lazy dog back now time all good men come jump aid their party 1 Stopword List for is of Document 2 the to Now is the time for all good men to come to the aid of their party. Search Engine 9
Step 2: Term Weighting Documents Query Representation Module Matching/Ranking Module Results Search Engine
Term Weight: What & How? What is term weight? Numerical estimate of term importance How should we estimate the term importance? Terms that appear often in a document should get high weights The more often a document contains the term “dog”, the more likely that the document is “about” dogs. Terms that appear in many documents should get low weights Words like “the”, “a”, “of” appear in (nearly) all documents. Term frequency in long documents should count less than those in short ones How do we compute it? Term frequency (tf) Inverse document frequency (idf) Document length (dl) Search Engine
Step 3: Matching/Ranking Documents Query Representation Module Representation Module Document Representation Query Representation Matching/Ranking Module Results Search Engine
Boolean vs. Vector Space Model Boolean model Based on the notion of sets Does not impose a ranking on retrieved documents Documents are retrieved only if they satisfy Boolean conditions specified in the query Exact match Vector space model Based on geometry, the notion of vectors in high dimensional space Documents are ranked based on their similarity to the query Best/partial match Search Engine
Boolean Model: Overview Weights assigned to terms are either “0” or “1” “0” represents “absence”: term isn’t in the document “1” represents “presence”: term is in the document Build queries by combining terms with Boolean operators AND, OR, NOT The system returns all documents that satisfy the query A OR B A B A AND B A AND NOT(B) Search Engine 10
Boolean Model: Strength Boolean operators define the relationship between query terms. AND → terms/concepts that are not equivalent/similar party AND good: good party Retrieves records that include all AND terms → Narrows the search OR → related terms, synonyms party AND (good OR excellent OR wild): good party, excellent party, wild party Retrieves records that include any OR terms → Broadens the search NOT → antonyms, alternate terms for polysemes party NOT democratic: Democratic party Eliminates records that include NOT term → Narrows the search Precise, if you know the right strategies knows what concepts to combine/exclude, narrow/broaden Efficient for the computer Search Engine 13
Boolean Model: Weakness Natural language is way more complex Boolean logic insufficient to capture the richness of language AND “discovers” nonexistent relationships Terms in different sentences, paragraphs, … Money is good, but I won’t be party to stealing. Guessing terminology for OR is hard good, nice, excellent, outstanding, awesome, … Guessing terms to exclude is even harder! Democratic party, party to a lawsuit, … No control over size of result set Too many documents or none All documents in the result set are considered “equally good” No Partial Matching Documents that “don’t quite match” the query may also be useful. Search Engine 15
Vector Space Model: Pros & Cons Non-binary term weights Partial matching Ranked results Easy query formulation Query expansion Cons Term relationships ignored Term order ignored No wildcard Problematic w/ long documents Similarity Relevance Search Engine 15
Vector Space Model: Representation “Bags of words” can be represented as vectors Computational efficiency Ease of manipulation Geometric metaphor: “arrows” A vector is a set of values recorded in any consistent order “The quick brown fox jumped over the lazy dog’s back” Vector Bag of words (1, 1, 1, 1, 1, 1, 1, 1, 2) back 1 brown dog fox jump lazy over quick the 2 1st position corresponds to “back” 2nd position corresponds to “brown” 3rd position corresponds to “dog” 4th position corresponds to “fox” 5th position corresponds to “jump” 6th position corresponds to “lazy” 7th position corresponds to “over” 8th position corresponds to “quick” 9th position corresponds to “the” Search Engine 8
Vector Space Model: Ranked Retrieval Order documents by “relevance” Relevance = how likely they are to be relevant to the information need Some documents are “better” than others Users can decide when to stop reading Best (partial) match Documents need not have all query terms Documents with more query terms should be “better” Estimate relevance with query-document similarity Treat the query as if it were a document Create a query bag-of-words Compute term weights Find its similarity to each document Rank order the documents by similarity Works surprisingly well Search Engine
Vector Space Model: 3-D Example A vector A in a 3-dimensional space Represented with initial point at the origin of a rectangular coordinate system. Projections of A on the x, y, and z axes: Ax, Ay, and Az the (rectangular) components of A in the x, y, and z directions each axis represents a term (e.g., x = all, y = brown, z = cat) z Az A y Ay Ax x Search Engine
Vector Space Model: Postulate θ t2 d5 t1 d4 Documents that are “close together” in vector space “talk about” the same things Therefore, retrieve documents based on how close the document is to the query (i.e., similarity ~ “closeness”) Search Engine
Vector Space Model: Example √𝟐 √𝟐 =60 d3 q √𝟐 d1 d3 d2 √𝟐 √𝟑 36 t2 q 𝟏 d1 q t1 29 √𝟐 √𝟔 Query: What is information retrieval? Q: Information 1, retrieval 1 Index Term d1 d2 d3 t1 (information) 1 t2 (retrieval) 2 t3 (seminar) q D1: Information retrieval seminars D2: Retrieval seminars and Information Retrieval D3: Information seminar √𝟐 d2 Search Engine
Similarity Measures: Set-based Simple matching function Dice’s coefficient Jaccard’s coefficient A = (wd1, wd2, wd3, wd4, wd5) B = (wd2, wd4, wd6) A B: intersection of A and B the set of elements that belongs to both A and B A B = (wd2, wd4) A B: union of A and B the set of elements that belongs to either A or B. A B = (wd1, wd2, wd3, wd4, wd5, wd6) |A| : cardinality of A the number of elements in A |A| = 5 |B| = 3 |A B| = 2 |A B| = 6 Similarity Scores Simple: |A B| = 2 Dice: 2* |A B| / (|A|+ |B|) = 2*2 /8 = 1/2 Jaccard: |A B| / |A B| = 2/6 = 1/3 Search Engine
Similarity Measures: Set-based Example Object – Attribute (feature) array O1 = (1, 0, 1, 1, 0, 0, 0, 1) |O1| = 4 O2 = (1, 0, 0, 0, 1, 1, 0, 0) |O2| = 3 O3 = (1, 0, 0, 1, 1, 1, 0, 0) |O3| = 4 O4 = (1, 1, 0, 1, 0, 1, 1, 0) |O4| = 5 O5 = (1, 1, 1, 1, 0, 0, 1, 1) |O5| = 6 |O1 O2| = |(A1)| = 1 |O1 O3| = |(A1, A4)| = 2 |O1 O4| = |(A1, A4)| = 2 |O1 O5| = |(A1, A3, A4 , A8)| = 4 |O1 O2| = |(A1, A3, A4, A5, A6, A8)| = 6 |O1 O3| = |(A1, A3, A4, A5, A6, A8)| = 6 |O1 O4| = |(A1, A2, A3, A4, A6, A7, A8)| = 7 |O1 O5| = |(A1, A2, A3, A4, A7, A8)| = 6 Simple matching function By Dice’s coefficient By Jaccard’s coefficient O2 O3 O4 O5 SIM 1 2 4 Rank O2 O3 O4 O5 SIM 2*1/(4+3)=2/7 2*2/(4+4)=4/8 2*2/(4+5)=4/9 2*4/(4+6)=8/10 Rank 4 2 3 1 O2 O3 O4 O5 SIM 1/6 2/6 2/7 4/6 Rank 4 2 3 1 Search Engine
Similarity Measures: Vector-based Cosine Similarity (n-dimensional space) Dot/Scalar product of vectors / product of vector lengths Dot product = sum (product of each axis component) A = (A1, A2, A3, A4) B = (B1, B2, B3, B4) AB = (A1B1+A2B2+A3B3+A4B4) Vector length = square root of sum (square of each axis component) |A| = sqrt [(A1)2+ (A2)2+ (A3)2+ (A4)2] |B| = sqrt [(B1)2+ (B2)2+ (B3)2+ (B4)2] Cosine Similarity (3-dimensional space) A = (Ax, Ay, Az) B = (Bx, By, Bz) Search Engine
Similarity Measures: Vector-based Example Object – Attribute (feature) array O1 = (1, 0, 1, 1, 0, 0, 0, 1) O2 = (1, 0, 0, 0, 1, 1, 0, 0) O3 = (1, 0, 0, 1, 1, 1, 0, 0) O4 = (1, 1, 0, 1, 0, 1, 1, 0) O5 = (1, 1, 1, 1, 0, 0, 1, 1) |O1| = sqrt(12+02+12+12+02+02+02+12) = sqrt(4) |O2| = sqrt(12+02+02+02+12+12+02+02) = sqrt(3) |O3| = sqrt(12+02+02+12+12+12+02+02) = sqrt(4) |O4| = sqrt(12+12+02+12+02+12+12+02) = sqrt(5) |O5| = sqrt(12+12+12+12+02+02+12+12) = sqrt(6) O1O2 = (1*1+0*0+1*0+1*0+0*1+0*1+0*0+1*0) = 1 O1O3 = (1*1+0*0+1*0+1*1+0*1+0*1+0*0+1*0) = 2 O1O4 = (1*1+0*1+1*0+1*1+0*0+0*1+0*1+1*0) = 2 O1O5 = (1*1+0*1+1*1+1*1+0*0+0*0+0*1+1*1) = 4 Compute cosine similarities Rank objects O2 through O5 by descending order of similarity to O1 Search Engine
Text Analysis: Word Frequency B. Croft (Umass) TREC Volume 3 Corpus Number of documents: 336,310 Total word occurrences: 125,720,891 Unique words: 508,209 Zipf Distribution Rank*Frequency = constant Population, Wealth, Popularity A few words are very common Most words are very rare Term Weights Represents the ability of terms to identify relevant items & to distinguish them from non-relevant material Very common & very rare words are not very useful for indexing (Luhn, 1958) Good Smaller index Faster retrieval Bad Lost gems & broken phrases Search Engine
Text Analysis: Term Weighting Term Weighting Factors Term frequency (tf) Number of times that a term occurs in a given document tf(dogd1) = 2, tf(dogd2) = 1 tf(foxd1) = 3, tf(foxd2) = 0 tf(partyd1) = 0, tf(partyd2) = 1 Inverse document frequency (idf) (Simple) 1/number of document in which a term occurs idf(dog) = 1/2, idf(fox) = 1/1, idf(party) = 1/1 (Default) log(Nd/ number of document in which a term occurs) Nd = number of document in a collection idf(dog)=log(2/2)=0, idf(fox)=log(2/1)=0.3, idf(party) = log(2/1)=0.3 Document length (dlen) Number of tokens in a document Token = an instance/occurrence of a word (not unique word) dlen(d1) = 11, dlen(d2) = 10 tfidf formula Term d1 d2 quick brown fox over lazy dog back now time all good men come jump aid their party 1 2 3 wki = weight of term k in document i fki = frequency of term k in document i (tf) Nd = number of documents in collection dk = number of documents in which term k appears (postings) Search Engine
Similarity Measures: using Term Weights Document – Term array Compute term weights (e.g., tf*idf) Nd = 5 d1=5, d2=4, d3=1, d4=3, d5=2, d6=3, d7=4, d8=2 Search Engine
Similarity Measures: using Term Weights Compute query-document cosine similarity with tf*idf weights Search Engine