Download presentation
Presentation is loading. Please wait.
1
IR Theory: IR Basics & Vector Space Model
2
IR Approach Is the document relevant to the query? Information Seeker
Authors Information Need Concepts Why is IR hard? Because language is hard! Query String Document Text Is the document relevant to the query? Search Engine
3
IR System Architecture
Documents Query Representation Module Representation Module Document Representation Query Representation Matching/Ranking Module Results Search Engine
4
Step 1: Representation Documents Query Representation Module
Matching/Ranking Module Results Search Engine
5
How to represent text? How do we represent the complexities of language? Computers don’t “understand” documents or queries Simple, yet effective approach: “bag of words” Treat all the words in a document as index terms for that document Disregard order, structure, meaning, etc. of the words Bag of Words McDonald's slims down spuds Fast-food chain to reduce certain types of fat in its french fries with new cooking oil. NEW YORK (CNN/Money) - McDonald's Corp. is cutting the amount of "bad" fat in its french fries nearly in half, the fast-food chain said Tuesday as it moves to make all its fried menu items healthier. But does that mean the popular shoestring fries won't taste the same? The company says no. "It's a win-win for our customers because they are getting the same great french-fry taste along with an even healthier nutrition profile," said Mike Roberts, president of McDonald's USA. But others are not so sure. McDonald's will not specifically discuss the kind of oil it plans to use, but at least one nutrition expert says playing with the formula could mean a different taste. Shares of Oak Brook, Ill.-based McDonald's (MCD: down $0.54 to $23.22, Research, Estimates) were lower Tuesday afternoon. It was unclear Tuesday whether competitors Burger King and Wendy's International (WEN: down $0.80 to $34.91, Research, Estimates) would follow suit. Neither company could immediately be reached for comment. … 16 × said 14 × McDonalds 12 × fat 11 × fries 8 × new 6 × company french nutrition 5 × food oil percent reduce taste Tuesday … Search Engine
6
Bag-of-Word Representation
Document 1 Term Document 1 Document 2 The quick brown fox jumped over the lazy dog’s back. quick brown fox over lazy dog back now time all good men come jump aid their party 1 Stopword List for is of Document 2 the to Now is the time for all good men to come to the aid of their party. Search Engine 9
7
Step 2: Term Weighting Documents Query Representation Module
Matching/Ranking Module Results Search Engine
8
Term Weight: What & How? What is term weight?
Numerical estimate of term importance How should we estimate the term importance? Terms that appear often in a document should get high weights The more often a document contains the term “dog”, the more likely that the document is “about” dogs. Terms that appear in many documents should get low weights Words like “the”, “a”, “of” appear in (nearly) all documents. Term frequency in long documents should count less than those in short ones How do we compute it? Term frequency Inverse document frequency Document length Search Engine
9
Step 3: Matching/Ranking
Documents Query Representation Module Representation Module Document Representation Query Representation Matching/Ranking Module Results Search Engine
10
Boolean vs. Vector Space Model
Boolean model Based on the notion of sets Does not impose a ranking on retrieved documents Documents are retrieved only if they satisfy Boolean conditions specified in the query Exact match Vector space model Based on geometry, the notion of vectors in high dimensional space Documents are ranked based on their similarity to the query Best/partial match Search Engine
11
Boolean Model: Overview
Weights assigned to terms are either “0” or “1” “0” represents “absence”: term isn’t in the document “1” represents “presence”: term is in the document Build queries by combining terms with Boolean operators AND, OR, NOT The system returns all documents that satisfy the query A OR B A B A AND B A AND NOT(B) Search Engine 10
12
Boolean Model: Strength
Boolean operators define the relationship between query terms. AND → terms/concepts that are not equivalent/similar party AND good: good party Retrieves records that include all AND terms → Narrows the search OR → related terms, synonyms party AND (good OR excellent OR wild): good party, excellent party, wild party Retrieves records that include any OR terms → Broadens the search NOT → antonyms, alternate terms for polysemes party NOT democratic: Democratic party Eliminates records that include NOT term → Narrows the search Precise, if you know the right strategies knows what concepts to combine/exclude, narrow/broaden Efficient for the computer Search Engine 13
13
Boolean Model: Weakness
Natural language is way more complex Boolean logic insufficient to capture the richness of language AND “discovers” nonexistent relationships Terms in different sentences, paragraphs, … Money is good, but I won’t be party to stealing. Guessing terminology for OR is hard good, nice, excellent, outstanding, awesome, … Guessing terms to exclude is even harder! Democratic party, party to a lawsuit, … No control over size of result set Too many documents or none All documents in the result set are considered “equally good” No Partial Matching Documents that “don’t quite match” the query may also be useful. Search Engine 15
14
Vector Space Model: Pros & Cons
Non-binary term weights Partial matching Ranked results Easy query formulation Query expansion Cons Term relationships ignored Term order ignored No wildcard Problematic w/ long documents Similarity Relevance Search Engine 15
15
Vector Space Model: Representation
“Bags of words” can be represented as vectors Computational efficiency Ease of manipulation Geometric metaphor: “arrows” A vector is a set of values recorded in any consistent order “The quick brown fox jumped over the lazy dog’s back” Vector Bag of words (1, 1, 1, 1, 1, 1, 1, 1, 2) back 1 brown dog fox jump lazy over quick the 2 1st position corresponds to “back” 2nd position corresponds to “brown” 3rd position corresponds to “dog” 4th position corresponds to “fox” 5th position corresponds to “jump” 6th position corresponds to “lazy” 7th position corresponds to “over” 8th position corresponds to “quick” 9th position corresponds to “the” Search Engine 8
16
Vector Space Model: Ranked Retrieval
Order documents by “relevance” Relevance = how likely they are to be relevant to the information need Some documents are “better” than others Users can decide when to stop reading Best (partial) match Documents need not have all query terms Documents with more query terms should be “better” Estimate relevance with query-document similarity Treat the query as if it were a document Create a query bag-of-words Compute term weights Find its similarity to each document Rank order the documents by similarity Works surprisingly well Search Engine
17
Vector Space Model: 3-D Example
A vector A in a 3-dimensional space Represented with initial point at the origin of a rectangular coordinate system. Projections of A on the x, y, and z axes: Ax, Ay, and Az the (rectangular) components of A in the x, y, and z directions each axis represents a term (e.g., x = all, y = brown, z = cat) z Az A y Ay Ax x Search Engine
18
Vector Space Model: Postulate
θ t1 d5 t2 d4 Documents that are “close together” in vector space “talk about” the same things Therefore, retrieve documents based on how close the document is to the query (i.e., similarity ~ “closeness”) Search Engine
19
Similarity Measures: Set-based
Simple matching function Dice’s coefficient Jaccard’s coefficient A = (wd1, wd2, wd3, wd4, wd5) B = (wd2, wd4, wd6) A B: intersection of A and B the set of elements that belongs to both A and B A B = (wd2, wd4) A B: union of A and B the set of elements that belongs to either A or B. A B = (wd1, wd2, wd3, wd4, wd5, wd6) |A| : cardinality of A the number of elements in A |A| = 5 |B| = 3 |A B| = 2 |A B| = 6 Similarity Scores Simple: |A B| = 2 Dice: 2* |A B| / (|A|+ |B|) = 2*2 /8 = 1/2 Jaccard: |A B| / |A B| = 2/6 = 1/3 Search Engine
20
Similarity Measures: Set-based Example
Object – Attribute (feature) array O1 = (1, 0, 1, 1, 0, 0, 0, 1) |O1| = 4 O2 = (1, 0, 0, 0, 1, 1, 0, 0) |O2| = 3 O3 = (1, 0, 0, 1, 1, 1, 0, 0) |O3| = 4 O4 = (1, 1, 0, 1, 0, 1, 1, 0) |O4| = 5 O5 = (1, 1, 1, 1, 0, 0, 1, 1) |O5| = 6 |O1 O2| = |(A1)| = 1 |O1 O3| = |(A1, A4)| = 2 |O1 O4| = |(A1, A4)| = 2 |O1 O5| = |(A1, A3, A4 , A8)| = 4 |O1 O2| = |(A1, A3, A4, A5, A6, A8)| = 6 |O1 O3| = |(A1, A3, A4, A5, A6, A8)| = 6 |O1 O4| = |(A1, A2, A3, A4, A6, A7, A8)| = 7 |O1 O5| = |(A1, A2, A3, A4, A7, A8)| = 6 Simple matching function By Dice’s coefficient By Jaccard’s coefficient O2 O3 O4 O5 SIM 1 2 4 Rank O2 O3 O4 O5 SIM 2*1/(4+3)=2/7 2*2/(4+4)=4/8 2*2/(4+5)=4/9 2*4/(4+6)=8/10 Rank 4 2 3 1 O2 O3 O4 O5 SIM 1/6 2/6 2/7 4/6 Rank 4 2 3 1 Search Engine
21
Similarity Measures: Vector-based
Cosine Similarity (n-dimensional space) Dot/Scalar product of vectors / product of vector lengths Dot product = sum (product of each axis component) A = (A1, A2, A3, A4) B = (B1, B2, B3, B4) AB = (A1B1+A2B2+A3B3+A4B4) Vector length = square root of sum (square of each axis component) |A| = sqrt [(A1)2+ (A2)2+ (A3)2+ (A4)2] |B| = sqrt [(B1)2+ (B2)2+ (B3)2+ (B4)2] Cosine Similarity (3-dimensional space) A = (Ax, Ay, Az) B = (Bx, By, Bz) Search Engine
22
Similarity Measures: Vector-based Example
Object – Attribute (feature) array O1 = (1, 0, 1, 1, 0, 0, 0, 1) O2 = (1, 0, 0, 0, 1, 1, 0, 0) O3 = (1, 0, 0, 1, 1, 1, 0, 0) O4 = (1, 1, 0, 1, 0, 1, 1, 0) O5 = (1, 1, 1, 1, 0, 0, 1, 1) |O1| = sqrt( ) = sqrt(4) |O2| = sqrt( ) = sqrt(3) |O3| = sqrt( ) = sqrt(4) |O4| = sqrt( ) = sqrt(5) |O5| = sqrt( ) = sqrt(6) O1O2 = (1*1+0*0+1*0+1*0+0*1+0*1+0*0+1*0) = 1 O1O3 = (1*1+0*0+1*0+1*1+0*1+0*1+0*0+1*0) = 2 O1O4 = (1*1+0*1+1*0+1*1+0*0+0*1+0*1+1*0) = 2 O1O5 = (1*1+0*1+1*1+1*1+0*0+0*0+0*1+1*1) = 4 Compute cosine similarities Rank objects O2 through O5 by descending order of similarity to O1 Search Engine
23
Text Analysis: Word Frequency
B. Croft (Umass) TREC Volume 3 Corpus Number of documents: ,310 Total word occurrences: 125,720,891 Unique words: ,209 Zipf Distribution Rank*Frequency = constant Population, Wealth, Popularity A few words are very common Most words are very rare Term Weights Represents the ability of terms to identify relevant items & to distinguish them from non-relevant material Very common & very rare words are not very useful for indexing (Luhn, 1958) Good Smaller index Faster retrieval Bad Lost gems & broken phrases Search Engine
24
Text Analysis: Term Weighting
Term Weighting Factors Term frequency (tf) Number of times that a term occurs in a given document tf(dogd1) = 2, tf(dogd2) = 1 tf(foxd1) = 3, tf(foxd2) = 0 tf(partyd1) = 0, tf(partyd2) = 1 Inverse document frequency (idf) (Simple) 1/number of document in which a term occurs idf(dog) = 1/2, idf(fox) = 1/1, idf(party) = 1/1 (Default) log(Nd/ number of document in which a term occurs) Nd = number of document in a collection idf(dog)=log(2/2)=0, idf(fox)=log(2/1)=0.3, idf(party) = log(2/1)=0.3 Document length (dlen) Number of tokens in a document Token = an instance/occurrence of a word (not unique word) dlen(d1) = 11, dlen(d2) = 10 tfidf formula Term d1 d2 quick brown fox over lazy dog back now time all good men come jump aid their party 1 2 3 wki = weight of term k in document i fki = frequency of term k in document i (tf) Nd = number of documents in collection dk = number of documents in which term k appears (postings) Search Engine
25
Similarity Measures: using Term Weights
Document – Term array Compute term weights (e.g., tf*idf) Nd = 5 d1=5, d2=4, d3=1, d4=3, d5=2, d6=3, d7=4, d8=2 Search Engine
26
Similarity Measures: using Term Weights
Compute query-document cosine similarity with tf*idf weights Search Engine
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.