IR Theory: IR Basics & Vector Space Model

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Retrieval Models and Ranking Systems CSC 575 Intelligent Information Retrieval.
LBSC 796/INFM 718R: Week 3 Boolean and Vector Space Models Jimmy Lin College of Information Studies University of Maryland Monday, February 13, 2006.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
TF-IDF David Kauchak cs160 Fall 2009 adapted from:
IR Models: Overview, Boolean, and Vector
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Information Retrieval Review
ISP 433/533 Week 2 IR Models.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
The Vector Space Model LBSC 796/CMSC828o Session 3, February 9, 2004 Douglas W. Oard.
Ch 4: Information Retrieval and Text Mining
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
CS276 Information Retrieval and Web Mining
Hinrich Schütze and Christina Lioma
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Vector Space Model CS 652 Information Extraction and Integration.
Information Retrieval IR 6. Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Chapter 5: Information Retrieval and Web Search
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
IR Systems and Web Search By Sri Harsha Tumuluri (UNI: st2653)
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Matching LBSC 708A/CMSC 828O March 8, 1999 Douglas W. Oard and Dagobert Soergel.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1.
Vector Space Models.
Evidence from Content INST 734 Module 2 Doug Oard.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Lecture 6: Scoring, Term Weighting and the Vector Space Model
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 9: Scoring, Term Weighting and the Vector Space Model.
IR 6 Scoring, term weighting and the vector space model.
Plan for Today’s Lecture(s)
Text Based Information Retrieval
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Information Retrieval and Web Search
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Language Models for Information Retrieval
אחזור מידע, מנועי חיפוש וספריות
Information Retrieval and Web Search
Basic Information Retrieval
Representation of documents and queries
CS 430: Information Discovery
CS 430: Information Discovery
Boolean and Vector Space Retrieval Models
Information Retrieval and Web Design
Information Retrieval and Web Design
Introduction to Search Engines
IR Theory: IR Basics & Vector Space Model
CS 430: Information Discovery
Presentation transcript:

IR Theory: IR Basics & Vector Space Model

IR Approach Is the document relevant to the query? Information Seeker Authors Information Need Concepts Why is IR hard? Because language is hard! Query String Document Text Is the document relevant to the query? Search Engine

IR System Architecture Documents Query Representation Module Representation Module Document Representation Query Representation Matching/Ranking Module Results Search Engine

Step 1: Representation Documents Query Representation Module Matching/Ranking Module Results Search Engine

How to represent text? How do we represent the complexities of language? Computers don’t “understand” documents or queries Simple, yet effective approach: “bag of words” Treat all the words in a document as index terms for that document Disregard order, structure, meaning, etc. of the words Bag of Words McDonald's slims down spuds Fast-food chain to reduce certain types of fat in its french fries with new cooking oil. NEW YORK (CNN/Money) - McDonald's Corp. is cutting the amount of "bad" fat in its french fries nearly in half, the fast-food chain said Tuesday as it moves to make all its fried menu items healthier. But does that mean the popular shoestring fries won't taste the same? The company says no. "It's a win-win for our customers because they are getting the same great french-fry taste along with an even healthier nutrition profile," said Mike Roberts, president of McDonald's USA. But others are not so sure. McDonald's will not specifically discuss the kind of oil it plans to use, but at least one nutrition expert says playing with the formula could mean a different taste. Shares of Oak Brook, Ill.-based McDonald's (MCD: down $0.54 to $23.22, Research, Estimates) were lower Tuesday afternoon. It was unclear Tuesday whether competitors Burger King and Wendy's International (WEN: down $0.80 to $34.91, Research, Estimates) would follow suit. Neither company could immediately be reached for comment. … 16 × said 14 × McDonalds 12 × fat 11 × fries 8 × new 6 × company french nutrition 5 × food oil percent reduce taste Tuesday … Search Engine

Bag-of-Word Representation Document 1 Term Document 1 Document 2 The quick brown fox jumped over the lazy dog’s back. quick brown fox over lazy dog back now time all good men come jump aid their party 1 Stopword List for is of Document 2 the to Now is the time for all good men to come to the aid of their party. Search Engine 9

Step 2: Term Weighting Documents Query Representation Module Matching/Ranking Module Results Search Engine

Term Weight: What & How? What is term weight? Numerical estimate of term importance How should we estimate the term importance? Terms that appear often in a document should get high weights The more often a document contains the term “dog”, the more likely that the document is “about” dogs. Terms that appear in many documents should get low weights Words like “the”, “a”, “of” appear in (nearly) all documents. Term frequency in long documents should count less than those in short ones How do we compute it? Term frequency (tf) Inverse document frequency (idf) Document length (dl) Search Engine

Step 3: Matching/Ranking Documents Query Representation Module Representation Module Document Representation Query Representation Matching/Ranking Module Results Search Engine

Boolean vs. Vector Space Model Boolean model Based on the notion of sets Does not impose a ranking on retrieved documents Documents are retrieved only if they satisfy Boolean conditions specified in the query Exact match Vector space model Based on geometry, the notion of vectors in high dimensional space Documents are ranked based on their similarity to the query Best/partial match Search Engine

Boolean Model: Overview Weights assigned to terms are either “0” or “1” “0” represents “absence”: term isn’t in the document “1” represents “presence”: term is in the document Build queries by combining terms with Boolean operators AND, OR, NOT The system returns all documents that satisfy the query A OR B A B A AND B A AND NOT(B) Search Engine 10

Boolean Model: Strength Boolean operators define the relationship between query terms. AND → terms/concepts that are not equivalent/similar party AND good: good party Retrieves records that include all AND terms → Narrows the search OR → related terms, synonyms party AND (good OR excellent OR wild): good party, excellent party, wild party Retrieves records that include any OR terms → Broadens the search NOT → antonyms, alternate terms for polysemes party NOT democratic: Democratic party Eliminates records that include NOT term → Narrows the search Precise, if you know the right strategies knows what concepts to combine/exclude, narrow/broaden Efficient for the computer Search Engine 13

Boolean Model: Weakness Natural language is way more complex Boolean logic insufficient to capture the richness of language AND “discovers” nonexistent relationships Terms in different sentences, paragraphs, … Money is good, but I won’t be party to stealing. Guessing terminology for OR is hard good, nice, excellent, outstanding, awesome, … Guessing terms to exclude is even harder! Democratic party, party to a lawsuit, … No control over size of result set Too many documents or none All documents in the result set are considered “equally good” No Partial Matching Documents that “don’t quite match” the query may also be useful. Search Engine 15

Vector Space Model: Pros & Cons Non-binary term weights Partial matching Ranked results Easy query formulation Query expansion Cons Term relationships ignored Term order ignored No wildcard Problematic w/ long documents Similarity  Relevance Search Engine 15

Vector Space Model: Representation “Bags of words” can be represented as vectors Computational efficiency Ease of manipulation Geometric metaphor: “arrows” A vector is a set of values recorded in any consistent order “The quick brown fox jumped over the lazy dog’s back” Vector Bag of words  (1, 1, 1, 1, 1, 1, 1, 1, 2) back 1 brown dog fox jump lazy over quick the 2 1st position corresponds to “back” 2nd position corresponds to “brown” 3rd position corresponds to “dog” 4th position corresponds to “fox” 5th position corresponds to “jump” 6th position corresponds to “lazy” 7th position corresponds to “over” 8th position corresponds to “quick” 9th position corresponds to “the” Search Engine 8

Vector Space Model: Ranked Retrieval Order documents by “relevance” Relevance = how likely they are to be relevant to the information need Some documents are “better” than others Users can decide when to stop reading Best (partial) match Documents need not have all query terms Documents with more query terms should be “better” Estimate relevance with query-document similarity Treat the query as if it were a document Create a query bag-of-words Compute term weights Find its similarity to each document Rank order the documents by similarity Works surprisingly well Search Engine

Vector Space Model: 3-D Example A vector A in a 3-dimensional space Represented with initial point at the origin of a rectangular coordinate system. Projections of A on the x, y, and z axes: Ax, Ay, and Az the (rectangular) components of A in the x, y, and z directions each axis represents a term (e.g., x = all, y = brown, z = cat) z Az A y Ay Ax x Search Engine

Vector Space Model: Postulate θ t2 d5 t1 d4 Documents that are “close together” in vector space “talk about” the same things Therefore, retrieve documents based on how close the document is to the query (i.e., similarity ~ “closeness”) Search Engine

Vector Space Model: Example √𝟐 √𝟐 =60 d3 q √𝟐 d1 d3 d2 √𝟐 √𝟑   36 t2 q 𝟏 d1 q t1   29 √𝟐 √𝟔 Query: What is information retrieval? Q: Information 1, retrieval 1 Index Term d1 d2 d3 t1 (information) 1 t2 (retrieval) 2 t3 (seminar) q D1: Information retrieval seminars D2: Retrieval seminars and Information Retrieval D3: Information seminar √𝟐 d2 Search Engine

Similarity Measures: Set-based Simple matching function Dice’s coefficient Jaccard’s coefficient A = (wd1, wd2, wd3, wd4, wd5) B = (wd2, wd4, wd6) A  B: intersection of A and B the set of elements that belongs to both A and B A  B = (wd2, wd4) A  B: union of A and B the set of elements that belongs to either A or B. A  B = (wd1, wd2, wd3, wd4, wd5, wd6) |A| : cardinality of A the number of elements in A |A| = 5 |B| = 3 |A  B| = 2 |A  B| = 6 Similarity Scores Simple: |A  B| = 2 Dice: 2* |A  B| / (|A|+ |B|) = 2*2 /8 = 1/2 Jaccard: |A  B| / |A  B| = 2/6 = 1/3 Search Engine

Similarity Measures: Set-based Example Object – Attribute (feature) array O1 = (1, 0, 1, 1, 0, 0, 0, 1) |O1| = 4 O2 = (1, 0, 0, 0, 1, 1, 0, 0) |O2| = 3 O3 = (1, 0, 0, 1, 1, 1, 0, 0) |O3| = 4 O4 = (1, 1, 0, 1, 0, 1, 1, 0) |O4| = 5 O5 = (1, 1, 1, 1, 0, 0, 1, 1) |O5| = 6 |O1 O2| = |(A1)| = 1 |O1 O3| = |(A1, A4)| = 2 |O1 O4| = |(A1, A4)| = 2 |O1 O5| = |(A1, A3, A4 , A8)| = 4 |O1 O2| = |(A1, A3, A4, A5, A6, A8)| = 6 |O1 O3| = |(A1, A3, A4, A5, A6, A8)| = 6 |O1 O4| = |(A1, A2, A3, A4, A6, A7, A8)| = 7 |O1 O5| = |(A1, A2, A3, A4, A7, A8)| = 6 Simple matching function By Dice’s coefficient By Jaccard’s coefficient O2 O3 O4 O5 SIM 1 2 4 Rank O2 O3 O4 O5 SIM 2*1/(4+3)=2/7 2*2/(4+4)=4/8 2*2/(4+5)=4/9 2*4/(4+6)=8/10 Rank 4 2 3 1 O2 O3 O4 O5 SIM 1/6 2/6 2/7 4/6 Rank 4 2 3 1 Search Engine

Similarity Measures: Vector-based Cosine Similarity (n-dimensional space) Dot/Scalar product of vectors / product of vector lengths Dot product = sum (product of each axis component) A = (A1, A2, A3, A4) B = (B1, B2, B3, B4) AB = (A1B1+A2B2+A3B3+A4B4) Vector length = square root of sum (square of each axis component) |A| = sqrt [(A1)2+ (A2)2+ (A3)2+ (A4)2] |B| = sqrt [(B1)2+ (B2)2+ (B3)2+ (B4)2] Cosine Similarity (3-dimensional space) A = (Ax, Ay, Az) B = (Bx, By, Bz) Search Engine

Similarity Measures: Vector-based Example Object – Attribute (feature) array O1 = (1, 0, 1, 1, 0, 0, 0, 1) O2 = (1, 0, 0, 0, 1, 1, 0, 0) O3 = (1, 0, 0, 1, 1, 1, 0, 0) O4 = (1, 1, 0, 1, 0, 1, 1, 0) O5 = (1, 1, 1, 1, 0, 0, 1, 1) |O1| = sqrt(12+02+12+12+02+02+02+12) = sqrt(4) |O2| = sqrt(12+02+02+02+12+12+02+02) = sqrt(3) |O3| = sqrt(12+02+02+12+12+12+02+02) = sqrt(4) |O4| = sqrt(12+12+02+12+02+12+12+02) = sqrt(5) |O5| = sqrt(12+12+12+12+02+02+12+12) = sqrt(6) O1O2 = (1*1+0*0+1*0+1*0+0*1+0*1+0*0+1*0) = 1 O1O3 = (1*1+0*0+1*0+1*1+0*1+0*1+0*0+1*0) = 2 O1O4 = (1*1+0*1+1*0+1*1+0*0+0*1+0*1+1*0) = 2 O1O5 = (1*1+0*1+1*1+1*1+0*0+0*0+0*1+1*1) = 4 Compute cosine similarities Rank objects O2 through O5 by descending order of similarity to O1 Search Engine

Text Analysis: Word Frequency B. Croft (Umass) TREC Volume 3 Corpus Number of documents: 336,310 Total word occurrences: 125,720,891 Unique words: 508,209 Zipf Distribution Rank*Frequency = constant Population, Wealth, Popularity A few words are very common Most words are very rare Term Weights Represents the ability of terms to identify relevant items & to distinguish them from non-relevant material Very common & very rare words are not very useful for indexing (Luhn, 1958) Good Smaller index  Faster retrieval Bad Lost gems & broken phrases Search Engine

Text Analysis: Term Weighting Term Weighting Factors Term frequency (tf) Number of times that a term occurs in a given document tf(dogd1) = 2, tf(dogd2) = 1 tf(foxd1) = 3, tf(foxd2) = 0 tf(partyd1) = 0, tf(partyd2) = 1 Inverse document frequency (idf) (Simple) 1/number of document in which a term occurs idf(dog) = 1/2, idf(fox) = 1/1, idf(party) = 1/1 (Default) log(Nd/ number of document in which a term occurs) Nd = number of document in a collection idf(dog)=log(2/2)=0, idf(fox)=log(2/1)=0.3, idf(party) = log(2/1)=0.3 Document length (dlen) Number of tokens in a document Token = an instance/occurrence of a word (not unique word) dlen(d1) = 11, dlen(d2) = 10 tfidf formula Term d1 d2 quick brown fox over lazy dog back now time all good men come jump aid their party 1 2 3 wki = weight of term k in document i fki = frequency of term k in document i (tf) Nd = number of documents in collection dk = number of documents in which term k appears (postings) Search Engine

Similarity Measures: using Term Weights Document – Term array Compute term weights (e.g., tf*idf) Nd = 5 d1=5, d2=4, d3=1, d4=3, d5=2, d6=3, d7=4, d8=2 Search Engine

Similarity Measures: using Term Weights Compute query-document cosine similarity with tf*idf weights Search Engine