Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model
Outline Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example Advantages and Disadvantages of VS Model Improving the VS Model 2 Information Retrieval and Vector Space Model
Outline Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example Advantages and Disadvantages of VS Model Improving the VS Model 3 Information Retrieval and Vector Space Model
Introduction to IR The world's total yearly production of unique information stored in the form of print, film, optical, and magnetic content would require roughly 1.5 billion gigabytes of storage. This is the equivalent of 250 megabytes per person for each man, woman, and child on earth. (Lyman & Hal 00) Information Retrieval and Vector Space Model 4
Growth of textual information How can we help manage and exploit all the information? Literature WWW Desktop News Intranet Blog Information Retrieval and Vector Space Model 5
Information overflow Information Retrieval and Vector Space Model 6
What is Information Retrieval (IR)? Narrow-sense : IR= Search Engine Technologies (IR=Google, library info system) IR= Text matching/classification Broad-sense: IR = Text Information Management : General problem: how to manage text information? How to find useful information? (retrieval) Example: Google How to organize information? (text classification) Example: Automatically assign s to different folders How to discover knowledge from text? (text mining) Example: Discover correlation of events Information Retrieval and Vector Space Model 7
Outline Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example Advantages of VS Model Improving the VSM Model 8 Information Retrieval and Vector Space Model
Formalizing IR Tasks Vocabulary: V = {w1,w2, …, wT} of a language Query: q = q1, q2, …, q m where q i ∈ V. Document: di= di1, di2, …, dimi where dij ∈ V. Collection: C = {d1, d2, …, dN} Relevant document set: R(q) ⊆ C:Generally unknown and user-dependent Query provides a “hint” on which documents should be in R(q) IR: find the approximate relevant document set R’(q) Source: This slide is borrowed from [1] Information Retrieval and Vector Space Model 9
Evaluation measures The quality of many retrieval systems depends on how well they manage to rank relevant documents. How can we evaluate rankings in IR? IR researchers have developed evaluation measures specifically designed to evaluate rankings. Most of these measures combine precision and recall in a way that takes account of the ranking. Information Retrieval and Vector Space Model 10
Precision & Recall Source: This slide is borrowed from [1] Information Retrieval and Vector Space Model 11
In other words: Precision is the percentage of relevant items in the returned set Recall is the percentage of all relevant documents in the collection that is in the returned set. Information Retrieval and Vector Space Model 12
Evaluating Retrieval Performance Source: This slide is borrowed from [1] Information Retrieval and Vector Space Model 13
IR System Architecture 14 User query judgments docs results Query Rep Doc Rep Ranking Feedback INDEXING SEARCHING QUERY MODIFICATION INTERFACE Information Retrieval and Vector Space Model
Indexing Document Break documents into words Stop listStemming Construct Index 15 Information Retrieval and Vector Space Model
Searching Given a query, score documents efficiently The basic question: Given a query, how do we know if document A is more relevant than B? If document A uses more query words than document B Word usage in document A is more similar to that in query …. We should find a way to compute relevance Query and documents 16 Information Retrieval and Vector Space Model
The Notion of Relevance 17 Relevance (Rep(q), Rep(d)) Similarity P(r=1|q,d) r {0,1} Probability of Relevance P(d q) or P(q d) Probabilistic inference Different rep & similarity Vector space model (Salton et al., 75) Prob. distr. model (Wong & Yao, 89) … Generative Model Regression Model (Fox 83) Classical prob. Model (Robertson & Sparck Jones, 76) Doc generation Query generation LM approach (Ponte & Croft, 98) (Lafferty & Zhai, 01a) Prob. concept space model (Wong & Yao, 95) Different inference system Inference network model (Turtle & Croft, 91) Today’s lecture Infor matio n Retrie val and Vector Space Model
Relevance = Similarity Assumptions Query and document are represented similarly A query can be regarded as a “document” Relevance(d,q) similarity(d,q) R(q) = {d C|f(d,q)> }, f(q,d)= (Rep(q), Rep(d)) Key issues How to represent query/document? Vector Space Model (VSM) How to define the similarity measure ? 18 Information Retrieval and Vector Space Model
Outline Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example Advantages of VS Model Improving the VSM Model 19 Information Retrieval and Vector Space Model
Vector Space Model (VSM) The vector space model is one of the most widely used models for ad-hoc retrieval Used in information filtering, information retrieval, indexing and relevancy rankings. 20 Information Retrieval and Vector Space Model
VSM Represent a doc/query by a term vector Term: basic concept, e.g., word or phrase Each term defines one dimension N terms define a high-dimensional space E.g., d=(x 1,…,x N ), x i is “importance” of term I Measure relevance by the distance between the query vector and document vector in the vector space 21 Information Retrieval and Vector Space Model
VS Model: illustration 22 Java Microsoft Starbucks D6D6 D 10 D9D9 D4D4 D7D7 D8D8 D5D5 D 11 D2D2 ? D1D1 ? ?? ? D3D3 Query Infor matio n Retrie val and Vector Space Model
Vector Space Documents and Queries D1D1 D2D2 D3D3 D4D4 D5D5 D6D6 D7D7 D8D8 D9D9 D 10 D 11 t2t2 t3t3 t1t1 Boolean term combinations Q is a query – also represented as a vector Infor matio n Retrie val and Vector Space Model 23
Some Issues about VS Model There is no consistent definition for basic concept Assigning weights to words has not been determined Weight in query indicates importance of term 24 Information Retrieval and Vector Space Model
Outline Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example Advantages of VS Model Improving the VSM Model 25 Information Retrieval and Vector Space Model
How to Assign Weights? Different terms have different importance in a text A term weighting scheme plays an important role for the similarity measure. Higher weight = greater impact We now turn to the question of how to weight words in the vector space model. 26 Information Retrieval and Vector Space Model
There are three components in a weighting scheme: g i: the global weight of the ith term, t ij: is the local weight of the ith term in the jth document, d j :the normalization factor for the jth document 27 Information Retrieval and Vector Space Model
Weighting Two basic heuristics TF (Term Frequency) = Within-doc-frequency IDF (Inverse Document Frequency) TF normalization 28 Information Retrieval and Vector Space Model
Outline Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example Advantages of VS Model Improving the VSM Model 29 Information Retrieval and Vector Space Model
TF Weighting Idea: A term is more important if it occurs more frequently in a document Formulas: Let f(t,d) be the frequency count of term t in doc d Raw TF: TF(t,d) = f(t,d) Log TF: TF(t,d)=log f(t,d) Maximum frequency normalization: TF(t,d) = *f(t,d)/MaxFreq(d) Normalization of TF is very important! 30 Information Retrieval and Vector Space Model
TF Methods 31 Information Retrieval and Vector Space Model
IDF Weighting Idea: A term is more discriminative if it occurs only in fewer documents Formula: IDF(t) = 1 + log(n/k) n : total number of docs k : # docs with term t (doc freq) 32 Information Retrieval and Vector Space Model
IDF weighting Methods 33 Information Retrieval and Vector Space Model
TF Normalization Why? Document length variation “Repeated occurrences” are less informative than the “first occurrence” Two views of document length A doc is long because it uses more words A doc is long because it has more contents Generally penalize long doc, but avoid over- penalizing 34 Information Retrieval and Vector Space Model
TF-IDF Weighting TF-IDF weighting : weight(t,d)=TF(t,d)*IDF(t) Common in doc high tf high weight Rare in collection high idf high weight Imagine a word count profile, what kind of terms would have high weights? 35 Information Retrieval and Vector Space Model
How to Measure Similarity? 36 Information Retrieval and Vector Space Model
Outline Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example Advantages of VS Model Improving the VSM Model 37 Information Retrieval and Vector Space Model
VS Example: Raw TF & Dot Product 38 doc3 information retrieval search engine information travel information map travel government president congress doc1 doc2 … Sim(q,doc1)=4.8* *4.5 Sim(q,doc2)=2.4*2.4 Sim(q,doc3)=0 query=“information retrieval” 1(4.5)1(2.4)Query 1(4.3)1(3.2)1(2.2)Doc3 1(3.3)2(5.6)1(2.4)Doc2 1(5.4)1(2.1)1(4.5)2(4.8)Doc IDF (fake) CongressPresidentGovern.EngineSearchMapTravelRetrievalInfo. Information Retrieval and Vector Space Model
Example Q: “gold silver truck” D1: “Shipment of gold delivered in a fire” D2: “Delivery of silver arrived in a silver truck” D3: “Shipment of gold arrived in a truck” Document Frequency of the j th term (df j ) Inverse Document Frequency (idf) = log10(n / df j ) Tf*idf is used as term weight here 39 Information Retrieval and Vector Space Model
Example (Cont’d) IdTermdfidf 1a30 2arrived damaged delivery fire gold in30 8of30 9silver shipment truck Information Retrieval and Vector Space Model
Example(Cont’d) Tf*idf is used here SC(Q, D1 ) = (0)(0) + (0)(0) + (0)(0.477) + (0)(0) + (0)(0.477)+ (0.176)(0.176) + (0)(0) + (0)(0) = SC(Q, D2 ) = SC(Q,D3) = The ranking would be D2,D3,D1. This SC uses the dot product. 41 Information Retrieval and Vector Space Model
Outline Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example Advantages and Disadvantages of VS Model Improving the VSM Model 42 Information Retrieval and Vector Space Model
Advantages of VS Model Empirically effective! (Top TREC performance) Intuitive Easy to implement Well-studied/Most evaluated The Smart system Developed at Cornell: Still widely used Warning: Many variants of TF-IDF! 43 Information Retrieval and Vector Space Model
Disadvantages of VS Model Assume term independence Assume query and document to be the same Lots of parameter tuning! 44 Information Retrieval and Vector Space Model
Outline Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example Advantages and Disadvantages of VS Model Improving the VSM Model 45 Information Retrieval and Vector Space Model
Improving the VSM Model We can improve the model by: Reducing the number of dimensions eliminating all stop words and very common terms stemming terms to their roots Latent Semantic Analysis Not retrieving documents below a defined cosine threshold Normalized frequency of a term i in document j is given by [1] : Normalized Document Frequencies Normalized Query Frequencies Information Retrieval and Vector Space Model 46
Stop List Function words do not bear useful information for IR of, not, to, or, in, about, with, I, be, … Stop list: contain stop words, not to be used as index Prepositions Articles Pronouns Some adverbs and adjectives Some frequent words (e.g. document) The removal of stop words usually improves IR effectiveness A few “standard” stop lists are commonly used. 47 Information Retrieval and Vector Space Model
Stemming 48 Reason: ◦ Different word forms may bear similar meaning (e.g. search, searching): create a “standard” representation for them Stemming: ◦ Removing some endings of word dancer dancers dance danced dancing dance Information Retrieval and Vector Space Model
Stemming(Cont’d) Two main methods : Linguistic/dictionary-based stemming high stemming accuracy high implementation and processing costs and higher coverage Porter-style stemming lower stemming accuracy lower implementation and processing costs and lower coverage Usually sufficient for IR 49 Information Retrieval and Vector Space Model
Latent Semantic Indexing (LSI) [3] Reduces the dimensions of the term-document space Attempts to solve the synonomy and polysemy Uses Singular Value Decomposition (SVD) identifies patterns in the relationships between the terms and concepts contained in an unstructured collection of text Based on the principle that words that are used in the same contexts tend to have similar meanings. Information Retrieval and Vector Space Model 50
LSI Process In general, the process involves: constructing a weighted term-document matrix performing a Singular Value Decomposition on the matrix using the matrix to identify the concepts contained in the text LSI statistically analyses the patterns of word usage across the entire document collection Information Retrieval and Vector Space Model 51
It computes the term and document vector spaces by transforming the single term-frequency matrix, A, into three other matrices: a term-concept vector matrix, T, a singular values matrix, S, a concept-document vector matrix, D, which satisfy the following relations: A = TSD T The reason SVD is useful: it finds a reduced dimensional representation of our matrix that emphasizes the strongest relationships and throws away the noise. Information Retrieval and Vector Space Model 52
What is noise in a document? Authors have a wide choice of words available when they write So, the concepts can be obscured due to different word choices from different authors. This essentially random choice of words introduces noise into the word-concept relationship. Latent Semantic Analysis filters out some of this noise and also attempts to find the smallest set of concepts that spans all the documents Information Retrieval and Vector Space Model 53
References Introduction to Information Retrieval, by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schuetze updated.ppt updated.ppt 12/F/6339/_media/lecture_13_ir_and_vsm_.ppt Document Classification based on Wikipedia Content, estamp= estamp= Information Retrieval and Vector Space Model
Thanks For Your Attention …. 55 Infor matio n Retrie val and Vector Space Model