Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model.

Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

Outline  Introduction to IR  IR System Architecture  Vector Space Model (VSM)  How to Assign Weights?  TF-IDF Weighting  Example  Advantages and Disadvantages of VS Model  Improving the VS Model 2 Information Retrieval and Vector Space Model

Outline  Introduction to IR  IR System Architecture  Vector Space Model (VSM)  How to Assign Weights?  TF-IDF Weighting  Example  Advantages and Disadvantages of VS Model  Improving the VS Model 3 Information Retrieval and Vector Space Model

Introduction to IR The world's total yearly production of unique information stored in the form of print, film, optical, and magnetic content would require roughly 1.5 billion gigabytes of storage. This is the equivalent of 250 megabytes per person for each man, woman, and child on earth. (Lyman & Hal 00) Information Retrieval and Vector Space Model 4

Growth of textual information How can we help manage and exploit all the information? Literature EmailWWW Desktop News Intranet Blog Information Retrieval and Vector Space Model 5

Information overflow Information Retrieval and Vector Space Model 6

What is Information Retrieval (IR)?  Narrow-sense : IR= Search Engine Technologies (IR=Google, library info system) IR= Text matching/classification  Broad-sense: IR = Text Information Management : General problem: how to manage text information? How to find useful information? (retrieval) Example: Google How to organize information? (text classification) Example: Automatically assign emails to different folders How to discover knowledge from text? (text mining) Example: Discover correlation of events Information Retrieval and Vector Space Model 7

Outline  Introduction to IR  IR System Architecture  Vector Space Model (VSM)  How to Assign Weights?  TF-IDF Weighting  Example  Advantages of VS Model  Improving the VSM Model 8 Information Retrieval and Vector Space Model

Formalizing IR Tasks  Vocabulary: V = {w1,w2, …, wT} of a language  Query: q = q1, q2, …, q m where q i ∈ V.  Document: di= di1, di2, …, dimi where dij ∈ V.  Collection: C = {d1, d2, …, dN}  Relevant document set: R(q) ⊆ C:Generally unknown and user-dependent  Query provides a “hint” on which documents should be in R(q)  IR: find the approximate relevant document set R’(q) Source: This slide is borrowed from [1] Information Retrieval and Vector Space Model 9

Evaluation measures  The quality of many retrieval systems depends on how well they manage to rank relevant documents.  How can we evaluate rankings in IR? IR researchers have developed evaluation measures specifically designed to evaluate rankings. Most of these measures combine precision and recall in a way that takes account of the ranking. Information Retrieval and Vector Space Model 10

Precision & Recall Source: This slide is borrowed from [1] Information Retrieval and Vector Space Model 11

 In other words:  Precision is the percentage of relevant items in the returned set  Recall is the percentage of all relevant documents in the collection that is in the returned set. Information Retrieval and Vector Space Model 12

Evaluating Retrieval Performance Source: This slide is borrowed from [1] Information Retrieval and Vector Space Model 13

IR System Architecture 14 User query judgments docs results Query Rep Doc Rep Ranking Feedback INDEXING SEARCHING QUERY MODIFICATION INTERFACE Information Retrieval and Vector Space Model

Indexing Document Break documents into words Stop listStemming Construct Index 15 Information Retrieval and Vector Space Model

Searching  Given a query, score documents efficiently  The basic question: Given a query, how do we know if document A is more relevant than B? If document A uses more query words than document B Word usage in document A is more similar to that in query ….  We should find a way to compute relevance Query and documents 16 Information Retrieval and Vector Space Model

The Notion of Relevance 17 Relevance  (Rep(q), Rep(d)) Similarity P(r=1|q,d) r  {0,1} Probability of Relevance P(d  q) or P(q  d) Probabilistic inference Different rep & similarity Vector space model (Salton et al., 75) Prob. distr. model (Wong & Yao, 89) … Generative Model Regression Model (Fox 83) Classical prob. Model (Robertson & Sparck Jones, 76) Doc generation Query generation LM approach (Ponte & Croft, 98) (Lafferty & Zhai, 01a) Prob. concept space model (Wong & Yao, 95) Different inference system Inference network model (Turtle & Croft, 91) Today’s lecture Infor matio n Retrie val and Vector Space Model

Relevance = Similarity  Assumptions Query and document are represented similarly A query can be regarded as a “document” Relevance(d,q)  similarity(d,q)  R(q) = {d  C|f(d,q)>  }, f(q,d)=  (Rep(q), Rep(d))  Key issues How to represent query/document? Vector Space Model (VSM) How to define the similarity measure  ? 18 Information Retrieval and Vector Space Model

Vector Space Model (VSM)  The vector space model is one of the most widely used models for ad-hoc retrieval  Used in information filtering, information retrieval, indexing and relevancy rankings. 20 Information Retrieval and Vector Space Model

VSM  Represent a doc/query by a term vector Term: basic concept, e.g., word or phrase Each term defines one dimension N terms define a high-dimensional space E.g., d=(x 1,…,x N ), x i is “importance” of term I  Measure relevance by the distance between the query vector and document vector in the vector space 21 Information Retrieval and Vector Space Model

VS Model: illustration 22 Java Microsoft Starbucks D6D6 D 10 D9D9 D4D4 D7D7 D8D8 D5D5 D 11 D2D2 ? D1D1 ? ?? ? D3D3 Query Infor matio n Retrie val and Vector Space Model

Vector Space Documents and Queries D1D1 D2D2 D3D3 D4D4 D5D5 D6D6 D7D7 D8D8 D9D9 D 10 D 11 t2t2 t3t3 t1t1 Boolean term combinations Q is a query – also represented as a vector Infor matio n Retrie val and Vector Space Model 23

Some Issues about VS Model  There is no consistent definition for basic concept  Assigning weights to words has not been determined Weight in query indicates importance of term 24 Information Retrieval and Vector Space Model

How to Assign Weights?  Different terms have different importance in a text  A term weighting scheme plays an important role for the similarity measure. Higher weight = greater impact  We now turn to the question of how to weight words in the vector space model. 26 Information Retrieval and Vector Space Model

 There are three components in a weighting scheme: g i: the global weight of the ith term, t ij: is the local weight of the ith term in the jth document, d j :the normalization factor for the jth document 27 Information Retrieval and Vector Space Model

Weighting Two basic heuristics TF (Term Frequency) = Within-doc-frequency IDF (Inverse Document Frequency) TF normalization 28 Information Retrieval and Vector Space Model

TF Weighting  Idea: A term is more important if it occurs more frequently in a document  Formulas: Let f(t,d) be the frequency count of term t in doc d Raw TF: TF(t,d) = f(t,d) Log TF: TF(t,d)=log f(t,d) Maximum frequency normalization: TF(t,d) = 0.5 +0.5*f(t,d)/MaxFreq(d)  Normalization of TF is very important! 30 Information Retrieval and Vector Space Model

TF Methods 31 Information Retrieval and Vector Space Model

IDF Weighting  Idea: A term is more discriminative if it occurs only in fewer documents  Formula: IDF(t) = 1 + log(n/k) n : total number of docs k : # docs with term t (doc freq) 32 Information Retrieval and Vector Space Model

IDF weighting Methods 33 Information Retrieval and Vector Space Model

TF Normalization  Why? Document length variation “Repeated occurrences” are less informative than the “first occurrence”  Two views of document length A doc is long because it uses more words A doc is long because it has more contents  Generally penalize long doc, but avoid over- penalizing 34 Information Retrieval and Vector Space Model

TF-IDF Weighting  TF-IDF weighting : weight(t,d)=TF(t,d)*IDF(t) Common in doc  high tf  high weight Rare in collection  high idf  high weight  Imagine a word count profile, what kind of terms would have high weights? 35 Information Retrieval and Vector Space Model

How to Measure Similarity? 36 Information Retrieval and Vector Space Model

VS Example: Raw TF & Dot Product 38 doc3 information retrieval search engine information travel information map travel government president congress doc1 doc2 … Sim(q,doc1)=4.8*2.4+4.5*4.5 Sim(q,doc2)=2.4*2.4 Sim(q,doc3)=0 query=“information retrieval” 1(4.5)1(2.4)Query 1(4.3)1(3.2)1(2.2)Doc3 1(3.3)2(5.6)1(2.4)Doc2 1(5.4)1(2.1)1(4.5)2(4.8)Doc1 4.33.22.25.42.13.32.84.52.4IDF (fake) CongressPresidentGovern.EngineSearchMapTravelRetrievalInfo. Information Retrieval and Vector Space Model

Example Q: “gold silver truck” D1: “Shipment of gold delivered in a fire” D2: “Delivery of silver arrived in a silver truck” D3: “Shipment of gold arrived in a truck” Document Frequency of the j th term (df j ) Inverse Document Frequency (idf) = log10(n / df j ) Tf*idf is used as term weight here 39 Information Retrieval and Vector Space Model

Example (Cont’d) IdTermdfidf 1a30 2arrived20.176 3damaged10.477 4delivery10.477 5fire10.477 6gold10.176 7in30 8of30 9silver10.477 10shipment20.176 11truck20.176 40 Information Retrieval and Vector Space Model

Example(Cont’d) Tf*idf is used here SC(Q, D1 ) = (0)(0) + (0)(0) + (0)(0.477) + (0)(0) + (0)(0.477)+ (0.176)(0.176) + (0)(0) + (0)(0) = 0.031 SC(Q, D2 ) = 0.486 SC(Q,D3) = 0.062 The ranking would be D2,D3,D1. This SC uses the dot product. 41 Information Retrieval and Vector Space Model

Outline  Introduction to IR  IR System Architecture  Vector Space Model (VSM)  How to Assign Weights?  TF-IDF Weighting  Example  Advantages and Disadvantages of VS Model  Improving the VSM Model 42 Information Retrieval and Vector Space Model

Advantages of VS Model  Empirically effective! (Top TREC performance)  Intuitive  Easy to implement  Well-studied/Most evaluated  The Smart system Developed at Cornell: 1960-1999 Still widely used  Warning: Many variants of TF-IDF! 43 Information Retrieval and Vector Space Model

Disadvantages of VS Model  Assume term independence  Assume query and document to be the same  Lots of parameter tuning! 44 Information Retrieval and Vector Space Model

Outline  Introduction to IR  IR System Architecture  Vector Space Model (VSM)  How to Assign Weights?  TF-IDF Weighting  Example  Advantages and Disadvantages of VS Model  Improving the VSM Model 45 Information Retrieval and Vector Space Model

Improving the VSM Model  We can improve the model by: Reducing the number of dimensions eliminating all stop words and very common terms stemming terms to their roots Latent Semantic Analysis Not retrieving documents below a defined cosine threshold Normalized frequency of a term i in document j is given by [1] : Normalized Document Frequencies Normalized Query Frequencies Information Retrieval and Vector Space Model 46

Stop List  Function words do not bear useful information for IR of, not, to, or, in, about, with, I, be, …  Stop list: contain stop words, not to be used as index Prepositions Articles Pronouns Some adverbs and adjectives Some frequent words (e.g. document)  The removal of stop words usually improves IR effectiveness  A few “standard” stop lists are commonly used. 47 Information Retrieval and Vector Space Model

Stemming 48 Reason: ◦ Different word forms may bear similar meaning (e.g. search, searching): create a “standard” representation for them Stemming: ◦ Removing some endings of word dancer dancers dance danced dancing dance Information Retrieval and Vector Space Model

Stemming(Cont’d)  Two main methods : Linguistic/dictionary-based stemming  high stemming accuracy  high implementation and processing costs and higher coverage Porter-style stemming  lower stemming accuracy  lower implementation and processing costs and lower coverage  Usually sufficient for IR 49 Information Retrieval and Vector Space Model

Latent Semantic Indexing (LSI) [3]  Reduces the dimensions of the term-document space  Attempts to solve the synonomy and polysemy  Uses Singular Value Decomposition (SVD) identifies patterns in the relationships between the terms and concepts contained in an unstructured collection of text  Based on the principle that words that are used in the same contexts tend to have similar meanings. Information Retrieval and Vector Space Model 50

LSI Process  In general, the process involves: constructing a weighted term-document matrix performing a Singular Value Decomposition on the matrix using the matrix to identify the concepts contained in the text  LSI statistically analyses the patterns of word usage across the entire document collection Information Retrieval and Vector Space Model 51

It computes the term and document vector spaces by transforming the single term-frequency matrix, A, into three other matrices: a term-concept vector matrix, T, a singular values matrix, S, a concept-document vector matrix, D, which satisfy the following relations: A = TSD T The reason SVD is useful: it finds a reduced dimensional representation of our matrix that emphasizes the strongest relationships and throws away the noise. Information Retrieval and Vector Space Model 52

 What is noise in a document? Authors have a wide choice of words available when they write So, the concepts can be obscured due to different word choices from different authors. This essentially random choice of words introduces noise into the word-concept relationship. Latent Semantic Analysis filters out some of this noise and also attempts to find the smallest set of concepts that spans all the documents Information Retrieval and Vector Space Model 53

References  Introduction to Information Retrieval, by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schuetze  https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/2.pdf https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/2.pdf  https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/ir4up.pdf https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/ir4up.pdf  https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/e09-3009.pdf https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/e09-3009.pdf  https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/07models-vsm.pdf https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/07models-vsm.pdf  https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/03vectorspaceimplementation-6per.pdf https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/03vectorspaceimplementation-6per.pdf  https://wiki.cse.yorku.ca/course_archive/2011-12/F/6339/_media/lecture02.ppt https://wiki.cse.yorku.ca/course_archive/2011-12/F/6339/_media/lecture02.ppt  https://wiki.cse.yorku.ca/course_archive/2011-12/F/6339/_media/vector_space_model- updated.ppt https://wiki.cse.yorku.ca/course_archive/2011-12/F/6339/_media/vector_space_model- updated.ppt  https://wiki.cse.yorku.ca/course_archive/2011- 12/F/6339/_media/lecture_13_ir_and_vsm_.ppt  Document Classification based on Wikipedia Content, http://www.iicm.tugraz.at/cguetl/courses/isr/opt/classification/Vector_Space_Model.html?tim estamp=1318275702299 http://www.iicm.tugraz.at/cguetl/courses/isr/opt/classification/Vector_Space_Model.html?tim estamp=1318275702299 54 Information Retrieval and Vector Space Model

Thanks For Your Attention …. 55 Infor matio n Retrie val and Vector Space Model

Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model.

Similar presentations

Presentation on theme: "Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model.

Similar presentations

Presentation on theme: "Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model."— Presentation transcript:

Similar presentations

About project

Feedback