Download presentation
Presentation is loading. Please wait.
Published byJemimah Wright Modified over 9 years ago
1
Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model
2
Outline Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example Advantages and Disadvantages of VS Model Improving the VS Model 2 Information Retrieval and Vector Space Model
3
Outline Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example Advantages and Disadvantages of VS Model Improving the VS Model 3 Information Retrieval and Vector Space Model
4
Introduction to IR The world's total yearly production of unique information stored in the form of print, film, optical, and magnetic content would require roughly 1.5 billion gigabytes of storage. This is the equivalent of 250 megabytes per person for each man, woman, and child on earth. (Lyman & Hal 00) Information Retrieval and Vector Space Model 4
5
Growth of textual information How can we help manage and exploit all the information? Literature EmailWWW Desktop News Intranet Blog Information Retrieval and Vector Space Model 5
6
Information overflow Information Retrieval and Vector Space Model 6
7
What is Information Retrieval (IR)? Narrow-sense : IR= Search Engine Technologies (IR=Google, library info system) IR= Text matching/classification Broad-sense: IR = Text Information Management : General problem: how to manage text information? How to find useful information? (retrieval) Example: Google How to organize information? (text classification) Example: Automatically assign emails to different folders How to discover knowledge from text? (text mining) Example: Discover correlation of events Information Retrieval and Vector Space Model 7
8
Outline Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example Advantages of VS Model Improving the VSM Model 8 Information Retrieval and Vector Space Model
9
Formalizing IR Tasks Vocabulary: V = {w1,w2, …, wT} of a language Query: q = q1, q2, …, q m where q i ∈ V. Document: di= di1, di2, …, dimi where dij ∈ V. Collection: C = {d1, d2, …, dN} Relevant document set: R(q) ⊆ C:Generally unknown and user-dependent Query provides a “hint” on which documents should be in R(q) IR: find the approximate relevant document set R’(q) Source: This slide is borrowed from [1] Information Retrieval and Vector Space Model 9
10
Evaluation measures The quality of many retrieval systems depends on how well they manage to rank relevant documents. How can we evaluate rankings in IR? IR researchers have developed evaluation measures specifically designed to evaluate rankings. Most of these measures combine precision and recall in a way that takes account of the ranking. Information Retrieval and Vector Space Model 10
11
Precision & Recall Source: This slide is borrowed from [1] Information Retrieval and Vector Space Model 11
12
In other words: Precision is the percentage of relevant items in the returned set Recall is the percentage of all relevant documents in the collection that is in the returned set. Information Retrieval and Vector Space Model 12
13
Evaluating Retrieval Performance Source: This slide is borrowed from [1] Information Retrieval and Vector Space Model 13
14
IR System Architecture 14 User query judgments docs results Query Rep Doc Rep Ranking Feedback INDEXING SEARCHING QUERY MODIFICATION INTERFACE Information Retrieval and Vector Space Model
15
Indexing Document Break documents into words Stop listStemming Construct Index 15 Information Retrieval and Vector Space Model
16
Searching Given a query, score documents efficiently The basic question: Given a query, how do we know if document A is more relevant than B? If document A uses more query words than document B Word usage in document A is more similar to that in query …. We should find a way to compute relevance Query and documents 16 Information Retrieval and Vector Space Model
17
The Notion of Relevance 17 Relevance (Rep(q), Rep(d)) Similarity P(r=1|q,d) r {0,1} Probability of Relevance P(d q) or P(q d) Probabilistic inference Different rep & similarity Vector space model (Salton et al., 75) Prob. distr. model (Wong & Yao, 89) … Generative Model Regression Model (Fox 83) Classical prob. Model (Robertson & Sparck Jones, 76) Doc generation Query generation LM approach (Ponte & Croft, 98) (Lafferty & Zhai, 01a) Prob. concept space model (Wong & Yao, 95) Different inference system Inference network model (Turtle & Croft, 91) Today’s lecture Infor matio n Retrie val and Vector Space Model
18
Relevance = Similarity Assumptions Query and document are represented similarly A query can be regarded as a “document” Relevance(d,q) similarity(d,q) R(q) = {d C|f(d,q)> }, f(q,d)= (Rep(q), Rep(d)) Key issues How to represent query/document? Vector Space Model (VSM) How to define the similarity measure ? 18 Information Retrieval and Vector Space Model
19
Outline Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example Advantages of VS Model Improving the VSM Model 19 Information Retrieval and Vector Space Model
20
Vector Space Model (VSM) The vector space model is one of the most widely used models for ad-hoc retrieval Used in information filtering, information retrieval, indexing and relevancy rankings. 20 Information Retrieval and Vector Space Model
21
VSM Represent a doc/query by a term vector Term: basic concept, e.g., word or phrase Each term defines one dimension N terms define a high-dimensional space E.g., d=(x 1,…,x N ), x i is “importance” of term I Measure relevance by the distance between the query vector and document vector in the vector space 21 Information Retrieval and Vector Space Model
22
VS Model: illustration 22 Java Microsoft Starbucks D6D6 D 10 D9D9 D4D4 D7D7 D8D8 D5D5 D 11 D2D2 ? D1D1 ? ?? ? D3D3 Query Infor matio n Retrie val and Vector Space Model
23
Vector Space Documents and Queries D1D1 D2D2 D3D3 D4D4 D5D5 D6D6 D7D7 D8D8 D9D9 D 10 D 11 t2t2 t3t3 t1t1 Boolean term combinations Q is a query – also represented as a vector Infor matio n Retrie val and Vector Space Model 23
24
Some Issues about VS Model There is no consistent definition for basic concept Assigning weights to words has not been determined Weight in query indicates importance of term 24 Information Retrieval and Vector Space Model
25
Outline Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example Advantages of VS Model Improving the VSM Model 25 Information Retrieval and Vector Space Model
26
How to Assign Weights? Different terms have different importance in a text A term weighting scheme plays an important role for the similarity measure. Higher weight = greater impact We now turn to the question of how to weight words in the vector space model. 26 Information Retrieval and Vector Space Model
27
There are three components in a weighting scheme: g i: the global weight of the ith term, t ij: is the local weight of the ith term in the jth document, d j :the normalization factor for the jth document 27 Information Retrieval and Vector Space Model
28
Weighting Two basic heuristics TF (Term Frequency) = Within-doc-frequency IDF (Inverse Document Frequency) TF normalization 28 Information Retrieval and Vector Space Model
29
Outline Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example Advantages of VS Model Improving the VSM Model 29 Information Retrieval and Vector Space Model
30
TF Weighting Idea: A term is more important if it occurs more frequently in a document Formulas: Let f(t,d) be the frequency count of term t in doc d Raw TF: TF(t,d) = f(t,d) Log TF: TF(t,d)=log f(t,d) Maximum frequency normalization: TF(t,d) = 0.5 +0.5*f(t,d)/MaxFreq(d) Normalization of TF is very important! 30 Information Retrieval and Vector Space Model
31
TF Methods 31 Information Retrieval and Vector Space Model
32
IDF Weighting Idea: A term is more discriminative if it occurs only in fewer documents Formula: IDF(t) = 1 + log(n/k) n : total number of docs k : # docs with term t (doc freq) 32 Information Retrieval and Vector Space Model
33
IDF weighting Methods 33 Information Retrieval and Vector Space Model
34
TF Normalization Why? Document length variation “Repeated occurrences” are less informative than the “first occurrence” Two views of document length A doc is long because it uses more words A doc is long because it has more contents Generally penalize long doc, but avoid over- penalizing 34 Information Retrieval and Vector Space Model
35
TF-IDF Weighting TF-IDF weighting : weight(t,d)=TF(t,d)*IDF(t) Common in doc high tf high weight Rare in collection high idf high weight Imagine a word count profile, what kind of terms would have high weights? 35 Information Retrieval and Vector Space Model
36
How to Measure Similarity? 36 Information Retrieval and Vector Space Model
37
Outline Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example Advantages of VS Model Improving the VSM Model 37 Information Retrieval and Vector Space Model
38
VS Example: Raw TF & Dot Product 38 doc3 information retrieval search engine information travel information map travel government president congress doc1 doc2 … Sim(q,doc1)=4.8*2.4+4.5*4.5 Sim(q,doc2)=2.4*2.4 Sim(q,doc3)=0 query=“information retrieval” 1(4.5)1(2.4)Query 1(4.3)1(3.2)1(2.2)Doc3 1(3.3)2(5.6)1(2.4)Doc2 1(5.4)1(2.1)1(4.5)2(4.8)Doc1 4.33.22.25.42.13.32.84.52.4IDF (fake) CongressPresidentGovern.EngineSearchMapTravelRetrievalInfo. Information Retrieval and Vector Space Model
39
Example Q: “gold silver truck” D1: “Shipment of gold delivered in a fire” D2: “Delivery of silver arrived in a silver truck” D3: “Shipment of gold arrived in a truck” Document Frequency of the j th term (df j ) Inverse Document Frequency (idf) = log10(n / df j ) Tf*idf is used as term weight here 39 Information Retrieval and Vector Space Model
40
Example (Cont’d) IdTermdfidf 1a30 2arrived20.176 3damaged10.477 4delivery10.477 5fire10.477 6gold10.176 7in30 8of30 9silver10.477 10shipment20.176 11truck20.176 40 Information Retrieval and Vector Space Model
41
Example(Cont’d) Tf*idf is used here SC(Q, D1 ) = (0)(0) + (0)(0) + (0)(0.477) + (0)(0) + (0)(0.477)+ (0.176)(0.176) + (0)(0) + (0)(0) = 0.031 SC(Q, D2 ) = 0.486 SC(Q,D3) = 0.062 The ranking would be D2,D3,D1. This SC uses the dot product. 41 Information Retrieval and Vector Space Model
42
Outline Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example Advantages and Disadvantages of VS Model Improving the VSM Model 42 Information Retrieval and Vector Space Model
43
Advantages of VS Model Empirically effective! (Top TREC performance) Intuitive Easy to implement Well-studied/Most evaluated The Smart system Developed at Cornell: 1960-1999 Still widely used Warning: Many variants of TF-IDF! 43 Information Retrieval and Vector Space Model
44
Disadvantages of VS Model Assume term independence Assume query and document to be the same Lots of parameter tuning! 44 Information Retrieval and Vector Space Model
45
Outline Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example Advantages and Disadvantages of VS Model Improving the VSM Model 45 Information Retrieval and Vector Space Model
46
Improving the VSM Model We can improve the model by: Reducing the number of dimensions eliminating all stop words and very common terms stemming terms to their roots Latent Semantic Analysis Not retrieving documents below a defined cosine threshold Normalized frequency of a term i in document j is given by [1] : Normalized Document Frequencies Normalized Query Frequencies Information Retrieval and Vector Space Model 46
47
Stop List Function words do not bear useful information for IR of, not, to, or, in, about, with, I, be, … Stop list: contain stop words, not to be used as index Prepositions Articles Pronouns Some adverbs and adjectives Some frequent words (e.g. document) The removal of stop words usually improves IR effectiveness A few “standard” stop lists are commonly used. 47 Information Retrieval and Vector Space Model
48
Stemming 48 Reason: ◦ Different word forms may bear similar meaning (e.g. search, searching): create a “standard” representation for them Stemming: ◦ Removing some endings of word dancer dancers dance danced dancing dance Information Retrieval and Vector Space Model
49
Stemming(Cont’d) Two main methods : Linguistic/dictionary-based stemming high stemming accuracy high implementation and processing costs and higher coverage Porter-style stemming lower stemming accuracy lower implementation and processing costs and lower coverage Usually sufficient for IR 49 Information Retrieval and Vector Space Model
50
Latent Semantic Indexing (LSI) [3] Reduces the dimensions of the term-document space Attempts to solve the synonomy and polysemy Uses Singular Value Decomposition (SVD) identifies patterns in the relationships between the terms and concepts contained in an unstructured collection of text Based on the principle that words that are used in the same contexts tend to have similar meanings. Information Retrieval and Vector Space Model 50
51
LSI Process In general, the process involves: constructing a weighted term-document matrix performing a Singular Value Decomposition on the matrix using the matrix to identify the concepts contained in the text LSI statistically analyses the patterns of word usage across the entire document collection Information Retrieval and Vector Space Model 51
52
It computes the term and document vector spaces by transforming the single term-frequency matrix, A, into three other matrices: a term-concept vector matrix, T, a singular values matrix, S, a concept-document vector matrix, D, which satisfy the following relations: A = TSD T The reason SVD is useful: it finds a reduced dimensional representation of our matrix that emphasizes the strongest relationships and throws away the noise. Information Retrieval and Vector Space Model 52
53
What is noise in a document? Authors have a wide choice of words available when they write So, the concepts can be obscured due to different word choices from different authors. This essentially random choice of words introduces noise into the word-concept relationship. Latent Semantic Analysis filters out some of this noise and also attempts to find the smallest set of concepts that spans all the documents Information Retrieval and Vector Space Model 53
54
References Introduction to Information Retrieval, by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schuetze https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/2.pdf https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/2.pdf https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/ir4up.pdf https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/ir4up.pdf https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/e09-3009.pdf https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/e09-3009.pdf https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/07models-vsm.pdf https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/07models-vsm.pdf https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/03vectorspaceimplementation-6per.pdf https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/03vectorspaceimplementation-6per.pdf https://wiki.cse.yorku.ca/course_archive/2011-12/F/6339/_media/lecture02.ppt https://wiki.cse.yorku.ca/course_archive/2011-12/F/6339/_media/lecture02.ppt https://wiki.cse.yorku.ca/course_archive/2011-12/F/6339/_media/vector_space_model- updated.ppt https://wiki.cse.yorku.ca/course_archive/2011-12/F/6339/_media/vector_space_model- updated.ppt https://wiki.cse.yorku.ca/course_archive/2011- 12/F/6339/_media/lecture_13_ir_and_vsm_.ppt Document Classification based on Wikipedia Content, http://www.iicm.tugraz.at/cguetl/courses/isr/opt/classification/Vector_Space_Model.html?tim estamp=1318275702299 http://www.iicm.tugraz.at/cguetl/courses/isr/opt/classification/Vector_Space_Model.html?tim estamp=1318275702299 54 Information Retrieval and Vector Space Model
55
Thanks For Your Attention …. 55 Infor matio n Retrie val and Vector Space Model
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.