Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Probabilistic Ranking Principle
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Information Retrieval Models: Probabilistic Models
IR Models: Overview, Boolean, and Vector
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
ISP 433/533 Week 2 IR Models.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
CSM06 Information Retrieval Lecture 3: Text IR part 2 Dr Andrew Salway
Ch 4: Information Retrieval and Text Mining
Modeling Modern Information Retrieval
Hinrich Schütze and Christina Lioma
Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Vector Space Model CS 652 Information Extraction and Integration.
The Vector Space Model …and applications in Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Multimedia Databases Text II. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Text databases Image and video.
Information retrieval: overview. Information Retrieval and Text Processing Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this.
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
IR Models: Review Vector Model and Probabilistic.
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
Chapter 5: Information Retrieval and Web Search
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Text mining.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
SINGULAR VALUE DECOMPOSITION (SVD)
Information Retrieval and Vector Space Model Presented by Jun Miao York University 1.
Probabilistic Ranking Principle Hongning Wang
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Vector Space Models.
Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
Information Retrieval Models: Vector Space Models
1 CS 430: Information Discovery Lecture 5 Ranking.
Natural Language Processing Topics in Information Retrieval August, 2002.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
Plan for Today’s Lecture(s)
Information Retrieval Models: Probabilistic Models
Representation of documents and queries
From frequency to meaning: vector space models of semantics
Vector Space Model Computational Linguistic Course
Boolean and Vector Space Retrieval Models
CS 430: Information Discovery
Language Models for TR Rong Jin
ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB SEARCH
Presentation transcript:

Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

Outline  Introduction to IR  IR System Architecture  Vector Space Model (VSM)  How to Assign Weights?  TF-IDF Weighting  Example  Advantages and Disadvantages of VS Model  Improving the VS Model 2 Information Retrieval and Vector Space Model

Outline  Introduction to IR  IR System Architecture  Vector Space Model (VSM)  How to Assign Weights?  TF-IDF Weighting  Example  Advantages and Disadvantages of VS Model  Improving the VS Model 3 Information Retrieval and Vector Space Model

Introduction to IR The world's total yearly production of unique information stored in the form of print, film, optical, and magnetic content would require roughly 1.5 billion gigabytes of storage. This is the equivalent of 250 megabytes per person for each man, woman, and child on earth. (Lyman & Hal 00) Information Retrieval and Vector Space Model 4

Growth of textual information How can we help manage and exploit all the information? Literature WWW Desktop News Intranet Blog Information Retrieval and Vector Space Model 5

Information overflow Information Retrieval and Vector Space Model 6

What is Information Retrieval (IR)?  Narrow-sense : IR= Search Engine Technologies (IR=Google, library info system) IR= Text matching/classification  Broad-sense: IR = Text Information Management : General problem: how to manage text information? How to find useful information? (retrieval) Example: Google How to organize information? (text classification) Example: Automatically assign s to different folders How to discover knowledge from text? (text mining) Example: Discover correlation of events Information Retrieval and Vector Space Model 7

Outline  Introduction to IR  IR System Architecture  Vector Space Model (VSM)  How to Assign Weights?  TF-IDF Weighting  Example  Advantages of VS Model  Improving the VSM Model 8 Information Retrieval and Vector Space Model

Formalizing IR Tasks  Vocabulary: V = {w1,w2, …, wT} of a language  Query: q = q1, q2, …, q m where q i ∈ V.  Document: di= di1, di2, …, dimi where dij ∈ V.  Collection: C = {d1, d2, …, dN}  Relevant document set: R(q) ⊆ C:Generally unknown and user-dependent  Query provides a “hint” on which documents should be in R(q)  IR: find the approximate relevant document set R’(q) Source: This slide is borrowed from [1] Information Retrieval and Vector Space Model 9

Evaluation measures  The quality of many retrieval systems depends on how well they manage to rank relevant documents.  How can we evaluate rankings in IR? IR researchers have developed evaluation measures specifically designed to evaluate rankings. Most of these measures combine precision and recall in a way that takes account of the ranking. Information Retrieval and Vector Space Model 10

Precision & Recall Source: This slide is borrowed from [1] Information Retrieval and Vector Space Model 11

 In other words:  Precision is the percentage of relevant items in the returned set  Recall is the percentage of all relevant documents in the collection that is in the returned set. Information Retrieval and Vector Space Model 12

Evaluating Retrieval Performance Source: This slide is borrowed from [1] Information Retrieval and Vector Space Model 13

IR System Architecture 14 User query judgments docs results Query Rep Doc Rep Ranking Feedback INDEXING SEARCHING QUERY MODIFICATION INTERFACE Information Retrieval and Vector Space Model

Indexing Document Break documents into words Stop listStemming Construct Index 15 Information Retrieval and Vector Space Model

Searching  Given a query, score documents efficiently  The basic question: Given a query, how do we know if document A is more relevant than B? If document A uses more query words than document B Word usage in document A is more similar to that in query ….  We should find a way to compute relevance Query and documents 16 Information Retrieval and Vector Space Model

The Notion of Relevance 17 Relevance  (Rep(q), Rep(d)) Similarity P(r=1|q,d) r  {0,1} Probability of Relevance P(d  q) or P(q  d) Probabilistic inference Different rep & similarity Vector space model (Salton et al., 75) Prob. distr. model (Wong & Yao, 89) … Generative Model Regression Model (Fox 83) Classical prob. Model (Robertson & Sparck Jones, 76) Doc generation Query generation LM approach (Ponte & Croft, 98) (Lafferty & Zhai, 01a) Prob. concept space model (Wong & Yao, 95) Different inference system Inference network model (Turtle & Croft, 91) Today’s lecture Infor matio n Retrie val and Vector Space Model

Relevance = Similarity  Assumptions Query and document are represented similarly A query can be regarded as a “document” Relevance(d,q)  similarity(d,q)  R(q) = {d  C|f(d,q)>  }, f(q,d)=  (Rep(q), Rep(d))  Key issues How to represent query/document? Vector Space Model (VSM) How to define the similarity measure  ? 18 Information Retrieval and Vector Space Model

Outline  Introduction to IR  IR System Architecture  Vector Space Model (VSM)  How to Assign Weights?  TF-IDF Weighting  Example  Advantages of VS Model  Improving the VSM Model 19 Information Retrieval and Vector Space Model

Vector Space Model (VSM)  The vector space model is one of the most widely used models for ad-hoc retrieval  Used in information filtering, information retrieval, indexing and relevancy rankings. 20 Information Retrieval and Vector Space Model

VSM  Represent a doc/query by a term vector Term: basic concept, e.g., word or phrase Each term defines one dimension N terms define a high-dimensional space E.g., d=(x 1,…,x N ), x i is “importance” of term I  Measure relevance by the distance between the query vector and document vector in the vector space 21 Information Retrieval and Vector Space Model

VS Model: illustration 22 Java Microsoft Starbucks D6D6 D 10 D9D9 D4D4 D7D7 D8D8 D5D5 D 11 D2D2 ? D1D1 ? ?? ? D3D3 Query Infor matio n Retrie val and Vector Space Model

Vector Space Documents and Queries D1D1 D2D2 D3D3 D4D4 D5D5 D6D6 D7D7 D8D8 D9D9 D 10 D 11 t2t2 t3t3 t1t1 Boolean term combinations Q is a query – also represented as a vector Infor matio n Retrie val and Vector Space Model 23

Some Issues about VS Model  There is no consistent definition for basic concept  Assigning weights to words has not been determined Weight in query indicates importance of term 24 Information Retrieval and Vector Space Model

Outline  Introduction to IR  IR System Architecture  Vector Space Model (VSM)  How to Assign Weights?  TF-IDF Weighting  Example  Advantages of VS Model  Improving the VSM Model 25 Information Retrieval and Vector Space Model

How to Assign Weights?  Different terms have different importance in a text  A term weighting scheme plays an important role for the similarity measure. Higher weight = greater impact  We now turn to the question of how to weight words in the vector space model. 26 Information Retrieval and Vector Space Model

 There are three components in a weighting scheme: g i: the global weight of the ith term, t ij: is the local weight of the ith term in the jth document, d j :the normalization factor for the jth document 27 Information Retrieval and Vector Space Model

Weighting Two basic heuristics TF (Term Frequency) = Within-doc-frequency IDF (Inverse Document Frequency) TF normalization 28 Information Retrieval and Vector Space Model

Outline  Introduction to IR  IR System Architecture  Vector Space Model (VSM)  How to Assign Weights?  TF-IDF Weighting  Example  Advantages of VS Model  Improving the VSM Model 29 Information Retrieval and Vector Space Model

TF Weighting  Idea: A term is more important if it occurs more frequently in a document  Formulas: Let f(t,d) be the frequency count of term t in doc d Raw TF: TF(t,d) = f(t,d) Log TF: TF(t,d)=log f(t,d) Maximum frequency normalization: TF(t,d) = *f(t,d)/MaxFreq(d)  Normalization of TF is very important! 30 Information Retrieval and Vector Space Model

TF Methods 31 Information Retrieval and Vector Space Model

IDF Weighting  Idea: A term is more discriminative if it occurs only in fewer documents  Formula: IDF(t) = 1 + log(n/k) n : total number of docs k : # docs with term t (doc freq) 32 Information Retrieval and Vector Space Model

IDF weighting Methods 33 Information Retrieval and Vector Space Model

TF Normalization  Why? Document length variation “Repeated occurrences” are less informative than the “first occurrence”  Two views of document length A doc is long because it uses more words A doc is long because it has more contents  Generally penalize long doc, but avoid over- penalizing 34 Information Retrieval and Vector Space Model

TF-IDF Weighting  TF-IDF weighting : weight(t,d)=TF(t,d)*IDF(t) Common in doc  high tf  high weight Rare in collection  high idf  high weight  Imagine a word count profile, what kind of terms would have high weights? 35 Information Retrieval and Vector Space Model

How to Measure Similarity? 36 Information Retrieval and Vector Space Model

Outline  Introduction to IR  IR System Architecture  Vector Space Model (VSM)  How to Assign Weights?  TF-IDF Weighting  Example  Advantages of VS Model  Improving the VSM Model 37 Information Retrieval and Vector Space Model

VS Example: Raw TF & Dot Product 38 doc3 information retrieval search engine information travel information map travel government president congress doc1 doc2 … Sim(q,doc1)=4.8* *4.5 Sim(q,doc2)=2.4*2.4 Sim(q,doc3)=0 query=“information retrieval” 1(4.5)1(2.4)Query 1(4.3)1(3.2)1(2.2)Doc3 1(3.3)2(5.6)1(2.4)Doc2 1(5.4)1(2.1)1(4.5)2(4.8)Doc IDF (fake) CongressPresidentGovern.EngineSearchMapTravelRetrievalInfo. Information Retrieval and Vector Space Model

Example Q: “gold silver truck” D1: “Shipment of gold delivered in a fire” D2: “Delivery of silver arrived in a silver truck” D3: “Shipment of gold arrived in a truck” Document Frequency of the j th term (df j ) Inverse Document Frequency (idf) = log10(n / df j ) Tf*idf is used as term weight here 39 Information Retrieval and Vector Space Model

Example (Cont’d) IdTermdfidf 1a30 2arrived damaged delivery fire gold in30 8of30 9silver shipment truck Information Retrieval and Vector Space Model

Example(Cont’d) Tf*idf is used here SC(Q, D1 ) = (0)(0) + (0)(0) + (0)(0.477) + (0)(0) + (0)(0.477)+ (0.176)(0.176) + (0)(0) + (0)(0) = SC(Q, D2 ) = SC(Q,D3) = The ranking would be D2,D3,D1. This SC uses the dot product. 41 Information Retrieval and Vector Space Model

Outline  Introduction to IR  IR System Architecture  Vector Space Model (VSM)  How to Assign Weights?  TF-IDF Weighting  Example  Advantages and Disadvantages of VS Model  Improving the VSM Model 42 Information Retrieval and Vector Space Model

Advantages of VS Model  Empirically effective! (Top TREC performance)  Intuitive  Easy to implement  Well-studied/Most evaluated  The Smart system Developed at Cornell: Still widely used  Warning: Many variants of TF-IDF! 43 Information Retrieval and Vector Space Model

Disadvantages of VS Model  Assume term independence  Assume query and document to be the same  Lots of parameter tuning! 44 Information Retrieval and Vector Space Model

Outline  Introduction to IR  IR System Architecture  Vector Space Model (VSM)  How to Assign Weights?  TF-IDF Weighting  Example  Advantages and Disadvantages of VS Model  Improving the VSM Model 45 Information Retrieval and Vector Space Model

Improving the VSM Model  We can improve the model by: Reducing the number of dimensions eliminating all stop words and very common terms stemming terms to their roots Latent Semantic Analysis Not retrieving documents below a defined cosine threshold Normalized frequency of a term i in document j is given by [1] : Normalized Document Frequencies Normalized Query Frequencies Information Retrieval and Vector Space Model 46

Stop List  Function words do not bear useful information for IR of, not, to, or, in, about, with, I, be, …  Stop list: contain stop words, not to be used as index Prepositions Articles Pronouns Some adverbs and adjectives Some frequent words (e.g. document)  The removal of stop words usually improves IR effectiveness  A few “standard” stop lists are commonly used. 47 Information Retrieval and Vector Space Model

Stemming 48 Reason: ◦ Different word forms may bear similar meaning (e.g. search, searching): create a “standard” representation for them Stemming: ◦ Removing some endings of word dancer dancers dance danced dancing dance Information Retrieval and Vector Space Model

Stemming(Cont’d)  Two main methods : Linguistic/dictionary-based stemming  high stemming accuracy  high implementation and processing costs and higher coverage Porter-style stemming  lower stemming accuracy  lower implementation and processing costs and lower coverage  Usually sufficient for IR 49 Information Retrieval and Vector Space Model

Latent Semantic Indexing (LSI) [3]  Reduces the dimensions of the term-document space  Attempts to solve the synonomy and polysemy  Uses Singular Value Decomposition (SVD) identifies patterns in the relationships between the terms and concepts contained in an unstructured collection of text  Based on the principle that words that are used in the same contexts tend to have similar meanings. Information Retrieval and Vector Space Model 50

LSI Process  In general, the process involves: constructing a weighted term-document matrix performing a Singular Value Decomposition on the matrix using the matrix to identify the concepts contained in the text  LSI statistically analyses the patterns of word usage across the entire document collection Information Retrieval and Vector Space Model 51

It computes the term and document vector spaces by transforming the single term-frequency matrix, A, into three other matrices: a term-concept vector matrix, T, a singular values matrix, S, a concept-document vector matrix, D, which satisfy the following relations: A = TSD T The reason SVD is useful: it finds a reduced dimensional representation of our matrix that emphasizes the strongest relationships and throws away the noise. Information Retrieval and Vector Space Model 52

 What is noise in a document? Authors have a wide choice of words available when they write So, the concepts can be obscured due to different word choices from different authors. This essentially random choice of words introduces noise into the word-concept relationship. Latent Semantic Analysis filters out some of this noise and also attempts to find the smallest set of concepts that spans all the documents Information Retrieval and Vector Space Model 53

References  Introduction to Information Retrieval, by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schuetze        updated.ppt updated.ppt  12/F/6339/_media/lecture_13_ir_and_vsm_.ppt  Document Classification based on Wikipedia Content, estamp= estamp= Information Retrieval and Vector Space Model

Thanks For Your Attention …. 55 Infor matio n Retrie val and Vector Space Model