VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop.

Slides:



Advertisements
Similar presentations
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
Advertisements

Chapter 5: Introduction to Information Retrieval
Retrieval Models and Ranking Systems CSC 575 Intelligent Information Retrieval.
Alternative Simulation Core for Material Reliability Assessments Speculation how to heighten random character of probability calculations (concerning the.
Text Similarity David Kauchak CS457 Fall 2011.
Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Models with Discrete Dependent Variables
IR Models: Overview, Boolean, and Vector
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Learning Techniques for Information Retrieval Perceptron algorithm Least mean.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Probabilistic IR Models Based on probability theory Basic idea : Given a document d and a query q, Estimate the likelihood of d being relevant for the.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
Modeling Modern Information Retrieval
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
Vector Space Model CS 652 Information Extraction and Integration.
The Vector Space Model …and applications in Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Chapter 5: Information Retrieval and Web Search
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Bayesian Networks. Male brain wiring Female brain wiring.
IR Models J. H. Wang Mar. 11, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Chapter 6: Information Retrieval and Web Search
CSE3201/CSE4500 Term Weighting.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
SINGULAR VALUE DECOMPOSITION (SVD)
Clustering C.Watters CS6403.
Vector Space Models.
Modern information retreival Chapter. 02: Modeling (Latent Semantic Indexing)
C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
CpSc 881: Information Retrieval. 2 Using language models (LMs) for IR ❶ LM = language model ❷ We view the document as a generative model that generates.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
INFORMATION RETRIEVAL Introduction. Search and Information Retrieval  Search on the Web is a daily activity for many people throughout the world  Search.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 14: Language Models for IR.
BOOLEAN INFORMATION RETRIEVAL 1Adrienn Skrop. Boolean Information Retrieval  The Boolean model of IR (BIR) is a classical IR model and, at the same time,
IR 6 Scoring, term weighting and the vector space model.
Plan for Today’s Lecture(s)
Lecture 13: Language Models for IR
Text Based Information Retrieval
CS 430: Information Discovery
Quantum One.
Information Organization: Clustering
Representation of documents and queries
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
CS 430: Information Discovery
Boolean and Vector Space Retrieval Models
CS 430: Information Discovery
Information Retrieval and Web Design
CS 430: Information Discovery
Presentation transcript:

VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop

VIR  The Vector (or vector space) model of IR (VIR) is a classical model of IR.  It is  traditionally  called so because both the documents and queries are conceived, for retrieval purposes, as strings of numbers as though they were (mathematical) vectors.  Retrieval is based on whether the 'query vector' and the 'document vector' are 'close enough'. 2Adrienn Skrop

Notations  Given a finite set D of elements called documents: D = {D 1,..., D j,..., D m }  and a finite set T of elements called index terms: T = {t 1,..., t i,..., t n }  Any document D j is assigned a vector v j of finite real numbers, called weights, of length m as follows: v j = (w ij ) i=1,...,n  (w 1j,..., w ij,..., w nj )  where 0  w ij  1 (i.e. w ij is normalised, e.g. division by the largest).  The weight w ij is interpreted as an extent to which the term t i 'characterises' document D j. 3Adrienn Skrop

Identifiers  The choice of terms and weights is a difficult theoretical (e.g. linguistic, semantical) and practical problem, and several techniques can be used to cope with it.  The VIR model assumes that the most obvious place where appropriate content identifiers might be found are the documents themselves.  It is assumed that frequencies of words in documents can give meaningful indication of their content, hence they can be taken as identifiers. 4Adrienn Skrop

Word significance factors  Steps of simple automatic method to identify word significance factors: 1. Identify lexical units. Write a computer program to recognise words (word = any sequence of characters preceded and followed by ‘space’, ‘dot’, ‘comma’). 2. Exclude stopwords from documents, i.e. those words that are unlikely to bear any significance. For example, a, about, and, etc. 5Adrienn Skrop

Word significance factors 3. Apply stemming to the remaining words, i.e. reduce or transform them to their linguistic roots. A widely used stemming algorithm is the well-known Porter's algorithm. Practically, the index terms are the stems. 4. Compute for each document D j the number of occurrences f ij of each term t i in that document. tf i =  j m =1 f ij 5. Calculate the total tf i for each term t i as follows 6Adrienn Skrop

Word significance factors 6. Rank the terms in decreasing order according to tf i, and exclude the very high frequency terms, e.g. over some threshold (on the ground that they are almost always insignificant), and the very low frequency terms as well, e.g. below some threshold (on the ground that they are not much on the writer's mind). 7Adrienn Skrop

Word significance factors 7. The remaining terms can be taken as document identifiers; they will be called index terms (or simply terms). 8. Compute for the index terms t i a significance factor (or weight) w ij with respect to each document D j. Several methods are known. 8Adrienn Skrop

Query object  An object Q k, called query, coming from a user, is also conceived as being a (much smaller) document, a vector v k can be computed for it, too, in a similar way. 9Adrienn Skrop

VIR VIR is defined as follows: A document D j is retrieved in response to a query Q k, if the document and the query are "similar enough", i.e. a similarity measure s jk between  the document (identified by v j ) and  the query (identified by v k ) is over some threshold K, i.e. S jk = s(v j, v k ) > K 10Adrienn Skrop

Steps of VIR  Define query Q k.  Compute for the index terms t i a significance factor (or weight) w ik with respect query Q k in the same way as the documents D j.  Compute the similarity values for the documents D j.  Give the hit list of the retrieved documents (in decreasing order) in respect to the similarity values. 11Adrienn Skrop

Similarity measures  Dot product (simple matching coefficient; inner product) s jk = (v j, v k ) =  i n =1 w ij w ik  If D j and Q k are conceived as sets of terms, the set theoretic counterpart of the simple matching coefficient is s jk = |D j  Q k | 12Adrienn Skrop

Similarity measures  Cosine measure: c jk s jk = c jk = (v j, v k )/(||v j ||  ||v k ||)  If D j and Q k are conceived as sets of terms, the set theoretic counterpart of the Cosine measure is c jk = |D j  Q k | / (|D j |  |Q k |) 1/2 13Adrienn Skrop

Similarity measures  Dice's coefficient: d jk s jk = d jk = 2  (v j, v k ) /  i n =1 (w ij + w ik )  If D j and Q k are conceived as sets of terms, the set theoretic counterpart of Dice's coefficient is d jk = 2  |D j  Q k | / (|D j | + |Q k |) 14Adrienn Skrop

Similarity measures  Jaccard's coefficient: J jk s jk = J jk = (v j, v k ) /  i n =1 (w ij + w ik )/2 w ij w ik  If D j and Q k are conceived as sets of terms, the set theoretic counterpart of Jaccard's coefficient is: J jk = |D j  Q k | / | D j  Q k | 15Adrienn Skrop

Similarity measures  Overlap coefficient: O jk s jk = O jk = (v j, v k ) / min (  i n =1 w ij,  i n =1 w ik )  If D j and Q k are conceived as sets of terms, the set theoretic counterpart of the overlap coefficient is O jk = |D j  Q k | / min (|D j |,|Q k |) 16Adrienn Skrop

Example  Let the set of documents be O = {O 1, O 2, O 3 } where  O 1 = Bayes' Principle: The principle that, in estimating a parameter, one should initially assume that each possible value has equal probability (a uniform prior distribution).  O 2 = Bayesian Decision Theory: A mathematical theory of decision-making which presumes utility and probability functions, and according to which the act to be chosen is the Bayes act, i.e. the one with highest Subjective Expected Utility. If one had unlimited time and calculating power with which to make every decision, this procedure would be the best way to make any decision. 17Adrienn Skrop

Example  O 3 = Bayesian Epistemology: A philosophical theory which holds that the epistemic status of a proposition (i.e. how well proven or well established it is) is best measured by a probability and that the proper way to revise this probability is given by Bayesian conditionalisation or similar procedures. A Bayesian epistemologist would use probability to define, and explore the relationship between, concepts such as epistemic status, support or explanatory power. 18Adrienn Skrop

Example  Step 1. Identify lexical units Original text: Output: Bayes' Principle: The principle that, in estimating a parameter, one should initially assume that each possible value has equal probability (a uniform prior distribution). Bayes principle the principle that in estimating a parameter one should initially assume that each possible value has equal probability a uniform prior distribution 19Adrienn Skrop

Example  Step 2. Exclude stopwords from the document  Stoplist: a aboard about above accordingly across actually add added after afterwards again against ago all allows almost alone along alongside … 20Adrienn Skrop

Example  Step 3. Apply stemming  Text before andafter stemmig: Bayes principle estimating parameter one initially assume possible value Bayes principl estim paramet on initi assum possibl valu … 21Adrienn Skrop

Example  Step 4. Compute for each document D j the number of occurrences f ij of each term t i in that document. E.g., t 1 := Bayes f 11  1  f 12  2  f 13  3  22Adrienn Skrop

Example  Step 5. Calculate the total tf i for each term t i E.g. number of term t 1 = Bayes in all the documents: tf 1  6 23Adrienn Skrop

Example  Step 6. Rank the terms in decreasing order according to tf i, and exclude the very high and very low frequency terms 24Adrienn Skrop

Example  Step 7. The remaining terms can be taken as index terms. Let the set T of index terms be (not stemmed here) T = {t 1, t 2, t 3 } = { t 1 = Bayes, t 2 = probability, t 3 = epistemology }. Conceive the documents as sets of terms: D = { D 1, D 2, D 3 } where D 1 = {(t 1 = Bayes); (t 2 = probability)} D 2 = {(t 1 = Bayes); (t 2 = probability)} D 3 = {(t 1 = Bayes); (t 2 = probability); (t 3 = epistemology)} 25Adrienn Skrop

Example Conceive the documents as sets of terms (together with their frequencies): D = {D 1, D 2, D 3 } where D 1 = { (Bayes, 1); (probability, 1); (epistemology, 0) } D 2 = { (Bayes, 2); (probability, 1); (epistemology, 0) } D 3 = { (Bayes, 3); (probability, 3); (epistemology, 3) } 26Adrienn Skrop

Documents in vector space 27Adrienn Skrop

Example  Step 8. Compute for the index terms t i a significance factor (or weight) w ij with respect to each document D j. Creating Term-Document matrix TD: TD 3×3 = (w ij ), where w ij denotes the weight of term t i in document D j 28Adrienn Skrop

Example  Using Frequency Weighting Method: TD =  Using Term Frequency Normalized Weighting Method: TD = 29Adrienn Skrop

Example  Step 9. Define a query. Let the query Q be: Q = {What is Bayesian epistemology? } Let the query Q be (as a set of terms): Q = {(t 1 = Bayes); (t 3 = epistemology) } Notice that, in the VIR, the query is not a Boolean expression anymore (as in BIR). 30Adrienn Skrop

Example  Step 10. Compute for the index terms t i a significance factor (or weight) w ik with respect query Q k. Using  Frequency or  Binary Weighting Method: Q k  (1, 0, 1) 31Adrienn Skrop

Query in vector space 32Adrienn Skrop

Example  12. Compute the similarity values, and give the ranking order.  Using, Term Frequency Normalized (tfn) Weighting Method :  Q= TD= 33Adrienn Skrop

Example 34Adrienn Skrop

Example  Dot Product: Dot 1  0.5  Dot 2   Dot 3   Cosine Measure: Cosine 1  0.5  Cosine 2   Cosine 3   Dice Measure: Dice 1   Dice 2   Dice 3   Jaccard Measure: Jaccard 1  0.21  Jaccard 2  0.29  Jaccard 3  0.32  The ranking order: D 3, D 2, D 1, 35Adrienn Skrop

Conclusion  Both the query and the document are represented as vectors of real numbers.  A similarity measure is meant to express a likeness between a query and a document.  Those documents are said to be retrieved in response to a query for which the similarity measure exceeds some threshold. 36Adrienn Skrop