Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Boolean and Vector Space Retrieval Models
Chapter 5: Introduction to Information Retrieval
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
IR Models: Overview, Boolean, and Vector
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
ISP 433/533 Week 2 IR Models.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Ch 4: Information Retrieval and Text Mining
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Modeling Modern Information Retrieval
Hinrich Schütze and Christina Lioma
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Vector Space Model CS 652 Information Extraction and Integration.
The Vector Space Model …and applications in Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
IR Models: Review Vector Model and Probabilistic.
1 CS 430: Information Discovery Lecture 2 Introduction to Text Based Information Retrieval.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Chapter 5: Information Retrieval and Web Search
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
Parallel and Distributed Searching. Lecture Objectives Review Boolean Searching Indicate how Searches may be carried out in parallel Overview Distributed.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Latent Semantic Indexing and Probabilistic (Bayesian) Information Retrieval.
Vector Space Models.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
Term Weighting approaches in automatic text retrieval. Presented by Ehsan.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
3. Weighting and Matching against Indices 인공지능 연구실 송승미 Text : Finding out about Page:
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop.
Automated Information Retrieval
Plan for Today’s Lecture(s)
Text Based Information Retrieval
Information Retrieval and Web Search
Basic Information Retrieval
Representation of documents and queries
CS 430: Information Discovery
From frequency to meaning: vector space models of semantics
CS 430: Information Discovery
Inf 722 Information Organisation
Boolean and Vector Space Retrieval Models
CS 430: Information Discovery
Recuperação de Informação B
Recuperação de Informação B
Information Retrieval and Web Design
Term Frequency–Inverse Document Frequency
ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB SEARCH
Presentation transcript:

Weighting and Matching against Indices

Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole = word frequency, F(w). Now imagine that we’ve sorted the vocabulary according to frequency, so that the most frequently occurring word will have rank = 1, the next most frequent word will have rank = 2, and so on. Zipf (1949) found the following empirical relation: F(w) = C / rank(w) to the power α, where α ~ 1, C ~ If α = 1, rank * frequency is approx. constant.

Consequences of lexical decisions on word frequencies Noise words occur frequently “external” keywords also frequent (which tell you what the corpus is about, but do not help index individual documents). Zipf’s Law seen with and without stemming. TokenFrequency (stemmed) Frequency (unstemmed) The78,428 Of50,026 And33,834 A31,347 To28,666 In21,512 SYSTEM21,4888,632 Is18,781 MODEL14,7724,796 For14,640 NETWORK10,3063,965 This10,095 BASE9838 that9820

Other applications of Zipf’s Law. Number of unique visitors vs. rank of website. Number of speakers of each Language Prize money won by golfers Frequency of DNA codons Size of avalanches of grains of sand Frequency of English surnames

Resolving Power (1) Luhn (1957): “It is hereby proposed that the frequency of word occurrence in an article furnishes a useful measurement of word significance”. If a word is found frequently, more frequently than we would expect, in a document, then it is reflecting emphasis on the part of the author about that document. But the raw frequency of occurrence in a document is only one of two critical statistics recommending good keywords. For example, almost every article in AIT contains the words ARTIFICIAL INTELLIGENCE

Resolving Power (2) Thus we prefer keywords which discriminate between documents (i.e found only in some documents). Resolving power is the ability to discriminate content Mid frequency terms Luhn did not provide a method of establishing the maximal and minimal occurrence thresholds Simple methods – frequency of stop list words = upper limit, words which appear only once can only index one document.

Exhaustivity and Specificity An index is exhaustive if it includes many topics. An index is specific if users can precisely identify their information needs. Trade off: high recall is easiest when an index is exhaustive but not very specific; high precision is best accomplished when the index is highly specific but not very exhaustive; the best index will strive for a balance. If a document is indexed with many keywords, it will be retrieved more often (“representation bias”) – we can expect higher recall, but precision will suffer. We can also analyse the problem from a query-oriented perspective – how well do the query terms discriminate one document from another?

Weighting the Index Relation The simplest notion of an index is binary – either a keyword is associated with a document or it is not – but it is natural to imagine degrees of aboutness. We will use a single real number, a weight, capturing the strength of association between keyword and document. The retrieval method can exploit these weights directly.

Weighting (2) One way to describe what this weight means is probabilistic. We seek a measure of a document’s relevance, conditionalised on the belief that a keyword is relevant: Wkd is proportional to Pr(d relevant | k relevant). This is a directed relation: we may or may not believe that the symmetric relation: Wdk is proportional to Pr(k relevant | d relevant) is the same. Unless otherwise specified, when we speak of a weight w we mean Wkd.

Weighting (3) In order to compute statistical estimates for such probabilities we define several important quantities: Fkd = number of occurrences of keyword k in document d Fk = total number of occurrences of keyword k across the entire corpus Dk = number of documents containing keyword k

Weighting (4) We will make two demands on the weight reflecting the degree to which a document is about a particular keyword or topic. 1. Repetition is an indicator of emphasis. If an author uses a word frequently, it is because she or he thinks it’s important. (Fkd) 2. A keyword must be a useful discriminator within then context of the corpus. Capturing this notion statistically is more difficult – for now we just give it the name discrim_k. Because we care about both, we will cause our weight to depend on the two factors: Wkd α Fkd * discrim_k Various index weighting schemes exist: they all use Fkd, but differ in how they quantify discrim_k

Inverse document frequency (IDF) Karen Sparck Jones said that from a discrimination point of view, we need to know the number of documents which contain a particular word. The value of a keyword varies inversely with the log of the number of documents in which it occurs: Wkd = Fkd * [ log( NDoc / Dk ) + 1] Where NDoc is the total number of documents in the corpus. Variations on this formula exist.

Vector Space Model (1) In a library, closely related books are physically close together in three dimensional space. Search engines consider the abstract notion of semantic space, where documents about the same topic remain close together. We will consider abstract spaces of thousands of dimensions. We start with the index matrix relating each document in the corpus to all of its keywords. Each and every keyword of the vocabulary is a separate dimension of a vector space. The dimensionality of the vector space is the size of our vocabulary.

Vector Space Model (2) In addition to the vectors representing the documents, another vector corresponds to a query. Because documents and queries exist within a common vector space, we seek those documents that are close to our query vector. A simple (unnormalised) measure of proximity is the inner (or “dot” ) product of query and document vectors: Sim( q, d ) = q. d e.g. [ ].[ ] = = 140

Vector Length Normalisation Making weights sensitive to document length Using the dot product alone, longer documents, containing more words (more verbose), are more likely to match the query than shorter ones, even if the “scope” (amount of actual information covered) is the same. One solution is to use the cosine measure of similarity.

Summary Zipf’s law: frequency * rank ~ constant Resolving power of keywords: TF * IDF Exhaustivity vs. specificity Vector space model Cosine Similarity measure