Weighting and Matching against Indices
Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole = word frequency, F(w). Now imagine that we’ve sorted the vocabulary according to frequency, so that the most frequently occurring word will have rank = 1, the next most frequent word will have rank = 2, and so on. Zipf (1949) found the following empirical relation: F(w) = C / rank(w) to the power α, where α ~ 1, C ~ If α = 1, rank * frequency is approx. constant.
Consequences of lexical decisions on word frequencies Noise words occur frequently “external” keywords also frequent (which tell you what the corpus is about, but do not help index individual documents). Zipf’s Law seen with and without stemming. TokenFrequency (stemmed) Frequency (unstemmed) The78,428 Of50,026 And33,834 A31,347 To28,666 In21,512 SYSTEM21,4888,632 Is18,781 MODEL14,7724,796 For14,640 NETWORK10,3063,965 This10,095 BASE9838 that9820
Other applications of Zipf’s Law. Number of unique visitors vs. rank of website. Number of speakers of each Language Prize money won by golfers Frequency of DNA codons Size of avalanches of grains of sand Frequency of English surnames
Resolving Power (1) Luhn (1957): “It is hereby proposed that the frequency of word occurrence in an article furnishes a useful measurement of word significance”. If a word is found frequently, more frequently than we would expect, in a document, then it is reflecting emphasis on the part of the author about that document. But the raw frequency of occurrence in a document is only one of two critical statistics recommending good keywords. For example, almost every article in AIT contains the words ARTIFICIAL INTELLIGENCE
Resolving Power (2) Thus we prefer keywords which discriminate between documents (i.e found only in some documents). Resolving power is the ability to discriminate content Mid frequency terms Luhn did not provide a method of establishing the maximal and minimal occurrence thresholds Simple methods – frequency of stop list words = upper limit, words which appear only once can only index one document.
Exhaustivity and Specificity An index is exhaustive if it includes many topics. An index is specific if users can precisely identify their information needs. Trade off: high recall is easiest when an index is exhaustive but not very specific; high precision is best accomplished when the index is highly specific but not very exhaustive; the best index will strive for a balance. If a document is indexed with many keywords, it will be retrieved more often (“representation bias”) – we can expect higher recall, but precision will suffer. We can also analyse the problem from a query-oriented perspective – how well do the query terms discriminate one document from another?
Weighting the Index Relation The simplest notion of an index is binary – either a keyword is associated with a document or it is not – but it is natural to imagine degrees of aboutness. We will use a single real number, a weight, capturing the strength of association between keyword and document. The retrieval method can exploit these weights directly.
Weighting (2) One way to describe what this weight means is probabilistic. We seek a measure of a document’s relevance, conditionalised on the belief that a keyword is relevant: Wkd is proportional to Pr(d relevant | k relevant). This is a directed relation: we may or may not believe that the symmetric relation: Wdk is proportional to Pr(k relevant | d relevant) is the same. Unless otherwise specified, when we speak of a weight w we mean Wkd.
Weighting (3) In order to compute statistical estimates for such probabilities we define several important quantities: Fkd = number of occurrences of keyword k in document d Fk = total number of occurrences of keyword k across the entire corpus Dk = number of documents containing keyword k
Weighting (4) We will make two demands on the weight reflecting the degree to which a document is about a particular keyword or topic. 1. Repetition is an indicator of emphasis. If an author uses a word frequently, it is because she or he thinks it’s important. (Fkd) 2. A keyword must be a useful discriminator within then context of the corpus. Capturing this notion statistically is more difficult – for now we just give it the name discrim_k. Because we care about both, we will cause our weight to depend on the two factors: Wkd α Fkd * discrim_k Various index weighting schemes exist: they all use Fkd, but differ in how they quantify discrim_k
Inverse document frequency (IDF) Karen Sparck Jones said that from a discrimination point of view, we need to know the number of documents which contain a particular word. The value of a keyword varies inversely with the log of the number of documents in which it occurs: Wkd = Fkd * [ log( NDoc / Dk ) + 1] Where NDoc is the total number of documents in the corpus. Variations on this formula exist.
Vector Space Model (1) In a library, closely related books are physically close together in three dimensional space. Search engines consider the abstract notion of semantic space, where documents about the same topic remain close together. We will consider abstract spaces of thousands of dimensions. We start with the index matrix relating each document in the corpus to all of its keywords. Each and every keyword of the vocabulary is a separate dimension of a vector space. The dimensionality of the vector space is the size of our vocabulary.
Vector Space Model (2) In addition to the vectors representing the documents, another vector corresponds to a query. Because documents and queries exist within a common vector space, we seek those documents that are close to our query vector. A simple (unnormalised) measure of proximity is the inner (or “dot” ) product of query and document vectors: Sim( q, d ) = q. d e.g. [ ].[ ] = = 140
Vector Length Normalisation Making weights sensitive to document length Using the dot product alone, longer documents, containing more words (more verbose), are more likely to match the query than shorter ones, even if the “scope” (amount of actual information covered) is the same. One solution is to use the cosine measure of similarity.
Summary Zipf’s law: frequency * rank ~ constant Resolving power of keywords: TF * IDF Exhaustivity vs. specificity Vector space model Cosine Similarity measure