1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.

1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting

2 Course Administration Laptop computers - Everybody who has (a) signed up and (b) submitted an assignment should receive an email about collecting a laptop - Laptop surveys will be handed out in class

3 Building an Index Documents break into words stoplist stemming* term weighting* Index database text non-stoplist words words stemmed words terms with weights *Indicates optional operation from Frakes, page 7 assign document IDs documents document numbers and *field numbers

4 What is a token? A token is a group of characters that are treated as a single term for indexing and in queries Exact definition of a token affects performance - digits (parts of names, data in tables) - hyphens (line break, compound words and phrases) - punctuation characters (part of names) - upper or lower case (proper nouns, acronyms) Impact on retrieval - broad definition of a token enhances recall - narrow definition of a token enhance precision

5 Lexical analysis Performance -- must look at every character Conveniently implemented as a finite state machine Definition of token in queries should be consistent with definition used in indexing Lexical analysis may be combined with: - removal of stop words - stemming

6 Transition Diagram 0 1 2 3 4 5 6 7 8 letter ( ) & | ^ eos other space letter, digit

7 Stop list Stop list: A list of common words that are ignored in building an index and when they occur in a query Reduces the length of inverted lists Saves storage space (10 most common words are 20% to 30% of tokens) The aim is to have no impact on retrieval effectiveness Typical stop lists have between 10 and 500 words.

8 Example: the WAIS stop list (first 84 of 363 multi-letter words) aboutaboveaccordingacrossactually adj after afterwardsagain against all almost alone along alreadyalsoalthough always among amongst an another any anyhow anyone anything anywhere are aren't around at be became because become becomes becoming been before beforehand begin beginning behind being below beside besides between beyond billion both butby can can't cannot caption co could couldn't did didn't do does doesn't don't down during each eg eight eighty either else elsewhere end ending enough etc even ever every everyone everything

9 Stop list policies How many words should be in the stop list? Long list lowers recall Which words should be in list? Some common words may have retrieval importance: -- war, home, life, water, world In certain domains, some words are very common: -- computer, program, source, machine, language There is very little systematic evidence to use in selecting a stop list.

10 Choice of stop words

11 Implementation of stop lists Filter stop words from output of lexical analyzer Create perfect hash table of stop words (no collisions) Calculate hash value of each token from lexical analyzer Reject token if value found in hash table Remove stop words as part of the lexical analysis [See Frake, Section 7.3.]

12 A generated finite state machine q0q0 q1q1 q2q2 q3q3 q4q4 q5q5 q6q6 L0L0 a an and in into to n nd d n nto to {o} { } a n d in t t o

13 Term weighting Zipf's Law: If the words, w, in a collection are ranked, r(w), by their frequency, f(w), they roughly fit the relation: r(w) * f(w) = c This suggests that some terms are more effective than others in retrieval. In particular relative frequency is a useful measure that identifies terms that occur with substantial frequency in some documents, but with relatively low overall collection frequency. Term weights are functions that are used to quantify these concepts.

14 Inverse document frequency weight Notation For term k number of documents = n document frequency (number of documents in which term k occurs) = d k One possible measure is n/d k Inverse document frequency: i k = log 2 (n/d k ) + 1

15 Inverse document frequency weight Example n = 1,000 documents term kd k i k A1004.32 B5002.00 C9001.13 D1,0001.00 From: Salton and McGill

16 Information in a term The search for more precise methods of weighting The higher the probability of occurrence of a word, the less information in provides for retrieval. probability of occurrence of word k = p k information, i k = - log 2 p k Average information that each word provides is -  p k log 2 p k

17 Average information Average information is maximized when each term occurs with the same probability. termp k -p k log 2 p k A0.50.50 B0.20.46 C0.20.46 D0.10.33 average information1.75 termp k -p k log 2 p k A0.250.50 B0.250.50 C0.250.50 D0.250.50 average information2.00

18 Noise Notation total frequency of term k in a collection = t k frequency of term k in document i = f ik noise of term k is defined as n k =  (f ik /t k ) log 2 (t k /f ik ) (f ik = 0) i Noise measures the concentration of a term in a collection (a) if term k occurs once in each document (i.e., all f ik = 1) n k =  (1/n) log 2 (n/1) = log 2 n (b) if term k occurs only in one document with frequency t k n k = (t k /t k ) log2 (t k /t k ) = 0

19 Signal Signal is the inverse of noise s k = log2 (t k ) - n k Considerable research has been carried out using weights based on signal, ranking words in decreasing order of the signal value. This favors terms that distinguish a few specific documents for the remainder of the collection. However, practical experience has been mixed. After: Shannon

1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.

Similar presentations

Presentation on theme: "1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.

Similar presentations

Presentation on theme: "1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting."— Presentation transcript:

Similar presentations

About project

Feedback