Presentation is loading. Please wait.

Presentation is loading. Please wait.

C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set.

Similar presentations


Presentation on theme: "C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set."— Presentation transcript:

1 C.Watterscsci64031 Term Frequency and IR

2 C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set –Too seldom – get too few matches Need to know the distribution of terms! Goal: get rid of “poor” index terms Why –faster to process –Smaller indices –Better results

3 C.Watterscsci64033 Look at Index term extraction Term distribution Growth of vocabulary Collocation of terms

4 C.Watterscsci64034 Important Words? Enron Ruling Leaves Corporate Advisers Open to Lawsuits By KURT EICHENWALD A ruling last week by a federal judge in Houston may well have accomplished what a year's worth of reform by lawmakers and regulators has failed to achieve: preventing the circumstances that led to Enron's stunning collapse from happening again. To casual observers, Friday's decision by the judge, Melinda F. Harmon, may seem innocuous and not surprising. In it, she held that banks, law firms and investment houses — many of them criticized on Capitol Hill for helping Enron construct off-the-books partnerships that led to its implosion — could be sued by investors who are seeking to regain billions of dollars they lost in the debacle.

5 C.Watterscsci64035 Index term Preprocessing Lexical normalization (get terms) Stop lists (get rid of terms) Stemming (collapse terms) Thesaurus or categorization construction (replace terms)

6 C.Watterscsci64036 Lexical Normalization Stream of characters  index terms Problems?? Numbers – good index terms? Hyphens – online on line on-line Punctuation – remove? Letter case ? Treat the query terms the same way

7 C.Watterscsci64037 Stop Lists 10 most frequent words => 20% occurrences Standard list of 28 words => 30% Look at 10 most frequent words in applet 1

8 C.Watterscsci64038 Stemming Plurals: car/cars Variants: react/reaction/reacted/reacting Category based: adheres/adhesion/adhesive Errors –Understemming: division/divide –Overstemming: experiment/experience Divine/divide

9 C.Watterscsci64039 Thesaurus Control the vocabulary Automobile (car, suv, sedan, convertible, van, roadster, …) Problems?

10 C.Watterscsci640310 What terms make good index terms? Resolving power or selection power? Most frequent? Least frequent? In between? Why not use all of them?

11 C.Watterscsci640311 Resolving Power

12 C.Watterscsci640312 2. Distribution of Terms in Text What terms occur very frequently What terms occur only once or twice What is the general distribution of terms in a document set

13 C.Watterscsci640313 Time magazine sample 243,836 word occurrences wordfreqrprpr A the15.65916.4220.064 of7,17922.9440.058 to6,28732.5780.077 a5,83042.3910.095 and5,58052.2880.114 week626380.2570.097 government582390.2390.093 when577400.2370.095 will488500.2000.100

14 C.Watterscsci640314 Zipf’s Relationship Frequency of the i th most frequent term is inversely related to the frequency of the most frequent word f i = f 1 i  where  depends on the text (~1-2) Rank x Frequency = constant constant ~.1

15 C.Watterscsci640315 Principle of Least effort Describe the weather today Easier to use the same words!

16 C.Watterscsci640316

17 C.Watterscsci640317 Word frequency & vocab growth rank F D Corpus size

18 C.Watterscsci640318 Zipf’s Law A few words occur a lot –Top 10 words about 20% occurrences A lot of words occur rarely –Almost half of the terms occur only once

19 C.Watterscsci640319 Actual Zipf’s Law Rank x frequency = constant Frequency, p r, is probability that a word taken at random from N occurrences will have rank r Given D unique words Sum (p r ) = 1 r x p r = A A ~ 0.1

20 C.Watterscsci640320 Time magazine sample 243,836 word occurrences wordfreqrprpr A the15.65916.4220.064 of7,17922.9440.058 to6,28732.5780.077 a5,83042.3910.095 and5,58052.2880.114 week626380.2570.097 government582390.2390.093 when577400.2370.095 will488500.2000.100

21 C.Watterscsci640321

22 Using Zipf to predict frequencies r x p r = A Word occurring n times has rank r n r n = AN/n But several words may occur n times We say r n refers to last word that occurs n times So r n words occur n or more times Number of unique terms,D, is highest rank with n=1 D = AN/1 Number of words occurring n times, I n I n = r n - r n+1 = AN/(n(n+1))

23 C.Watterscsci640323 Zipf and Power Law Power law uses y=kx c Zipf is a power law with c = -1 r=(AN)n -1 On log-log plot expect straight line with slope = c So how does our Reuters data do?

24 C.Watterscsci640324 Zipf log-log curve Log freq Log rank Slope = c

25 C.Watterscsci640325 2. Vocabulary Growth How quickly does the vocabulary grow as the size of the data corpus grows? Upper bound? Rate of increase of new words? Can be derived from Zipf’s law

26 C.Watterscsci640326 Calculation Given n term occurrences in corpus D = kn b Where 0<b<1, typically between.4 and.6 k usually between 10 and 100 (n is size of corpus in words)

27 C.Watterscsci640327

28 C.Watterscsci640328 3. Collocation of Terms Bag of word indexing is based on term independence Why do we do this? Should we do this? What could we do if we kept collocation?

29 C.Watterscsci640329 What is collocation Next to –Tape deck –Day pass Ordered –Pass day –Deck tape Adjacency

30 C.Watterscsci640330 What data do you need to use collocation? Word position Relative to? What about updates?

31 C.Watterscsci640331 Queries and Collocation “Information retrieval” Information (+- 2) retrieval ??

32 C.Watterscsci640332 Summary We can use general knowledge about term distribution to –Design more efficient systems –Choose effective indexing terms –Map queries to document indexes Now what?? –Using keywords in IR systems –Most common IR models


Download ppt "C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set."

Similar presentations


Ads by Google