Download presentation
Presentation is loading. Please wait.
Published byTyrone Charles Modified over 9 years ago
1
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0
2
how do we best convert documents to their index terms how do we make acquired documents searchable?
3
Simplest approach is find, which requires no text transformation Useful in user applications, but not in search (why?) Optional transformation handled during the find operation: case sensitivity
4
English documents are predictable: Top two most frequently occurring words are “the” and “of” (10% of word occurrences) Top six most frequently occurring words account for 20% of word occurrences Top fifty most frequently occurring words account for 50% of word occurrences Given all unique words in a (large) document, approximately 50% occur only once
5
Zipf’s law: Rank words in order of decreasing frequency The rank (r) of a word times its frequency (f) is approximately equal to a constant (k) r x f = k In other words, the frequency of the rth most common word is inversely proportional to r George Kingsley Zipf (1902-1950)
6
The probability of occurrence (P r ) of a word is the word frequency divided by the total number of words in the document Revise Zipf’s law as: r x P r = c for English, c ≈ 0.1
7
Verify Zipf’s law using the AP89 dataset: Collection of Associated Press (AP) news stories from 1989 (available at http://trec.nist.gov):http://trec.nist.gov Total documents 84,678 Total word occurrences39,749,179 Vocabulary size 198,763 Words occurring > 1000 times 4,169 Words occurring once 70,064 Total documents 84,678 Total word occurrences39,749,179 Vocabulary size 198,763 Words occurring > 1000 times 4,169 Words occurring once 70,064
8
Top 50 words of AP89
9
As the corpus grows, so does vocabulary size Fewer new words when corpus is already large The relationship between corpus size (n) and vocabulary size (v) was defined empirically by Heaps (1978) and called Heaps law: v = k x n β Constants k and β vary Typically 10 ≤ k ≤ 100 and β ≈ 0.5
10
note values of k and β
11
Web pages crawled from.gov in early 2004
12
Word occurrence statistics can be used to estimate result set size of a user query Aside from stop words, how many pages contain all of the query terms? ▪ To figure this out, first assume that words occur independently of one another ▪ Also assume that the search engine knows N, the number of documents it indexes
13
Given three query terms a, b, and c Probability of a document containing all three is the product of individual probabilities for each query term: P(a b c) = P(a) x P(b) x P(c) P(a b c) is the joint probability of events a, b, and c occurring
14
We assume the search engine knows the number of documents that a word occurs in Call these n a, n b, and n c ▪ Note that the book uses f a, f b, and f c Estimate individual query term probabilities: P(a) = n a / N P(b) = n b / N P(c) = n c / N
15
Given P(a), P(b), and P(c), we estimate the result set size as: n abc = N x (n a / N) x (n b / N) x (n c / N) n abc = (n a x n b x n c ) / N 2 This estimation sounds good, but is lacking due to our query term independence assumption
16
Using the GOV2 dataset with N = 25,205,179 Poor results, because of the query term independence assumption Could use word co-occurrence data...
17
Extrapolate based on the size of the current result set: The current result set is the subset of documents that have been ranked thus far Let C be the number of documents found thus far containing all the query words Let s be the proportion of the total documents ranked (use least frequently occurring term) Estimate result set size via n abc = C / s
18
Given example query: tropical fish aquarium Least frequently occurring term is aquarium (which occurs in 26,480 documents) After ranking 3,000 documents, 258 documents contain all three query terms Thus, n abc = C / s = 258 / (3,000 ÷ 26,480) = 2,277 After processing 20% of the documents, the estimate is 1,778 ▪ Which overshoots actual value of 1,529
19
Read and study Chapter 4 Do Exercises 4.1, 4.2, and 4.3 Start thinking about how to write code to implement the stopping & stemming techniques of Ch.4
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.