Exploring Text: Zipf’s Law and Heaps’ Law
(a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary Zipf’s and Heap’s distributions
Statistical Properties of Text How is the frequency of different words distributed? How fast does vocabulary size grow with the size of a corpus? Such factors affect the performance of information retrieval and can be used to select appropriate term weights and other aspects of an IR system. 3
Word Frequency A few words are very common. ◦2 most frequent words (e.g. “the”, “of”) can account for about 10% of word occurrences. Most words are very rare. ◦Half the words in a corpus appear only once, called hapax legomena (Greek for “read only once”) Called a “heavy tailed” distribution, since most of the probability mass is in the “tail” 4
Predicting Occurrence Frequencies By Zipf, a word appearing n times has rank r n =AN/n If several words may occur n times, assume rank r n applies to the last of these. Therefore, r n words occur n or more times and r n+1 words occur n+1 or more times. So, the number of words appearing exactly n times is: Fraction of words with frequency n is: Fraction of words appearing only once is therefore ½.
Occurrence Frequency Data (from B. Croft, UMass)
Text properties (formalized) Sample word frequency data
Zipf’s Law We use a few words very frequently and rarely use most other words The product of the frequency of a word and its rank is approximately he same as the product of the frequency and rank of another word. Deviations usually occur at the beginning and at the end of the table/graph
Zipf’s Law Zipf's law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table.corpusnatural languageinversely proportional Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc. For example, in the Brown Corpus of American English text, the word "the" is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences (69,971 out of slightly over 1 million).Brown Corpusthe the second-place word "of" accounts for slightly over 3.5% of words (36,411 occurrences), followed by "and" (28,852). Only 135 vocabulary items are needed to account for half the Brown Corpus. [3] [3]
Zipf’s Law The most common 20 words in English are listed in the following table. The table is based on the Brown Corpus, a careful study of a million words from a wide variety of sources including newspapers, books, magazines, fiction, government documents, comedy and academic publications.
Table of Top 20 frequently occurring words in English RankWordFrequency% FrequencyTheoretical Zipf Distribution 1the of and to a in that is was he for it with as his on be at by I
Plot of Top 20 frequently occurring words in English
Zipf’s Law Rank (r): The numerical position of a word in a list sorted by decreasing frequency (f ). Zipf (1949) “discovered” that: If probability of word of rank r is p r and N is the total number of word occurrences:
Does Real Data Fit Zipf’s Law? A law of the form y = kx c is called a power law. Zipf’s law is a power law with c = –1 On a log-log plot, power laws give a straight line with slope c. Zipf is quite accurate except for very high and low rank.
Top 2000 English words using a log-log scale
Fit to Zipf for Brown Corpus k = 100,000
Plot of word frequency in Wikipedia-dump The plot is made in log- log coordinates. x is rank of a word in the frequency table; y is the total number of the word’s occurrences. Most popular words are “the”, “of” and “and”, as expected
Zipf’s Law The same relationship occurs in many other rankings unrelated to language, such as Corporation sizes, Calls to computer operating systems Colors in images As the basis of most approaches to image compression City populations (a small number of large cities, a larger number of smaller cities) Wealth distribution (a small number of people have large amounts of money, large numbers of people have small amounts of money) Popularity of web pages in websites
Zipf’s Law Authorship tests Textual analysis can be used to demonstrate the authenticity of disputed works. Each author has their own preference for using certain words, and so one technique compares the occurrence of different words in the uncertain text with that of an author's known works. The counted words are ranked (whereby the most common is number one and the rarest is last) and then plotted on a graph with their frequency of occurrence up the side: Comparing the Zipf graphs of two different pieces of writing, paying attention to the position of selected words, reveals whether they were both composed by the same author.
Vocabulary Growth How does the size of the overall vocabulary (number of unique words) grow with the size of the corpus? This determines how the size of the inverted index will scale with the size of the corpus. Vocabulary not really upper-bounded due to proper names, typos, etc.
Heaps’ Law If V is the size of the vocabulary and the n is the length of the corpus in words: Typical constants: ◦K 10 100 ◦ 0.4 0.6 (approx. square-root)
Heaps’ Law Data
Heaps’s Law
A typical Heaps-law plot The x-axis represents the text size The y-axis represents the number of distinct vocabulary elements present in the text. Compare the values of the two axes
AP89 Example
Heaps’ Law Predictions Predictions for TREC collections are accurate for large numbers of words ◦e.g., first 10,879,522 words of the AP89 collection scanned ◦prediction is 100,151 unique words ◦actual number is 100,024 Predictions for small numbers of words (i.e. < 1000) are much worse
GOV2 (Web) Example
Web Example Heaps’ Law works with very large corpora ◦new words occurring even after seeing 30 million! ◦parameter values different than typical newswire corpora used in competitions New words come from a variety of sources ◦spelling errors, invented words (e.g. product, company names), code, other languages, addresses, etc. Search engines must deal with these large and growing vocabularies