Download presentation
Presentation is loading. Please wait.
1
IN350: Text properties, Zipf’s Law,and Heap’s Law. Judith A. Molka-Danielsen September 12, 2002 Notes are based on Chapter 6 of the Article Collection More notes to follow in class.
2
Review: Michael Spring’s comments The Document Processing Revolution by MBS. How do you define a document? Revolutions: reprographics, communications Transition: create,compose,render; WWW/XML New processing model for e-docs, forms. XML- a first look. Namespace rules allow for a modular document, reuse. DTDs are historical, Schema is the future. Xpath and pointers: used for accessing the document as a tree of nodes. XSL style – rendering, XSLT uses XSL and FO. XSLT transformations of docs to multiple forms. E-business: B2B applications will use concepts, specifications, and tools based on the XML family.
3
Modeling Natural Languages and Compression Information theory - the amount of information in a text is quantified by its entropy, information uncertainty. If one symbol appears all the time it does not convey much information. Higher entropy text cannot be compressed as much lower. Symbols - There are 26 in the English alphabet and 29 in Norwegian. Frequency of occurrence - of the symbols in text in different in different languages. In english ’e’ has the highest occurance. Run length encoding schemes such as Huffman encoding can be used to represent the symbols based on frequency of occurance. Compression for transfering data can be based on the frequency of symbols. More on this in another lecture.
4
Modeling Natural Languages and Compression Creation of Indicies are often based on the frequency of occurance of words within a text. Zipf's Law- named after the Harvard linguistic professor George Kingsley Zipf (1902-1950), is the observation that frequency of occurrence of some event ( P ), as a function of the rank ( i) when the rank is determined by the above frequency of occurrence, is a power-law function P i ~ 1/i a with the exponent a close to unity. Zipf’s distribution- frequency of words in a document approximatly follow this distribution. In the English language, the probability of encountering the ith most common word is given roughly by Zipf law for i up to 1000 or so. The law breaks down for less frequent words, since the harmonic series diverges. Stopwords - a few hundred words take up 50% of the text. These can be disregarded to reduce the space of indices.
5
(a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary Zipf’s and Heap’s distributions
6
Sample Word Frequency Data (from B. Croft, UMass)
7
Zipf’s Law Rank (r): The numerical position of a word in a list sorted by decreasing frequency (f ). Zipf (1949) “discovered” that: If probability of word of rank r is p r and N is the total number of word occurrences:
8
Occurrence Frequency Data (from B. Croft, UMass)
9
Does Real Data Fit Zipf’s Law? A law of the form y = kx c is called a power law. Zipf’s law is a power law with c = –1 On a log-log plot, power laws give a straight line with slope c. Zipf is quite accurate except for very high and low rank.
10
Fit to Zipf for Brown Corpus k = 100,000
11
Mandelbrot distribution Mandelbrot distribution: Experimental data suggests that a better model is this where c is an additional parameter and k is such that all frequencies add to n. Since the distribution of words is Very skewed (that is, there are a few hundred words which take up 50% of the text), words that are too frequent, such as stopwords, can be disregarded. A stopword is a word which does not carry meaning in natural language and therefore can be ignored (made not searchable), such as 'a,' 'the,' and 'by,'. The most frequent words are stopwords. This allows a significant reduction in the space overhead of indices for natural language texts.
12
Mandelbrot (1954) Correction The following more general form gives a bit better fit:
13
Mandelbrot Fit P = 10 5.4, B = 1.15, = 100
14
Explanations for Zipf’s Law Zipf’s explanation was his “principle of least effort.” Balance between speaker’s desire for a small vocabulary and hearer’s desire for a large one. Debate (1955-61) between Mandelbrot and H. Simon over explanation. Li (1992) shows that just random typing of letters including a space will generate “words” with a Zipfian distribution. http://linkage.rockefeller.edu/wli/zipf/
15
Zipf’s Law Impact on IR Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces inverted-index storage costs. Bad News: For most words, gathering sufficient data for meaningful statistical analysis (e.g. for correlation analysis for query expansion) is difficult since they are extremely rare.
16
Vocabulary Growth How does the size of the overall vocabulary (number of unique words) grow with the size of the corpus? This determines how the size of the inverted index will scale with the size of the corpus. Vocabulary not really upper-bounded due to proper names, typos, etc.
17
Heaps’ Law Heap’s law is used to predict the growth of the vocabulary size n words of size V. Typical constants: K 10 100 0.4 0.6 (approx. square-root)
18
Heaps’ Law Data
19
Modeling Natural Languages and Compression Document vocaburlary - is the number of distinct (different) words within a document. Heaps’ Law - is show in the right graph in the readings Figure 6.2. In general, the number of new words found in a document does increases logarithmically with the increasing text size. So, if you have an encylopedia, probably most of the words are found in the first volume. Heaps’ Law is also implies the length of words increase logarithmically with the text size, but the average word length is constant. That is because there is a greater occurance of the shorter words.
20
Relate text size to Indexing Review: What is the purpose of an index and how is it made? Based on: ”Appendix A: File Organization and Storage Structures” Text Processing: large text collections are often indexed using inverted files. A vector of distinct words forms a vocabulary, with pointers to a list of all the documents that contain the word(s). Indexes can be compressed to allow quicker access. (Discuss this with Ch.7.)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.