Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Chapter 5: Introduction to Information Retrieval
Sta220 - Statistics Mr. Smith Room 310 Class #14.
Table of Contents Exit Appendix Behavioral Statistics.
PrasadL07IndexCompression1 Index Compression Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning.
CpSc 881: Information Retrieval. 2 Why compression? (in general) Use less disk space (saves money) Keep more stuff in memory (increases speed) Increase.
Web Graph Characteristics Kira Radinsky All of the following slides are courtesy of Ronny Lempel (Yahoo!)
Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2.
Search Engines and Information Retrieval
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
Topic 7 Sampling And Sampling Distributions. The term Population represents everything we want to study, bearing in mind that the population is ever changing.
1 Text Properties and Mark-up Languages. 2 Statistical Properties of Text How is the frequency of different words distributed? How fast does vocabulary.
IN350: Text properties, Zipf’s Law,and Heap’s Law. Judith A. Molka-Danielsen September 12, 2002 Notes are based on Chapter 6 of the Article Collection.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Information Retrieval
Control Charts for Attributes
Chapter 5: Information Retrieval and Web Search
Dr. Hong Zhang.  Tables and Graphs  Populations and Samples  Mean, Median, and Standard Deviation  Standard Error & 95% Confidence Interval (CI) 
Information Retrieval Document Parsing. Basic indexing pipeline Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.
COURSE: JUST 3900 INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE Instructor: Dr. John J. Kerbs, Associate Professor Joint Ph.D. in Social Work and Sociology.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
1 Statistical Analysis - Graphical Techniques Dr. Jerrell T. Stracener, SAE Fellow Leadership in Engineering EMIS 7370/5370 STAT 5340 : PROBABILITY AND.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
CS324e - Elements of Graphics and Visualization Java Intro / Review.
Sociology 5811: Lecture 7: Samples, Populations, The Sampling Distribution Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
Search Engines and Information Retrieval Chapter 1.
Chapter. 8: Indexing and Searching Sections: 8.1 Introduction, 8.2 Inverted Files 9/13/ Dr. Almetwally Mostafa.
STAT02 - Descriptive statistics (cont.) 1 Descriptive statistics (cont.) Lecturer: Smilen Dimitrov Applied statistics for testing and evaluation – MED4.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 7. Using Probability Theory to Produce Sampling Distributions.
Introduction to Inferential Statistics. Introduction  Researchers most often have a population that is too large to test, so have to draw a sample from.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
The Scientific Method Honors Biology Laboratory Skills.
Estimating Topical Context by Diverging from External Resources SIGIR’13, July 28–August 1, 2013, Dublin, Ireland. Presenter: SHIH, KAI WUN Romain Deveaud.
1 Statistical Properties for Text Rong Jin. 2 Statistical Properties of Text  How is the frequency of different words distributed?  How fast does vocabulary.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
1 Psych 5500/6500 Standard Deviations, Standard Scores, and Areas Under the Normal Curve Fall, 2008.
Analysis of Algorithms CSCI Previous Evaluations of Programs Correctness – does the algorithm do what it is supposed to do? Generality – does it.
Chapter 6: Information Retrieval and Web Search
Graphing Data: Introduction to Basic Graphs Grade 8 M.Cacciotti.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Chapter 7 Sampling Distributions Statistics for Business (Env) 1.
C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set.
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 11: Bivariate Relationships: t-test for Comparing the Means of Two Groups.
Evidence from Content INST 734 Module 2 Doug Oard.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
TEXT STATISTICS 3 DAY /24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra SIGIR’11, July 24–28, 2011, Beijing, China.
1 Friends and Neighbors on the Web Presentation for Web Information Retrieval Bruno Lepri.
Statistical Properties of Text
 Assumptions are an essential part of statistics and the process of building and testing models.  There are many different assumptions across the range.
Chapter 4 Processing Text. n Modifying/Converting documents to index terms  Convert the many forms of words into more consistent index terms that represent.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Uncertainty and confidence Although the sample mean,, is a unique number for any particular sample, if you pick a different sample you will probably get.
1 Statistical Analysis - Graphical Techniques Dr. Jerrell T. Stracener, SAE Fellow Leadership in Engineering EMIS 7370/5370 STAT 5340 : PROBABILITY AND.
Document Parsing Paolo Ferragina Dipartimento di Informatica Università di Pisa.
INFORMATION RETRIEVAL Introduction. Search and Information Retrieval  Search on the Web is a daily activity for many people throughout the world  Search.
1. 2 Today’s Agenda Search engines: What are the main challenges in building a search engine? Structure of the data index Naïve solutions and their problems.
Text Based Information Retrieval
Introduction to Summary Statistics
Introduction to Summary Statistics
Information Retrieval and Web Search
The facts or numbers that describe the results of an experiment.
CS 430: Information Discovery
Data Mining Chapter 6 Search Engines
The facts or numbers that describe the results of an experiment.
TEXT ANALYSIS BY MEANS OF ZIPF’S LAW
Presentation transcript:

Exploring Text: Zipf’s Law and Heaps’ Law

(a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary Zipf’s and Heap’s distributions

Sample Word Frequency Data (from B. Croft, UMass)

Predicting Occurrence Frequencies By Zipf, a word appearing n times has rank r n =AN/n If several words may occur n times, assume rank r n applies to the last of these. Therefore, r n words occur n or more times and r n+1 words occur n+1 or more times. So, the number of words appearing exactly n times is: Fraction of words with frequency n is: Fraction of words appearing only once is therefore ½.

Zipf’s Law Impact on Language Analysis Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size of vocabulary in a text Bad News: For most words, gathering sufficient data for meaningful statistical analysis (e.g. for correlation analysis for query expansion) is difficult since they are extremely rare.

Vocabulary Growth How does the size of the overall vocabulary (number of unique words) grow with the size of the corpus? This determines how the size of the inverted index will scale with the size of the corpus. Vocabulary not really upper-bounded due to proper names, typos, etc.

Heaps’ Law If V is the size of the vocabulary and the n is the length of the corpus in words: Typical constants: – K  10  100 –   0.4  0.6 (approx. square-root)

Heaps’ Law Data

Occurrence Frequency Data (from B. Croft, UMass)

Text properties (formalized) Sample word frequency data

Zipf’s Law We use a few words very frequently and rarely use most other words The product of the frequency of a word and its rank is approximately he same as the product of the frequency and rank of another word. Deviations usually occur at the beginning and at the end of the table/graph

Zipf’s Law Zipf's law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table.corpusnatural languageinversely proportional Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc. For example, in the Brown Corpus of American English text, the word "the" is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences (69,971 out of slightly over 1 million).Brown Corpusthe the second-place word "of" accounts for slightly over 3.5% of words (36,411 occurrences), followed by "and" (28,852). Only 135 vocabulary items are needed to account for half the Brown Corpus. [3] [3]

Zipf’s Law The most common 20 words in English are listed in the following table. The table is based on the Brown Corpus, a careful study of a million words from a wide variety of sources including newspapers, books, magazines, fiction, government documents, comedy and academic publications.

Table of Top 20 frequently occurring words in English RankWordFrequency% FrequencyTheoretical Zipf Distribution 1the of and to a in that is was he for it with as his on be at by I

Plot of Top 20 frequently occurring words in English

Zipf’s Law Rank (r): The numerical position of a word in a list sorted by decreasing frequency (f ). Zipf (1949) “discovered” that: If probability of word of rank r is p r and N is the total number of word occurrences:

Does Real Data Fit Zipf’s Law? A law of the form y = kx c is called a power law. Zipf’s law is a power law with c = –1 On a log-log plot, power laws give a straight line with slope c. Zipf is quite accurate except for very high and low rank.

Top 2000 English words using a log- log scale

Fit to Zipf for Brown Corpus k = 100,000

Plot of word frequency in Wikipedia- dump The plot is made in log- log coordinates. x is rank of a word in the frequency table; y is the total number of the word’s occurrences. Most popular words are “the”, “of” and “and”, as expected

Zipf’s Law The same relationship occurs in many other rankings unrelated to language, such as Corporation sizes, Calls to computer operating systems Colors in images As the basis of most approaches to image compression City populations (a small number of large cities, a larger number of smaller cities) Wealth distribution (a small number of people have large amounts of money, large numbers of people have small amounts of money) Popularity of web pages in websites

Zipf’s Law Authorship tests Textual analysis can be used to demonstrate the authenticity of disputed works. Each author has their own preference for using certain words, and so one technique compares the occurrence of different words in the uncertain text with that of an author's known works. The counted words are ranked (whereby the most common is number one and the rarest is last) and then plotted on a graph with their frequency of occurrence up the side: Comparing the Zipf graphs of two different pieces of writing, paying attention to the position of selected words, reveals whether they were both composed by the same author.

Heaps’s Law

A typical Heaps-law plot The x-axis represents the text size The y-axis represents the number of distinct vocabulary elements present in the text. Compare the values of the two axes

AP89 Example

Heaps’ Law Predictions Predictions for TREC collections are accurate for large numbers of words – e.g., first 10,879,522 words of the AP89 collection scanned – prediction is 100,151 unique words – actual number is 100,024 Predictions for small numbers of words (i.e. < 1000) are much worse

GOV2 (Web) Example

Web Example Heaps’ Law works with very large corpora – new words occurring even after seeing 30 million! – parameter values different than typical newswire corpora used in competitions New words come from a variety of sources spelling errors, invented words (e.g. product, company names), code, other languages, addresses, etc. Search engines must deal with these large and growing vocabularies