1 Statistical Properties for Text Rong Jin. 2 Statistical Properties of Text  How is the frequency of different words distributed?  How fast does vocabulary.

Slides:



Advertisements
Similar presentations
1 Text Properties and Languages. 2 Statistical Properties of Text How is the frequency of different words distributed? How fast does vocabulary size grow.
Advertisements

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer D istributed T able: Efficient Query-Driven Processing of Multi-Term Queries in.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Week 11 Review: Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution.
Power Laws: Rich-Get-Richer Phenomena
Research Methods in Crime and Justice
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Web Graph Characteristics Kira Radinsky All of the following slides are courtesy of Ronny Lempel (Yahoo!)
1 1.Protein structure study via residue environment – Residues Solvent Accessibility Environment in Globins Protein Family 2.Statistical linguistic study.
Engineering experiments involve the measuring of the dependent variable as the independent one has been altered, so as to determine the relationship between.
Search Engines and Information Retrieval
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Measuring the Web. What? Use, size –Of entire Web, of sites (popularity), of pages –Growth thereof Technologies in use (servers, media types) Properties.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
Large Scale Findability Analysis Shariq Bashir PhD-Candidate Department of Software Technology and Interactive Systems.
1 Text Properties and Mark-up Languages. 2 Statistical Properties of Text How is the frequency of different words distributed? How fast does vocabulary.
IN350: Text properties, Zipf’s Law,and Heap’s Law. Judith A. Molka-Danielsen September 12, 2002 Notes are based on Chapter 6 of the Article Collection.
Judith Molka-Danielsen, Høgskolen i Molde1 IN350: Document Management and Information Steering: Class 5 Text properties and processing, File Organization.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 7 May 14, 2006
Term weighting and vector representation of text Lecture 3.
Chapter 5: Information Retrieval and Web Search
Information Retrieval Document Parsing. Basic indexing pipeline Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
CS324e - Elements of Graphics and Visualization Java Intro / Review.
Graphical Analysis. Why Graph Data? Graphical methods Require very little training Easy to use Massive amounts of data can be presented more readily Can.
Search Engines and Information Retrieval Chapter 1.
Chapter. 8: Indexing and Searching Sections: 8.1 Introduction, 8.2 Inverted Files 9/13/ Dr. Almetwally Mostafa.
Models and Algorithms for Complex Networks Power laws and generative processes.
Bibliometric research methods Faculty Brown Bag IUPUI Cassidy R. Sugimoto.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
IR System Evaluation Farhad Oroumchian. IR System Evaluation System-centered strategy –Given documents, queries, and relevance judgments –Try several.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set.
1. 2 To be able to determine which of the three measures(mean, median and mode) to apply to a given set of data with the given purpose of information.
Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.
Evidence from Content INST 734 Module 2 Doug Oard.
Date: 2013/6/10 Author: Shiwen Cheng, Arash Termehchy, Vagelis Hristidis Source: CIKM’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Predicting the Effectiveness.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Automatic acquisition for low frequency lexical items Nuria Bel, Sergio Espeja, Montserrat Marimon.
How Do “Real” Networks Look?
SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.
Statistical Properties of Text
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
WIRED Week 5 Readings Overview - Text & Multimedia Languages & Properties - Text Operations - Multimedia IR Finalize Topic Discussions Schedule Projects.
Section 6 – Ec1818 Jeremy Barofsky March 10 th and 11 th, 2010.
Document Parsing Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Intelligent Information Retrieval
Plan for Today’s Lecture(s)
Text Based Information Retrieval
Topics In Social Computing (67810)
How Do “Real” Networks Look?
How Do “Real” Networks Look?
How Do “Real” Networks Look?
Text Properties and Languages
How Do “Real” Networks Look?
CS 430: Information Discovery
Content Analysis of Text
CS 430: Information Discovery
TEXT ANALYSIS BY MEANS OF ZIPF’S LAW
Presentation transcript:

1 Statistical Properties for Text Rong Jin

2 Statistical Properties of Text  How is the frequency of different words distributed?  How fast does vocabulary size grow with the size of a corpus?  Such factors affect the performance of information retrieval and can be used to select appropriate term weights and other aspects of an IR system.

3 Sample Word Frequency Data (from B. Croft, UMass)

Zipf’s Law  A few words occur very often 2 most frequent words can account for 10% of occurrences top 6 words are 20%, top 50 words are 50%  Many words are infrequent  “Principle of Least Effort” easier to repeat words rather than coining new ones 4

5 Zipf’s Law Collected from collection WSJ 1987

Zipf’s Law  Rank · Frequency ≈ Constant pr = (Number of occurrences of word of rank r)/N N total word occurrences probability that a word chosen randomly from the text will be the word of rank r r ·pr = A A ≈ 0.1 6

7 Zipf and Term Weighting  Both extremely common and extremely uncommon words were not very useful for indexing.

8 Predicting Occurrence Frequencies  By Zipf, a word appearing n times has rank r n =AN/n  Several words may occur n times, assume rank r n applies to the last of these.  Therefore, r n words occur n or more times and r n+1 words occur n+1 or more times.  So, the number of words appearing exactly n times is:

9 Predicting Word Frequencies (cont)  Assume highest ranking term occurs once and therefore has rank D = AN/1  Fraction of words with frequency n is:  Fraction of words appearing only once is therefore ½.

10 Occurrence Frequency Data (from B. Croft, UMass)

11 Does Real Data Fit Zipf’s Law?  A law of the form y = kx c is called a power law.  Zipf’s law is a power law with c = –1  On a log-log plot, power laws give a straight line with slope c.  Zipf is quite accurate except for very high and low rank.

12 Fit to Zipf for Brown Corpus  How to fit a straight line in log-log plot ?

13 Fit to Zipf for Brown Corpus

Mandelbrot Fit

15 Explanations for Zipf’s Law  Zipf’s explanation was his “principle of least effort.” Balance between speaker’s desire for a small vocabulary and hearer’s desire for a large one.  Simon explanation is “rich get richer.”  Li (1992) shows that just random typing of letters including a space will generate “words” with a Zipfian distribution.

16 Zipf’s Law Impact on IR  Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces inverted-index storage costs.  Bad News: For most words, gathering sufficient data for meaningful statistical analysis (e.g. for correlation analysis for query expansion) is difficult since they are extremely rare.

Prevalence of Zipfian Laws  Many items exhibit a Zipfian distribution. Population of cities Wealth of individuals  Discovered by sociologist/economist Pareto in 1909 Popularity of books, movies, music, web-pages, etc. Popularity of consumer products  Chris Anderson’s “long tail” 17

18 Vocabulary Growth  How does the size of the overall vocabulary (number of unique words) grow with the size of the corpus?  This determines how the index size will scale with the size of the corpus.  Vocabulary not really upper-bounded due to proper names, typos, etc.

19 Heaps’ Law

20 Heaps’ Law  If V is the size of the vocabulary and the n is the length of the corpus in words:  Typical constants: K  10  100   0.4  0.6 (approx. square-root)