IN350: Text properties, Zipf’s Law,and Heap’s Law. Judith A. Molka-Danielsen September 12, 2002 Notes are based on Chapter 6 of the Article Collection.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Data Compression CS 147 Minh Nguyen.
03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
R OBERTO B ATTITI, M AURO B RUNATO The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
Psychology: A Modular Approach to Mind and Behavior, Tenth Edition, Dennis Coon Appendix Appendix: Behavioral Statistics.
SIMS-201 Compressing Information. 2  Overview Chapter 7: Compression Introduction Entropy Huffman coding Universal coding.
PrasadL07IndexCompression1 Index Compression Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning.
Web Graph Characteristics Kira Radinsky All of the following slides are courtesy of Ronny Lempel (Yahoo!)
Lecture04 Data Compression.
Recuperação de Informação B Cap. 06: Text and Multimedia Languages and Properties (Introduction, Metadata and Text) 6.1, 6.2, 6.3 November 01, 1999.
1 1.Protein structure study via residue environment – Residues Solvent Accessibility Environment in Globins Protein Family 2.Statistical linguistic study.
SWE 423: Multimedia Systems
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
CSCI 3 Chapter 1.8 Data Compression. Chapter 1.8 Data Compression  For the purpose of storing or transferring data, it is often helpful to reduce the.
Text Operations: Coding / Compression Methods. Text Compression Motivation –finding ways to represent the text in fewer bits –reducing costs associated.
CS336: Intelligent Information Retrieval
1 Text Properties and Mark-up Languages. 2 Statistical Properties of Text How is the frequency of different words distributed? How fast does vocabulary.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.
Judith Molka-Danielsen, Høgskolen i Molde1 IN350: Document Management and Information Steering: Class 5 Text properties and processing, File Organization.
Term weighting and vector representation of text Lecture 3.
1 CS 430: Information Discovery Lecture 2 Introduction to Text Based Information Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 5: Index Compression 1.
Chapter 5: Information Retrieval and Web Search
1 Lossless Compression Multimedia Systems (Module 2) r Lesson 1: m Minimum Redundancy Coding based on Information Theory: Shannon-Fano Coding Huffman Coding.
Information Retrieval Document Parsing. Basic indexing pipeline Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Text Mining: Text-as-Data March 25, 2009.
Cryptanalysis of the Vigenere Cipher Using Signatures and Scrawls To break a Vigenere cipher you need to know the keyword length. – The Kasiski and Friedman.
CS324e - Elements of Graphics and Visualization Java Intro / Review.
Models and Algorithms for Complex Networks Power laws and generative processes.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
LECTURER PROF.Dr. DEMIR BAYKA AUTOMOTIVE ENGINEERING LABORATORY I.
1 Statistical Properties for Text Rong Jin. 2 Statistical Properties of Text  How is the frequency of different words distributed?  How fast does vocabulary.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Chapter 6: Information Retrieval and Web Search
Summarization of XML Documents K Sarath Kumar. Outline I.Motivation II.System for XML Summarization III.Ranking Model and Summary Generation IV.Example.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
Using Surface Syntactic Parser & Deviation from Randomness Jean-Pierre Chevallet IPAL I2R Gilles Sérasset CLIPS IMAG.
Lotkaian Informetrics and applications to social networks L. Egghe Chief Librarian Hasselt University Professor Antwerp University Editor-in-Chief “Journal.
Outline Transmitters (Chapters 3 and 4, Source Coding and Modulation) (week 1 and 2) Receivers (Chapter 5) (week 3 and 4) Received Signal Synchronization.
C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set.
Source Coding Efficient Data Representation A.J. Han Vinck.
Evidence from Content INST 734 Module 2 Doug Oard.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Statistical Properties of Text
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
Anthony J Greene1 Central Tendency 1.Mean Population Vs. Sample Mean 2.Median 3.Mode 1.Describing a Distribution in Terms of Central Tendency 2.Differences.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
WIRED Week 5 Readings Overview - Text & Multimedia Languages & Properties - Text Operations - Multimedia IR Finalize Topic Discussions Schedule Projects.
Document Parsing Paolo Ferragina Dipartimento di Informatica Università di Pisa.
1. 2 Today’s Agenda Search engines: What are the main challenges in building a search engine? Structure of the data index Naïve solutions and their problems.
Plan for Today’s Lecture(s)
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Lecture 13: Language Models for IR
Text Based Information Retrieval
CS 430: Information Discovery
Context-based Data Compression
STATS DAY First a few review questions.
Chapter 8 – Binary Search Tree
Why Compress? To reduce the volume of data to be transmitted (text, fax, images) To reduce the bandwidth required for transmission and to reduce storage.
Text Properties and Languages
CS 430: Information Discovery
CSE 589 Applied Algorithms Spring 1999
Recuperação de Informação B
Presentation transcript:

IN350: Text properties, Zipf’s Law,and Heap’s Law. Judith A. Molka-Danielsen September 12, 2002 Notes are based on Chapter 6 of the Article Collection More notes to follow in class.

Review: Michael Spring’s comments The Document Processing Revolution by MBS. How do you define a document? Revolutions: reprographics, communications Transition: create,compose,render; WWW/XML New processing model for e-docs, forms. XML- a first look. Namespace rules allow for a modular document, reuse. DTDs are historical, Schema is the future. Xpath and pointers: used for accessing the document as a tree of nodes. XSL style – rendering, XSLT uses XSL and FO. XSLT transformations of docs to multiple forms. E-business: B2B applications will use concepts, specifications, and tools based on the XML family.

Modeling Natural Languages and Compression Information theory - the amount of information in a text is quantified by its entropy, information uncertainty. If one symbol appears all the time it does not convey much information. Higher entropy text cannot be compressed as much lower. Symbols - There are 26 in the English alphabet and 29 in Norwegian. Frequency of occurrence - of the symbols in text in different in different languages. In english ’e’ has the highest occurance. Run length encoding schemes such as Huffman encoding can be used to represent the symbols based on frequency of occurance. Compression for transfering data can be based on the frequency of symbols. More on this in another lecture.

Modeling Natural Languages and Compression Creation of Indicies are often based on the frequency of occurance of words within a text. Zipf's Law- named after the Harvard linguistic professor George Kingsley Zipf ( ), is the observation that frequency of occurrence of some event ( P ), as a function of the rank ( i) when the rank is determined by the above frequency of occurrence, is a power-law function P i ~ 1/i a with the exponent a close to unity. Zipf’s distribution- frequency of words in a document approximatly follow this distribution. In the English language, the probability of encountering the ith most common word is given roughly by Zipf law for i up to 1000 or so. The law breaks down for less frequent words, since the harmonic series diverges. Stopwords - a few hundred words take up 50% of the text. These can be disregarded to reduce the space of indices.

(a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary Zipf’s and Heap’s distributions

Sample Word Frequency Data (from B. Croft, UMass)

Zipf’s Law Rank (r): The numerical position of a word in a list sorted by decreasing frequency (f ). Zipf (1949) “discovered” that: If probability of word of rank r is p r and N is the total number of word occurrences:

Occurrence Frequency Data (from B. Croft, UMass)

Does Real Data Fit Zipf’s Law? A law of the form y = kx c is called a power law. Zipf’s law is a power law with c = –1 On a log-log plot, power laws give a straight line with slope c. Zipf is quite accurate except for very high and low rank.

Fit to Zipf for Brown Corpus k = 100,000

Mandelbrot distribution  Mandelbrot distribution: Experimental data suggests that a better model is this where c is an additional parameter and k is such that all frequencies add to n.  Since the distribution of words is Very skewed (that is, there are a few hundred words which take up 50% of the text), words that are too frequent, such as stopwords, can be disregarded.  A stopword is a word which does not carry meaning in natural language and therefore can be ignored (made not searchable), such as 'a,' 'the,' and 'by,'.  The most frequent words are stopwords. This allows a significant reduction in the space overhead of indices for natural language texts.

Mandelbrot (1954) Correction The following more general form gives a bit better fit:

Mandelbrot Fit P = , B = 1.15,  = 100

Explanations for Zipf’s Law Zipf’s explanation was his “principle of least effort.” Balance between speaker’s desire for a small vocabulary and hearer’s desire for a large one. Debate ( ) between Mandelbrot and H. Simon over explanation. Li (1992) shows that just random typing of letters including a space will generate “words” with a Zipfian distribution.

Zipf’s Law Impact on IR Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces inverted-index storage costs. Bad News: For most words, gathering sufficient data for meaningful statistical analysis (e.g. for correlation analysis for query expansion) is difficult since they are extremely rare.

Vocabulary Growth How does the size of the overall vocabulary (number of unique words) grow with the size of the corpus? This determines how the size of the inverted index will scale with the size of the corpus. Vocabulary not really upper-bounded due to proper names, typos, etc.

Heaps’ Law Heap’s law is used to predict the growth of the vocabulary size n words of size V. Typical constants: K  10  100   0.4  0.6 (approx. square-root)

Heaps’ Law Data

Modeling Natural Languages and Compression Document vocaburlary - is the number of distinct (different) words within a document. Heaps’ Law - is show in the right graph in the readings Figure 6.2. In general, the number of new words found in a document does increases logarithmically with the increasing text size. So, if you have an encylopedia, probably most of the words are found in the first volume. Heaps’ Law is also implies the length of words increase logarithmically with the text size, but the average word length is constant. That is because there is a greater occurance of the shorter words.

Relate text size to Indexing Review: What is the purpose of an index and how is it made? Based on: ”Appendix A: File Organization and Storage Structures” Text Processing: large text collections are often indexed using inverted files. A vector of distinct words forms a vocabulary, with pointers to a list of all the documents that contain the word(s). Indexes can be compressed to allow quicker access. (Discuss this with Ch.7.)