TEXT ANALYSIS BY MEANS OF ZIPF’S LAW

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

4-4 Multiplication Rule The basic multiplication rule is used for finding P(A and B), the probability that event A occurs in a first trial and event B.
Chapter 4 Probability and Probability Distributions
AP Statistics Section 6.2 A Probability Models
Empirical Investigations of WWW Surfing Paths Jim Pitkow User Interface Research Xerox Palo Alto Research Center.
1 1.Protein structure study via residue environment – Residues Solvent Accessibility Environment in Globins Protein Family 2.Statistical linguistic study.
QUANTITATIVE DATA ANALYSIS
The Efficiency of Algorithms
IN350: Text properties, Zipf’s Law,and Heap’s Law. Judith A. Molka-Danielsen September 12, 2002 Notes are based on Chapter 6 of the Article Collection.
Judith Molka-Danielsen, Høgskolen i Molde1 IN350: Document Management and Information Steering: Class 5 Text properties and processing, File Organization.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
CHAPTER 6 Statistical Analysis of Experimental Data
Cmpt-225 Simulation. Application: Simulation Simulation  A technique for modeling the behavior of both natural and human-made systems  Goal Generate.
Probability Theory Random Variables and Distributions Rob Nicholls MRC LMB Statistics Course 2014.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Section 4-2 Basic Concepts of Probability.
Chapter 6 Probability.
Albert Gatt Corpora and Statistical Methods. Probability distributions Part 2.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
STYLOMETRY IN IR SYSTEMS Leyla BİLGE Büşra ÇELİKKAYA Kardelen HATUN.
CS324e - Elements of Graphics and Visualization Java Intro / Review.
1 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Basic Principle of Statistics: Rare Event Rule If, under a given assumption,
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
MA Dynamical Systems MODELING CHANGE. Introduction and Historical Context.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Using Probability and Discrete Probability Distributions
McGraw-Hill/Irwin Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 6 Continuous Random Variables.
1 Statistical Properties for Text Rong Jin. 2 Statistical Properties of Text  How is the frequency of different words distributed?  How fast does vocabulary.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Compression.  Compression ratio: how much is the size reduced?  Symmetric/asymmetric: time difference to compress, decompress?  Lossless; lossy: any.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.
COMPRESSION. Compression in General: Why Compress? So Many Bits, So Little Time (Space) CD audio rate: 2 * 2 * 8 * = 1,411,200 bps CD audio storage:
Chapter 7 Probability and Samples: The Distribution of Sample Means
Copyright © Cengage Learning. All rights reserved. 5 Probability Distributions (Discrete Variables)
C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set.
Copyright © 2010 Pearson Education, Inc. Chapter 6 Probability.
Journal: 1)Suppose you guessed on a multiple choice question (4 answers). What was the chance that you marked the correct answer? Explain. 2)What is the.
Reliability Failure rates Reliability
Randomness, Probability, and Simulation
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.
Statistical Properties of Text
+ Chapter 5 Overview 5.1 Introducing Probability 5.2 Combining Events 5.3 Conditional Probability 5.4 Counting Methods 1.
Describing a Score’s Position within a Distribution Lesson 5.
AP Statistics From Randomness to Probability Chapter 14.
The Statistical Imagination Chapter 7. Using Probability Theory to Produce Sampling Distributions.
PRIMENJENA LINGVISTIKA I NASTAVA JEZIKA II 3 rd class.
Modeling and Simulation CS 313
Sampling Distributions
Plan for Today’s Lecture(s)
Section 4-4 Multiplication Rule: Basics.
Measuring Monolinguality
Text Based Information Retrieval
Modeling and Simulation CS 313
Lecture Slides Elementary Statistics Twelfth Edition

Corpora and Statistical Methods
Wild Ride Lauren St John.
Lecture Slides Elementary Statistics Twelfth Edition
Honors Statistics From Randomness to Probability
Inf 722 Information Organisation
Chapter 10 Nonlinear Models Undergraduated Econometrics Page 1
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Content Analysis of Text
M248: Analyzing data Block A UNIT A3 Modeling Variation.
Probability and Statistics
Presentation transcript:

TEXT ANALYSIS BY MEANS OF ZIPF’S LAW Alina Andersen, Giedre Budienė, Alytis Gruodis Vilnius Business College Lithuania

The project is aimed at analysis of human language The process: artificial intelligence language recognition recognition of creating processes 27/07/2019

Formulation of the project task Formalized human language analysis and recognition The task in the middle of the 20th century - conversation between human and robot , i.e. developing language programming applications The task in the 21st century – analysing language using artificial intelligence 27/07/2019

Specific questions to be answered: Does the graphical ‘fingerprint’ exist for any text? The graphical ‘fingerprint’: related to the corpus/set of items could be analyzed formaly by computer programme 27/07/2019

G. K. Zipf Human Behavior and the Principle of Least Effort (1949) Human nature wants the greatest outcome for the least amount of work The law states that people will naturally choose the path of least resistance or "effort“ 27/07/2019

Zipf's law (1) the frequency of any word in a large corpus is inversely proportional to its rank in the frequency table only a few words are used very often, many or most are used rarely 27/07/2019

Zipf’s law (2) Count number of words in an English book Order the words by their frequency of appearance Found that the most frequent word appears twice as often as next most popular word, three times as often as 3rd most popular, and so on. 27/07/2019

Zipf’s law (3) Brown Corpus The is the most frequently occurring word, accounts for nearly 7% of all word occurrences (69,971 out of slightly over 1 million words). Of is the second-rank word accounts for slightly over 3.5% of words (36,411 occurrences), And has 28,852 occurrences. Only 135 vocabulary items are needed to account for half the Brown Corpus. 27/07/2019

Zipf’s law (4) n Brown corpus 27/07/2019

Zipf’s law (5) Ranking of items k → (k) 1 → (1) 2 → (2) 3 → (3) 27/07/2019

27/07/2019

George Kingsley Zipf Frequency is inversely and linearly related to rank. 27/07/2019

Inverse hyperbolic function Power Law Hyperbolic function Inverse hyperbolic function Zipf Heap 27/07/2019

Wikipedia content 27/07/2019

www.gutenberg.org 27/07/2019

The Picture of Dorian Gray OSCAR WILDE The Picture of Dorian Gray CHAPTER 1 The studio was filled with the rich odour of roses, and when the light summer wind stirred amidst the trees of the garden, there came through the open door the heavy scent of the lilac, or the more delicate perfume of the pink-flowering thorn. From the corner of the divan of Persian saddle-bags on which he was lying, smoking, as was his custom, innumerable cigarettes, Lord Henry Wotton could just catch the gleam of the honey-sweet and honey-coloured blossoms of a laburnum, whose tremulous branches seemed hardly able to bear the burden of a beauty so flamelike as theirs; .... .... .... 27/07/2019

Analysis (1): 1. All the words indentified (a word is combination of letters followed by a space) 2. Grammatical forms (function words) not excluded 3. Syntactical and puctuation marks removed 4. 's not excluded 5. Decapitalization performed 6. Calculation of word frequency and ranking was performed 27/07/2019

Analysis (2): totally 60000 items (words) Frequency dependence on item rank totally 60000 items (words) 200 words are used frequently the other words are used rarely the text is rather difficult for an average reader. 27/07/2019

Properties of power-law distributions Findings: Asymmetric distribution Most words are very rare Half the words in the corpus appear only once, called hapax legomena (Greek for “read only once”) The diagram shows a “heavy tailed” distribution since most of the probability mass is in the “tail” 27/07/2019

Findings from analysis of the novel by the Zipf law: No one-line distribution for all the corpus Small vocabulary Significant part of the items in the corpus appear only once How to estimate the ratio of order / disorder? 27/07/2019

Shenon function Shenon entropy is the average unpredictability in a random variable equivalent to its information content.  27/07/2019

Shenon entropy dependence on item number N 27/07/2019

Shenon entropy Presented dependance is similar to heap function (vocabulary grow) Several exeptions are present How to combine the such both dependencies ? (shenon function and zipf function) 27/07/2019

ZS-dependency (1): Zipf constant dependency on corresponding Shenon value for current item Entropy increases by increasing number of items. While entropy increases zipf function decreases. Intuitively perceived stylistic beauty is proportional to the number of rarely used words. 27/07/2019

ZS-dependency (2): Several sophisticated figures Reswitching of content and context Use of new vocabulary 27/07/2019

ZS-dependency (3): Reswitching of content New paragraph New morphological content PORTRAIT - LANDSCAPE 27/07/2019

Conclusions For a very large corpus the Zipf curve for English has two slopes,  = 1 for rank < 5000 and  = 2 for rank >5000. Several genres (lexical and morphological forms) are recognizable using mathematical tools. Zipf and Shennon functions: significance of adjustable parameters for limited corpus. Academic English via Modern English – entropy related problems. 27/07/2019

Thank you! 27/07/2019