Quantitative aspects of literary texts Adam J. Callahan & Gary E. Davis Department of Mathematics University of Massachusetts.

Slides:



Advertisements
Similar presentations
Tests of Hypotheses Based on a Single Sample
Advertisements

Chapter 3 Properties of Random Variables
Chapter 4: Basic Estimation Techniques
Quantitative Methods in Social Research 2010/11 Week 5 (morning) session 11th February 2011 Descriptive Statistics.
Inferential Statistics
Simple Linear Regression Analysis
Multiple Regression and Model Building
Correlation and regression
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1 ~ Curve Fitting ~ Least Squares Regression Chapter.
Introduction: The General Linear Model b b The General Linear Model is a phrase used to indicate a class of statistical models which include simple linear.
Chapter 8 Linear Regression © 2010 Pearson Education 1.
P M V Subbarao Professor Mechanical Engineering Department
MGT 821/ECON 873 Volatility Smiles & Extension of Models
1 1.Protein structure study via residue environment – Residues Solvent Accessibility Environment in Globins Protein Family 2.Statistical linguistic study.
Copyright © 2008 by the McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Managerial Economics, 9e Managerial Economics Thomas Maurice.
Simple Linear Regression
IN350: Text properties, Zipf’s Law,and Heap’s Law. Judith A. Molka-Danielsen September 12, 2002 Notes are based on Chapter 6 of the Article Collection.
Need to know in order to do the normal dist problems How to calculate Z How to read a probability from the table, knowing Z **** how to convert table values.
Correlation and Regression Analysis
Transforming the data Modified from: Gotelli and Allison Chapter 8; Sokal and Rohlf 2000 Chapter 13.
Simple Linear Regression. Introduction In Chapters 17 to 19, we examine the relationship between interval variables via a mathematical equation. The motivation.
Calibration & Curve Fitting
Chapter 12 Correlation and Regression Part III: Additional Hypothesis Tests Renee R. Ha, Ph.D. James C. Ha, Ph.D Integrative Statistics for the Social.
Physics 114: Lecture 15 Probability Tests & Linear Fitting Dale E. Gary NJIT Physics Department.
Chapter 6 Production. The Production Function A production function tells us the maximum output a firm can produce (in a given period) given available.
Transforming to achieve linearity
Relationships Scatterplots and correlation BPS chapter 4 © 2006 W.H. Freeman and Company.
Lecture Presentation Software to accompany Investment Analysis and Portfolio Management Seventh Edition by Frank K. Reilly & Keith C. Brown Chapter 7.
Correlation and Regression
STA302/ week 111 Multicollinearity Multicollinearity occurs when explanatory variables are highly correlated, in which case, it is difficult or impossible.
BPS - 3rd Ed. Chapter 211 Inference for Regression.
Statistics for Data Miners: Part I (continued) S.T. Balke.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1 Part 4 Curve Fitting.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Ch4 Describing Relationships Between Variables. Pressure.
Copyright © 2012 Pearson Education. Chapter 23 Nonparametric Methods.
Ch4 Describing Relationships Between Variables. Section 4.1: Fitting a Line by Least Squares Often we want to fit a straight line to data. For example.
Chapter 8 Curve Fitting.
Examining Relationships in Quantitative Research
TYPES OF STATISTICAL METHODS USED IN PSYCHOLOGY Statistics.
Copyright © 2005 by the McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Managerial Economics Thomas Maurice eighth edition Chapter 4.
Y X 0 X and Y are not perfectly correlated. However, there is on average a positive relationship between Y and X X1X1 X2X2.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
Extreme values and risk Adam Butler Biomathematics & Statistics Scotland CCTC meeting, September 2007.
Prediction, Goodness-of-Fit, and Modeling Issues Prepared by Vera Tabakova, East Carolina University.
CORRELATION. Correlation key concepts: Types of correlation Methods of studying correlation a) Scatter diagram b) Karl pearson’s coefficient of correlation.
28. Multiple regression The Practice of Statistics in the Life Sciences Second Edition.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Simple Linear Regression Analysis Chapter 13.
Chapter 7 An Introduction to Portfolio Management.
Statistics Josée L. Jarry, Ph.D., C.Psych. Introduction to Psychology Department of Psychology University of Toronto June 9, 2003.
LESSON 5 - STATISTICS & RESEARCH STATISTICS – USE OF MATH TO ORGANIZE, SUMMARIZE, AND INTERPRET DATA.
BPS - 5th Ed. Chapter 231 Inference for Regression.
Chapter 4 More on Two-Variable Data. Four Corners Play a game of four corners, selecting the corner each time by rolling a die Collect the data in a table.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Central Bank of Egypt Basic statistics. Central Bank of Egypt 2 Index I.Measures of Central Tendency II.Measures of variability of distribution III.Covariance.
Chapter 4: Basic Estimation Techniques
Physics 114: Lecture 13 Probability Tests & Linear Fitting
Chapter 4 Basic Estimation Techniques
Probability and the Normal Curve
Basic Estimation Techniques
CHAPTER 3 Data Description 9/17/2018 Kasturiarachi.
Basic Estimation Techniques
CHAPTER 29: Multiple Regression*
Chapter 7 Part 1 Scatterplots, Association, and Correlation
Chapter 12 Curve Fitting : Fitting a Straight Line Gab-Byung Chae
Undergraduated Econometrics
CHAPTER 7-1 CORRELATION PROBABILITY.
Basic Practice of Statistics - 3rd Edition Inference for Regression
MGS 3100 Business Analysis Regression Feb 18, 2016
Presentation transcript:

Quantitative aspects of literary texts Adam J. Callahan & Gary E. Davis Department of Mathematics University of Massachusetts Dartmouth Sigma Xi Research Exhibition April 29 th & 30 th, 2008

Type-token ratio The distribution of word frequencies in text has been studied extensively, since at least the time of Zipf in 1936 until the present. For a text the type-token ratio is, where types(n) is the number of word types in the first n words of the text. The type-token ratio is just the running average of the number of new words in an initial text segment of length n. Typical decay of the type-token ratio with the number of words, n: What sort of curve is this? This data is for the text: With the Turks in Palestine, by A. Aaronsohn.

Power laws A log-log plot of versus – yields a good straight line fit (r 2 = 0.964): This gives an analytical expression for the type-token ratio: In the case of the Aarohnson text, A  and d  This is an approximate power law decay of the type-token ratio with the number of words. This line might not look geometrically quite straight, but the correlation coefficient is quite high: r = 0.982

Very slowly varying tails A power law for the type-token ratio,, says that the product should be approximately constant, equal to A. A plot of versus n shows that, typically, this is only true from some point on: How can we determine a “turnover” point n*, beyond which the type-token ratio is a genuine power law? The apparent downward slope from about 5000 words on is something of an illusion due to scale: the slope of the line is approximately – a slope of 1 in 24,000

Regression coefficient analysis We plot the r 2 for a straight line fit to versus for n  n 0, against n 0 : For the Aaronsohn text we see a local maximum for r 2 of (r = ) at n* = The corresponding least squares value for the index d is For n  n*, with r 2  1 -- an almost perfect fit to a power law. For n  n*, the type-token ratio is better described as a decreasing logarithmic function of n. n*

Entropy The i th word has a relative frequency of occurrence  (i,n) in the first n words of a text. We regard  (i,n) as the probability of occurrence of the i th word in the first n words of a text. For this probability distribution the Shannon entropy of the initial segment of text of length n is This amounts to treating each initial text segment as a self-contained text, for statistical purposes, with the point of examining how the entropy changes as the text is enlarged to by the addition of a new word or a previously used word. We examined the variation of H(n) with n for a variety of literary texts. When a new word is added to an existing text segment, the entropy necessarily increases. When a previously used word is added to an initial segment of text the entropy will generally rise if the word is used rarely, but fall if it is used often. How does the entropy H(n) vary with n? Empirically we find that H(n) increases approximately logarithmically with n (r 2 = for the Aarohnson text): The average statistical “surprise” of the text, with the addition of a new or previously used word, rises approximately logarithmically with the length of the text.

The Voynich manuscript The Voynich manuscript is MS 408 of the Beinecke library at Yale University. It is a still mysterious, undeciphered manuscript written using unusual symbolic forms, but apparently representing a text with linguistic structure [G. Landini, Evidence of linguistic structure in the Voynich manuscript using spectral analysis, Cryptologia, (2001)]. Using the Takahashi transcription of these symbolic forms we plotted the entropy H(n) of the first n words of the Voynich text as a function of n. As for all the other texts we examined, H(n) varies approximately logarithmically with n. However, there is a significantly large block of the Voynich text, about 5000 words from the first words of the text on - approximately 16% of the total text - for which the entropy decreases. This necessarily indicates a large degree of repetition of words that have been used significantly often in the text before this point. The Voynich text, is becoming significantly statistically less surprising between and words

The Voynich manuscript also shows unusual behavior when we plot the r 2 for a straight line fit to versus for n  n 0, against n 0 : These successive local maxima and local minima in the plot suggest a variety of different stages of usage of new word types throughout the manuscript. A similar, but less variable, situation holds for Darwin’s Origin of Species: The dip around 53,000 words is approximately where Darwin starts Chapter 6: “Difficulties on Theory”. The shaded area corresponds approximately to the region of decreasing entropy.

Distribution of log returns The distribution of word frequencies is highly skewed, a fact well known even before Zipf quantified it in Borrowing an idea from finance we look at the distribution of the log returns: where  (i) is the frequency of the i th word in the entire text. This distribution is typically highly symmetric, with mean close to 0, but with low kurtosis (broad shoulders), reminiscent of a modified raised cosine distribution rather than a normal distribution: References G. K. Zipf, Human Behaviour and the Principle of Least Effort, Addison-Wesley, G. Landini, Evidence of linguistic structure in the Voynich manuscript using spectral analysis, Cryptologia, 4 (2001). L. L. Goncalves, L. B. Goncalves, Fractal power laws in literary English. Physica A, 360 (2), (2006). S. I. Resnick, Heavy-Tail Phenomena, Springer, Distribution of log returns Distribution of frequenciesDistribution of –log frequencies