Peter Grzybek Austrian Research Fund Project #15485  Von der Ökonomie der Sprache zur Selbst-Regulation kultureller.

Slides:



Advertisements
Similar presentations
Chapter 6 Continuous Random Variables and Probability Distributions
Advertisements

Discrete Uniform Distribution
Item Analysis: A Crash Course Lou Ann Cooper, PhD Master Educator Fellowship Program January 10, 2008.
Probability Distributions CSLU 2850.Lo1 Spring 2008 Cameron McInally Fordham University May contain work from the Creative Commons.
Chapter 5 Some Important Discrete Probability Distributions
Chapter 5 Discrete Random Variables and Probability Distributions
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 5-1 Chapter 5 Some Important Discrete Probability Distributions Statistics.
probability distributions
1 1.Protein structure study via residue environment – Residues Solvent Accessibility Environment in Globins Protein Family 2.Statistical linguistic study.
Simulation Modeling and Analysis
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR.
Chapter 6 Continuous Random Variables and Probability Distributions
Bivariate Statistics GTECH 201 Lecture 17. Overview of Today’s Topic Two-Sample Difference of Means Test Matched Pairs (Dependent Sample) Tests Chi-Square.
Engineering Probability and Statistics - SE-205 -Chap 3 By S. O. Duffuaa.
Chapter 5 Continuous Random Variables and Probability Distributions
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 6-1 Chapter 6 The Normal Distribution and Other Continuous Distributions.
Class notes for ISE 201 San Jose State University
1 We will now consider the distributional properties of OLS estimators in models with a lagged dependent variable. We will do so for the simplest such.
Corpus Linguistics What can a corpus tell us ? Levels of information range from simple word lists to catalogues of complex grammatical structures and.
Business Statistics: A First Course, 5e © 2009 Prentice-Hall, Inc. Chap 6-1 Chapter 6 The Normal Distribution Business Statistics: A First Course 5 th.
Albert Gatt Corpora and Statistical Methods. Probability distributions Part 2.
Chapter 9 Title and Outline 1 9 Tests of Hypotheses for a Single Sample 9-1 Hypothesis Testing Statistical Hypotheses Tests of Statistical.
Choosing Statistical Procedures
Chapter 4 Continuous Random Variables and Probability Distributions
Chi-Square Test Dr Kishor Bhanushali. Chi-Square Test Chi-square, symbolically written as χ2 (Pronounced as Ki-square), is a statistical measure used.
PSYCHOLOGY 820 Chapters Introduction Variables, Measurement, Scales Frequency Distributions and Visual Displays of Data.
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 9. Hypothesis Testing I: The Six Steps of Statistical Inference.
Chapter 1: Introduction to Statistics
CS324e - Elements of Graphics and Visualization Java Intro / Review.
Chapter 6 The Normal Probability Distribution
1 Data and central tendency Integrated Disease Surveillance Programme (IDSP) district surveillance officers (DSO) course.
Which Test Do I Use? Statistics for Two Group Experiments The Chi Square Test The t Test Analyzing Multiple Groups and Factorial Experiments Analysis of.
Peter Grzybek & Ernst Stadlober  Austrian Research Fund  Project #15485 Quantitative Text.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
MTH 104 Calculus and Analytical Geometry Lecture No. 2.
Hypothesis of Association: Correlation
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 5-1 Chapter 5 Some Important Discrete Probability Distributions Basic Business Statistics.
Chapter 12 The Analysis of Categorical Data and Goodness-of-Fit Tests.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
STA347 - week 31 Random Variables Example: We roll a fair die 6 times. Suppose we are interested in the number of 5’s in the 6 rolls. Let X = number of.
Tests of Random Number Generators
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
 Two basic types Descriptive  Describes the nature and properties of the data  Helps to organize and summarize information Inferential  Used in testing.
Universal properties of language From An Introduction to Language and Linguistics (Fasold & Connor-Linton (editors), 2006, Yule, 2003)
1.  Interpretation refers to the task of drawing inferences from the collected facts after an analytical and/or experimental study.  The task of interpretation.
Demand. Outline I. What is Demand? A. Demand Schedules B. The Law of Demand C. Demand Curves/Market Demand II. Change in Demand vs. Change in Quantity.
Data Analysis: Analyzing Individual Variables and Basics of Hypothesis Testing Chapter 20.
 Economic Problem:  The problem of having unlimited wants, but limited resources to satisfy them  Scarcity  The limited nature of resources, which.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Business Statistics: A Decision-Making Approach, 7e © 2008 Prentice-Hall, Inc. Chap 5-1 Business Statistics: A Decision-Making Approach 7 th Edition Chapter.
SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.
Chapter 31Introduction to Statistical Quality Control, 7th Edition by Douglas C. Montgomery. Copyright (c) 2012 John Wiley & Sons, Inc.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 6-1 Chapter 6 The Normal Distribution and Other Continuous Distributions Basic Business.
Chap 5-1 Chapter 5 Discrete Random Variables and Probability Distributions Statistics for Business and Economics 6 th Edition.
Chap 5-1 Discrete and Continuous Probability Distributions.
Modeling and Simulation CS 313
I. ANOVA revisited & reviewed
Chapter 6 Introductory Statistics and Data
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
AN INTRODUCTION TO EDUCATIONAL RESEARCH.
Chapter 5 STATISTICS (PART 1).
Statistics for Psychology
Chi-Square Test Dr Kishor Bhanushali.
9 Tests of Hypotheses for a Single Sample CHAPTER OUTLINE
Basic Statistical Terms
Discrete Event Simulation - 4
Language Model Approach to IR
Chapter 6 Introductory Statistics and Data
Presentation transcript:

Peter Grzybek Austrian Research Fund Project #15485  Von der Ökonomie der Sprache zur Selbst-Regulation kultureller Systeme  Korpuslinguistik vs. Textanalyse  Exakte Literaturwissenschaft: Zur Prosa Karel Čapeks  Was tun die Wörter im Vers miteinander? Zur Poesie A.S. Puškins

Peter Grzybek Austrian Research Fund Project #15485 Korpus-Linguistik vs. Text-Analyse

Analysis of Letter Frequencies Methodological Problems in Former Studies 1.Insufficient Data Distinction (graphemic and phonematic/phonetic data) 2.Insufficient Control of Data Homogeneity (text / text segments / text mixtures (corpora) 3.Frequency Models: Continuous vs. Discrete (a) theoretical entropy, repeat rate (b)  p i = 1 4.Goodness of Fit Graphics vs. tests, R² vs.  ²

Analysis of Letter Frequencies Methodological Decisions 1.Data Distinction Graphemic data 2.Control of Data Homogeneity Text vs. text segments vs. text cumulations vs. text mixtures (corpus) 3.Discrete Frequency Models Test of relevant models 4. Goodness of Fit  ² test  C =  ² / N (C < 0.02 = * ; C < 0.01 = **)

Analysis of Letter Frequencies Slavic Alphabets inventory size minimal25Slovene maximal46Slovak medium32/33Russian (е / ё)

Analysis of Letter Frequencies Russian

Zipf (Zeta) distribution Basic assumption: r x f r = c  f r = c / r

Zipf-Mandelbrot distribution Basic assumption:  f r = c / (r + b) a

Zipf and Zipf-Mandelbrot Distributions: Goodness of Fit (38 Russian samples)

Geometric Distribution and Good Distribution

n = inventory size, x = class 2 parameters: K and M Negative Hypergeometric Distribution Analysis of Russian Letter Frequencies: Corpus: 37 Texts (ca. 8.5 mio. letters)

Analysis of Russian Letter Frequencies Comparison of Texts, Text Segments, Text Cumulations, Text Mixtures, and Complete Corpus Constancy of goodness of fit (C)Constancy of Parameters (K, M) Negative Hypergeometric Distribution

Analysis of Slovene Letter Frequencies Corpus: ca letters Goodness of fit (C= ) Negative Hypergeometric Distribution

Analysis of Slovene Letter Frequencies Comparison of Texts, Text Segments, Text Cumulations, Text Mixtures, and Complete Corpus Constancy of goodness of fit (C)Constancy of Parameters (K, M) Negative Hypergeometric Distribution

Analysis of Slovene Letter and Phoneme Frequencies: Corpus: ca Slovene LettersSlowene Phonemes

First Tentative Results of Slowak Letter Frequencies Tasks: 1.Interpretation of Parameters: „foreign letters Q-W-X“ influence inventory size 2.Exploration of Data Basis: Texts, Text Segments, Text cumulations, text mixtures

The Question of Data Homogeneity

“[…] the magnitude of words tends, on the whole, to stand in an inverse (not necessarily proportionate) relationship to the number of occurrences” Zipf (1935: 25) Four major problems in research

What is the direction of dependence: Does frequency depend on length or vice versa? What is the unit of measurement: Is word length measured in letters, phonemes, syllables, morphemes,...? What is frequency: Absolute occurrence or the rank of words, or of word forms? What is the text basis: Corpus data, frequency dictionaries,..., individual texts?

Assuming that word length is a variable of frequency Measuring word length in the number of syllables per word Analyzing the absolute occurrence of words the influence of the text basis shall be tested: Individual texts vs. text cumulations vs. corpus data DATA HOMOGENEITY

Intertextual Inhomogeneity vs. Intratexual Inhomogeneity Combination (“mixture”) of different texts A ‘text’ in itself does not consist of homogeneous elements Different Languages  Different Different Authors Different Text Types complete novel, composed of chapters complete book of a novel, consisting of several chapters individual chapters dialogical vs. narrative sequences within a text

Russian Anna Karenina (ch. 1) x frequency y length a = , b = R² = 0.88, N = 397

TextLanguage N R² a b Anna Karenina (I,1)Russian ,030,97 Evgenij Onegin (I)Russian ,700,79 Na badnjakCroatian ,950,51 Zářivé hlubinyCzech ,760,59 Hiša M.P. (I) Slovenian ,800,40 Zakliata pannaSlovak ,480,69 Hänsel und GretelGerman ,160,51 Fairy Tale by MóraHungarian ,570,84 Di lembung kuring Sundanese ,860,51 Burung api Indonesian ,440,26 Portrait of a Lady (I)English ,230,  R²  0.96

The course of the theoretical curves

The relationship between parameters a and b

The relationship between text length (N) and parameter a

Obvious data inhomogeneity 1. Texts from different languages, authors, and various text types 2. Violation of the ceteris paribus condition Ergo: The data in this mixture are not adequate for testing the hypothesis at stake

Lev N. Tolstoj: Anna Karenina Chap. I,1 vs. I (34 chapters) N (Types) Cab AK (I, 1) AK (I) ,600.27

Henry James: Portrait of a Lady Chap. 1 vs. novel (52 chapters) N (Types) Cab I I ,840.27

N (Types) Cab narrative dialogues Ks.Š. Gjalski: Na badnjak Narrative vs. dialogical sequences

Evgenij Onegin Text cumulation (I – VIII) ChapterN Types M Tokens abR2R2 I I+II I-III I-IV I-V I-VI I-VII text (I-VIII) ,70 1,84 1,92 1,97 1,95 1,97 2,03 2,05 0,79 0,69 0,57 0,53 0,48 0,52 0,43 0, Results of fitting y = ax^-b + 1 to the cumulative text of Evgenij Onegin

Evgenij Onegin – text cumulation (chap. I – VIII) Fitting y = ax^-b R² = 0.92 Dependence of parameter b on parameter a

Evgenij Onegin Text cumulation (I – VIII) Dependence of a on Text Length (N): a = N (R² = 0.96 )

Summary & Results (I) Data corroborate hypothesis: There is a specific interrelation of parameters: a = f (N)b = g(a) b = h(N) f, g, h  functions of the same type

Summary & Results (II) 1.Homogeneous texts do not interfere with linguistic laws, inhomogeneous texts can distort the textual reality. 2.Text mixtures can evoke phenomena which do not exist as such in individual texts 3.Short texts do not allow a property to take appropriate shape; long texts (and corpora) contain mixed generating regimes superposing different layers, what may lead to “ artificial ” phenomena. 4.With an increase of text size the resulting curve of the frequency-length relationship is shifted upwards; this is caused by the fact that the number of words occurring only once increase up to a certain text length. If this assumption is correct, then b converges to zero, yielding the limit y = a.

F I N I S