Change over time: Working with diachronic data

Slides:



Advertisements
Similar presentations
A.S. 3.8 INTERNAL 4 CREDITS Time Series. Time Series Overview Investigate Time Series Data A.S. 3.8 AS91580 Achieve Students need to tell the story of.
Advertisements

Experimental Design, Response Surface Analysis, and Optimization
CORRELATON & REGRESSION
Psychology 202b Advanced Psychological Statistics, II February 10, 2011.
BA 555 Practical Business Analysis
Statistics: Data Analysis and Presentation Fr Clinic II.
Statistics: Data Presentation & Analysis Fr Clinic I.
Intro to Statistics for the Behavioral Sciences PSYC 1900 Lecture 6: Correlation.
Chapter 11 Multiple Regression.
Today Concepts underlying inferential statistics
1 BA 555 Practical Business Analysis Review of Statistics Confidence Interval Estimation Hypothesis Testing Linear Regression Analysis Introduction Case.
Assumption of Homoscedasticity
Analysis of Variance. ANOVA Probably the most popular analysis in psychology Why? Ease of implementation Allows for analysis of several groups at once.
Inference for regression - Simple linear regression
TIME SERIES by H.V.S. DE SILVA DEPARTMENT OF MATHEMATICS
Graphical Analysis. Why Graph Data? Graphical methods Require very little training Easy to use Massive amounts of data can be presented more readily Can.
Model Building III – Remedial Measures KNNL – Chapter 11.
Quantitative Skills 1: Graphing
Section 8.1 Estimating  When  is Known In this section, we develop techniques for estimating the population mean μ using sample data. We assume that.
Correlation & Regression
Time series Model assessment. Tourist arrivals to NZ Period is quarterly.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Worked Example Using R. > plot(y~x) >plot(epsilon1~x) This is a plot of residuals against the exploratory variable, x.
Agresti/Franklin Statistics, 1 of 88  Section 11.4 What Do We Learn from How the Data Vary Around the Regression Line?
Multivariate Data Analysis Chapter 1 - Introduction.
Data Analysis, Presentation, and Statistics
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
CCSS.Math.Content.8.SP.A.1 Construct and interpret scatter plots for bivariate measurement data to investigate patterns of association between two quantities.
FOR TEEN AND YOUNG ADULT MALES (13 TO 29) IS AGE RELATED TO THE NUMBER OF HOURS SPENT PLAYING VIDEO/COMPUTER GAMES? By Amanda Webster, Jennifer Burgoyne,
Stats Methods at IC Lecture 3: Regression.
Multiple Regression Analysis: Inference
Welcome to Week 07 College Statistics
Chapter 4: Basic Estimation Techniques
Chapter 4 Basic Estimation Techniques
Confidence Interval Estimation
Descriptive measures Capture the main 4 basic Ch.Ch. of the sample distribution: Central tendency Variability (variance) Skewness kurtosis.
Data Mining: Concepts and Techniques
Statistical Quality Control, 7th Edition by Douglas C. Montgomery.
Basic Estimation Techniques
Regression Analysis Part D Model Building
3. Data analysis SIS.
PCB 3043L - General Ecology Data Analysis.
Quantitative Skills : Graphing
Inverse Transformation Scale Experimental Power Graphing
Inferential Statistics
Basic Estimation Techniques
DS2 – Displaying and Interpreting Single Data Sets
Correlation and Regression
AP Exam Review Chapters 1-10
Numerical Measures: Skewness and Location
Precipitation Analysis
Chapter 23 Comparing Means.
Section 11.2 Day 2.
I271b Quantitative Methods
Graphing Techniques.
3 4 Chapter Describing the Relation between Two Variables
Statistics: The Interpretation of Data
Incremental Partitioning of Variance (aka Hierarchical Regression)
BIVARIATE ANALYSIS: Measures of Association Between Two Variables
Lexico-grammar: From simple counts to complex models
Introduction: Statistics meets corpus linguistics
Roman F. Loonis, Scott L. Brincat, Evan G. Antzoulatos, Earl K. Miller 
Statistics II: An Overview of Statistics
Product moment correlation
Semantics and discourse: Collocations, keywords and reliability of manual coding Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide.
Register variation: correlation, clusters and factors
BIVARIATE ANALYSIS: Measures of Association Between Two Variables
The Examination of Residuals
Data Transformation, T-Tools and Alternatives
H. Sadeghi, D.E.T. Shepherd, D.M. Espino  Osteoarthritis and Cartilage 
Presentation transcript:

Change over time: Working with diachronic data Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press.

Think about and discuss Which colour terms are most popular? Does this change over time? How would you investigate this? Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press.

Where to start? Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press.

Visualising language change Candle stick plot Line graph maximum value minimum value first value last value last value first value Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press.

Measuring time Time – a continuous (scale) variable; this means that we can measure time on a continuum of centuries, decades, years, months, weeks, days, hours, minutes, seconds, milliseconds etc. Studies involving time as a variable – diachronic/longitudinal studies. Change over time vs. stability over time. Diachronic corpora: diachronic representativeness. Diachronic polysemy, e.g. pre-2000s: web, tweet, cloud Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press.

Measuring time(cont.) Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press.

Percentage change and bootstrap test Linguistic feature Corpus 1 – Commonwealth & Protectorate (1650-1659)   Corpus 2 – Restoration (1660- 1669) Percentage increase/ decrease its 515.86 652.86 +27% must 1,173.02 1,135.67 -3% time(s) 1,445.57 1,355.84 -6% pestilence 9.88 13.71 +39% % increase/decrease= relative frequency in corpus 2 − relative frequency incorpus 1 relative frequency incorpus 1 ×100 percentage increase/decrease of 𝑖𝑡𝑠= 652.86 − 515.86 515.86 ×100=26.6% Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press.

Percentage change and bootstrap test (cont.) Bootstrapping is a process of multiple resampling, which often happens thousands of times, with replacement of the data – this means we take a random sample of texts from a corpus in such a way that each text can occur multiple times in the sample because we ‘replace’ it (i.e. place it to the pool again) once it has been taken. In each resampling cycle, we note down the value of the statistic (e.g. mean frequency of a linguistic variable) we are interested in; this gives an insight into the amount of variation in the data and gives us the confidence to generalise from this sample. Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press.

Bootstrap test Corpus tests: A, B, C, D and E Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press.

Bootstrap test (cont.) We compare across a large number of bootstrapping cycles the resampled corpus 1 and the resampled corpus 2 and look for a consistent difference between the resampled corpora, which would produce a low p-value (statistical significance). A low p-value is returned if in all or most cases resampled corpus 1 is either larger (we add 1 in the equation above) or smaller than corpus 2 (we add 0). Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press.

Neighbouring cluster analysis hierarchical agglomerative clustering variability-based neighbour clustering Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press.

Neighbouring cluster analysis Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press.

Peaks and troughs and UFA Obligatory: Obtaining the statistic of interest for each of the periods (e.g. years, decades etc.) covered by the analysis. Optional: Transformation of the values using binary logarithm (log2) to reduce extremes; This step is possible only if all transformed values are positive numbers because logarithm is not defined for negative numbers. Since step 2 typically produces also negative values, logarithmic transformation is possible with data from step 1. Obligatory: Fitting a non-linear regression model (displayed as a curve in the graph), computing 95% and 99% confidence intervals (displayed as shaded areas around the curve) and identification of significant outliers – data points outside of the confidence interval area data points across time a non-linear regression model (GAM) 95 and 99% CI significant outliers Results of UFA for red 1600-1699, 3a-MI(3), L5-R5, C10relative-NC10relative; AC1 Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press.

Things to remember Historical analyses, because they use available and imperfect data, require critical consideration of i) diachronic representativeness of corpora, ii) alternative interpretations of linguistic development and iii) fluctuation of the meaning of linguistic forms. Visualization options include line graphs, boxplots and error bars, sparklines and candlestick plots. The bootstrapping test is used to compare two corpora (representing different points in time); it makes use of a technique of multiple resampling of corpus data. Peaks and troughs is a technique which fits a non-linear regression to historical data, producing a graph which highlights significant outliers in the process of historical development of language and discourse. UFA (Usage Fluctuation Analysis) is a complex procedure combining automatic collocation comparison in a given historical period and the peaks and troughs technique. Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press.