Presentation is loading. Please wait.

Presentation is loading. Please wait.

Corpus Size and the Robustness of Measures of Corpus Distance

Similar presentations


Presentation on theme: "Corpus Size and the Robustness of Measures of Corpus Distance"— Presentation transcript:

1 Corpus Size and the Robustness of Measures of Corpus Distance
Alexander Piperski (RSUH / HSE) June 01, 2018

2 Comparing corpora in SketchEngine

3 Comparing corpora in SketchEngine

4 Comparing corpora in SketchEngine
Law Report Corpus is closer to Academic Spoken than to Academic Written Cambridge Academic English corpus is also closer to Academic Spoken than to Academic Written

5 Measures of corpus distance
NB: distance resp. similarity (used interchangeably in this talk) Gomaa & Fahmy (2013), “A Survey of Text Similarity Approaches”: character-based measures term-based measures

6 Term-based measures Texts are represented as vectors of frequencies (absolute frequencies, relative frequencies, tf-idf scores, etc.) of words or other entities Many ways of measuring distances between vectors can be used: Euclidean distance Spearman’s ρ (cf. also Kilgarriff 2001)

7 Text similarity vs. corpus similarity
Measures of corpus similarity are conceptually similar to measures of text similarity However, some problems are not the same

8 Text similarity vs. corpus similarity
Problem 1: No human has an intuition as to what are similar corpora ⇒ intuitive notions of corpus distance (resp. similarity) are based on intuitions about metadata and are dependent on the focus of attention

9 Corpus similarity and metadata
A: 937,080 tokens B: 249,051 tokens C: 113,493 tokens (source: RNC)

10 Corpus similarity and metadata
A: 937,080 tokens, around 1800 B: 249,051 tokens, around 1800 C: 113,493 tokens, around 1900 (source: RNC)

11 Corpus similarity and metadata
A: 937,080 tokens, around 1800, prose B: 249,051 tokens, around 1800, poetry C: 113,493 tokens, around 1900, poetry (source: RNC)

12 Corpus similarity and metadata
A: 937,080 tokens, around 1800, prose, Karamzin B: 249,051 tokens, around 1800, poetry, Zhukovsky C: 113,493 tokens, around 1900, poetry, Blok (source: RNC)

13 Text similarity vs. corpus similarity
Problem 1: No human has an intuition as to what are similar corpora ⇒ intuitive notions of corpus distance (resp. similarity) are based on intuitions about metadata and are dependent on the focus of attention For a formal approach to corpus similarity, see Kilgarriff (2001)

14 Text similarity vs. corpus similarity
Problem 2: Corpus similarity cannot be simply reduced to similarity of texts from individual corpora because the corpora are not necessarily homogeneous

15 Text similarity vs. corpus similarity
Problem 3: Texts represent themselves, whereas corpora are mostly conceived as samples from larger populations Examples: Russian National Corpus British Academic Spoken English corpus

16 Text similarity vs. corpus similarity
Problem 3: Texts represent themselves, whereas corpora are mostly conceived as samples from larger populations ⇒ can a distance estimated using two corpora (i.e., samples) serve as a distance estimate for two populations?

17 Robustness of measures of text similarity
A measure of distance between texts can be used as a measure of distance between corpora if it is robust with respect to corpus size A good measure of corpus distance should return similar results for differently-sized samples from the same population Source of inspiration: Tweedie & Baayen (1998) on lexical richness

18 (Tweedie, Baayen 1998: 333)

19 Experiment design 200,000-token corpora from 11 sources from the British National Corpus

20 Sources

21 Experiment design 200,000-token corpora from 11 sources from the British National Corpus 9 sample sizes: 20,000; 40,000; 60,000; … ; 180,000 For each pair of sources and for each sample size, 50 samples from the two sources are taken and the distance between them is measured using 6 distance measures

22 Distance measures Geometrical measures: Statistical measures:
Euclidean distance Manhattan distance Cosine distance Statistical measures: χ² Spearman’s ρ Keyword-based measure: Simple-Maths Keyword distance (Kilgarriff 2009, implemented in SketchEngine)

23 Experiment design For each pair of sources and for each sample size, a 95% confidence interval (CI) for the mean value of the distance is constructed using 10,000-sample bootstrapping from the 50 computed values CI is expected to include the best estimate computed on 200,000-token corpora

24 The Daily Mirror vs. Unigram X

25 The Daily Mirror vs. Unigram X

26 The Daily Mirror vs. Unigram X

27 The Daily Mirror vs. Unigram X

28 The Daily Mirror vs. Unigram X

29 The Daily Mirror vs. Unigram X

30 Evaluating robustness
All six measures have different range of values and are not directly comparable to each other A good distance measure must satisfy the following requirements: CI for smaller samples is expected to include the best estimate computed on 200,000- token corpora CI must not be too wide CI must not systematically shift up- or downwards with decreasing corpus size

31 Best estimate within the confidence interval

32 The Daily Mirror vs. Unigram X

33 Increase in width of the CI
How many times wider is the CI for 20,000-token samples as compared to 180,000-token samples? Averaged across 55 pairs of sources

34 The Daily Mirror vs. Unigram X

35 Instability score For 1 ≤ n ≤ 8:
if a distance between a pair of corpora containing n × 20,000 tokens is larger than the upper bound of the CI for (n + 1) × 20,000 tokens, the measure is given n / (n + 1) points if, on the contrary, a distance is smaller than the lower bound of this CI, the measure loses n / (n + 1) points A good measure will get approximately the same amount of points as it will lose (i.e., it is symmetric)

36 Instability score

37 The Daily Mirror vs. Unigram X

38 Capturing the best estimate
Summary Measure Capturing the best estimate No CI inflation No instability Euclidean 1 2 Manhattan 3 Cosine 4 χ² 6 5 Spearman’s ρ Simple-Maths Keywords

39 Conclusion Geometrical measures outperform other types of measures for the purpose of corpus comparison Euclidean distance performs best among geometrical measures

40 Caveats One language One variety of language
Based on word rather than on characters, lemmata, PoS-tags, etc. Based on 1-grams rather 2-, 3-, etc. grams No adaptation of measures to different corpus sizes

41 Thank you for your attention!


Download ppt "Corpus Size and the Robustness of Measures of Corpus Distance"

Similar presentations


Ads by Google