Download presentation
Presentation is loading. Please wait.
Published byElvin Barnett Modified over 9 years ago
1
Workshop: Corpus (1) What might a corpus of spoken data tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk
2
Outline What can a corpus tell us? The 3A cycle What can a parsed corpus tell us? ICE-GB and DCPSE Diachronic changes –Modal shall/will over time Intra-structural priming –NP premodification The value of interaction evidence
3
What can a corpus tell us? Three kinds of evidence may be obtained from a corpus Frequency (distribution) evidence of a particular known linguistic event Coverage (discovery) evidence of new events Interaction evidence of the relationship between events But if these ‘events’ are lexical, this evidence can only really tell us about lexis –So corpus linguistics has always involved annotation
4
The 3A cycle Plain text corpora –evidence of lexical phenomena Text
5
The 3A cycle Plain text corpora –evidence of lexical phenomena Need to annotate –add knowledge of frameworks –classify and relate phenomena –general annotation scheme not focused on particular research goals Annotation Corpus Text
6
The 3A cycle Plain text corpora –evidence of lexical phenomena Need to annotate –add knowledge of frameworks –classify and relate phenomena –general annotation scheme not focused on particular research goals Corpus research = the ‘3A’ cycle –Annotation Annotation Corpus Text
7
The 3A cycle Plain text corpora –evidence of lexical phenomena Need to annotate –add knowledge of frameworks –classify and relate phenomena –general annotation scheme not focused on particular research goals Corpus research = the ‘3A’ cycle –Annotation Abstraction Annotation Abstraction Corpus Text Dataset data transformation (“operationalisation”)
8
The 3A cycle Plain text corpora –evidence of lexical phenomena Need to annotate –add knowledge of frameworks –classify and relate phenomena –general annotation scheme not focused on particular research goals Corpus research = the ‘3A’ cycle –Annotation Abstraction Analysis Annotation Abstraction Analysis Corpus Text Dataset Hypotheses data transformation (“operationalisation”)
9
Annotation Abstraction Abstraction –selects data from annotated corpus –maps it to a regular dataset for statistical analysis –bi-directional (“concretisation”) allows us to interpret statistically significant results
10
Annotation Abstraction Abstraction –selects data from annotated corpus –maps it to a regular dataset for statistical analysis –bi-directional (“concretisation”) allows us to interpret statistically significant results Even ‘lexical’ questions need annotation: –1st person declarative modal verb shall/will abstraction relies on annotation
11
What can a parsed corpus tell us? Three kinds of evidence may be obtained from a parsed corpus Frequency evidence of a particular known rule, structure or linguistic event Coverage evidence of new rules, etc. Interaction evidence of the relationship between rules, structures and events BUT evidence is necessarily framed within a particular grammatical scheme –So… (an obvious question) how might we evaluate this grammar?
12
What can a parsed corpus tell us? Parsed corpora contain (lots of) trees –Use Fuzzy Tree Fragment queries to get data –An FTF
13
What can a parsed corpus tell us? Parsed corpora contain (lots of) trees –Use Fuzzy Tree Fragment queries to get data –An FTF –A matching case in a tree –Using ICECUP (Nelson et al, 2002)
14
What can a parsed corpus tell us? Trees as handle on data –make useful distinctions –retrieve cases reliably –not necessary to “agree” to framework used provided distinctions are meaningful
15
What can a parsed corpus tell us? Trees as handle on data –make useful distinctions –retrieve cases reliably –not necessary to “agree” to framework used provided distinctions are meaningful Trees as trace of language production process –interaction between decisions leave a probabilistic effect on overall performance not simple to distinguish between source –depends on the framework but may also validate it
16
Why spoken corpora? Speech predates writing –historically– literacy growth and spread –child development– internal speech during writing
17
Why spoken corpora? Speech predates writing –historically– literacy growth and spread –child development– internal speech during writing Scale –professional authors recommend 1,000 words/day –1 hour of speech 8,000 words (DCPSE)
18
Why spoken corpora? Speech predates writing –historically– literacy growth and spread –child development– internal speech during writing Scale –professional authors recommend 1,000 words/day –1 hour of speech 8,000 words (DCPSE) Spontaneity –production process lost: many written sources edited
19
Why spoken corpora? Speech predates writing –historically– literacy growth and spread –child development– internal speech during writing Scale –professional authors recommend 1,000 words/day –1 hour of speech 8,000 words (DCPSE) Spontaneity –production process lost: many written sources edited Dialogue –interaction between speakers
20
ICE-GB and DCPSE British Component of the International Corpus of English (1990-92) –1 million words (nominal) –60% spoken, 40% written –speech component is orthographically transcribed –fully parsed marked up, POS-tagged, parsed, hand-corrected Diachronic Corpus of Present-day Spoken English –800,000 words (nominal) –orthographically transcribed and fully parsed –created from subsamples of LLC and ICE-GB Matching numbers of texts in text categories Not sampled over equal duration –LLC (1958-1977) – ICE-GB (1990-1992)
21
0.0 0.2 0.4 0.6 0.8 1.0 1955 1960196519701975198019851990 1995 p(shall | {shall, will}) Modal shall vs. will over time Plotting modal shall/will over time (DCPSE) Small amounts of data / year
22
Modal shall vs. will over time Plotting modal shall/will over time (DCPSE) 0.0 0.2 0.4 0.6 0.8 1.0 1955 1960196519701975198019851990 1995 p(shall | {shall, will}) Small amounts of data / year Confidence intervals identify the degree of certainty in our results
23
Modal shall vs. will over time Plotting modal shall/will over time (DCPSE) Small amounts of data / year Confidence intervals identify the degree of certainty in our results Highly skewed p in some cases – p = 0 or 1 (circled)
24
Modal shall vs. will over time Plotting modal shall/will over time (DCPSE) Small amounts of data / year Confidence intervals identify the degree of certainty in our results We can now estimate an approximate downwards curve (Aarts et al., 2013)
25
Intra-structural priming Priming effects within a structure –Study repeating an additive step in structures Consider –a phrase or clause that may (in principle) be extended ad infinitum e.g. an NP with a noun head N
26
Intra-structural priming Priming effects within a structure –Study repeating an additive step in structures Consider –a phrase or clause that may (in principle) be extended ad infinitum e.g. an NP with a noun head –a single additive step applied to this structure e.g. add an attributive AJP before the head N AJP
27
Intra-structural priming Priming effects within a structure –Study repeating an additive step in structures Consider –a phrase or clause that may (in principle) be extended ad infinitum e.g. an NP with a noun head –a single additive step applied to this structure e.g. add an attributive AJP before the head –Q. What is the effect of repeatedly applying this operation to the structure? ship N N AJP
28
Intra-structural priming Priming effects within a structure –Study repeating an additive step in structures Consider –a phrase or clause that may (in principle) be extended ad infinitum e.g. an NP with a noun head –a single additive step applied to this structure e.g. add an attributive AJP before the head –Q. What is the effect of repeatedly applying this operation to the structure? ship NAJP tall N AJP
29
Intra-structural priming Priming effects within a structure –Study repeating an additive step in structures Consider –a phrase or clause that may (in principle) be extended ad infinitum e.g. an NP with a noun head –a single additive step applied to this structure e.g. add an attributive AJP before the head –Q. What is the effect of repeatedly applying this operation to the structure? ship NAJP very greentall AJP N
30
Intra-structural priming Priming effects within a structure –Study repeating an additive step in structures Consider –a phrase or clause that may (in principle) be extended ad infinitum e.g. an NP with a noun head –a single additive step applied to this structure e.g. add an attributive AJP before the head –Q. What is the effect of repeatedly applying this operation to the structure? ship NAJP very greentall AJP N old
31
NP premodification Sequential probability analysis –calculate probability of adding each AJP –error bars: Wilson intervals –probability falls second < first third < second –decisions interact –Every AJP added makes it harder to add another 0.00 0.05 0.10 0.15 0.20 012345 probability
32
NP premodification: explanations? Feedback loop: for each successive AJP, it is more difficult to add a further AJP Possible explanations include: logical and semantic constraints tend to say the tall green ship do not tend to say tall short ship or green tall ship communicative economy once speaker said tall green ship, tends to only say ship memory/processing constraints unlikely: this is a small structure, as are AJPs
33
NP premod’n: speech vs. writing Spoken vs. written subcorpora –Same overall pattern –Spoken data tends to have fewer attributive AJPs Support for communicative economy or memory/processing hypotheses? –Significance tests Paired 2x1 Wilson tests (Wallis 2011) first and second observed spoken probabilities are significantly smaller than written probability written spoken
34
Potential sources of interaction shared context –topic or ‘content words’ ( Noriega ) idiomatic conventions –semantic ordering of attributive adjectives ( tall green ship ) logical-semantic constraints –exclusion of incompatible adjectives ( ?tall short ship ) communicative constraints –brevity on repetition (just say ship next time) psycholinguistic processing constraints –attention and memory of speakers
35
What use is interaction evidence? Corpus linguistics –Optimising existing grammar e.g. co-ordination, compound nouns Theoretical linguistics –Comparing different grammars, same language –Comparing different languages or periods Psycholinguistics –Search for evidence of language production constraints in spontaneous speech corpora speech and language therapy language acquisition and development
36
What can a parsed corpus tell us? Trees as handle on data –make useful distinctions –retrieve cases reliably –not necessary to “agree” to framework used provided distinctions are meaningful Trees as trace of language production process –interaction between decisions leave a probabilistic effect on overall performance not simple to distinguish between source –results enabled by the framework but may also validate it
37
The importance of annotation Key element of a ‘3A cycle’ –Annotation Abstraction Analysis Richer annotation –more effective abstraction –deeper research questions? Multiple layers of annotation –new research questions –studying interaction between layers Algorithmic vs. human annotation
38
More information References Aarts, B. Close, J. and Wallis, S.A. (2013) Choices over time: methodological issues in current change. In Aarts, Close, Leech and Wallis (eds)The Verb Phrase in English. Cambridge University Press. Nelson, G., Wallis, S.A. and Aarts, B. (2002) Exploring Natural Language. Amsterdam: John Benjamins. Wallis, S.A. (2011) Comparing χ 2 tests for separability. London: Survey of English Usage. Useful links –Survey of English Usage www.ucl.ac.uk/english-usage –Fuzzy Tree Fragments www.ucl.ac.uk/english-usage/resources/ftfs –Statistics and methodology research blog http://corplingstats.wordpress.com
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.