Workshop: Corpus (1) What might a corpus of spoken data tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English.

Slides:

Advertisements

Similar presentations

A Probabilistic Representation of Systemic Functional Grammar Robert Munro Department of Linguistics, SOAS, University of London.

Advertisements

Z-squared: the origin and use of χ² - or - what I wish I had been told about statistics (but had to work out for myself) Sean Wallis Survey of English.

Corpora in grammatical studies

Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London

Tracking L2 Lexical and Syntactic Development Xiaofei Lu CALPER 2010 Summer Workshop July 14, 2010.

Language and Cognition Colombo, June 2011 Day 8 Aphasia: disorders of comprehension.

The SALSA experience: semantic role annotation Katrin Erk University of Texas at Austin.

® Towards Using Structural Events To Assess Non-Native Speech Lei Chen, Joel Tetreault, Xiaoming Xi Educational Testing Service (ETS) The 5th Workshop.

Capturing linguistic interaction in a grammar A method for empirically evaluating the grammar of a parsed corpus Sean Wallis Survey of English Usage University.

What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.

The Subjunctive in Spoken British English ICAME, Lancaster, 28 th May Jo Close & Bas Aarts, UCL

Fifth Workshop on Link Analysis, Counterterrorism, and Security. or Antonio Badia David Skillicorn.

Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.

Copyright ©2010 Pearson Education, Inc. publishing as Prentice Hall 9- 1 Basic Marketing Research: Using Microsoft Excel Data Analysis, 3 rd edition Alvin.

April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.

Automatic Discovery of Technology Trends from Patent Text Youngho Kim, Yingshi Tian, Yoonjae Jeong, Ryu Jihee, Sung-Hyon Myaeng School of Engineering Information.

1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.

Language, Mind, and Brain by Ewa Dabrowska Chapter 2: Language processing: speed and flexibility.

Corpora and Language Teaching

Presented by Jennifer Robison TexTESOL II March 12, 2010 San Antonio, TX.

Creation of a Russian-English Translation Program Karen Shiells.

The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College.

Albert Gatt LIN 3098 Corpus Linguistics. In this lecture Some more on corpora and grammar Construction Grammar as a theoretical framework Collostructional.

AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.

McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)

Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.

Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.

English Corpus Linguistics Introducing the Diachronic Corpus of Present-Day Spoken English (DCPSE) Sean Wallis UCL.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

1 How to Compute the Meaning of Natural Language Utterances Patrick Hanks, Research Institute of Information and Language Processing, University of Wolverhampton.

MA in English Linguistics Experimental design and statistics Sean Wallis Survey of English Usage University College London

For Friday Finish chapter 23 Homework: –Chapter 22, exercise 9.

3rd International Symposium on Teaching English at Tertiary Level Hong Kong, 9-10 June 2007 Jointly organised by: Department of English, The Hong Kong.

Experiments on Building Language Resources for Multi-Modal Dialogue Systems Goals identification of a methodology for adapting linguistic resources for.

Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.

THE BIG PICTURE Basic Assumptions Linguistics is the empirical science that studies language (or linguistic behavior) Linguistics proposes theories (models)

10.2 Tests of Significance Use confidence intervals when the goal is to estimate the population parameter If the goal is to.

Grammatical Noriegas interaction in corpora and treebanks ICAME 30 Lancaster May 2009 Sean Wallis Survey of English Usage University College London.

An Examination of Science. What is Science Is a systematic approach for analyzing and organizing knowledge. Used by all scientists regardless of the field.

MA in English Linguistics Experimental design and statistics II Sean Wallis Survey of English Usage University College London

인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.

Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.

Capturing patterns of linguistic interaction in a parsed corpus A methodological case study Sean Wallis Survey of English Usage University College London.

For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.

Engaging with data Choices and decisions. Seeing or looking at? The advance of corpus linguistics has certainly changed the way that we can look at our.

Introduction Chapter 1 Foundations of statistical natural language processing.

Communicative and Academic English for the EFL Professional.

For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.

CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.

Some Alternative Approaches Two Samples. Outline Scales of measurement may narrow down our options, but the choice of final analysis is up to the researcher.

Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.

SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.

Differences between Spoken and Written Discourse

For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.

A Simple English-to-Punjabi Translation System By : Shailendra Singh.

1 Vocabulary acquisition from extensive reading: A case study Maria Pigada and Norbert Schmitt ( 2006)

Chapter 3 Language Acquisition: A Linguistic Treatment Jang, HaYoung Biointelligence Laborotary Seoul National University.

Chapter 5 The Oral Approach.

Usage-Based Phonology Anna Nordenskjöld Bergman. Usage-Based Phonology overall approach What is the overall approach taken by this theory? summarize How.

Capturing patterns of linguistic interaction in a parsed corpus A methodological case study Sean Wallis Survey of English Usage University College London.

Teaching with Depth An Understanding of Webb’s Depth of Knowledge

PSYC 206 Lifespan Development Bilge Yagmurlu.

CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.

Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin

Grammar Workshop Thursday 9th June.

Lexico-grammar: From simple counts to complex models

Survey of English Usage University College London

Presentation transcript:

Workshop: Corpus (1) What might a corpus of spoken data tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English Usage University College London

Outline What can a corpus tell us? The 3A cycle What can a parsed corpus tell us? ICE-GB and DCPSE Diachronic changes –Modal shall/will over time Intra-structural priming –NP premodification The value of interaction evidence

What can a corpus tell us? Three kinds of evidence may be obtained from a corpus  Frequency (distribution) evidence of a particular known linguistic event  Coverage (discovery) evidence of new events  Interaction evidence of the relationship between events But if these ‘events’ are lexical, this evidence can only really tell us about lexis –So corpus linguistics has always involved annotation

The 3A cycle Plain text corpora –evidence of lexical phenomena Text

The 3A cycle Plain text corpora –evidence of lexical phenomena Need to annotate –add knowledge of frameworks –classify and relate phenomena –general annotation scheme not focused on particular research goals Annotation Corpus Text

The 3A cycle Plain text corpora –evidence of lexical phenomena Need to annotate –add knowledge of frameworks –classify and relate phenomena –general annotation scheme not focused on particular research goals Corpus research = the ‘3A’ cycle –Annotation Annotation Corpus Text

The 3A cycle Plain text corpora –evidence of lexical phenomena Need to annotate –add knowledge of frameworks –classify and relate phenomena –general annotation scheme not focused on particular research goals Corpus research = the ‘3A’ cycle –Annotation  Abstraction Annotation Abstraction Corpus Text Dataset data transformation (“operationalisation”)

The 3A cycle Plain text corpora –evidence of lexical phenomena Need to annotate –add knowledge of frameworks –classify and relate phenomena –general annotation scheme not focused on particular research goals Corpus research = the ‘3A’ cycle –Annotation  Abstraction  Analysis Annotation Abstraction Analysis Corpus Text Dataset Hypotheses data transformation (“operationalisation”)

Annotation  Abstraction Abstraction –selects data from annotated corpus –maps it to a regular dataset for statistical analysis –bi-directional (“concretisation”) allows us to interpret statistically significant results

Annotation  Abstraction Abstraction –selects data from annotated corpus –maps it to a regular dataset for statistical analysis –bi-directional (“concretisation”) allows us to interpret statistically significant results Even ‘lexical’ questions need annotation: –1st person declarative modal verb shall/will abstraction relies on annotation

What can a parsed corpus tell us? Three kinds of evidence may be obtained from a parsed corpus  Frequency evidence of a particular known rule, structure or linguistic event  Coverage evidence of new rules, etc.  Interaction evidence of the relationship between rules, structures and events BUT evidence is necessarily framed within a particular grammatical scheme –So… (an obvious question) how might we evaluate this grammar?

What can a parsed corpus tell us? Parsed corpora contain (lots of) trees –Use Fuzzy Tree Fragment queries to get data –An FTF

What can a parsed corpus tell us? Parsed corpora contain (lots of) trees –Use Fuzzy Tree Fragment queries to get data –An FTF –A matching case in a tree –Using ICECUP (Nelson et al, 2002)

What can a parsed corpus tell us? Trees as handle on data –make useful distinctions –retrieve cases reliably –not necessary to “agree” to framework used provided distinctions are meaningful

What can a parsed corpus tell us? Trees as handle on data –make useful distinctions –retrieve cases reliably –not necessary to “agree” to framework used provided distinctions are meaningful Trees as trace of language production process –interaction between decisions leave a probabilistic effect on overall performance not simple to distinguish between source –depends on the framework but may also validate it

Why spoken corpora? Speech predates writing –historically– literacy growth and spread –child development– internal speech during writing

Why spoken corpora? Speech predates writing –historically– literacy growth and spread –child development– internal speech during writing Scale –professional authors recommend 1,000 words/day –1 hour of speech  8,000 words (DCPSE)

Why spoken corpora? Speech predates writing –historically– literacy growth and spread –child development– internal speech during writing Scale –professional authors recommend 1,000 words/day –1 hour of speech  8,000 words (DCPSE) Spontaneity –production process lost: many written sources edited

Why spoken corpora? Speech predates writing –historically– literacy growth and spread –child development– internal speech during writing Scale –professional authors recommend 1,000 words/day –1 hour of speech  8,000 words (DCPSE) Spontaneity –production process lost: many written sources edited Dialogue –interaction between speakers

ICE-GB and DCPSE British Component of the International Corpus of English ( ) –1 million words (nominal) –60% spoken, 40% written –speech component is orthographically transcribed –fully parsed marked up, POS-tagged, parsed, hand-corrected Diachronic Corpus of Present-day Spoken English –800,000 words (nominal) –orthographically transcribed and fully parsed –created from subsamples of LLC and ICE-GB Matching numbers of texts in text categories Not sampled over equal duration –LLC ( ) – ICE-GB ( )

p(shall | {shall, will}) Modal shall vs. will over time Plotting modal shall/will over time (DCPSE) Small amounts of data / year

Modal shall vs. will over time Plotting modal shall/will over time (DCPSE) p(shall | {shall, will}) Small amounts of data / year Confidence intervals identify the degree of certainty in our results

Modal shall vs. will over time Plotting modal shall/will over time (DCPSE) Small amounts of data / year Confidence intervals identify the degree of certainty in our results Highly skewed p in some cases – p = 0 or 1 (circled)

Modal shall vs. will over time Plotting modal shall/will over time (DCPSE) Small amounts of data / year Confidence intervals identify the degree of certainty in our results We can now estimate an approximate downwards curve (Aarts et al., 2013)

Intra-structural priming Priming effects within a structure –Study repeating an additive step in structures Consider –a phrase or clause that may (in principle) be extended ad infinitum e.g. an NP with a noun head N

Intra-structural priming Priming effects within a structure –Study repeating an additive step in structures Consider –a phrase or clause that may (in principle) be extended ad infinitum e.g. an NP with a noun head –a single additive step applied to this structure e.g. add an attributive AJP before the head N AJP

Intra-structural priming Priming effects within a structure –Study repeating an additive step in structures Consider –a phrase or clause that may (in principle) be extended ad infinitum e.g. an NP with a noun head –a single additive step applied to this structure e.g. add an attributive AJP before the head –Q. What is the effect of repeatedly applying this operation to the structure? ship N N AJP

Intra-structural priming Priming effects within a structure –Study repeating an additive step in structures Consider –a phrase or clause that may (in principle) be extended ad infinitum e.g. an NP with a noun head –a single additive step applied to this structure e.g. add an attributive AJP before the head –Q. What is the effect of repeatedly applying this operation to the structure? ship NAJP tall N AJP

Intra-structural priming Priming effects within a structure –Study repeating an additive step in structures Consider –a phrase or clause that may (in principle) be extended ad infinitum e.g. an NP with a noun head –a single additive step applied to this structure e.g. add an attributive AJP before the head –Q. What is the effect of repeatedly applying this operation to the structure? ship NAJP very greentall AJP N

Intra-structural priming Priming effects within a structure –Study repeating an additive step in structures Consider –a phrase or clause that may (in principle) be extended ad infinitum e.g. an NP with a noun head –a single additive step applied to this structure e.g. add an attributive AJP before the head –Q. What is the effect of repeatedly applying this operation to the structure? ship NAJP very greentall AJP N old

NP premodification Sequential probability analysis –calculate probability of adding each AJP –error bars: Wilson intervals –probability falls second < first third < second –decisions interact –Every AJP added makes it harder to add another probability

NP premodification: explanations? Feedback loop: for each successive AJP, it is more difficult to add a further AJP Possible explanations include:  logical and semantic constraints tend to say the tall green ship do not tend to say tall short ship or green tall ship  communicative economy once speaker said tall green ship, tends to only say ship  memory/processing constraints unlikely: this is a small structure, as are AJPs

NP premod’n: speech vs. writing Spoken vs. written subcorpora –Same overall pattern –Spoken data tends to have fewer attributive AJPs Support for communicative economy or memory/processing hypotheses? –Significance tests Paired 2x1 Wilson tests (Wallis 2011) first and second observed spoken probabilities are significantly smaller than written probability written spoken

Potential sources of interaction shared context –topic or ‘content words’ ( Noriega ) idiomatic conventions –semantic ordering of attributive adjectives ( tall green ship ) logical-semantic constraints –exclusion of incompatible adjectives ( ?tall short ship ) communicative constraints –brevity on repetition (just say ship next time) psycholinguistic processing constraints –attention and memory of speakers

What use is interaction evidence? Corpus linguistics –Optimising existing grammar e.g. co-ordination, compound nouns Theoretical linguistics –Comparing different grammars, same language –Comparing different languages or periods Psycholinguistics –Search for evidence of language production constraints in spontaneous speech corpora speech and language therapy language acquisition and development

What can a parsed corpus tell us? Trees as handle on data –make useful distinctions –retrieve cases reliably –not necessary to “agree” to framework used provided distinctions are meaningful Trees as trace of language production process –interaction between decisions leave a probabilistic effect on overall performance not simple to distinguish between source –results enabled by the framework but may also validate it

The importance of annotation Key element of a ‘3A cycle’ –Annotation  Abstraction  Analysis Richer annotation –more effective abstraction –deeper research questions? Multiple layers of annotation –new research questions –studying interaction between layers Algorithmic vs. human annotation

More information References Aarts, B. Close, J. and Wallis, S.A. (2013) Choices over time: methodological issues in current change. In Aarts, Close, Leech and Wallis (eds)The Verb Phrase in English. Cambridge University Press. Nelson, G., Wallis, S.A. and Aarts, B. (2002) Exploring Natural Language. Amsterdam: John Benjamins. Wallis, S.A. (2011) Comparing χ 2 tests for separability. London: Survey of English Usage. Useful links –Survey of English Usage –Fuzzy Tree Fragments –Statistics and methodology research blog