Capturing linguistic interaction in a grammar A method for empirically evaluating the grammar of a parsed corpus Sean Wallis Survey of English Usage University.

Slides:



Advertisements
Similar presentations
Z-squared: the origin and use of χ² - or - what I wish I had been told about statistics (but had to work out for myself) Sean Wallis Survey of English.
Advertisements

Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London
Tracking L2 Lexical and Syntactic Development Xiaofei Lu CALPER 2010 Summer Workshop July 14, 2010.
Experiments and Variables
Statistical NLP: Lecture 3
1 Developing Statistic-based and Rule-based Grammar Checkers for Chinese ESL Learners Howard Chen Department of English National Taiwan Normal University.
For Monday Read Chapter 23, sections 3-4 Homework –Chapter 23, exercises 1, 6, 14, 19 –Do them in order. Do NOT read ahead.
Chapter 20: Natural Language Generation Presented by: Anastasia Gorbunova LING538: Computational Linguistics, Fall 2006 Speech and Language Processing.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
1 Words and the Lexicon September 10th 2009 Lecture #3.
NLP and Speech Course Review. Morphological Analyzer Lexicon Part-of-Speech (POS) Tagging Grammar Rules Parser thethe – determiner Det NP → Det.
Language, Mind, and Brain by Ewa Dabrowska Chapter 2: Language processing: speed and flexibility.
CS 330 Programming Languages 09 / 16 / 2008 Instructor: Michael Eckmann.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Outline of English Syntax.
C SC 620 Advanced Topics in Natural Language Processing 3/9 Lecture 14.
PRAGMATICS. 3- Pragmatics is the study of how more gets communicated than is said. It explores how a great deal of what is unsaid is recognized. 4.
The students will be able to know:
Models of Generative Grammar Smriti Singh. Generative Grammar  A Generative Grammar is a set of formal rules that can generate an infinite set of sentences.
Three Generative grammars
Albert Gatt LIN 3098 Corpus Linguistics. In this lecture Some more on corpora and grammar Construction Grammar as a theoretical framework Collostructional.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
What counts as evidence in linguistics?. WHAT IS UNIVERSAL GRAMMAR? A system of grammatical rules and constraints believed to underlie all natural languages.
Linguistics, Pragmatics & Natural Grammar
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
English Corpus Linguistics Introducing the Diachronic Corpus of Present-Day Spoken English (DCPSE) Sean Wallis UCL.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
MA in English Linguistics Experimental design and statistics Sean Wallis Survey of English Usage University College London
For Friday Finish chapter 23 Homework: –Chapter 22, exercise 9.
What is linguistics  It is the science of language.  Linguistics is the systematic study of language.  The field of linguistics is concerned with the.
1 Statistical Parsing Chapter 14 October 2012 Lecture #9.
Chapter 10 - Language 4 Components of Language 1.Phonology Understanding & producing speech sounds Phoneme - smallest sound unit Number of phonemes varies.
Sequencing and Feedback in Teaching Grammar. Problems in Sequencing ► How do we sequence the grammar in a teaching programme? ► From easy to difficult?
1 Chapter 4 Syntax The sentence patterns of language Part I.
May 2006CLINT-LN Parsing1 Computational Linguistics Introduction Parsing with Context Free Grammars.
Grammatical Noriegas interaction in corpora and treebanks ICAME 30 Lancaster May 2009 Sean Wallis Survey of English Usage University College London.
MA in English Linguistics Experimental design and statistics II Sean Wallis Survey of English Usage University College London
GrammaticalHierarchy in Information Flow Translation Grammatical Hierarchy in Information Flow Translation CAO Zhixi School of Foreign Studies, Lingnan.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
Notes on Pinker ch.7 Grammar, parsing, meaning. What is a grammar? A grammar is a code or function that is a database specifying what kind of sounds correspond.
Capturing patterns of linguistic interaction in a parsed corpus A methodological case study Sean Wallis Survey of English Usage University College London.
Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University.
HYMES (1964) He developed the concept that culture, language and social context are clearly interrelated and strongly rejected the idea of viewing language.
Workshop: Corpus (1) What might a corpus of spoken data tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English.
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
Wordnet - A lexical database for the English Language.
Rules, Movement, Ambiguity
E BERHARD- K ARLS- U NIVERSITÄT T ÜBINGEN SFB 441 Coordinate Structures: On the Relationship between Parsing Preferences and Corpus Frequencies Ilona Steiner.
Introduction Chapter 1 Foundations of statistical natural language processing.
Evaluating Models of Computation and Storage in Human Sentence Processing Thang Luong CogACLL 2015 Tim J. O’Donnell & Noah D. Goodman.
Communicative and Academic English for the EFL Professional.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Natural Language Processing Lecture 14—10/13/2015 Jim Martin.
Corpus search What are the most common words in English
Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London
SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.
NATURAL LANGUAGE PROCESSING
Chapter 3 Language Acquisition: A Linguistic Treatment Jang, HaYoung Biointelligence Laborotary Seoul National University.
Why Is It There? Chapter 6. Review: Dueker’s (1979) Definition “a geographic information system is a special case of information systems where the database.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
Capturing patterns of linguistic interaction in a parsed corpus A methodological case study Sean Wallis Survey of English Usage University College London.
Natural Language Processing Vasile Rus
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
Statistical NLP: Lecture 3
What is linguistics?.
Probabilistic and Lexicalized Parsing
Linguistic Essentials
Survey of English Usage University College London
Linguistic aspects of interlanguage
Lecture 7: Definite Clause Grammars
Presentation transcript:

Capturing linguistic interaction in a grammar A method for empirically evaluating the grammar of a parsed corpus Sean Wallis Survey of English Usage University College London

Capturing linguistic interaction... Parsed corpus linguistics Empirical evaluation of grammar Experiments –Attributive AJPs –Preverbal AVPs –Embedded postmodifying clauses Conclusions –Comparing grammars or corpora –Potential applications

Parsed corpus linguistics Several million-word parsed corpora exist Each sentence analysed in the form of a tree –different languages have been analysed –limited amount of spontaneous speech data Commitment to a particular grammar required –different schemes have been applied –problems: computational completeness + manual consistency Tools support linguistic research in corpora

Parsed corpus linguistics An example tree from ICE-GB (spoken) S1A-006 #23

Parsed corpus linguistics Three kinds of evidence may be obtained from a parsed corpus  Frequency evidence of a particular known rule, structure or linguistic event  Coverage evidence of new rules, etc.  Interaction evidence of the relationship between rules, structures and events This evidence is necessarily framed within a particular grammatical scheme –So… how might we evaluate this grammar?

Empirical evaluation of grammar Many theories, frameworks and grammars –no agreed evaluation method exists –linguistics is divided into competing camps –status of parsed corpora ‘suspect’ Possible method: retrievability of events  circularity: you get out what you put in  redundancy: ‘improvement’ by mere addition  atomic: based on single events, not pattern  specificity: based on particular phenomena New method: retrievability of event sequences

Experiment 1: attributive AJPs Adjectives before a noun in English Simple idea: plot the frequency of NPs with at least n = 0, 1, 2, 3… attributive AJPs

Experiment 1: attributive AJPs Adjectives before a noun in English Simple idea: plot the frequency of NPs with at least n = 0, 1, 2, 3… attributive AJPs Raw frequencyLog frequency NB: not a straight line

Experiment 1: analysis of results If the log-frequency line is straight –exponential fall in frequency (constant probability) –no interaction between decisions (cf. coin tossing) Sequential probability analysis –calculate probability of adding each AJP –error bars (binomial) –probability falls second < first third < second fourth < second –decisions interact

Experiment 1: analysis of results If the log-frequency line is straight –exponential fall in frequency (constant probability) –no interaction between decisions (cf. coin tossing) Sequential probability analysis –calculate probability of adding each AJP –error bars (binomial) –probability falls second < first third < second fourth < second –decisions interact probability

Experiment 1: analysis of results If the log-frequency line is straight –exponential fall in frequency (constant probability) –no interaction between decisions (cf. coin tossing) Sequential probability analysis –calculate probability of adding each AJP –error bars (binomial) –probability falls –decisions interact –fit to a power law y = m.x k find m and x probability y = x

Experiment 1: explanations? Feedback loop: for each successive AJP, it is more difficult to add a further AJP –Explanation 1: semantic constraints tend to say tall green ship do not tend to say tall short ship or green tall ship –Explanation 2: communicative economy once speaker said tall green ship, tends to only say ship –Further investigation required General principle: –significant change (usually, fall) in probability is evidence of an interaction along grammatical axis

Experiments 2,3: variations  Restrict head: common and proper nouns –Common nouns: similar results –Proper nouns and adjectives are often treated as compounds (Northern England vs. lower Loire )  Ignore grammar: adjective + noun strings –Some misclassifications / miscounting (‘noise’) she was [beautiful, people] said ; tall very [green ship] –Similar results slightly weaker (third < second ns at p=0.01) –Insufficient evidence for grammar null hypothesis: simple lexical adjacency

Experiment 4: preverbal AVPs Consider adverb phrases before a verb –Results very different Probability does not fall significantly between first and second AVP Probability does fall between third and second AVP –Possible constraints (weak) communicative not (strong) semantic –Further investigation needed

Experiment 4: preverbal AVPs Consider adverb phrases before a verb –Results very different Probability does not fall significantly between first and second AVP Probability does fall between third and second AVP –Possible constraints (weak) communicative not (strong) semantic –Further investigation needed –Not power law: R 2 < 0.24 probability

Experiment 5: embedded clauses Another way to specify nouns in English –add clause after noun to explicate it the ship [that was tall and green] the ship [in the port] –may be embedded the ship [in the port [with the ancient lighthouse]] –or successively postmodified the ship [in the port][with a very old mast] Compare successive embedding and sequential postmodifying clauses –Axis = embedding depth / sequence length

Experiment 5: method Extract examples with FTFs –at least n levels of embedded postmodification:

Experiment 5: method Extract examples with FTFs –at least n levels of embedded postmodification: (etc.)

Experiment 5: method Extract examples with FTFs –at least n levels of embedded postmodification: –problems: multiple matching cases (use ICECUP IV to classify) overlapping cases (subtract extra case) co-ordination of clauses or NPs (use alternative patterns) (etc.)

Experiment 5: analysis of results Probability of adding a further embedded clause falls with each level –second < first –sequential < embedding Embedding only: –third < first –insufficient data for third < second Conclusion: –Interaction along embedding and sequential axes

Experiment 5: analysis of results Probability of adding a further embedded clause falls with each level –second < first –sequential < embedding Embedding only: –third < first –insufficient data for third < second Conclusion: –Interaction along embedding and sequential axes sequential embedded probability

Experiment 5: analysis of results Probability of adding a further embedded clause falls with each level –second < first –sequential < embedding Fitting to f = m.x k –k < 0 = fall ( f = m/x |k| ) –|k| is high = steep Conclusion: –Both match power law: R 2 > 0.99 sequential embedded y = x y = x

Experiment 5: explanations? Lexical adjacency? –No: 87% of 2-level cases have at least one VP, NP or clause between upper and lower heads Misclassified cases of embedding? –No: very few (5%) semantically ambiguous cases Language production constraints? –Possibly, could also be communicative economy contrast spontaneous speech with other modes Positive ‘proof’ of recursive tree grammar –Established from parsed corpus –cf. negative ‘proof’ (NLP parsing problems)

Conclusions A new method for evaluating interactions along grammatical axes –General purpose, robust, structural –More abstract than ‘linguistic choice’ experiments –Depends on a concept of grammatical distance along an axis, based on the chosen grammar Method has philosophical implications –Grammar viewed as structure of linguistic choices –Linguistics as an evaluable observational science Signature (trace) of language production decisions –A unification of theoretical and corpus linguistics?

Comparing grammars or corpora Can we reliably retrieve known interaction patterns with different grammars? –Do these patterns differ across corpora? Benefits over individual event retrieval non-circular: generalisation across local syntax not subject to redundancy: arbitrary terms makes trends more difficult to retrieve not atomic: based on patterns of interaction general: patterns may have multiple explanations Supplements retrieval of events

Potential applications Corpus linguistics –Optimising existing grammar e.g. co-ordination, compound nouns Theoretical linguistics –Comparing different grammars, same language –Comparing different languages or periods Psycholinguistics –Search for evidence of language production constraints in spontaneous speech corpora speech and language therapy language acquisition and development

Links and further reading Survey of English Usage – Corpora and grammar –.../projects/ice-gb Full paper –.../staff/sean/resources/analysing-grammatical- interaction.pdf Sequential analysis spreadsheet (Excel) –.../staff/sean/resources/interaction-trends.xls