Using PaQu for language acquisition research Jan Odijk CLARIN 2015 Conference Wroclaw, 2015-10-16 1.

Slides:



Advertisements
Similar presentations
Example-Based Treebank Querying Liesbeth Augustinus Vincent Vandeghinste Frank Van Eynde CLARIN Sofia,
Advertisements

Example queries for Federated search Jan Odijk CLARIN Federated Search Workshop Copenhagen, 24 Apr
CLARIN for Linguists Search Illustration 1 Jan Odijk LOT Summerschool Nijmegen,
The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Linguistic Research with PaQu Jan Odijk, Utrecht University Small Experiment (was intended as a user test) Take all Dutch CHILDES corpora Select all adult.
Psycholinguistic what is psycholinguistic? 1-pyscholinguistic is the study of the cognitive process of language acquisition and use. 2-The scope of psycholinguistic.
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Linguistics with CLARIN OpenSONAR Jan Odijk LOT Winterschool Amsterdam,
Universität des Saarlandes Seminar: Recent Advances in Parsing Technology Winter Semester Jesús Calvillo.
Nature of the Input Dana Hughes. What is the Nature of Input in Regards to Language Acquisition?
Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
Tools and resources Summary of working group discussion.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Aspect is not first: Children do not mistakenly map inherent lexical aspect to tense morphology Galila Spharim and Anat Ninio The Hebrew University, Jerusalem.
The Use of Corpora for Automatic Evaluation of Grammar Inference Systems Andrew Roberts & Eric Atwell Corpus Linguistics ’03 – 29 th March Computer Vision.
Young Children Learn a Native English Anat Ninio The Hebrew University, Jerusalem 2010 Conference of Human Development, Fordham University, New York Background:
1/17 Acquiring Selectional Preferences from Untagged Text for Prepositional Phrase Attachment Disambiguation Hiram Calvo and Alexander Gelbukh Presented.
1 Human simulations of vocabulary learning Présentation Interface Syntaxe-Psycholinguistique Y-Lan BOUREAU Gillette, Gleitman, Gleitman, Lederer.
Introduction to Machine Learning Approach Lecture 5.
Research methods in corpus linguistics Xiaofei Lu.
1 CS 178H Introduction to Computer Science Research What is CS Research?
9/8/20151 Natural Language Processing Lecture Notes 1.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department of Linguistics and Philology.
Grammaticality Judgments Do you want to come with?20% I might could do that.38% The pavements are all wet.60% Y’all come back now.38% What if I were Romeo.
Learner corpus analysis and error annotation Xiaofei Lu CALPER 2010 Summer Workshop July 13, 2010.
1 Computational Linguistics Ling 200 Spring 2006.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
CLARIN for Linguists Portal & Searching for Resources Jan Odijk LOT Summerschool Nijmegen,
An ICALL writing support system tunable to varying levels of learner initiative Karin Harbusch 1 & Gerard Kempen 2,3 1 University of Koblenz-Landau, Koblenz,
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
Universal Grammar Noam Chomsky.
Capturing patterns of linguistic interaction in a parsed corpus A methodological case study Sean Wallis Survey of English Usage University College London.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
MIT 6.893; SMA 5508 Spring 2004 Larry Rudolph Lecture Introduction Sketching Interface.
E BERHARD- K ARLS- U NIVERSITÄT T ÜBINGEN SFB 441 Coordinate Structures: On the Relationship between Parsing Preferences and Corpus Frequencies Ilona Steiner.
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
Introduction Chapter 1 Foundations of statistical natural language processing.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Supertagging CMSC Natural Language Processing January 31, 2006.
Linguistic Research with CLARIN Jan Odijk MA Rotation Utrecht,
1 ISOCAT Proposed solutions for Problems encountered in DUELME-LMF Jan Odijk Nijmegen 21 Sep 2010.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
PARSEME Alpino MWE Encoding Jan Odijk PARSEME Meeting Iasi,
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
Parsing & Language Acquisition: Parsing Child Language Data CSMC Natural Language Processing February 7, 2006.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Copyright © 2011 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 1 Research: An Overview.
Input, Interaction, and Output Input: (in language learning) language which a learner hears or receives and from which he or she can learn. Enhanced input:
Welcome to All S. Course Code: EL 120 Course Name English Phonetics and Linguistics Lecture 1 Introducing the Course (p.2-8) Unit 1: Introducing Phonetics.
Automatic Ontology Extraction Miloš Husák RASLAN 2010.
Language Identification and Part-of-Speech Tagging
Relations between Data Categories
Second Language Acquisition
Computational and Statistical Methods for Corpus Analysis: Overview
Natural Language Processing (NLP)
ENG 380 Education for Service-- snaptutorial.com
ENG 380 Education for Service-- tutorialrank.com
LaDeLi Centre for Research in Language Development throughout the Lifespan Second Language Acquisition Research (SLA) and Teacher Education: what should.
Chapter 5.
Jan Odijk LREC Miyazaki
Search in Token-annotated Corpora Search in Treebanks
Natural Language Processing (NLP)
Natural Language Processing (NLP)
Presentation transcript:

Using PaQu for language acquisition research Jan Odijk CLARIN 2015 Conference Wroclaw,

Introduction CHILDES Corpora PaQu Evaluation & Analysis Conclusions Future Work Overview 2

(See [Odijk 2011, 2014] for more data and qualifications Introduction 3 Catinitmodifierpredicaterest AHij is daarHeel / erg /zeerblijmee glossHe is thereveryhappywith PHij is daar*heel / erg / zeerin zijn sasmee glossHe is thereveryhappywith V…omdat dat mij*heel / erg / zeerverbaast gloss…because that me verysurprises

Distinction is purely syntactic Cannot be derived from semantic differences Correlation with other known facts unlikely Cannot be derived from general (universal) principles  must be acquired by L1 learners of Dutch Introduction 4

Minimal pair in acquisition Requires acquisition of negative property – No evidence in the input – No ‘correction’ or correction ignored May provide evidence for/against relevant hypotheses – E.g. Indirect Negative Evidence hypothesis Absence of evidence  evidence for absence Introduction 5

Problem: Ambiguity – Heel7-fold ambiguous7-fold ambiguous – Erg4-fold ambiguous4-fold ambiguous – Zeer 3-fold ambiguous3-fold ambiguous (as any decent natural language word) For our purposes: – Morpho-syntactic and syntactic properties resolve the ambuigities Corpus Analysis 6

[Odijk 2014]Odijk 2014 Automatic Corpus analysis: GrETEL, OpenSONAR, COAVA, LWRS, CMDGrETELOpenSONARCOAVA LWRSCMD These apply to specific corpora only Manual Corpus analysis of CHILDES Van Kampen CorpusCHILDES Van Kampen Corpus How can I apply these applications to my own corpus?  request for PaQu (extends LWRS), AutoSearch (extends CMD), …LWRS CMD Corpus Analysis 7

PaQu= Parse and Query: Web application made by Groningen University Upload corpus – Plain text or in Alpino format Plain Text is automatically parsed by Alpino Resulting treebank can be searched and analyzed – Search Word relations interface and XPATH Queries – Analysis User-definable statistics on search results (and metadata) PaQu 8

Take the Dutch CHILDES corpora Select all utterances containing heel, erg or zeer Clean the utterances, e.g. ja, maar [//] we bewaren (he)t ook ja, maar we bewaren het ook Upload it into PaQu Gather statistics and draw conclusions Experiments 9

Adult utterances of Van Kampen Corpus Manual annotation used as gold standard (Acc) Alpino makes finer distinctions: I mapped these Annotation errors in the gold standard: revised gold standard (Rev Acc) Experiment 1 10

Accuracy Experiment 1: Results 11 wordAccRev Acc heel erg zeer 0.21

Good for heel, erg Bad for zeer, but: Completely due to zeer doen (lit. pain(ful) do, ‘to hurt’) Can be identified very easily in PaQu Generalisability: Limited It concerns (cleaned) adult speech It concerns relatively short sentences, explicitly separated It mostly concerns a very local grammatical relation Experiment 1: Interpretation 12

All adults’ utterances: Experiment 2: 13 Resultsmod Amod NMod Vmod PpredcotherunclearTotal heel erg zeer

Heel most frequent (almost 54%) Experiment 2: Interpretation 14 Resultsmod Amod NMod Vmod PpredcotherunclearTotal heel erg zeer

Heel as mod A overwhelming: > 93% Experiment 2: Interpretation 15 Resultsmod Amod NMod Vmod PpredcotherunclearTotal heel erg zeer

Heel as mod V, mod P wrong analysis Experiment 2: Interpretation 16 Resultsmod Amod NMod Vmod PpredcotherunclearTotal heel erg zeer

Mod A and mod V more balanced for erg Experiment 2: Interpretation 17 Resultsmod Amod NMod Vmod PpredcotherunclearTotal heel erg zeer

Evidence for zeer mostly lacking Cases of Mod V are mostly wrong analyses Experiment 2: Interpretation 18 Resultsmod Amod NMod Vmod PpredcotherunclearTotal heel erg zeer

Evidence for Mod P mostly lacking Some evidence for erg, zeer (4 occurrences) Experiment 2: Interpretation 19 Resultsmod Amod NMod Vmod PpredcotherunclearTotal heel erg zeer

Van Kampen Children’s speech: Accuracy Similar to the Adults’ speech but slightly lower Experiment 3: 20 WordAcc heel0.90 erg0.73 zeer0.17

Linguistics: No examples for mod P: how to explain heel v. erg, zeer? Overwhelmingness of mod A for heel might be a relevant factor Current Dutch CHILDES corpora probably too small to draw reliable conclusions Conclusions 21

PaQu: PaQu is very useful for doing better and more efficient manual verification of hypotheses In some cases its parses and their statistics can reliably be used directly (though care is required!) Several small details were improved, small additions to functionality made through these experiments Conclusions 22

More experiments for the children’s speech (cf. [Odijk 2014:34]) Similar experiments for other examples te ‘too’ v. overmatig ‘excessively’; worden ‘become’v. raken ‘get’ and others Extend PaQu to include all relevant `metadata’ Extend PaQu to natively support common formats such as CHAT, Folia, TEI, … Make similar system for GrETEL, OpenSONAR Manually verify (parts of) parses for CHILDES corpora (most is being done in CLARIAH-NL or UU AnnCor) Future Work 23

Thanks for Attention! Visit the Demo at 16:30! Visit the Bazaar at 14:30 for a completely different use of PaQu! 24

 NO! Correlation with other Differences? 25 PhenomenonOpposesVersus Mod V,Pheelerg, zeer Meaningergheel, zeer Inflectionheel, ergzeer Comparative, Superlative ergheel, zeer Modifieeergheel, zeer Pragmaticszeerheel, erg

Ambiguity: HEEL 26 wordMorpho- syntax SyntaxMeaning heel A Mod N(1)`whole’ (2) ‘in one piece’ (3)`large’ Predc‘in one piece’ Mod A`very’ Vf(1)`heal’ (2) `receive’

Ambiguity: ERG 27 wordMorpho- syntax SyntaxMeaning erg N utrum`erg’ N neutrum`evil’ A Mod N, predc ‘bad’, ‘awful’ Mod A V Pvery

Ambiguity: ZEER 28 wordMorpho- Syntax Meaning zeer N`pain’ A Mod N, predc‘painful’ Mod A V P‘very’