Linguistic Research with PaQu Jan Odijk, Utrecht University Small Experiment (was intended as a user test) Take all Dutch CHILDES corpora Select all adult.

Slides:



Advertisements
Similar presentations
Corpus Linguistics Richard Xiao
Advertisements

CLL Session 3: L2 Research Methodology LAEL, Lancaster University Florencia Franceschina.
Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim.
Controlled Vocabularies in TELPlus Antoine ISAAC Vrije Universiteit Amsterdam EDLProject Workshop November 2007.
The CLARIN INFRASTRUCTURE Jan Odijk MA Rotation Utrecht,
1 Linguistic Research And CLARIN Jan Odijk NIAS 29 Mar 2011.
SEEING THE WOOD FOR THE TREES Liesbeth Augustinus Vincent Vandeghinste Frank Van Eynde.
Example-Based Treebank Querying Liesbeth Augustinus Vincent Vandeghinste Frank Van Eynde CLARIN Sofia,
Finding your way through the woods with GrETEL Liesbeth Augustinus Vincent Vandeghinste Ineke Schuurman Frank Van Eynde TABU-dag - June 14, 2013.
Example queries for Federated search Jan Odijk CLARIN Federated Search Workshop Copenhagen, 24 Apr
CLARIN for Linguists Search Illustration 1 Jan Odijk LOT Summerschool Nijmegen,
Psycholinguistic what is psycholinguistic? 1-pyscholinguistic is the study of the cognitive process of language acquisition and use. 2-The scope of psycholinguistic.
Linguistics with CLARIN OpenSONAR Jan Odijk LOT Winterschool Amsterdam,
Statistical NLP: Lecture 3
WG3: Innovative e-dictionaries Simon Krek „Jožef Stefan“ Institute, Ljubljana, Slovenia Carole Tiberius Institute of Dutch Lexicology, Leiden, the Netherlands.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
Overall Information Extraction vs. Annotating the Data Conference proceedings by O. Etzioni, Washington U, Seattle; S. Handschuh, Uni Krlsruhe.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Input-Output Relations in Syntactic Development Reflected in Large Corpora Anat Ninio The Hebrew University, Jerusalem The 2009 Biennial Meeting of SRCD,
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Part of speech (POS) tagging
1 Human simulations of vocabulary learning Présentation Interface Syntaxe-Psycholinguistique Y-Lan BOUREAU Gillette, Gleitman, Gleitman, Lederer.
Psycholinguistics 12 Language Acquisition. Three variables of language acquisition Environmental Cognitive Innate.
Introduction to Machine Learning Approach Lecture 5.
WG3: Innovative e-dictionaries Simon Krek „Jožef Stefan“ Institute, Ljubljana, Slovenia Carole Tiberius Institute of Dutch Lexicology, Leiden, the Netherlands.
L ANGUAGE - LEARNING AND TEACHING PROCESSES AND YOUNG CHILDREN (Chapter 6) Miss. Mona AL-Kahtani.
Sentiment Analysis with a Multilingual Pipeline 12th International Conference on Web Information System Engineering (WISE 2011) October 13, 2011 Daniëlla.
Aiding WSD by exploiting hypo/hypernymy relations in a restricted framework MEANING project Experiment 6.H(d) Luis Villarejo and Lluís M à rquez.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department of Linguistics and Philology.
Learner corpus analysis and error annotation Xiaofei Lu CALPER 2010 Summer Workshop July 13, 2010.
Linguistics with CLARIN Concluding Overview Jan Odijk LOT Winterschool Amsterdam,
Information Structure in DPs The Syntax/Information Structure Interface in DPs: Internal Syntactic Properties and their Informational Correlates Anja Kleemann,
Hands-on tutorial: Using Praat for analysing a speech corpus Mietta Lennes Palmse, Estonia Department of Speech Sciences University of Helsinki.
CODA – CATCHPlus Open Document Annotation Hennie Brugman OAC II Project Review meeting Chicago – July 26-27, 2012.
DigiTAAL Some exciting examples Ineke Schuurman coordinator CLARIN-Vlaanderen.
Searching for Common Sense: Populating Cyc from the Web Presented by Yu-Chung Shen 2007/05/03.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
CLARIN for Linguists Portal & Searching for Resources Jan Odijk LOT Summerschool Nijmegen,
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
Spanish FrameNet Project Autonomous University of Barcelona Marc Ortega.
*Erasmus University Rotterdam P.O. Box 1738, NL-3000 DR Rotterdam, the Netherlands † Teezir BV Wilhelminapark 46, NL-3581 NL, Utrecht, the Netherlands.
Capturing patterns of linguistic interaction in a parsed corpus A methodological case study Sean Wallis Survey of English Usage University College London.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
The eXtensible Markup Language (XML). Presentation Outline Part 1: The basics of creating an XML document Part 2: Developing constraints for a well formed.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
E BERHARD- K ARLS- U NIVERSITÄT T ÜBINGEN SFB 441 Coordinate Structures: On the Relationship between Parsing Preferences and Corpus Frequencies Ilona Steiner.
Introduction Chapter 1 Foundations of statistical natural language processing.
Linguistic Research with CLARIN Jan Odijk MA Rotation Utrecht,
Formal Specification: a Roadmap Axel van Lamsweerde published on ICSE (International Conference on Software Engineering) Jing Ai 10/28/2003.
PARSEME Alpino MWE Encoding Jan Odijk PARSEME Meeting Iasi,
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
Parsing & Language Acquisition: Parsing Child Language Data CSMC Natural Language Processing February 7, 2006.
Matakuliah: G0922/Introduction to Linguistics Tahun: 2008 Session 13 Second Language Acquisition.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Using PaQu for language acquisition research Jan Odijk CLARIN 2015 Conference Wroclaw,
Glottodidactics Lesson 7.
Relations between Data Categories
Verbal inflection: why is it vulnerable in SLI?
Lecture 12: Relevance Feedback & Query Expansion - II
Statistical NLP: Lecture 3
Computational and Statistical Methods for Corpus Analysis: Overview
CLARIN Language Resources Switchboard in CLARIAH
Multimedia Information Retrieval
Psycholinguistic aspects of interlanguage
Jan Odijk LREC Miyazaki
Corpus-based tools: a “how to” presentation
Search in Token-annotated Corpora Search in Treebanks
Presentation transcript:

Linguistic Research with PaQu Jan Odijk, Utrecht University Small Experiment (was intended as a user test) Take all Dutch CHILDES corpora Select all adult utterances containing heel,erg or zeer Clean the utterances, e.g. ja, maar [//] we bewaren (he)t ook  ja, maar we bewaren het ook Gather statistics and draw conclusions CHILDES corpora Use Dutch CHILDES corpora to investigate this Problem: ambiguity of the relevant words Dutch CHILDES corpora do NOT have (reliable) pos-tags and no syntactic parses at all Done manually for Van Kampen Corpus [Odijk 2014:91] PaQu (Parse and Query) automates this Caveats It concerns (cleaned) adult speech It concerns relatively short sentences, explicitly separated It mostly concerns a very local grammatical relation Most problematic for zeer: zeer doen PaQu Search for morpho-syntactic information and syntactic dependency relations Distinction relevant ones v. irrelevant ones can now be made mostly automatically Accuracy Manual annotation of Van Kampen corpus used as gold standard (Acc) Alpino makes finer distinctions: I mapped these Annotation errors in the gold standard: revised gold standard (Rev Acc) Conclusions Linguistics No examples for mod P: how to explain heel v. erg, zeer? Overwhelmingness of mod A for heel? Are the current Dutch CHILDES corpora representative enough to draw reliable conclusions? PaQu PaQu is very useful for doing better and more efficient manual verification of hypotheses In some cases its fully automatically generated parses and their statistics can reliably be used directly (though care is required!) Basic Facts Heel erg zeer are (near-)synonyms meaning `very’ Heel can modify adjectival (A) predicates only Erg en zeer can modify A, verbal (V) and prepositional (P) predicates 1.(A) Hij is daar heel /erg / zeer blij over 2.(P) Hij is daar *heel / erg /zeer in zijn sas mee 3.(V) Dat verbaast mij *heel / erg / zeer (very in English is like Dutch heel (v. very much) See [Odijk 2011, 2014] for more data and qualifications Research Questions How can children acquire the fact that erg and zeer can modify A, V and P predicates (in L1 acquisition)? How can children acquire the fact that heel can modify A but canNOT modify V and P predicates (in L1 acquisition)? What kind of evidence do children have access to for acquiring such properties? Is there a relation with the time of acquisition? Is there a role for indirect negative evidence (absence of evidence interpreted as evidence for absence)? Interpretation Overwhelming # examples for mod A for heel Large # examples for mod A and mod V for erg Very few examples for zeer (mod V mostly wrong parses ) No examples of mod P / mod V for heel at all (the 4 are wrong parses) PP predicates with zeer, erg: op prijs stellen, in de smaak vallen only (mod V) – 3 occurrences wordMorphosyntaxSyntaxMeaning heel A Mod N(1)`whole’ (2)`large’ Mod A`very’ Vf(1)`heal’ (2) `receive’ erg N utrum`erg’ N neutrum`evil’ A Mod N, predc‘bad’, ‘awful’ Mod A V Pvery zeer N`pain’ A Mod N, predc‘painful’ Mod A V P‘very’ wordAccRev Acc heel erg zeer 0.21 Background AutoSearch: (enrich,) upload & search (expected March 2015, INL) PoS-tags, Corpus Modern Dutch interface PaQu: upload (, enrich) and search (July 2015, V1 available, RUG) syntactic structures, Groningen Word Relations Search Application User tests on-going Assessment of the facts Distinction is purely syntactic Cannot be derived from semantic differences No correlation found with other known facts Cannot be derived from general (universal) principles  must be acquired by L1 learners of Dutch Resultsmod Amod NMod Vmod PpredcotherunclearTotal heel erg zeer Future Work Similar experiments for the children’s speech (cf. [Odijk 2014:34]) Similar experiments for te v. overmatig; worden v. raken and others Extend PaQu to include all relevant `metadata’ Extend PaQu to natively support common formats such as CHAT, Folia, TEI, … Make similar system for GrETEL Manually verify (parts of) parses for CHILDES corpora (UU AnnCor project) References [Odijk 2011] Odijk, J., "User Scenario Search", internal CLARIN-NL document, April 13, [docx]docx [Odijk 2014] Odijk, J.,'CLARIN: What's in it for Linguists?', Uilendag lecture, Utrecht, Mar 27, [pptx]pptx