SEEING THE WOOD FOR THE TREES Liesbeth Augustinus Vincent Vandeghinste Frank Van Eynde.

Slides:



Advertisements
Similar presentations
Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation.
Advertisements

The CLARIN INFRASTRUCTURE Jan Odijk MA Rotation Utrecht,
The Sketch Engine for Dutch with the ANW corpus Carole Tiberius.
NederBooms Hands on session GrETEL - Greedy Extraction of Trees for Empirical Linguistics Vincent Vandeghinste.
Example-Based Treebank Querying Liesbeth Augustinus Vincent Vandeghinste Frank Van Eynde CLARIN Sofia,
Finding your way through the woods with GrETEL Liesbeth Augustinus Vincent Vandeghinste Ineke Schuurman Frank Van Eynde TABU-dag - June 14, 2013.
TAAL EN COMPUTER INTRO Paola Monachesi. Why Taal en Computer?
Example queries for Federated search Jan Odijk CLARIN Federated Search Workshop Copenhagen, 24 Apr
Computational language: week 10 Lexical Knowledge Representation concluded Syntax-based computational language Sentence structure: syntax Context free.
Linguistic Research with PaQu Jan Odijk, Utrecht University Small Experiment (was intended as a user test) Take all Dutch CHILDES corpora Select all adult.
 Christel Kemke 2007/08 COMP 4060 Natural Language Processing Feature Structures and Unification.
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Linguistics with CLARIN OpenSONAR Jan Odijk LOT Winterschool Amsterdam,
A Syntactic Translation Memory Vincent Vandeghinste Centre for Computational Linguistics K.U.Leuven
Chapter Chapter Summary Languages and Grammars Finite-State Machines with Output Finite-State Machines with No Output Language Recognition Turing.
Universität des Saarlandes Seminar: Recent Advances in Parsing Technology Winter Semester Jesús Calvillo.
Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.
Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra.
Are Linguists Dinosaurs? 1.Statistical language processors seem to be doing away with the need for linguists. –Why do we need linguists when a machine.
The Use of Corpora for Automatic Evaluation of Grammar Inference Systems Andrew Roberts & Eric Atwell Corpus Linguistics ’03 – 29 th March Computer Vision.
C SC 620 Advanced Topics in Natural Language Processing 3/9 Lecture 14.
ELN – Natural Language Processing Giuseppe Attardi
UAM CorpusTool: An Overview Debopam Das Discourse Research Group Department of Linguistics Simon Fraser University Feb 5, 2014.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department of Linguistics and Philology.
A Survey of NLP Toolkits Jing Jiang Mar 8, /08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)
Incremental sentence production and ellipsis Incremental sentence production reduces the working memory capacity needed for advance planning: The planning.
Syntactically annotated corpora of Estonian Heli Uibo Institute of Computer Science University of Tartu
Experiments on Building Language Resources for Multi-Modal Dialogue Systems Goals identification of a methodology for adapting linguistic resources for.
Tree-adjoining grammar (TAG) is a grammar formalism defined by Aravind Joshi and introduced in Tree-adjoining grammars are somewhat similar to context-free.
A Web Application for Customized Corpus Delivery Nancy Ide, Keith Suderman, Brian Simms Department of Computer Science Vassar College USA.
Linguistics & AI1 Linguistics and Artificial Intelligence Linguistics and Artificial Intelligence Frank Van Eynde Center for Computational Linguistics.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Methods for the Automatic Construction of Topic Maps Eric Freese, Senior Consultant ISOGEN International.
Introduction to CL & NLP CMSC April 1, 2003.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Comparing syntactic semantic patterns and passages in Interactive Cross Language Information Access (iCLEF at the University of Alicante) Borja Navarro,
October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.
Natural Language Programming David Vadas The University of Sydney Supervisor: James Curran.
Grammars Grammars can get quite complex, but are essential. Syntax: the form of the text that is valid Semantics: the meaning of the form – Sometimes semantics.
CSKGOI'08 Commonsense Knowledge and Goal Oriented Interfaces.
CSA2050 Introduction to Computational Linguistics Parsing I.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
CPSC 503 Computational Linguistics
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Supertagging CMSC Natural Language Processing January 31, 2006.
Linguistic Research with CLARIN Jan Odijk MA Rotation Utrecht,
Syntactic Annotation of Slovene Corpora (SDT, JOS) Nina Ledinek ISJ ZRC SAZU
Using the WWW to resolve PP attachment ambiguities in Dutch Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U.Leuven, Belgium.
Natural Language Processing Lecture 15—10/15/2015 Jim Martin.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Spring 2006-Lecture 2.
Natural Language Processing Lecture 14—10/13/2015 Jim Martin.
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
Handling Unlike Coordinated Phrases in TAG by Mixing Syntactic Category and Grammatical Function Carlos A. Prolo Faculdade de Informática – PUCRS CELSUL,
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Using PaQu for language acquisition research Jan Odijk CLARIN 2015 Conference Wroclaw,
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 3 rd.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th.
CLARIN - Flanders Activities and Achievements Frank Van Eynde Center for Computational Linguistics (KU Leuven) Digital Humanities Spring Event, April.
Relations between Data Categories
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27
LING/C SC/PSYC 438/538 Lecture 21 Sandiway Fong.
LING/C SC 581: Advanced Computational Linguistics
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 26
Jan Odijk LREC Miyazaki
Search in Token-annotated Corpora Search in Treebanks
Artificial Intelligence 2004 Speech & Natural Language Processing
Presentation transcript:

SEEING THE WOOD FOR THE TREES Liesbeth Augustinus Vincent Vandeghinste Frank Van Eynde

DEDUCTION and INDUCTION Computational corpus linguistics Focus on data Inductive generalisation Data > theory Provide tools for linguistic research Descriptive and theoretical linguistics Focus on knowledge Introspection Theory > data Verify theory through data with corpus or treebank examples How to make interaction between both approaches possible?

Corpus with syntactic annotations Dependency structures and/or constituent structures Examples:  English (Penn Treebank)  German (VerbMobil, Tiger)  French (French Treebank)  Czech (Prague Dependency Treebank)  Swedish (Swedish Treebank)  Dutch (CGN, LASSY, SoNaR) Annual conference on Treebanks and Linguistic Theories since 2002 TREEBANKS

NEDERBOOMS

Exploitation of Dutch treebanks for research in linguistics September 2010 – February 2012 Frank Van Eynde (CCL) and Hans Smessaert (NGTG) Liesbeth Augustinus and Vincent Vandeghinste (CCL) Goals  User-friendly  Access to large data files  Fast and accurate NEDERBOOMS

How can we combine the data-oriented approach of treebank mining with the knowledge-oriented method of theoretical and descriptive linguistics? NEDERBOOMS

The Nederbooms project LASSY Treebank Querying LASSY Conclusion and future research OUTLINE

The Nederbooms project LASSY Treebank Querying LASSY Conclusion and future research OUTLINE

Written texts Wikipedia, books, newspapers, reports, websites, law texts... ALPINO parser (van Noord) Automatic syntactic annotation LASSY small 1 million words, sentences Manually corrected LASSY large 1.5 billion words Not corrected LASSY

Validation report Jongejan et al Center for Sprogteknologi LASSY small Lemmatisationerror rate 0,365% POS-taggingerror rate 1,37% Dependency relations99,8% Sentence level accuracy97,8% LASSY

“Dit is een zin” >> ALPINO >>

LASSY “Dit is een zin”>> ALPINO parser >>

LASSY U wordt aangeraden demonstraties en andere politieke activiteiten te mijden. ( WR-P-E-H p.33.s.2) “You are adviced to avoid demonstrations and other political activities.”

The Nederbooms project LASSY Treebank Querying LASSY Conclusion and future research OUTLINE

Existing search tool: dtsearch (Kloosterman 2007) Query language XPath = standard query language for xml trees QUERYING LASSY

Existing search tool: dtsearch (Kloosterman 2007) Query language XPath = standard query language for xml trees Some examples “look for all occurrences of 'politiek' ” QUERYING LASSY

Existing search tool: dtsearch (Kloosterman 2007) Query language XPath = standard query language for xml trees Some examples “look for all occurrences of 'politiek' ” QUERYING LASSY

Existing search tool: dtsearch (Kloosterman 2007) Query language XPath = standard query language for xml trees Some examples “look for all occurrences of 'politiek' ” “look for all nodes in which a noun is modified by the adjective 'politiek'” QUERYING LASSY

“look for all verbs that occur as a head of a te-infinitive.” and XPath Not user-friendly Knowledge of Alpino grammar necessary QUERYING LASSY

Greedy Extraction of Trees for Empirical Linguistics Search tool based on example sentences Input = natural language No explicit knowledge of formal query language nor Alpino grammar required Bridge gap between descriptive and computational linguistics GrETEL

Zowel... als... + singular or plural? bv. Zowel de politie als de brandweer is/zijn ter plaatse. (Taaladvies.net) “Both the police and the fire brigade is/are on site.” GrETEL ZOEKEN IN LASSY

GrETEL

Zowel president Bush als zijn voorganger Clinton komen er redelijk vanaf. “Both president Bush and his predecessor Clinton do reasonably well.” 22 results Singular and plural form of the finite verb

GrETEL Zowel algemeen directeur Beijer van Sparta als technisch manager Roks weigerde gisteren commentaar te geven. “Both general director Beijer of Sparta and technical manager Roks refused to comment yesterday.”

XPath generated automatically and and and and and and and and = long, specific query > manually adaptable in advanced search mode  to generalise (include names, pron)  to specify (look for singular finite verbs only) GrETEL

Treebank mining without knowledge of the internal structure of the treebank XPath automatically generated Basic and advanced search mode Ordering filter GrETEL

The Nederbooms project LASSY Treebank Querying LASSY Conclusion and future research OUTLINE

NEDERBOOMS Treebank mining for empirical linguistics Search tool for descriptive linguistics: GrETEL  LASSY Treebank  User-friendly: input = natural language knowledge of formal query language not required  Sample of similar sentences CONCLUSION

Speed up search > query LASSY large (1.5 billion words) Add more features > e.g. query subcorpora Automatically look for close-related examples (more or less specific) Add more quantitative information Webservice FUTURE RESEARCH

Questions? Suggestions?

REFERENCES Augustinus, L., Vandeghinste, V. and Van Eynde, F. Example-Based Treebank Querying. Submitted to LREC Jongejan B., Olsen, S. and Fersøe, H. Validation Report Lassy Corpora Linguistic Validation. Center for Sprogteknologi, University of Copenhagen, 2011 Kloosterman, G. An overview of the Alpino Treebank Tools. Alfa Informatica, University of Groningen Van Belle, W. and Van Langendonck (eds.), The Dative vol. I, Amsterdam/Philadelphia: John Benjamins, van der Beek, L., Topics in Corpus-Based Syntax. Groningen Dissertations in Linguistics, Van Eynde, F. Part of Speech Tagging en Lemmatisering van het D-Coi Corpus. Centrum voor Computerlinguïstiek, KU Leuven, van Noord, G. At Last Parsing Is Now Operational. In TALN 2006, pp , van Noord Gertjan, Gosse Bouma, Frank Van Eynde, Daniel de Kok, Jelmer van der Linde, Ineke Schuurman, Erik Tjong Kim Sang and Vincent Vandeghinste. Large Scale Syntactic Annotation of Written Dutch: Lassy. In Peter Spyns and Jan Odijk (eds.): Essential Speech and Language Technology for Dutch: resources, tools and applications. Springer, submitted. van Noord, G., Schuurman, I. and Bouma, G. Lassy syntactische annotatie, revision