Finding your way through the woods with GrETEL Liesbeth Augustinus Vincent Vandeghinste Ineke Schuurman Frank Van Eynde TABU-dag - June 14, 2013.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

The CLARIN INFRASTRUCTURE Jan Odijk MA Rotation Utrecht,
The Sketch Engine for Dutch with the ANW corpus Carole Tiberius.
NederBooms Hands on session GrETEL - Greedy Extraction of Trees for Empirical Linguistics Vincent Vandeghinste.
SEEING THE WOOD FOR THE TREES Liesbeth Augustinus Vincent Vandeghinste Frank Van Eynde.
Example-Based Treebank Querying Liesbeth Augustinus Vincent Vandeghinste Frank Van Eynde CLARIN Sofia,
Slide 1 of 20 Increasing the coverage of answer extraction by applying anaphora resolution Increasing the coverage of answer extraction by applying anaphora.
Example queries for Federated search Jan Odijk CLARIN Federated Search Workshop Copenhagen, 24 Apr
Linguistic Research with PaQu Jan Odijk, Utrecht University Small Experiment (was intended as a user test) Take all Dutch CHILDES corpora Select all adult.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Syntax Constituency, Phrase structure rules LING 400 Winter 2010.
June 6, 20073rd PIRE Meeting1 Tectogrammatical Representation of English in Prague Czech-English Dependency Treebank Lucie Mladová Silvie Cinková, Kristýna.
Linguistics with CLARIN OpenSONAR Jan Odijk LOT Winterschool Amsterdam,
Language Data Resources Treebanks. A treebank is a … database of syntactic trees corpus annotated with morphological and syntactic information segmented,
Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University
A Syntactic Translation Memory Vincent Vandeghinste Centre for Computational Linguistics K.U.Leuven
Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.
Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007.
LING 581: Advanced Computational Linguistics Lecture Notes January 26th.
Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra.
6/29/051 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive.
Are Linguists Dinosaurs? 1.Statistical language processors seem to be doing away with the need for linguists. –Why do we need linguists when a machine.
METIS-II: a hybrid MT system Peter Dirix Vincent Vandeghinste Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven TMI 2007,
Parsing the NEGRA corpus Greg Donaker June 14, 2006.
Workshop on Treebanks, Rochester NY, April 26, 2007 The Penn Treebank: Lessons Learned and Current Methodology Ann Bies Linguistic Data Consortium, University.
The Linguist’s Search Engine 02/04/2004. Background Address: Address:
Labels: automation Adam Kilgarriff. Auckland 2012Kilgarriff / Labels: automation2 Which words are:  Most distinctive of business English?  Most often.
UAM CorpusTool: An Overview Debopam Das Discourse Research Group Department of Linguistics Simon Fraser University Feb 5, 2014.
Language and Speech Technology: Parsing Jan Odijk January 2011 LOT Winter School
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
A Survey of NLP Toolkits Jing Jiang Mar 8, /08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
Content of the Data Category Registry 10 May /20111CLARIN-NL ISOcat workshop.
A Web Application for Customized Corpus Delivery Nancy Ide, Keith Suderman, Brian Simms Department of Computer Science Vassar College USA.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
IV. SYNTAX. 1.1 What is syntax? Syntax is the study of how sentences are structured, or in other words, it tries to state what words can be combined with.
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
Olga Pustylnikov, Alexander Mehler Bielefeld University A Unified Database of Dependency Treebanks Integrating, Quantifying & Evaluating Dependency Data.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
IBM Research © Copyright IBM Corporation 2005 | A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi,
CSE573 Autumn /27/98 Natural Language Processing Administrative –New version of PS4 on the Web different interface to the Truckworld more extra.
CPSC 503 Computational Linguistics
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Supertagging CMSC Natural Language Processing January 31, 2006.
Linguistic Research with CLARIN Jan Odijk MA Rotation Utrecht,
Grammar in a Nutshell 2. Unit 2 Word order Making questions Directions Present simple Present continuous o Wh-questions o Yes/no questions o Question.
Using the WWW to resolve PP attachment ambiguities in Dutch Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U.Leuven, Belgium.
Natural Language Processing Lecture 15—10/15/2015 Jim Martin.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Spring 2006-Lecture 2.
PARSEME Alpino MWE Encoding Jan Odijk PARSEME Meeting Iasi,
Language and Cognition Colombo, June 2011 Day 2 Introduction to Linguistic Theory, Part 3.
NLP. Parsing ( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (,,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (,,) ) (VP (MD will) (VP (VB join) (NP (DT.
Eliciting a corpus of word- aligned phrases for MT Lori Levin, Alon Lavie, Erik Peterson Language Technologies Institute Carnegie Mellon University.
Using PaQu for language acquisition research Jan Odijk CLARIN 2015 Conference Wroclaw,
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th.
CLARIN - Flanders Activities and Achievements Frank Van Eynde Center for Computational Linguistics (KU Leuven) Digital Humanities Spring Event, April.
Natural Language Processing Vasile Rus
Relations between Data Categories
[A Contrastive Study of Syntacto-Semantic Dependencies]
CKY Parser 0Book 1 the 2 flight 3 through 4 Houston5 6/19/2018
Syntax Word order, constituency
Grammar in a Nutshell.
CKY Parser 0Book 1 the 2 flight 3 through 4 Houston5 11/16/2018
LING/C SC 581: Advanced Computational Linguistics
GRAMMAR TASK INFORMATION
Language and Speech Technology: Parsing
Jan Odijk LREC Miyazaki
Search in Token-annotated Corpora Search in Treebanks
Presentation transcript:

Finding your way through the woods with GrETEL Liesbeth Augustinus Vincent Vandeghinste Ineke Schuurman Frank Van Eynde TABU-dag - June 14, 2013

GrETEL Greedy Extraction of Trees for Empirical Linguistics Query engine for treebanks Nederbooms project Exploitation of Dutch treebanks for research in linguistics

GrETEL Greedy Extraction of Trees for Empirical Linguistics Query engine for treebanks Nederbooms project Exploitation of Dutch treebanks for research in linguistics Goals o User-friendly tools o Access to large data files o Fast and accurate

GrETEL Greedy Extraction of Trees for Empirical Linguistics Query engine for treebanks Treebank = syntactically annotated corpus e.g. Penn Treebank (English), TüBa (German), LASSY, CGN (Dutch)

TREEBANKS

GrETEL Greedy Extraction of Trees for Empirical Linguistics Query engine for treebanks Treebank = syntactically annotated corpus e.g. Penn Treebank (English), TüBa (German), LASSY, CGN (Dutch) Parser e.g. Alpino (Van Noord 2006)

ALPINO PARSER Dit is een zin. >> ALPINO parser >> “This is a sentence.”

ALPINO PARSER Dit is een zin. >> ALPINO parser >> “This is a sentence.” XML trees Query language: XPath

XPATH and and and and and

XPATH and and and and and

XPATH and and and and and

XPATH

GrETEL Greedy Extraction of Trees for Empirical Linguistics Query treebanks by example

GrETEL Greedy Extraction of Trees for Empirical Linguistics Query treebanks by example First version => only for LASSY treebank New release => GrETEL for CGN treebank => update based on user reviews

GrETEL Example sentence Indicate relevant items of the sentence (Adapt XPath) Select treebank Inspect results Parser (Alpino) Automatically generate XPath expression Present results the user

OUTLINE GrETEL in a nutshell GrETEL demo o Case study o Search options Conclusions and future work

CASE STUDY Verbs with fixed preposition o E.g. Hij keek met een bang hartje naar de heks. ‘he was looking at the witch with a heavy heart.’ o VERB + (…+) PREP LASSY: Xpath query and and and

CASE STUDY Verbs with fixed preposition o E.g. Hij keek naar de heks. ‘he was looking at the witch.’ Discontinuous constructions! o E.g. Hij keek met een bang hartje naar de heks. ‘he was looking at the witch with a heavy heart.’ o VERB + (…+) PREP

GrETEL ONLINE

INPUT

ANNOTATION MATRIX

ANNOTATION GUIDELINES

XPATH GENERATOR

Other treebank, other format … Hij keek met een bank hartje naar de heks CGN and and and LASSY and and and

Other treebank, other format … Hij keek met een bang hartje naar de heks CGN and and and LASSY and and and

TREEBANK SELECTION

RESULTS Verb plus fixed preposition o E.g. Hij keek naar de heks. ‘A number of trees fell down.’ o VERB + (…+) PREP  4004 matches in 3881 sentences

RESULTS: table

RESULTS: data

RESULTS: trees

OUTLINE GrETEL in a nutshell GrETEL demo o Case study o Search options Conclusions and future work

SEARCH OPTIONS  Below annotation matrix

SEARCH OPTIONS Green versus red word order in Dutch o green: past participle – auxiliary De NAVO stelt dat ze er alles aan gedaan heeft o red: auxiliary – past participle De NAVO stelt dat ze er alles aan heeft gedaan “The NATO claim that they have done everything in their power” (deredactie.be)

SEARCH OPTIONS

OUTLINE GrETEL in a nutshell GrETEL demo o Case study o Search options Conclusions and future work

CONCLUSIONS GrETEL: search engine for Dutch treebanks Input = natural language example Output = sample of similar sentences Syntactic concordancer Available online (via Mozilla Firefox) No installation required

FUTURE WORK GrETEL 2.0 o Include SoNaR corpus (ca 500M tokens) o More generic AfriBooms o GrETEL for Afrikaans o Include other treebank formats

CASE STUDY Collective noun constructions o E.g. Een aantal bomen zijn omgevallen. ‘A number of trees fell down.’ o DET + NOUN + PLURAL NOUN Discontinuous constructions! o E.g. Een groot aantal oude bomen zijn omgevallen. ‘A large number of old trees fell down.’

Thanks for your attention! Try it yourself at

Waaraan vs Waar … aan Waar denk je aan ? and and and and and and (4 results) Waar bemoei je je mee? Wanneer gaat een koortsstuip over in epilepsie ?

Waaraan denk je ? and and and (38 results) Waarom werken we ? Waartoe verbind ik mij als ouder door dit formulier in te vullen ? Vanwaar die gulle hand van een Turkse overheid die in de schulden zwemt ?

Hij klom de boom in and and and and and and and (37 results) Door haar winst komt Clijsters de top-20 binnen. In feite ging minder dan de helft van Dorsets de rivier over. Nederland gaat de bezettingstijd in.