Search in Token-annotated Corpora Search in Treebanks

Slides:



Advertisements
Similar presentations
The CLARIN INFRASTRUCTURE Jan Odijk MA Rotation Utrecht,
Advertisements

Example-Based Treebank Querying Liesbeth Augustinus Vincent Vandeghinste Frank Van Eynde CLARIN Sofia,
Example queries for Federated search Jan Odijk CLARIN Federated Search Workshop Copenhagen, 24 Apr
Quranic Arabic Corpus Data Mining & Text Analytics By Ismail Teladia & Abdullah Alazwari.
Linguistic Research with PaQu Jan Odijk, Utrecht University Small Experiment (was intended as a user test) Take all Dutch CHILDES corpora Select all adult.
Linguistics with CLARIN OpenSONAR Jan Odijk LOT Winterschool Amsterdam,
INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING NLP-AI IIIT-Hyderabad CIIL, Mysore ICON DECEMBER, 2003.
Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra.
ANLE1 CC 437: Advanced Natural Language Engineering ASSIGNMENT 2: Implementing a query expansion component for a Web Search Engine.
Shared Ontology for Knowledge Management Atanas Kiryakov, Borislav Popov, Ilian Kitchukov, and Krasimir Angelov Meher Shaikh.
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
UAM CorpusTool: An Overview Debopam Das Discourse Research Group Department of Linguistics Simon Fraser University Feb 5, 2014.
April 2005CSA2050:NLTK1 CSA2050: Introduction to Computational Linguistics NLTK.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
Practical Project of the 2006 Joint International Master’s Degree.
Linguistics & AI1 Linguistics and Artificial Intelligence Linguistics and Artificial Intelligence Frank Van Eynde Center for Computational Linguistics.
Linguistics with CLARIN Concluding Overview Jan Odijk LOT Winterschool Amsterdam,
-1- Philipp Heim, Thomas Ertl, Jürgen Ziegler Facet Graphs: Complex Semantic Querying Made Easy Philipp Heim 1, Thomas Ertl 1 and Jürgen Ziegler 2 1 Visualization.
CODA – CATCHPlus Open Document Annotation Hennie Brugman OAC II Project Review meeting Chicago – July 26-27, 2012.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
CLARIN for Linguists Portal & Searching for Resources Jan Odijk LOT Summerschool Nijmegen,
11 CMDI/ISOcat And Semantic Operability Ineke Schuurman ISOcat content coördinator CLARIN-NL Menzo Windhouwer ISOcat system administrator Utrecht
Recent Developments in CLARIN-NL Jan Odijk P11 LREC, Istanbul, May 23,
For Wednesday No reading Homework –Chapter 23, exercise 15 –Process: 1.Create 5 sentences 2.Select a language 3.Translate each sentence into that language.
For Friday Finish chapter 24 No written homework.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Enda F. Scott 2001 Good morning An introduction to modern dictionary making.
Linguistic Research with CLARIN Jan Odijk MA Rotation Utrecht,
Syntactic Annotation of Slovene Corpora (SDT, JOS) Nina Ledinek ISJ ZRC SAZU
1 ISOCAT Proposed solutions for Problems encountered in DUELME-LMF Jan Odijk Nijmegen 21 Sep 2010.
1 STO A Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001.
PARSEME Alpino MWE Encoding Jan Odijk PARSEME Meeting Iasi,
Introduction A field survey of Dutch language resources has been carried out within the framework of a project launched by the Dutch Language Union (Nederlandse.
Paloma Marín Arraiza 17 th International Conference on Grey Literature 1 st and 2 nd December 2015, Amsterdam (Netherlands) SCIENTIFIC AUDIOVISUAL MATERIALS.
Scots Language Award Marilyn Waters and Malcolm Wilson.
Using PaQu for language acquisition research Jan Odijk CLARIN 2015 Conference Wroclaw,
Problem Solving with NLTK MSE 2400 EaLiCaRA Dr. Tom Way.
Language Identification and Part-of-Speech Tagging
Automatic Writing Evaluation
Relations between Data Categories
Measuring Monolinguality
Coarse-grained Word Sense Disambiguation
Semantic testing in oneM2M
Jan Odijk Birmingham, Corpus and Computational Linguistic Methods and Tools beyond corpus linguistics in CLARIAH Jan Odijk Birmingham,
Reading Report on Hybrid Question Answering System
DuELME: database of multiword expressions (MWE)
Natural Language Processing (NLP)
Logical Agents.
CLARIN Language Resources Switchboard in CLARIAH
Text Analytics Giuseppe Attardi Università di Pisa
Corpus-Based ELT CEL Symposium Creating Learning Designers
Topics in Linguistics ENG 331
دانشگاه شهیدرجایی تهران
تعهدات مشتری در کنوانسیون بیع بین المللی
The European Union case law corpus (EUCLCORP)
Lecture 7 HMMs – the 3 Problems Forward Algorithm
Lecture 7 HMMs – the 3 Problems Forward Algorithm
Jan Odijk LREC Miyazaki
Computational Linguistics: New Vistas
Natural Language Processing (NLP)
CS224N Section 3: Corpora, etc.
Giuseppe Attardi Dipartimento di Informatica Università di Pisa
Some more corpora
CSA2050: Introduction to Computational Linguistics
CS224N Section 3: Project,Corpora
A new web-based corpus management and analysis platform
BYU COCA: CORPUS OF CONTEMPORARY AMERICAN ENGLISH
Big Data: Text Mining The Linguistics Department Presents:
Natural Language Processing (NLP)
Presentation transcript:

LLOD Use Case: Syntactic Search Jan Odijk CLARIAH-CORE LD4LR Workshop Utrecht, 2017-02-06/07

Search in Token-annotated Corpora Search in Treebanks Overview Search in Token-annotated Corpora Search in Treebanks

Search in Token-annotated Corpora Search in Treebanks Overview Search in Token-annotated Corpora Search in Treebanks

Corpus Search Token-Annotated SONAR (535 m tokens, Dutch) Pos-tagged Encoded in FoLiA Search OpenSONAR 4 search interfaces, of increasing complexity Expert: CQP queries

Other Corpora Search- all use CQP. Token-Annotated Own Corpora BNC Contemporary Dutch Corpus Search- all use CQP. AutoSearch BNC Lancaster Contemporary Dutch Corpus

LOD? Token-Annotated Will it bring advantages? If so, which ones? Does it retain the power and simple notation of CQP? SPARQL queries? REs over token descriptions?

Search in Token-annotated Corpora Search in Treebanks Overview Search in Token-annotated Corpora Search in Treebanks

Treebanks Treebank = text corpus in which each sentence has been assigned a syntactic structure I use CGN, LASSY, CHILDES for Dutch LINDAT/CLARIN for many different languages Tündra for (mainly) German INESS treebanks for multiple languages Query languages: CGN, LASSY, CHILDES: XPATH/XQUERY LINDAT/CLARIN: PML-TQ Tündra: Tiger INESS: Tiger

Treebanks Dedicated search applications: GrETEL PaQU Performance: Example-based search & XPATH PaQU Dedicated search for dependencies & XPATH Performance: OK for 65k sent /1 M token corpora Too slow for 7 M sent corpora (and getting slower every 18 months)

Treebanks LOD: Could it be used to overcome the many different query languages in use? Query language? Same potential, transparent notation? Query language syntax NO problem Queries get very complex very quickly Must know the structure of the syntactic structures in every fine detail

Treebanks LOD: Linking to other resources. Combined syntactic/morphological/semantic search Wordnet for checking for semantic properties (mass/count, human/nonhuman) CELEX for morphological/phonological properties performance?

Thanks for your attention