Example queries for Federated search Jan Odijk CLARIN Federated Search Workshop Copenhagen, 24 Apr 2013 1.

Slides:



Advertisements
Similar presentations
Possible Changes to OLIF 2.1. General Issues Japanese.
Advertisements

© Bowne Global Solutions, Inc All rights reserved Bowne Global Solutions and OLIF Industry Implementation Michael Kranawetvogl Linguistic Engineering Bowne.
Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.
Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim.
The CLARIN INFRASTRUCTURE Jan Odijk MA Rotation Utrecht,
Comparative Linguistic research in CLARIN Jan Odijk Language Diversity Congress Groningen,
Demonstration of the Microcomparative Morphosyntactic Research Tool MIMORE Sjef Barbiers, Matthijs Brouwer, Jan Pieter Kunst, Folkert de Vriend Meertens.
Example-Based Treebank Querying Liesbeth Augustinus Vincent Vandeghinste Frank Van Eynde CLARIN Sofia,
XML and General Dutch Dictionary (ANW) Van der Kamp, Lexical databases and digital tools, april 29 th, 2005, 1 Peter van der Kamp
1 Minimally Supervised Morphological Analysis by Multimodal Alignment David Yarowsky and Richard Wicentowski.
Linguistic Research with PaQu Jan Odijk, Utrecht University Small Experiment (was intended as a user test) Take all Dutch CHILDES corpora Select all adult.
Computational Paradigms in the Humanities – eHumanities and their role and impact in transdisciplinary research Gerhard Budin University of Vienna.
Computational Lexicography Frank Van Eynde Centre for Computational Linguistics.
11 CLARIN? ISOCAT! Ineke Schuurman ISOcat content coördinator CLARIN-NL Amsterdam
Linguistics with CLARIN OpenSONAR Jan Odijk LOT Winterschool Amsterdam,
1 Words and the Lexicon September 10th 2009 Lecture #3.
Shared Ontology for Knowledge Management Atanas Kiryakov, Borislav Popov, Ilian Kitchukov, and Krasimir Angelov Meher Shaikh.
Linguisitics Levels of description. Speech and language Language as communication Speech vs. text –Speech primary –Text is derived –Text is not “written.
1. 2 Content WSK Online is a new online database of specialized dictionaries covering all the major areas of linguistics and communication science: Biannual.
Course G Web Search Engines 3/9/2011 Wei Xu
CLARIN for Linguists Introduction Jan Odijk LOT Summerschool Nijmegen,
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
CLARIN-NL Second Open Call Jan Odijk CLARIN-NL Call 2 Info-session Amsterdam, 26 Aug 2010.
Linguistics, Pragmatics & Natural Grammar
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
E-Meld Workshop on Digitization of lexical Information 3-5 August 2002, EMU, Ypsilanti Working Group on Lexicon Macrostructures Chairman’s Report Dafydd.
Sharing linguistic multi-media resources Jacquelijn Ringersma Paul Trilsbeek Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands.
CLARIN-NL ISOcat workshop 2011 part 2 Ineke Schuurman Menzo Windhouwer.
1 Corpora: Annotating and Searching LING 5200 Computational Corpus Linguistics Martha Palmer.
Word senses Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
Content of the Data Category Registry 10 May /20111CLARIN-NL ISOcat workshop.
Ferenc Havas Tallinn, Introduction to the project: Uralic Typology Database Project website:
Linguistics with CLARIN Concluding Overview Jan Odijk LOT Winterschool Amsterdam,
Oana Adriana Şoica Building and Ordering a SenDiS Lexicon Network.
ISOcat: known issues 20 June 20131CLARIN-NL ISOcat workshop.
Linguistics with CLARIN Introduction Jan Odijk LOT Winterschool Amsterdam,
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
Linguistics with CLARIN Storing resources in CLARIN Jan Odijk LOT Winterschool Amsterdam,
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
CLARIN for Linguists Portal & Searching for Resources Jan Odijk LOT Summerschool Nijmegen,
11 Chapter 19 Lexical Semantics. 2 Lexical Ambiguity Most words in natural languages have multiple possible meanings. –“pen” (noun) The dog is in the.
CLARIN-NL ISOcat workshop 2012 part 2 ( ) Ineke Schuurman Menzo Windhouwer.
ISOcat: known issues 19 June 20121CLARIN-NL ISOcat workshop.
Jan 9, 2004 Symposium on Best Practice LSA, Boston, MA 1 Comparability of language data and analysis Using an ontology for linguistics Scott Farrar, U.
Transcripts are stored in a relational database Transcripts are divided up to their smallest constituent (words), while the context is preserved, in a.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
CLARIN-NL Requirements and Desiderata Jan Odijk CLARIN-NL Call 3 Info-session Utrecht, 25 Aug 2011.
1 CLARIN - NL What is going on? Jan Odijk Amsterdam 26 Aug 2010.
Digitization – Basics and Beyond workshop Interoperability of cultural and academic resources New services for digitized collections Muriel Foulonneau.
Linguistic Research with CLARIN Jan Odijk MA Rotation Utrecht,
NLTK & Python Day 8 LING Computational Linguistics Harry Howard Tulane University.
Digitization – Basics and Beyond workshop Interoperability of cultural and academic resources New services for digitized collections Muriel Foulonneau.
1 ISOCAT Proposed solutions for Problems encountered in DUELME-LMF Jan Odijk Nijmegen 21 Sep 2010.
Unit 11 –Reglar Expressions Instructor: Brent Presley.
Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.
SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.
1 Dictionary priorities, e- dictionaries of compounds, morphological mode Cvetana Krstev & Duško Vitas.
From Question to Query Building a bridge between traditional scholarship and data-oriented research methods in biblical studies Wido van Peursen VU University.
© University of Manchester Creative Commons Attribution-NonCommercial 3.0 unported 3.0 license Quality Assurance, Ontology Engineering, and Semantic Interoperability.
Genre and cultural purpose We recognize a genre when a text does something with language that we’re familiar with. Very often we are able state what kind.
Using PaQu for language acquisition research Jan Odijk CLARIN 2015 Conference Wroclaw,
Relations between Data Categories
Measuring Monolinguality
WordNet WordNet, WSD.
ISOCAT ISOCAT Problems
Jan Odijk LREC Miyazaki
Search in Token-annotated Corpora Search in Treebanks
Applied Linguistics Chapter Four: Corpus Linguistics
Presentation transcript:

Example queries for Federated search Jan Odijk CLARIN Federated Search Workshop Copenhagen, 24 Apr

Linguistic Problem Federated Search Structural Differences Search in Lexicons Search In Corpora Corpora+Lexicon search Iterative Corpus Search Search in micro-comparative databases? Overview 2

Inflection of attributively used adjectives Influence of number, gender, case, definiteness (strong/weak inflection), other factors? Exceptions to the main rules such as – het bijvoeglijk(?e) naamwoord, lit. the adjectival noun, ‘the adjective’ – het medisch(*e) onderzoek `the medical research’ – de medisch(*e) onderzoeker `the medical researcher’ – Een competent(e) linguïst Where –e suffix is predicted as the only option by the main rules The exceptions are not (all) arbitrary, there are further subregularities: I want to find out what these are. There are similar phenomena in many languages (Germanic, Romance, e.g. Brasilian Portuguese Menuzzi 1994,..) Linguistic Problem 3

[Odijk 1992] J. Odijk. Uninflected Adjectives in Dutch. In R. Bok-Bennema & R. van Hout, Linguistics in the Netherlands1992, pp Amsterdam: Benjamins Odijk, J. (2012). De structuur van Phrasal Names. Nederlandse Taalkunde, 17(2), De structuur van Phrasal Names Odijk, J. (2013), Comparative Linguistic Research in the CLARIN Infrastructure, presentation to be held at the Patterns of Macro- and Micro-Diversity in the Languages of Europe and the Middle East. Computational Issues in Studying Language Diversity: Storage, Analysis and Inference, Groningen, July 2013.Patterns of Macro- and Micro-Diversity in the Languages of Europe and the Middle East. Computational Issues in Studying Language Diversity: Storage, Analysis and Inference Linguistic Problem 4

I want to search 1.For relevant examples in large annotated text corpora 2.For related examples in large text corpora selected on the basis of the results of (1) 3.For relevant examples is microcomparative databases (e.g. MIMORE) 4.For properties of relevant words in dictionaries 5.For synonyms/hyperonyms/hyponyms in semantic lexical databases (e.g. CORNETTO) Broaden empirical Base 5

A set of resources R has been selected by the researcher on the basis of metadata. The resources of R can be in different locations A Federated Search Engine (FSE) enables search in the resources in R For each resource r in R there is a local search engine LSE_r A query can be formulated in an agreed-upon query language (SRU/CQL),e.g. via a Federates Search web application: q_fs Q_fs uses ISOCAT DCs Q_fs is sent to the LSE_r for each r in R, and translated there into the query language needed for LSE_r and into the DCs used in r Each LSE_r yields a result set for q_fs, in which it translates DCs used into ISOCAT_DCs, and sends to FSE, which combines/aggregates them and prepares them for presentation to /saving by the user, possibly via the Federated Search web application Federated Search 6

For many resource formats used in NL there is NOT yet a systematic mapping of their DCs to ISOCAT DCs, e.g. TEI, CGN-format, Folia, EAF, GTB, WordNet, CELEX, Praat,… A special project should be started up for this – Nationally for national formats (CGN, Folia, …) – Internationally for generic formats (TEI, CELEX, Wordnet, …) Federated Search 7

There are (sometimes trivial) structural differences between resources. Description of an occurrence of Dutch ‘is’: Structural Differences 8 CGNVU-DNC/FoLiA is... is

Do these two descriptions contain the same or overlapping information? ISOCAT alone will not help because there are differences in structure Will the LSE’s deal with such structural differences? Or is something more general needed for this (and is this possible?) Structural Differences 9

CGN pw element means the same as Folia w element: CGN w attribute of pw means the same as Folia t element in w CGN pos attribute of pw means the same as Folia class attribute of element pos (part of speech property name) CGN lem attribute of pw means the same as FoLiA class attribute of element lemma (lemma property name) Values inside the CGN pos attribute of pw are identical to and mean the same as values inside Folia class attribute of element pos in element w (values from the CGN-tagset) Structural Differences 10

WFT-GTB: Give me entries with PoS=noun of which the headword ends in “tsje” GTB, CELEX, CGN-lexicon: Give me entries with PoS=noun and with the headword ending in “tsje”, together with the source (=GTB, CELEX, of CGN-lexicon) in which it was found. Search in Lexicons 11

Search in all resources where the language=nld For each resource with language=nld – For each word in [ ‘zeer’, ‘heel’, ‘erg’] with PoS=adj For each sense of the word – For each synonym of the sense » For each lemma of the synonym Return word, Pos, sense, synonym, lemma, ‘synonym’, resource.name And analogously with ‘synonym’ replaced by ‘immediate hyperonym’ And analogously with ‘synomym’ replaced by ‘hyponym’ (incl hyponyms of hyponyms) Search in Lexicons 12

Question: Will federated search somehow smartly `know’ (e.g. from the metadata) that it has to search in lexicons only, actually only in lexicons that contain synonym information? Or will it waist time and effort by searching in all text corpora and in lexicons that do not have synonym information? Or is a smart choice of resources to search in left to the user? Similarly: Search in CGN: Give me all utterances that contain the word ‘zeer’ with PoS=ADJ spoken by a speaker with age<=7. (there are no speakers with age<=7 in CGN; will federated search smartly be able to see this from the metadata or will it waist time searching?) Question 13

Search in CGN-corpus, VU-DNC, SONAR: Give me utterances that contain a subsequence of the form: – A wordtoken with PoS='definite_determiner', immediately followed by – A wordtoken with PoS=adjective with vorm=zonder-e, immediately followed by – A wordtoken with Pos=noun examples are 'het bijvoeglijk naamwoord', 'de gulden snede', 'het ingewikkelder probleem') lternative: just return the subsequence Search in Corpora 14

The same as in the preceding example but now the adjective should not end in two syllables that both contain a schwa (represented by a regular expression over the phonetic transcription) in its phonetic_transcription as found in the CGN- lexicon This excludes an example such as: 'het ingewikkelder probleem' Corpus+Lexicon search 15

a value for an additional attribute with as possible values eFormExists, eFormDoesNotExist, eFormExistenceUnknown. The value specifies whether it is true for the word with pos=adjective that a form with property vorm=met-e exists or, or not, or whether it is unknown whether such a form exists. Corpus+Lexicon search 16

let wv be the value of the attribute word of the wordtoken with properties Pos=adjective, vorm=zonder-e). Look up the entry/ies for wv for which PoS=adjective in the CGN-lexicon and/or CELEX- lexicon lexicon, and determine its lemma (=wl) – if not found: result =eFormExistenceUnknown – if found look up in CGN/Celex an entry with PoS=adjective-code and lemma=wl and vorm=met-e – if found: result=EFormExists (e.g. (het) bijvoeglijk (naamwoord)) – if not found: result= eFormDoesNotExist (e.g. ('de) gulden (snede)' This can be done in one very complicated query, or the queries might be put in a series where the results of the First query are filtered by the second query, etc. Corpus+Lexicon search 17

Each result in of the previous query is (or contains) a sequence Det ADJ NOUN For each result found in the previous query, – Give me utterances that contain a subsequence of the form: – A wordtoken with PoS='definite determiner', immediately followed by – A wordtoken with PoS=adjective, with lemma=ADJ.lemma and with vorm=met-e, immediately followed by – A wordtoken with Pos=noun with number=NOUN.number alternative: just return the subsequences Iterative Corpus Search 18

using the MIMORE search engine (MIMORE web app)MIMOREMIMORE web app Give me utterances that contain a subsequence of the form: – A wordtoken with PoS='definite_determiner', immediately followed by – A wordtoken with PoS=adjective with vorm=zonder-e, immediately followed by – A wordtoken with Pos=noun alternative: just return the subsequences Search in MIMORE 19

Odijk, J. (2011), "User Scenario Search", internal CLARIN-NL document, April 13, [docx]docx Odijk, J. (2011), "Linguistic Research in the CLARIN Infrastructure", presentation for the KNAW eHumanities Workshop, NIAS, Wassenaar, Mar 29, 2011 [ppt]. Abstract contained in eHumanities Brainstorm Bookletppt eHumanities Brainstorm Booklet Odijk, J.E.J.M. (2012, October 23). Linguistic Research and the CLARIN Infrastructure. Utrecht, Digital Humanities Lecture. [ppt]ppt More Examples 20

Thanks for your attention! 21

DO NOT ENTER HERE 22