Data models and the (blind ?) query of lexical resources Laurent Romary Inria & HUB Work in progress.

Slides:



Advertisements
Similar presentations
OLIF V2 Gr. Thurmair April OLIF April 2000 OLIF: Overview Rationale Principles Entries Descriptions Header Examples Status.
Advertisements

Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.
1 Building scientific Virtual Research Environments in D4Science Paul Polydoras University of Athens, Greece.
LIFTing LEGO with RELISH: Lexicon Interchange FormaT in Use Helen Aristar-Dry Institute for Language Information and Technology Eastern Michigan U.
XML: Extensible Markup Language
ISO TC184/SC4 Future architecture Rotterdam Progress on the Future SC4 Architecture PWI Friday 13 th November 2009.
Representing dictionaries with the TEI Proposal for basic guidelines Laurent Romary - Max Planck Digital Library With the help of Susanne Alt - CNRS.
SRDC Ltd. 1. Problem  Solutions  Various standardization efforts ◦ Document models addressing a broad range of requirements vs Industry Specific Document.
CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY CALTS has been in NLP for over a decade. It has participated in the following major projects: 1. NLP-TTP,
Helping people find content … preparing content to be found Enabling the Semantic Web Joseph Busch.
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands Metadata Component Framework Possible Standardization Work.
MLIF: A Metamodel to Represent and Exchange Multilingual Textual Information ISO TC37 SC4 WG Samuel Cruz-Lara, Gil Francopoulo, Laurent Romary,
Data Management for XML: Research Directions By: Jennifer Widom Stanford University Reviewer: Kristin Streilein.
Software Testing and Quality Assurance
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 8 The Enhanced Entity- Relationship (EER) Model.
Interchange using TBX 8 th Metadata conference Berlin April 2005 Alan K. Melby Brigham Young University, Provo campus.
4/20/2017.
Barcelona Meeting 21/06/05 MM 1 LIRICS WP2 LIRICS WP2 NLP LEXICA Task Leader: ILC-CNR (Pisa) presented by: Monica Monachini.
TMF - a tutorial TMF - Terminological Markup Framework Laurent Romary - Laboratoire Loria.
Knowledge Mediation in the WWW based on Labelled DAGs with Attached Constraints Jutta Eusterbrock WebTechnology GmbH.
GRISP: A Massive Multilingual Terminological Database for Scientific and Technical Domains Patrice Lopez and Laurent Romary INRIA & HUB – IDSL
Provo, 16 Aug 2007 LMF meeting 1 Lexical Markup Framework: ISO Provo meeting Gil Francopoulo.
E-Meld Workshop on Digitization of lexical Information 3-5 August 2002, EMU, Ypsilanti Working Group on Lexicon Macrostructures Chairman’s Report Dafydd.
CLARIN web services and workflow Marc Kemps-Snijders.
The ISO-DCR 17 January /20111CMDI tutorial Marc Kemps-Snijders a, Menzo Windhouwer b, Sue Ellen Wright c a Meertens Institute, b MPI for.
Standards for language resources the ISO/TC 37(/SC 4) perspective
Using the TEI framework as a possible serialization for LMF Laurent Romary INRIA & HUB-IDSL
Accessing distributed linguistic resources An XML based architecture Laurent Romary Laboratoire Loria, Nancy (F) Samuel Cruz-Lara, Patrice Bonhomme, Christophe.
Name : Emad Zargoun Id number : EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.
Experiments with ODD outside the TEI framework Laurent Romary & Piotr Banski The ISO-TEI connection.
9/14/2012ISC329 Isabelle Bichindaritz1 Database System Life Cycle.
The MultilingualWeb-LT Working Group receives funding by the European Commission (project name LT-Web) through the Seventh Framework Programme (FP7) in.
Alignment of ATL and QVT © 2006 ATLAS Nantes Alignment of ATL and QVT Ivan Kurtev ATLAS group, INRIA & University of Nantes, France
ISO a tutorial Part 2: Representing data categories TMF - Terminological Markup Framework Laurent Romary - Laboratoire Loria.
ISOcat introduction 20 March 20121CLARIN-NL ISOcat workshop.
CLARIN work packages. Conference Place yyyy-mm-dd
Terminology and documentation*  Object of the study of terminology:  analysis and description of the units representing specialized knowledge in specialized.
Jan 9, 2004 Symposium on Best Practice LSA, Boston, MA 1 Comparability of language data and analysis Using an ontology for linguistics Scott Farrar, U.
11 CMDI/ISOcat And Semantic Operability Ineke Schuurman ISOcat content coördinator CLARIN-NL Menzo Windhouwer ISOcat system administrator Utrecht
N. Calzolari 1Nijmegen, August 2010 Conclusions – Observations (maybe biased)  Field linguistics: Re-doing the path we did, asking the same questions,
LEXUS a flexible web based lexicon tool LEXUS a flexible web based lexicon tool, august 21 th, 2005 Marc Kemps-Snijders Peter Wittenburg
A Data Category Registry- and Component- based Metadata Framework Daan Broeder et al. Max-Planck Institute for Psycholinguistics LREC 2010.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Archie Warnock, A/WWW Enterprises OCG Catalog Specification v2.0 Overview and Discussion Archie Warnock, Doug Nebert Yonsook Enloe, Jolyon Martin May 14,
The Semistructured-Data Model Programming Languages for XML Spring 2011 Instructor: Hassan Khosravi.
TMF - Terminological Markup Framework Laurent Romary Laboratoire LORIA (CNRS, INRIA, Universités de Nancy) ISO meeting London, 14 August 2000.
Mining the Biomedical Research Literature Ken Baclawski.
ISO CD Editorial and technical comments. Contact Mailing list Subject: sub FirstName LastName.
ISO TC 37/CLARIN SEMANTIC DATA REGISTRY WORKSHOP UTRECHT, DECEMBER ISOcat: Metadata Registry SUE ELLEN WRIGHT DECEMBER 2013.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
Towards a roadmap for standardization in language technology Laurent Romary & Nancy Ide Loria-INRIA — Vassar College.
1 Ontolog OOR-BioPortal Comparative Analysis Todd Schneider 15 October 2009.
1 STO A Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001.
Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.
ISO TMF - Terminological Markup Framework Laurent Romary - Laboratoire Loria.
Patterns in caBIG Baris E. Suzek 12/21/2009. What is a Pattern? Design pattern “A general reusable solution to a commonly occurring problem in software.
Semi-structured Data In many applications, data does not have a rigidly and predefined schema: –e.g., structured files, scientific data, XML. Managing.
Formats, interoperability and standards Marc Kemps-Snijders.
ISO TC37/SC4 N435 Nov 12, 2007 Presented by Miran Choi/ETRI Written by Jae Sung Lee/Chungbuk National Univ.
OWL Web Ontology Language Summary IHan HSIAO (Sharon)
2) Database System Concepts and Architecture. Slide 2- 2 Outline Data Models and Their Categories Schemas, Instances, and States Three-Schema Architecture.
Improvement of Semantic Interoperability based on Metadata Registry(MDR) Doo-Kwon Baik Dept. of CSE Korea University.
ISOcat introduction 10 May /20111CLARIN-NL ISOcat workshop.
Models and standards for onomasiological and semasiological lexical data Laurent Romary Inria & BBAW & DARIAH COST ENEL meeting 31 March 2016.
Implementing the TEI Feature System Declaration Gary F. Simons SIL International ___________________________ TEI Members Meeting 11 Oct 2002, Chicago.
Marc Kemps-Snijders Menzo Windhouwer Sue Ellen Wright
Using the TEI framework as a possible serialization for LMF
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
IDEAS Core Model Concept
Presentation transcript:

Data models and the (blind ?) query of lexical resources Laurent Romary Inria & HUB Work in progress

Argument Querying language resources requires precise knowledge of the underlying representation model Given the over-expressiveness of the TEI, we need complementary (possibly local – aka crystals) models to enforce reference constructs The TEI dictionary is a good use case to start, given the existence of LMF as an underlying model

Querying semi-structured data Characteristics – Order (sequences or sets) – Recursivity (depth-free ?) – Typing (local or structural) – Schema driven or not Select-from-where and DB models – Traditionally: works on the link between the database models and the corresponding query structures (ERM) – Recent proposal in the semi-structured document community (the Abiteboul school => cf. peri-Xquery works; D. Florescu) Importance of paths (set semantics) Role of patterns (e.g. pairs, sequences, car.cdr etc.) as typing mechanisms Very few works about data with low conformance to a reference model – Low predictive data Objective: limiting un-predictability in the lexical domain

Lexical data is a messy field From full-form lexica for NLP to encyclopaedic dictionaries Legacy unstructured/unpredictable data – Available software (e.g. Shoebox) or scholarly traditions (the Multext format for full-form lexica) Two core reference traditions/models/serialisations – Onomasiological Concept to term; as is the case for most terminological databases ISO (TMF – Terminological Markup Framework) Natural serialisation in ISO (TBX) – Semasiological Word to sense; as implemented in traditional dictionaries ISO (LMF – Lexical Markup Framework) Unclear serialisation landscape. We argue in (Romary, 2013) that the TEI just offers the background we need

TEI – a wealth of possibilities, reflecting messiness Orphan grammatical descriptors [corrected!] Orphan sense descriptors –,, etc. can occur outside a sense Multiple elements to provide the “same” information – E.g. vs. General issues – Free text can occur everywhere – Existence and usage of large-coverage TEI classes – E.g. (text | model.gLike | sense |model.entryPart.top | model.phrase | model.global)*

model.global model.global.edit [addSpan damageSpan delSpan gap space] model.global.meta [alt altGrp certainty fLib fs fvLib index interp interpGrp join joinGrp link linkGrp listTranspose precision respons span spanGrp substJoin timeline] model.global.spoken [incident kinesic pause shift vocal writing] model.milestoneLike [anchor cb fw gb lb milestone pb] model.noteLike – [note witDetail] figure metamark notatedMusic

Model - principles General modeling strategy from ISO/TC 37 (cf. Object Management Group) – Meta-model General, underlying model that informs current practice – Data-categories Provides the elementary descriptors to instantiate models Possibly registered/standardised/re-sused from ISOcat Any serialization isomorphic with a given model is acceptable – “blind” (no schema, no documentation) interoperability requires sharing vocabularies within communities of services

LMF-TEI meta-model components (simplified) Lexicon Lexical entry 0..n 1..1 Morphology 1..1 Form 1..1 Sense 0..n n 1..1

Main data categories Lexicon Lexical entry 0..n 1..1 Morphology 1..1 Form 1..1 Sense 0..n n 1..1 /orthography/ ( ) /pronunciation/ ( ) /hyphenization/ ( ) /syllabification/ ( ) /stress pattern/ ( ) /part of speech/ ( ) /inflexional class/ ( ) /gender/ ( ) /number/ ( ) /case/ ( ) /person/ ( ) /tense/ ( ) /mood/ ( ) /definition/ ( ) /example/ ( ) /usage/ ( ) /etymology/ ( )

Examples of constraints Forbid the usage of, (status of, to be determined) Systematic use of a grammatical container (gramGrp) for all gramamtical features Limit the usage of to,,, Only allow semantic descriptors in (usage constraints), and (for contextualizing an example) … cf. Budin et alii, 2012, Romary & Wehstein, 2012, Romary, 2013

Paths Baseline when using (XML) TEI documents: – Xpath: Issues – Model agnostic – serialization specific Model based query-language (component-data-category (CDC) path) – Pointing to explicit components and data categories $lexicalEntry.$sense.geographicalUsage A CCD path can be check as being compatible with the model We can consider the compiled set of all path compatible with the model: CDC Graph Natural interface with DB/faceting environment suuch as ElasticSearch

Queries Retrieval of a specific entry considering constraint on the form – token to word-form mapping – $lexiconEntry.$form[orthography=‘chats’] Retrieval of a sense from an entry given additional constraints – $lexiconEntry.$sense*[subjectField=‘nautical’] Search for all entries having some specific form, grammatical or semantic properties, for instance the retrieval of all transitive verbs – … Extraction of all (or part of all) occurrences of a certain descriptor in a group of lexical entries, for instance all translated examples – …

Signatures Objective: characterizing the data as compliant with a given model (M) – Identification of queryable data (D) Principle – S M : Construction of a compiled graph of components and data categories allowed by a model (component- DC graph) – S D : Construction of the compiled graph of CDC paths from the data – S D must be a subset of S M

Silent data Scenario: querying multiple dictionaries of various types – e.g. presence of full-form lexica for which queries about do not apply Identifying all paths from the model which are not realized in the data – S M - S D

Noisy data TEI encoded data which do not fulfill LMF compliance Checking process – Compiling all possible paths as a CDC graph – Comparison with possible CDC paths allowed by the model Note that data can still be queried – Depending on semantic, lower recall and precision

What’s next The issue of querying language resources should be accompanied by an enforcement of models – Integration within a language resource query language agenda (bringing in semi-structured database specialists) Going blind? – Procedures for identifying compatibilities between queries and data Data quality check – Recommendations for DARIAH & CLARIN LMF additional part? – Not just a technical issue…

Trend: TEI reaching out new communities – Bringing back existing communities of practices