Data models and the (blind ?) query of lexical resources Laurent Romary Inria & HUB Work in progress.

Data models and the (blind ?) query of lexical resources Laurent Romary Inria & HUB Work in progress

Argument Querying language resources requires precise knowledge of the underlying representation model Given the over-expressiveness of the TEI, we need complementary (possibly local – aka crystals) models to enforce reference constructs The TEI dictionary is a good use case to start, given the existence of LMF as an underlying model

Querying semi-structured data Characteristics – Order (sequences or sets) – Recursivity (depth-free ?) – Typing (local or structural) – Schema driven or not Select-from-where and DB models – Traditionally: works on the link between the database models and the corresponding query structures (ERM) – Recent proposal in the semi-structured document community (the Abiteboul school => cf. peri-Xquery works; D. Florescu) Importance of paths (set semantics) Role of patterns (e.g. pairs, sequences, car.cdr etc.) as typing mechanisms Very few works about data with low conformance to a reference model – Low predictive data Objective: limiting un-predictability in the lexical domain

Lexical data is a messy field From full-form lexica for NLP to encyclopaedic dictionaries Legacy unstructured/unpredictable data – Available software (e.g. Shoebox) or scholarly traditions (the Multext format for full-form lexica) Two core reference traditions/models/serialisations – Onomasiological Concept to term; as is the case for most terminological databases ISO 16642 (TMF – Terminological Markup Framework) Natural serialisation in ISO 30042 (TBX) – Semasiological Word to sense; as implemented in traditional dictionaries ISO 24613 (LMF – Lexical Markup Framework) Unclear serialisation landscape. We argue in (Romary, 2013) that the TEI just offers the background we need

TEI – a wealth of possibilities, reflecting messiness Orphan grammatical descriptors [corrected!] Orphan sense descriptors –,, etc. can occur outside a sense Multiple elements to provide the “same” information – E.g. vs. General issues – Free text can occur everywhere – Existence and usage of large-coverage TEI classes – E.g. (text | model.gLike | sense |model.entryPart.top | model.phrase | model.global)*

model.global model.global.edit [addSpan damageSpan delSpan gap space] model.global.meta [alt altGrp certainty fLib fs fvLib index interp interpGrp join joinGrp link linkGrp listTranspose precision respons span spanGrp substJoin timeline] model.global.spoken [incident kinesic pause shift vocal writing] model.milestoneLike [anchor cb fw gb lb milestone pb] model.noteLike – [note witDetail] figure metamark notatedMusic

Model - principles General modeling strategy from ISO/TC 37 (cf. Object Management Group) – Meta-model General, underlying model that informs current practice – Data-categories Provides the elementary descriptors to instantiate models Possibly registered/standardised/re-sused from ISOcat Any serialization isomorphic with a given model is acceptable – “blind” (no schema, no documentation) interoperability requires sharing vocabularies within communities of services

LMF-TEI meta-model components (simplified) Lexicon Lexical entry 0..n 1..1 Morphology 1..1 Form 1..1 Sense 0..n 1..1 0..n 1..1

Main data categories Lexicon Lexical entry 0..n 1..1 Morphology 1..1 Form 1..1 Sense 0..n 1..1 0..n 1..1 /orthography/ ( ) /pronunciation/ ( ) /hyphenization/ ( ) /syllabification/ ( ) /stress pattern/ ( ) /part of speech/ ( ) /inflexional class/ ( ) /gender/ ( ) /number/ ( ) /case/ ( ) /person/ ( ) /tense/ ( ) /mood/ ( ) /definition/ ( ) /example/ ( ) /usage/ ( ) /etymology/ ( )

Examples of constraints Forbid the usage of, (status of, to be determined) Systematic use of a grammatical container (gramGrp) for all gramamtical features Limit the usage of to,,, Only allow semantic descriptors in (usage constraints), and (for contextualizing an example) … cf. Budin et alii, 2012, Romary & Wehstein, 2012, Romary, 2013

Paths Baseline when using (XML) TEI documents: – Xpath: entry/sense/usg[@type=‘geo’] Issues – Model agnostic – serialization specific Model based query-language (component-data-category (CDC) path) – Pointing to explicit components and data categories $lexicalEntry.$sense.geographicalUsage A CCD path can be check as being compatible with the model We can consider the compiled set of all path compatible with the model: CDC Graph Natural interface with DB/faceting environment suuch as ElasticSearch

Queries Retrieval of a specific entry considering constraint on the form – token to word-form mapping – $lexiconEntry.$form[orthography=‘chats’] Retrieval of a sense from an entry given additional constraints – $lexiconEntry.$sense*[subjectField=‘nautical’] Search for all entries having some specific form, grammatical or semantic properties, for instance the retrieval of all transitive verbs – … Extraction of all (or part of all) occurrences of a certain descriptor in a group of lexical entries, for instance all translated examples – …

Signatures Objective: characterizing the data as compliant with a given model (M) – Identification of queryable data (D) Principle – S M : Construction of a compiled graph of components and data categories allowed by a model (component- DC graph) – S D : Construction of the compiled graph of CDC paths from the data – S D must be a subset of S M

Silent data Scenario: querying multiple dictionaries of various types – e.g. presence of full-form lexica for which queries about do not apply Identifying all paths from the model which are not realized in the data – S M - S D

Noisy data TEI encoded data which do not fulfill LMF compliance Checking process – Compiling all possible paths as a CDC graph – Comparison with possible CDC paths allowed by the model Note that data can still be queried – Depending on semantic, lower recall and precision

What’s next The issue of querying language resources should be accompanied by an enforcement of models – Integration within a language resource query language agenda (bringing in semi-structured database specialists) Going blind? – Procedures for identifying compatibilities between queries and data Data quality check – Recommendations for DARIAH & CLARIN LMF additional part? – Not just a technical issue…

Trend: TEI reaching out new communities – Bringing back existing communities of practices

Data models and the (blind ?) query of lexical resources Laurent Romary Inria & HUB Work in progress.

Similar presentations

Presentation on theme: "Data models and the (blind ?) query of lexical resources Laurent Romary Inria & HUB Work in progress."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data models and the (blind ?) query of lexical resources Laurent Romary Inria & HUB Work in progress.

Similar presentations

Presentation on theme: "Data models and the (blind ?) query of lexical resources Laurent Romary Inria & HUB Work in progress."— Presentation transcript:

Similar presentations

About project

Feedback