Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data models and the (blind ?) query of lexical resources Laurent Romary Inria & HUB Work in progress.

Similar presentations


Presentation on theme: "Data models and the (blind ?) query of lexical resources Laurent Romary Inria & HUB Work in progress."— Presentation transcript:

1 Data models and the (blind ?) query of lexical resources Laurent Romary Inria & HUB Work in progress

2 Argument Querying language resources requires precise knowledge of the underlying representation model Given the over-expressiveness of the TEI, we need complementary (possibly local – aka crystals) models to enforce reference constructs The TEI dictionary is a good use case to start, given the existence of LMF as an underlying model

3 Querying semi-structured data Characteristics – Order (sequences or sets) – Recursivity (depth-free ?) – Typing (local or structural) – Schema driven or not Select-from-where and DB models – Traditionally: works on the link between the database models and the corresponding query structures (ERM) – Recent proposal in the semi-structured document community (the Abiteboul school => cf. peri-Xquery works; D. Florescu) Importance of paths (set semantics) Role of patterns (e.g. pairs, sequences, car.cdr etc.) as typing mechanisms Very few works about data with low conformance to a reference model – Low predictive data Objective: limiting un-predictability in the lexical domain

4 Lexical data is a messy field From full-form lexica for NLP to encyclopaedic dictionaries Legacy unstructured/unpredictable data – Available software (e.g. Shoebox) or scholarly traditions (the Multext format for full-form lexica) Two core reference traditions/models/serialisations – Onomasiological Concept to term; as is the case for most terminological databases ISO 16642 (TMF – Terminological Markup Framework) Natural serialisation in ISO 30042 (TBX) – Semasiological Word to sense; as implemented in traditional dictionaries ISO 24613 (LMF – Lexical Markup Framework) Unclear serialisation landscape. We argue in (Romary, 2013) that the TEI just offers the background we need

5 TEI – a wealth of possibilities, reflecting messiness Orphan grammatical descriptors [corrected!] Orphan sense descriptors –,, etc. can occur outside a sense Multiple elements to provide the “same” information – E.g. vs. General issues – Free text can occur everywhere – Existence and usage of large-coverage TEI classes – E.g. (text | model.gLike | sense |model.entryPart.top | model.phrase | model.global)*

6 model.global model.global.edit [addSpan damageSpan delSpan gap space] model.global.meta [alt altGrp certainty fLib fs fvLib index interp interpGrp join joinGrp link linkGrp listTranspose precision respons span spanGrp substJoin timeline] model.global.spoken [incident kinesic pause shift vocal writing] model.milestoneLike [anchor cb fw gb lb milestone pb] model.noteLike – [note witDetail] figure metamark notatedMusic

7 Model - principles General modeling strategy from ISO/TC 37 (cf. Object Management Group) – Meta-model General, underlying model that informs current practice – Data-categories Provides the elementary descriptors to instantiate models Possibly registered/standardised/re-sused from ISOcat Any serialization isomorphic with a given model is acceptable – “blind” (no schema, no documentation) interoperability requires sharing vocabularies within communities of services

8 LMF-TEI meta-model components (simplified) Lexicon Lexical entry 0..n 1..1 Morphology 1..1 Form 1..1 Sense 0..n 1..1 0..n 1..1

9 Main data categories Lexicon Lexical entry 0..n 1..1 Morphology 1..1 Form 1..1 Sense 0..n 1..1 0..n 1..1 /orthography/ ( ) /pronunciation/ ( ) /hyphenization/ ( ) /syllabification/ ( ) /stress pattern/ ( ) /part of speech/ ( ) /inflexional class/ ( ) /gender/ ( ) /number/ ( ) /case/ ( ) /person/ ( ) /tense/ ( ) /mood/ ( ) /definition/ ( ) /example/ ( ) /usage/ ( ) /etymology/ ( )

10 Examples of constraints Forbid the usage of, (status of, to be determined) Systematic use of a grammatical container (gramGrp) for all gramamtical features Limit the usage of to,,, Only allow semantic descriptors in (usage constraints), and (for contextualizing an example) … cf. Budin et alii, 2012, Romary & Wehstein, 2012, Romary, 2013

11 Paths Baseline when using (XML) TEI documents: – Xpath: entry/sense/usg[@type=‘geo’] Issues – Model agnostic – serialization specific Model based query-language (component-data-category (CDC) path) – Pointing to explicit components and data categories $lexicalEntry.$sense.geographicalUsage A CCD path can be check as being compatible with the model We can consider the compiled set of all path compatible with the model: CDC Graph Natural interface with DB/faceting environment suuch as ElasticSearch

12 Queries Retrieval of a specific entry considering constraint on the form – token to word-form mapping – $lexiconEntry.$form[orthography=‘chats’] Retrieval of a sense from an entry given additional constraints – $lexiconEntry.$sense*[subjectField=‘nautical’] Search for all entries having some specific form, grammatical or semantic properties, for instance the retrieval of all transitive verbs – … Extraction of all (or part of all) occurrences of a certain descriptor in a group of lexical entries, for instance all translated examples – …

13 Signatures Objective: characterizing the data as compliant with a given model (M) – Identification of queryable data (D) Principle – S M : Construction of a compiled graph of components and data categories allowed by a model (component- DC graph) – S D : Construction of the compiled graph of CDC paths from the data – S D must be a subset of S M

14 Silent data Scenario: querying multiple dictionaries of various types – e.g. presence of full-form lexica for which queries about do not apply Identifying all paths from the model which are not realized in the data – S M - S D

15 Noisy data TEI encoded data which do not fulfill LMF compliance Checking process – Compiling all possible paths as a CDC graph – Comparison with possible CDC paths allowed by the model Note that data can still be queried – Depending on semantic, lower recall and precision

16 What’s next The issue of querying language resources should be accompanied by an enforcement of models – Integration within a language resource query language agenda (bringing in semi-structured database specialists) Going blind? – Procedures for identifying compatibilities between queries and data Data quality check – Recommendations for DARIAH & CLARIN LMF additional part? – Not just a technical issue…

17 Trend: TEI reaching out new communities – Bringing back existing communities of practices


Download ppt "Data models and the (blind ?) query of lexical resources Laurent Romary Inria & HUB Work in progress."

Similar presentations


Ads by Google