Download presentation
Presentation is loading. Please wait.
Published byAlijah Burtt Modified over 9 years ago
1
Data models and the (blind ?) query of lexical resources Laurent Romary Inria & HUB Work in progress
2
Argument Querying language resources requires precise knowledge of the underlying representation model Given the over-expressiveness of the TEI, we need complementary (possibly local – aka crystals) models to enforce reference constructs The TEI dictionary is a good use case to start, given the existence of LMF as an underlying model
3
Querying semi-structured data Characteristics – Order (sequences or sets) – Recursivity (depth-free ?) – Typing (local or structural) – Schema driven or not Select-from-where and DB models – Traditionally: works on the link between the database models and the corresponding query structures (ERM) – Recent proposal in the semi-structured document community (the Abiteboul school => cf. peri-Xquery works; D. Florescu) Importance of paths (set semantics) Role of patterns (e.g. pairs, sequences, car.cdr etc.) as typing mechanisms Very few works about data with low conformance to a reference model – Low predictive data Objective: limiting un-predictability in the lexical domain
4
Lexical data is a messy field From full-form lexica for NLP to encyclopaedic dictionaries Legacy unstructured/unpredictable data – Available software (e.g. Shoebox) or scholarly traditions (the Multext format for full-form lexica) Two core reference traditions/models/serialisations – Onomasiological Concept to term; as is the case for most terminological databases ISO 16642 (TMF – Terminological Markup Framework) Natural serialisation in ISO 30042 (TBX) – Semasiological Word to sense; as implemented in traditional dictionaries ISO 24613 (LMF – Lexical Markup Framework) Unclear serialisation landscape. We argue in (Romary, 2013) that the TEI just offers the background we need
5
TEI – a wealth of possibilities, reflecting messiness Orphan grammatical descriptors [corrected!] Orphan sense descriptors –,, etc. can occur outside a sense Multiple elements to provide the “same” information – E.g. vs. General issues – Free text can occur everywhere – Existence and usage of large-coverage TEI classes – E.g. (text | model.gLike | sense |model.entryPart.top | model.phrase | model.global)*
6
model.global model.global.edit [addSpan damageSpan delSpan gap space] model.global.meta [alt altGrp certainty fLib fs fvLib index interp interpGrp join joinGrp link linkGrp listTranspose precision respons span spanGrp substJoin timeline] model.global.spoken [incident kinesic pause shift vocal writing] model.milestoneLike [anchor cb fw gb lb milestone pb] model.noteLike – [note witDetail] figure metamark notatedMusic
7
Model - principles General modeling strategy from ISO/TC 37 (cf. Object Management Group) – Meta-model General, underlying model that informs current practice – Data-categories Provides the elementary descriptors to instantiate models Possibly registered/standardised/re-sused from ISOcat Any serialization isomorphic with a given model is acceptable – “blind” (no schema, no documentation) interoperability requires sharing vocabularies within communities of services
8
LMF-TEI meta-model components (simplified) Lexicon Lexical entry 0..n 1..1 Morphology 1..1 Form 1..1 Sense 0..n 1..1 0..n 1..1
9
Main data categories Lexicon Lexical entry 0..n 1..1 Morphology 1..1 Form 1..1 Sense 0..n 1..1 0..n 1..1 /orthography/ ( ) /pronunciation/ ( ) /hyphenization/ ( ) /syllabification/ ( ) /stress pattern/ ( ) /part of speech/ ( ) /inflexional class/ ( ) /gender/ ( ) /number/ ( ) /case/ ( ) /person/ ( ) /tense/ ( ) /mood/ ( ) /definition/ ( ) /example/ ( ) /usage/ ( ) /etymology/ ( )
10
Examples of constraints Forbid the usage of, (status of, to be determined) Systematic use of a grammatical container (gramGrp) for all gramamtical features Limit the usage of to,,, Only allow semantic descriptors in (usage constraints), and (for contextualizing an example) … cf. Budin et alii, 2012, Romary & Wehstein, 2012, Romary, 2013
11
Paths Baseline when using (XML) TEI documents: – Xpath: entry/sense/usg[@type=‘geo’] Issues – Model agnostic – serialization specific Model based query-language (component-data-category (CDC) path) – Pointing to explicit components and data categories $lexicalEntry.$sense.geographicalUsage A CCD path can be check as being compatible with the model We can consider the compiled set of all path compatible with the model: CDC Graph Natural interface with DB/faceting environment suuch as ElasticSearch
12
Queries Retrieval of a specific entry considering constraint on the form – token to word-form mapping – $lexiconEntry.$form[orthography=‘chats’] Retrieval of a sense from an entry given additional constraints – $lexiconEntry.$sense*[subjectField=‘nautical’] Search for all entries having some specific form, grammatical or semantic properties, for instance the retrieval of all transitive verbs – … Extraction of all (or part of all) occurrences of a certain descriptor in a group of lexical entries, for instance all translated examples – …
13
Signatures Objective: characterizing the data as compliant with a given model (M) – Identification of queryable data (D) Principle – S M : Construction of a compiled graph of components and data categories allowed by a model (component- DC graph) – S D : Construction of the compiled graph of CDC paths from the data – S D must be a subset of S M
14
Silent data Scenario: querying multiple dictionaries of various types – e.g. presence of full-form lexica for which queries about do not apply Identifying all paths from the model which are not realized in the data – S M - S D
15
Noisy data TEI encoded data which do not fulfill LMF compliance Checking process – Compiling all possible paths as a CDC graph – Comparison with possible CDC paths allowed by the model Note that data can still be queried – Depending on semantic, lower recall and precision
16
What’s next The issue of querying language resources should be accompanied by an enforcement of models – Integration within a language resource query language agenda (bringing in semi-structured database specialists) Going blind? – Procedures for identifying compatibilities between queries and data Data quality check – Recommendations for DARIAH & CLARIN LMF additional part? – Not just a technical issue…
17
Trend: TEI reaching out new communities – Bringing back existing communities of practices
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.