Models and standards for onomasiological and semasiological lexical data Laurent Romary Inria & BBAW & DARIAH COST ENEL meeting 31 March 2016.

Slides:



Advertisements
Similar presentations
OLIF V2 Gr. Thurmair April OLIF April 2000 OLIF: Overview Rationale Principles Entries Descriptions Header Examples Status.
Advertisements

Using OLIF, The Open Lexicon Interchange Format Susan McCormick OLIF2 Consortium October 1, 2004.
© Bowne Global Solutions, Inc All rights reserved Bowne Global Solutions and OLIF Industry Implementation Michael Kranawetvogl Linguistic Engineering Bowne.
Open Access to Humanities Data — a scholarly perspective Laurent Romary Inria — French national research center in computer science Humboldt University.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Data models and the (blind ?) query of lexical resources Laurent Romary Inria & HUB Work in progress.
Representing dictionaries with the TEI Proposal for basic guidelines Laurent Romary - Max Planck Digital Library With the help of Susanne Alt - CNRS.
ANSI TAG 37 Committee F43 Language Services and Products Interagency Language Roundtable September 30, 2011 Sue Ellen Wright ISO TC 37, Terminology and.
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands Metadata Component Framework Possible Standardization Work.
LREC 2000 Athens; Gerhard Budin and Alan Melby Accessibility of Multilingual Terminological Resources Current Problems and Prospects for the Future Gerhard.
MLIF: A Metamodel to Represent and Exchange Multilingual Textual Information ISO TC37 SC4 WG Samuel Cruz-Lara, Gil Francopoulo, Laurent Romary,
Galia Angelova Institute for Parallel Processing, Bulgarian Academy of Sciences Visualisation and Semantic Structuring of Content (some.
LIRICS International Standards in Lexicography Gerhard Budin University of Vienna August 2005.
Towards an NLP `module’ The role of an utterance-level interface.
Interchange using TBX 8 th Metadata conference Berlin April 2005 Alan K. Melby Brigham Young University, Provo campus.
Geospatial standards Beyond FGDC Geog 458: Map Sources and Errors March 3, 2006.
Modelling the spatial data of Hellenic Cadastre and generating the geodatabase schema Aris Sismanidis ARISTOTLE UNIVERSITY OF THESSALONIKI FACULTY OF ENGINEERING.
ISO Standards: Status, Tools, Implementations, and Training Standards/David Danko.
TMF - a tutorial TMF - Terminological Markup Framework Laurent Romary - Laboratoire Loria.
GRISP: A Massive Multilingual Terminological Database for Scientific and Technical Domains Patrice Lopez and Laurent Romary INRIA & HUB – IDSL
Data Exchange Tools (DExT) DExT PROJECTAN OPEN EXCHANGE FORMAT FOR DATA enables long-term preservation and re-use of metadata,
9 th Open Forum on Metadata Registries Harmonization of Terminology, Ontology and Metadata 20th – 22nd March, 2006, Kobe Japan. Commonalities and Differences.
/21LIRICS IAG Meeting Barcelona LIRICS IAG Meeting /21 Universitat Pompeu Fabra Barcelona Introduction Gerhard Budin.
Provo, 16 Aug 2007 LMF meeting 1 Lexical Markup Framework: ISO Provo meeting Gil Francopoulo.
E-Meld Workshop on Digitization of lexical Information 3-5 August 2002, EMU, Ypsilanti Working Group on Lexicon Macrostructures Chairman’s Report Dafydd.
Standards for language resources the ISO/TC 37(/SC 4) perspective
Using the TEI framework as a possible serialization for LMF Laurent Romary INRIA & HUB-IDSL
Chapter 1: By: Ms. Ola Al-arjani
Experiments with ODD outside the TEI framework Laurent Romary & Piotr Banski The ISO-TEI connection.
1 XML as a preservation strategy Experiences with the DiVA document format Eva Müller, Uwe Klosa Electronic Publishing Centre Uppsala University Library,
MPEG-21 : Overview MUMT 611 Doug Van Nort. Introduction Rather than audiovisual content, purpose is set of standards to deliver multimedia in secure environment.
LIRICS Mid-term Review 1 LIRICS WP2 – NLP Lexica Monica Monachini CNR-ILC - Pisa 23rd May 2006.
ISO TC 37 / SC4 Language Resources An overview (Ammended 2-5 février 2002) Laurent Romary.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
24 Jan 2005 Kick off meeting (Luxembourg) 1 LIRICS Linguistic Infrastructure for Interoperable Resources and Systems ►Kick off meeting presentation ►Proposal.
TEI and Scholarly publishing Laurent Romary INRIA & HUB-ISDL TEI council, chair.
24 Jan 2005 Kick off meeting (Luxembourg) 1 LIRICS Linguistic Infrastructure for Interoperable Resources and Systems ►Kick off meeting presentation ►Proposal.
Jan 9, 2004 Symposium on Best Practice LSA, Boston, MA 1 Comparability of language data and analysis Using an ontology for linguistics Scott Farrar, U.
Halifax, 31 Oct – 3 Nov 2011ICT Accessibility For All ICT Accessibility Standardization Dr. Jim Carter, ISACC Document No: GSC16-PLEN-57r2 Source: ISACC.
LEXUS a flexible web based lexicon tool LEXUS a flexible web based lexicon tool, august 21 th, 2005 Marc Kemps-Snijders Peter Wittenburg
What’s MPEG-21 ? (a short summary of available papers by OCCAMM)
ISO-PWI Lexical ontology some loose remarks Thierry Declerck, DFKI GmbH.
SKOS. Ontologies Metadata –Resources marked-up with descriptions of their content. No good unless everyone speaks the same language; Terminologies –Provide.
TMF - Terminological Markup Framework Laurent Romary Laboratoire LORIA (CNRS, INRIA, Universités de Nancy) ISO meeting London, 14 August 2000.
ISO/TC37/SC4/N377 secretary report
Towards a roadmap for standardization in language technology Laurent Romary & Nancy Ide Loria-INRIA — Vassar College.
Metadata and Meta tag. What is metadata? What does metadata do? Metadata schemes What is meta tag? Meta tag example Table of Content.
Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.
Developing OLIF, Version 2 Susan M. McCormick Christian Lieske OLIF2 Consortium SAP/Walldorf, Germany.
SemAF – Basics: Semantic annotation framework Harry Bunt Tilburg University isa -6 Joint ISO - ACL/SIGSEM workshop Oxford, January 2011 TC 37/SC.
Improvement of Semantic Interoperability based on Metadata Registry(MDR) Doo-Kwon Baik Dept. of CSE Korea University.
Web Service Exchange Protocols Preliminary Proposal ISO TC37 SC4 WG1 2 September 2013 Pisa, Italy.
1 Metadata: an overview Alan Hopkinson ILRS Middlesex University.
ISOcat introduction 10 May /20111CLARIN-NL ISOcat workshop.
Engineering, 7th edition. Chapter 8 Slide 1 System models.
Technical translation
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
Lecture 12 Why metadata? CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Using the TEI framework as a possible serialization for LMF
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
Improving Braille accessibility and personalization on Internet
WHAT DOES THE FUTURE HOLD? Ann Ellis Dec. 18, 2000
DATA MODELS.
A year in the life of the council
2. An overview of SDMX (What is SDMX? Part I)
Multimedia Content Description Interface
Session 2: Metadata and Catalogues
BUILDING A DIGITAL REPOSITORY FOR LEARNING RESOURCES
MSDI training courses feedback MSDIWG10 March 2019 Busan
User’s Perspective Laurie Gerber.
Presentation transcript:

Models and standards for onomasiological and semasiological lexical data Laurent Romary Inria & BBAW & DARIAH COST ENEL meeting 31 March 2016

Lexical databases – a variety of objects Lexical data as part of a wider landscape of language resources First level of abstraction in linguistic analysis -Psycholinguistic, field linguistics, computational linguistics Input to language technology processes Wider interest from language learners and general public A huge amount of legacy information Proprietary formats -E.g. dictionary publishers Proprietary tools -E.g. Shoebox From highly narrative to deeply structured content … open an MS Word document and start typing in your dictionary entry (sorry, just kidding) Are there coherent principles in the representation of lexical data? Can we treat electronic dictionaries and lexical databases in a uniform manner?

Various types of applications Linguistic Precise description of linguistic information Natural language processing Optical Character Recognition, Spell checkers, Information extraction “Traditional” dictionary projects Publishing industry, large scale dictionary projects (e.g. DWDS — Translation domain, technical writing Terminological databases

Why standardizing all this? Defining methods or models to facilitate Exchange of lexical data Pooling heterogeneous lexical data Interoperability between software components -Search engines, layout, extraction of linguistic properties Comparability of results -E.g. Linguistic coverage of lexical databases

Standardization initiatives for lexical/terminological resources TEI Initiated in 1987, driving force behind XML creation P5 edition of the guidelines -Cf. specification platform (ODD) -Dictionary chapter -Former terminology chapter ISO ISO/TC 37: Terminology and language resources -ISO/ TC 37/SC 2: ISO 639 series (language codes) -ISO/TC 37/SC 3: ISO (Terminology), ISO (Data categories), ISO (TBX) -ISO/TC 37/SC 4: Language resource management (2002) ISO (LMF) W3C SKOS, Ontolex

Standards, standards, everywhere! ISO TC 37 SC4: 17 published standards since 2002 TEI – Text Encoding Initiative MAF SynAF ISO-TimeML LMF Feature structures Transcription of speech Stand-off annotation Dictionary chapter TMF TBX Terminology chapter ODD MLIF …

Lexical structures at a glance Observing the data: Various forms of lexical structures Observing the data: Various forms of lexical structures Basic distinctions: Onoma- and semasiological structures Basic distinctions: Onoma- and semasiological structures Onomasiological forms: TMF and TBX Onomasiological forms: TMF and TBX Semasiological forms: LMF and TEI Semasiological forms: LMF and TEI Common concept: Data categories Common concept: Data categories

Comparing approaches Semasiological approach Large coverage All parts of speech Build-in polysemy Multiple senses for the same entry Referential synonymy Onomasiological approach Domain oriented Essentially nouns Extension to verbs, adjectives No polysemy (needs to be reconstructed) Build-in synonymy Multiple terms for the same concept

Onomasiological data (concept to term)

Standards for the digital representation of terminologies ISO 6156:1987 (Mater) — format for representing terminological information on magnetic tapes; followed by an adaptation for microcomputers (MicroMater; see Melby, 1991); Chapter in the TEI guidelines; SGML-based representation; remained there until the P4 edition ISO (Martif), published in 1999; improves the TEI proposal Strongly inspired from the TEI (e.g. the header-text organisation; entries embedded within a and hierarchy) Reaching out the translation and localisation industry ISO 12620:1999, set of reference descriptors (or data categories) ISO 16642:2003 (TMF) — Terminological Markup Framework TBX (TermBase eXchange) published in 2007 by LISA (Localisation Industry Standards Association) as a follower to Martif TBX: ISO standard in 2008

Building up a terminological model (TMF) Terminological entry Language section Term section Language section Term section

Building up a terminological model Terminological entry Language section Term section Language section Term section subjectField definition+source note term source

TBX serialisation (ISO 30042) Industrie mécanique endloser Riemen mit trapezförmigem Querschnitt, der auf zwei Riemenscheiben mit Eindrehungen läuft De Coster, Wörterbuch, Kraftfahrzeugtechnik, SAUR, München, 1982 wird zum Antrieb der Lichtmaschine, des Ventilators und der Wasserpumpe benutzt Keilriemen De Coster, … courroie trapézoïdale …

Semasiological data (word to sense)

LMF as an ISO project Summer 2003: new work item proposal (US) delegation Fall 2003: technical proposal (FR) for a data model dedicated to NLP lexica ISO Convenor: -Nicoletta Calzolari (IT) Editors: -Gil Francopoulo (FR), Monte George (US) 13 versions written, dispatched (to the National delegations nominated experts), commented and discussed in various ISO technical meetings IS (= published standard) in oct Tubingen 2007 Lex-Sem & Onto-Resources15

LMF core package

SERIALIZING LMF USING THE TEI

TEI and “dictionaries” The TEI Print Dictionary (PD) chapter –Initially designed by N. Ide and J. Veronis –Accounts for both presentational and editorial (“content”) issues Cf.,, … and –Based on a hierarchical abstract model (cristals) : for characterising the othographic or phonetic form of the word –,, etc. : grammatical features –May characterize an entry, a specific form or a specific sense –,, generic feature : iterative and recursive –May contains definitions, examples, etymological information, translations, etc. Main characteristic (drawback?): +very+ flexible

Prototypical entry in TEI table n. f. Pièce de mobilier… Une table de cuisine

Why is the TEI a good idea for serialising LMF? Basic structure already defined Provision of additional tags –Surface annotation (e.g. names, dates, abbreviations, alternatives) –Cf equivalences to ISOCat when needed Integration of lexical data in a textual macro-structure –Creating an edited version of a lexica –Grammar books, teaching material, scientific papers Interoperability with other lexical sources –Community of users: sharing a common culture of TEI tags rather than constantly worrying about mappings –Sharing tools: e.g. stylesheets, editors, etc. (cf. Roma) –Note: continuity between dictionary and lexical sources

Identifying the meta-model components Lexicon Lexical entry 0..n 1..1 Morphology 1..1 Form 1..1 Sense 0..n n 1..1

Mapping data categories Lexicon Lexical entry 0..n 1..1 Morphology 1..1 Form 1..1 Sense 0..n n 1..1 /orthography/ ( ) /pronunciation/ ( ) /hyphenization/ ( ) /syllabification/ ( ) /stress pattern/ ( ) /part of speech/ ( ) /inflexional class/ ( ) /gender/ ( ) /number/ ( ) /case/ ( ) /person/ ( ) /tense/ ( ) /mood/ ( ) /definition/ ( ) /example/ ( ) /usage/ ( ) /etymology/ ( )

Construing an TEI dictionary entry demigod... <gramGrp n a being who is part mortal, part god a lesser deity a godlike person

Issues and prospects Stabilizing good practices in TEI encoding Identifying a catalogue of reference constructs for various lexicographic phenomena E.g. Etymology in TEI: Bowers & Romary, in progress More convergence between TEI and ISO Revision of as a multi-part standard More room for enlarging the coverage of LMF serialization by means of the TEI Towards a stable TBX extension to the TEI Data delivery in flat models for the semantic web No manual production TBX2SKOS TEI2Ontolex

Thank you for your attention Merci de votre attention 25

For the long Winter evenings February 2013 Towards Inria Laurent Romary. TBX goes TEI -- Implementing a TBX basic extension for the Text Encoding Initiative guidelines. Terminology and Knowledge Engineering 2014, Jun 2014, Berlin, Germany. 2014, Terminology and Knowledge Engineering, TKE Laurent Romary. An abstract model for the representation of multilingual terminological data: TMF - Terminological Markup Framework. TAMA 2001, Feb 2001, Antwerp, Belgium Laurent Romary. TEI and LMF crosswalks. JLCL - Journal for Language Technology and Computational Linguistics, 2015, 30 (1),.. Laurent Romary. Standards for language resources in ISO – Looking back at 13 fruitful years. edition - die Terminologiefachzeitschrift, Deutscher Terminologie-Tag e.V. (DTT), Laurent Romary, Andreas Witt. Méthodes pour la représentation informatisée de données lexicales/Methoden der Speicherung lexikalischer Daten. Lexicographica, de gruyter Mouton, 2014, …