Using the TEI framework as a possible serialization for LMF Laurent Romary INRIA & HUB-IDSL

Slides:



Advertisements
Similar presentations
OLIF V2 Gr. Thurmair April OLIF April 2000 OLIF: Overview Rationale Principles Entries Descriptions Header Examples Status.
Advertisements

Using OLIF, The Open Lexicon Interchange Format Susan McCormick OLIF2 Consortium October 1, 2004.
LIFTing LEGO with RELISH: Lexicon Interchange FormaT in Use Helen Aristar-Dry Institute for Language Information and Technology Eastern Michigan U.
ISO DSDL ISO – Document Schema Definition Languages (DSDL) Martin Bryan Convenor, JTC1/SC18 WG1.
Data models and the (blind ?) query of lexical resources Laurent Romary Inria & HUB Work in progress.
Representing dictionaries with the TEI Proposal for basic guidelines Laurent Romary - Max Planck Digital Library With the help of Susanne Alt - CNRS.
SRDC Ltd. 1. Problem  Solutions  Various standardization efforts ◦ Document models addressing a broad range of requirements vs Industry Specific Document.
MLIF: A Metamodel to Represent and Exchange Multilingual Textual Information ISO TC37 SC4 WG Samuel Cruz-Lara, Gil Francopoulo, Laurent Romary,
XML for Information Management – Day 2 Airi Salminen University of Erlangen-Nuremberg Computational Linguistics Instructor: Professor Airi Salminen
Uncovering the TEI and ODD A pedagogical strip-tease Laurent Romary - Max Planck Digital Library.
1 COS 425: Database and Information Management Systems XML and information exchange.
EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar.
Distributed Collaborations Using Network Mobile Agents Anand Tripathi, Tanvir Ahmed, Vineet Kakani and Shremattie Jaman Department of computer science.
18 June, 2013 Katrin Heinze, Bundesbank CEN/WS XBRL CWA1: European Filing Rules Data Point Meta Model Data Point Methodology Guidance European Taxonomy.
TMF - a tutorial TMF - Terminological Markup Framework Laurent Romary - Laboratoire Loria.
Metadata Standards and Applications 4. Metadata Syntaxes and Containers.
Aurora: A Conceptual Model for Web-content Adaptation to Support the Universal Accessibility of Web-based Services Anita W. Huang, Neel Sundaresan Presented.
Provo, 16 Aug 2007 LMF meeting 1 Lexical Markup Framework: ISO Provo meeting Gil Francopoulo.
 Copyright 2005 Digital Enterprise Research Institute. All rights reserved. Towards Translating between XML and WSML based on mappings between.
Why XML ? Problems with HTML HTML design - HTML is intended for presentation of information as Web pages. - HTML contains a fixed set of markup tags. This.
XML CPSC 315 – Programming Studio Fall 2008 Project 3, Lecture 1.
CIM and UML Overview Terry Saxton Xtensible Solutions
Chapter 1 Understanding the Web Design Environment Principles of Web Design, 4 th Edition.
Experiments with ODD outside the TEI framework Laurent Romary & Piotr Banski The ISO-TEI connection.
XML BIS4430 – unit 10. XML Origins Extensible Markup Language (XML) 1998 Inspired by Standard Generalized Markup Language (SGML) and HTML. SGML defines.
1 CIS336 Website design, implementation and management (also Semester 2 of CIS219, CIS221 and IT226) Lecture 6 XSLT (Based on Møller and Schwartzbach,
TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.
Sheet 1XML Technology in E-Commerce 2001Lecture 1 XML Technology in E-Commerce Lecture 1 WWW, HTML, CSS, XML, Meta-modeling.
LIRICS Mid-term Review 1 LIRICS WP2 – NLP Lexica Monica Monachini CNR-ILC - Pisa 23rd May 2006.
FIGIS’ML Hands-on training - © FAO/FIGIS An introduction to XML Objectives : –what is XML? –XML and HTML –XML documents structure well-formedness.
November 1, 2006IU DLP Brown Bag : Fall Data Integrity and Document- centric XML Using Schematron for Managing Text Collections Dazhi Jiao, Tamara.
Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK
XML A web enabled data description language 4/22/2001 By Mark Lawson & Edward Ryan L’Herault.
TEI and Scholarly publishing Laurent Romary INRIA & HUB-ISDL TEI council, chair.
Development Process and Testing Tools for Content Standards OASIS Symposium: The Meaning of Interoperability May 9, 2006 Simon Frechette, NIST.
ISO a tutorial Part 2: Representing data categories TMF - Terminological Markup Framework Laurent Romary - Laboratoire Loria.
TUTORIAL Dolphy A. Fernandes Computer Science & Engg. IIT Bombay.
XHTML By Trevor Adams. Topics Covered XHTML eXtensible HyperText Mark-up Language The beginning – HTML Web Standards Concept and syntax Elements (tags)
LEXUS a flexible web based lexicon tool LEXUS a flexible web based lexicon tool, august 21 th, 2005 Marc Kemps-Snijders Peter Wittenburg
ISO-PWI Lexical ontology some loose remarks Thierry Declerck, DFKI GmbH.
ModelPedia Model Driven Engineering Graphical User Interfaces for Web 2.0 Sites Centro de Informática – CIn/UFPe ORCAS Group Eclipse GMF Fábio M. Pereira.
SKOS. Ontologies Metadata –Resources marked-up with descriptions of their content. No good unless everyone speaks the same language; Terminologies –Provide.
Common Terminology Services 2 CTS 2 Submission Team Status Update HL7 Vocabulary Working Group May 17, 2011.
TMF - Terminological Markup Framework Laurent Romary Laboratoire LORIA (CNRS, INRIA, Universités de Nancy) ISO meeting London, 14 August 2000.
ISO CD Editorial and technical comments. Contact Mailing list Subject: sub FirstName LastName.
Standards for digital encoding Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture 2: TEI.
WP 3: Standardisation of shared metadata Mode of operation –All partners are involved –Building on practice outside the project Achievements of Year 1.
Towards a roadmap for standardization in language technology Laurent Romary & Nancy Ide Loria-INRIA — Vassar College.
Basics of Web Based Computing. The Architecture The user’s system A Web Server What’s inside? Server software Apache or other Resources to be accessible.
Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.
ISO TMF - Terminological Markup Framework Laurent Romary - Laboratoire Loria.
Working with XML. Markup Languages Text-based languages based on SGML Text-based languages based on SGML SGML = Standard Generalized Markup Language SGML.
SemAF – Basics: Semantic annotation framework Harry Bunt Tilburg University isa -6 Joint ISO - ACL/SIGSEM workshop Oxford, January 2011 TC 37/SC.
Manufacturing Systems Integration Division Development Process and Testing Tools for Content Standards Simon Frechette National Institute of Standards.
Formats, interoperability and standards Marc Kemps-Snijders.
LBSC 690 Session 4 Programming. Languages How do we learn a language? Learn by listening Then reading Then writing How do we teach programming? Learn.
Using DSDL plus annotations for Netconf (+) data modeling Rohan Mahy draft-mahy-canmod-dsdl-01.
Describing resources II: Dublin Core CERN-UNESCO School on Digital Libraries Rabat, Nov 22-26, 2010 Annette Holtkamp CERN.
Web Design Principles 5 th Edition Chapter 3 Writing HTML for the Modern Web.
Geospatial metadata Prof. Wenwen Li School of Geographical Sciences and Urban Planning 5644 Coor Hall
Models and standards for onomasiological and semasiological lexical data Laurent Romary Inria & BBAW & DARIAH COST ENEL meeting 31 March 2016.
Implementing the TEI Feature System Declaration Gary F. Simons SIL International ___________________________ TEI Members Meeting 11 Oct 2002, Chicago.
Unit 4 Representing Web Data: XML
Using the TEI framework as a possible serialization for LMF
DATA MODELS.
A year in the life of the council
Markup Languages Gilok Choi 9/17/2018
Chapter 7 Representing Web Data: XML
XML Data Introduction, Well-formed XML.
CSE591: Data Mining by H. Liu
Presentation transcript:

Using the TEI framework as a possible serialization for LMF Laurent Romary INRIA & HUB-IDSL

Executive summary Issue: identifying an “appropriate” serialization for LMF – Serialization: mapping a lexical model onto a concrete (computer) representation (field based, XML, etc.) – Wide consensus, maintenance, flexibility, cohesion with other standardization activities The TEI is an ideal basis for defining standardized XML formats for lexical data – TEI as an infrastructure Customization facilities: ODD, classes, pointing mechanisms, etc. – TEI as a reference vocabulary Print dictionary chapter (PD) – TEI as an application of LMF Workplan proposal – Defining the ideal LMF compliant subset of the TEI PD chapter – Suggesting extensions to the PD chapter Convergence… – Contribution to making ISO and TEI work closer together

The TEI at a glance Started in 1987 Organized as a consortium: 5 hosts, board, council Edition P5 of the guidelines: more than 500 elements covering various text genres and structures – Genericity: header, text structure, pointing mechanisms, paragraph level elements, surface entities – Precise and flexible documentation – Maintenance: 2 releases per year Wide community of users: default format for most text-based projects worldwide – Cf. papers from Przepiorkowski, or Erjavec at LREC And yes, it is XML based… e.g. preconfigured in Oxygen

Intermezzo — an XML tutorial XML is about awful angle brackets (serialization) f p XML is about beautiful trees (model) Issues – Specifying structures – Providing semantics

Basic concepts of the TEI technical platforms A specification language ODD (One Document Does it all) – Literate programming (Knuth) – Generation of both schemas and documentation DTD, RelaxNG, W3C scemas HTML, pdf, ePub, docx – Provides extended customization facilities – Natural link with ISOCat Modules – Each schema specification is a combination of internal or external modules E.g. ISO-TEI Feature-Structure module Classes (shared behaviours or semantics) – Model classes – Attribute classes

From ODD to documentation

TEI and “dictionaries” The TEI Print Dictionary (PD) chapter – Initially designed by N. Ide and J. Veronis – Accounts for both presentational and editorial (“content”) issues Cf.,, … and – Based on a hierarchical abstract model (cristals) : for characterising the othographic or phonetic form of the word –,, etc. : grammatical features – May characterize an entry, a specific form or a specific sense –,, generic feature : iterative and recursive – May contains definitions, examples, etymological information, translations, etc. Main characteristic (drawback?): +very+ flexible

Examples 一乘顯性教 One of the five divisions made by 圭峰 Guifeng of the Huayan 華嚴 or Avataṃsaka School; v. 五教. Source: thanks to Marcus Bingenheimerhttp://buddhistinformatics.ddbc.edu.tw/glossaries/ 眾生不可思議 術語 四事不可思議之一。見不可思議條。 不可思議

Examples – cont. chat noun masculine chat singular chats plural

Customizing an entry table n. f. Pièce de mobilier… Une table de cuisine Selecting content,,, Selecting content,,, Constraining content f., f, fem, feminin, feminine,… Constraining content f., f, fem, feminin, feminine,… Adding content e.g.: Adding content e.g.:

Illustrating classes: tei.gramInfo Grammatical information in a dictionary entry – E.g.: luire verb intransitive – Rather homogeneous set of elements,,,, etc. – May also appear in

Overall picture tei.gramInfo

Declaring the class: tei.gramInfo grammatical information groups those elements allowed within a gramGrp element in a dictionary.

belongs to tei.gramInfo part of speech indicates the part of speech assigned to a dictionary headword (noun, verb, adjective, etc.) …

Content model for grammatical information group <rng:zeroOrMore xmlns:rng=" …

LMF at a glance LMF – Lexical Markup Framework – ISO standard (published Oct. 2008) – Edited within ISO committee TC 37/SC 4 Technical content – Focus on provided a core meta-model with extensions – Potentially agnostic with regards serialisation Isomorphism => interoperability – Default syntax to exemplify its possible use, room for improvement… Can the TEI be seen as a conformant implementation of LMF?

LMF architecture — playing Lego Seite 17 Lexical DB 1..1 Global Info 1..1 Lexical Entry 0..n 1..1 Form 1..1 Sense 0..n n 1..1 Lexical Entry Morphology 1..1 Lexical Entry Morphology 1..1 Lexical extensions Lexical extensions Lexical extension Lexical extension 0..1 Paradigm 1..1 Flexion 0..n 1..1 Lexical extension for morphology

Example: designing a full-form lexicon Seite 18 Morphology 1..1 Paradigm Inflexion 0..n 1..1 Lexical DB Entry 0..n 1..1 Global Info 1..1

Decorating the model Seite 19 Lexical DB Entry 0..n 1..1 Morphology 1..1 Paradigm Inflexion 0..n 1..1 /lemma/ /part of speech/ /word form/ /gender/ /number/ /tense/ … 1..1 Global Info 1..1 /paradigmId/ …

Why is the TEI a good idea for serialising LMF? Basic structure already defined Provision of additional tags – Surface annotation (e.g. names, dates, abbreviations, alternatives) – Cf equivalences to ISOCat when needed Integration of lexical data in a textual macro-structure – Creating an edited version of a lexica – Grammar books, teaching material, scientific papers Interoperability with other lexical sources – Community of users: sharing a common culture of TEI tags rather than constantly worrying about mappings – Sharing tools: e.g. stylesheets, editors, etc. (cf. Roma) – Note: continuity between dictionary and lexical sources

A typical entry demigod... <gramGrp n a being who is part mortal, part god a lesser deity a godlike person

Identifying the meta-model components Lexical DB Lexical entry 0..n 1..1 Global Info 1..1 Morphology 1..1 Form 1..1 Sense 0..n n 1..1

Data categories Lexical DB Lexical entry 0..n 1..1 Global Info 1..1 Morphology 1..1 Form 1..1 Sense 0..n n 1..1 /orthography/ ( ) /pronunciation/ ( ) /hyphenization/ ( ) /syllabification/ ( ) /stress pattern/ ( ) /part of speech/ ( ) /inflexional class/ ( ) /gender/ ( ) /number/ ( ) /case/ ( ) /person/ ( ) /tense/ ( ) /mood/ ( ) /definition/ ( ) /example/ ( ) /usage/ ( ) /etymology/ ( )

Customizing the TEI lexical model Constraining the TEI model – Sub-setting the TEI default dictionary module – Providing additional rules (XSLT, Schematron) – Constraining possible values Complementing the TEI model – Defining additional data categories – Defining missing LMF extensions as TEI components Make use of the class mechanisms A natural implementation of the LMF extension mechanisms

Towards a joint ISO-TEI activity Contributing to convergence, with a pragmatic prespective Benefiting from advantages of both sides – TEI reactivity and community support – ISO stability and international validation LMF serialization seen as – A subset of the TEI when equivalent construct exist – An extension of the TEI for missing constructs (e.g. syntax) Some concrete work on the table… – Aside activities: specifying LL-LIF in ODD Reference – L. Romary, “Standardization of the formal representation of lexical information for NLP” –