Standards for language resources the ISO/TC 37(/SC 4) perspective


Similar presentations
TMF - a tutorial Part 3: Designing (schemas and) filters TMF - Terminological Markup Framework Laurent Romary - Laboratoire Loria.

IRCS Workshop on Linguistic Databases December 2001 Philadelphia Standards for Language Resources Nancy IDE Department of Computer Science Vassar.
OLIF V2 Gr. Thurmair April OLIF April 2000 OLIF: Overview Rationale Principles Entries Descriptions Header Examples Status.
Using OLIF, The Open Lexicon Interchange Format Susan McCormick OLIF2 Consortium October 1, 2004.
Can I Use It, and If so, How? Christian Lieske SAP AG – MultiLingual Technology Discussion of Consortium Proposal for OLIF2 File Header.
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
The way to open resources Laurent Romary CNRS. Two aspects of scientific communication Research papers –All types (Conferences, journals, grey literature.
ISOcat Data Category Registry Defining widely accepted linguistic concepts Menzo Windhouwer 1CLARIN-NL MD tutorial, September 2009.
Developing a Metadata Exchange Format for Mathematical Literature David Ruddy Project Euclid Cornell University Library DML 2010 Paris 7 July 2010.
XML Technology in E-Commerce
Principles of ISOcat, a Data Category Registry Marc Kemps-Snijders a, Menzo Windhouwer a, Sue Ellen Wright b a Max Planck Institute for.
ISOcat introduction 19 June 20121CLARIN-NL ISOcat workshop.
Data Category specifications 19 June 20121CLARIN-NL 2012 ISOcat tutorial.
ANSI TAG 37 Committee F43 Language Services and Products Interagency Language Roundtable September 30, 2011 Sue Ellen Wright ISO TC 37, Terminology and.
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands Metadata Component Framework Possible Standardization Work.
MLIF: A Metamodel to Represent and Exchange Multilingual Textual Information ISO TC37 SC4 WG Samuel Cruz-Lara, Gil Francopoulo, Laurent Romary,
Ontology Notes are from:
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
©Silberschatz, Korth and Sudarshan10.1Database System Concepts W3C Activities HTML: is the lingua franca for publishing on the Web XHTML: an XML application.
Interchange using TBX 8 th Metadata conference Berlin April 2005 Alan K. Melby Brigham Young University, Provo campus.
COMP 6703 eScience Project Semantic Web for Museums Student : Lei Junran Client/Technical Supervisor : Tom Worthington Academic Supervisor : Peter Strazdins.
Philips Research France Delivery Context in MPEG-21 Sylvain Devillers Philips Research France Anthony Vetro Mitsubishi Electric Research Laboratories.
W3C Activities HTML: is the lingua franca for publishing on the Web XHTML: an XML application with a clean migration path from HTML 4.01 CSS: Style sheets.
Ontology-based Access Ontology-based Access to Digital Libraries Sonia Bergamaschi University of Modena and Reggio Emilia Modena Italy Fausto Rabitti.
TMF - a tutorial TMF - Terminological Markup Framework Laurent Romary - Laboratoire Loria.
Future of MDR - ISO/IEC Metadata Registries (MDR) Larry Fitzwater, SC 32 WG 2 Convener Computer Scientist U.S. Environmental Protection Agency May.
GRISP: A Massive Multilingual Terminological Database for Scientific and Technical Domains Patrice Lopez and Laurent Romary INRIA & HUB – IDSL
Data Category specifications 20 March 20121CLARIN-NL ISOcat workshop.
9 th Open Forum on Metadata Registries Harmonization of Terminology, Ontology and Metadata 20th – 22nd March, 2006, Kobe Japan. Commonalities and Differences.
CLARIN web services and workflow Marc Kemps-Snijders.
Experiments with ODD outside the TEI framework Laurent Romary & Piotr Banski The ISO-TEI connection.
Standards, Use and Prospects for Language Resource Management Key-Sun Choi 16 Aug TII, Moscow.
The MultilingualWeb-LT Working Group receives funding by the European Commission (project name LT-Web) through the Seventh Framework Programme (FP7) in.
LIRICS Mid-term Review 1 LIRICS WP2 – NLP Lexica Monica Monachini CNR-ILC - Pisa 23rd May 2006.
9 th Open Forum on Metadata Registries Harmonization of Terminology, Ontology and Metadata 20th – 22nd March, 2006, Kobe Japan. TBX TermBase Exchange Format.
ISO TC 37 / SC4 Language Resources An overview (Ammended 2-5 février 2002) Laurent Romary.
24 Jan 2005 Kick off meeting (Luxembourg) 1 LIRICS Linguistic Infrastructure for Interoperable Resources and Systems ►Kick off meeting presentation ►Proposal.
24 Jan 2005 Kick off meeting (Luxembourg) 1 LIRICS Linguistic Infrastructure for Interoperable Resources and Systems ►Kick off meeting presentation ►Proposal.
Towards multimodal meaning representation Harry Bunt & Laurent Romary LREC Workshop on standards for language resources Las Palmas, May 2002.
ISO a tutorial Part 2: Representing data categories TMF - Terminological Markup Framework Laurent Romary - Laboratoire Loria.
ISOcat introduction 20 June 20131CLARIN-NL ISOcat workshop.
ISOcat introduction 20 March 20121CLARIN-NL ISOcat workshop.
9 th Open Forum on Metadata Registries Harmonization of Terminology, Ontology and Metadata 20th – 22nd March, 2006, Kobe Japan. Presentation Title: Day:
Jan 9, 2004 Symposium on Best Practice LSA, Boston, MA 1 Comparability of language data and analysis Using an ontology for linguistics Scott Farrar, U.
Lifecycle Metadata for Digital Objects November 1, 2004 Descriptive Metadata: “Modeling the World”
A Systemic Approach for Effective Semantic Access to Cultural Content Ilianna Kollia, Vassilis Tzouvaras, Nasos Drosopoulos and George Stamou Presenter:
SKOS. Ontologies Metadata –Resources marked-up with descriptions of their content. No good unless everyone speaks the same language; Terminologies –Provide.
TMF - Terminological Markup Framework Laurent Romary Laboratoire LORIA (CNRS, INRIA, Universités de Nancy) ISO meeting London, 14 August 2000.
Overview of SC 32/WG 2 Standards Projects Supporting Semantics Management Open Forum 2005 on Metadata Registries 14:45 to 15:30 13 April 2005 Larry Fitzwater.
MedKAT Medical Knowledge Analysis Tool December 2009.
Metadata : an overview XML and Educational Metadata, SBU, London, 10 July 2001 Pete Johnston UKOLN, University of Bath Bath, BA2 7AY UKOLN is supported.
Slide 1 SDTSSDTS FGDC CWG SDTS Revision Project ANSI INCITS L1 Project to Update SDTS FGDC CWG September 2, 2003.
Towards a roadmap for standardization in language technology Laurent Romary & Nancy Ide Loria-INRIA — Vassar College.
Metadata and Meta tag. What is metadata? What does metadata do? Metadata schemes What is meta tag? Meta tag example Table of Content.
©Silberschatz, Korth and Sudarshan10.1Database System Concepts W3C - The World Wide Web Consortium W3C - The World Wide Web Consortium.
1 Open Ontology Repository initiative - Planning Meeting - Thu Co-conveners: PeterYim, LeoObrst & MikeDean ref.:
ISO TMF - Terminological Markup Framework Laurent Romary - Laboratoire Loria.
SemAF – Basics: Semantic annotation framework Harry Bunt Tilburg University isa -6 Joint ISO - ACL/SIGSEM workshop Oxford, January 2011 TC 37/SC.
Formats, interoperability and standards Marc Kemps-Snijders.
Web Service Exchange Protocols Preliminary Proposal ISO TC37 SC4 WG1 2 September 2013 Pisa, Italy.
Linking to Linguistic Data Categories in ISOcat Menzo Windhouwer a, Sue Ellen Wright b a The Language Archive - MPI for Psycholinguistics,
Geospatial metadata Prof. Wenwen Li School of Geographical Sciences and Urban Planning 5644 Coor Hall
ISOcat introduction 10 May /20111CLARIN-NL ISOcat workshop.
Marc Kemps-Snijders Menzo Windhouwer Sue Ellen Wright
The Semantic Web By: Maulik Parikh.
Lifecycle Metadata for Digital Objects
Semantic Web Update W3C RDF, OWL Standards, Development and Applications Dave Beckett.
Linked Data Reuse in the Language Services Industry
Presentation transcript:

Standards for language resources the ISO/TC 37(/SC 4) perspective Laurent Romary Directeur de Recherche INRIA ISO/TC 37/SC 4 chair

Context ISO TC37 - Terminology and other language resources SC3 - Computer applications in terminology ISO 12200 - Martif ISO 12620 - Data categories (under revision) ISO 16642 - TMF (Terminological Markup Framework) SC4 - Language Resource Management

An example scenario: information extraction Semantic content Content analysis Syntactic structures Chunk parsing Part-of-speech tagging POS tagging Primary Data

Horizontal view (W3C perspective) Semantic content OWL XML Content analysis RDF Syntactic structures Chunk parsing Part-of-speech tagging SOAP POS tagging Primary Data

Vertical view (ISO/TC 37/SC 4 perspective) Semantic content Content analysis Evaluation Linguistic models and descriptors (Data Categories) Syntactic structures Chunk parsing Lexica Part-of-speech tagging POS tagging Primary Data

Linguistic information sources …and initiatives Access protocols [Corba, SOAP] Primary resources (text, dialogues) Structural mark-up Basic annotations [TEI, MPEG7, TMX, XLIFF, XHTML, etc.] Knowledge structures Hierarchies of types Relations between concepts (subjects/topics etc.) Links to primary resources [Topic Maps, OIL, RDF] Links NLP structures (annotations) POS tagging Chunks (cf. Named Entities) Deep Syntactic structures Co-references etc. [Eagles/ISLE, CES, MATE,…] Lexical structures (Language models) Terminologies Transfer lexica LTAG/HPSG/LFG lexica [TBX, OLIF, Eagles/ ISLE (Genelex)] Meta-data [Dublin core, OLAC, ISLE, MPEG7, RDF]

SC4 Approach Efforts geared towards defining abstract models and general frameworks for the creation and representation of language resources In principle, abstract enough to accommodate diverse linguistic, theoretical or practical approaches No provision of new formats Situate development squarely in the framework of XML and related standards Ensure compatibility with established and widely accepted web-based technologies Ensure feasibility of transduction from legacy formats into newly defined formats

SC4 and other standardizing bodies Contributing organizations ----- ----- ----- TEI text representation Reference for primary sources e.g.: text archives ----- Oscar Text W3C basic protocols and formats XML (Schemas) XPath XPointer + RDF, SVG, SMIL, SOAP ISO TC37/SC4 - language resources, NLP perspective e.g. linguistic annotations, lexical formats Technical background MPEG - Multimedia, XML based e.g. MPEG7-4 Word and phone lattices Audio/Speech

ISO/TC 37/SC 4 structure Data categories WG4 WG2 WG3 WG5 WG1 Lexical databases WG2 Representation schemes WG3 Multilingual text representation Workflow of language Resource Management WG5 WG1 Basic descriptors and mechanisms for language resources

On-going activities Feature structure representation (in collaboration with the TEI - Text Encoding Initiative) ISO DIS 24610 Morpho-syntactic annotation ISO NP 24611 Lexical markup framework ISO NP 24612 (+ ISO NP 12620-3) Task force on Meta-data for language resources (OLAC+IMDI) ACL/Sigsem working group on multimodal content representation Data category registry for ISO/TC 37 ISO CD 12620-1 on ballot (deadline Jan. 2004)

Modeling linguistic annotation structures

General framework - 1 Model for linguistic annotation that can be instantiated in a standard representational format GMT: Generic Mapping Tool serve as a pivot format into and out of which proprietary formats may be transduced to enable Comparison, merging, manipulation via common tools Reference: ISO 16642 - Terminological Markup Framework

General framework - 2 A meta-model A set of data-categories A general, underlying model that informs current practice A set of data-categories Provides to precise semantics of the format Obtained: By sub-setting a Data Category Registry By providing application specific categories Vs. terminology - fixed named levels

ISO 16642: A family of formats TMF … TML1 TML2 TML3 TMLi (Geneter) (TBX) GMT

Meta-model * * * * Terminological Data Collection (TDC) Global Information (GI) Complementary Information (CI) * Terminological Entry (TE) * Language Section (LS) * Term Section (TS) * Term Component Section (TCS)

TMF: example TE LS LS TS TS id=‘ID67’ subjectField=‘ manufacturing ’ definition=‘A value…’ TE LS LS lang=‘ hu ’ lang=‘ en ’ TS term=‘alpha smoothing factor’ termType=‘fullForm’ term=‘…’ TS

Implementation in TBX (cf. <termEntry id='ID67'> <descrip type='subjectField‘>manufacturing</descrip> <descrip type='definition'>A value between 0 and 1 used in ...</descrip> <langSet lang='en'> <tig> <term>alpha smoothing factor</term> <termNote type='termType'>fullForm</termNote> </tig> </langSet> <langSet lang='hu'> <term>Alfa ...</term> </termEntry>

Implementing a Data Category Registry for ISO TC37

Data Category Definition: Example: Background: Elementary descriptor used in a linguistic description or annotation scheme Example: /Part of speech/, /Grammatical gender/, /Grammatical number/, /Feminine/, /Plural/, /Ablative/ Background: Experience gained from ISO 16642 in linguistic format specification Wider notion of data-categories as meta-data for tagged language resources

Multiple uses of data categories Documentation Meta-data XML schemas Data category selection Meta model XSL filters

Application domains Terminological data collection (TC 37/SC 3) Cf. “old” ISO 12620 set of data categories for terminology Language codes (TC 37/SC 2) Cf. evolution from ISO 639-1 and ISO 639-2 to ISO 639-4 On-going and future SC4 activities (TC 37/SC 4) Meta-data for language resources Morpho-syntax/Syntax, Discourse level annotation NLP lexica, MT lexica Multilingual data representations (e.g. translation memories) and access (query languages)

Technical background ISO 11179 (ISO JTC 1/SC 32): meta-data registry view Provide mechanisms for the management of data categories ISO 16642 (ISO TC 37/SC 3): terminology view Provides ways of dealing with multilingual issues OWL (W3C Sem. Web activity): ontology view Provides a framework for dealing with hierarchies and expressing constraints on data-categories E.g. a /noun/ can be described by means of /gender/ and /number/ in French

XML schema declaration Relation to ISO 11179 Complex datcat Set of Simple datcats /masculine/ /feminine/ /neuter/ /gender/ Data element concept Conceptual domain Data element Value domain XML object List of values Implemented as an XML attribute named ‘gen’ m, f, n XML schema declaration <w lemme=“vert” gen=“f”>verte</w>

The ISO 12620-1 proposal Entry Identifier: gender Profile: morpho-syntax Definition (fr): Catégorie grammaticale reposant, selon les langues et les systèmes, sur la distinction naturelle entre les sexes ou sur des critères formels (Source: TLFi) Definition (en): Grammatical category… (Source: TLFi (Trad.)) Conceptual Domain: {/feminine/, /masculine/, /neuter/} Object Language: fr Name: genre Conceptual Domain: {/feminine/, /masculine/} Object Language: en Name: gender Object Language: de Name: Geschlecht Conceptual Domain: {/feminine/, /masculine/, /neuter/}

Perspectives ISO/TC 37/SC 4 in a wider picture Basic building blocks to bring coherence in the representation of linguistic information in a variety of application domains E.g. e-documentation, e-learning, e-business (e-catalogues), multimedia, localisation… Provide vertical solution to linguistically based applications E.g. Information extraction, indexing