Www.isocat.org Principles of ISOcat, a Data Category Registry Marc Kemps-Snijders a, Menzo Windhouwer a, Sue Ellen Wright b a Max Planck Institute for.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

Resource description and access for the digital world Gordon Dunsire Centre for Digital Library Research University of Strathclyde Scotland.
ISOcat Data Model: Workflow & Guidelines Marc Kemps-Snijders a, Sue Ellen Wright b, Menzo Windhouwer a a Max Planck Institute for Psycholinguistics, b.
ISOcat Data Category Registry Defining widely accepted linguistic concepts Menzo Windhouwer 1CLARIN-NL MD tutorial, September 2009.
Advanced Metadata Usage Daan Broeder TLA - MPI for Psycholinguistics / CLARIN Metadata in Context, APA/CLARIN Workshop, September 2010 Nijmegen.
ISOcat introduction 19 June 20121CLARIN-NL ISOcat workshop.
Data Category specifications 19 June 20121CLARIN-NL 2012 ISOcat tutorial.
ICT Monica Monachini – 1° KYOTO Workshop – Amsterdam 2/ KYOTO (ICT ) Yielding Ontologies for Transition-Based Organization Intelligent.
ANSI TAG 37 Committee F43 Language Services and Products Interagency Language Roundtable September 30, 2011 Sue Ellen Wright ISO TC 37, Terminology and.
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands Metadata Component Framework Possible Standardization Work.
MLIF: A Metamodel to Represent and Exchange Multilingual Textual Information ISO TC37 SC4 WG Samuel Cruz-Lara, Gil Francopoulo, Laurent Romary,
The current state of Metadata - as far as we understand it - Peter Wittenburg The Language Archive - Max Planck Institute CLARIN Research Infrastructure.
What Linguists Want (we think) Helen Aristar Dry & Anthony Aristar LINGUIST List & E-MELD.
10 December, 2013 Katrin Heinze, Bundesbank CEN/WS XBRL CWA1: DPM Meta model CWA1Page 1.
TMF - a tutorial TMF - Terminological Markup Framework Laurent Romary - Laboratoire Loria.
Future of MDR - ISO/IEC Metadata Registries (MDR) Larry Fitzwater, SC 32 WG 2 Convener Computer Scientist U.S. Environmental Protection Agency May.
Data Category specifications 20 March 20121CLARIN-NL ISOcat workshop.
9 th Open Forum on Metadata Registries Harmonization of Terminology, Ontology and Metadata 20th – 22nd March, 2006, Kobe Japan. Commonalities and Differences.
Principles of the GOLD Ontology & Conversion of GOLD to DCIF Presenters: Anthony Aristar, Evelyn Richter.
Provo, 16 Aug 2007 LMF meeting 1 Lexical Markup Framework: ISO Provo meeting Gil Francopoulo.
CLARIN web services and workflow Marc Kemps-Snijders.
The ISO-DCR 17 January /20111CMDI tutorial Marc Kemps-Snijders a, Menzo Windhouwer b, Sue Ellen Wright c a Meertens Institute, b MPI for.
Standards for language resources the ISO/TC 37(/SC 4) perspective
ISOcat demo and providing RELcat input Menzo Windhouwer The Language Archive tla.mpi.nl Data Archiving and Networked Solutions
Trends in Concept Modelling Turning Issues into Solutions How to Discipline a Cat Sue Ellen Wright, Kent State University.
INF 384 C, Spring 2009 Ontologies Knowledge representation to support computer reasoning.
CLARIN-NL Call 3 ISOcat follow-up 10/10/20121CLARIN-NL ISOcat Call 3 follow-up.
Content of the Data Category Registry 10 May /20111CLARIN-NL ISOcat workshop.
Classification and the Metadata Registry Judith Newton NIST IRS XML Stakeholders/ XML Working Group May 18, 2004.
LIRICS Mid-term Review 1 LIRICS WP2 – NLP Lexica Monica Monachini CNR-ILC - Pisa 23rd May 2006.
Nancy Lawler U.S. Department of Defense ISO/IEC Part 2: Classification Schemes Metadata Registries — Part 2: Classification Schemes The revision.
Report on the ISOcat project Marc Kemps-Snijders Menzo Windhouwer Peter Wittenburg Sue Ellen Wright January 8,
MPEG-7 Interoperability Use Case. Motivation MPEG-7: set of standardized tools for describing multimedia content at different abstraction levels Implemented.
Tommie Curtis SAIC January 17, 2000 Open Forum on Metadata Registries Santa Fe, NM SDC JE-2023.
CLARIN-NL Call 4 ISOcat follow-up 2/10/20131CLARIN-NL Call 4 ISOcat follow-up.
ISO a tutorial Part 2: Representing data categories TMF - Terminological Markup Framework Laurent Romary - Laboratoire Loria.
Metadata. Generally speaking, metadata are data and information that describe and model data and information For example, a database schema is the metadata.
ISOcat introduction 20 June 20131CLARIN-NL ISOcat workshop.
ISOcat introduction 20 March 20121CLARIN-NL ISOcat workshop.
Jan 9, 2004 Symposium on Best Practice LSA, Boston, MA 1 Comparability of language data and analysis Using an ontology for linguistics Scott Farrar, U.
11 CMDI/ISOcat And Semantic Operability Ineke Schuurman ISOcat content coördinator CLARIN-NL Menzo Windhouwer ISOcat system administrator Utrecht
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands NP CMDI-1 Metadata Component Framework New Standardization.
Technology – Broad View Aspects that play a role when integrating archives leave the details of some core topics to the 2. day Bernhard Neumair:Base Technologies.
A Data Category Registry- and Component- based Metadata Framework Daan Broeder et al. Max-Planck Institute for Psycholinguistics LREC 2010.
Beyond ISOcat 20 June 2013CLARIN-NL ISOcat tutorial1.
LexEVS Value Domains. LexGrid Definitions Value Domain Definition – the description of the contents Value Domain Resolution – the actual contents –The.
ISO TC 37/CLARIN SEMANTIC DATA REGISTRY WORKSHOP UTRECHT, DECEMBER ISOcat: Metadata Registry SUE ELLEN WRIGHT DECEMBER 2013.
Metadata “Data about data” Describes various aspects of a digital file or group of files Identifies the parts of a digital object and documents their content,
CLARIN Concept Registry: the new semantic registry Ineke Schuurman, Menzo Windhouwer, Oddrun Ohren, Daniel Zeman
Towards a roadmap for standardization in language technology Laurent Romary & Nancy Ide Loria-INRIA — Vassar College.
The ISO Data Category Registry ISO 12620:2009 introduces – A web-based electronic Data Category Registry (DCR) for simple, complex and (in the future)
ISOcat status
CLARIN Requirements for a Semantic Registry Daan Broeder The Language Archive – MPI Ineke Schuurman CLARIN-NL/VL – KU Leuven & Utrecht.
Menzo Windhouwer.  The Typological Database System (TDS) provides integrated access to multiple, independently created typological databases.  Users.
1 ISOCAT Proposed solutions for Problems encountered in DUELME-LMF Jan Odijk Nijmegen 21 Sep 2010.
Working with XML. Markup Languages Text-based languages based on SGML Text-based languages based on SGML SGML = Standard Generalized Markup Language SGML.
SemAF – Basics: Semantic annotation framework Harry Bunt Tilburg University isa -6 Joint ISO - ACL/SIGSEM workshop Oxford, January 2011 TC 37/SC.
Annotation by category – ELAN and ISO DCR Han Slöetjes, Peter Wittenburg Max-Planck-Institute for Psycholinguistics LREC,
Formats, interoperability and standards Marc Kemps-Snijders.
Be.wi-ol.de User-friendly ontology design Nikolai Dahlem Universität Oldenburg.
ISO TC 37/CLARIN DISCUSSION UTRECHT, DECEMBER 9/ Thinning Down a Bloated Cat SUE ELLEN WRIGHT DECEMBER 2013.
ISOcat tutorial DCR data model and guidelines. Simple and complex DCs Simple Data CategoryComplex Data CategoryConceptual Domain Data CategoryDescription.
A Data Category Registry- and Component- based Metadata Framework Daan Broeder et al. Max-Planck Institute for Psycholinguistics LREC 2010.
TDS-Curator DANS MPI for Psycholinguistics Utrecht Institute of Linguistics OTS languagelink.let.uu.nl/tds/ 9/21/20101CLARIN-NL - Call 1 - ISOcat status.
Group work and standardization features in ISOcat Menzo Windhouwer 8/14/20101Standardizing Data Categories in ISOcat - Implementing Group.
Linking to Linguistic Data Categories in ISOcat Menzo Windhouwer a, Sue Ellen Wright b a The Language Archive - MPI for Psycholinguistics,
ISOcat introduction 10 May /20111CLARIN-NL ISOcat workshop.
Marc Kemps-Snijders Menzo Windhouwer Sue Ellen Wright
Metadata Issues in Long-term Management of Data and Metadata
Working meeting of WP4 Task WP4.1
Presentation transcript:

Principles of ISOcat, a Data Category Registry Marc Kemps-Snijders a, Menzo Windhouwer a, Sue Ellen Wright b a Max Planck Institute for Psycholinguistics, b Kent State University 4/8/20101Lexicon Tools and Lexicon Standards

Outline What are Data Categories? How can you use Data Categories? What is a Data Category Registry? How can you use a Data Category Registry? Future work 4/8/2010Lexicon Tools and Lexicon Standards2

ISO 12620:2009 Terminology and other content and language resources — Specification of data categories and management of a Data Category Registry for language resources – An ISO TC 37/SC 3 standard (see [1]) – Successor to ISO 12620:1999 which contained a hardcoded list of Data Categories 4/8/2010Lexicon Tools and Lexicon Standards3

What is a Data Category? The result of the specification of a given data field – A data category is an elementary descriptor in a linguistic structure or an annotation scheme. Specification consists of 3 main parts: – Administrative part Administration and identification – Descriptive part Documentation in various working languages – Linguistic part Conceptual domain(s for various object languages) 4/8/2010Lexicon Tools and Lexicon Standards4

Data category example Data category: /Grammatical gender/ – Administrative part: Identifier: grammaticalGender PID: – Descriptive part: English definition: Category based on (depending on languages) the natural distinction between sex and formal criteria. French definition: Catégorie fondée (selon la langue) sur la distinction naturelle entre les sexes ou d'autres critères formels. – Linguistic part: Morposyntax conceptual domain: /male/, /feminine/, /neuter/ French conceptual domain: /male/, /feminine/ 4/8/2010Lexicon Tools and Lexicon Standards5

Mandatory parts of the specification For each data category: – a mnemonic identifier – an English definition – an English name For complex data categories: – a conceptual domain For standardization candidates: – a profile – a justification 4/8/2010Lexicon Tools and Lexicon Standards6

Data Category types 4/8/2010Lexicon Tools and Lexicon Standards7 writtenForm string open grammaticalGender string neuter masculine feminine closed simple: string constrained complex:

Data Category relationships 4/8/2010Lexicon Tools and Lexicon Standards8 Value domain membership Subsumption relationships between simple data categories (legacy) Relationships between complex data categories are not stored in the DCR partOfSpeech string pronoun personal pronoun

No ontological relationships? Rationale: – Relation types and modeling strategies for a given data category may differ from application to application; – Motivation to agree on relation and modeling strategies will be stronger at individual application level; – Integration of multiple relation structures in DCR itself could lead to endless ontological clutter. 4/8/2010Lexicon Tools and Lexicon Standards9

Usage of is-a relationships between simple DCs 4/8/2010Lexicon Tools and Lexicon Standards10 Data categoryMorposyntaxTerminology /partOfSpeech/XX /adjective/XX /ordinalAdjective/X /participleAdjective/X /qualifierAdjective/X /adposition/XX /circumposition/X /preposition/X /postposition/X

How can you use Data Categories? 4/8/2010Lexicon Tools and Lexicon Standards11 Lexicon Lexical Entry FormSense 0..* 1..* partOfSpeech writtenForm grammaticalGender lexicalType Word Form Lemma LanguageBWOgenders grammaticalGenderwordOrder A LMF (ISO 24613:2008) compliant (schema for a) lexicon A (schema for a) typological database Shared semantics!

How? A (TC 37) meta model which is instantiated with a domain/application specific data category selection into a data model An proprietary data model with a related data category selection 4/8/2010Lexicon Tools and Lexicon Standards12

How? 4/8/2010Lexicon Tools and Lexicon Standards13 nihongo … … …

Referencing Data Categories Each Data Category should be uniquely identifiable – Ambiguity: different domains use the same term but mean different ‘things’ – Semantic rot: even in the same domain the meaning of a term changes over time – Persistence: for archived resources Data Category references should still be resolvable and point to the specification as it was at/close to time of creation Persistent IDentifiers – ISO/DIS Language resource management -- Persistent identification and access in language technology applications – ISOcat uses ‘cool URIs’ (see [6]) (/grammaticalGender/) 4/8/2010Lexicon Tools and Lexicon Standards14

Where do you put these references? In a schema: ipa … 4/8/2010Lexicon Tools and Lexicon Standards15

ISO TC 37 standards using Data Categories Terminological Markup Framework (TMF; ISO 16642) Lexical Markup Framework (LMF; ISO 24613) TermBase eXchange (TBX; ISO 30042) Morpho-syntactic Annotation Framework (MAF; ISO 24611) Linguistic Annotation Framework (LAF; ISO 24612) Meta models which can be instantiated into a specific data model with data categories However, some still refer to ISO 12620:1999 Data Categories and some don’t support all types (see [3]) 4/8/2010Lexicon Tools and Lexicon Standards16

Other uses of Data Categories CLARIN Component Metadata Infrastructure (CMDI) ISO 12620:2009 provides a small XML vocabulary, DC Reference (see [4]), which provides elements and attributes to embed Data Category references in arbitrary XML documents – Including: XML Schema, Relax NG, TEI/ISO feature structures, … The references can be used in URI based ‘mappings’: – Including: ODD, RDF-based vocabularies (OWL, SKOS), … 4/8/2010Lexicon Tools and Lexicon Standards17

What is a Data Category Registry? A (coherent) set of Data Categories, in our case for linguistic resources A system to manage this set: – Create and edit Data Categories – Share Data Categories, e.g., resolve PID references – Standardize Data Categories Grass roots approach 4/8/2010Lexicon Tools and Lexicon Standards18

Standardize Data Categories 4/8/2010Lexicon Tools and Lexicon Standards19 Submission group Data Category Registry Board Validation Thematic Domain Group Evaluation Stewardship group Decision Group rejected Publication

Thematic Domain Groups 4/8/2010Lexicon Tools and Lexicon Standards20 TDG 1: Metadata TDG 2: Morphosyntax TDG 3: Semantic Content Representation TDG 4: Syntax TDG 5: Machine Readable Dictionary TDG 6: Language Resource Ontology TDG 7: Lexicography TDG 8: Language Codes TDG 9: Terminology TDG 11: Multilingual Information Management TDG 12: Lexical Resources TDG 13: Lexical Semantics TDG 14: Source Identification TDGs are the owner and guardians of a coherent subset of the DCR TDGs own one or more profiles Each TDG has a chair A number of judges (assigned by SC P members) A number of expert members (up to 50%) TDGs are constituted at the TC37/SC plenary New TDGs need to be proposed by a SC 1.Translation 2.Sign language 3.Audio

How can you use a Data Category Registry? You can: – Find Data Categories relevant for your resources and embed references to them so the semantics of (parts of) your resources are made explicit This can be supported by tools you use, e.g., ELAN, LEXUS and the CMDI Component Editor directly interact with ISOcat – Interact with Data Category owners to improve (the coverage of) their Data Categories – Create (together with others) new Data Categories and/pr selections needed for your resources and share those – Submit (your) Data Categories for standardization – Free of charge – Grass roots approach 4/8/2010Lexicon Tools and Lexicon Standards21

ISOcat Reference implementation of ISO 12620:2009 The TC 37 Data Category Registry 4/8/2010Lexicon Tools and Lexicon Standards22

Future work Finish first complete version of ISOcat: – Standardization process Cleanup of the current set of Data Categories – TDGs cleanup their profiles – Standardize first sets of Data Categories Interaction with other TC 37 standards: – Migration from ISO 12620:1999 – Full support for all types of Data Categories 4/8/2010Lexicon Tools and Lexicon Standards23

More future work Additional Data Categories types – Container Data Categories Complex and Simple only cover ‘leafs’ and their values – Data Category Concepts Basic building blocks for knowledge bases Relation Registries (RR) – Stores (your) (semantic) relationships between Data Categories 4/8/2010Lexicon Tools and Lexicon Standards24

Container Data Categories Use the administrative and descriptive parts to manage standardization and describe the containers (components/tables/classes/objects/inner nodes…) of a meta/data model in the DCR But the relationships between components and complex data categories wouldn’t be stored in the DCR (maybe in the RR) 4/8/2010Lexicon Tools and Lexicon Standards25

Data Category Concepts ISO 12620:2009: – data category (DC): result of the specification of a given data field EXAMPLE /partOfSpeech/, /grammaticalGender/, /grammaticalNumber/; the values associated with these items (for example, /noun/, /verb/, /feminine/, /plural/, etc.) are also data categories according to this International Standard, but values of this type are not viewed as data element concepts (3.1.4) in the ISO/IEC family of standards. NOTE 2 A data category corresponds closely, but not identically, to a data element concept in ISO/IEC DCR Guidelines for the definition – Definitions shall follow the rules outlined for intentional definitions in ISO 704 They should begin with the superordinate concept, either immediately above or at a higher level of the data category concept being defined; They should list critical and delimiting characteristic(s) that distinguish the concept from other related concepts. So DCs and concepts are related, maybe this relationship should become clear in the DCR? Maybe the concept descriptions should be in the DCR, and the ontological relationships in the RR 4/8/2010Lexicon Tools and Lexicon Standards26

Possible full model 4/8/2010Lexicon Tools and Lexicon Standards27 lexicon language alphabet entry lemma writtenForm japanese ipa Data modelKnowledge base

GOLD in ISOcat Map GOLD concepts to DC types: – Some to closed DCs with simple DC hierarchies For example: /formUnit/ with simple DCs /Foot/, /Grapheme/, … and /Segment/ which is the parent of /Consonant/, /Vowel/, … – Some to simple DCs (as they can’t have values) Could be candidates for the proposed new data category concept type? – Ontological relations could be stored in the (to be build) RR The combination of the DCR + RR should result in the (part of the) ontological structure 4/8/2010Lexicon Tools and Lexicon Standards28

Registry network 4/8/2010Lexicon Tools and Lexicon Standards29 Linguistic resources Data category registries Relation registries MPI DCR ISO DCR Typological Database System RRMPI RR MPI archive TDS databaseresource

4/8/2010Lexicon Tools and Lexicon Standards30 Thank you for your attention! Visit Questions?

References [1] ISO 12620, Terminology and other language and content resources -- Specification of data categories and management of a Data Category Registry for language resources.ISO [2] [3] M.A. Windhouwer, S.E. Wright, M. Kemps-Snijders. Referencing ISOcat data categories. In proceedings of the LREC 2010 LRT standards workshop. Malta, May 18, 2010.Referencing ISOcat data categoriesLRT standards workshop [4] [5] [6] Tim Berners-Lee, Cool URIs don't change, 1998.Cool URIs don't change 4/8/2010Lexicon Tools and Lexicon Standards31