Marc Kemps-Snijders Menzo Windhouwer Sue Ellen Wright ISOcat Data Category Registry Defining widely accepted linguistic concepts Marc Kemps-Snijders Menzo Windhouwer Sue Ellen Wright CLARIN-NL Info dag, 1 July 2009
ISOcat: a reference implementation Terminology and other content and language resources — Specification of data categories and management of a Data Category Registry for language resources ISO 12620:1999 was a fixed list of data categories, this revision provides a data model and management procedures ISO Technical Committee 37 Terminology and other language and content resources CLARIN-NL Info dag, 1 July 2009
ISO 24613:2008 Lexical Markup Framework Lexicon 1..* Lexical Entry partOfSpeech writtenForm grammaticalNumber lexicalType Word Form Lemma 1..* 0..* Form Sense 0..* CLARIN-NL Info dag, 1 July 2009
Data categories “result of the specification of a given data field ” (ISO 12620:2009) data element concept (ISO 11179) “concept for which the definition, identification and conceptual domain are specified independently of any particular representation” complex data categories are data element concepts CLARIN-NL Info dag, 1 July 2009
Data category types complex: open closed constrained simple: writtenForm string open grammaticalGender string neuter masculine feminine closed email string constrained Constraint: .+@.+ simple: CLARIN-NL Info dag, 1 July 2009
Data category specification Administration Information Section Description Section Data Element Name Language Section Name Section Conceptual Domain Linguistic Section Mandatory: A mnemonic identifier An English definition An English name A conceptual domain CLARIN-NL Info dag, 1 July 2009
Data Category Selections Anyone can register with ISOcat can create data categories can create data category selections (DCSs) can share DCSs can make DCSs public can submit DCSs for standardization CLARIN-NL Info dag, 1 July 2009
ISO standardization process Submission group Thematic Domain Group Evaluation Data Category Registry Board Validation Stewardship group ISO Publication CLARIN-NL Info dag, 1 July 2009
Using data categories Each data category has a Persistent Identifier (PID): http://www.isocat.org/datcat/DC-1297 This PID can be embedded in the schemata of linguistic resources: <rng:element name=“gender” dcr:datcat=“…/DC-1297”> The full data category specification can be downloaded from ISOcat in the Data Category Interchange Format (DCIF) CLARIN-NL Info dag, 1 July 2009
ISOcat demonstration http://www.isocat.org/ CLARIN-NL Info dag, 1 July 2009
Status of ISOcat ISOcat is under active development: Now: Future: You can access public data categories and selections You can create your own data categories and selections Future: Group features Cleanup by TDGs Standardization workflow CLARIN-NL Info dag, 1 July 2009
Relation Registry ISOcat contains a flat list of concepts The Relation Registry will support storing (user-specific) relations between these concepts is-a part-of equivalent-to related-to … Will support: Ontologies and taxonomies on top of data categories Searches across related data categories … CLARIN-NL Info dag, 1 July 2009
Thanks for your attention! http://www.isocat.org/ Menzo.Windhouwer@mpi.nl CLARIN-NL Info dag, 1 July 2009