ISOcat Data Category Registry Defining widely accepted linguistic concepts Menzo Windhouwer 1CLARIN-NL MD tutorial, September 2009
ISOcat: a reference implementation ISO 12620:2009 –Terminology and other content and language resources — Specification of data categories and management of a Data Category Registry for language resources –ISO 12620:1999 was a fixed list of data categories, this revision provides a data model and management procedures ISO Technical Committee 37 –Terminology and other language and content resources 2CLARIN-NL MD tutorial, September 2009
ISO 24613:2008 Lexical Markup Framework 3 Lexicon Lexical Entry FormSense 0..* 1..* partOfSpeech writtenForm grammaticalNumber lexicalType Word Form Lemma CLARIN-NL MD tutorial, September 2009
Data categories “result of the specification of a given data field ” (ISO 12620:2009) data element concept (ISO 11179) –“concept for which the definition, identification and conceptual domain are specified independently of any particular representation” complex data categories are data element concepts 4CLARIN-NL MD tutorial, September 2009
Data category types 5 writtenForm string open grammaticalGender string neuter masculine feminine closed simple: string constrained complex: CLARIN-NL MD tutorial, September 2009
Data category relationships Value domain membership Subsumption relationships between simple data categories Relationships between complex data categories are not stored in the DCR ___ ___ ___CLARIN-NL MD tutorial, September partOfSpeech string pronoun personal pronoun
Data category specification Administration Information Section Description Section –Data Element Name –Language Section Name Section Conceptual Domain Linguistic Section –Conceptual Domain 7 Mandatory: 1.A mnemonic identifier 2.An English definition 3.An English name 4.A conceptual domain CLARIN-NL MD tutorial, September 2009
Guidelines for data categories (I) Identifier: –camel case and XML-valid element name (without a namespace) partOfSpeech my:POS, 123POS Data Element Name: –language independent name for the data category used in a specific application domain (specified in the source) PoS in TBX ___ ___ ___CLARIN-NL MD tutorial, September 20098
Guidelines for data categories (II) Name Section in a Language Section –legible name ‘part of speech’ in the English language section ‘partie du discours’ in the French language section Definition: –intentional definitions (ISO 704) –should consist of a single sentence fragment Source: –add a source for any quoted material ___ ___ ___CLARIN-NL MD tutorial, September 20099
Guidelines for data categories (III) Justification: –a simple statement justifying the relevance of the data category to the field of language resources –especially needed for standardization ___ ___ ___CLARIN-NL MD tutorial, September
Private versus standard The standard subset of data categories in the registry should be coherent The coherency is guarded by Thematic Domain Groups and the DCR Board Standard data categories need to meet some more constraints then private ones: –mandatory justification –DC relations demand profile overlap –…–… ___ ___ ___CLARIN-NL MD tutorial, September
Data Category Selections Anyone 1.can register with ISOcat 2.can create data categories 3.can create data category selections (DCSs) 4.can share DCSs 5.can make DCSs public 6.may submit DCSs for standardization 12CLARIN-NL MD tutorial, September 2009
Profiles versus DCSs Profile membership is part of the DC specification –the profile indicates the thematic domain of the DC –the profile view in the UI is created by a query –there are a limited number of profiles A DCS is a collection of DCs –hand picked by an user for a specific purpose –can contain DCs from various profiles –there can be an unlimited number of DCSs There isn’t (yet) a profile specific view on a DCS ___ ___ ___CLARIN-NL MD tutorial, September
ISO standardization process 14 Submission group Data Category Registry Board Validation Thematic Domain Group Evaluation Stewardship group ISO Publication CLARIN-NL MD tutorial, September 2009
Submission group The owner, possibly together with a group of users, which submit a DCS for standardization The data categories in the selection should already meet the more stricter constraints for standardized data categories (as far as possible) –justification –profile(s) –…–… ___ ___ ___CLARIN-NL MD tutorial, September
Thematic Domain Groups TDG 1: Metadata TDG 2: Morphosyntax TDG 3: Semantic Content Representation TDG 4: Syntax TDG 5: Machine Readable Dictionary TDG 6: Language Resource Ontology TDG 7: Lexicography TDG 8: Language Codes TDG 9: Terminology TDG 11: Multilingual Information Management TDG 12: Lexical Resources TDG 13: Lexical Semantics TDG 14: Source Identification ___ ___ ___CLARIN-NL MD tutorial, September TDGs are the owner and guardians of a coherent subset of the DCR TDGs own one or more profiles Each TDG has a chair A number of judges (assigned by SC P members) A number of expert members (up to 50%) TDGs are constituted at the TC37/SC plenary New TDGs need to be proposed by a SC
Harmonization When a DC belongs to multiple profiles belonging to different TDGs harmonization may be needed –one TDG becomes the owner of the DC –judges from the other TDG(s) are involved in the evaluation process ___ ___ ___CLARIN-NL MD tutorial, September
Stewardship group Members of the TDG who will maintain the data category The TDG becomes the owner of a standardized data category Changes to the data category need to go through the standardization procedure (evaluation by the TDG, validation by the DCR Board) ___ ___ ___CLARIN-NL MD tutorial, September
Using data categories (I) Each data category has a Persistent Identifier (PID): –once a data category has been created it can never be deleted only deprecated or superseded –the registration authority of is obliged to keep these URLs working 19CLARIN-NL MD tutorial, September 2009
Using data categories (II) This PID can be embedded in the schemata of linguistic resources: –CMD –Relax NG –XML Schema, TEI ODD, TBX, RDF, XML, … DC Reference vocabulary: – 20CLARIN-NL MD tutorial, September 2009
Using data categories (III) The full data category specification can be downloaded from ISOcat in the Data Category Interchange Format (DCIF) –DCIF is based on a simplified version of the DCR data model, and leaves out some administrative information –DCIF vocabulary: 21CLARIN-NL MD tutorial, September 2009
Usage scenarios DC references only: –find semantic overlap between two or more resources by comparing their DC references DC references and a schema/component registry: –find interesting resource (types) by comparing the DC references of schemas/components in the registry DC references and a network of registries: –find (in)direct related resources by related DCs ___ ___ ___CLARIN-NL MD tutorial, September
Relation Registry ISOcat contains a ‘flat’ list of concepts The Relation Registry will support storing (user-specific) relations between these concepts –is-a –part-of –equivalent-to –related-to –…–… 23 Will support: 1.Ontologies and taxonomies on top of data categories 2.Searches across related data categories 3.… CLARIN-NL MD tutorial, September 2009
Registry network ___ ___ ___CLARIN-NL MD tutorial, September Linguistic resources Data category registries Relation registries MPI DCR ISO DCR Typological Database System RRMPI RR MPI archive TDS databaseresource
Status of ISOcat ISOcat is under active development: –Now: You can access public data categories and selections You can create your own data categories and selections You can share your data categories and selections with others (everyone, or a specified group) –Future: Some social features (forum to discuss specific data categories) Cleanup of profiles by TDGs Import external ‘data category’ sets, such as: –parts of the ISO Concept Database –Dublin Core –TEI Standardization workflow High availability (mirrors) Relation registry 25CLARIN-NL MD tutorial, September 2009
Thanks for your attention! 26CLARIN-NL MD tutorial, September 2009