ISOcat Data Category Registry Defining widely accepted linguistic concepts Menzo Windhouwer 1CLARIN-NL MD tutorial, 24-25 September 2009.

Slides:



Advertisements
Similar presentations
OLIF V2 Gr. Thurmair April OLIF April 2000 OLIF: Overview Rationale Principles Entries Descriptions Header Examples Status.
Advertisements

© Bowne Global Solutions, Inc All rights reserved Bowne Global Solutions and OLIF Industry Implementation Michael Kranawetvogl Linguistic Engineering Bowne.
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
ISO TC 37 Terminology and Other Language and Content Resources OntoIOp Telecon Sue Ellen Wright October 20,
Putting the Pieces Together Grace Agnew Slide User Description Rights Holder Authentication Rights Video Object Permission Administration.
1 An Update on XML.org Registry and Repository Una Kearns Documentum, Inc.
DC8 Registries Breakout. Goals of the session Discuss and clarify : Requirements for registry Framework for policy Relate issues raised to EOR prototype.
Registry breakout group DC-8, National Library of Canada 5 October 2000.
DC2001, Tokyo DCMI Registry : Background and demonstration DC2001 Tokyo October 2001 Rachel Heery, UKOLN, University of Bath Harry Wagner, OCLC
White Paper on Establishing an Infrastructure for Open Language Archiving Steven Bird and Gary Simons.
CLARIN Metadata & ISO DCR Daan Broeder. Max-Planck Institute for Psycholinguistics TKE ES05 Workshop, August 14th Dublin.
1/ 26 AGROVOC and the OWL Web Ontology Language: the Agriculture Ontology Service - Concept Server OWL model NKOS workshop Alicante,
Metadata vocabularies and ontologies Dr. Manjula Patel Technical Research and Development
UKOLN, University of Bath
February Harvesting RDF metadata Building digital library portals with harvested metadata workshop EU-DL All Projects concertation meeting DELOS.
1 A long and winding road: RDA from principles to practice Alan Danskin (Chair JSC)
© 2011 TIBCO Software Inc. All Rights Reserved. Confidential and Proprietary. Towards a Model-Based Characterization of Data and Services Integration Paul.
ISOcat Data Model: Workflow & Guidelines Marc Kemps-Snijders a, Sue Ellen Wright b, Menzo Windhouwer a a Max Planck Institute for Psycholinguistics, b.
W3C and RDF. Why OCLC is a W3C Member Access to networked information resources –the browser and online access –the breath and depth of networked information.
Forest Markup / Metadata Language FML
Bulk loading ISOcat data categories with the Data Category Interchange Format 10/24/20111CLARIN-NL ISOcat Call 2 followup.
Status Report of the Study Group on MDR/MFI Implemenations ISO/IEC JTC 1/SC 32/WG2 Interim Meeting Santa Fe, NM, USA, November 11~15, 2013 Dongwon Jeong,
Advanced Metadata Usage Daan Broeder TLA - MPI for Psycholinguistics / CLARIN Metadata in Context, APA/CLARIN Workshop, September 2010 Nijmegen.
RDF Schemata (with apologies to the W3C, the plural is not ‘schemas’) CSCI 7818 – Web Technologies 14 November 2001 Van Lepthien.
Principles of ISOcat, a Data Category Registry Marc Kemps-Snijders a, Menzo Windhouwer a, Sue Ellen Wright b a Max Planck Institute for.
ISOcat introduction 19 June 20121CLARIN-NL ISOcat workshop.
Data Category specifications 19 June 20121CLARIN-NL 2012 ISOcat tutorial.
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands Metadata Component Framework Possible Standardization Work.
MLIF: A Metamodel to Represent and Exchange Multilingual Textual Information ISO TC37 SC4 WG Samuel Cruz-Lara, Gil Francopoulo, Laurent Romary,
The current state of Metadata - as far as we understand it - Peter Wittenburg The Language Archive - Max Planck Institute CLARIN Research Infrastructure.
TMF - a tutorial TMF - Terminological Markup Framework Laurent Romary - Laboratoire Loria.
Data Category specifications 20 March 20121CLARIN-NL ISOcat workshop.
9 th Open Forum on Metadata Registries Harmonization of Terminology, Ontology and Metadata 20th – 22nd March, 2006, Kobe Japan. Commonalities and Differences.
Provo, 16 Aug 2007 LMF meeting 1 Lexical Markup Framework: ISO Provo meeting Gil Francopoulo.
CLARIN web services and workflow Marc Kemps-Snijders.
The role of metadata schema registries XML and Educational Metadata, SBU, London, 10 July 2001 Pete Johnston UKOLN, University of Bath Bath, BA2 7AY UKOLN.
The ISO-DCR 17 January /20111CMDI tutorial Marc Kemps-Snijders a, Menzo Windhouwer b, Sue Ellen Wright c a Meertens Institute, b MPI for.
ADC Meeting ICEO Standards Working Group Steven F. Browdy, Co-Chair ADC Workshop Washington, D.C. September, 2007.
ISOcat demo and providing RELcat input Menzo Windhouwer The Language Archive tla.mpi.nl Data Archiving and Networked Solutions
Trends in Concept Modelling Turning Issues into Solutions How to Discipline a Cat Sue Ellen Wright, Kent State University.
CLARIN-NL Call 3 ISOcat follow-up 10/10/20121CLARIN-NL ISOcat Call 3 follow-up.
Content of the Data Category Registry 10 May /20111CLARIN-NL ISOcat workshop.
Nancy Lawler U.S. Department of Defense ISO/IEC Part 2: Classification Schemes Metadata Registries — Part 2: Classification Schemes The revision.
Report on the ISOcat project Marc Kemps-Snijders Menzo Windhouwer Peter Wittenburg Sue Ellen Wright January 8,
CLARIN-NL Call 4 ISOcat follow-up 2/10/20131CLARIN-NL Call 4 ISOcat follow-up.
ISOcat introduction 20 June 20131CLARIN-NL ISOcat workshop.
ISOcat introduction 20 March 20121CLARIN-NL ISOcat workshop.
11 CMDI/ISOcat And Semantic Operability Ineke Schuurman ISOcat content coördinator CLARIN-NL Menzo Windhouwer ISOcat system administrator Utrecht
Lifecycle Metadata for Digital Objects November 1, 2004 Descriptive Metadata: “Modeling the World”
CLARIN Issues Peter Wittenburg MPI for Psycholinguistics Nijmegen, NL.
A Data Category Registry- and Component- based Metadata Framework Daan Broeder et al. Max-Planck Institute for Psycholinguistics LREC 2010.
Beyond ISOcat 20 June 2013CLARIN-NL ISOcat tutorial1.
ISO TC 37/CLARIN SEMANTIC DATA REGISTRY WORKSHOP UTRECHT, DECEMBER ISOcat: Metadata Registry SUE ELLEN WRIGHT DECEMBER 2013.
Metadata “Data about data” Describes various aspects of a digital file or group of files Identifies the parts of a digital object and documents their content,
Tutorial on XML Tag and Schema Registration in an ISO/IEC Metadata Registry Open Forum 2003 on Metadata Registries Tuesday, January 21, 2003; 4:45-5:30.
The ISO Data Category Registry ISO 12620:2009 introduces – A web-based electronic Data Category Registry (DCR) for simple, complex and (in the future)
ISOcat status
CLARIN Requirements for a Semantic Registry Daan Broeder The Language Archive – MPI Ineke Schuurman CLARIN-NL/VL – KU Leuven & Utrecht.
Menzo Windhouwer.  The Typological Database System (TDS) provides integrated access to multiple, independently created typological databases.  Users.
Formats, interoperability and standards Marc Kemps-Snijders.
ISO TC 37/CLARIN DISCUSSION UTRECHT, DECEMBER 9/ Thinning Down a Bloated Cat SUE ELLEN WRIGHT DECEMBER 2013.
ISOcat tutorial DCR data model and guidelines. Simple and complex DCs Simple Data CategoryComplex Data CategoryConceptual Domain Data CategoryDescription.
A Data Category Registry- and Component- based Metadata Framework Daan Broeder et al. Max-Planck Institute for Psycholinguistics LREC 2010.
TDS-Curator DANS MPI for Psycholinguistics Utrecht Institute of Linguistics OTS languagelink.let.uu.nl/tds/ 9/21/20101CLARIN-NL - Call 1 - ISOcat status.
Group work and standardization features in ISOcat Menzo Windhouwer 8/14/20101Standardizing Data Categories in ISOcat - Implementing Group.
Linking to Linguistic Data Categories in ISOcat Menzo Windhouwer a, Sue Ellen Wright b a The Language Archive - MPI for Psycholinguistics,
ISOcat introduction 10 May /20111CLARIN-NL ISOcat workshop.
Marc Kemps-Snijders Menzo Windhouwer Sue Ellen Wright
Lifecycle Metadata for Digital Objects
The Re3gistry software and the INSPIRE Registry
Presentation transcript:

ISOcat Data Category Registry Defining widely accepted linguistic concepts Menzo Windhouwer 1CLARIN-NL MD tutorial, September 2009

ISOcat: a reference implementation ISO 12620:2009 –Terminology and other content and language resources — Specification of data categories and management of a Data Category Registry for language resources –ISO 12620:1999 was a fixed list of data categories, this revision provides a data model and management procedures ISO Technical Committee 37 –Terminology and other language and content resources 2CLARIN-NL MD tutorial, September 2009

ISO 24613:2008 Lexical Markup Framework 3 Lexicon Lexical Entry FormSense 0..* 1..* partOfSpeech writtenForm grammaticalNumber lexicalType Word Form Lemma CLARIN-NL MD tutorial, September 2009

Data categories “result of the specification of a given data field ” (ISO 12620:2009) data element concept (ISO 11179) –“concept for which the definition, identification and conceptual domain are specified independently of any particular representation” complex data categories are data element concepts 4CLARIN-NL MD tutorial, September 2009

Data category types 5 writtenForm string open grammaticalGender string neuter masculine feminine closed simple: string constrained complex: CLARIN-NL MD tutorial, September 2009

Data category relationships Value domain membership Subsumption relationships between simple data categories Relationships between complex data categories are not stored in the DCR ___ ___ ___CLARIN-NL MD tutorial, September partOfSpeech string pronoun personal pronoun

Data category specification Administration Information Section Description Section –Data Element Name –Language Section Name Section Conceptual Domain Linguistic Section –Conceptual Domain 7 Mandatory: 1.A mnemonic identifier 2.An English definition 3.An English name 4.A conceptual domain CLARIN-NL MD tutorial, September 2009

Guidelines for data categories (I) Identifier: –camel case and XML-valid element name (without a namespace) partOfSpeech my:POS, 123POS Data Element Name: –language independent name for the data category used in a specific application domain (specified in the source) PoS in TBX ___ ___ ___CLARIN-NL MD tutorial, September 20098

Guidelines for data categories (II) Name Section in a Language Section –legible name ‘part of speech’ in the English language section ‘partie du discours’ in the French language section Definition: –intentional definitions (ISO 704) –should consist of a single sentence fragment Source: –add a source for any quoted material ___ ___ ___CLARIN-NL MD tutorial, September 20099

Guidelines for data categories (III) Justification: –a simple statement justifying the relevance of the data category to the field of language resources –especially needed for standardization ___ ___ ___CLARIN-NL MD tutorial, September

Private versus standard The standard subset of data categories in the registry should be coherent The coherency is guarded by Thematic Domain Groups and the DCR Board Standard data categories need to meet some more constraints then private ones: –mandatory justification –DC relations demand profile overlap –…–… ___ ___ ___CLARIN-NL MD tutorial, September

Data Category Selections Anyone 1.can register with ISOcat 2.can create data categories 3.can create data category selections (DCSs) 4.can share DCSs 5.can make DCSs public 6.may submit DCSs for standardization 12CLARIN-NL MD tutorial, September 2009

Profiles versus DCSs Profile membership is part of the DC specification –the profile indicates the thematic domain of the DC –the profile view in the UI is created by a query –there are a limited number of profiles A DCS is a collection of DCs –hand picked by an user for a specific purpose –can contain DCs from various profiles –there can be an unlimited number of DCSs There isn’t (yet) a profile specific view on a DCS ___ ___ ___CLARIN-NL MD tutorial, September

ISO standardization process 14 Submission group Data Category Registry Board Validation Thematic Domain Group Evaluation Stewardship group ISO Publication CLARIN-NL MD tutorial, September 2009

Submission group The owner, possibly together with a group of users, which submit a DCS for standardization The data categories in the selection should already meet the more stricter constraints for standardized data categories (as far as possible) –justification –profile(s) –…–… ___ ___ ___CLARIN-NL MD tutorial, September

Thematic Domain Groups TDG 1: Metadata TDG 2: Morphosyntax TDG 3: Semantic Content Representation TDG 4: Syntax TDG 5: Machine Readable Dictionary TDG 6: Language Resource Ontology TDG 7: Lexicography TDG 8: Language Codes TDG 9: Terminology TDG 11: Multilingual Information Management TDG 12: Lexical Resources TDG 13: Lexical Semantics TDG 14: Source Identification ___ ___ ___CLARIN-NL MD tutorial, September TDGs are the owner and guardians of a coherent subset of the DCR TDGs own one or more profiles Each TDG has a chair A number of judges (assigned by SC P members) A number of expert members (up to 50%) TDGs are constituted at the TC37/SC plenary New TDGs need to be proposed by a SC

Harmonization When a DC belongs to multiple profiles belonging to different TDGs harmonization may be needed –one TDG becomes the owner of the DC –judges from the other TDG(s) are involved in the evaluation process ___ ___ ___CLARIN-NL MD tutorial, September

Stewardship group Members of the TDG who will maintain the data category The TDG becomes the owner of a standardized data category Changes to the data category need to go through the standardization procedure (evaluation by the TDG, validation by the DCR Board) ___ ___ ___CLARIN-NL MD tutorial, September

Using data categories (I) Each data category has a Persistent Identifier (PID): –once a data category has been created it can never be deleted only deprecated or superseded –the registration authority of is obliged to keep these URLs working 19CLARIN-NL MD tutorial, September 2009

Using data categories (II) This PID can be embedded in the schemata of linguistic resources: –CMD –Relax NG –XML Schema, TEI ODD, TBX, RDF, XML, … DC Reference vocabulary: – 20CLARIN-NL MD tutorial, September 2009

Using data categories (III) The full data category specification can be downloaded from ISOcat in the Data Category Interchange Format (DCIF) –DCIF is based on a simplified version of the DCR data model, and leaves out some administrative information –DCIF vocabulary: 21CLARIN-NL MD tutorial, September 2009

Usage scenarios DC references only: –find semantic overlap between two or more resources by comparing their DC references DC references and a schema/component registry: –find interesting resource (types) by comparing the DC references of schemas/components in the registry DC references and a network of registries: –find (in)direct related resources by related DCs ___ ___ ___CLARIN-NL MD tutorial, September

Relation Registry ISOcat contains a ‘flat’ list of concepts The Relation Registry will support storing (user-specific) relations between these concepts –is-a –part-of –equivalent-to –related-to –…–… 23 Will support: 1.Ontologies and taxonomies on top of data categories 2.Searches across related data categories 3.… CLARIN-NL MD tutorial, September 2009

Registry network ___ ___ ___CLARIN-NL MD tutorial, September Linguistic resources Data category registries Relation registries MPI DCR ISO DCR Typological Database System RRMPI RR MPI archive TDS databaseresource

Status of ISOcat ISOcat is under active development: –Now: You can access public data categories and selections You can create your own data categories and selections You can share your data categories and selections with others (everyone, or a specified group) –Future: Some social features (forum to discuss specific data categories) Cleanup of profiles by TDGs Import external ‘data category’ sets, such as: –parts of the ISO Concept Database –Dublin Core –TEI Standardization workflow High availability (mirrors) Relation registry 25CLARIN-NL MD tutorial, September 2009

Thanks for your attention! 26CLARIN-NL MD tutorial, September 2009