Presentation is loading. Please wait.

Presentation is loading. Please wait.

ISOcat Data Model: Workflow & Guidelines Marc Kemps-Snijders a, Sue Ellen Wright b, Menzo Windhouwer a a Max Planck Institute for Psycholinguistics, b.

Similar presentations


Presentation on theme: "ISOcat Data Model: Workflow & Guidelines Marc Kemps-Snijders a, Sue Ellen Wright b, Menzo Windhouwer a a Max Planck Institute for Psycholinguistics, b."— Presentation transcript:

1 ISOcat Data Model: Workflow & Guidelines Marc Kemps-Snijders a, Sue Ellen Wright b, Menzo Windhouwer a a Max Planck Institute for Psycholinguistics, b Kent State University Marc.kemps-snijders@mpi.nlMarc.kemps-snijders@mpi.nl, sellenwright@gmail.com, menzo.windhouwer@mpi.nlsellenwright@gmail.commenzo.windhouwer@mpi.nl NEERI Helsinki Standards Workshop 2009-09-30

2 NEERI Helsinki 2009-09-30 Standards Workshop www.isocat.org Data category The result of the specification of a given data field A data category is an elementary descriptor in a linguistic structure or an annotation scheme (ISO 1087-2) Linguistic data categories: /part of speech/, /noun/, /verb/ /definition/, /context/, etc. © DCR Group 2009 2 of 29

3 Data Category Applications DCs are used as: Field names in databases Permissible values for closed and constrained data categories Tag names and attribute values in annotation frameworks DCs are used by: Different broad thematic domains (e.g., terminology, morphosyntax, lexicography, etc.) Different communities of practice within a given domain Data category selections exist as: Resource tag sets (e.g., tagsets used in major corpora) Standardized sets of field names and values (e.g., TBX Basic, TBX [ISO 29042]) © DCR Group 2009 3 of 29 NEERI Helsinki 2009-09-30 Standards Workshop www.isocat.org

4 Data category TC 37 practice treats both data fields and enumerated domain values as data categories: Open data categories: e.g., term, which can take any value designated as a term Closed data categories: e.g., grammatical gender, which takes a set of enumerated values as its content Constrained data category: e.g., Olympic years, which takes as its content values defined by a formal constraint (i.e., every fourth year starting from a certain date) Simple data categories: e.g., masculine, member of an enumerated value domain © DCR Group 2009 4 of 29 NEERI Helsinki 2009-09-30 Standards Workshop www.isocat.org

5 Data category types writtenForm any string open grammaticalGender enumerated string neuter masculine feminine closed simple: email Address constrained string constrained Constraint:.+@.+ complex: © DCR Group 2009 5 of 29 NEERI Helsinki 2009-09-30 Standards Workshop www.isocat.org

6 Data category relationships Value domain membership Subsumption relationships between simple data categories Relationships between complex data categories are not stored in the DCR partOfSpeech pronoun personal pronoun enumerated string © DCR Group 2009 6 of 29 NEERI Helsinki 2009-09-30 Standards Workshop www.isocat.org

7 Data Category Registry (DCR) set of data categories to be used as a reference for the definition of linguistic annotation schemes or any other formats used in the area of language resources Implemented as the TC 37 ISOcat registry Registration Authority: Max Planck Institute for Psycholinguistics Nijmegen Open and accessible at: http://www.isocat.orghttp://www.isocat.org Come play with the cat! But – hes a bit fussy and likes to have people follow some simple rules! Simple rules are spelled out in the DCR Guidelines. © DCR Group 2009 7 of 29 NEERI Helsinki 2009-09-30 Standards Workshop www.isocat.org

8 ISOcat model and mission & metaphor Not a layered onion … A segmented aggregate, like knob of garlic instead: Cloves are sets of private data categories The center stem represents the standardization core Many DCs and DCS may never be intended for standardization Only the standardized core described in ISO 12620:2009 Need to define non- and pre- standardization procedures NEERI Helsinki 2009-09-30 Standards Workshop www.isocat.org

9 ISOcat Data model The ISO 12620 data model consists of 3 main parts: Administrative part Administration and identification Descriptive part Documentation and information for working language or languages Data element names and identifiers Data element concept definitions Linguistic part Conceptual domain of object language Data element type declarations Special object language constraints © DCR Group 2009 9 of 29 NEERI Helsinki 2009-09-30 Standards Workshop www.isocat.org

10 Data Model and DC life cycle Part 1 of the ISOcat data model reflects the DC standardization cycle Major steps in the workflow = classes in the DC model But the creation cycle precedes standardization A DC must be created, and ideally discussed in a group before the standardization process even begins. Not all DCs will be standardized. © DCR Group 2009 10 of 29 The process starts out here, and we need to define this process. NEERI Helsinki 2009-09-30 Standards Workshop www.isocat.org

11 Non- & Pre Standardization Workflow DC created in private work space Option: DC remains private Option: assign DC to a group Option: DC discussed & revised in the group to achieve consensus Option: DC used in group Option: DC used widely by public group Option: DC submitted for standardization Standards process starts with submission Stan- dardized Core DCS DCR NEERI Helsinki 2009-09-30 Standards Workshop www.isocat.org

12 Cascade of Responsibility ISOcat Model Design the ISOcat development group Approved by TC 37 Standardized in ISO 12620:2009 ISOcat input template, interface presentation Implementation by the ISOcat programmer/system administrator Approved by development group Scrutiny of beta testers, user community ISOcat Guidelines for data category specifications http://www.isocat.org/manual/DCRGuidelines.pdf Instantiation by the individual expert user Scrutiny by other users, eventually by DCR TDGs/DCRB NEERI Helsinki 2009-09-30 Standards Workshop www.isocat.org

13 ISO 12620 Overview Three parts Lynch pin: Data Category

14 Part 1 Global Information & Administration Information Section

15 Identifiers – Responsibilities Global Information Non-mnemonic identifier (Key) System-assigned internal identifier Persistent identifier (PID) System-generated external identifier Administration Record User-assigned Identifier: camel case mnemonic ID XML-valid element name (without a namespace) partOfSpeech my:POS, 123POS Required NEERI Helsinki 2009-09-30 Standards Workshop www.isocat.org

16 Justification – Creator Responsibility Justification for /part of speech/: Part of speech obvious, but not true of every DC for every potential user. Required for standardization Highly desirable for any DC that will be shared outside a private scope Neeri Helsinki 2009-09-30 www.isocat.org Required

17 Administration Information Section Implementation of the standardization workflow Embodied in the information workflow associated with the standardization process Standardized in ISO 12620:2009 in compliance with ISO Directives Annex ST for Standards as Databases Represented by the flowchart in slide 19/20 Responsibility: Thematic Domain Groups (TDGs), which act as stewards in maintaining data category specifications (DCs) and data category selections (DCSs) Data Category Registry Board (DCRB), which validates DCs and DCSs and endeavors to harmonize among TDGs Neeri Helsinki 2009-09-30 www.isocat.org

18 Data category The standardization option Data categories can be kept private or submitted to the standardization process, in which case they are assigned to a Thematic Domain Group which judges them. DCR Board TDG metadata TDG ….. TDG morphosyntax TDG terminology At regular intervals, snapshots of the standardized subset of the DCR will be submitted to ISO to form a standard as database according to Annex ST of the ISO/IEC Directives. NEERI Helsinki 2009-09-30 www.isocat.org

19 TDG Role: Maintenance Team Neeri Helsinki 2009-09-30 www.isocat.org

20 DCRB Role: Validation Role Neeri Helsinki 2009-09-30 www.isocat.org

21 Part 2 Descriptive Part Describes equivalents in working languages; English data element name, definition, and justification comment required Database, format or application specific data element names Rigorous terminological definition consisting of a single sentence fragment linked to a logical concept system

22 Part 2: Guideline Responsibilities Data Element Name: Language-independent name for the data category used in a specific application domain (specified in the Source) PoS / POS / pos are all common short forms used for /part of speech/ in various application environments. Name Section in a Language Section (Min. one required in English Language Section) (Multiple in multiple Language Sections permitted) Human-legible (mnemonic) name part of speech in the English language section partie du discours in the French language section NEERI Helsinki 2009-09-30 Standards Workshop www.isocat.org

23 Neeri Helsinki 2009-09-30 www.isocat.org One en Name required. Multiple Names optional. Multiple Names in other languages optional.

24 Part 2: Guideline Responsibilities Definition: Rigorous intentional definitions (ISO 704) Single sentence fragment Additional information in comments fields, justification, etc. Example: Die Klasse von Wörtern einer Sprache (broader concept) … auf Grund der Zuordnung (characteristic) nach gemeinsamen grammatischen Merkmalen. (characteristic) Source: The source for any quoted material; here: Wikipedia NEERI Helsinki 2009-09-30 Standards Workshop www.isocat.org

25 Part 3 Linguistic Part

26 Data category Linguistic part Complex, constrained and simple data categories are explicitly modeled here Constraints for a given object language Enumeration of permissible values in closed value domains NEERI Helsinki 2009-09-30 Standards Workshop www.isocat.org

27 Data category Linguistic part (example) Data category: /grammatical gender/ Conceptual domain: /masculine/, /feminine/, /neuter/ Lists all admissible values for all languages Linguistic Section Language: fr Value Domain: /masculine/, /feminine/ Lists all admissible values for French Linguistic section values must be subset of the defined conceptual domain. Data category: /part of speech/, value: /partitive/ Limited in the Linguistic Section to French Issue with the partitive case in Finnish – some values are very language dependent NEERI Helsinki 2009-09-30 Standards Workshop www.isocat.org

28 QA Components Option for ad hoc group validation TDG approval during standardization DCRB harmonization & validation ISOcat Checker NEERI Helsinki 2009-09-30 Standards Workshop www.isocat.org

29 Thank you for your attention Come play with the cat! http://www.isocat.org http://blogs.warwick.ac.uk/jmiles/tag/shadow_of_the_colossus/


Download ppt "ISOcat Data Model: Workflow & Guidelines Marc Kemps-Snijders a, Sue Ellen Wright b, Menzo Windhouwer a a Max Planck Institute for Psycholinguistics, b."

Similar presentations


Ads by Google