Relish Rendering Endangered Languages Lexicons Interoperable through Standards Harmonization Marc Kemps-Snijders Max Planck.

Slides:



Advertisements
Similar presentations
OLIF V2 Gr. Thurmair April OLIF April 2000 OLIF: Overview Rationale Principles Entries Descriptions Header Examples Status.
Advertisements

Using OLIF, The Open Lexicon Interchange Format Susan McCormick OLIF2 Consortium October 1, 2004.
Resource description and access for the digital world Gordon Dunsire Centre for Digital Library Research University of Strathclyde Scotland.
LIFTing LEGO with RELISH: Lexicon Interchange FormaT in Use Helen Aristar-Dry Institute for Language Information and Technology Eastern Michigan U.
ISOcat Data Category Registry Defining widely accepted linguistic concepts Menzo Windhouwer 1CLARIN-NL MD tutorial, September 2009.
MDF and its Applications Sebastian Drude & Irina Nevskaya Goethe-Universität Frankfurt RELISH / Lexicon Meeting Nijmegen July 2010.
Interoperability Aspects in Europeana Antoine Isaac Workshop on Research Metadata in Context 7./8. September 2010, Nijmegen.
Principles of ISOcat, a Data Category Registry Marc Kemps-Snijders a, Menzo Windhouwer a, Sue Ellen Wright b a Max Planck Institute for.
ISOcat introduction 19 June 20121CLARIN-NL ISOcat workshop.
LEXUS and ViCoS: Introduction and hands-on Jacquelijn Ringersma LEXUS and ViCoS developers are: Huib Verweij, Marc Kemps-Snijders, Claus Zinn, Andre Moreira.
Susan Gehr Cell/text (707)
Data Category specifications 19 June 20121CLARIN-NL 2012 ISOcat tutorial.
ICT Monica Monachini – 1° KYOTO Workshop – Amsterdam 2/ KYOTO (ICT ) Yielding Ontologies for Transition-Based Organization Intelligent.
The Wichita lexicon in LEXUS Armik Mirzayan University of Colorado at Boulder Jacquelijn Ringersma Max Planck Institute for Psycholinguistics RELISH Workshop.
Representing dictionaries with the TEI Proposal for basic guidelines Laurent Romary - Max Planck Digital Library With the help of Susanne Alt - CNRS.
6. Applying metadata standards: Controlled vocabularies and quality issues Metadata Standards and Applications Workshop.
Morphology Chapter 7 Prepared by Alaa Al Mohammadi.
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands Metadata Component Framework Possible Standardization Work.
The current state of Metadata - as far as we understand it - Peter Wittenburg The Language Archive - Max Planck Institute CLARIN Research Infrastructure.
Interchange using TBX 8 th Metadata conference Berlin April 2005 Alan K. Melby Brigham Young University, Provo campus.
Geospatial standards Beyond FGDC Geog 458: Map Sources and Errors March 3, 2006.
OCLC Online Computer Library Center A Global OpenURL Resolver Registry Phil Norman OCLC Dlsr4lib Workshop March 23 rd, 2006 Arlington VA.
TMF - a tutorial TMF - Terminological Markup Framework Laurent Romary - Laboratoire Loria.
Principles of the GOLD Ontology & Conversion of GOLD to DCIF Presenters: Anthony Aristar, Evelyn Richter.
Provo, 16 Aug 2007 LMF meeting 1 Lexical Markup Framework: ISO Provo meeting Gil Francopoulo.
CLARIN-NL Second Open Call Jan Odijk CLARIN-NL Call 2 Info-session Amsterdam, 26 Aug 2010.
EMELD Workshop on Digitizing Lexical Information Modeling Lexical Entries in Bilingual Dictionaries —Or— Exegeting the UML Model Mike Maxwell Linguistic.
E-Meld Workshop on Digitization of lexical Information 3-5 August 2002, EMU, Ypsilanti Working Group on Lexicon Macrostructures Chairman’s Report Dafydd.
CLARIN web services and workflow Marc Kemps-Snijders.
Sharing linguistic multi-media resources Jacquelijn Ringersma Paul Trilsbeek Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands.
The ISO-DCR 17 January /20111CMDI tutorial Marc Kemps-Snijders a, Menzo Windhouwer b, Sue Ellen Wright c a Meertens Institute, b MPI for.
Eureka! User friendly access to the MPI linguistic data archive Max Planck Institute for Psycholinguistics Alexander Koenig Jacquelijn Ringersma Claus.
9. Microstructure of Bilingual Dictionaries. The microstructure of the dictionary specifies the way the lemma articles are composed. The lemma article.
Content of the Data Category Registry 10 May /20111CLARIN-NL ISOcat workshop.
CLARIN Metadata Infrastructure Component Metadata and intermediate solutions Daan Broeder Claus Zinn Dieter van Uytvanck - Max-Planck Institute for Psycholinguistics.
LEXUS: a web based lexicon tool Jacquelijn Ringersma Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands.
ET-ADRS-1, April ISO 191xx series of geographic information standards.
Reasons to Study Lexicography  You love words  It can help you evaluate dictionaries  It might make you more sensitive to what dictionaries have in.
CLARIN-NL Call 4 ISOcat follow-up 2/10/20131CLARIN-NL Call 4 ISOcat follow-up.
ISO a tutorial Part 2: Representing data categories TMF - Terminological Markup Framework Laurent Romary - Laboratoire Loria.
Getting the Iwaidja lexicon in LEXUS and ViCoS Jacquelijn Ringersma Konrad Rybka.
ISOcat introduction 20 June 20131CLARIN-NL ISOcat workshop.
ISOcat introduction 20 March 20121CLARIN-NL ISOcat workshop.
Nicoletta Calzolari Berlin, October PWI ISO SC 4/WG 4 Lexicon-Ontology relations PWI Nicoletta Calzolari Exploratory meeting.
9 th Open Forum on Metadata Registries Harmonization of Terminology, Ontology and Metadata 20th – 22nd March, 2006, Kobe Japan. Presentation Title: Day:
N. Calzolari 1Nijmegen, August 2010 Conclusions – Observations (maybe biased)  Field linguistics: Re-doing the path we did, asking the same questions,
LEXUS a flexible web based lexicon tool LEXUS a flexible web based lexicon tool, august 21 th, 2005 Marc Kemps-Snijders Peter Wittenburg
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands NP CMDI-1 Metadata Component Framework New Standardization.
A Data Category Registry- and Component- based Metadata Framework Daan Broeder et al. Max-Planck Institute for Psycholinguistics LREC 2010.
SIL FieldWorks Language Explorer: The lexicon component Gary Simons SIL International Lexicon Tools and Lexicon Standards Nijmegen, 4–5 August 2010.
ISO TC 37/CLARIN SEMANTIC DATA REGISTRY WORKSHOP UTRECHT, DECEMBER ISOcat: Metadata Registry SUE ELLEN WRIGHT DECEMBER 2013.
The ISO Data Category Registry ISO 12620:2009 introduces – A web-based electronic Data Category Registry (DCR) for simple, complex and (in the future)
Natural Language Processing Chapter 2 : Morphology.
Lexicography Lexicon has two different meanings:
1 ISOCAT Proposed solutions for Problems encountered in DUELME-LMF Jan Odijk Nijmegen 21 Sep 2010.
Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.
Annotation by category – ELAN and ISO DCR Han Slöetjes, Peter Wittenburg Max-Planck-Institute for Psycholinguistics LREC,
Formats, interoperability and standards Marc Kemps-Snijders.
ISO TC 37/CLARIN DISCUSSION UTRECHT, DECEMBER 9/ Thinning Down a Bloated Cat SUE ELLEN WRIGHT DECEMBER 2013.
A Data Category Registry- and Component- based Metadata Framework Daan Broeder et al. Max-Planck Institute for Psycholinguistics LREC 2010.
ISOcat introduction 10 May /20111CLARIN-NL ISOcat workshop.
Marc Kemps-Snijders Menzo Windhouwer Sue Ellen Wright
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
Relations between Data Categories
Application of Dublin Core and XML/RDF standards in the KIKERES
Metadata for research outputs management
Language Review Topics
ISOCAT ISOCAT Problems
Introduction to English morphology
ViCoS Visualising Conceptual Spaces
Presentation transcript:

Relish Rendering Endangered Languages Lexicons Interoperable through Standards Harmonization Marc Kemps-Snijders Max Planck Institute for Psycholinguistics SaLTMIL Workshop Speech and Language Technology for Minority Languages May 23 rd 2010 LREC Malta

Increase interoperability between endangered language lexica created on both sides of the Atlantic

Background Lexica constitute important record of endangered languages Diverging European and American standards for data formatting and markup LIFT/LLIFT vs. LMF GOLD vs. ISOcat Significant effort in tool support by all parties Structural differences Differences in terms and abbreviations Differences in interchange formats

European and American Projects and Standards MPIMPI ILITILIT DobesDobes InteraIntera DAM-LRDAM-LR ECHOECHO CLARINCLARIN LEGOLEGO EMELDEMELD Data Driven Ontology GOLD Community SILSIL Lexicons of endangered languages Standards for Terminology DCRDCR GOLDGOLD Standards for Lexicons LMFLMF LIFTLIFT ISO IS 12620:2009 DCR ISO FDIS 24613:2008 LMF UFUF

Methodology Bottom up approach Analyze existing lexica to identify commonalities and differences in lexical structure and content Tofa Udi Archi Iwaidja Mocovi Salar Kayardild LLIFT example <!-- this is where we'll keep the original id, since we may need it and we have to put an underscore in front of the entry id, so that it conforms to the datatype id format. We considered using as it seemed an underscore in front of the entry id, so that it conforms to the datatype id format. We considered using as it seemed more semantically appropriate, but would require inside it, which would in turn require a language attribute, and more semantically appropriate, but would require inside it, which would in turn require a language attribute, and we don't want that. has no appropriate attributes we could use, either. --> we don't want that. has no appropriate attributes we could use, either. --> <!-- regarding the lang attribute: The format (based on RFC 4646bis or superseding document): ISO language code-script type-ISO country code. Only the ISO language code is really necessary, though. ISO language code-script type-ISO country code. Only the ISO language code is really necessary, though. Q: What to do if we need more than one language code to cover a given form, though? For instance, in Tamashek, Q: What to do if we need more than one language code to cover a given form, though? For instance, in Tamashek, where what Heath calls 'dialects' have separate ISO codes? where what Heath calls 'dialects' have separate ISO codes? A: Use the private use 'x-' format, ie: taq-x-ttq-thz. NB everything following the x- is considered private use, A: Use the private use 'x-' format, ie: taq-x-ttq-thz. NB everything following the x- is considered private use, so put anything conforming to the standard first. so put anything conforming to the standard first. OR: x-qta (use a temp code, and map it in a URI to all three required codes) Not sure if this would work if OR: x-qta (use a temp code, and map it in a URI to all three required codes) Not sure if this would work if we're trying to map individuals to their different possible combinations of dialects, though. --> we're trying to map individuals to their different possible combinations of dialects, though. --> cow cow <!-- alternate spellings or forms - these can't have any different meaning or grammatical info, as variant can't have under it. --> as variant can't have under it. --> dabere dabere dabbere dabbere LLIFT example <!-- this is where we'll keep the original id, since we may need it and we have to put an underscore in front of the entry id, so that it conforms to the datatype id format. We considered using as it seemed an underscore in front of the entry id, so that it conforms to the datatype id format. We considered using as it seemed more semantically appropriate, but would require inside it, which would in turn require a language attribute, and more semantically appropriate, but would require inside it, which would in turn require a language attribute, and we don't want that. has no appropriate attributes we could use, either. --> we don't want that. has no appropriate attributes we could use, either. --> <!-- regarding the lang attribute: The format (based on RFC 4646bis or superseding document): ISO language code-script type-ISO country code. Only the ISO language code is really necessary, though. ISO language code-script type-ISO country code. Only the ISO language code is really necessary, though. Q: What to do if we need more than one language code to cover a given form, though? For instance, in Tamashek, Q: What to do if we need more than one language code to cover a given form, though? For instance, in Tamashek, where what Heath calls 'dialects' have separate ISO codes? where what Heath calls 'dialects' have separate ISO codes? A: Use the private use 'x-' format, ie: taq-x-ttq-thz. NB everything following the x- is considered private use, A: Use the private use 'x-' format, ie: taq-x-ttq-thz. NB everything following the x- is considered private use, so put anything conforming to the standard first. so put anything conforming to the standard first. OR: x-qta (use a temp code, and map it in a URI to all three required codes) Not sure if this would work if OR: x-qta (use a temp code, and map it in a URI to all three required codes) Not sure if this would work if we're trying to map individuals to their different possible combinations of dialects, though. --> we're trying to map individuals to their different possible combinations of dialects, though. --> cow cow <!-- alternate spellings or forms - these can't have any different meaning or grammatical info, as variant can't have under it. --> as variant can't have under it. --> dabere dabere dabbere dabbere Shoebox example \_sh v Iwaidja \_sh v Iwaidja\_DateStampHasFourDigitYear \lx a \lc Lexical citation ((R) => root) \ps Part of speech \de Definition \ge Gloss-English \re Reversal \xv Example vernacular \xe Example English \rf Reference for example \dt 11/Jul/2007 \lx a- \lc a- \a a- \ps v. prefix \de third person plural intransitive subject prefix \ge 3pl \re they \ng This is the neutral form; the 'towards' form is |fv{ayuwu-}, 'away' form is |fv{ijb-} ~ |fv{ijuwu-} \sd verb prefix \sd inflectional prefix \rf PL93 \xv Amalkban. \xe They move outside. \dt 15/Jul/2007 \lx a- \lc a- \a a- \ps n. pref. \de their (with possessed body parts) \ge 3pl \re their (with possessed body parts) \sd noun prefix \sd inflectional prefix \dt 29/Nov/2006 Shoebox example \_sh v Iwaidja \_sh v Iwaidja\_DateStampHasFourDigitYear \lx a \lc Lexical citation ((R) => root) \ps Part of speech \de Definition \ge Gloss-English \re Reversal \xv Example vernacular \xe Example English \rf Reference for example \dt 11/Jul/2007 \lx a- \lc a- \a a- \ps v. prefix \de third person plural intransitive subject prefix \ge 3pl \re they \ng This is the neutral form; the 'towards' form is |fv{ayuwu-}, 'away' form is |fv{ijb-} ~ |fv{ijuwu-} \sd verb prefix \sd inflectional prefix \rf PL93 \xv Amalkban. \xe They move outside. \dt 15/Jul/2007 \lx a- \lc a- \a a- \ps n. pref. \de their (with possessed body parts) \ge 3pl \re their (with possessed body parts) \sd noun prefix \sd inflectional prefix \dt 29/Nov/2006 Lexus example <lexicalEntry><headword_x0020_group> 11/Jul/ /Jul/2007 <headword>a</headword> Lexical citation ((R) => root) Lexical citation ((R) => root) <part_x0020_of_x0020_speech_x0020_group><part_x0020_of_x0020_speech/><sense_x0020_number_x0020_group><contextualized_x0020_example_x0020_group><example_x0020__x0028_free_x0020_translation_x0029_/><contextualized_x0020_example/></contextualized_x0020_example_x0020_group><definition_x0020_group><English_x0020_reversal/><English_x0020_gloss/><definition/></definition_x0020_group><reference_x0020_group><reference/></reference_x0020_group></sense_x0020_number_x0020_group></part_x0020_of_x0020_speech_x0020_group></headword_x0020_group></lexicalEntry><lexicalEntry><headword_x0020_group> 12/Jul/ /Jul/2007 <headword>^(d)angkarranaka</headword><citation_x0020_form>angkarranaka</citation_x0020_form><part_x0020_of_x0020_speech_x0020_group><part_x0020_of_x0020_speech>?</part_x0020_of_x0020_speech><sense_x0020_number_x0020_group><reference_x0020_group><reference>IwNo05:19Ap</reference></reference_x0020_group><contextualized_x0020_example_x0020_group>ce></reference_x0020_group><_x0032_D_x0020_group> The d-initial form is found after prefixes ending in K- ; elsewhere the root begins with |fv{a}. The citation form is |fv{dangkarranaka}. The d-initial form is found after prefixes ending in K- ; elsewhere the root begins with |fv{a}. The citation form is |fv{dangkarranaka}. </_x0032_D_x0020_group> Lexus example <lexicalEntry><headword_x0020_group> 11/Jul/ /Jul/2007 <headword>a</headword> Lexical citation ((R) => root) Lexical citation ((R) => root) <part_x0020_of_x0020_speech_x0020_group><part_x0020_of_x0020_speech/><sense_x0020_number_x0020_group><contextualized_x0020_example_x0020_group><example_x0020__x0028_free_x0020_translation_x0029_/><contextualized_x0020_example/></contextualized_x0020_example_x0020_group><definition_x0020_group><English_x0020_reversal/><English_x0020_gloss/><definition/></definition_x0020_group><reference_x0020_group><reference/></reference_x0020_group></sense_x0020_number_x0020_group></part_x0020_of_x0020_speech_x0020_group></headword_x0020_group></lexicalEntry><lexicalEntry><headword_x0020_group> 12/Jul/ /Jul/2007 <headword>^(d)angkarranaka</headword><citation_x0020_form>angkarranaka</citation_x0020_form><part_x0020_of_x0020_speech_x0020_group><part_x0020_of_x0020_speech>?</part_x0020_of_x0020_speech><sense_x0020_number_x0020_group><reference_x0020_group><reference>IwNo05:19Ap</reference></reference_x0020_group><contextualized_x0020_example_x0020_group>ce></reference_x0020_group><_x0032_D_x0020_group> The d-initial form is found after prefixes ending in K- ; elsewhere the root begins with |fv{a}. The citation form is |fv{dangkarranaka}. The d-initial form is found after prefixes ending in K- ; elsewhere the root begins with |fv{a}. The citation form is |fv{dangkarranaka}. </_x0032_D_x0020_group>

Methodology Top down approach Analyze existing standards for lexical resources (GOLD/LIFT and LMF/DCR) to identify commonalities and differences at the conceptual level. Harmonize concepts using ISO Data Category Registry Harmonize model approaches Harmonize interchange formats

Harmonizing data categories All linguistic concepts will be registered in the ISO Data Category Registry (ISOcat) Analysis of existing ISOcat data categories vs. GOLD vs. MDF ISOcat Data Category Registry GOLD Comunity \+DatabaseType MDF 4.0 \ver 5.0 \desc Standard Format markers defined in _Making Dictionaries: A guide to lexicography and the Multi-Dictionary Formatter_. David F. Coward, Charles E. Grimes, and Mark R. Pedrotti. Waxhaw, NC: SIL, (2nd edition) \+mkrset \lngDefault English \mkrRecord lx \+mkr an \nam Antonym \desc Used to reference an antonym of the lexeme, but using the \lf (lexical function) field for this is better practice. \lng vernacular \mkrOverThis sn \CharStyle\-mkr \+mkr bw \nam Borrowed word (loan) \desc Used for denoting the source language of a borrowed word. \lng English \mkrOverThis se \CharStyle\-mkr \+mkr ce \nam Cross-ref. gloss (E) \desc Gives the English gloss(es) for the vernacular lexeme referenced by the preceding \cf field. \lng English \mkrOverThis cf \CharStyle\-mkr \+DatabaseType MDF 4.0 \ver 5.0 \desc Standard Format markers defined in _Making Dictionaries: A guide to lexicography and the Multi-Dictionary Formatter_. David F. Coward, Charles E. Grimes, and Mark R. Pedrotti. Waxhaw, NC: SIL, (2nd edition) \+mkrset \lngDefault English \mkrRecord lx \+mkr an \nam Antonym \desc Used to reference an antonym of the lexeme, but using the \lf (lexical function) field for this is better practice. \lng vernacular \mkrOverThis sn \CharStyle\-mkr \+mkr bw \nam Borrowed word (loan) \desc Used for denoting the source language of a borrowed word. \lng English \mkrOverThis se \CharStyle\-mkr \+mkr ce \nam Cross-ref. gloss (E) \desc Gives the English gloss(es) for the vernacular lexeme referenced by the preceding \cf field. \lng English \mkrOverThis cf \CharStyle\-mkr MDF type file

Harmonizing data categories Example: part of speech Determiner Definite article PartOfSpeech article Indefinite article Is a... Complex ClosedSimple ISOcat: MorhoSyntax Profile GOLD ontology \+mkr ps \nam Part of speech \desc Classifies the part of speech. This must reflect the part of speech of the vernacular lexeme (not the national or English gloss). Consistent labeling is important; use the Range Set feature. Sense numbers are beneath \ps in this hierarchy; don't mark different \ps fields with sense numbers. \lng English \rngset adj adv …… n num pn post prtcl v \mkrOverThis se \mkrFollowingThis va \CharStyle\-mkr \+mkr ps \nam Part of speech \desc Classifies the part of speech. This must reflect the part of speech of the vernacular lexeme (not the national or English gloss). Consistent labeling is important; use the Range Set feature. Sense numbers are beneath \ps in this hierarchy; don't mark different \ps fields with sense numbers. \lng English \rngset adj adv …… n num pn post prtcl v \mkrOverThis se \mkrFollowingThis va \CharStyle\-mkr MDF Multi Dictionary Format

Harmonizing data categories Gold example 2 In some cases GOLD contains additional information Additional extensions to the conceptual domain isA relations between GOLD concepts GOLD ontology

Harmonizing data categories Relation Registries Relation Registries describes relations not handled through the ISO model Simple relations e.g MDF /PartOfSpeech/ ‘equals’ MorphoSyntax /PartOfSpeech/ GOLD relations (GOLD ontology is a Relation Registry) Compositional Relations (DC is composed of multiple more granular DCs) e.g. UDI MDF \1d (First dual)  person:firstPerson, grammaticalNumber: dual, value:… Model specific relations e.g. TBX model

Harmonizing data categories Relation Registries Relation registriesData Category registriesresource registries

Harmonizing interchange formats Possibility to use TEI? Can TEI serve as interchange format for LMF and be accepted by CLARIN community? Decision needs to be made before end 2010 to be useful for RELISH ODD (One Document does all) Documentation Schema information Schema documents validate xml data structure In August a workshop is organized to discuss the possibility of using TEI as an interchange format with representatives from ISO, CLARIN, TEI and endangered languages community

Adapting the tools Relish project will result in tool adaptation to support the interoperability aspects and interchange formats

Conclusions and remarks Minority and less resourced languages and tools are starting to actively participate in the standards discussions becoming part of the e-infrastructure landscape have the opportunity to play a mature role in the area of language resources We need organizations and individuals who are actively involved and represent the position of less resources languages in these discussions Results from Relish project may be useful for other less resourced language resources as well

Thank you for your attention Relish was made possible through the DFG/NEH Bilateral Digital Humanities Program