Www.isocat.org ISOcat: How to create a DC (including “do’s and don’ts”) 20 June 20131CLARIN-NL ISOcat tutorial.

Slides:



Advertisements
Similar presentations
Chapter Two The Scope of Semantics.
Advertisements

ISOcat introduction 19 June 20121CLARIN-NL ISOcat workshop.
Data Category specifications 19 June 20121CLARIN-NL 2012 ISOcat tutorial.
CLARIN-NL/VL procedure 20 June 20131CLARIN-NL ISOcat workshop.
11 CLARIN? ISOCAT! Ineke Schuurman ISOcat content coördinator CLARIN-NL Amsterdam
 Before you submit your paper, check these things.
Introduction to phrases & clauses
Albert Gatt LIN3021 Formal Semantics Lecture 5. In this lecture Modification: How adjectives modify nouns The problem of vagueness Different types of.
The current state of Metadata - as far as we understand it - Peter Wittenburg The Language Archive - Max Planck Institute CLARIN Research Infrastructure.
Assignment 1 Pointers ● Be sure to use all tags properly – Don't use a tag for something it wasn't designed for – Ex. Do not use heading tags... for regular.
Object-Orientated Design Unit 3: Objects and Classes Jin Sa.
Reference and inference By: Esra’a Rawah
Unit One: Parts of Speech
English Errors Explained. No Big Deal  Do not freak out. Most grammar errors can be put into three categories: agreement, tense, and ambiguity.  You.
ISOcat: known issues 10 May /20111CLARIN-NL ISOcat workshop.
Zinovy Diskin and Juergen Dingel Queen’s University Kingston, Ontario, Canada Mappings, maps and tables: Towards formal semantics for associations in UML.
Adding metadata to web pages Please note: this is a temporary test document for use in internal testing only.
Unit 1 – Understanding Non-Fiction and Media Texts
Data Category specifications 20 March 20121CLARIN-NL ISOcat workshop.
CLARIN-NL: Dealing with ISOcat Ineke Schuurman. ISOcat and CLARIN Projects call 1 CLARIN-NL Joint Flemish/Dutch pilot Whenever relevant, elements are.
Source: How to Write a Report Source:
Principles of the GOLD Ontology & Conversion of GOLD to DCIF Presenters: Anthony Aristar, Evelyn Richter.
Grammar Notes Avoiding Common Mistakes. SPELLING MATTERS The number one reason to proofread your work before you turn it in is because there are a number.
CLARIN-NL ISOcat workshop 2011 part 2 Ineke Schuurman Menzo Windhouwer.
Why XML ? Problems with HTML HTML design - HTML is intended for presentation of information as Web pages. - HTML contains a fixed set of markup tags. This.
The ISO-DCR 17 January /20111CMDI tutorial Marc Kemps-Snijders a, Menzo Windhouwer b, Sue Ellen Wright c a Meertens Institute, b MPI for.
Paraphrasing and Plagiarism. PLAGIARISM Plagiarism is using data, ideas, or words that originated in work by another person without appropriately acknowledging.
ISOcat demo and providing RELcat input Menzo Windhouwer The Language Archive tla.mpi.nl Data Archiving and Networked Solutions
SIGHT UNSEEN: WORKING WITH WRITING CENTER CLIENTS THROUGH ASYNCHRONOUS CONSULTATIONS PRESENTED BY MICHAEL FRIZELL, DIRECTOR STUDENT LEARNING SERVICES.
 In an academic essay you need to have a formal tone.  A formal tone is characterized by learned vocabulary, longer sentences, and an avoidance of personal.
CLARIN-NL Call 3 ISOcat follow-up 10/10/20121CLARIN-NL ISOcat Call 3 follow-up.
SIX COMMON MISTAKES IN WRITING. Switching Tenses Unnecessarily One of the more common problems seen in ESL writing is unnecessary switching between past,
DC specifications or “Do’s and don’ts” when creating a DC.
Content of the Data Category Registry 10 May /20111CLARIN-NL ISOcat workshop.
Scientific writing style Exact  Word choice: make certain that every word means exactly what you want to express. Choose synonyms with care. Be not.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
ISOcat: known issues 20 June 20131CLARIN-NL ISOcat workshop.
Practice Examples 1-4. Def: Semantics is the study of Meaning in Language  Definite conclusions Can be arrived at concerning meaning.  Careful thinking.
CLARIN-NL Call 4 ISOcat follow-up 2/10/20131CLARIN-NL Call 4 ISOcat follow-up.
Collecting primary data: use of questionnaires Lecture 20 th.
ISOcat introduction 20 June 20131CLARIN-NL ISOcat workshop.
M1G Introduction to Database Development 2. Creating a Database.
ISOcat introduction 20 March 20121CLARIN-NL ISOcat workshop.
Metadata and Versioning VIF workshop 22 nd April
CLARIN-NL ISOcat workshop 2012 part 2 ( ) Ineke Schuurman Menzo Windhouwer.
ISOcat: known issues 19 June 20121CLARIN-NL ISOcat workshop.
11 CMDI/ISOcat And Semantic Operability Ineke Schuurman ISOcat content coördinator CLARIN-NL Menzo Windhouwer ISOcat system administrator Utrecht
A Data Category Registry- and Component- based Metadata Framework Daan Broeder et al. Max-Planck Institute for Psycholinguistics LREC 2010.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Test-Taking Skills and Preparation. Test-Taking Skills Skills related not to subject knowledge but attitude and how a person approaches the test. Skills.
ISOcat: How to create a DC (including “do’s and don’ts”) 19 June 20121CLARIN-NL ISOcat tutorial.
Perfect your technique! How to do your best in exams…
Beyond ISOcat 20 June 2013CLARIN-NL ISOcat tutorial1.
BANK EXAM ONLINE COACHING ENGLISH GRAMMAR ADJECTIVE.
CLARIN Concept Registry: the new semantic registry Ineke Schuurman, Menzo Windhouwer, Oddrun Ohren, Daniel Zeman
ISOcat status
CLARIN Requirements for a Semantic Registry Daan Broeder The Language Archive – MPI Ineke Schuurman CLARIN-NL/VL – KU Leuven & Utrecht.
1 ISOCAT Proposed solutions for Problems encountered in DUELME-LMF Jan Odijk Nijmegen 21 Sep 2010.
Written Presentations of Technical Subject Writing Guide vs. Term paper Writing style: specifics Editing Refereeing.
1 CLARIN? ISOCAT! Ineke Schuurman Hilversum,
Adding and Subtracting Decimals © Math As A Second Language All Rights Reserved next #8 Taking the Fear out of Math 8.25 – 3.5.
ISO TC 37/CLARIN DISCUSSION UTRECHT, DECEMBER 9/ Thinning Down a Bloated Cat SUE ELLEN WRIGHT DECEMBER 2013.
GRAMMAR AND PUNCTUATION REVISE AND REVIEW WORD CLASSES.
ATTACKING THE (SAR) OPEN ENDED RESPONSE. Get out a sheet of paper(or 2?)! Your responses to the questions on this power point will be your SAR test grade.
ISOcat introduction 10 May /20111CLARIN-NL ISOcat workshop.
Relations between Data Categories
The scope of Semantics Made Simple
A Systematic Framework for Language Analysis
ISOCAT ISOCAT Problems
TECHNICAL REPORTS WRITING
Presentation transcript:

ISOcat: How to create a DC (including “do’s and don’ts”) 20 June 20131CLARIN-NL ISOcat tutorial

Your work wrt ISOcat Adopt an existing entry Create an entry Link with an existing entry In all cases: the entries should be GOOD ones But: what makes an entry a good one, one that you can use? 20 June 2013CLARIN-NL ISOcat tutorial2

A good DC What defines a good DC? – It should ‘match’ with the way you use a specific notion in the annotation scheme, application, … at hand – It should come with the same profile – It should handle the same phenomenon, SpeakerID =/= SingerID 20 March 2012CLARIN-NL ISOcat tutorial320 June 2013CLARIN-NL ISOcat tutorial3

Speaker vs Singer SingerID and SpeakerID: siblings SingerID is subclass of both Singer and ID (RELcat!) String→Name→Person→Singer → Opera singer→Tenor →Tenor in La Bohème First: too generic, last: too specific The others are in se candidates for DCs 20 March 2012CLARIN-NL ISOcat tutorial420 June 2013CLARIN-NL ISOcat tutorial4

Standards Hardly any available (cf morning session) We really should try to arrive at a series of sound DCs, useful for YOU and as many other people as possible => not too specific, not too general 20 March 2012CLARIN-NL ISOcat tutorial520 June 2013CLARIN-NL ISOcat tutorial5

What defines a good DC? Meaningful definition Indefinite pronoun – Not: pronoun that is indefinite Unless both ‘pronoun’ and ‘indefinite’ are defined elsewhere AND it is mentioned explicitly which are involved AND these definitions are correct (for you) 20 March 2012CLARIN-NL ISOcat tutorial620 June 2013CLARIN-NL ISOcat tutorial6

Correct definition Personal pronoun – Not: pronoun referring to persons As That cat has five kittens. SHE … This table was very expensive but I like IT very much And John shook HIS head … [Note: in a particular tagset the definition may be correct! In general it is not.] 20 March 2012CLARIN-NL ISOcat tutorial720 June 2013CLARIN-NL ISOcat tutorial7

Reusable definition Personal pronoun Not: In CGN a personal pronoun … Not: In Dutch a personal pronoun … Not: A personal pronoun (ik, ikke and ikzelf) is characterized by … A definition should be as neutral (project, language) as possible, while still valid for your purposes! 20 June 2013CLARIN-NL ISOcat tutorial8

Good DC => good name Sometimes confused: 1.Identifier (=/= PID) 2.Data Element Name 3.Name Re 1: should come in camelCaseFormat, start with alphabetical character (not 1stPerson, but firstPerson), in English, be meaningful (not EVON, but singularNeuterForm),… 20 June 2013CLARIN-NL ISOcat tutorial9

Re 2: field Data Element Name (DEN) is proper place to mention abbreviations/tags used for a particular notion, and not just for English (N, NPlur, EVON) Re 3: In all Language Sections the correct full name(s) in the working language at hand are provided 20 June 2013CLARIN-NL ISOcat tutorial10

decision process: 20 June 2013CLARIN-NL ISOcat tutorial11

Flagged DCs why? 20 June 2013CLARIN-NL ISOcat tutorial12

Flagged DCs Try to avoid linking with ‘deprecated’ or ‘superseded’ DCs ! – do not use DCs with 2 definitions!! In other cases the flags show whether the DC specification is correct from a more technical point of view Note that only DCs with a green marking are qualified for standardization (or CLARIN-NL/VL recommendation) 20 June 2013CLARIN-NL ISOcat tutorial13

DC/DCS and profile Profiles are not added automatically, a DCS may contain elements with various profiles Profile ‘not available’: only to be used when the correct profile is not contained in the list! In such a case, use ‘Not available’ for the time being, AND Contact 20 June 2013CLARIN-NL ISOcat tutorial14

Which elements to include? Cf slide on SingerID/SpeakerID In general: all linguistically meaningful notions mentioned in your schema, manual, definition PLUS the metadata (CMDI !) Abbreviations (PST for /past tense/) are to be mentioned as Data Element Name 20 June 2013CLARIN-NL ISOcat tutorial15

“Do’s & don’ts” Do’s: Create a DCS for your scheme (name project, annotation scheme, …) Provide clear definition (short, to the point) for your scheme, application, …. Take care not to leave concepts used in your definition undefined or vague (‘note’ section !) Use appropriate profile (NOT: ‘undecided’) Use appropriate vocabulary (per profile) Check ‘adopted’ DC’s regularly till standardization ! 20 June 2013CLARIN-NL ISOcat tutorial16

Dos When creating a DC, fill out Justification: used in XYZ, part of tagset N – Why existing DCs could not be reused !!!!! Language section – Always English language section (+ Dutch!) – Strong recommendation: sections for object language(s), for working language (like language in which manual is written) – Sections in the various languages should match (+/- be translations of each other) Profile – ‘Undecided’ is NOT correct! 20 June 2013CLARIN-NL ISOcat tutorial17

When creating a DC, fill out Example section – Note that *negative* examples may be very helpful! Identifier “foreignWord” Dutch language section – example section: the, house, NOT: poster – explanation section: een woord als ‘poster’ heeft Nederlandse diminutief: postertje, itt house (*housje, *houseje) 20 June 2013CLARIN-NL ISOcat tutorial18

Example sections Suppose you want to illustrate a real Dutch phenomenon (‘neuter’ vs ‘non-neuter’) : Ex.sec. in EN language section – Dutch ex with transl in English Ex.sec. in DE language section – Dutch ex with transl in German Ex.sec. in EN linguistic section – EN example Ex.sec. in DE linguistic section – DE example with translation in English 20 June 2013CLARIN-NL ISOcat tutorial19

Don’ts Confuse Language and Linguistic section – Latter contains language specific values for closed domains Be (too) language specific in definition Mention scheme in definition Use several definitions in one DC Circular definitions Rely on authority Rely on standardized status – Definition should fit YOUR scheme, etc 20 June 2013CLARIN-NL ISOcat tutorial20

Questions? 20 June 2013CLARIN-NL ISOcat tutorial21

20 June 2013CLARIN-NL ISOcat tutorial22

20 June 2013CLARIN-NL ISOcat tutorial23

20 June 2013CLARIN-NL ISOcat tutorial24

20 June 2013CLARIN-NL ISOcat tutorial25

20 June 2013CLARIN-NL ISOcat tutorial26

RelCat “Linking DCs” is not just a ‘nice’ feature – Proper noun – Common noun – Mass noun – Count noun are all instances of ‘noun’ (i.e. have an IsA relation with it) 20 June 2013CLARIN-NL ISOcat tutorial27

RelCat Essential for several Dutch tag sets N(soort, ….) comes with 2 DCs: 1.Noun 2.Common How to relate this with one of the DCs for ‘common noun’, even in case we would find the definition perfect? Good news: in progress! 20 June 2013CLARIN-NL ISOcat tutorial28

Some considerations DC N(common) as a unit DC Noun and DC Common We are to take care that a definition for ‘Common’ is not seen as definition of ‘common noun’ (i.e. the whole) We are to take care that, when a notion ‘noun’ is used in the definition of ‘common’, it gets the intended reading 20 June 2013CLARIN-NL ISOcat tutorial29

More complex N(soort,mv,dim) noun(common,plural,diminutive) More problematic to define as a whole, not just stating: a diminutive common noun used as plural This doesn’t mean anything! Possible solution: linking it with the intended readings of the features involved 20 June 2013CLARIN-NL ISOcat tutorial30

Searching How to detect which DCs are Standardized? Or have a German language section? How to search using the keys? And what about language of keywords? How to detect which DCs ‘belong together’ (unless one mentions the tag set in the definition e.g ) 20 June 2013CLARIN-NL ISOcat tutorial31

Searching How to search for alternative names (Data Element Names): Konjunktion, Bindewort; Präposition/ Verhältniswort And the results: when not using ‘exact’ match and a specific field, MANY results come up, apparently unordered, while using ‘exact’ + specific ‘field’ or ‘profile’ may make you miss relevant entries. 20 June 2013CLARIN-NL ISOcat tutorial32

Consequences of mapping Suppose, you map with a specific DC, and some essential changes are made to that DC – You may no longer want to map, but how do you know? Suppose the are several relevant DCs, you select one and just that one doesn’t get standardized – You have to redo your work (but you first are to be aware that …) 20 June 2013CLARIN-NL ISOcat tutorial33

Ill-defined DCs Profile: morphosyntax – Definition: semantic – Definition: too narrow/broad – Definition unclear (and no examples available) ‘concept’ in definition not defined in ISOcat, or That concept comes with several DCs (which one was meant?) 20 June 2013CLARIN-NL ISOcat tutorial34

Too many DCs There are too many ‘almost the same’ DCs, even within the same profile Too vague DCs There are many DCs with rather ‘empty’ definitions – Proper noun: a noun or adjective denoting a single object – Common noun: a noun or adjective denoting a class of objects 20 June 2013CLARIN-NL ISOcat tutorial35

Too language-specific DCs Quite a number of DCs are too specific, mostly Polish ones, this makes it difficult to map with them In these cases: stuff that belongs in the Polish language section is in the general, English one *** ISOcat: not yet perfect 20 June 2013CLARIN-NL ISOcat tutorial36

Therefore, while for some technical issues solutions will come up/are coming up YOU should also be very careful yourself, especially wrt the ‘soundness’ of the DCs, in particular as far as definitions, profile, and translation are concerned! Only in that case ISOcat can become a success story! 20 June 2013CLARIN-NL ISOcat tutorial37

Thanks ! 20 June 2013CLARIN-NL ISOcat tutorial38