Presentation is loading. Please wait.

Presentation is loading. Please wait.

ISO TC37/SC4 – Tilburg 2007 Data categories for (lexical semantics and) reference annotation Susanne Alt ATILF-CNRS, Nancy, France & BBAW, Berlin, Germany.

Similar presentations


Presentation on theme: "ISO TC37/SC4 – Tilburg 2007 Data categories for (lexical semantics and) reference annotation Susanne Alt ATILF-CNRS, Nancy, France & BBAW, Berlin, Germany."— Presentation transcript:

1 ISO TC37/SC4 – Tilburg 2007 Data categories for (lexical semantics and) reference annotation Susanne Alt ATILF-CNRS, Nancy, France & BBAW, Berlin, Germany Laurent Romary INRIA, France & MPDL, Berlin, Germany

2 ISO TC37/SC4 – Tilburg 2007 Reference annotation Links between markables Various views –Coreference: identity of the referent –{une poire, la, l’une} –{une pomme, le fruit, l’autre} –{(une poire, une pomme), les} –Anaphora: interpretational dependency –une poire <= la peau –l’une <= l’autre Prendre une poire et la couper. Enlever la peau. Laver une pomme. Éplucher le fruit. Les faire cuire. Servir l’une et l’autre avec de la glace.

3 ISO TC37/SC4 – Tilburg 2007 1..1 Global Meta-data RAF: Reference Annotation Framework Referential Data Collection 1..1 0..n Markable 1..1 0..n Referential Link 1..1

4 ISO TC37/SC4 – Tilburg 2007 Important issues Markables as autonomous units –No isomorphism to source data –complex markables, zero pronouns, discourse deixis, disfluencies –Necessity to identify non-referring units in a homogeneous way –Cf. Byron & Gegg-Harrison (2004) –Possible overwriting of inherited features –Gender, POS refinement –Markable specific data categories Links as autonomous units –Specific annotation mechanisms –e.g. Ambiguity, same source markable involved in different links –Link specific data categories => Markables and links may be annotated in different phases and by different annotators (cf. alignment...)

5 ISO TC37/SC4 – Tilburg 2007 une pomme nounPhrase le fruit nounPhrase coreference hypernymy Prendre une pomme. Eplucher le fruit.

6 ISO TC37/SC4 – Tilburg 2007 The current jungle of "attributes" direct anaphor, identity, coreference, identity of reference, bridging, part-whole, associative, reference to part of landmark, indirect anaphor, larger situation, unfamiliar, designation, conceptual bridging, set-subset, miscellaneous, cause, inferable-of-complement, propositional, possessive, implicit argument, ellipsis, plural NP, numerical pronoun, substitution form, identity of reference with two landmarks, NP predication, member, general relation, event relation, argument, proper name, bound anaphor, function-value, instantiation, agent, patient, attribute, partitive, strict possession, cause, other-anaphor… classification and description of data categories

7 ISO TC37/SC4 – Tilburg 2007 Data categories for RAF Markables –Lexical Semantics Data Categories “... are related to properties of semantic entities. Dependent on the underlying theory, semantic entities might be instantiated as concepts or referents. The following features are primarily considered as being lexicalized features. A strong indicator in favour of lexicalization is specific grammatical mark-up in some languages, as for example for animacy, alienability or collectiveness. However, in many cases, the value of a lexicalized or default semantic feature might be overwritten in discourse.” –Miscellaneous Semantic Data Categories “... groups other properties of semantic entities, useful for reference annotation. They might not be considered as lexicalized, but as discourse dependent features..” –Definiteness Data Categories “... are properties of linguistic units, mainly noun phrases, concerned with the identifiability and non-identifiability of their referents on the part of a speaker or addressee.”

8 ISO TC37/SC4 – Tilburg 2007 Overview Lexical Semantics Data Categories –/abstractness/ –/animacy/ –/alienability/ –/collectiveness/ –/countability/ Miscellaneous Semantic Data Category –/entityCategorization/ –/naturalGender/ –/cardinality/ Definiteness Data Categories –/definiteIdentifiableTerm/ –/genericTerm/ –/indefiniteTerm/ –/nonSpecificTerm/ –/specificTerm/

9 ISO TC37/SC4 – Tilburg 2007 Referential, lexical or syntactic property ? Not always syntactically marked. Die M ö bel waren zu verkaufen. Das Gefieder war schwarz. Not predictible from the referent. Die Federn waren schwarz. Das Gefieder war schwarz. Therefore – considered as lexicalized – sources, notes, explanations – possible overriding in discourse Le vin est bon. Les vins sont bons.

10 ISO TC37/SC4 – Tilburg 2007 Data categories from MAF, SynAF Relevant information percolated from lower levels –/part of speech/ –/grammatical gender (number, person, etc.)/ –/syntactic category/ –{ /noun phrase/, … } –Consensus hardly achievable on the possible values… –/syntactic function/ –{ /subject/, /object/, …} –Consensus… 

11 ISO TC37/SC4 – Tilburg 2007 Data categories for RAF Links –Lexical Relation Data Categories “... are relations between lexical items. For reference annotation, they might be extended to larger linguistic units, such as noun phrases.” –Coreference Relation Data Category “...an equivalence relation between linguistic expressions referring to the same extra- linguistic entity.” –Objectal Relation Data Categories “... are a generalisation of van Deemter and Kibble’s (2000) extensional approach to the definition of coreference in terms of relations holding between referents of linguistic expressions: an objectal relation holds between extra-linguistic entities, defines relations from a referential viewpoint.”

12 ISO TC37/SC4 – Tilburg 2007 Overview Lexical Relation Data Categories –/synonymy/ –/hyponymy/ –/hypernymy/ –/compatibility/ –/incompatibility/ –/meronymy/ –/lexicalIdentity/ Coreference Relation Data Category –/coreference/ Objectal Relation Data Categories –/objectalIdentity/ –/partOf/ –/subset/

13 ISO TC37/SC4 – Tilburg 2007

14 une pomme noun hrase le fruit nounPhrase Prendre une pomme. Eplucher le fruit.

15 ISO TC37/SC4 – Tilburg 2007 une pomme nounPhrase le fruit nounPhrase objectalIdentity hypernymy Prendre une pomme. Eplucher le fruit.

16 ISO TC37/SC4 – Tilburg 2007 Metadata for annotation schemes A general issue in annotation schema design –Global information –Annotator(s), tool, date –Pointer to scheme specification = DCS (Data Category Selection) –Inter-annotator agreement –Revision information –Local information : markables, links –Annotator (markable ≠ links) –Confidence level (cf. tools) –Update, correction Sources: –OLAC (Open Language Archive Community), IMDI (ISLE Metadata Initiative), TEI (Text Encoding Initiative)


Download ppt "ISO TC37/SC4 – Tilburg 2007 Data categories for (lexical semantics and) reference annotation Susanne Alt ATILF-CNRS, Nancy, France & BBAW, Berlin, Germany."

Similar presentations


Ads by Google