Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University Prague, Czech Republic
LREC 2006, Annotation Outline of the talk Introduction Prague Dependency Treebank 2.0 Annotation of grammatemes Motivation Grammateme attributes Two-level node hierarchy Examples of grammateme value assignment Final remarks
LREC 2006, Annotation Introduction grammatemes in the PDT 2.0 one type of attributes of nodes of a deep syntactic tree capturing morphological meanings that are semantically indispensable number for nouns, degree of comparison for adjectives, tense for verbs, etc. annotation of grammatemes the last task in the PDT 2.0 annotation procedure possible to assign automatically – profiting from the already available annotation: annotation of the same sentence at the lower layers already available components of the t-tree (tree structure, types of dependency relations, co-reference, etc.)
LREC 2006, Annotation Historical background and development of PDT project mid 1960’s – Praguian Functional Generative Description (Petr Sgall et al.) 1994 – Czech National Corpus 1995 – PDT started 1998 – PDT 0.5 pre-release 2001 – PDT 1.0 released by LDC manual annotation of morphology and surface syntax 2006 – PDT 2.0 to be released by LDC interlinked morphological, surface-syntactic and complex deep-syntactic annotation including annotation of grammatemes
LREC 2006, Annotation Outline of the talk Introduction Prague Dependency Treebank 2.0 Annotation of grammatemes Motivation Grammateme attributes Two-level node hierarchy Examples of grammateme value assignment Final remarks
LREC 2006, Annotation Layers of annotation tectogrammatical layer deep-syntactic dependency tree analytical layer surface-syntactic dependency tree morphological layer m-lemma and m-tag associated with each token word layer original text, segmented on word boundaries lit: He-was would went toforest. He would have gone to the forest.
LREC 2006, Annotation Interlinking the layers lit: He-was would went toforest. He would have gone to the forest. any unit at any layer has a PDT unique ID neighboring layers connected by top-down pointers
LREC 2006, Annotation Size of the PDT 2.0 data (i) 7,129 manually annotated textual documents all documents annotated at the m-layer 16,065 sentences with 1,960,657 tokens 75 % of the m-layer data annotated at the a-layer 5,338 documents, 87,980 sentences, 1,504,847 tokens 44 % of the m-layer data annotated also at the t-layer 3,168 documents, 49,442 sentences, 833,357 tokens
LREC 2006, Annotation training data (80 %) development test data (10 %) evaluation test data (10 %) Size of the PDT 2.0 data (ii)
LREC 2006, Annotation M-layer sentence represented as a sequence of tokens each token lemmatized and tagged (attributes m-lemma and m-tag ) positional m-tag: 15 characters 1. (main) POS 2. detailed POS 3. gender 4. number 5. case... lit.: Some contours problem(gen) reflexive_pronoun though after resurgence(instr) Havel's speech(instr) they-seem to-be clearer. Some contours of the problem seem to be clearer after the resurgence by Havel's speech.
LREC 2006, Annotation A-layer rooted ordered tree with labeled nodes and edges a-nodes one token of the m-layer is represented by exactly one a-node labeled with a-lemmas (identical with word forms) a-edges represent dependency relations ( Sb, Obj, Adv, Atr) represent non-dependency relations ( Coord) analytical function attribute appears as an a-node attribute Some contours of the problem seem to be clearer after the resurgence by Havel's speech.
LREC 2006, Annotation T-layer Some contours of the problem seem to be clearer after the resurgence by Havel's speech. rooted ordered tree with labeled nodes and edges t-nodes complex typed feature structures represent auto-semantic words functional words do not have nodes of their own artificially added nodes t-edges dependency relations ( functor ) non-dependency relations (coordination constructions) functor attribute appears as an t-node attribute
LREC 2006, Annotation lit. [To] all was handed over a certificate of successful graduation from the course. They all received a certificate of successful graduation from this course. Areas of annotation at the t-layer tree structure t-lemma attribute dependency relation ( functor and subfunctor ) topic-focus attributes co-reference attributes node typing attributes ( nodetype and sempos) grammateme attributes Všem bylo předáno osvědčení o úspěšném absolvování kurzu.
LREC 2006, Annotation Outline of the talk Introduction Prague Dependency Treebank 2.0 Annotation of grammatemes Motivation Grammateme attributes Two-level node hierarchy Examples of grammateme value assignment Final remarks
LREC 2006, Annotation grammatemes t-node attributes representing inflectional information that is semantically indispensable (morphological meanings such as number for nouns, tense for verbs, degree of comparison for adjectives, etc.) semantically irrelevant morphological meanings are not part of the t-layer (e.g. case for nouns) Grammatemes: Motivation
LREC 2006, Annotation Grammateme attributes 15 grammatemes indeftype numertype negation degcmp tense aspect verbmod deontmod dispmod resultative iterativeness number gender person politeness
LREC 2006, Annotation Conditioned presence/absence of grammatemes obviously, not all grammatemes are relevant for all nodes no tense for dog, no degree of comparison for (he) waits, etc. how to formally declare presence/absence of a given grammateme attribute in a given node? the need for node typing chosen solution: two-level typing 1 st level: 8 more general types of nodes grammatemes relevant only for one of them 2 nd level: 19 more specific subtypes, corresponding to detailed semantic parts of speech
LREC 2006, Annotation Presence/absence of grammateme values: Two-level t-node hierarchy 1 st level: attribute nodetype 2 nd level: attribute sempos
LREC 2006, Annotation 8 attribute values: root | qcomplex | list | atom | coap | dphr | fphr | complex fully automatic annotation - use of the tree structure root t-attributes t-lemma qcomplex | list functor atom | coap | dphr | fphr else complex Levnější benzín na Východě, dražší na Západě Cheaper gasoline in the East, more expensive one in the West First level of the hierarchy: attribute nodetype
LREC 2006, Annotation only complex nodes grouped into semantic parts of speech 19 values of the attribute sempos : n.... | adj.... | adv.... | v.... fully automatic annotation – use of m-tag t-lemma other t-attributes sempos value delimits the set of relevant grammatemes Second level of the hierarchy: attribute sempos
LREC 2006, Annotation Values of nodetype and sempos in the PDT 2.0 – an overview nodetype values: sempos values:
LREC 2006, Annotation Grammateme value assignment n-tred environment for processing the PDT data automatic annotation 2000 lines of Perl code crucial importance of inter-layer links – use of t-attributes, a-attributes, m-attributes rules using special economic notation 2000 lines written in a text file lexical resources special purpose lists of adverbs / verbs manual annotation of special problems two annotators working in parallel simplified annotation environment: treebank positions extracted into simple HTML forms
LREC 2006, Annotation Simple HTML-based environment for manual annotation lit: The difference [you] would have to pay yourself.
LREC 2006, Annotation Automatic vs. manual assignment at the t-layer of the PDT 2.0: 1,594,333 grammateme values assigned at 550,947 complex nodes manually assigned: 17,520 grammateme values inter-annotator agreement: %
LREC 2006, Annotation Grammateme assignment and m-tag number grammateme: values sg | pl assigned automatically using m-tag e.g. les (forest) m-layer: tag NNIS2-----A---- t-layer: number=sg manual assignment nouns with only plural forms (identified by a list extracted from the machine- readable dictionary of standard Czech) e.g. dveře (door/doors) m-layer: always plural t-layer: annotator decision sg | pl n.denot number=sg lit: He-was would went toforest. He would have gone to the forest.
LREC 2006, Annotation Grammateme assignment and tree structure v verbmod=cdn mood grammateme verbmod: values ind | imp | cdn assigned automatically one-word verbal forms e.g. jde (goes) m-tag information verbal forms consisting of more word forms (represented by a single node at the t-layer) e.g. byl by šel (would have gone) corresponding a-layer subtree involves the node by m-tag of the node by lit: He-was would went toforest. He would have gone to the forest.
LREC 2006, Annotation lit. From remainder of raw material the diary produces dried milk, which [it] exports to Asia and South America. From the rest of the material, the diary produces dried milk, which is exported [by it] to Asia and South America. Grammateme assignment and co-reference grammatemes gender, number and person in relative pronouns are left underspecified (value inher ), since they are imposed only by grammatical agreement (thus can be “inherited from the antecedents”) Ze zbytku suroviny mlékárna vyrábí sušené mléko, které vyváží do Asie a Jižní Ameriky.
LREC 2006, Annotation Outline of the talk Introduction Prague Dependency Treebank 2.0 Annotation of grammatemes Motivation Grammateme attributes Two-level node hierarchy Examples of grammateme value assignment Final remarks
LREC 2006, Annotation Final remarks achievements: two-level typing of t-layer nodes which makes it possible to formally capture presence/absence of individual grammatemes in a given node automatic procedure for capturing the node classification and the grammateme attributes verification of the procedure on large-scale data experience: it was the existence of the lower annotation layers and the existence of inter-layer links what allowed to make the procedure of grammateme assignment more or less automatic
LREC 2006, Annotation