Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University.

Slides:



Advertisements
Similar presentations
Machine Learning Approaches to the Analysis of Large Corpora : A Survey Xunlei Rose Hu and Eric Atwell University of Leeds.
Advertisements

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Progress update Lin Ziheng. System overview 2 Components – Connective classifier Features from Pitler and Nenkova (2009): – Connective: because – Self.
En->Cz MT system based on tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague.
En->Cz MT system based on TR Zdeněk Žabokrtský IFAL, Charles University in Prague.
June 6, 20073rd PIRE Meeting1 Tectogrammatical Representation of English in Prague Czech-English Dependency Treebank Lucie Mladová Silvie Cinková, Kristýna.
Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester1 Treebanks: Layering the Annotation Jan Hajič Institute of Formal and Applied Linguistics.
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester1 Treebanks: Language-specific Issues Czech Jan Hajič Institute of Formal and Applied Linguistics.
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester1 Treebanks and Parsing Jan Hajič Institute of Formal and Applied Linguistics School of.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
1/36 TectoMT Zdeněk Žabokrtský Institute of Formal and Applied Linguistics MFF UK Software framework for developing MT systems (and other NLP applications)
Building the Valency Lexicon of Arabic Verbs Viktor Bielický Otakar Smrž LREC 2008, Marrakech, Morocco.
Automatic translation quality control using Eurovoc descriptors Marko Tadić, Božo Bekavac
PDT 2.0 Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague.
Syntax & Semantic Introduction Organization of Language Description Abstract Syntax Formal Syntax The Way of Writing Grammars Formal Semantic.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
UAM CorpusTool: An Overview Debopam Das Discourse Research Group Department of Linguistics Simon Fraser University Feb 5, 2014.
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
PDT Grammatemes and Coreference in the PDT 2.0 Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University in Prague.
Some Advances in Transformation-Based Part of Speech Tagging
1/21 Introduction to TectoMT Zdeněk Žabokrtský, Martin Popel Institute of Formal and Applied Linguistics Charles University in Prague CLARA Course on Treebank.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague.
March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax 1 The PDT Morphology and Surface Syntax.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Morphological Meanings in the Prague Dependency Treebank Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University,
Tree-based Machine Translation using syntax and semantics
April 17, 2007MT Marathon: Tree-based Translation1 Tree-based Translation with Tectogrammatical Representation Jan Hajič Institute of Formal and Applied.
The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Systematic Parameterized Description of Pro-forms in the Prague Dependency Treebank 2.0 Magda Ševčíková Zdeněk Žabokrtský Institute of Formal and Applied.
Approximating a Deep-Syntactic Metric for MT Evaluation and Tuning Matouš Macháček, Ondřej Bojar; {machacek, Charles University.
1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 4.
Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University.
Resemblances between Meaning-Text Theory and Functional Generative Description Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University,
Detection of Links between Words in the Task of Syntactic-Semantic Analysis of Russian Texts. Dmitry V. Merkuryev Saint-Petersburg State University, Russia.
1 / 5 Zdeněk Žabokrtský: Automatic Functor Assignment in the PDT Automatic Functor Assignment (AFA) in the Prague Dependency Treebank PDT : –a long term.
Proper Nouns in Czech Corpora Magda Ševčíková Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics.
PDT Grammatemes in the PDT 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague
nd PIRE project workshop1 Tectogrammatical Representation of English Silvie Cinková Lucie Mladová, Anja Nedoluzhko, Jiří Semecký, Jana Šindlerová,
March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Intro 1 The Prague Dependency Treebank (PDT) Introduction Jan Hajič Institute.
LREC 2004, 26 May 2004, Lisbon 1 Multimodal Multilingual Resources in the Subtitling Process S.Piperidis, I.Demiros, P.Prokopidis, P.Vanroose, A. Hoethker,
Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
Syntactic Annotation of Slovene Corpora (SDT, JOS) Nina Ledinek ISJ ZRC SAZU
Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague.
Leonid Iomdin Institute for Information Transmission Problems, Russian Academy of Sciences
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
Arabic Syntactic Trees Zdeněk Žabokrtský Otakar Smrž Center for Computational Linguistics Faculty of Mathematics and Physics Charles University in Prague.
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Tectogrammatics 1 PDT: Tectogrammatical Representation Jan Hajič Institute.
Named Entities in Czech Texts and Their Processing Magda Ševčíková Zdeněk Žabokrtský ÚFAL MFF UK.
Semantic annotation of a dialog corpus Silvie Cinková Institute of Formal and Applied Linguistics Charles University in Prague, Czech Republic COMPANIONS.
BAMAE: Buckwalter Arabic Morphological Analyzer Enhancer Sameh Alansary Alexandria University Bibliotheca Alexandrina 4th International.
Prague Czech-English Dependency Treebank 2.0 ufal.mff.cuni.cz/pcedt2.0 Silvie Cinková, Marie Mikulová, Jan Štěpánek & professors, annotators and programmers.
1/16 TectoMT Zdeněk Žabokrtský ÚFAL MFF UK Software framework for developing MT systems (and other NLP applications)
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.
Netgraph – a Tool for Searching in the Prague Dependency Treebank 2.0 Defence of the Doctoral Thesis, Prague, September 3 rd, 2008 Author: Mgr. Jiří Mírovský.
Beginning Syntax Linda Thomas
CSC 594 Topics in AI – Natural Language Processing
A Statistical Model for Parsing Czech
Prague Dependency Treebank 2. 0 Zdeněk Žabokrtský Dept
Universal Dependencies
Presentation transcript:

Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University Prague, Czech Republic

LREC 2006, Annotation Outline of the talk Introduction Prague Dependency Treebank 2.0 Annotation of grammatemes Motivation Grammateme attributes Two-level node hierarchy Examples of grammateme value assignment Final remarks

LREC 2006, Annotation Introduction grammatemes in the PDT 2.0 one type of attributes of nodes of a deep syntactic tree capturing morphological meanings that are semantically indispensable number for nouns, degree of comparison for adjectives, tense for verbs, etc. annotation of grammatemes the last task in the PDT 2.0 annotation procedure possible to assign automatically – profiting from the already available annotation: annotation of the same sentence at the lower layers already available components of the t-tree (tree structure, types of dependency relations, co-reference, etc.)

LREC 2006, Annotation Historical background and development of PDT project mid 1960’s – Praguian Functional Generative Description (Petr Sgall et al.) 1994 – Czech National Corpus 1995 – PDT started 1998 – PDT 0.5 pre-release 2001 – PDT 1.0 released by LDC manual annotation of morphology and surface syntax 2006 – PDT 2.0 to be released by LDC interlinked morphological, surface-syntactic and complex deep-syntactic annotation including annotation of grammatemes

LREC 2006, Annotation Outline of the talk Introduction Prague Dependency Treebank 2.0 Annotation of grammatemes Motivation Grammateme attributes Two-level node hierarchy Examples of grammateme value assignment Final remarks

LREC 2006, Annotation Layers of annotation tectogrammatical layer deep-syntactic dependency tree analytical layer surface-syntactic dependency tree morphological layer m-lemma and m-tag associated with each token word layer original text, segmented on word boundaries lit: He-was would went toforest. He would have gone to the forest.

LREC 2006, Annotation Interlinking the layers lit: He-was would went toforest. He would have gone to the forest. any unit at any layer has a PDT unique ID neighboring layers connected by top-down pointers

LREC 2006, Annotation Size of the PDT 2.0 data (i) 7,129 manually annotated textual documents all documents annotated at the m-layer 16,065 sentences with 1,960,657 tokens 75 % of the m-layer data annotated at the a-layer 5,338 documents, 87,980 sentences, 1,504,847 tokens 44 % of the m-layer data annotated also at the t-layer 3,168 documents, 49,442 sentences, 833,357 tokens

LREC 2006, Annotation training data (80 %) development test data (10 %) evaluation test data (10 %) Size of the PDT 2.0 data (ii)

LREC 2006, Annotation M-layer sentence represented as a sequence of tokens each token lemmatized and tagged (attributes m-lemma and m-tag ) positional m-tag: 15 characters 1. (main) POS 2. detailed POS 3. gender 4. number 5. case... lit.: Some contours problem(gen) reflexive_pronoun though after resurgence(instr) Havel's speech(instr) they-seem to-be clearer. Some contours of the problem seem to be clearer after the resurgence by Havel's speech.

LREC 2006, Annotation A-layer rooted ordered tree with labeled nodes and edges a-nodes one token of the m-layer is represented by exactly one a-node labeled with a-lemmas (identical with word forms) a-edges represent dependency relations ( Sb, Obj, Adv, Atr) represent non-dependency relations ( Coord) analytical function attribute appears as an a-node attribute Some contours of the problem seem to be clearer after the resurgence by Havel's speech.

LREC 2006, Annotation T-layer Some contours of the problem seem to be clearer after the resurgence by Havel's speech. rooted ordered tree with labeled nodes and edges t-nodes complex typed feature structures represent auto-semantic words functional words do not have nodes of their own artificially added nodes t-edges dependency relations ( functor ) non-dependency relations (coordination constructions) functor attribute appears as an t-node attribute

LREC 2006, Annotation lit. [To] all was handed over a certificate of successful graduation from the course. They all received a certificate of successful graduation from this course. Areas of annotation at the t-layer tree structure t-lemma attribute dependency relation ( functor and subfunctor ) topic-focus attributes co-reference attributes node typing attributes ( nodetype and sempos) grammateme attributes Všem bylo předáno osvědčení o úspěšném absolvování kurzu.

LREC 2006, Annotation Outline of the talk Introduction Prague Dependency Treebank 2.0 Annotation of grammatemes Motivation Grammateme attributes Two-level node hierarchy Examples of grammateme value assignment Final remarks

LREC 2006, Annotation grammatemes t-node attributes representing inflectional information that is semantically indispensable (morphological meanings such as number for nouns, tense for verbs, degree of comparison for adjectives, etc.) semantically irrelevant morphological meanings are not part of the t-layer (e.g. case for nouns) Grammatemes: Motivation

LREC 2006, Annotation Grammateme attributes 15 grammatemes indeftype numertype negation degcmp tense aspect verbmod deontmod dispmod resultative iterativeness number gender person politeness

LREC 2006, Annotation Conditioned presence/absence of grammatemes obviously, not all grammatemes are relevant for all nodes no tense for dog, no degree of comparison for (he) waits, etc. how to formally declare presence/absence of a given grammateme attribute in a given node?  the need for node typing chosen solution: two-level typing 1 st level: 8 more general types of nodes grammatemes relevant only for one of them 2 nd level: 19 more specific subtypes, corresponding to detailed semantic parts of speech

LREC 2006, Annotation Presence/absence of grammateme values: Two-level t-node hierarchy 1 st level: attribute nodetype 2 nd level: attribute sempos

LREC 2006, Annotation 8 attribute values: root | qcomplex | list | atom | coap | dphr | fphr | complex fully automatic annotation - use of the tree structure  root t-attributes t-lemma  qcomplex | list functor  atom | coap | dphr | fphr else  complex Levnější benzín na Východě, dražší na Západě Cheaper gasoline in the East, more expensive one in the West First level of the hierarchy: attribute nodetype

LREC 2006, Annotation only complex nodes grouped into semantic parts of speech 19 values of the attribute sempos : n.... | adj.... | adv.... | v.... fully automatic annotation – use of m-tag t-lemma other t-attributes sempos value delimits the set of relevant grammatemes Second level of the hierarchy: attribute sempos

LREC 2006, Annotation Values of nodetype and sempos in the PDT 2.0 – an overview nodetype values: sempos values:

LREC 2006, Annotation Grammateme value assignment n-tred environment for processing the PDT data automatic annotation 2000 lines of Perl code crucial importance of inter-layer links – use of t-attributes, a-attributes, m-attributes rules using special economic notation 2000 lines written in a text file lexical resources special purpose lists of adverbs / verbs manual annotation of special problems two annotators working in parallel simplified annotation environment: treebank positions extracted into simple HTML forms

LREC 2006, Annotation Simple HTML-based environment for manual annotation lit: The difference [you] would have to pay yourself.

LREC 2006, Annotation Automatic vs. manual assignment at the t-layer of the PDT 2.0: 1,594,333 grammateme values assigned at 550,947 complex nodes manually assigned: 17,520 grammateme values inter-annotator agreement: %

LREC 2006, Annotation Grammateme assignment and m-tag number grammateme: values sg | pl assigned automatically using m-tag e.g. les (forest) m-layer: tag NNIS2-----A----  t-layer: number=sg manual assignment nouns with only plural forms (identified by a list extracted from the machine- readable dictionary of standard Czech) e.g. dveře (door/doors) m-layer: always plural t-layer: annotator decision sg | pl n.denot number=sg lit: He-was would went toforest. He would have gone to the forest.

LREC 2006, Annotation Grammateme assignment and tree structure v verbmod=cdn mood grammateme verbmod: values ind | imp | cdn assigned automatically one-word verbal forms e.g. jde (goes) m-tag information verbal forms consisting of more word forms (represented by a single node at the t-layer) e.g. byl by šel (would have gone) corresponding a-layer subtree involves the node by m-tag of the node by lit: He-was would went toforest. He would have gone to the forest.

LREC 2006, Annotation lit. From remainder of raw material the diary produces dried milk, which [it] exports to Asia and South America. From the rest of the material, the diary produces dried milk, which is exported [by it] to Asia and South America. Grammateme assignment and co-reference grammatemes gender, number and person in relative pronouns are left underspecified (value inher ), since they are imposed only by grammatical agreement (thus can be “inherited from the antecedents”) Ze zbytku suroviny mlékárna vyrábí sušené mléko, které vyváží do Asie a Jižní Ameriky.

LREC 2006, Annotation Outline of the talk Introduction Prague Dependency Treebank 2.0 Annotation of grammatemes Motivation Grammateme attributes Two-level node hierarchy Examples of grammateme value assignment Final remarks

LREC 2006, Annotation Final remarks achievements: two-level typing of t-layer nodes which makes it possible to formally capture presence/absence of individual grammatemes in a given node automatic procedure for capturing the node classification and the grammateme attributes verification of the procedure on large-scale data experience: it was the existence of the lower annotation layers and the existence of inter-layer links what allowed to make the procedure of grammateme assignment more or less automatic

LREC 2006, Annotation