Systematic Parameterized Description of Pro-forms in the Prague Dependency Treebank 2.0 Magda Ševčíková Zdeněk Žabokrtský Institute of Formal and Applied.

Slides:



Advertisements
Similar presentations
Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University.
Advertisements

En->Cz MT system based on tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague.
En->Cz MT system based on TR Zdeněk Žabokrtský IFAL, Charles University in Prague.
June 6, 20073rd PIRE Meeting1 Tectogrammatical Representation of English in Prague Czech-English Dependency Treebank Lucie Mladová Silvie Cinková, Kristýna.
Statistical NLP: Lecture 3
Chapter Chapter Summary Languages and Grammars Finite-State Machines with Output Finite-State Machines with No Output Language Recognition Turing.
Discourse Martin Hassel KTH NADA Royal Institute of Technology Stockholm
The Unreasonable Effectiveness of Data Alon Halevy, Peter Norvig, and Fernando Pereira Kristine Monteith May 1, 2009 CS 652.
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester1 Treebanks: Layering the Annotation Jan Hajič Institute of Formal and Applied Linguistics.
Its Grammatical Categories
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
1/36 TectoMT Zdeněk Žabokrtský Institute of Formal and Applied Linguistics MFF UK Software framework for developing MT systems (and other NLP applications)
Building the Valency Lexicon of Arabic Verbs Viktor Bielický Otakar Smrž LREC 2008, Marrakech, Morocco.
PDT 2.0 Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague.
PDT Grammatemes and Coreference in the PDT 2.0 Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University in Prague.
1/21 Introduction to TectoMT Zdeněk Žabokrtský, Martin Popel Institute of Formal and Applied Linguistics Charles University in Prague CLARA Course on Treebank.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
COMPOSITION 9 Parts of Speech: Verbs Action Verbs in General  Follow along on Text page 362.  A verb either expresses an action (what something or.
Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague.
March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax 1 The PDT Morphology and Surface Syntax.
Natural Language Processing Introduction. 2 Natural Language Processing We’re going to study what goes into getting computers to perform useful and interesting.
© 2006 SOUTH-WESTERN EDUCATIONAL PUBLISHING 11th Edition Hulbert & Miller Effective English for Colleges Chapter 2 PRONOUNS.
Morphological Meanings in the Prague Dependency Treebank Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University,
Tree-based Machine Translation using syntax and semantics
April 17, 2007MT Marathon: Tree-based Translation1 Tree-based Translation with Tectogrammatical Representation Jan Hajič Institute of Formal and Applied.
The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.
PETRA – the Personal Embedded Translation and Reading Assistant Werner Winiwarter University of Vienna InSTIL/ICALL Symposium 2004 June 17-19, 2004.
Language Data Resources About Corpora. J. Sinclair: “Language looks rather different when you look at a lot of it at once.“ P. Eisner: “Znáte jej, ten.
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
A Language Independent Method for Question Classification COLING 2004.
LOGIC AND ONTOLOGY Both logic and ontology are important areas of philosophy covering large, diverse, and active research projects. These two areas overlap.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Approximating a Deep-Syntactic Metric for MT Evaluation and Tuning Matouš Macháček, Ondřej Bojar; {machacek, Charles University.
1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.
Annotation for Hindi PropBank. Outline Introduction to the project Basic linguistic concepts – Verb & Argument – Making information explicit – Null arguments.
Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University.
Resemblances between Meaning-Text Theory and Functional Generative Description Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University,
MT with an Interlingua Lori Levin April 13, 2009.
1 / 5 Zdeněk Žabokrtský: Automatic Functor Assignment in the PDT Automatic Functor Assignment (AFA) in the Prague Dependency Treebank PDT : –a long term.
WHAT IS LANGUAGE?. INTRODUCTION In order to interact,human beings have developed a language which distinguishes them from the rest of the animal world.
Proper Nouns in Czech Corpora Magda Ševčíková Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics.
PDT Grammatemes in the PDT 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
nd PIRE project workshop1 Tectogrammatical Representation of English Silvie Cinková Lucie Mladová, Anja Nedoluzhko, Jiří Semecký, Jana Šindlerová,
March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Intro 1 The Prague Dependency Treebank (PDT) Introduction Jan Hajič Institute.
Jeff Howbert Introduction to Machine Learning Winter Machine Learning Natural Language Processing.
Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics.
Syntactic Annotation of Slovene Corpora (SDT, JOS) Nina Ledinek ISJ ZRC SAZU
Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague.
Support Vector Machines and Kernel Methods for Co-Reference Resolution 2007 Summer Workshop on Human Language Technology Center for Language and Speech.
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
Levels of Linguistic Analysis
Arabic Syntactic Trees Zdeněk Žabokrtský Otakar Smrž Center for Computational Linguistics Faculty of Mathematics and Physics Charles University in Prague.
March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Tectogrammatics 1 PDT: Tectogrammatical Representation Jan Hajič Institute.
Named Entities in Czech Texts and Their Processing Magda Ševčíková Zdeněk Žabokrtský ÚFAL MFF UK.
A knowledge rich morph analyzer for Marathi derived forms Ashwini Vaidya IIIT Hyderabad.
Semantic annotation of a dialog corpus Silvie Cinková Institute of Formal and Applied Linguistics Charles University in Prague, Czech Republic COMPANIONS.
Prague Czech-English Dependency Treebank 2.0 ufal.mff.cuni.cz/pcedt2.0 Silvie Cinková, Marie Mikulová, Jan Štěpánek & professors, annotators and programmers.
1/16 TectoMT Zdeněk Žabokrtský ÚFAL MFF UK Software framework for developing MT systems (and other NLP applications)
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.
SYNTAX.
Netgraph – a Tool for Searching in the Prague Dependency Treebank 2.0 Defence of the Doctoral Thesis, Prague, September 3 rd, 2008 Author: Mgr. Jiří Mírovský.
Descriptive Grammar – 2S, 2016 Mrs. Belén Berríos
Statistical NLP: Lecture 3
Revision Outcome 1, Unit 1 The Nature and Functions of Language
Prague Dependency Treebank 2. 0 Zdeněk Žabokrtský Dept
Levels of Linguistic Analysis
Pronoun/Antecedent Agreement
Presentation transcript:

Systematic Parameterized Description of Pro-forms in the Prague Dependency Treebank 2.0 Magda Ševčíková Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University Prague, Czech Republic

TLT Outline of the talk Introduction Description of pro-forms in the PDT 2.0 Type 1 Personal pronouns Type 2 Indefinite, negative, interrogative, and relative pronouns Pro-adverbs and pro-numerals Pro-forms in other languages Final remarks

TLT Introduction Pro-forms pronouns, pro-adverbs, and pro-numerals closed classes to replace or substitute other words, phrases, or sentences anaphoric and deictic functions semantically relevant regularities within the sub-classes nobody-never-nowhere everybody-always-everywhere Pro-forms in the PDT 2.0 formal linguistic system for annotation of pro-forms making the present regularities explicit part of the deep-syntactic layer (tectogrammatical layer, t-layer) representation by a reduced set of (underlying) lemmas in combination with relevant attributes

TLT PDT project Historical background mid 1960’sFunctional Generative Description (Petr Sgall et al.) 1994 Czech National Corpus 1995 PDT started 1998 PDT 0.5 pre-release 2001 PDT 1.0 released by LDC ( LDC2001T10 ) manual annotation of morphology and surface syntax 2006PDT 2.0 released by LDC ( LDC2006T01 ) interlinked morphological, surface-syntactic and complex deep-syntactic annotation

TLT PDT 2.0 Layers of annotation Lit: [He] was would went to forest. [He] would have gone to the forest. Tectogrammatical layer deep-syntactic dependency tree 59 % of the a-layer data 3,165 doc., 49,431 sent., 833,195 tokens Analytical layer surface-syntactic dependency tree 75 % of the m-layer data 5,330 doc., 87,913 sent., 1,503,739 tokens Morphological layer m-lemma and m-tag associated with each token 7,110 textual documents 115,844 sent., 1,957,247 tokens Word layer original text, segmented on word boundaries

TLT Outline of the talk Introduction Description of pro-forms in the PDT 2.0 Type 1 Personal pronouns Type 2 Indefinite, negative, interrogative, and relative pronouns Pro-adverbs and pro-numerals Pro-forms in other languages Final remarks

TLT Description of pro-forms in the PDT 2.0 M-layer pronouns, pro-adverbs, and pro-numerals treated separately m-lemma, m-tag T-layer 2 basic types of description type 1: personal pronouns type 2: indefinite, negative, interrogative, and relative pronouns together with pro-adverbs and pro-numerals semantic features originally present in the word form extracted and stored as values of inner attributes of the t-node that corresponds to the given word form

TLT

TLT Type 1 Personal pronouns in the PDT 2.0 all personal pronouns (no matter whether they are pro- dropped or present in the sentence) represented by nodes labeled with a single, artificial lemma #PersPron grammatical information expressed by a personal pronoun in the sentence is stored in node attributes person, number, and gender attribute politeness for discerning between honorific and non-honorific usage vy jste přišel (you came said politely to a single person) #PersPron + 2nd person + singular + masc.anim. + polite

TLT Tím, že Evropská unie nechala ve rwandské operaci Francii na holičkách, podle Léotarda ukázala, že její politika nemá žádný africký rozměr. According to Léotard, by the fact that the European Union left France in the lurch concerning the Rwanda operation, [it] has shown that its politics has no African dimension. at the t-layer, representation of personal pronouns was completed with the annotation of co-reference (i.e relations between nodes referring to the same entity) Type 1 Personal pronouns and co-reference

TLT Type 2 Indefinite, negative, interrogative, and relative pronouns in the PDT 2.0 in Czech, single meanings are expressed regularly by means of a relatively small group of prefixes that join together with a small set of bases transparent correspondence between the semantic features and formal composition of pronouns: indefinite prefix ně-: někdo (somebody) – něco (something) – nějaký (some) negative prefix ni-: nikdo (nobody) – nic (nothing)… at the t-layer, pronouns with the same base element grouped together, each pronoun group represented by the lemma corresponding to the respective relative pronoun: e.g. někdo (somebody) and nikdo (nobody) represented by the lemma kdo (who) corresponding possessive pronouns represented in the same way as the non-possessive ones the semantic feature completing the reduced lemma was stored in the indeftype attribute

TLT Type 2 Indefinite, negative, interrogative, and relative pronouns and the indeftype attribute all indefinite, negative, interrogative, and relative pronouns represented by only four lemmas at the t-layer the reduced lemmas were completed by a value of the indeftype attribute 11 values:

TLT Type 2 Pro-adverbs and pro-numerals in the PDT 2.0 in Czech, pro-adverbs (e.g. nikde (nowhere), nějak (somehow)) and pro-numerals (e.g. několik (a few)) share certain semantic features with pronouns represented in the same way as indefinite, negative, interrogative, and relative pronouns at the t-layer another derivational relation can be seen between pro-adverbs with directional meaning and those of location – for example, the adverb odněkud (from somewhere) is represented as follows: lemma kde (where) + indef1 value (of the indeftype attribute) + functor DIR1 capturing the directional meaning

TLT Zakládá-li si někdo na tom, že se vyhýbá cizím slovům, pak udělá nejlíp, když se nikdy nepodívá do Etymologického slovníku jazyka českého. If someone finds it important that [he] eliminates foreign words, then the best thing [he] can do is if [he] never looks in the Etymology Dictionary of Czech.

TLT Outline of the talk Introduction Description of pro-forms in the PDT 2.0 Type 1 Personal pronouns Type 2 Indefinite, negative, interrogative, and relative pronouns Pro-adverbs and pro-numerals Pro-forms in other languages Final remarks

TLT Pro-forms in other languages PDT-like description indefinite, negative, interrogative, and relative pronouns and other pro-forms are unproductive classes with (at least to a certain extent) transparent derivational relations also in other languages preliminary sketch of several English and German pronouns: still not solved: English anybody, German niemand and nirgendjemand …

TLT Lit.: The teacher finds nowhere a mistake. Der Lehrer findet nirgends einen Fehler. In Helbig, H. (2001), Die semantische Struktur natürlicher Sprache, Springer, 2001, p. 174 Negative pro-adverbs Lit.: Peter goes on holiday nowhere. Peter fährt in den Ferien nirgendwo hin. with directional meaning with local meaning Pro-forms in other languages Helbig’s MultiNet

TLT Outline of the talk Introduction Description of pro-forms in the PDT 2.0 Type 1 Personal pronouns Type 2 Indefinite, negative, interrogative, and relative pronouns Pro-adverbs and pro-numerals Pro-forms in other languages Final remarks

TLT Final remarks achievements: all pro-forms in Czech divided into two groups: personal (and corresponding possessive) pronouns indefinite, negative, interrogative, and relative pronouns (and corresponding possessive pronouns) and pro-adverbs and pro-numerals several pro-form analogies crossing the part-of-speech boundaries are explicitly marked in the annotation verification of the formal system on large-scale data future work: to elaborate the system for other languages in more detail, taking into consideration specific phenomena of the respective language to describe the relations among pro-form systems in more languages (for example, for the purposes of machine translation)

TLT