6th Intex Workshop, Sofia 28-30 May 20031 6th Intex Workshop & 10 years of (Silberztein, 1993) Sofia, 28-30 May 2003.

Slides:



Advertisements
Similar presentations
MULTEXT-East Version 4: multilingual morphosyntactic specifications for lots of languages Tomaž Erjavec Department of Knowledge Technologies.
Advertisements

MULTEXT-East Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora Tomaž Erjavec Department of Knowledge Technologies Jožef.
Morphology.
Greenberg 1963 Some Universals of Grammar with Particular Reference to the Order of Meaningful Elements.
2 pt 3 pt 4 pt 5pt 1 pt 2 pt 3 pt 4 pt 5 pt 1 pt 2pt 3 pt 4pt 5 pt 1pt 2pt 3 pt 4 pt 5 pt 1 pt 2 pt 3 pt 4pt 5 pt 1pt Mood Tense 1Tense 2Parti-MoodSub-Gen.
Chapter 4 Basics of English Grammar
The Eight Parts of Speech
Grammatical Categories and Markers
Prim(j)ena MULTEXT-East standarda i normi TEI u izradi paralelnih korpusa Applikation des MULTEXT-East und der TEI-Normen bei der Erstellung von Parallelkorpora.
The MULTEXT-East multilingual language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana
1 A Hidden Markov Model- Based POS Tagger for Arabic ICS 482 Presentation A Hidden Markov Model- Based POS Tagger for Arabic By Saleh Yousef Al-Hudail.
Elicitation Corpus April 12, Agenda Tagging with feature vectors or feature structures Combinatorics Extensions.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Outline of English Syntax.
Grammatical frameworks Inflectional morphology. Grammar In the Middle Ages, grammatica […] chiefly meant the knowledge or study of Latin, and were hence.
Its Grammatical Categories
Getting started with Sanskrit grammar. Inflectional form: Root + Affix = Stem Stem + Inflectional ending = Word.
Parts of Speech (Lexical Categories). Parts of Speech Nouns, Verbs, Adjectives, Prepositions, Adverbs (etc.) The building blocks of sentences The [ N.
1 A Chart Parser for Analyzing Modern Standard Arabic Sentence Eman Othman Computer Science Dept., Institute of Statistical Studies and Research (ISSR),
ME verb system Its changes and development. Finite forms. Number, Person, Mood and Tense  Number  in the 13-14th c. the ending –en - the main marker.
Grammar Skills Workshop
Chapter 4 Basics of English Grammar Business Communication Copyright 2010 South-Western Cengage Learning.
Morphology An Introduction to the Structure of Words By Christian Monson.
Principles of organizing a common morphological tagset and a search engine for PolUKR (Polish-Ukrainian Parallel Corpus) Польсько-Український паралельний.
IKTA-27/2000 Development of a Part-of-Speech (POS) Tagging Method for Hungarian Using Machine Learning Algorithms Project duration: July June.
Daily Grammar Practice
Phonemes A phoneme is the smallest phonetic unit in a language that is capable of conveying a distinction in meaning. These units are identified within.
English Review for Final These are the chapters to review. In Textbook: Chapter 1 Nouns Chapter 2 Pronouns Chapter 3 Adjectives Chapter 4 Verbs Chapter.
Macedonian DELAS – first results Aleksandar Petrovski Tetovo, Macedonia.
Chapter 5 Syntax English Linguistics: An Introduction.
English Review for Final These are the chapters to review. In Textbook: Chapter 1 Nouns Chapter 2 Pronouns Chapter 3 Adjectives Chapter 4 Verbs Chapter.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Application of INTEX in refinement and validation of Serbian WordNet Ivan Obradović, Ranka Stanković Cvetana Krstev, Gordana Pavlović-Lažetić University.
Morphological Analysis of Hungarian in NooJ
Integrating Semantic Dictionaries for English, French and Bulgarian into the NooJ System for the Purposes of Information Retrieval Svetla Koeva, Max Silbetztein.
Morphology An Introduction to the Structure of Words Lori Levin and Christian Monson Grammars and Lexicons Fall Term, 2004.
1 On the Ambiguity of Serbian Texts and Methods to disambiguate it Cvetana Krstev, Duško Vitas, University of Belgrade 8 th Intex/Nooj Workshop.
SVETLA KOEVA SVETLOZARA LESEVA BORISLAV RIZOV. The project Automatic information extraction based on semantic relations (RILA – a bilateral co-operation.
By: Jeremy Pagnotti.  Phonetic language (no silent letters)  No particular word order  Grammatical function of nouns and verbs displayed by endings.
Improving Morphosyntactic Tagging of Slovene by Tagger Combination Jan Rupnik Miha Grčar Tomaž Erjavec Jožef Stefan Institute.
Lecture 4 Eastern Middle Persian
The Greek Verb System: A Bird’s Eye View Chapter 2.
English Review for Final These are the chapters to review. In Textbook: Chapter 9 Nouns Chapter 10 Pronouns Chapter 11 Adjectives Chapter 12 Verbs Chapter.
Parsing and Translating
Reference Section. Copyright © Houghton Mifflin Company. All rights reserved.R | 2 1. Personal pronouns.
Leonid Iomdin Institute for Information Transmission Problems, Russian Academy of Sciences
Group 2: Sino-Tibetan Languages Working Group II: Sino-Tibetan Languages Session Report July 2, 2005.
General characteristics As any other part of speech, the noun can be characterized by three criteria:  Semantic (the meaning)  Morphological (the form.
Basic Syntactic Structures of English CSCI-GA.2590 – Lecture 2B Ralph Grishman NYU.
1 Dictionary priorities, e- dictionaries of compounds, morphological mode Cvetana Krstev & Duško Vitas.
Dictionary graphs Duško Vitas University of Belgrade, Faculty of Mathematics.
What do we do with this Latin Part of Speech ( PoS )? Latin to English.
Inflection. Inflection refers to word formation that does not change category and does not create new lexemes, but rather changes the form of lexemes.
Chapter 1 Notes. Chapter 1 Gender Chapter 1 Gender A grammatical category indicating the sex, or lack of sex, of nouns and pronouns. The three genders.
Different types of Grammer
The theory of word classes in modern grammar studies
Introduction to Linguistics
Lesson XXII.
Germanic Languages Germanic Cultures.
GREEK ADJECTIVES
Lesson XXVI.
Getting started with Sanskrit grammar
MULTEXT-East Version 4: multilingual morphosyntactic specifications for lots of languages Tomaž Erjavec Department of Knowledge.
Chapter 4 Basics of English Grammar
Agenda diēs Martis, a.d. xiv Kal. Oct. A.D. MMXVIII
How To Answer Questions in Latin!
Daily Grammar Practice
Parts of speech.
Chapter 4 Basics of English Grammar
Ms. McDaniel 6th Grade Language Arts
Presentation transcript:

6th Intex Workshop, Sofia May th Intex Workshop & 10 years of (Silberztein, 1993) Sofia, May 2003

6th Intex Workshop, Sofia May Conversion between Intex and MULTEXT-East Morphosyntactic Descriptions Cvetana Krstev, Duško Vitas University of Belgrade Tomaž Erjavec Jožef Stefan Institute, Ljubljana

6th Intex Workshop, Sofia May Motivation general use of different tools use of multilingual resources comparison of results in NLP specific inclusion of Serbian language in MULTEXT-East specification and production of Slovenian Intex resources production of tagged Serbian translation of Orwell's 1984

6th Intex Workshop, Sofia May MULTEXT-East morphosyntactic specification aim exhaustive description of morphological and morphosyntactic features of different languages and establishment of unique codes for common features scope: English, Romanian, Slovene, Czeck, Bulgarian, Estonian, Hungarian, Croatian (Concede), and Serbian

6th Intex Workshop, Sofia May MULTEXT-East types or PoS - new types cannot be introduced Nouns (N) Nouns Verbs (V) Verbs Adjectives (A) Adjectives Pronouns (P) Determiners (D) Adpositions (S) Conjuctions (C) Numerals (M) Interjections (I) Abbreviations (Y) Particles (Q) Adverbs (R) Adverbs Articles (T) Residuals (X)

6th Intex Workshop, Sofia May Type attributes Each type has a set of attributes that are appropriate to it Each type attribute has its position in MSD description It is not recommended to add new attributes to a type

6th Intex Workshop, Sofia May Attribute values a set of values is added to each attribute each value is coded by one alphanumeric character the new values can be added to the attributes, if necessary Types Verb attributes Adjective attributes

6th Intex Workshop, Sofia May Adjective attribute values/1 Adjective (A) 13 positions = ============== ============== = EN RO SL CS BG ET HU HR SR P ATT VAL C x x x x x x x x x = ============== ============== = 1 Type qualificative f x x x x x x x indefinite i possessive s x x x x ordinal o x x Degree positive p x x x x x x x x comparative c x x x x x x x x superlative s x x x x x x x x elative e x x

6th Intex Workshop, Sofia May Adjective attribute values/2 = ============== ============== = EN RO SL CS BG ET HU HR SR P ATT VAL C x x x x x x x x x = ============== ============== = 3 Gender masculine m x x x x x x feminine f x x x x x x neuter n x x x x x x Number singular s x x x x x x x x plural p x x x x x x x x dual d x x paucal c x Case nominative n x x x x x x genitive g x x x x x x dative d x x x x x accusative a x x x x x...(various more values).. *

6th Intex Workshop, Sofia May Adjective attribute values/3 6 Definiteness no n x x x x x yes y x x x x x short_art s x full_art f x Clitic no n x yes y x Animate no n x x x x x yes y x x x x x Formation nominal n x compound c x various Hungarian specific attributes... ================================= EN RO SL CS BG ET HU HR SR

6th Intex Workshop, Sofia May An example from the Slovenian MULTEXT-East dictionary čistejšičistAfcfda lemma čist (Engl. clean) corresponds to the simple word form čistejši ; it is qualified as qualificative ( f ) adjective ( A ) in comparative form ( c ), feminine gender ( f ), dual number ( d ), and accusative case ( a ). čistejšičistAfcmsa--n lemma čist (Engl. clean) corresponds to the simple word form čistejši ; it is qualified as qualificative ( f ) adjective ( A ) in comparative form ( c ), masculine gender ( m ), singular ( s ), accusative case ( a ), and not animate ( n ).

6th Intex Workshop, Sofia May The first sentence of the Slovene translation of Orwell's 1984 tagged Bil je jasen, mrzel aprilski dan in ure so bile trinajst

6th Intex Workshop, Sofia May Intex MSD for Serbian one DELAS entry cyist,A17 one of its corresponding DELAF entries cyistiji,cyist.A17:bems1g:bems4q:bems5g:bemp1g :bemp5g produced by the regular expression A17.exp ijemu/:bems3g:bems7g:bens3g:bens7g + iji/:bems1g:bems4q:bems5g:bemp1g:bemp5g + o/:aens1g:aens4g:aens5g

6th Intex Workshop, Sofia May Attributes and their values for Serbian adjectives in DELAS/DELAF AttributeValueCodeAttributeValueCode degreepositiveacasenominative1 comparativebgenitive2 superlativecdative3 definitenessnokaccusative4 yesdvocative5 not applicableeinstrumental6 gendermasculinemlocative7 femininefanimateyesv neuternnoq numbersingularsnot-applicableg pluralp(not important)

6th Intex Workshop, Sofia May Syntactic and semantic marks in Serbian DELAS categorytagapplied toexplanationexample syntactic +p2 prepositionsnoun is in genitive bez,PREP+p2 +Ref verbsreflexive dicyiti,V551+Imper f+It+Ref +MG nounsmasculine natural gender budala,N601+Hum+MG +FG derivational +VN nounsverbal noun kiselxenxe,N300+VN +Adj adverbsderived from adjectives fanaticyno,ADV+Adj +DerOvaIra verbs, nouns, adjectives derivational variaty dezinfikovati,V18+ Imperf+...+DerOvaI ra semantic +Col adjectivescolors zelenkastosiv,A6+C ol +Hum nounshuman lxubavnica,N601+Hu m +Mat adjectivesmaterial kozxnat,A6+Mat dialectic +Ek allekavien nedelxa,N600+Ek +Cr allcroatism izopcxen,A1+PP+Cr

6th Intex Workshop, Sofia May Problems of correspondence between MULTEXT-East MSD and Intex/1  The necessity to enforce the existing coding schema to a particular language Example: How to encode present and past gerund active? In Serbian, for the verb ići ( Engl. to go) those gerunds are idući and išavši There are attributes in verb tables of MULTEXT-east specification that describe them. However, no Slavic language, except Bulgarian, uses it.

6th Intex Workshop, Sofia May Problems/2 the common encoding schema does not guarantee that true standardization would be achieved Example: only in Bulgarian do we find the attribute value 'adjectival' for adverbs (with the examples 'umno, veselo, studeno') – other Slavic languages, at least, could make use of that value of the attribute type.

6th Intex Workshop, Sofia May Problems/3 Encoding of verb tenses = ============== ============== = EN RO SL CS BG ET HU HR SR P ATT VAL C x x x x x x x x x = ============== ============== = 2 VForm indicative i x x x x x x x x x subjunctive s x imperative m x x x x x x x x conditional c x x x x x x x infinitive n x x x x x x x x participle p x x x x x x x x gerund g x x x supine u x x transgressive t x quotative q x Tense present p x x x x x x x x x imperfect i x x x x x future f x x x x past s x x x x x x x x x pluperfect l x x x aorist a x x x

6th Intex Workshop, Sofia May Problems/3 The second attribute specifies verb form, and the third the tense. However, due to the composite tenses, some verb forms are used for the construction of different tenses. In Slovenian, verb form imel is past participle of the verb imeti (Engl. to have ), and it is used to produce perfect tense if used with the indicative form of the present tense of the copula verb biti (Engl. to be) and conditional if used with the conditional form of the same copula verb.

6th Intex Workshop, Sofia May Problems/3 Winston Smith je imel da bi ga imel

6th Intex Workshop, Sofia May Problems/4 different interpretation of various grammatical categories across languages and lack of a clear cross-linguistic correspondance are discussed in Przepiórkowski (EACL 2003), for example dual number in Slovene and paucal in Serbian. certain morphosyntactic phenomena have not been taken into consideration, as various problems of agreement (Vitas, Krstev, to appear).

6th Intex Workshop, Sofia May Application of MSD  Intex mapping to Serbian 1984 {S}{Bio,biti.V77:Gsm} ({je,jesam.V575+Imperf+It+Iref+Aux:Pzsi} + {je,on.PRO+Prs:sz2fi:sz4fi}) {vedar,.A18:akms1g:akms4q} ({i,.CONJ} + {i,.PAR}) {hladan,.A18:akms1g:akms4q} {aprilski,.A2+PosQ:adms1g:aems4q:aems5g:aemp1g:aemp5g} ({dan,.A1+PP:akms1g:aems4q} + {dan,dati.V103+Perf+Tr+Iref+Ref:Tms}) ; {S} ({na,.PREP+p4} + {na,.PREP+p7}) {cyasovnicima,.?} ({je,jesam.V575+Imperf+It+Iref+Aux:Pzsi} + {je,on.PRO+Prs:sz2fi:sz4fi}) {izbijalo,izbijati.V101+Perf+Tr+It+Iref:Gsn} {trinaest,.?}.

6th Intex Workshop, Sofia May Tool that facilitates the lemmatization and disambiguation

6th Intex Workshop, Sofia May Tagged Serbian translation of 1984 after hand disambiguation and resolving of unknown words {S}{Bio,biti.V77:Gsm} {je,jesam.V575+Imperf+It+Iref+Aux:Pzsi} {vedar,.A18:akms1g} (i,.CONJ) {hladan,.A18:akms1g} {aprilski,.A2+PosQ:adms1g} {dan,.N1:ms1q} ; {S} {na,.PREP+p7} {cyasovnicima,cyasovnik.N5:mp7q} {je,jesam.V575+Imperf+It+Iref+Aux:Pzsi} {izbijalo,izbijati.V101+Perf+Tr+It+Iref:Gsn} {trinaest,.Num+Car}.

6th Intex Workshop, Sofia May Simple perl script maps Serbian Intex codes to MULTEX-East MSD if (($POS eq "V") && ($kategorije !~ /[XS]/)) { #glagol je $glagol = "V". " "; if ($semkat =~ /Aux/) { #tip, atribut 1 substr($glagol,1,1) = "a"; } else { substr($glagol,1,1) = "m"; } if ($kategorije =~ /([WYGTIFA])/ ) { # forma, atribut 2 substr($glagol,2,1) = $1; } $glagol =~ tr/WYGTIFA/nmppiii/; if ( ($lema eq "biti") && ($kategorije =~ /A/) ) { substr($glagol,2,1) = "c"; } if ($kategorije =~ /([PIFAGY])/) { # vreme, atribut 3 substr($glagol,3,1) = $1; } $glagol =~ tr/PIFAGY/pofasp/; if ($kategorije =~ /([xyz])/) { # broj, atribut 4 substr($glagol,4,1) = $1; } $glagol =~ tr/xyz/123/;

6th Intex Workshop, Sofia May Tagged Serbian 1984 using MULTEXT-East MSD Bio je vedar i hladan aprilski dan na cyasovnicima je izbijalo trinaest

6th Intex Workshop, Sofia May Conclusion It is possible to convert from Intex to MULTEXT-East It is possible to convert from MULTEXT-East to Intex to certain extent. Some information can not be recovered, such as inflectional class code

6th Intex Workshop, Sofia May Noun attributes 1. Type 2. Gender 3. Number 4. Case 5. Definitness Type attributes Types 6. Clitic 7. Animate 8. Owner_Number 9. Owner_Person 10. Owned_Number

6th Intex Workshop, Sofia May Verb Attributes 1. Type 2. VForm VForm 3. Tense Tense 4. Person 5. Number 6. Gender 7. Voice Type attributes Types 8. Negative 9. Definitness 10. Clitic 11. Case 12. Animate 13. Clitic_s 14. Aspect

6th Intex Workshop, Sofia May Adjective attributes 1. Type 2. Degree 3. Gender 4. Number 5. Case 6. Definitness Type attributes Types 7. Clitic 8. Animate 9. Formation 10. Owner_Number 11. Owner_Person 12. Owned_Number

6th Intex Workshop, Sofia May Adverb attributes 1. Type 2. Degree 3. Clitic 4. Number 5. Person 6. Wh_Type Type attributes Types

6th Intex Workshop, Sofia May Values of the attribute Vform of the type Verb indicative (m) subjunctive (s) imperative (m) conditional (c) infinitive (i) Verb attributes participle (p) gerund (g) supine (u) transgressive (t) quotative (q)

6th Intex Workshop, Sofia May Value of the attribute Tense of the type Verb present (p) imperfect (i) future (f) past (s) pluperfect (l) aorist (a) Verb attributes