6th Intex Workshop, Sofia May th Intex Workshop & 10 years of (Silberztein, 1993) Sofia, May 2003
6th Intex Workshop, Sofia May Conversion between Intex and MULTEXT-East Morphosyntactic Descriptions Cvetana Krstev, Duško Vitas University of Belgrade Tomaž Erjavec Jožef Stefan Institute, Ljubljana
6th Intex Workshop, Sofia May Motivation general use of different tools use of multilingual resources comparison of results in NLP specific inclusion of Serbian language in MULTEXT-East specification and production of Slovenian Intex resources production of tagged Serbian translation of Orwell's 1984
6th Intex Workshop, Sofia May MULTEXT-East morphosyntactic specification aim exhaustive description of morphological and morphosyntactic features of different languages and establishment of unique codes for common features scope: English, Romanian, Slovene, Czeck, Bulgarian, Estonian, Hungarian, Croatian (Concede), and Serbian
6th Intex Workshop, Sofia May MULTEXT-East types or PoS - new types cannot be introduced Nouns (N) Nouns Verbs (V) Verbs Adjectives (A) Adjectives Pronouns (P) Determiners (D) Adpositions (S) Conjuctions (C) Numerals (M) Interjections (I) Abbreviations (Y) Particles (Q) Adverbs (R) Adverbs Articles (T) Residuals (X)
6th Intex Workshop, Sofia May Type attributes Each type has a set of attributes that are appropriate to it Each type attribute has its position in MSD description It is not recommended to add new attributes to a type
6th Intex Workshop, Sofia May Attribute values a set of values is added to each attribute each value is coded by one alphanumeric character the new values can be added to the attributes, if necessary Types Verb attributes Adjective attributes
6th Intex Workshop, Sofia May Adjective attribute values/1 Adjective (A) 13 positions = ============== ============== = EN RO SL CS BG ET HU HR SR P ATT VAL C x x x x x x x x x = ============== ============== = 1 Type qualificative f x x x x x x x indefinite i possessive s x x x x ordinal o x x Degree positive p x x x x x x x x comparative c x x x x x x x x superlative s x x x x x x x x elative e x x
6th Intex Workshop, Sofia May Adjective attribute values/2 = ============== ============== = EN RO SL CS BG ET HU HR SR P ATT VAL C x x x x x x x x x = ============== ============== = 3 Gender masculine m x x x x x x feminine f x x x x x x neuter n x x x x x x Number singular s x x x x x x x x plural p x x x x x x x x dual d x x paucal c x Case nominative n x x x x x x genitive g x x x x x x dative d x x x x x accusative a x x x x x...(various more values).. *
6th Intex Workshop, Sofia May Adjective attribute values/3 6 Definiteness no n x x x x x yes y x x x x x short_art s x full_art f x Clitic no n x yes y x Animate no n x x x x x yes y x x x x x Formation nominal n x compound c x various Hungarian specific attributes... ================================= EN RO SL CS BG ET HU HR SR
6th Intex Workshop, Sofia May An example from the Slovenian MULTEXT-East dictionary čistejšičistAfcfda lemma čist (Engl. clean) corresponds to the simple word form čistejši ; it is qualified as qualificative ( f ) adjective ( A ) in comparative form ( c ), feminine gender ( f ), dual number ( d ), and accusative case ( a ). čistejšičistAfcmsa--n lemma čist (Engl. clean) corresponds to the simple word form čistejši ; it is qualified as qualificative ( f ) adjective ( A ) in comparative form ( c ), masculine gender ( m ), singular ( s ), accusative case ( a ), and not animate ( n ).
6th Intex Workshop, Sofia May The first sentence of the Slovene translation of Orwell's 1984 tagged Bil je jasen, mrzel aprilski dan in ure so bile trinajst
6th Intex Workshop, Sofia May Intex MSD for Serbian one DELAS entry cyist,A17 one of its corresponding DELAF entries cyistiji,cyist.A17:bems1g:bems4q:bems5g:bemp1g :bemp5g produced by the regular expression A17.exp ijemu/:bems3g:bems7g:bens3g:bens7g + iji/:bems1g:bems4q:bems5g:bemp1g:bemp5g + o/:aens1g:aens4g:aens5g
6th Intex Workshop, Sofia May Attributes and their values for Serbian adjectives in DELAS/DELAF AttributeValueCodeAttributeValueCode degreepositiveacasenominative1 comparativebgenitive2 superlativecdative3 definitenessnokaccusative4 yesdvocative5 not applicableeinstrumental6 gendermasculinemlocative7 femininefanimateyesv neuternnoq numbersingularsnot-applicableg pluralp(not important)
6th Intex Workshop, Sofia May Syntactic and semantic marks in Serbian DELAS categorytagapplied toexplanationexample syntactic +p2 prepositionsnoun is in genitive bez,PREP+p2 +Ref verbsreflexive dicyiti,V551+Imper f+It+Ref +MG nounsmasculine natural gender budala,N601+Hum+MG +FG derivational +VN nounsverbal noun kiselxenxe,N300+VN +Adj adverbsderived from adjectives fanaticyno,ADV+Adj +DerOvaIra verbs, nouns, adjectives derivational variaty dezinfikovati,V18+ Imperf+...+DerOvaI ra semantic +Col adjectivescolors zelenkastosiv,A6+C ol +Hum nounshuman lxubavnica,N601+Hu m +Mat adjectivesmaterial kozxnat,A6+Mat dialectic +Ek allekavien nedelxa,N600+Ek +Cr allcroatism izopcxen,A1+PP+Cr
6th Intex Workshop, Sofia May Problems of correspondence between MULTEXT-East MSD and Intex/1 The necessity to enforce the existing coding schema to a particular language Example: How to encode present and past gerund active? In Serbian, for the verb ići ( Engl. to go) those gerunds are idući and išavši There are attributes in verb tables of MULTEXT-east specification that describe them. However, no Slavic language, except Bulgarian, uses it.
6th Intex Workshop, Sofia May Problems/2 the common encoding schema does not guarantee that true standardization would be achieved Example: only in Bulgarian do we find the attribute value 'adjectival' for adverbs (with the examples 'umno, veselo, studeno') – other Slavic languages, at least, could make use of that value of the attribute type.
6th Intex Workshop, Sofia May Problems/3 Encoding of verb tenses = ============== ============== = EN RO SL CS BG ET HU HR SR P ATT VAL C x x x x x x x x x = ============== ============== = 2 VForm indicative i x x x x x x x x x subjunctive s x imperative m x x x x x x x x conditional c x x x x x x x infinitive n x x x x x x x x participle p x x x x x x x x gerund g x x x supine u x x transgressive t x quotative q x Tense present p x x x x x x x x x imperfect i x x x x x future f x x x x past s x x x x x x x x x pluperfect l x x x aorist a x x x
6th Intex Workshop, Sofia May Problems/3 The second attribute specifies verb form, and the third the tense. However, due to the composite tenses, some verb forms are used for the construction of different tenses. In Slovenian, verb form imel is past participle of the verb imeti (Engl. to have ), and it is used to produce perfect tense if used with the indicative form of the present tense of the copula verb biti (Engl. to be) and conditional if used with the conditional form of the same copula verb.
6th Intex Workshop, Sofia May Problems/3 Winston Smith je imel da bi ga imel
6th Intex Workshop, Sofia May Problems/4 different interpretation of various grammatical categories across languages and lack of a clear cross-linguistic correspondance are discussed in Przepiórkowski (EACL 2003), for example dual number in Slovene and paucal in Serbian. certain morphosyntactic phenomena have not been taken into consideration, as various problems of agreement (Vitas, Krstev, to appear).
6th Intex Workshop, Sofia May Application of MSD Intex mapping to Serbian 1984 {S}{Bio,biti.V77:Gsm} ({je,jesam.V575+Imperf+It+Iref+Aux:Pzsi} + {je,on.PRO+Prs:sz2fi:sz4fi}) {vedar,.A18:akms1g:akms4q} ({i,.CONJ} + {i,.PAR}) {hladan,.A18:akms1g:akms4q} {aprilski,.A2+PosQ:adms1g:aems4q:aems5g:aemp1g:aemp5g} ({dan,.A1+PP:akms1g:aems4q} + {dan,dati.V103+Perf+Tr+Iref+Ref:Tms}) ; {S} ({na,.PREP+p4} + {na,.PREP+p7}) {cyasovnicima,.?} ({je,jesam.V575+Imperf+It+Iref+Aux:Pzsi} + {je,on.PRO+Prs:sz2fi:sz4fi}) {izbijalo,izbijati.V101+Perf+Tr+It+Iref:Gsn} {trinaest,.?}.
6th Intex Workshop, Sofia May Tool that facilitates the lemmatization and disambiguation
6th Intex Workshop, Sofia May Tagged Serbian translation of 1984 after hand disambiguation and resolving of unknown words {S}{Bio,biti.V77:Gsm} {je,jesam.V575+Imperf+It+Iref+Aux:Pzsi} {vedar,.A18:akms1g} (i,.CONJ) {hladan,.A18:akms1g} {aprilski,.A2+PosQ:adms1g} {dan,.N1:ms1q} ; {S} {na,.PREP+p7} {cyasovnicima,cyasovnik.N5:mp7q} {je,jesam.V575+Imperf+It+Iref+Aux:Pzsi} {izbijalo,izbijati.V101+Perf+Tr+It+Iref:Gsn} {trinaest,.Num+Car}.
6th Intex Workshop, Sofia May Simple perl script maps Serbian Intex codes to MULTEX-East MSD if (($POS eq "V") && ($kategorije !~ /[XS]/)) { #glagol je $glagol = "V". " "; if ($semkat =~ /Aux/) { #tip, atribut 1 substr($glagol,1,1) = "a"; } else { substr($glagol,1,1) = "m"; } if ($kategorije =~ /([WYGTIFA])/ ) { # forma, atribut 2 substr($glagol,2,1) = $1; } $glagol =~ tr/WYGTIFA/nmppiii/; if ( ($lema eq "biti") && ($kategorije =~ /A/) ) { substr($glagol,2,1) = "c"; } if ($kategorije =~ /([PIFAGY])/) { # vreme, atribut 3 substr($glagol,3,1) = $1; } $glagol =~ tr/PIFAGY/pofasp/; if ($kategorije =~ /([xyz])/) { # broj, atribut 4 substr($glagol,4,1) = $1; } $glagol =~ tr/xyz/123/;
6th Intex Workshop, Sofia May Tagged Serbian 1984 using MULTEXT-East MSD Bio je vedar i hladan aprilski dan na cyasovnicima je izbijalo trinaest
6th Intex Workshop, Sofia May Conclusion It is possible to convert from Intex to MULTEXT-East It is possible to convert from MULTEXT-East to Intex to certain extent. Some information can not be recovered, such as inflectional class code
6th Intex Workshop, Sofia May Noun attributes 1. Type 2. Gender 3. Number 4. Case 5. Definitness Type attributes Types 6. Clitic 7. Animate 8. Owner_Number 9. Owner_Person 10. Owned_Number
6th Intex Workshop, Sofia May Verb Attributes 1. Type 2. VForm VForm 3. Tense Tense 4. Person 5. Number 6. Gender 7. Voice Type attributes Types 8. Negative 9. Definitness 10. Clitic 11. Case 12. Animate 13. Clitic_s 14. Aspect
6th Intex Workshop, Sofia May Adjective attributes 1. Type 2. Degree 3. Gender 4. Number 5. Case 6. Definitness Type attributes Types 7. Clitic 8. Animate 9. Formation 10. Owner_Number 11. Owner_Person 12. Owned_Number
6th Intex Workshop, Sofia May Adverb attributes 1. Type 2. Degree 3. Clitic 4. Number 5. Person 6. Wh_Type Type attributes Types
6th Intex Workshop, Sofia May Values of the attribute Vform of the type Verb indicative (m) subjunctive (s) imperative (m) conditional (c) infinitive (i) Verb attributes participle (p) gerund (g) supine (u) transgressive (t) quotative (q)
6th Intex Workshop, Sofia May Value of the attribute Tense of the type Verb present (p) imperfect (i) future (f) past (s) pluperfect (l) aorist (a) Verb attributes