Presentation is loading. Please wait.

Presentation is loading. Please wait.

6th Intex Workshop, Sofia 28-30 May 20031 6th Intex Workshop & 10 years of (Silberztein, 1993) Sofia, 28-30 May 2003.

Similar presentations


Presentation on theme: "6th Intex Workshop, Sofia 28-30 May 20031 6th Intex Workshop & 10 years of (Silberztein, 1993) Sofia, 28-30 May 2003."— Presentation transcript:

1 6th Intex Workshop, Sofia 28-30 May 20031 6th Intex Workshop & 10 years of (Silberztein, 1993) Sofia, 28-30 May 2003

2 6th Intex Workshop, Sofia 28-30 May 20032 Conversion between Intex and MULTEXT-East Morphosyntactic Descriptions Cvetana Krstev, Duško Vitas University of Belgrade Tomaž Erjavec Jožef Stefan Institute, Ljubljana

3 6th Intex Workshop, Sofia 28-30 May 2003 3 Motivation general use of different tools use of multilingual resources comparison of results in NLP specific inclusion of Serbian language in MULTEXT-East specification and production of Slovenian Intex resources production of tagged Serbian translation of Orwell's 1984

4 6th Intex Workshop, Sofia 28-30 May 2003 4 MULTEXT-East morphosyntactic specification aim exhaustive description of morphological and morphosyntactic features of different languages and establishment of unique codes for common features scope: English, Romanian, Slovene, Czeck, Bulgarian, Estonian, Hungarian, Croatian (Concede), and Serbian

5 6th Intex Workshop, Sofia 28-30 May 2003 5 14 MULTEXT-East types or PoS - new types cannot be introduced Nouns (N) Nouns Verbs (V) Verbs Adjectives (A) Adjectives Pronouns (P) Determiners (D) Adpositions (S) Conjuctions (C) Numerals (M) Interjections (I) Abbreviations (Y) Particles (Q) Adverbs (R) Adverbs Articles (T) Residuals (X)

6 6th Intex Workshop, Sofia 28-30 May 2003 6 Type attributes Each type has a set of attributes that are appropriate to it Each type attribute has its position in MSD description It is not recommended to add new attributes to a type

7 6th Intex Workshop, Sofia 28-30 May 2003 7 Attribute values a set of values is added to each attribute each value is coded by one alphanumeric character the new values can be added to the attributes, if necessary Types Verb attributes Adjective attributes

8 6th Intex Workshop, Sofia 28-30 May 2003 8 Adjective attribute values/1 Adjective (A) 13 positions = ============== ============== = EN RO SL CS BG ET HU HR SR P ATT VAL C x x x x x x x x x = ============== ============== = 1 Type qualificative f x x x x x x x indefinite i possessive s x x x x ordinal o x x - -------------- -------------- - 2 Degree positive p x x x x x x x x comparative c x x x x x x x x superlative s x x x x x x x x elative e x x - -------------- -------------- -

9 6th Intex Workshop, Sofia 28-30 May 2003 9 Adjective attribute values/2 = ============== ============== = EN RO SL CS BG ET HU HR SR P ATT VAL C x x x x x x x x x = ============== ============== = 3 Gender masculine m x x x x x x feminine f x x x x x x neuter n x x x x x x - -------------- -------------- - 4 Number singular s x x x x x x x x plural p x x x x x x x x dual d x x paucal c x - -------------- -------------- - 5 Case nominative n x x x x x x genitive g x x x x x x dative d x x x x x accusative a x x x x x...(various more values).. *

10 6th Intex Workshop, Sofia 28-30 May 2003 10 Adjective attribute values/3 6 Definiteness no n x x x x x yes y x x x x x short_art s x full_art f x - -------------- -------------- - 7 Clitic no n x yes y x - -------------- -------------- - 8 Animate no n x x x x x yes y x x x x x - -------------- -------------- - 9 Formation nominal n x compound c x - -------------- -------------- -... various Hungarian specific attributes... ================================= EN RO SL CS BG ET HU HR SR

11 6th Intex Workshop, Sofia 28-30 May 2003 11 An example from the Slovenian MULTEXT-East dictionary čistejšičistAfcfda lemma čist (Engl. clean) corresponds to the simple word form čistejši ; it is qualified as qualificative ( f ) adjective ( A ) in comparative form ( c ), feminine gender ( f ), dual number ( d ), and accusative case ( a ). čistejšičistAfcmsa--n lemma čist (Engl. clean) corresponds to the simple word form čistejši ; it is qualified as qualificative ( f ) adjective ( A ) in comparative form ( c ), masculine gender ( m ), singular ( s ), accusative case ( a ), and not animate ( n ).

12 6th Intex Workshop, Sofia 28-30 May 2003 12 The first sentence of the Slovene translation of Orwell's 1984 tagged Bil je jasen, mrzel aprilski dan in ure so bile trinajst

13 6th Intex Workshop, Sofia 28-30 May 2003 13 Intex MSD for Serbian one DELAS entry cyist,A17 one of its corresponding DELAF entries cyistiji,cyist.A17:bems1g:bems4q:bems5g:bemp1g :bemp5g produced by the regular expression A17.exp.............. ijemu/:bems3g:bems7g:bens3g:bens7g + iji/:bems1g:bems4q:bems5g:bemp1g:bemp5g + o/:aens1g:aens4g:aens5g +..............

14 6th Intex Workshop, Sofia 28-30 May 2003 14 Attributes and their values for Serbian adjectives in DELAS/DELAF AttributeValueCodeAttributeValueCode degreepositiveacasenominative1 comparativebgenitive2 superlativecdative3 definitenessnokaccusative4 yesdvocative5 not applicableeinstrumental6 gendermasculinemlocative7 femininefanimateyesv neuternnoq numbersingularsnot-applicableg pluralp(not important)

15 6th Intex Workshop, Sofia 28-30 May 2003 15 Syntactic and semantic marks in Serbian DELAS categorytagapplied toexplanationexample syntactic +p2 prepositionsnoun is in genitive bez,PREP+p2 +Ref verbsreflexive dicyiti,V551+Imper f+It+Ref +MG nounsmasculine natural gender budala,N601+Hum+MG +FG derivational +VN nounsverbal noun kiselxenxe,N300+VN +Adj adverbsderived from adjectives fanaticyno,ADV+Adj +DerOvaIra verbs, nouns, adjectives derivational variaty dezinfikovati,V18+ Imperf+...+DerOvaI ra semantic +Col adjectivescolors zelenkastosiv,A6+C ol +Hum nounshuman lxubavnica,N601+Hu m +Mat adjectivesmaterial kozxnat,A6+Mat dialectic +Ek allekavien nedelxa,N600+Ek +Cr allcroatism izopcxen,A1+PP+Cr

16 6th Intex Workshop, Sofia 28-30 May 2003 16 Problems of correspondence between MULTEXT-East MSD and Intex/1  The necessity to enforce the existing coding schema to a particular language Example: How to encode present and past gerund active? In Serbian, for the verb ići ( Engl. to go) those gerunds are idući and išavši There are attributes in verb tables of MULTEXT-east specification that describe them. However, no Slavic language, except Bulgarian, uses it.

17 6th Intex Workshop, Sofia 28-30 May 2003 17 Problems/2 the common encoding schema does not guarantee that true standardization would be achieved Example: only in Bulgarian do we find the attribute value 'adjectival' for adverbs (with the examples 'umno, veselo, studeno') – other Slavic languages, at least, could make use of that value of the attribute type.

18 6th Intex Workshop, Sofia 28-30 May 2003 18 Problems/3 Encoding of verb tenses = ============== ============== = EN RO SL CS BG ET HU HR SR P ATT VAL C x x x x x x x x x = ============== ============== = 2 VForm indicative i x x x x x x x x x subjunctive s x imperative m x x x x x x x x conditional c x x x x x x x infinitive n x x x x x x x x participle p x x x x x x x x gerund g x x x supine u x x transgressive t x quotative q x - -------------- -------------- - 3 Tense present p x x x x x x x x x imperfect i x x x x x future f x x x x past s x x x x x x x x x pluperfect l x x x aorist a x x x

19 6th Intex Workshop, Sofia 28-30 May 2003 19 Problems/3 The second attribute specifies verb form, and the third the tense. However, due to the composite tenses, some verb forms are used for the construction of different tenses. In Slovenian, verb form imel is past participle of the verb imeti (Engl. to have ), and it is used to produce perfect tense if used with the indicative form of the present tense of the copula verb biti (Engl. to be) and conditional if used with the conditional form of the same copula verb.

20 6th Intex Workshop, Sofia 28-30 May 2003 20 Problems/3 Winston Smith je imel.......................................... da bi ga imel

21 6th Intex Workshop, Sofia 28-30 May 2003 21 Problems/4 different interpretation of various grammatical categories across languages and lack of a clear cross-linguistic correspondance are discussed in Przepiórkowski (EACL 2003), for example dual number in Slovene and paucal in Serbian. certain morphosyntactic phenomena have not been taken into consideration, as various problems of agreement (Vitas, Krstev, to appear).

22 6th Intex Workshop, Sofia 28-30 May 2003 22 Application of MSD  Intex mapping to Serbian 1984 {S}{Bio,biti.V77:Gsm} ({je,jesam.V575+Imperf+It+Iref+Aux:Pzsi} + {je,on.PRO+Prs:sz2fi:sz4fi}) {vedar,.A18:akms1g:akms4q} ({i,.CONJ} + {i,.PAR}) {hladan,.A18:akms1g:akms4q} {aprilski,.A2+PosQ:adms1g:aems4q:aems5g:aemp1g:aemp5g} ({dan,.A1+PP:akms1g:aems4q} + {dan,dati.V103+Perf+Tr+Iref+Ref:Tms}) ; {S} ({na,.PREP+p4} + {na,.PREP+p7}) {cyasovnicima,.?} ({je,jesam.V575+Imperf+It+Iref+Aux:Pzsi} + {je,on.PRO+Prs:sz2fi:sz4fi}) {izbijalo,izbijati.V101+Perf+Tr+It+Iref:Gsn} {trinaest,.?}.

23 6th Intex Workshop, Sofia 28-30 May 2003 23 Tool that facilitates the lemmatization and disambiguation

24 6th Intex Workshop, Sofia 28-30 May 2003 24 Tagged Serbian translation of 1984 after hand disambiguation and resolving of unknown words {S}{Bio,biti.V77:Gsm} {je,jesam.V575+Imperf+It+Iref+Aux:Pzsi} {vedar,.A18:akms1g} (i,.CONJ) {hladan,.A18:akms1g} {aprilski,.A2+PosQ:adms1g} {dan,.N1:ms1q} ; {S} {na,.PREP+p7} {cyasovnicima,cyasovnik.N5:mp7q} {je,jesam.V575+Imperf+It+Iref+Aux:Pzsi} {izbijalo,izbijati.V101+Perf+Tr+It+Iref:Gsn} {trinaest,.Num+Car}.

25 6th Intex Workshop, Sofia 28-30 May 2003 25 Simple perl script maps Serbian Intex codes to MULTEX-East MSD if (($POS eq "V") && ($kategorije !~ /[XS]/)) { #glagol je $glagol = "V". "---------------"; if ($semkat =~ /Aux/) { #tip, atribut 1 substr($glagol,1,1) = "a"; } else { substr($glagol,1,1) = "m"; } if ($kategorije =~ /([WYGTIFA])/ ) { # forma, atribut 2 substr($glagol,2,1) = $1; } $glagol =~ tr/WYGTIFA/nmppiii/; if ( ($lema eq "biti") && ($kategorije =~ /A/) ) { substr($glagol,2,1) = "c"; } if ($kategorije =~ /([PIFAGY])/) { # vreme, atribut 3 substr($glagol,3,1) = $1; } $glagol =~ tr/PIFAGY/pofasp/; if ($kategorije =~ /([xyz])/) { # broj, atribut 4 substr($glagol,4,1) = $1; } $glagol =~ tr/xyz/123/;........

26 6th Intex Workshop, Sofia 28-30 May 2003 26 Tagged Serbian 1984 using MULTEXT-East MSD Bio je vedar i hladan aprilski dan na cyasovnicima je izbijalo trinaest

27 6th Intex Workshop, Sofia 28-30 May 2003 27 Conclusion It is possible to convert from Intex to MULTEXT-East It is possible to convert from MULTEXT-East to Intex to certain extent. Some information can not be recovered, such as inflectional class code

28 6th Intex Workshop, Sofia 28-30 May 2003 28 Noun attributes 1. Type 2. Gender 3. Number 4. Case 5. Definitness Type attributes Types 6. Clitic 7. Animate 8. Owner_Number 9. Owner_Person 10. Owned_Number

29 6th Intex Workshop, Sofia 28-30 May 2003 29 Verb Attributes 1. Type 2. VForm VForm 3. Tense Tense 4. Person 5. Number 6. Gender 7. Voice Type attributes Types 8. Negative 9. Definitness 10. Clitic 11. Case 12. Animate 13. Clitic_s 14. Aspect

30 6th Intex Workshop, Sofia 28-30 May 2003 30 Adjective attributes 1. Type 2. Degree 3. Gender 4. Number 5. Case 6. Definitness Type attributes Types 7. Clitic 8. Animate 9. Formation 10. Owner_Number 11. Owner_Person 12. Owned_Number

31 6th Intex Workshop, Sofia 28-30 May 2003 31 Adverb attributes 1. Type 2. Degree 3. Clitic 4. Number 5. Person 6. Wh_Type Type attributes Types

32 6th Intex Workshop, Sofia 28-30 May 2003 32 Values of the attribute Vform of the type Verb indicative (m) subjunctive (s) imperative (m) conditional (c) infinitive (i) Verb attributes participle (p) gerund (g) supine (u) transgressive (t) quotative (q)

33 6th Intex Workshop, Sofia 28-30 May 2003 33 Value of the attribute Tense of the type Verb present (p) imperfect (i) future (f) past (s) pluperfect (l) aorist (a) Verb attributes


Download ppt "6th Intex Workshop, Sofia 28-30 May 20031 6th Intex Workshop & 10 years of (Silberztein, 1993) Sofia, 28-30 May 2003."

Similar presentations


Ads by Google