Principles of organizing a common morphological tagset and a search engine for PolUKR (Polish-Ukrainian Parallel Corpus) Польсько-Український паралельний.

1 Principles of organizing a common morphological tagset and a search engine for PolUKR (Polish-Ukrainian Parallel Corpus) Польсько-Український паралельний корпус Polsko-Ukraiński Korpus Równoległy Natalia Kotsyba, Institute of Slavic Studies, Polish Academy of Sciences Olga Shypnivska, ULIF, Ukrainian Academy of Sciences Magdalena Turska, Warsaw University

2 Main objectives and expected applications at least 3 mln tokens ; representative sentence-level alignment morphological annotation with a common tagset public access; user-friendly linguistic material for –(independent) language learning –bilingual dictionaries –research on grammar and lexis translation memory for humans and machines




6 Statistics (prototype version) totalPolish partUkrainian part texts7035 tokens359 926179 087180 120 characters3 863 5641 449 3762 407 034 Kb394114922439


8 Search (present) based on PERL regular expressions any searched chain has to be “embraced” by “/”. E.g. /Холодна війна/ special characters: І alternative; ) end of subchain [ i ] beginning and end of a defined character class ? 1 or 0 appearances; * 0 or more appearances + 1 or more appearances \s any empty character \w any letter, digit, underlining sign \b end of word, \ escape

9 Examples of search formulae /jako/  „jako” /jako\s/  „jako, niejako, dwojako” /\bjako/  „jakość’ /norma\./  „norma” before a dot

14 Multext-East tagset for En Ro Sl Cz Bg Et Hu Hr Sr Re chain-like; criticised 14 PoS: N10, V15, A12, P(ron)17, Det10, T(he)6, adveRb6, S(adposition)4, C(onj)7, nuMeral12, Intjn2, X(residual), Yabbr5, Qparticle3 only Bg and Hu do not have modal verbs and copulas En Ro have determiners, Ro Hu Re have articles, Bg – has neither (analitism, segmentation); Is a Bg noun formally indefinite if the article is attached to the adj? (cf. agglutinativity of Pl być) negation as morphological category Cz transgresivity (adverbial participle)

15 Treatment of participles Polish (no aspectual characteristics) (Here and further cited by: Adam Przepiórkowski i Marcin Woliński A Flexemic Tagset for Polish.)A Flexemic Tagset for Polish Ukrainian (aspect and tense) Дієслово, дієприслівник, доконаний вид, минулий час, активний стан VW прочитавши Дієслово, дієприслівник, недоконаний вид, теперішній час, активний стан UQ читаючи (Here and further cited by: Широков В.А et al. Корпусна лінгвістика.) PolUKR participle I (doing/having done) characterised by aspect

16 Treatment of pronouns notorious Slavonic pronoun problem: 296 unique tags for 309 pronouns Polish: division into 1-2 p, 3p and siebie (ów, jak?) Ukrainian: pro-noun, pro-adjective Russian: also pro-predicative and pro-adverb Czech: many subcategories on the level of SubPoS PolUKR: Ua approach and Pl division into 1-2 and 3 person

22 Search engine for PolUKR choose the direction of the search (Ua>Pl or Pl<Ua) search conditions for both languages (RvonW) 3 levels of search: -exact form -(lemma) with the morphological choice -using Poliqarp-like tag formulas (for advanced users) idea of subcategories (either a POS or a SUBPOS can be selected, but not both; similarly, one cannot select all subcategories of a POS), cf. aliases in IPI PAN corpus alternative is ensured through tick-off boxes, so that one can choose EITHER „VERB finite past” OR „NOUN dative neutral” OR sth else, etc.) restrictions on choice within 1 of 10 POS


24 Built-in restrictions on search

25 Literature INTERA unified tagset project Tomas Erjavec et al. Multext-East specifications for Slavic languages, Budapest, 2003. Jan Hajič. Positional Tags: Quick Reference (Czech „HM” Morphology), 2000. Adam Przepiórkowski and Marcin Woliński. A Flexemic Tagset for Polish. In: The Proceedings of the Workshop on Morphological Processing of Slavic Languages, EACL 2003. Flexemic Tagset for Polish Elena Paskaleva. Balcan South-East Corpora Aligned to English. In: The Proceedings of the Workshop on Common Natural Language Processing Paradigm for Balkan Languages, EACL 2003 Широков В.А et al. Корпусна лінгвістика. Київ: Довіра, 2005.

