1 Possibilities of identification of translation equivalents in a parallel corpus Krešimir Šojat Marko Tadić Institute of Linguistics Faculty of Philosophy;

1 Possibilities of identification of translation equivalents in a parallel corpus Krešimir Šojat Marko Tadić Institute of Linguistics Faculty of Philosophy; Zagreb, Croatia

2 Croatian-English parallel corpus n compiled at the Institute of Linguistics, Faculty of Philosophy, Zagreb n Languages u source L:Croatian u target L:English n consists of 113 issues of “Croatia Weekly”

3 Corpus parameters u Articles: 4,343 u Sentences: F HR67,694(15.59 s/article avg.) F EN75,390(17.36 s/article avg.) u Tokens: F HR1,490,964(22.03 w/s avg.) F EN1,796,744(23.83 w/s avg.) F Total3,287,708

4 Corpus parameters 2 n “Vanilla” aligner, sentence level n alignments u 0:1 310in 235 articles 0.45% u 1:0 25in 12 articles 0.04% u 1:1 56783in4143 articles 84.12% u 1:2 8611in3288 articles 12.76% u 2:1 1391in1012 articles 2.06% u 2:2 379in 345 articles 0.56% n Total alignments: 67499 in 4143 articles n analysis only of 1:1 alignments

5 The aim n detection of possible translation equivalents of multiword units (pairs) which occurred in aligned sentences n basic assumption: u in a pair of aligned sentences pairs of words could be translated with pairs of words n seek for translation equivalent candidates between pairs of pairs of words

6 Test sample u research done on the test sample (issues 14-32) F 102021:1 alignements F tokens HR188,851HR188,851 EN223,112EN223,112 F types HR 35,777HR 35,777 EN 16,521EN 16,521

7 Methods of TE detection n First attempt u pure frequency counts of all possible pairs hr: Ove su jabuke velike. Skupe su jabuke. en: These apples are big. Apples are expensive. u possible pairs: ove su these apples su jabukeapples are jabuke velikeare big skupe su apples are su jabukeare expensive

8 Methods of TE detection 2 n all possible pairs of pairs of tokens within aligned sentence were generated u 10202 sentence pairs aligned 1:1 u 5,259,631 pairs of pairs n results poor: u frequency gain in expected TE was very low u real TE pairs were severely outnumbered by irrelevant pairs

9 Methods of TE detection 3 n Second attempt: use MI for pair selection u MI calculated for pairs from each language tokenstypes HR18885135777 EN22311216521HR18885135777 EN22311216521 pairsdifferent pairs HR179353124321 EN213614100558HR179353124321 EN213614100558 pairsMI=>10MI=>15 pairsMI=>10MI=>15 HR4711510991 EN16448 2744HR4711510991 EN16448 2744

10 Methods of TE detection 4 n expectation u pair of tokens from source L with high MI could have TE which is also pair of tokens from target L with high MI u list of pairs of pairs of tokens filtered by MI

11 List of pairs of pairs

12 Results n disappointing n high MI of pairs of tokens in both languages does not necessarily mean that translation equivalence could be detected n the fact that pairs of tokens with high MI appear in both sides of aligned sentences is not enough to establish TE

13 Possible directions n find out the way of connecting MI values from both languages n problem of morphology u Croatian = inflective language u affects pairs of tokens u affects frequency of pairs u affects MI value u calculating MI for types or lemmas? n is MI value useful for morphologically rich languages?

1 Possibilities of identification of translation equivalents in a parallel corpus Krešimir Šojat Marko Tadić Institute of Linguistics Faculty of Philosophy;

Similar presentations

Presentation on theme: "1 Possibilities of identification of translation equivalents in a parallel corpus Krešimir Šojat Marko Tadić Institute of Linguistics Faculty of Philosophy;"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Possibilities of identification of translation equivalents in a parallel corpus Krešimir Šojat Marko Tadić Institute of Linguistics Faculty of Philosophy;

Similar presentations

Presentation on theme: "1 Possibilities of identification of translation equivalents in a parallel corpus Krešimir Šojat Marko Tadić Institute of Linguistics Faculty of Philosophy;"— Presentation transcript:

Similar presentations

About project

Feedback