Presentation is loading. Please wait.

Presentation is loading. Please wait.

ZRINKA DUJMOVIĆ University of Zagreb/ETF JRC Workshop: Exploiting parallel corpora in up to 20 Languages Arona, 25-27 September 2005 STATISTICAL ANALYSIS.

Similar presentations


Presentation on theme: "ZRINKA DUJMOVIĆ University of Zagreb/ETF JRC Workshop: Exploiting parallel corpora in up to 20 Languages Arona, 25-27 September 2005 STATISTICAL ANALYSIS."— Presentation transcript:

1 ZRINKA DUJMOVIĆ University of Zagreb/ETF JRC Workshop: Exploiting parallel corpora in up to 20 Languages Arona, 25-27 September 2005 STATISTICAL ANALYSIS OF NOUN LEMMAS IN THE ITALIAN AND SWISS CONSTITUTION AND THEIR TRANSLATIONS INTO CROATIAN

2 What? Constitution of the Republic of Italy Constitution of the Republic of Italy (original in Italian + translation in Croatian) – 139 art. + transitory provisions); effective since 1948. Federal Constitution of the Swiss Confederation Federal Constitution of the Swiss Confederation (original in Italian + translation in Croatian<It/Germ/Eng.) – 196 art. (+tr. provisions); in force since 2000.

3 Why? objective: objective: test terminological consistency between SL & TL prerequisites: prerequisites: - parallel corpora as rich resources of translation equivalents - small corpora - small corpora

4 How? Data processing: Conversion into the HTML format Conversion into the HTML format Sentence alignment Sentence alignment Lemmatisation (inflectionally rich language!!) Lemmatisation (inflectionally rich language!!) Corpus annotation (POS tagging) Corpus annotation (POS tagging) Word alignment Word alignment Word frequency lists Word frequency lists

5 Testing terminological consistency of translation 1. HYPOTHESIS 1 Italian noun lemma = 1 translation equivalent in Croatian  Constitution Constitution 2. STATISTICAL TESTING - the minimum least square method - Y = a + bX - Correlation coefficient (R)

6 Correlation of the most frequent Italian and Croatian noun lemmas in the Federal Constitution of the Swiss Confederation (51) a = 0,009  0.039 b = 0.999  0,030 R = 0,978

7 Correlation of the most frequent Italian and Croatian noun lemmas in the Constitution of the Republic of Italy (31) a = 0,075  0.07305 b = 0,938  0.03970 R = 0,975

8 Deviation from linearity (a) Accidental (translators’ mistakes) (a) Accidental (translators’ mistakes) (b) Justified (still not expected!) (b) Justified (still not expected!) - stillistic differencies - stillistic differencies e.g. use of relative pronun instead of a noun (1:0) e.g. use of relative pronun instead of a noun (1:0) - polysemy (1:2) e. g. It. titolo 11 x e. g. It. titolo 11 x = Cr. naslov 6 x ( eng. title) = Cr. naslov 6 x ( eng. title) = Cr. vrijednosni papiri 1 x ( eng. Securities) = Cr. vrijednosni papiri 1 x ( eng. Securities) - as idiom: 1) a titolo transitorio = privremeno / eng. temporarily; - as idiom: 1) a titolo transitorio = privremeno / eng. temporarily; 2) a titolo oneroso = za plaću /eng. against payment 2) a titolo oneroso = za plaću /eng. against payment

9 Italian noun lemmas present in Italian and Swiss constitutions = candidates for glossary

10 Conclusions the minimum least square method appeared to be adequate for verification of translation the minimum least square method appeared to be adequate for verification of translation the verification does not have to be carried out on the entire sample, but only on the lemmas with the highest frequency covering at least one order of magnitude the verification does not have to be carried out on the entire sample, but only on the lemmas with the highest frequency covering at least one order of magnitude the best candidates for glossary are those lemmas which are repeated with the high frequency in both constitutions the best candidates for glossary are those lemmas which are repeated with the high frequency in both constitutions


Download ppt "ZRINKA DUJMOVIĆ University of Zagreb/ETF JRC Workshop: Exploiting parallel corpora in up to 20 Languages Arona, 25-27 September 2005 STATISTICAL ANALYSIS."

Similar presentations


Ads by Google