RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria
A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and Russian Words Workshop “Multilingual Resources, Technologies and Evaluation for Central and Eastern European Languages”, RANLP 2009 Preslav Nakov, Sofia University "St. Kliment Ohridski" Elena Paskaleva, Bulgarian Academy of Sciences Svetlin Nakov, Sofia University "St. Kliment Ohridski" RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Introduction Objective Measure the extent to which a Bulgarian and a Russian word are perceived as similar by a person who is fluent in both languages Orthographic similarity Modified to account typical cross-lingual correspondences between Bulgarian and Russian, e.g. transformations of inflections Example Bulgarian афектирахме and Russian аффектировались are orthographically different but perceived as similar RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Orthographic Similarity
Minimum Edit Distance Ratio (MEDR) MED(s1, s2) = the minimum number of INSERT / REPLACE / DELETE operations for transforming s1 to s2 (Levenshtein distance) MEDR is also known as normalized edit distance (NED) Longest Common Subsequence Ratio (LCSR) Maximal length subsequence common to both words, normalized by the longer word RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Modified Minimum Edit Distance Ratio (MMEDR)
Our MMEDR similarity algorithm Reduces the Russian word to an intermediate Bulgarian-sounding form Applies a set of linguistically motivated transformation rules Compares orthographically the modified Russian word with the Bulgarian word Calculates weighted Levenshtein distance RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Linguistic Motivation behind the MMEDR Algorithm
RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Linguistic Motivation
Transliteration from Cyrillic to Cyrillic Full coincidence (equality) of letters Regular letter transitions Transformations of n-grams Lemmatization Transformation Weights RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Transliteration What is transliteration? Transition of sounds and their letter correspondences in one language to letters in another language Russian → Bulgarian transliteration Full coincidence (equality) of letters E.g. a → a (азбука – азбука) Russian letters missing in Bulgarian E.g. ы → и, э → е (рыба – риба, поэт – поет) Removing a Russian letter E.g. пальто → палто Regular letter transitions E.g. муж → мъж, хлеб → хляб, сон → сън RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Transformation of n-grams
Regular sound-letter transitions from Russian to Bulgarian Transformations originating from spelling Double consonants, e.g. процесс → процес Voiceless to voiced consonants, e.g. бессмертный → безсмъртен Transformations of morphological origin Removing agglutinative morphemes (ся and сь), e.g. веселиться → веселить Transforming endings, e.g. стенной → стенен RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Transformation of Russian Adjectives
Russian Ending Bulgarian Ending Example -нный -нен военный → военен -ный -ен вечный → вечен -нний ранний → ранен -ний вечерний → вечерен -ский -ски вражеский → вражески -ый -и стрелковый → стрелкови -нной стенной → стенен -ной родной → роден -ой деловой → делови RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Transformation of Russian Verbs
Russian Ending Bulgarian Ending Examples -овать -ам декорировать → декорирам -ить -я бродить → бродя -ять блеять → блея -ать давать → давам -уть -а гаснуть → гасна -еть -ея белеть → белея RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Lemmatization Bulgarian and Russian are highly-inflectional languages Variety of endings express the different forms of the same word What is lemmatization? Replacement of inflected wordforms with their lemmata E.g. късният → късен (Bulgarian), равняющимся → равнять (Russian) Lemmatization can handle inflections RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Transformation Weights
We use weights for letter substitutions when measuring Levenshtein distance We account regular phonetic and spelling letter correspondences Some substitutions are unlikely E.g. о → у is more likely than о → щ Replacing letter with itself has cost 0 Regular letter substitution cost is 1 Consonants and vowels with similar sequences of distinctive phonetic features have less substitution cost (e.g. б → в) RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Transformation Weights
а w(а, е)=0.7; w(а, и)=0.8; w(а, о)=0.7; w(а, у)=0.6; w(а, ъ)=0.5; w(а, ю)=0.8; w(а, я)=0.5 б w(б, в)=0.8; w(б, п)=0.6 в w(в, ф)=0.6 г w(г, х)=0.5 д w(д, т)=0.6 е w(е, и)=0.6; w(е, о)=0.7; w(е, у)=0.8; w(е, ъ)=0.5; w(е, ю)=0.8; w(е, я)=0.5 ж w(ж, з)=0.8; w(ж, ш)=0.6 з w(з, с)=0.5 и w(и, й)=0.6; w(и, о)=0.8; w(и, у)=0.8; w(и, ъ)=0.8; w(и, ю)=0.7; w(и, я)=0.7 й w(й, ю)=0.7; w(й, я)=0.7 к w(к, т)=0.8; w(к, х)=0.6 м w(м, н)=0.7 о w(о, у)=0.6; w(о, ъ)=0.8; w(о, ю)=0.7; w(о, я)=0.8 п w(п, ф)=0.8; w(п, х)=0.9 с w(с, ц)=0.6; w(с, ш)=0.9 т w(т, ф)=0.8; w(т, х)=0.9; w(т, ц)=0.9 у w(у, ъ)=0.5; w(у, ю)=0.6; w(у, я)=0.8 ф w(ф, ц)=0.8 х w(х, ш)=0.9 ц w(ц, ч)=0.8 ч w(ч, ш)=0.9 ъ w(ъ, ю)=0.8; w(ъ, я)=0.8 ю w(ю, я)=0.8 RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

The MMEDR Algorithm in Details

The MMEDR Algorithm MMEDR algorithm steps (order is important): Lemmatize the Bulgarian word Lemmatize the Russian word Transform the Russian word’s ending Transliterate the Russian word Remove some double consonants in the Russian word Calculate weighted Levenshtein distance Normalize and calculate the MMEDR value RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Lemmatizing Bulgarian and Russian Words
How to perform lemmatization? Use of large morphological dictionaries Wordforms are replaced with corresponding lemmata Lemmatization if optional step in MMEDR For each word it is either performed or not When multiple lemmata are found, all of them are considered Highest value of MMEDR is taken RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Transforming the Russian Endings
The following endings are replaced in the Russian words: нный → нен; ный → ен; нний → нен; ний → ен; ий → и; ый → и; нной → нен; ной → ен; ой → и; ский → ски; ься → ь; овать → ам; ить → я; ять → я; ать → ам; уть → а; еть → ея RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Removing Double Consonants
The following substitutions are performed in the Russian words: бб → б; жж → ж; кк → к; лл → л; мм → м; пп → п; рр → р; сс → с; тт → т; фф → ф Note that not all double consonants are replaced, e.g. дд is left дд E.g. наддавать → наддавам RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Calculating Weighted Levenshtein Distance
Starting from classical Levenshtein distance (MED) we modify it to use weights for letter substitutions (MMED) We use the previously discussed linguistically motivated weights We calculate MMEDR as follows: RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Calculating the Final Result
The final MMEDR value is calculated by maximum of all MMEDR values: with / without lemmatization of the Bulgarian word with / without lemmatization of the Russian word with / without transformation of the Russian word ending Lemmatization sometimes produces multiple lemmata, so all of them are considered RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

MMEDR Algorithm: Example
Bulgarian word: афектирахме Russian word: аффектировались Traditional MEDR similarity MED(афектирахме, аффектировались) = 7 Apply normalization MEDR = 1–(7/15) = 8/15 ≈ 53% Even though these words "sound similar" to Bulgarian / Russian fluent speakers RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

MMEDR Algorithm: Example (2)
Our improved MMEDR similarity: Lemmatization produces афектирам and аффектировать We replace the double Russian consonant -фф- by -ф- We obtain афектирам and афектировать We replace the Russian ending -овать by the Bulgarian ending -ам We obtain identical words: афектирам and афектирам Thus our MMEDR similarity is 100% RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Another MMEDR Example Bulgarian word избягам and the Russian word отбегать (both meaning ‘to run out’) MED(избягам,отбегать) = 5 MEDR = 1 – (5/8) = 3/8 = 37.5% MMEDR first transforms отбегать to отбегам MMED(избягам, отбегам) = = 2.3 MMEDR = 1 – (2.3/7) = 47/70 ≈ 67% RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Experiments and Evaluation

Experimental Setup Model the problem as information retrieval (IR) task: Retrieve all similar pairs of words from Bulgarian and Russian lists of words Measure similarity between 200 x 200 = 40,000 Bulgarian-Russian pairs of words 163 pairs annotated as similar by linguist 39,837 considered unrelated Rank the 40,000 pairs by MMEDR algorithm Evaluate the quality of the ranking with 11pt interpolated average precision RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Resources Textual resources The first 200 words from the Russian novel The Lord of the World (Властелин мира) by Alexander Belyayev The first 200 words form the Bulgarian translation of the novel Grammatical resources (for lemmatization) Grammatical dictionary of Bulgarian 1M wordforms and 70,000 lemmata Grammatical dictionary of Russian 1.5M wordforms and 100,000 lemmata RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Results MMEDR significantly outperforms traditional orthographic similarity measures: Algorithm 11-pt interpolated average precision LCSR 69.06% MEDR 72.30% MMEDR 90.58% RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Results – Produced Ranking
# Bulgarian word Russian word MMEDR Similar? Precision Recall 1 беляев 1.0000 Yes 100.00% 0.68% 2 на 1.37% 3 глава 2.05% 4 кандидат 2.74% 5 за 3.42% 6 наполеон наполеоны 4.11% 7 не 4.79% 8 ми нас No 87.50% 9 мой 88.89% 5.48% 10 мы 90.00% 6.16% ... 93 четвъртият четвертым 0.9375 94.57% 59.59% 94 оставят остается 0.9286 94.62% 60.27% 39998 са в 0.0000 0.37% 100% 39999 к 40000 боядисвали RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Conclusion We proposed orthographical similarity measure algorithm for Bulgarian / Russian Outperforms traditional orthographic similarity measures Accuracy is still far from 100% Evaluation performed with stop words included No publications on orthographic similarity for Bulgarian / Russian Can not compare the results with others RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Future Work Combine the ideas of MMEDR with machine learning techniques Automatically learning transformation rules for n-grams correspondences Perform evaluation with stop words excluded Evaluation for different pairs of languages RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and Russian Words Questions? RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Similar presentations

Presentation on theme: "RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Similar presentations

Presentation on theme: "RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria"— Presentation transcript:

Similar presentations

About project

Feedback