RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and Russian Words Preslav.

Slides:



Advertisements
Similar presentations
RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Unsupervised Extraction of False Friends from Parallel Bi-Texts Using the Web as a Corpus Preslav.
Advertisements

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Cognate or False Friend? Ask the Web! Svetlin Nakov, Sofia University "St. Kliment Ohridski" Preslav.
PHONE MODELING AND COMBINING DISCRIMINATIVE TRAINING FOR MANDARIN-ENGLISH BILINGUAL SPEECH RECOGNITION Yanmin Qian, Jia Liu ICASSP2010 Pei-Ning Chen CSIE.
1 Minimally Supervised Morphological Analysis by Multimodal Alignment David Yarowsky and Richard Wicentowski.
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
Baselines for Recognizing Textual Entailment Ling 541 Final Project Terrence Szymanski.
Automatic Identification of Cognates, False Friends, and Partial Cognates University of Ottawa, Canada University of Ottawa, Canada.
Normalizing Microtext Zhenzhen Xue, Dawei Yin and Brian D. Davison Lehigh University.
Computational Linguistic Techniques Applied to Drugname Matching Bonnie J. Dorr, University of Maryland Greg Kondrak, University of Alberta June 26, 2003.
Video Shot Boundary Detection at RMIT University Timo Volkmer, Saied Tahaghoghi, and Hugh E. Williams School of Computer Science & IT, RMIT University.
Crosslingual Ontology-Based Document Retrieval (Search) in an eLearning Environment RANLP, Borovets, 2007 Eelco Mossel University of Hamburg.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Linguisitics Levels of description. Speech and language Language as communication Speech vs. text –Speech primary –Text is derived –Text is not “written.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Evaluation of N-grams Conflation Approach in Text-based Information Retrieval Serge Kosinov University of Alberta, Computing Science Department, Edmonton,
Chapter 5: Information Retrieval and Web Search
Semantic and phonetic automatic reconstruction of medical dictations STEFAN PETRIK, CHRISTINA DREXEL, LEO FESSLER, JEREMY JANCSARY, ALEXANDRA KLEIN,GERNOT.
L. Padmasree Vamshi Ambati J. Anand Chandulal J. Anand Chandulal M. Sreenivasa Rao M. Sreenivasa Rao Signature Based Duplicate Detection in Digital Libraries.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
October 2006Advanced Topics in NLP1 CSA3050: NLP Algorithms Finite State Transducers for Morphological Parsing.
Transliteration System
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Computational Investigation of Palestinian Arabic Dialects
Poetry.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng.
Chapter 6: Information Retrieval and Web Search
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
Extracting bilingual terminologies from comparable corpora By: Ahmet Aker, Monica Paramita, Robert Gaizauskasl CS671: Natural Language Processing Prof.
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
Biometric Iris Recognition System INTRODUCTION Iris recognition is fast developing to be a foolproof and fast identification technique that can be administered.
Hendrik J Groenewald Centre for Text Technology (CTexT™) Research Unit: Languages and Literature in the South African Context North-West University, Potchefstroom.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.
Melodic Similarity Presenter: Greg Eustace. Overview Defining melody Introduction to melodic similarity and its applications Choosing the level of representation.
Jen-Tzung Chien, Meng-Sung Wu Minimum Rank Error Language Modeling.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Improved Word Alignments Using the Web as a Corpus Preslav Nakov, University of California, Berkeley.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Generating Query Substitutions Alicia Wood. What is the problem to be solved?
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
Learning Photographic Global Tonal Adjustment with a Database of Input / Output Image Pairs.
St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Utilizing vector models for automatic text lemmatization Ladislav Gallay Supervisor: Ing. Marián Šimko, PhD. Slovak University of Technology Faculty of.
The Canopies Algorithm from “Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching” Andrew McCallum, Kamal Nigam, Lyle.
Two Level Morphology Alexander Fraser & Liane Guillou CIS, Ludwig-Maximilians-Universität München Computational Morphology.
METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev.
CIS, Ludwig-Maximilians-Universität München Computational Morphology
THE SEVEN WONDERS OF OHRID
an Introduction to English
Welcome to Science at Lowton CE High School
Midterm Review (closed book)
Do-Gil Lee1*, Ilhwan Kim1 and Seok Kee Lee2
ArtsSemNet: From Bilingual Dictionary To Bilingual Semantic Network
RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria
Test Case Purification for Improving Fault Localization
Improved Word Alignments Using the Web as a Corpus
Presentation transcript:

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and Russian Words Preslav Nakov, Sofia University "St. Kliment Ohridski" Elena Paskaleva, Bulgarian Academy of Sciences Svetlin Nakov, Sofia University "St. Kliment Ohridski" Workshop “Multilingual Resources, Technologies and Evaluation for Central and Eastern European Languages”, RANLP 2009

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Introduction Objective Measure the extent to which a Bulgarian and a Russian word are perceived as similar by a person who is fluent in both languages Orthographic similarity Modified to account typical cross-lingual correspondences between Bulgarian and Russian, e.g. transformations of inflections Example Bulgarian афектирахме and Russian аффектировались are orthographically different but perceived as similar

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Orthographic Similarity Minimum Edit Distance Ratio (MEDR) MED(s 1, s 2 ) = the minimum number of INSERT / REPLACE / DELETE operations for transforming s 1 to s 2 (Levenshtein distance) MEDR is also known as normalized edit distance (NED) Longest Common Subsequence Ratio (LCSR) Maximal length subsequence common to both words, normalized by the longer word

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Modified Minimum Edit Distance Ratio (MMEDR) Our MMEDR similarity algorithm Reduces the Russian word to an intermediate Bulgarian-sounding form Applies a set of linguistically motivated transformation rules Compares orthographically the modified Russian word with the Bulgarian word Calculates weighted Levenshtein distance

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Linguistic Motivation behind the MMEDR Algorithm

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Linguistic Motivation Transliteration from Cyrillic to Cyrillic Full coincidence (equality) of letters Regular letter transitions Transformations of n-grams Lemmatization Transformation Weights

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Transliteration What is transliteration? Transition of sounds and their letter correspondences in one language to letters in another language Russian → Bulgarian transliteration Full coincidence (equality) of letters E.g. a → a (азбука – азбука) Russian letters missing in Bulgarian E.g. ы → и, э → е (рыба – риба, поэт – поет) Removing a Russian letter E.g. пальто → палто Regular letter transitions E.g. муж → мъж, хлеб → хляб, сон → сън

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Transformation of n-grams Regular sound-letter transitions from Russian to Bulgarian Transformations originating from spelling Double consonants, e.g. процесс → процес Voiceless to voiced consonants, e.g. бессмертный → безсмъртен Transformations of morphological origin Removing agglutinative morphemes (ся and сь), e.g. веселиться → веселить Transforming endings, e.g. стенной → стенен

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Transformation of Russian Adjectives Russian Ending Bulgarian Ending Example -нный-ненвоенный → военен -ный-енвечный → вечен -нний-ненранний → ранен -ний-енвечерний → вечерен -ский-скивражеский → вражески -ый-истрелковый → стрелкови -нной-ненстенной → стенен -ной-енродной → роден -ой-иделовой → делови

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Transformation of Russian Verbs Russian Ending Bulgarian Ending Examples -овать-амдекорировать → декорирам -ить-ябродить → бродя -ять-яблеять → блея -ать-амдавать → давам -уть-агаснуть → гасна -еть-еябелеть → белея

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Lemmatization Bulgarian and Russian are highly- inflectional languages Variety of endings express the different forms of the same word What is lemmatization? Replacement of inflected wordforms with their lemmata E.g. късният → късен (Bulgarian), равняющимся → равнять (Russian) Lemmatization can handle inflections

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Transformation Weights We use weights for letter substitutions when measuring Levenshtein distance We account regular phonetic and spelling letter correspondences Some substitutions are unlikely E.g. о → у is more likely than о → щ Replacing letter with itself has cost 0 Regular letter substitution cost is 1 Consonants and vowels with similar sequences of distinctive phonetic features have less substitution cost (e.g. б → в)

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Transformation Weights а w(а, е)=0.7; w(а, и)=0.8; w(а, о)=0.7; w(а, у)=0.6; w(а, ъ)=0.5; w(а, ю)=0.8; w(а, я)=0.5 бw(б, в)=0.8; w(б, п)=0.6 вw(в, ф)=0.6 гw(г, х)=0.5 дw(д, т)=0.6 е w(е, и)=0.6; w(е, о)=0.7; w(е, у)=0.8; w(е, ъ)=0.5; w(е, ю)=0.8; w(е, я)=0.5 жw(ж, з)=0.8; w(ж, ш)=0.6 зw(з, с)=0.5 и w(и, й)=0.6; w(и, о)=0.8; w(и, у)=0.8; w(и, ъ)=0.8; w(и, ю)=0.7; w(и, я)=0.7 йw(й, ю)=0.7; w(й, я)=0.7 кw(к, т)=0.8; w(к, х)=0.6 мw(м, н)=0.7 о w(о, у)=0.6; w(о, ъ)=0.8; w(о, ю)=0.7; w(о, я)=0.8 пw(п, ф)=0.8; w(п, х)=0.9 сw(с, ц)=0.6; w(с, ш)=0.9 т w(т, ф)=0.8; w(т, х)=0.9; w(т, ц)=0.9 у w(у, ъ)=0.5; w(у, ю)=0.6; w(у, я)=0.8 фw(ф, ц)=0.8 хw(х, ш)=0.9 цw(ц, ч)=0.8 чw(ч, ш)=0.9 ъw(ъ, ю)=0.8; w(ъ, я)=0.8 юw(ю, я)=0.8

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria The MMEDR Algorithm in Details

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria The MMEDR Algorithm MMEDR algorithm steps (order is important): 1. Lemmatize the Bulgarian word 2. Lemmatize the Russian word 3. Transform the Russian word’s ending 4. Transliterate the Russian word 5. Remove some double consonants in the Russian word 6. Calculate weighted Levenshtein distance 7. Normalize and calculate the MMEDR value

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Lemmatizing Bulgarian and Russian Words How to perform lemmatization? Use of large morphological dictionaries Wordforms are replaced with corresponding lemmata Lemmatization if optional step in MMEDR For each word it is either performed or not When multiple lemmata are found, all of them are considered Highest value of MMEDR is taken

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Transforming the Russian Endings The following endings are replaced in the Russian words: нный → нен; ный → ен; нний → нен; ний → ен; ий → и; ый → и; нной → нен; ной → ен; ой → и; ский → ски; ься → ь; овать → ам; ить → я; ять → я; ать → ам; уть → а; еть → ея

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Removing Double Consonants The following substitutions are performed in the Russian words: бб → б; жж → ж; кк → к; лл → л; мм → м; пп → п; рр → р; сс → с; тт → т; фф → ф Note that not all double consonants are replaced, e.g. дд is left дд E.g. наддавать → наддавам

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Calculating Weighted Levenshtein Distance Starting from classical Levenshtein distance (MED) we modify it to use weights for letter substitutions (MMED) We use the previously discussed linguistically motivated weights We calculate MMEDR as follows:

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Calculating the Final Result The final MMEDR value is calculated by maximum of all MMEDR values: with / without lemmatization of the Bulgarian word with / without lemmatization of the Russian word with / without transformation of the Russian word ending Lemmatization sometimes produces multiple lemmata, so all of them are considered

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria MMEDR Algorithm: Example Bulgarian word: афектирахме Russian word: аффектировались Traditional MEDR similarity MED(афектирахме, аффектировались) = 7 Apply normalization MEDR = 1–(7/15) = 8/15 ≈ 53% Even though these words "sound similar" to Bulgarian / Russian fluent speakers

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria MMEDR Algorithm: Example (2) Our improved MMEDR similarity: Lemmatization produces афектирам and аффектировать We replace the double Russian consonant -фф- by -ф- We obtain афектирам and афектировать We replace the Russian ending -овать by the Bulgarian ending -ам We obtain identical words: афектирам and афектирам Thus our MMEDR similarity is 100%

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Another MMEDR Example Bulgarian word избягам and the Russian word отбегать (both meaning ‘to run out’) MED(избягам,отбегать) = 5 MEDR = 1 – (5/8) = 3/8 = 37.5% MMEDR first transforms отбегать to отбегам MMED(избягам, отбегам) = = 2.3 MMEDR = 1 – (2.3/7) = 47/70 ≈ 67%

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Experiments and Evaluation

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Experimental Setup Model the problem as information retrieval (IR) task: Retrieve all similar pairs of words from Bulgarian and Russian lists of words Measure similarity between 200 x 200 = 40,000 Bulgarian-Russian pairs of words 163 pairs annotated as similar by linguist 39,837 considered unrelated Rank the 40,000 pairs by MMEDR algorithm Evaluate the quality of the ranking with 11pt interpolated average precision

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Resources Textual resources The first 200 words from the Russian novel The Lord of the World (Властелин мира) by Alexander Belyayev The first 200 words form the Bulgarian translation of the novel Grammatical resources (for lemmatization) Grammatical dictionary of Bulgarian 1M wordforms and 70,000 lemmata Grammatical dictionary of Russian 1.5M wordforms and 100,000 lemmata

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Results MMEDR significantly outperforms traditional orthographic similarity measures: Algorithm 11-pt interpolated average precision LCSR69.06% MEDR72.30% MMEDR90.58%

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Results – Produced Ranking #Bulgarian wordRussian wordMMEDRSimilar?PrecisionRecall 1беляев Yes100.00%0.68% 2на Yes100.00%1.37% 3глава Yes100.00%2.05% 4кандидат Yes100.00%2.74% 5за Yes100.00%3.42% 6наполеоннаполеоны1.0000Yes100.00%4.11% 7не Yes100.00%4.79% 8минас1.0000No87.50%4.79% 9мимой1.0000Yes88.89%5.48% 10мимы1.0000Yes90.00%6.16%... 93четвъртиятчетвертым0.9375Yes94.57%59.59% 94оставятостается0.9286Yes94.62%60.27% сав0.0000No0.37%100% 39999сак0.0000No0.37%100% 40000боядисвалик0.0000No0.37%100%

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Conclusion We proposed orthographical similarity measure algorithm for Bulgarian / Russian Outperforms traditional orthographic similarity measures Accuracy is still far from 100% Evaluation performed with stop words included No publications on orthographic similarity for Bulgarian / Russian Can not compare the results with others

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Future Work Combine the ideas of MMEDR with machine learning techniques Automatically learning transformation rules for n-grams correspondences Perform evaluation with stop words excluded Evaluation for different pairs of languages

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Questions? A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and Russian Words