Automatic Identification of Cognates and False Friends in French and English Diana Inkpen and Oana Frunza University of Ottawa and Greg Kondrak University.

Slides:



Advertisements
Similar presentations
A Text Processing Tool for the Romanian Language Oana Frunza and Diana InkpenDavid Nadeau School of Information Technology and Institute for Information.
Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Large-Scale Entity-Based Online Social Network Profile Linkage.
Automatic Identification of Cognates, False Friends, and Partial Cognates University of Ottawa, Canada University of Ottawa, Canada.
Computational Linguistic Techniques Applied to Drugname Matching Bonnie J. Dorr, University of Maryland Greg Kondrak, University of Alberta June 26, 2003.
1 JCDL 2011 Report Kazunari Sugiyama WING meeting 19 th August, 2011.
Identifying Translations Philip Resnik, Noah Smith University of Maryland.
Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.
CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?
Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Author Identification for LiveJournal Alyssa Liang.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
Cognates and Word Alignment in Bitexts Greg Kondrak University of Alberta.
LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.
By Miranda Wang ENGLISH-FRENCH FALSE FRIENDS. Faux Amis/ False Friends Cognates: words in different languages that have similar spellings and meanings.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
SI485i : NLP Set 12 Features and Prediction. What is NLP, really? Many of our tasks boil down to finding intelligent features of language. We do lots.
Wang, Z., et al. Presented by: Kayla Henneman October 27, 2014 WHO IS HERE: LOCATION AWARE FACE RECOGNITION.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande.
Comparing the Parallel Automatic Composition of Inductive Applications with Stacking Methods Hidenao Abe & Takahira Yamaguchi Shizuoka University, JAPAN.
Short Introduction to Machine Learning Instructor: Rada Mihalcea.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
W ORD S ENSE D ISAMBIGUATION By Mahmood Soltani Tehran University 2009/12/24 1.
A Language Independent Method for Question Classification COLING 2004.
1 Co-Training for Cross-Lingual Sentiment Classification Xiaojun Wan ( 萬小軍 ) Associate Professor, Peking University ACL 2009.
Malay-English Bitext Mapping and Alignment Using SIMR/GSA Algorithms Mosleh Al-Adhaileh Tang Enya Kong Mosleh Al-Adhaileh and Tang Enya Kong Computer Aided.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Automatic Syllabus Classification JCDL – Vancouver – 22 June 2007 Edward A. Fox (presenting co-author), Xiaoyan Yu, Manas Tungare, Weiguo Fan, Manuel Perez-Quinones,
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
Learning Multilingual Subjective Language via Cross-Lingual Projections Mihalcea, Banea, and Wiebe ACL 2007 NLG Lab Seminar 4/11/2008.
Spam Detection Ethan Grefe December 13, 2013.
Extracting bilingual terminologies from comparable corpora By: Ahmet Aker, Monica Paramita, Robert Gaizauskasl CS671: Natural Language Processing Prof.
CISC Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning.
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.
Linking Organizational Social Networking Profiles PROJECT ID: H JEROME CHENG ZHI KAI (A H ) 1.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Intelligent Key Prediction by N-grams and Error-correction Rules Kanokwut Thanadkran, Virach Sornlertlamvanich and Tanapong Potipiti Information Research.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Support Vector Machine Based Orthographic Disambiguation Eiji ARAMAKI, Takeshi IMAI, Kengo MIYO, Kazuhiko OHE Hospital “center” and “centre” are equivalent?
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Using Linguistic Analysis and Classification Techniques to Identify Ingroup and Outgroup Messages in the Enron Corpus.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
From Words to Senses: A Case Study of Subjectivity Recognition Author: Fangzhong Su & Katja Markert (University of Leeds, UK) Source: COLING 2008 Reporter:
11 A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 1, Michael R. Lyu 1, Irwin King 1,2 1 The Chinese.
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
Ontology Engineering and Feature Construction for Predicting Friendship Links in the Live Journal Social Network Author:Vikas Bahirwani 、 Doina Caragea.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
Language Identification and Part-of-Speech Tagging
Sentiment analysis algorithms and applications: A survey
Results for all features Results for the reduced set of features
Statistical NLP: Lecture 13
Evaluating classifiers for disease gene discovery
Improved Word Alignments Using the Web as a Corpus
Evaluating Classifiers for Disease Gene Discovery
Presentation transcript:

Automatic Identification of Cognates and False Friends in French and English Diana Inkpen and Oana Frunza University of Ottawa and Greg Kondrak University of Alberta

Second-language learning When learning a second language a student can benefit from knowledge in his/her first language (Gass,1987), Ringbom,1987). The use of cognates in second language teaching was shown to accelerate vocabulary acquisition and to facilitate reading comprehension (LeBlanc et al., 1989). Morphological rules for conversion from English to French also proved to help (Treville, 1990). Manual work on cognate detection: LeBlanc and Seguin (1996) collected 23,160 French-English cognate pairs from two general-purpose dictionaries (70,000 entries). Cognates make up over 30% of the vocabulary.

French and English Although French and English belong to different branches of the Indoeuropean family of languages, they share a very high number of cognates. The majority are words of Latin and Greek origin: éducation - education and théorie - theory. A small number (120) of very old, “genetic” cognates go back all the way to Proto-Indoeuropean, mère - mother and pied - foot. Other cognates can be traced to the conquest of Gaul by Germanic tribes after the collapse of the Roman Empire, and by the period of French domination of England after the Norman conquest.

Definitions Cognates or True Friends (Vrais Amis), words that are similar and are mutual translations nature - nature, recognition - reconnaissance False Friends (Faux Amis) words that are similar but have different meanings main “hand” - main, blesser “to injure” - bless (bénir) Partial Cognates words that have the same meaning in both languages in some but not all contexts facteur - factor or “mailman” Genetic Cognates - derived directly from the same word in the ancestor language (may differ in form or meaning) père - father, chef - head

Related work Bilingual corpora and translation lexicons Simard et al, (1992) use cognates to align sentences in bitexts (cognates = first 4 characters are identical). Brew and McKelvie (1996) extract French-English cognates and false friends from bitexts using a variety of orthographic similarity measures. Mann and Yarowsky (2001) automatically induce translation lexicons on the basis of cognate pairs. Kondrak (2004) identifies genetic cognates in vocabularies of related languages by combining the phonetic similarity of lexemes with the semantic similarity of glosses.

Our task Automatic identification of cognates and false friends => lists, for inclusions in learning aids For a pair of word, the two classes for the automatic classification are: Cognates/False-Friends and Unrelated. Additional “translation” feature: –Cognates if the two words are translations of each other in a bilingual dictionary; –False Friends otherwise.

Our method Machine learning (Weka) Features: 13 orthographic similarity measures. –Each measure separately. –Average. –Combine them through ML algorithms.

Orthographic similarity measures IDENT returns 1 for identical words, 0 otherwise. PREFIX returns the length of the common prefix divided by the length of the longer string. DICE divides twice the number of shared letter bigrams by the total number of bigrams in both words. DICE(colour, couleur) = 6/11 = 0.55 (the shared bigrams are co, ou, ur) TRIGRAM same as DICE but uses trigrams. XDICE uses trigrams without the middle letter. XXDICE takes into account the positions of bigrams.

Orthographic similarity measures (cont.) Longest Common Subsequence Ratio (LCSR) LCSR(colour, couleur) = 5/7 = 0.71 Normalized Edit Distance (NED) SOUNDEX – phonetic name matching, numeric codes. BI-SIM, TRI-SIM, BI-DIST, and TRI-DIST generalize LCSR and NED measures, uses letter bigrams or trigrams instead of single letters.

Training data bilingual list of 1047 basic words and expressions. We excluded multi-word expressions. We manually classified 203 pairs as Cognates and 527 pairs as Unrelated. A manually word-aligned bitext (Melamed, 1998).We manually identified 258 Cognate pairs. A set of exercises for Anglophone learners of French (Treville, 1990) (152 Cognate pairs). An on-line ( blfauxam.htm) list of “French-English False Cognates” (314 False-Friends).

Test data A separate test set extracted from the following sources: A random sample of 1000 word pairs from an automatically generated translation lexicon. We manually classified 603 pairs as Cognates and 343 pairs as Unrelated. The on-line list of “French-English False Cognates” (94 additional False-Friends).

Data (summary) Training setTest set Cognates613 (73)603 (178) False-Friends314 (135) 94 (46) Unrelated527 (0)343 (0) Total

Results of classification Orthographic similarity measure ThresholdAccuracy on Training set Accuracy on Test set IDENT %55.00 % PREFIX %90.97 % DICE %93.37 % LCSR %94.24 % NED %93.57 % SOUNDEX %84.54 % TRI %92.13 % XDICE %94.52 % XXDICE %95.39 % TRI-SIM %93.28 % TRI-DIST %93.85 % Average measure %94.14 %

Results of classification ClassifierAccuracy cross- val. on training set Accuracy on test set Baseline63.75 %66.98 % OneRule95.66 %92.89 % Naive Bayes94.84 %94.62 % Decision Tree95.66 %92.08 % Decision Tree (pruned)95.66%93.18 % IBK93.81 %92.80 % Ada Boost95.66 %93.47 % Perceptron95.11 %91.55 % SVM (SMO)95.46 %93.76 %

Pruned Decision Tree TRI-SIM <= | TRI-SIM <= : UNREL (447.0/17.0) | TRI-SIM > | | XDICE <= 0.2: UNREL (97.0/20.0) | | XDICE > 0.2 | | | BI-SIM <= 0.3: UNREL (3.0) | | | BI-SIM > 0.3: CG_FF (9.0) TRI-SIM > : CG_FF (898.0/17.0)

Results on genetic cognates set ClassifierAccuracy Baseline –– OneRule35.39 % Naive Bayes29.20 % Decision Tree35.39 % Decision Tree (pruned)38.05 % IBK43.36 % Ada Boost35.39 % Perceptron42.47 % SVM (SMO)35.39 % 120 pairs of genetic cognates, available at: ~kondrak/cognatesEF.html

Conclusion We presented several methods to automatically identify cognates and false friends. We tested a number of orthographic similarity measures individually. We combined them using several different machine learning classifiers. We evaluated the methods on a training set, on a test set, and on a list of genetic cognates. The results show that, for French and English, it is possible to achieve very good accuracy even without the training data by employing orthographic measures of word similarity.

Future work Automatically identify partial cognates (WSD). We plan to use translation probabilities from a word- aligned parallel corpus. Produce complete lists of cognates and false friends, given two vocabulary lists for the two languages. Apply our method to other pairs of languages (since the orthographic similarity measures are not language-dependent). Include our lists of cognates and false friends into language learning tools.