Automatic Identification of Cognates, False Friends, and Partial Cognates University of Ottawa, Canada University of Ottawa, Canada.

Slides:

Advertisements

Similar presentations

University of Sheffield NLP Module 4: Machine Learning.

Advertisements

Recognizing Human Actions by Attributes CVPR2011 Jingen Liu, Benjamin Kuipers, Silvio Savarese Dept. of Electrical Engineering and Computer Science University.

Albert Gatt Corpora and Statistical Methods Lecture 13.

Large-Scale Entity-Based Online Social Network Profile Linkage.

Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.

The Impact of Task and Corpus on Event Extraction Systems Ralph Grishman New York University Malta, May 2010 NYU.

Automatic Identification of Cognates and False Friends in French and English Diana Inkpen and Oana Frunza University of Ottawa and Greg Kondrak University.

Made with OpenOffice.org 1 Sentiment Classification using Word Sub-Sequences and Dependency Sub-Trees Pacific-Asia Knowledge Discovery and Data Mining.

A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.

Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.

Identifying Translations Philip Resnik, Noah Smith University of Maryland.

Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06.

CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?

Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.

1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.

Sentence Classifier for Helpdesk s Anthony 6 June 2006 Supervisors: Dr. Yuval Marom Dr. David Albrecht.

Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.

1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.

A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.

Selective Sampling on Probabilistic Labels Peng Peng, Raymond Chi-Wing Wong CSE, HKUST 1.

Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)

A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande.

Kuang Ru; Jinan Xu; Yujie Zhang; Peihao Wu Beijing Jiaotong University

Comparing the Parallel Automatic Composition of Inductive Applications with Stacking Methods Hidenao Abe & Takahira Yamaguchi Shizuoka University, JAPAN.

The use of machine translation tools for cross-lingual text-mining Blaz Fortuna Jozef Stefan Institute, Ljubljana John Shawe-Taylor Southampton University.

Unsupervised Word Sense Disambiguation Rivaling Supervised Methods Oh-Woog Kwon KLE Lab. CSE POSTECH.

An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.

1 Bins and Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)

 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.

INSTITUTE OF COMPUTING TECHNOLOGY Bagging-based System Combination for Domain Adaptation Linfeng Song, Haitao Mi, Yajuan Lü and Qun Liu Institute of Computing.

1 Statistical NLP: Lecture 9 Word Sense Disambiguation.

W ORD S ENSE D ISAMBIGUATION By Mahmood Soltani Tehran University 2009/12/24 1.

A Language Independent Method for Question Classification COLING 2004.

1 Co-Training for Cross-Lingual Sentiment Classification Xiaojun Wan ( 萬小軍 ) Associate Professor, Peking University ACL 2009.

An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee

A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:

CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.

Automatic Syllabus Classification JCDL – Vancouver – 22 June 2007 Edward A. Fox (presenting co-author), Xiaoyan Yu, Manas Tungare, Weiguo Fan, Manuel Perez-Quinones,

Bootstrapping for Text Learning Tasks Ramya Nagarajan AIML Seminar March 6, 2001.

Extracting bilingual terminologies from comparable corpora By: Ahmet Aker, Monica Paramita, Robert Gaizauskasl CS671: Natural Language Processing Prof.

CISC Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning.

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

CLEF2003 Forum/ August 2003 / Trondheim / page 1 Report on CLEF-2003 ML4 experiments Extracting multilingual resources from corpora N. Cancedda, H. Dejean,

Word Translation Disambiguation Using Bilingial Bootsrapping Paper written by Hang Li and Cong Li, Microsoft Research Asia Presented by Sarah Hunter.

Mutual bilingual terminology extraction Le An Ha*, Gabriela Fernandez**, Ruslan Mitkov*, Gloria Corpas*** * University of Wolverhampton ** Universidad.

Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

Support Vector Machine Based Orthographic Disambiguation Eiji ARAMAKI, Takeshi IMAI, Kengo MIYO, Kazuhiko OHE Hospital “center” and “centre” are equivalent?

Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:

From Words to Senses: A Case Study of Subjectivity Recognition Author: Fangzhong Su & Katja Markert (University of Leeds, UK) Source: COLING 2008 Reporter:

Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.

BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer.

Overview of Statistical NLP IR Group Meeting March 7, 2006.

Concept-Based Analysis of Scientific Literature Chen-Tse Tsai, Gourab Kundu, Dan Roth UIUC.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.

Learning to Detect and Classify Malicious Executables in the Wild by J

Bridging Domains Using World Wide Knowledge for Transfer Learning

Sentiment analysis algorithms and applications: A survey

Advanced data mining with TagHelper and Weka

Supervised Machine Learning

Using Transductive SVMs for Object Classification in Images

Machine Learning Week 1.

CS Fall 2016 (Shavlik©), Lecture 2

Introduction Task: extracting relational facts from text

iSRD Spam Review Detection with Imbalanced Data Distributions

Basics of ML Rohan Suri.

MAS 622J Course Project Classification of Affective States - GP Semi-Supervised Learning, SVM and kNN Hyungil Ahn

Extracting Why Text Segment from Web Based on Grammar-gram

Presentation transcript:

Automatic Identification of Cognates, False Friends, and Partial Cognates University of Ottawa, Canada University of Ottawa, Canada

Outline Outline Overview of the Thesis Overview of the Thesis Research Contribution Research Contribution Cognate and False Friend Identification Cognate and False Friend Identification Partial Cognate Disambiguation Partial Cognate Disambiguation CLPA- Cognate and False Friend Annotator CLPA- Cognate and False Friend Annotator Conclusions and Future Work Conclusions and Future Work

Overview of the Thesis Tasks –Automatic Identification of Cognates and False Friends –Automatic Disambiguation of Partial Cognates Areas of Applications –CALL, MT, Word Alignment, Cross-Language Information Retrieval CALL Tool - CLPA

Cognates or True Friends (Vrais Amis), are pairs of words that are perceived as similar and are mutual translations. Cognates or True Friends (Vrais Amis), are pairs of words that are perceived as similar and are mutual translations. nature - nature, reconnaissance - recognition nature - nature, reconnaissance - recognition False Friends (Faux Amis) are pairs of words in two languages that are perceived as similar but have different meanings. False Friends (Faux Amis) are pairs of words in two languages that are perceived as similar but have different meanings. main (=hand) - main (principal, essential), blesser (=to injure) - bless (b é nir in French) main (=hand) - main (principal, essential), blesser (=to injure) - bless (b é nir in French) Partial Cognates words that share the same meaning in two languages in some but not all contexts Partial Cognates words that share the same meaning in two languages in some but not all contexts note – note, facteur - factor or mailman, maker note – note, facteur - factor or mailman, maker Definitions

Research Contribution Novel method based on ML algorithms to identify Cognates and False Friends Novel method based on ML algorithms to identify Cognates and False Friends A method to create complete lists of Cognates and False Friends A method to create complete lists of Cognates and False Friends Define a novel task: Partial Cognate Disambiguation, and solve it using a supervised and a semi-supervised method Define a novel task: Partial Cognate Disambiguation, and solve it using a supervised and a semi-supervised method –Combine and use corpora from different domains Implement a CALL Tool – CLPA to annotate Cognates and False Friends Implement a CALL Tool – CLPA to annotate Cognates and False Friends

Cognates and False Friends Identification Our method Our method –Machine Learning techniques with different algorithms –Instances: French-English pairs of words –Feature Space: 13 orthographic similarity measures –Classes: Cog_FF and Unrelated Experiments done for: Each measure separately Each measure separately Average of all measures Average of all measures All 13 measures All 13 measures

Cognates and False Friends Identification Data Data Training set Test set Cognates 613 (73) 603 (178) False-Friends 314 (135) 94 (46) Unrelated 527 (0) 343 (0) Total

Results for classification (COG_FF/UNREL) Orthographic similarity measure Threshold Accuracy on Training set Accuracy on Test set IDENT % % PREFIX % % DICE % % LCSR % % NED % % SOUNDEX % % TRI % % XDICE % % XXDICE % % TRI-SIM % % TRI-DIST % % Average measure % %

Results for classification (COG_FF/UNREL) Classifier Accuracy cross- val. on training set Accuracy on test set Baseline % % OneRule % % Naive Bayes % % Decision Tree % % Decision Tree (pruned) 95.66% % IBK % % Ada Boost % % Perceptron % % SVM (SMO) % %

Complete Lists of Cognates and False Friends Method Method –Use the XXDICE orthographic similarity measure –Use list of pairs of words in two languages (the words that are translation of each other, or not, or monolingual lists of words) –Use a bilingual dictionary to determine if the words contained in a pair are translation of each other

Complete Lists of Cognates and False Friends Evaluation Evaluation –On the entry list of a French-English bilingual dictionary 55% - Cognates 55% - Cognates 2% - False Friends (5,619,270 pairs) 2% - False Friends (5,619,270 pairs) –We created pair of words from two large monolingual list of words in French and English 11,469,662 – Orthographical Similar (0.8%) 11,469,662 – Orthographical Similar (0.8%) –3,496 Cognates (0.03%) –3,767,435 False Friends (32%)

Cognates and False Friends Identification Conclusion We tested a number of orthographic similarity measures individually, and also combined using different Machine Learning algorithms We tested a number of orthographic similarity measures individually, and also combined using different Machine Learning algorithms We evaluated the methods on a training set using 10-fold cross validation, on a test set We evaluated the methods on a training set using 10-fold cross validation, on a test set We proposed an extension of the method to create complete lists of Cognates and False Friends We proposed an extension of the method to create complete lists of Cognates and False Friends The results show that, for French and English, it is possible to achieve very good accuracy based on the orthographic measures of word similarity The results show that, for French and English, it is possible to achieve very good accuracy based on the orthographic measures of word similarity

Partial Cognate Disambiguation Task Task –To determine the sense/meaning (Cognate or False Friend with the equivalent English word) of an Partial Cognate in a French context Note Cog Cog Le comité prend note de cette information. Le comité prend note de cette information. The Committee takes note of this reply. The Committee takes note of this reply. FF FF Mais qui a dû payer la note? Mais qui a dû payer la note? So who got left holding the bill? So who got left holding the bill?

Data Use a set of 10 Partial Cognates Use a set of 10 Partial Cognates –Parallel sentences that have on the French side the French Partial Cognate and on the English side the English Cognate (English False Friend) - labeled as COG (FF) Collected from EuroPar, Hansard Collected from EuroPar, Hansard –~ 115 sentences each class for Training –~ 60 sentences each class for Testing

Supervised Method Traditional ML algorithms Features - used the bag-of-words (BOW) approach of modeling context, with the binary feature values - used the bag-of-words (BOW) approach of modeling context, with the binary feature values - context words from the training corpus that appeared at least 3 times in the training sentences - context words from the training corpus that appeared at least 3 times in the training sentences Classes COG and FF

Monolingual Bootstrapping For each pair of partial cognates (PC) 1. Train a classifier on the training seeds – using the BOW approach and a NB-K classifier with attribute selection on the features 1. Train a classifier on the training seeds – using the BOW approach and a NB-K classifier with attribute selection on the features 2. Apply the classifier on unlabeled data – sentences that contain the PC word, extracted from LeMonde (MB-F) or from BNC (MB-E) 2. Apply the classifier on unlabeled data – sentences that contain the PC word, extracted from LeMonde (MB-F) or from BNC (MB-E) 3. Take the first k newly classified sentences, both from the COG and FF class and add them to the training seeds (the most confident ones – the prediction accuracy greater or equal than a threshold =0.85) 3. Take the first k newly classified sentences, both from the COG and FF class and add them to the training seeds (the most confident ones – the prediction accuracy greater or equal than a threshold =0.85) 4. Rerun the experiments training on the new training set 4. Rerun the experiments training on the new training set 5. Repeat steps 2 and 3 for t times 5. Repeat steps 2 and 3 for t timesendFor

Bilingual Bootstrapping 1. Translate the English sentences that were collected in the MB-E step into French using an online MT tool and add them to the French seed training data. 1. Translate the English sentences that were collected in the MB-E step into French using an online MT tool and add them to the French seed training data. 2. Repeat the MB-F and MB-E steps for T times.

Additional Data Additional Data LeMonde LeMonde –An average of 250 sentences for each class BNC BNC –An average of 200 sentences for each class Multi-Domain corpus Multi-Domain corpus –An average of 80 sentences for each class

Results

Partial Cognate Disambiguation Conclusions – Simple methods and available tools are used with success for a task hard to solve even for humans –Additional use of unlabeled data improves the learning process for the Partial Cognates Disambiguation task –Semi-Supervised Learning proves to be “as good as” Supervised Learning

CLPA - Cross Language Pair Annotator

Future Work Apply the Cognate and False Friend Identification method, and create complete list for other pair of languages Apply the Cognate and False Friend Identification method, and create complete list for other pair of languages Increase the accuracy results for the Partial Cognate Disambiguation task Increase the accuracy results for the Partial Cognate Disambiguation task Use lemmatization for French texts and human evaluation for CLPA Use lemmatization for French texts and human evaluation for CLPA

Thank you! Thank you!