Download presentation
Presentation is loading. Please wait.
Published byLee Powers Modified over 9 years ago
1
Automatic Identification of Cognates, False Friends, and Partial Cognates University of Ottawa, Canada University of Ottawa, Canada
2
Outline Outline Overview of the Thesis Overview of the Thesis Research Contribution Research Contribution Cognate and False Friend Identification Cognate and False Friend Identification Partial Cognate Disambiguation Partial Cognate Disambiguation CLPA- Cognate and False Friend Annotator CLPA- Cognate and False Friend Annotator Conclusions and Future Work Conclusions and Future Work
3
Overview of the Thesis Tasks –Automatic Identification of Cognates and False Friends –Automatic Disambiguation of Partial Cognates Areas of Applications –CALL, MT, Word Alignment, Cross-Language Information Retrieval CALL Tool - CLPA
4
Cognates or True Friends (Vrais Amis), are pairs of words that are perceived as similar and are mutual translations. Cognates or True Friends (Vrais Amis), are pairs of words that are perceived as similar and are mutual translations. nature - nature, reconnaissance - recognition nature - nature, reconnaissance - recognition False Friends (Faux Amis) are pairs of words in two languages that are perceived as similar but have different meanings. False Friends (Faux Amis) are pairs of words in two languages that are perceived as similar but have different meanings. main (=hand) - main (principal, essential), blesser (=to injure) - bless (b é nir in French) main (=hand) - main (principal, essential), blesser (=to injure) - bless (b é nir in French) Partial Cognates words that share the same meaning in two languages in some but not all contexts Partial Cognates words that share the same meaning in two languages in some but not all contexts note – note, facteur - factor or mailman, maker note – note, facteur - factor or mailman, maker Definitions
5
Research Contribution Novel method based on ML algorithms to identify Cognates and False Friends Novel method based on ML algorithms to identify Cognates and False Friends A method to create complete lists of Cognates and False Friends A method to create complete lists of Cognates and False Friends Define a novel task: Partial Cognate Disambiguation, and solve it using a supervised and a semi-supervised method Define a novel task: Partial Cognate Disambiguation, and solve it using a supervised and a semi-supervised method –Combine and use corpora from different domains Implement a CALL Tool – CLPA to annotate Cognates and False Friends Implement a CALL Tool – CLPA to annotate Cognates and False Friends
6
Cognates and False Friends Identification Our method Our method –Machine Learning techniques with different algorithms –Instances: French-English pairs of words –Feature Space: 13 orthographic similarity measures –Classes: Cog_FF and Unrelated Experiments done for: Each measure separately Each measure separately Average of all measures Average of all measures All 13 measures All 13 measures
7
Cognates and False Friends Identification Data Data Training set Test set Cognates 613 (73) 603 (178) False-Friends 314 (135) 94 (46) Unrelated 527 (0) 343 (0) Total14541040
8
Results for classification (COG_FF/UNREL) Orthographic similarity measure Threshold Accuracy on Training set Accuracy on Test set IDENT1 43.90 % 55.00 % PREFIX0.03845 92.70 % 90.97 % DICE0.29669 89.40 % 93.37 % LCSR0.45800 92.91 % 94.24 % NED0.34845 93.39 % 93.57 % SOUNDEX0.62500 85.28 % 84.54 % TRI0.0476 88.30 % 92.13 % XDICE0.21825 92.84 % 94.52 % XXDICE0.12915 91.74 % 95.39 % TRI-SIM0.34845 95.66 % 93.28 % TRI-DIST0.34845 95.11 % 93.85 % Average measure 0.14770 93.83 % 94.14 %
9
Results for classification (COG_FF/UNREL) Classifier Accuracy cross- val. on training set Accuracy on test set Baseline 63.75 % 66.98 % OneRule 95.66 % 92.89 % Naive Bayes 94.84 % 94.62 % Decision Tree 95.66 % 92.08 % Decision Tree (pruned) 95.66% 93.18 % IBK 93.81 % 92.80 % Ada Boost 95.66 % 93.47 % Perceptron 95.11 % 91.55 % SVM (SMO) 95.46 % 93.76 %
10
Complete Lists of Cognates and False Friends Method Method –Use the XXDICE orthographic similarity measure –Use list of pairs of words in two languages (the words that are translation of each other, or not, or monolingual lists of words) –Use a bilingual dictionary to determine if the words contained in a pair are translation of each other
11
Complete Lists of Cognates and False Friends Evaluation Evaluation –On the entry list of a French-English bilingual dictionary 55% - Cognates 55% - Cognates 2% - False Friends (5,619,270 pairs) 2% - False Friends (5,619,270 pairs) –We created pair of words from two large monolingual list of words in French and English 11,469,662 – Orthographical Similar (0.8%) 11,469,662 – Orthographical Similar (0.8%) –3,496 Cognates (0.03%) –3,767,435 False Friends (32%)
12
Cognates and False Friends Identification Conclusion We tested a number of orthographic similarity measures individually, and also combined using different Machine Learning algorithms We tested a number of orthographic similarity measures individually, and also combined using different Machine Learning algorithms We evaluated the methods on a training set using 10-fold cross validation, on a test set We evaluated the methods on a training set using 10-fold cross validation, on a test set We proposed an extension of the method to create complete lists of Cognates and False Friends We proposed an extension of the method to create complete lists of Cognates and False Friends The results show that, for French and English, it is possible to achieve very good accuracy based on the orthographic measures of word similarity The results show that, for French and English, it is possible to achieve very good accuracy based on the orthographic measures of word similarity
13
Partial Cognate Disambiguation Task Task –To determine the sense/meaning (Cognate or False Friend with the equivalent English word) of an Partial Cognate in a French context Note Cog Cog Le comité prend note de cette information. Le comité prend note de cette information. The Committee takes note of this reply. The Committee takes note of this reply. FF FF Mais qui a dû payer la note? Mais qui a dû payer la note? So who got left holding the bill? So who got left holding the bill?
14
Data Use a set of 10 Partial Cognates Use a set of 10 Partial Cognates –Parallel sentences that have on the French side the French Partial Cognate and on the English side the English Cognate (English False Friend) - labeled as COG (FF) Collected from EuroPar, Hansard Collected from EuroPar, Hansard –~ 115 sentences each class for Training –~ 60 sentences each class for Testing
15
Supervised Method Traditional ML algorithms Features - used the bag-of-words (BOW) approach of modeling context, with the binary feature values - used the bag-of-words (BOW) approach of modeling context, with the binary feature values - context words from the training corpus that appeared at least 3 times in the training sentences - context words from the training corpus that appeared at least 3 times in the training sentences Classes COG and FF
16
Monolingual Bootstrapping For each pair of partial cognates (PC) 1. Train a classifier on the training seeds – using the BOW approach and a NB-K classifier with attribute selection on the features 1. Train a classifier on the training seeds – using the BOW approach and a NB-K classifier with attribute selection on the features 2. Apply the classifier on unlabeled data – sentences that contain the PC word, extracted from LeMonde (MB-F) or from BNC (MB-E) 2. Apply the classifier on unlabeled data – sentences that contain the PC word, extracted from LeMonde (MB-F) or from BNC (MB-E) 3. Take the first k newly classified sentences, both from the COG and FF class and add them to the training seeds (the most confident ones – the prediction accuracy greater or equal than a threshold =0.85) 3. Take the first k newly classified sentences, both from the COG and FF class and add them to the training seeds (the most confident ones – the prediction accuracy greater or equal than a threshold =0.85) 4. Rerun the experiments training on the new training set 4. Rerun the experiments training on the new training set 5. Repeat steps 2 and 3 for t times 5. Repeat steps 2 and 3 for t timesendFor
17
Bilingual Bootstrapping 1. Translate the English sentences that were collected in the MB-E step into French using an online MT tool and add them to the French seed training data. 1. Translate the English sentences that were collected in the MB-E step into French using an online MT tool and add them to the French seed training data. 2. Repeat the MB-F and MB-E steps for T times.
18
Additional Data Additional Data LeMonde LeMonde –An average of 250 sentences for each class BNC BNC –An average of 200 sentences for each class Multi-Domain corpus Multi-Domain corpus –An average of 80 sentences for each class
19
Results
20
Partial Cognate Disambiguation Conclusions – Simple methods and available tools are used with success for a task hard to solve even for humans –Additional use of unlabeled data improves the learning process for the Partial Cognates Disambiguation task –Semi-Supervised Learning proves to be “as good as” Supervised Learning
21
CLPA - Cross Language Pair Annotator
22
Future Work Apply the Cognate and False Friend Identification method, and create complete list for other pair of languages Apply the Cognate and False Friend Identification method, and create complete list for other pair of languages Increase the accuracy results for the Partial Cognate Disambiguation task Increase the accuracy results for the Partial Cognate Disambiguation task Use lemmatization for French texts and human evaluation for CLPA Use lemmatization for French texts and human evaluation for CLPA
23
Thank you! Thank you!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.