Download presentation
Presentation is loading. Please wait.
Published bySherman Pitts Modified over 6 years ago
1
Translation of Unknown Words in Low Resource Languages
Biman Gujral, Huda Khayrallah, and Philipp Koehn Johns Hopkins University 31 October 2016 This talk was presented at AMTA 2016 It is based on this paper: bib: This talk was presented at AMTA 2016 It is based on this paper:
2
Translation of Unknown Words in Low Resource Languages
Biman Gujral, Huda Khayrallah, and Philipp Koehn Johns Hopkins University 31 October 2016 This talk was presented at AMTA 2016 It is based on this paper:
3
Out of Vocabulary Words (OOVs)
Hindi English: It वूवन पैंट्स, graphic टीज, Polo T शर्टे, शर्टे, शॉट्स, स्कर्टे and bright embroidered jackets etc are included. Uzbek English: Quvayt oʻyinga how koʻryapmiz with the preparation. Quvayt bilan bilan oʻyinga qanday qanday tayyorgarlik tayyorgarlik koʻryapmiz Quvayt oʻyinga how koʻryapmiz with the preparation How are we getting prepared to the match with Kuwayt ? Gujral, Khayrallah, Koehn
4
Gujral, Khayrallah, Koehn
Goals Generate candidates for each OOV Select the best one Gujral, Khayrallah, Koehn
5
Gujral, Khayrallah, Koehn
How big is this problem? Gujral, Khayrallah, Koehn
6
Gujral, Khayrallah, Koehn
Data Hindi English WMT14 News Training 274k sentences Test 2.5k sentences ~1 OOV/sentence ~5% OOVs Uzbek English LORELEI News, Wikipedia, social media Training 55k sentences Test 1k sentences ~4 OOVs/sentence ~20% OOVs ~3k OOVs in hindi ~5k in UZ Gujral, Khayrallah, Koehn
7
Gujral, Khayrallah, Koehn
OOV Examples Names I هدى Huda Misspellings grammer/grammar Inflections play/plays/playing Borrowed words हैलोवीन Halloween Reinflected Borrowings स्कर्टे skirts Googlear to Google Content words अटकलें speculation Gujral, Khayrallah, Koehn
8
Gujral, Khayrallah, Koehn
Distribution of OOVs From table 6 in the paper Gujral, Khayrallah, Koehn
9
Gujral, Khayrallah, Koehn
MT System Moses (Koehn et al. 2007) Phrase Based Large English language model WMT English ’07-’12 Gujral, Khayrallah, Koehn
10
Gujral, Khayrallah, Koehn
Methods Gujral, Khayrallah, Koehn
11
Gujral, Khayrallah, Koehn
Methods Transliteration Levenshtein distance Word Embeddings Gujral, Khayrallah, Koehn
12
Gujral, Khayrallah, Koehn
Transliteration v هدى Huda हैलोवीन Halloween Unsupervised Moses mode (Durrani et al. 2014) Character translation model Incorporate larger English language model Uzbek is already written in Latin script, keep original spelling Generate 1 candidate Gujral, Khayrallah, Koehn
13
Gujral, Khayrallah, Koehn
Levenshtein distance grammer/grammar play/plays Minimum number of: insertions deletions substitutions Gujral, Khayrallah, Koehn
14
Gujral, Khayrallah, Koehn
Levenshtein distance qilyapmiz qilyapsiz Find source words with distance ≤ 2 from OOV Use their English translation as translation candidate Generate 18 candidates on average doing Levenshtein distance = 1 doing Biz (we) qilyapmiz (are doing) Siz (you) qilyapsiz (are doing) Gujral, Khayrallah, Koehn
15
Gujral, Khayrallah, Koehn
Word Embeddings rumors अटकलें अफवाह doubts rumours Figure 3 speculation suspicions misgivings worry speculation worried करोङ Gujral, Khayrallah, Koehn
16
Gujral, Khayrallah, Koehn
Word Embeddings Word2vec (Mikolov et al. 2013) monolingual corpora Multilingual word vectors (Faruqui & Dyer 2014) monolingual vectors alignments Canonical Correlation Analysis (CCA) Generates 20 candidates Gujral, Khayrallah, Koehn
17
Gujral, Khayrallah, Koehn
Word Embeddings English Text English Projection Matrix Word2Vec English Vectors CCA Hindi Text Hindi Projection Matrix Word2Vec Hindi Vectors Gujral, Khayrallah, Koehn
18
Word Embeddings * * English Projection Matrix
Projected English Vectors KNN English Vectors English Candidates Hindi Candidates Hindi Projection Matrix * Projected Hindi Vectors Hindi Vectors Gujral, Khayrallah, Koehn
19
Gujral, Khayrallah, Koehn
Word Embeddings rumors अटकलें अफवाह doubts rumours Figure 3 speculation suspicions misgivings worry speculation worried करोङ Gujral, Khayrallah, Koehn
20
Gujral, Khayrallah, Koehn
Word Embeddings अटकलें doubts rumours suspicions misgivings worry worried speculation … rumors अटकलें अफवाह doubts rumours suspicions misgivings Figure 3 speculation worry worried speculation Gujral, Khayrallah, Koehn
21
Gujral, Khayrallah, Koehn
Word Embeddings अटकलें rumors अफवाह करोङ … rumor crore … अटकलें अफवाह doubts rumours suspicions misgivings Figure 3 speculation worry worried speculation Gujral, Khayrallah, Koehn
22
Gujral, Khayrallah, Koehn
Word Embeddings अटकलें doubts rumours suspicions misgivings worry worried speculation … rumor crore … Figure 3 speculation Gujral, Khayrallah, Koehn
23
Gujral, Khayrallah, Koehn
Integration Gujral, Khayrallah, Koehn
24
Gujral, Khayrallah, Koehn
Integration Transliteration Hindi OOVs English Candidates Select English Translation Levenshtein Embeddings Gujral, Khayrallah, Koehn
25
Gujral, Khayrallah, Koehn
Integration Language Model Phrase table Gujral, Khayrallah, Koehn
26
Gujral, Khayrallah, Koehn
Language Model Large English language model XML markup in Moses (Koehn & Haddow, 2009) Selection occurs during decoding Gujral, Khayrallah, Koehn
27
Gujral, Khayrallah, Koehn
Phrase Table Secondary Phrase Table only includes OOVs Features: Method Word Vector Distance Levenshtein distance Inverse frequency in Monolingual corpus Gujral, Khayrallah, Koehn
28
Gujral, Khayrallah, Koehn
Results Gujral, Khayrallah, Koehn
29
Gujral, Khayrallah, Koehn
Oracle Upper bound on how well a selection method can do given current generation methods Select word from list of candidates that is in the reference *remove stop words *add synonyms Gujral, Khayrallah, Koehn
30
Gujral, Khayrallah, Koehn
BLEU - Uzbek From table 8 in the paper Gujral, Khayrallah, Koehn
31
Gujral, Khayrallah, Koehn
BLEU - Hindi From table 7 in the paper Phrase Table is without multi word candidates Transliteration has an expanded LM Model Gujral, Khayrallah, Koehn
32
Gujral, Khayrallah, Koehn
Beyond BLEU Goals: generate candidates for each OOV How well can we generate translation candidates? select the best one How well can we select from the translation candidates? Gujral, Khayrallah, Koehn
33
Gujral, Khayrallah, Koehn
Coverage How well can we generate translation candidates? Was one of the candidates generated by this method in the reference? Gujral, Khayrallah, Koehn
34
Gujral, Khayrallah, Koehn
Coverage - Uzbek From table 5 in the paper All – 30% Note that the same candidate could be generated by multiple methods TOKENS (copy) Gujral, Khayrallah, Koehn
35
Gujral, Khayrallah, Koehn
Coverage - Hindi From table 4 in the paper All -> 50% TOKENS News data lits of names Gujral, Khayrallah, Koehn
36
Gujral, Khayrallah, Koehn
Accuracy How well can we select from the translation candidates? Is the word we selected in the reference? Gujral, Khayrallah, Koehn
37
Gujral, Khayrallah, Koehn
Accuracy - Uzbek From table 8 in the paper TYPES Gujral, Khayrallah, Koehn
38
Gujral, Khayrallah, Koehn
Accuracy - Hindi From table 7 in the paper - TYPES This is news data, so there are lots of names Phrase Table is without multi word candidates Transliteration has an expanded LM Model Gujral, Khayrallah, Koehn
39
Conclusion & Future Work
Generate Quality translations Selection does not perform as well Improved selection methods More sophisticated embedding projection Analysis of what methods work on which types of OOVs Gujral, Khayrallah, Koehn
40
Gujral, Khayrallah, Koehn
Acknowledgement This material is based upon work supported in part by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR C Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the Defense Advanced Research Projects Agency (DARPA). Gujral, Khayrallah, Koehn
41
Translation of Unknown Words in Low Resource Languages
Biman Gujral, Huda Khayrallah, and Philipp Koehn {bgujral1, huda, Johns Hopkins University
42
Gujral, Khayrallah, Koehn
References Durrani, Haddow, Koehn, and Heafield. (2014). Edinburgh’s Phrase-Based Machine Translation Systems for WMT-14. Workshop on Statistical Machine Translation Koehn, Hoang, Birch, Callison-Burch, Federico, Bertoldi, Cowan, Shen, Moran, Zens, Dyer, Bojar, Constantin, and Herbst. (2007). Moses: Open source toolkit for Statistical Machine Translation. ACL Interactive Poster and Demonstration Sessions Koehn and Haddow. Edinburgh’s submission to all tracks of the WMT2009 Shared Task with Reordering and Speed Improvements to Moses. Workshop on Statistical Machine Translation Faruqui and Dyer. (2014). Improving Vector Space Word Representations Using Multilingual Correlation. In Proceedings of EACL. Gujral, Khayrallah, Koehn
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.