Translation of Unknown Words in Low Resource Languages

Slides:

Advertisements

Similar presentations

Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.

Advertisements

Word Sense Disambiguation for Machine Translation Han-Bin Chen

Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.

+ Improving Vector Space Word Representations Using Multilingual Correlation Manaal Faruqui and Chris Dyer Language Technologies Institute Carnegie Mellon.

“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.

MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.

Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu, Bing Qin

Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group.

LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.

Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali.

Czech-to-English Translation: MT Marathon 2009 Session Preview Jonathan Clark Greg Hanneman Language Technologies Institute Carnegie Mellon University.

Multilinguality to the Rescue Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.

Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.

An Introduction to SMT Andy Way, DCU. Statistical Machine Translation (SMT) Translation Model Language Model Bilingual and Monolingual Data* Decoder:

The use of machine translation tools for cross-lingual text-mining Blaz Fortuna Jozef Stefan Institute, Ljubljana John Shawe-Taylor Southampton University.

English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.

Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.

1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.

Transliteration Transliteration CS 626 course seminar by Purva Joshi Mugdha Bapat Aditya Joshi Manasi Bapat

An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.

2012: Monolingual and Crosslingual SMS-based FAQ Retrieval Johannes Leveling CNGL, School of Computing, Dublin City University, Ireland.

Active Learning for Statistical Phrase-based Machine Translation Gholamreza Haffari Joint work with: Maxim Roy, Anoop Sarkar Simon Fraser University NAACL.

2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.

Enhanced Infrastructure for Creation & Collection of Translation Resources Zhiyi Song, Stephanie Strassel (speaker), Gary Krug, Kazuaki Maeda.

NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.

Approximating a Deep-Syntactic Metric for MT Evaluation and Tuning Matouš Macháček, Ondřej Bojar; {machacek, Charles University.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.

NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

MACHINE TRANSLATION PAPER 1 Daniel Montalvo, Chrysanthia Cheung-Lau, Jonny Wang CS159 Spring 2011.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.

Statistical Machine Translation of Texts with Misspelled Words Nicola Bertoldi, Mauro Cettolo, Marcello Federico FBK - Fondazione Bruno Kessler, Trento,

October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.

Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,

English-Hindi Neural machine translation and parallel corpus generation EKANSH GUPTA ROHIT GUPTA.

The University of Illinois System in the CoNLL-2013 Shared Task Alla RozovskayaKai-Wei ChangMark SammonsDan Roth Cognitive Computation Group University.

Mountain... पहाड mountain... पर्वत Mr... श्री Aligarh District... अलीगढ जिला Allahabad University... इलाहाबाद विश्वविद्यालय Amit Vilasrao Deshmukh... अमित.

This research is supported by NIH grant U54-GM114838, a grant from the Allen Institute for Artificial Intelligence (allenai.org), and Contract HR

A CASE STUDY OF GERMAN INTO ENGLISH BY MACHINE TRANSLATION: MOSES EVALUATED USING MOSES FOR MERE MORTALS. Roger Haycock

Language Identification and Part-of-Speech Tagging

Cross-lingual Models of Word Embeddings: An Empirical Comparison

Neural Machine Translation

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky

Neural Machine Translation by Jointly Learning to Align and Translate

Joint Training for Pivot-based Neural Machine Translation

Do-Gil Lee1*, Ilhwan Kim1 and Seok Kee Lee2

Vector-Space (Distributional) Lexical Semantics

Neural Lattice Search for Domain Adaptation in Machine Translation

Neural Lattice Search for Domain Adaptation in Machine Translation

--Mengxue Zhang, Qingyang Li

Deep Learning based Machine Translation

Word Embeddings with Limited Memory

Transformer result, convolutional encoder-decoder

January 12th, 2017 Brown – AC ELA.

On the Impact of Various Types of Noise on Neural Machine Translation

Introduction to Natural Language Processing

Statistical n-gram David ling.

Using GOLD to Tracking L2 Development

Introduction to Text Analysis

Improving IBM Word-Alignment Model 1(Robert C. MOORE)

The XMU SMT System for IWSLT 2007

Using Multilingual Neural Re-ranking Models for Low Resource Target Languages in Cross-lingual Document Detection Using Multilingual Neural Re-ranking.

Word embeddings (continued)

Statistical NLP Spring 2011

Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.

This material is based upon work supported by the National Science Foundation under Grant #XXXXXX. Any opinions, findings, and conclusions or recommendations.

Sequence-to-Sequence Models

Neural Machine Translation by Jointly Learning to Align and Translate

Presentation transcript:

Translation of Unknown Words in Low Resource Languages Biman Gujral, Huda Khayrallah, and Philipp Koehn Johns Hopkins University 31 October 2016 This talk was presented at AMTA 2016 It is based on this paper: http://www.cs.jhu.edu/~huda/papers/gujral2016AMTA.pdf bib: http://www.cs.jhu.edu/~huda/papers/gujral2016AMTA.bib This talk was presented at AMTA 2016 It is based on this paper: http://www.cs.jhu.edu/~huda/papers/gujral2016AMTA.pdf

Translation of Unknown Words in Low Resource Languages Biman Gujral, Huda Khayrallah, and Philipp Koehn Johns Hopkins University 31 October 2016 This talk was presented at AMTA 2016 It is based on this paper: http://www.cs.jhu.edu/~huda/papers/gujral2016AMTA.pdf

Out of Vocabulary Words (OOVs) Hindi  English: It वूवन पैंट्स, graphic टीज, Polo T शर्टे, शर्टे, शॉट्स, स्कर्टे and bright embroidered jackets etc are included. Uzbek English: Quvayt oʻyinga how koʻryapmiz with the preparation. Quvayt bilan bilan oʻyinga qanday qanday tayyorgarlik tayyorgarlik koʻryapmiz Quvayt oʻyinga how koʻryapmiz with the preparation How are we getting prepared to the match with Kuwayt ? Gujral, Khayrallah, Koehn

Gujral, Khayrallah, Koehn Goals Generate candidates for each OOV Select the best one Gujral, Khayrallah, Koehn

Gujral, Khayrallah, Koehn How big is this problem? Gujral, Khayrallah, Koehn

Gujral, Khayrallah, Koehn Data Hindi  English WMT14 News Training 274k sentences Test 2.5k sentences ~1 OOV/sentence ~5% OOVs Uzbek  English LORELEI News, Wikipedia, social media Training 55k sentences Test 1k sentences ~4 OOVs/sentence ~20% OOVs ~3k OOVs in hindi ~5k in UZ Gujral, Khayrallah, Koehn

Gujral, Khayrallah, Koehn OOV Examples Names I هدى  Huda Misspellings grammer/grammar Inflections play/plays/playing Borrowed words हैलोवीन  Halloween Reinflected Borrowings स्कर्टे  skirts Googlear  to Google Content words अटकलें speculation Gujral, Khayrallah, Koehn

Gujral, Khayrallah, Koehn Distribution of OOVs From table 6 in the paper Gujral, Khayrallah, Koehn

Gujral, Khayrallah, Koehn MT System Moses (Koehn et al. 2007) Phrase Based Large English language model WMT English ’07-’12 Gujral, Khayrallah, Koehn

Gujral, Khayrallah, Koehn Methods Gujral, Khayrallah, Koehn

Gujral, Khayrallah, Koehn Methods Transliteration Levenshtein distance Word Embeddings Gujral, Khayrallah, Koehn

Gujral, Khayrallah, Koehn Transliteration v هدى  Huda हैलोवीन  Halloween Unsupervised Moses mode (Durrani et al. 2014) Character translation model Incorporate larger English language model Uzbek is already written in Latin script, keep original spelling Generate 1 candidate Gujral, Khayrallah, Koehn

Gujral, Khayrallah, Koehn Levenshtein distance grammer/grammar play/plays Minimum number of: insertions deletions substitutions Gujral, Khayrallah, Koehn

Gujral, Khayrallah, Koehn Levenshtein distance qilyapmiz qilyapsiz Find source words with distance ≤ 2 from OOV Use their English translation as translation candidate Generate 18 candidates on average  doing Levenshtein distance = 1  doing Biz (we) qilyapmiz (are doing) Siz (you) qilyapsiz (are doing) Gujral, Khayrallah, Koehn

Gujral, Khayrallah, Koehn Word Embeddings rumors अटकलें अफवाह doubts rumours Figure 3 speculation suspicions misgivings worry speculation worried करोङ Gujral, Khayrallah, Koehn

Gujral, Khayrallah, Koehn Word Embeddings Word2vec (Mikolov et al. 2013) monolingual corpora Multilingual word vectors (Faruqui & Dyer 2014) monolingual vectors alignments Canonical Correlation Analysis (CCA) Generates 20 candidates Gujral, Khayrallah, Koehn

Gujral, Khayrallah, Koehn Word Embeddings English Text English Projection Matrix Word2Vec English Vectors CCA Hindi Text Hindi Projection Matrix Word2Vec Hindi Vectors Gujral, Khayrallah, Koehn

Word Embeddings * * English Projection Matrix Projected English Vectors KNN English Vectors English Candidates Hindi Candidates Hindi Projection Matrix * Projected Hindi Vectors Hindi Vectors Gujral, Khayrallah, Koehn

Gujral, Khayrallah, Koehn Word Embeddings rumors अटकलें अफवाह doubts rumours Figure 3 speculation suspicions misgivings worry speculation worried करोङ Gujral, Khayrallah, Koehn

Gujral, Khayrallah, Koehn Word Embeddings अटकलें doubts rumours suspicions misgivings worry worried speculation … rumors अटकलें अफवाह doubts rumours suspicions misgivings Figure 3 speculation worry worried speculation Gujral, Khayrallah, Koehn

Gujral, Khayrallah, Koehn Word Embeddings अटकलें rumors अफवाह करोङ …  rumor  crore  … अटकलें अफवाह doubts rumours suspicions misgivings Figure 3 speculation worry worried speculation Gujral, Khayrallah, Koehn

Gujral, Khayrallah, Koehn Word Embeddings अटकलें doubts rumours suspicions misgivings worry worried speculation … rumor crore … Figure 3 speculation Gujral, Khayrallah, Koehn

Gujral, Khayrallah, Koehn Integration Gujral, Khayrallah, Koehn

Gujral, Khayrallah, Koehn Integration Transliteration Hindi OOVs English Candidates Select English Translation Levenshtein Embeddings Gujral, Khayrallah, Koehn

Gujral, Khayrallah, Koehn Integration Language Model Phrase table Gujral, Khayrallah, Koehn

Gujral, Khayrallah, Koehn Language Model Large English language model XML markup in Moses (Koehn & Haddow, 2009) Selection occurs during decoding Gujral, Khayrallah, Koehn

Gujral, Khayrallah, Koehn Phrase Table Secondary Phrase Table only includes OOVs Features: Method Word Vector Distance Levenshtein distance Inverse frequency in Monolingual corpus Gujral, Khayrallah, Koehn

Gujral, Khayrallah, Koehn Results Gujral, Khayrallah, Koehn

Gujral, Khayrallah, Koehn Oracle Upper bound on how well a selection method can do given current generation methods Select word from list of candidates that is in the reference *remove stop words *add synonyms Gujral, Khayrallah, Koehn

Gujral, Khayrallah, Koehn BLEU - Uzbek From table 8 in the paper Gujral, Khayrallah, Koehn

Gujral, Khayrallah, Koehn BLEU - Hindi From table 7 in the paper Phrase Table is without multi word candidates Transliteration has an expanded LM Model Gujral, Khayrallah, Koehn

Gujral, Khayrallah, Koehn Beyond BLEU Goals: generate candidates for each OOV How well can we generate translation candidates? select the best one How well can we select from the translation candidates? Gujral, Khayrallah, Koehn

Gujral, Khayrallah, Koehn Coverage How well can we generate translation candidates? Was one of the candidates generated by this method in the reference? Gujral, Khayrallah, Koehn

Gujral, Khayrallah, Koehn Coverage - Uzbek From table 5 in the paper All – 30% Note that the same candidate could be generated by multiple methods TOKENS (copy) Gujral, Khayrallah, Koehn

Gujral, Khayrallah, Koehn Coverage - Hindi From table 4 in the paper All -> 50% TOKENS News data lits of names Gujral, Khayrallah, Koehn

Gujral, Khayrallah, Koehn Accuracy How well can we select from the translation candidates? Is the word we selected in the reference? Gujral, Khayrallah, Koehn

Gujral, Khayrallah, Koehn Accuracy - Uzbek From table 8 in the paper TYPES Gujral, Khayrallah, Koehn

Gujral, Khayrallah, Koehn Accuracy - Hindi From table 7 in the paper - TYPES This is news data, so there are lots of names Phrase Table is without multi word candidates Transliteration has an expanded LM Model Gujral, Khayrallah, Koehn

Conclusion & Future Work Generate Quality translations Selection does not perform as well Improved selection methods More sophisticated embedding projection Analysis of what methods work on which types of OOVs Gujral, Khayrallah, Koehn

Gujral, Khayrallah, Koehn Acknowledgement This material is based upon work supported in part by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-15-C-0113. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the Defense Advanced Research Projects Agency (DARPA). Gujral, Khayrallah, Koehn

Translation of Unknown Words in Low Resource Languages Biman Gujral, Huda Khayrallah, and Philipp Koehn {bgujral1, huda, phi}@jhu.edu Johns Hopkins University

Gujral, Khayrallah, Koehn References Durrani, Haddow, Koehn, and Heafield. (2014). Edinburgh’s Phrase-Based Machine Translation Systems for WMT-14. Workshop on Statistical Machine Translation Koehn, Hoang, Birch, Callison-Burch, Federico, Bertoldi, Cowan, Shen, Moran, Zens, Dyer, Bojar, Constantin, and Herbst. (2007). Moses: Open source toolkit for Statistical Machine Translation. ACL Interactive Poster and Demonstration Sessions Koehn and Haddow. Edinburgh’s submission to all tracks of the WMT2009 Shared Task with Reordering and Speed Improvements to Moses. Workshop on Statistical Machine Translation Faruqui and Dyer. (2014). Improving Vector Space Word Representations Using Multilingual Correlation. In Proceedings of EACL. Gujral, Khayrallah, Koehn