Statistically Motivated Example-based Machine Translation using Translation Memory Sandipan Dandapat, Sara Morrissey, Sudip Kumar Naskar, Harold Somers.

Slides:

Advertisements

Similar presentations

The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China

Advertisements

The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.

Word Sense Disambiguation for Machine Translation Han-Bin Chen

A Syntactic Translation Memory Vincent Vandeghinste Centre for Computational Linguistics K.U.Leuven

Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.

Interactive Translation vs. Pre-Translation in the Context of Translation Memory Systems: Investigating the Effects of Translation Method on Productivity,

Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.

Languages & The Media, 4 Nov 2004, Berlin 1 Multimodal multilingual information processing for automatic subtitle generation: Resources, Methods and System.

Speech and Language Technologies in the Next Generation Localisation CSET Prof. Andy Way, School of Computing, DCU.

Hybridity in MT: Experiments on the Europarl Corpus Declan Groves 24 th May, NCLT Seminar Series 2006.

Data-Driven Machine Translation for Sign Languages Sara Morrissey PhD topic NCLT/CNGL Workshop 23 rd July 2008.

Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.

Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.

Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.

Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.

1 Lending a Hand: Sign Language Machine Translation Sara Morrissey NCLT Seminar Series 21 st June 2006.

Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.

Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali.

Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.

Semantic and phonetic automatic reconstruction of medical dictations STEFAN PETRIK, CHRISTINA DREXEL, LEO FESSLER, JEREMY JANCSARY, ALEXANDRA KLEIN,GERNOT.

Natural Language Processing Lab Northeastern University, China Feiliang Ren EBMT Based on Finite Automata State Transfer Generation Feiliang Ren.

Machine translation Context-based approach Lucia Otoyo.

Machine Translation Dr. Radhika Mamidi. What is Machine Translation? A sub-field of computational linguistics It investigates the use of computer software.

English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

Achieving Domain Specificity in SMT without Over Siloing William Lewis, Chris Wendt, David Bullock Microsoft Research Machine Translation.

Evaluation of the Statistical Machine Translation Service for Croatian-English Marija Brkić Department of Informatics, University of Rijeka

Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.

Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.

2012: Monolingual and Crosslingual SMS-based FAQ Retrieval Johannes Leveling CNGL, School of Computing, Dublin City University, Ireland.

Kyoshiro SUGIYAMA, AHC-Lab., NAIST An Investigation of Machine Translation Evaluation Metrics in Cross-lingual Question Answering Kyoshiro Sugiyama, Masahiro.

2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.

Phrase Reordering for Statistical Machine Translation Based on Predicate-Argument Structure Mamoru Komachi, Yuji Matsumoto Nara Institute of Science and.

Coşkun Mermer, Hamza Kaya, Mehmet Uğur Doğan National Research Institute of Electronics and Cryptology (UEKAE) The Scientific and Technological Research.

Reordering Model Using Syntactic Information of a Source Tree for Statistical Machine Translation Kei Hashimoto, Hirohumi Yamamoto, Hideo Okuma, Eiichiro.

Advanced MT Seminar Spring 2008 Instructors: Alon Lavie and Stephan Vogel.

Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.

An Investigation of Statistical Machine Translation (Spanish to English) Raghav Bashyal.

Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,

1.  10% Assignments/ class participation  10% Pop Quizzes  05% Attendance  25% Mid Term  50% Final Term 2.

Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:

Korea Maritime and Ocean University NLP Jung Tae LEE

INSTITUTE OF COMPUTING TECHNOLOGY Forest-to-String Statistical Translation Rules Yang Liu, Qun Liu, and Shouxun Lin Institute of Computing Technology Chinese.

Chinese Word Segmentation Adaptation for Statistical Machine Translation Hailong Cao, Masao Utiyama and Eiichiro Sumita Language Translation Group NICT&ATR.

Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.

LREC 2004, 26 May 2004, Lisbon 1 Multimodal Multilingual Resources in the Subtitling Process S.Piperidis, I.Demiros, P.Prokopidis, P.Vanroose, A. Hoethker,

Imposing Constraints from the Source Tree on ITG Constraints for SMT Hirofumi Yamamoto, Hideo Okuma, Eiichiro Sumita National Institute of Information.

Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.

Approaching a New Language in Machine Translation Anna Sågvall Hein, Per Weijnitz.

Discriminative Modeling extraction Sets for Machine Translation Author John DeNero and Dan KleinUC Berkeley Presenter Justin Chiu.

Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.

October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.

A Simple English-to-Punjabi Translation System By : Shailendra Singh.

Evaluating Translation Memory Software Francie Gow MA Translation, University of Ottawa Translator, Translation Bureau, Government of Canada

Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,

LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.

Is Neural Machine Translation the New State of the Art?

Approaches to Machine Translation

KantanNeural™ LQR Experiment

Joint Training for Pivot-based Neural Machine Translation

--Mengxue Zhang, Qingyang Li

Eiji Aramaki* Sadao Kurohashi* * University of Tokyo

Approaches to Machine Translation

Statistical vs. Neural Machine Translation: a Comparison of MTH and DeepL at Swiss Post’s Language service Lise Volkart – Pierrette Bouillon – Sabrina.

Presentation transcript:

Statistically Motivated Example-based Machine Translation using Translation Memory Sandipan Dandapat, Sara Morrissey, Sudip Kumar Naskar, Harold Somers CNGL, School of Computing, DCU

Introduction Machine Translation is the process of automatic encoding of information (syntactic and semantic) from one language to another language. RBMT – is characterized by linguistic rules SMT– mathematical model based on probability distribution from parallel corpus EBMT – integrates both rule-based and data-driven techniques EBMT is often linked to another related technique, “Translation Memory (TM)” – stores past translation in database Both EBMT and TM have the idea of using existing translations But, EBMT is an automated technique for translation whereas TM is an interactive tool for human translators

Introduction SMTVs.EBMT Works well when a significant amount of training data is available Can be developed with a limited example-base Good for open domain translationGood for restricted domain: works well when test and training set are close Has shown difficulties with free word order language Reuse the segment of a test sentence that can be found in the source side of the example base

Our Attempt We try to use EBMT and TM to tackle the English-Bangla language pair Proved troublesome with low BLEU score for various SMT approaches (Islam et. al., 2010) We attempt to translate medical-receptionist dialogues, primarily for appointment scheduling Our Goal Integrate EBMT and TM for better translation for restricted domain EBMT helps to find the closest match and TM is good for translating segments of a sentence

Creating English -Bangla Parallel Corpus Task: to create manually translated English-Bangla parallel corpus for training Points to consider: native speakers, translation challenges  literal vs. explicit Methodology: Manual translation by native speaker Discussions on translation conventions Corpus example: English: Hello, can I get an appointment sometime later this week? Bangla: নমস্কার, এই সপ্তাহের শেষের দিকে, কোন সময় একটা অ্যপয়ন্টমেন্ট পাওয়া যাবে কি ?

Notes on Translation Challenges Non-alteration of source text Literal translation of source Which doctor would you prefer? I don’t mind Bangla

Size and Type of the Corpora Due to involvement of aforementioned stages, it is time- consuming to collect large amount of medical-receptionist dialogue Thus, our corpus comprises 380 dialogue turns In transcription, this works out at just under 3000 words (~8 words per dialogue) A very small corpus by any standard

Note on Size of the Corpora How many examples are needed to adopt any data-driven MT system? No SMT system developed with only 380 parallel sentences But, many EBMT systems have been developed with such a small corpus SystemLanguage PairSize TTLEnglish → Turkish488 TDMTEnglish → Japanese350 EDGARGerman → English303 ReVerbEnglish → German214 ReVerbIrish → English120 METLA-1English → French29 METLAEnglish → Urdu7

Structure of the Corpus Medical –receptionist dialogue is comprised of very similarly structured sentences Example ( 1) a. I need a medical for my insurance company. b. I need a medical for my new job. (2) a. The doctor told me to come back for a follow up appointment. b. The doctor told me to call back in a week. Thus, it might be helpful to reuse the translation of common parts while translating new sentences This leads us to use EBMT

Main Idea Input Ok, I have booked you in for eleven fifteen on Friday with Dr. Thomas. Fuzzy match in the example-base Ok, I have booked you in for three thirty on Thursday with Dr. Kelly. =>Part of the translation from example-base fuzzy match Part of the translation Translation Memory or SMT Ok, I have booked you in for eleven fifteen on Friday with Dr. Thomas.

Building Translation Memory (TM) We build TM automatically from our small patient-dialogue corpus We use Moses to build two TMs Aligned phrase pairs from the Moses phrase table (phrase table - PT) Aligned word pairs based on GIZA++ (lexical table - LT) We keep all the target equivalents of a source phrase in the TMwhich are stored in a sorted order based on the phrase translation probability PT come in on friday instead ? পরিবর্তে শুক্রবার আসতে পারবেন ?, but dr finn, কিন্তু ডাঃ ফিন LT hello হ্যালো # নমস্কার eleven এগারটা # পনেরতে

Building Translation Memory (TM) We build TM automatically from our small patient-dialogue corpus We use Moses to build two TMs Aligned phrase pairs from the Moses phrase table (phrase table - PT) Aligned word pairs based on GIZA++ (lexical table - LT) We keep all the target equivalents of a source phrase in the TMwhich are stored in a sorted order based on the phrase translation probability

Our Approach Our EBMT system, like most, has three stages: Matching – find closest match with the input sentence Adoptability – find translation of the desired segments Recombination – combined the translation of desired segments

Matching We find the closest sentence (S c ) from the example base for the input sentence (S) to be translated We have used a word based edit distance metric to find out this closest match sentence from the example base ( ). S : Ok, I have booked you in for eleven fifteen on Friday with Dr. Thomas. S c : Ok, I have booked you in for three thirty on Thursday with Dr. Kelly.

Matching We consider the associated translation ( S c t ) of S c as the skeleton translation of the input sentence S S : Ok, I have booked you in for eleven fifteen on Friday with Dr. Thomas. S c : Ok, I have booked you in for three thirty on Thursday with Dr. Kelly. S c t : আচ্ছা, আমি আপনার জন্য বৃহস্পতিবার তিনটে তিরিশে ডাঃ কেলির সাথে বুক্ করেছি ৷ AchchhA, Ami ApanAra janya bRRihaspatibAra tinaTe tirishe DAH kelira sAthe buk karechhi. We will use some segment of the S c t to produce a new translation

Adaptability We extract the translation of the inappropriate fragments of the input sentence (S) To do this, we align three sentences – the input (S), the closest source-side match (S c ) and its target equivalent (S c t ) 1.Mark the mismatched portion between input sentence (S) and the closest source-side match (S c ) using edit-distance S : ok, i’ve booked you in for on with dr. S c : ok, i’ve booked you in for on with dr.

Adaptability We extract the translation of the inappropriate fragments of the input sentence (S) To do this, we align three sentences – the input (S), the closest source-side match (S c ) and its target equivalent (S c t ) 2.Further we align the mismatch portion of S c with its associated translation S c t using our TMs (PT and LT) S : ok, i’ve booked you in for on with dr. S c : ok, i’ve booked you in for on with dr. S c t : আচ্ছা, আমি আপনার জন্য ে ডাঃ র সাথে বুক্ করেছি৷ The number in the angular bracket keeps track of the order of the appropriate fragments

Recombination Substitute, add or delete the segments from the input sentence (S) with the skeleton translation equivalent (S c t ) S: ok, i’ve booked you in for on with dr. S c : ok, i’ve booked you in for on with dr. S c t : আচ্ছা, আমি আপনার জন্য ে ডাঃ র সাথে বুক্ করেছি৷ T friday =?T eleven_fifteen =?T thomas =? Possible ways of obtaining Tx Tx = SMT(x) Tx = PT(x) >> আচ্ছা, আমি আপনার জন্য ে ডাঃ র সাথে বুক্ করেছি ৷

Recombination Algorithm

Experiments We conduct 5 different experiments Baseline 1.SMT – use OpenMaTrEx ( 2.EBMT – based on the matching step. We consider the skeleton translation as the desired output Our Approach 3.EBMT + TM (PT) – uses only phrase table during recombination 4.EBMT + TM(PT,LT) – using both phase- and lexical- table during recombination 5.EBMT +SMT – untranslated segments are translated using SMT

Results Data used for the experiment Training Data – 381 parallel sentences Test Data – 41 sentences disjoint from the training set We use BLEU and NIST score for automatic evaluation SystemBLEUNIST SMT EBMT EBMT + TM(PT) EBMT + TM(PT,LT) EBMT+SMT

Results Manual evaluation – 4 different native speakers were asked to rate the translations using the two scales FluencyAdequacy 5 = Flawless Bangla5 = All 4 = Good Bangla4 = Most 3 = Non-native Bangla3 = Much 2 = Disfluent Bangla2 = Little 1 = Incomprehensible1 = None SystemFluencyAdequacy SMT EBMT+TM(PT) EBMT+TM(PT,LT) EBMT+SMT

Example Translations

Assessment of Error Types Wrong source-target alignment in the phrase table and lexical table resulting in an incorrect alignment

Assessment of Error Types Generates erroneous translation during recombination in a few minutes – in + a few minutes in - a. নিয়ে (niYe) b. নিয়ে আসতে (niYe Asate) c. আসুন (Asuna). a few minutes - ‘ কয়েক মিনিট দেরিতে (kaYeka miniTa derite)’ in a few minutes – ‘ নিয়ে কয়েক মিনিট দেরিতে (niYe kaYeka miniTa derite)’

Observations Baseline EBMT has higher accuracy in all metrics compared to the baseline SMT system Combination of EBMT and TM has better accuracy than both the baseline SMT and EBMT system The combination of SMT with EBMT has some improvement over baseline EBMT but has lower accuracy than combination of TM with EBMT

Conclusion and Future Work We have shown initial investigations for combining TM in an EBMT framework The integration of TM with EBMT has improved the translation quality The error shows that a syntax based matching and adaptation might help to reduce false positive adaptations Use of morpho-syntactic information during recombination might improve the translation quality

Thank you! Questions?