Download presentation
Presentation is loading. Please wait.
Published byGianni Waddington Modified over 9 years ago
1
Statistically Motivated Example-based Machine Translation using Translation Memory Sandipan Dandapat, Sara Morrissey, Sudip Kumar Naskar, Harold Somers CNGL, School of Computing, DCU
2
Introduction Machine Translation is the process of automatic encoding of information (syntactic and semantic) from one language to another language. RBMT – is characterized by linguistic rules SMT– mathematical model based on probability distribution from parallel corpus EBMT – integrates both rule-based and data-driven techniques EBMT is often linked to another related technique, “Translation Memory (TM)” – stores past translation in database Both EBMT and TM have the idea of using existing translations But, EBMT is an automated technique for translation whereas TM is an interactive tool for human translators
3
Introduction SMTVs.EBMT Works well when a significant amount of training data is available Can be developed with a limited example-base Good for open domain translationGood for restricted domain: works well when test and training set are close Has shown difficulties with free word order language Reuse the segment of a test sentence that can be found in the source side of the example base
4
Our Attempt We try to use EBMT and TM to tackle the English-Bangla language pair Proved troublesome with low BLEU score for various SMT approaches (Islam et. al., 2010) We attempt to translate medical-receptionist dialogues, primarily for appointment scheduling Our Goal Integrate EBMT and TM for better translation for restricted domain EBMT helps to find the closest match and TM is good for translating segments of a sentence
5
Creating English -Bangla Parallel Corpus Task: to create manually translated English-Bangla parallel corpus for training Points to consider: native speakers, translation challenges literal vs. explicit Methodology: Manual translation by native speaker Discussions on translation conventions Corpus example: English: Hello, can I get an appointment sometime later this week? Bangla: নমস্কার, এই সপ্তাহের শেষের দিকে, কোন সময় একটা অ্যপয়ন্টমেন্ট পাওয়া যাবে কি ?
6
Notes on Translation Challenges Non-alteration of source text Literal translation of source Which doctor would you prefer? I don’t mind Bangla
7
Size and Type of the Corpora Due to involvement of aforementioned stages, it is time- consuming to collect large amount of medical-receptionist dialogue Thus, our corpus comprises 380 dialogue turns In transcription, this works out at just under 3000 words (~8 words per dialogue) A very small corpus by any standard
8
Note on Size of the Corpora How many examples are needed to adopt any data-driven MT system? No SMT system developed with only 380 parallel sentences But, many EBMT systems have been developed with such a small corpus SystemLanguage PairSize TTLEnglish → Turkish488 TDMTEnglish → Japanese350 EDGARGerman → English303 ReVerbEnglish → German214 ReVerbIrish → English120 METLA-1English → French29 METLAEnglish → Urdu7
9
Structure of the Corpus Medical –receptionist dialogue is comprised of very similarly structured sentences Example ( 1) a. I need a medical for my insurance company. b. I need a medical for my new job. (2) a. The doctor told me to come back for a follow up appointment. b. The doctor told me to call back in a week. Thus, it might be helpful to reuse the translation of common parts while translating new sentences This leads us to use EBMT
10
Main Idea Input Ok, I have booked you in for eleven fifteen on Friday with Dr. Thomas. Fuzzy match in the example-base Ok, I have booked you in for three thirty on Thursday with Dr. Kelly. =>Part of the translation from example-base fuzzy match Part of the translation Translation Memory or SMT Ok, I have booked you in for eleven fifteen on Friday with Dr. Thomas.
11
Building Translation Memory (TM) We build TM automatically from our small patient-dialogue corpus We use Moses to build two TMs Aligned phrase pairs from the Moses phrase table (phrase table - PT) Aligned word pairs based on GIZA++ (lexical table - LT) We keep all the target equivalents of a source phrase in the TMwhich are stored in a sorted order based on the phrase translation probability PT come in on friday instead ? পরিবর্তে শুক্রবার আসতে পারবেন ?, but dr finn, কিন্তু ডাঃ ফিন LT hello হ্যালো # নমস্কার eleven এগারটা # পনেরতে
12
Building Translation Memory (TM) We build TM automatically from our small patient-dialogue corpus We use Moses to build two TMs Aligned phrase pairs from the Moses phrase table (phrase table - PT) Aligned word pairs based on GIZA++ (lexical table - LT) We keep all the target equivalents of a source phrase in the TMwhich are stored in a sorted order based on the phrase translation probability
13
Our Approach Our EBMT system, like most, has three stages: Matching – find closest match with the input sentence Adoptability – find translation of the desired segments Recombination – combined the translation of desired segments
14
Matching We find the closest sentence (S c ) from the example base for the input sentence (S) to be translated We have used a word based edit distance metric to find out this closest match sentence from the example base ( ). S : Ok, I have booked you in for eleven fifteen on Friday with Dr. Thomas. S c : Ok, I have booked you in for three thirty on Thursday with Dr. Kelly.
15
Matching We consider the associated translation ( S c t ) of S c as the skeleton translation of the input sentence S S : Ok, I have booked you in for eleven fifteen on Friday with Dr. Thomas. S c : Ok, I have booked you in for three thirty on Thursday with Dr. Kelly. S c t : আচ্ছা, আমি আপনার জন্য বৃহস্পতিবার তিনটে তিরিশে ডাঃ কেলির সাথে বুক্ করেছি ৷ AchchhA, Ami ApanAra janya bRRihaspatibAra tinaTe tirishe DAH kelira sAthe buk karechhi. We will use some segment of the S c t to produce a new translation
16
Adaptability We extract the translation of the inappropriate fragments of the input sentence (S) To do this, we align three sentences – the input (S), the closest source-side match (S c ) and its target equivalent (S c t ) 1.Mark the mismatched portion between input sentence (S) and the closest source-side match (S c ) using edit-distance S : ok, i’ve booked you in for on with dr. S c : ok, i’ve booked you in for on with dr.
17
Adaptability We extract the translation of the inappropriate fragments of the input sentence (S) To do this, we align three sentences – the input (S), the closest source-side match (S c ) and its target equivalent (S c t ) 2.Further we align the mismatch portion of S c with its associated translation S c t using our TMs (PT and LT) S : ok, i’ve booked you in for on with dr. S c : ok, i’ve booked you in for on with dr. S c t : আচ্ছা, আমি আপনার জন্য ে ডাঃ র সাথে বুক্ করেছি৷ The number in the angular bracket keeps track of the order of the appropriate fragments
18
Recombination Substitute, add or delete the segments from the input sentence (S) with the skeleton translation equivalent (S c t ) S: ok, i’ve booked you in for on with dr. S c : ok, i’ve booked you in for on with dr. S c t : আচ্ছা, আমি আপনার জন্য ে ডাঃ র সাথে বুক্ করেছি৷ T friday =?T eleven_fifteen =?T thomas =? Possible ways of obtaining Tx Tx = SMT(x) Tx = PT(x) >> আচ্ছা, আমি আপনার জন্য ে ডাঃ র সাথে বুক্ করেছি ৷
19
Recombination Algorithm
20
Experiments We conduct 5 different experiments Baseline 1.SMT – use OpenMaTrEx (http://www.openmatrex.org)http://www.openmatrex.org 2.EBMT – based on the matching step. We consider the skeleton translation as the desired output Our Approach 3.EBMT + TM (PT) – uses only phrase table during recombination 4.EBMT + TM(PT,LT) – using both phase- and lexical- table during recombination 5.EBMT +SMT – untranslated segments are translated using SMT
21
Results Data used for the experiment Training Data – 381 parallel sentences Test Data – 41 sentences disjoint from the training set We use BLEU and NIST score for automatic evaluation SystemBLEUNIST SMT39.324.84 EBMT50.385.32 EBMT + TM(PT)57.475.92 EBMT + TM(PT,LT)57.566.00 EBMT+SMT52.015.51
22
Results Manual evaluation – 4 different native speakers were asked to rate the translations using the two scales FluencyAdequacy 5 = Flawless Bangla5 = All 4 = Good Bangla4 = Most 3 = Non-native Bangla3 = Much 2 = Disfluent Bangla2 = Little 1 = Incomprehensible1 = None SystemFluencyAdequacy SMT3.003.16 EBMT+TM(PT)3.503.55 EBMT+TM(PT,LT)3.503.70 EBMT+SMT3.443.52
23
Example Translations
24
Assessment of Error Types Wrong source-target alignment in the phrase table and lexical table resulting in an incorrect alignment
25
Assessment of Error Types Generates erroneous translation during recombination in a few minutes – in + a few minutes in - a. নিয়ে (niYe) b. নিয়ে আসতে (niYe Asate) c. আসুন (Asuna). a few minutes - ‘ কয়েক মিনিট দেরিতে (kaYeka miniTa derite)’ in a few minutes – ‘ নিয়ে কয়েক মিনিট দেরিতে (niYe kaYeka miniTa derite)’
26
Observations Baseline EBMT has higher accuracy in all metrics compared to the baseline SMT system Combination of EBMT and TM has better accuracy than both the baseline SMT and EBMT system The combination of SMT with EBMT has some improvement over baseline EBMT but has lower accuracy than combination of TM with EBMT
27
Conclusion and Future Work We have shown initial investigations for combining TM in an EBMT framework The integration of TM with EBMT has improved the translation quality The error shows that a syntax based matching and adaptation might help to reduce false positive adaptations Use of morpho-syntactic information during recombination might improve the translation quality
28
Thank you! Questions?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.