Integrated Parallel Data Extraction from Comparable Corpora for Statistical Machine Translation Kurohashi & Kawahara Lab. Chenhui Chu.

Integrated Parallel Data Extraction from Comparable Corpora for Statistical Machine Translation Kurohashi & Kawahara Lab. Chenhui Chu

Example of Machine Translation Google Translate 2015/02/27 https://translate.google.com 2

Languages Google Translate Supports Amazing! How can Google Translate support so many languages? 3 Google Translate 2015/02/27 https://translate.google.com

Statistical Machine Translation (SMT) [Brown+ 1993] #1: 对下列事项进行了说明。 #2: 精米蛋白质含量在 5.69% 〜 6.60% 的范围内。 #3: 由跨行业合作引发的地区品牌化－废弃物循环型农业的事业化战略 #4: 因此，只记述了肯定面的情况不存在。 #5: 另外，我们还在地图上标注位置。 #6: 小坂先生是日本临床麻醉学会的创始者。 #7: 变化中的中国科学院生态环境研究中心．．． Input: 作为测量器械使用了秒表 Output: 測定機器としてはストップウォッチを用いた Translation model #1: 以下の事項を解説した。 #2: 精米タンパク質含有率は５．６９〜６．６０％の範囲だ。 #3: 異業種ネットワークからの地域ブランド化 − 廃棄物循環型農業の事業化戦略 #4: そのため，肯定的な面のみを記述したものは無かった． #5: また位置の地図上への記録も行った。． #6: 小坂先生は日本臨床麻酔学会の創始者である。 #7: 変身中の中国科学院生態環境研究センター．．． Zh:Ja: Rapid development of MT systems for different language pairs and domains is possible, once parallel corpora are available Language model Decoder 4

Parallel Corpora Construction Collect from manually translated data such as EU Parliament [Koehn+ 2005], patent family [Utiyama+ 2007] – Such data is limited Collect parallel sentences from the Web [Google Translate; Ling+ 2013; Tian+ 2014] – The web is very noisy, leading to noisy sentence pairs Construct in a collaborative manner [Tatoeba project http://tatoeba.org/eng/] – Difficult to motivate people Crowdsourcing [Zaidan+ 2011; Post+ 2012] – Difficult to control the quality 5

Scarceness of Parallel Corpora Richness of languages – There are about 7,000 languages in the world, and it is difficult to construct for every language pair Domain diversity – Construct in every domain is not an easy task, even for the language pairs that have rich resources Problem when pivoting via English – X-En and En-Y parallel corpora are not available for all languages and domains 6

List of Online Available Multilingual Parallel Corpora # Collected with the help of [Koehn+ 2010], http://opus.lingfil.uu.se and http://www.statmt.org/moses/?n=Moses.LinksToCorpora 7

Problems Caused by Scarceness of Parallel Corpora The coverage problem – Scarceness makes the coverage of the translation model low, leading to high out of vocabulary (OOV) word rates [Callison-Burch+ 2006] – It also occurs when the domain shifts [Irvine+ 2013] The accuracy problem – The quality of the translation model correlates with the quality and quantity of parallel corpora, and scarceness makes it inaccurate [Irvine+ 2013] 8

Outline 1.Background 2.Overview of Our Approach 3.Bilingual Lexicon Extraction 4.Parallel Sentence Extraction 5.Parallel Fragment Extraction 6.Conclusion and Future Work 9

Comparable Corpora Zh:Ja: ．．．塞特港（法语： Sète ）是位于法国南部的市镇，属朗格多克 - 鲁西永大区的埃罗省。塞特港是一座临地中海的度假地。这里的美食也颇为知名。塞特港也是许多艺术家的故乡。直到 1681 年米迪运河修好之前，塞特港都是座小渔村。在 19 世纪修建了港口。．．．セット（ Sète 、オック語 :Seta ）は、フランス、ラングドック＝ルシヨン地域圏、エロー県の都市。別名『ラングドックのヴェネツィア』と呼ばれている。港湾と地中海に面したリゾート地である。 1681 年にミディ運河ができるまでは漁村であった。 1927 年まで、 Cette の綴りを採用していた。．．． ※ Example of comparable texts describing a French city “Sète” from Wikipedia (green: bilingual lexicons, blue: parallel sentences, and orange: parallel fragments). 11

Reasons of Exploiting Comparable Corpora for SMT Only require a set of monolingual corpora They are more available for various domains – Such as Wikipedia, academic papers, news articles and social media There is a large amount of parallel data in comparable corpora – Such as bilingual lexicons, parallel sentences and parallel fragments 12

Related Work Parallel data extraction – Bilingual lexicon extraction [Rapp+ 1999; Vulic+ 2011] – Parallel sentence extraction [Utiyama+ 2003; Munteanu+ 2005] – Parallel fragment extraction [Munteanu+ 2006; Quirk+ 2007] Translation model improvement using parallel data extraction – Improving the coverage [Daumeiii+ 2011; Zhang+ 2013] – Improving the accuracy [Klementiev+ 2012] 13

C.4 Parallel sentence extraction Cross-lin gual IR Parallel sentence candidates Parallel sentences Comparable sentences Parallel fragments C.5 Parallel fragment extraction Comparable corpora Integrated Parallel Data Extraction Framework Seed parallel corpus C.6 Improving SMT accuracy using BLE C.3 Bilingual lexicon extraction (BLE) Translation model Bilingual dictionary 14 C.2 Common Chinese characters C.6 Improving SMT accuracy using BLE C.2 Common Chinese characters C.2 Common Chinese characters C.4 Parallel sentence extraction C.4 Parallel sentence extraction C.5 Parallel fragment extraction C.5 Parallel fragment extraction C.3 Bilingual lexicon extraction (BLE) Augmented parallel corpus

Contributions of This Study An integrated framework for parallel data extraction – All the tasks are closely connected and benefit each other Novel approaches for – Bilingual lexicon, parallel sentence, fragment extraction – The accuracy problem of SMT using BLE The framework is language-independent, and common Chinese characters are helpful – Can apply to other language pairs that share cognates 15

Related Work Topic model based method [Vulic+ 2011] – Bilingual lexicons often present in the same cross- lingual topics (document-level context) – Does not require any prior knowledge Context based method [Rapp+ 1999] – Bilingual lexicons appear in similar contexts across languages (usually window-based context) – Requires a seed dictionary 17

Bilingual Lexicon Extraction System Comparable corpora Topic model based method Context based method Combinatio n Topical bilingual lexicons ・・・市場公司市场客户 0.0626 0.0600 0.0474 ・・・ Contextual bilingual lexicons ・・・市場客户市场公司 0.0840 0.0680 0.0557 ・・・ Combined bilingual lexicons ・・・市場市场公司客户 0.0616 0.0612 0.0547 ・・・ Unsupervised Seed Dictionary Iteration Seed Dictionary 18

Topic Model Based Method [Vulic+ 2011] 市場 : 客户 : 市场 : 公司 : Sim=0.0474 Sim=0.0600 Sim=0.0626 D α θ φ K wz MSMS β wz MTMT ψ Topic distribution Word-topic distributions 19

Context Based Method [Rapp+ 1999] 市場 : (projection via a seed dictionary) 客户 : 市场 : 公司 : Sim=0.0840 Sim=0.0680 Sim=0.0557 … 大規模な地方卸売市場は地域拠点に分類 … … 現在小売市場は小売商業調整特別措置法 … 20

Combination Combined similarity score 0.8 ×+ 0.2 ×= 21 Topical bilingual lexicons ・・・市場公司市场客户 0.0626 0.0600 0.0474 ・・・ Contextual bilingual lexicons ・・・市場客户市场公司 0.0840 0.0680 0.0557 ・・・ Combined bilingual lexicons ・・・市場市场公司客户 0.0616 0.0612 0.0547 ・・・

Dataset Training data – Wikipedia: 10k and all (162k) Ja-Zh article pairs aligned via the interlanguage links – Kept only lemmatized nouns 10k article pairs: 51k Japanese and 114k Chinese nouns All article pairs: 104k Japanese and 772k Chinese nouns Testing data – Manually created Ja-Zh test sets for the most frequent 1,000 source words in all the articles 22

Results (Ja-Zh Precision@1) 23 Precision@1 Iteration Even the accuracy of “Combination(K=200) [all]” is not high Underestimated because of the incompleteness of our test set (32%→0.6) Precision@3: 0.5370 (→0.75) The results after the first a few iterations are unstable Iteratively increase the size of the seed dictionary by choosing confident lexicons [Vulic+ 2013]

Improved Example CandidateSim_TopicSim_ContextSim_Comb 开发 0.05030.26910.0941 计划 0.06240.14920.0798 研发 0.05190.17730.0770 测试 0.05610.15770.0764 里程碑 0.04940.09250.0580 ※ An improved example of word “ 開発 ”, where topical similarity scores are similar, while contextual similarity scores are distinguishable 24

Not Improved Example CandidateSim_TopicSim_ContextSim_Comb 政治 0.05450.11860.0673 政权 0.05270.11100.0644 冲突 0.05000.11310.0626 内战 0.05020.08140.0564 政变 0.05150.06840.0549 ※ A not improved example of word “ 対立 ”, where linear combination of the two scores is not discriminative enough 25

Summary of Bilingual Lexicon Extraction We proposed a bilingual lexicon extraction system exploiting both topical and contextual knowledge in an iterative process Future Work – Extract for polysemy, compound and rare words – Experiment on other comparable corpora 26

Related Work Parallel sentence identification methods – Binary classification [Munteanu+ 2005] – Translation similarity measures [Utiyama+ 2003] – Similar features such as word overlap and sentence length based features are used Extraction from various comparable corpora – Bilingual news articles [Do+ 2010] – Patent data [Lu+ 2010] – Wikipedia [Stefanescu+ 2013] 28

Seed parallel corpus Parallel Sentence Extraction System Classifier Parallel sentence candidates Zh-Ja Wikipedia Filter Interlanguage link Article pairs... Parallel sentences Comparable sentences Common Chinese characters Bilingual dictionary Three novel feature sets Common Chinese character filtering BLE 29

Common Chinese Characters [Chu+ 2012] Chinese and Japanese share common Chinese characters – We previously created a mapping table for them 45% Zh hanzi and 75% Ja kanji are common Chinese characters in a parallel corpus 而被指定为政令指定都市、中核市、特例市。別途政令指定都市、中核市、特例市に定められている。 Zh: Ja: Common Chinese characters can be helpful clues for both the filter and classifier 30 ZhZh 雪爱发 Ja 雪愛発

Classifier Filter Seed parallel corpus Cartesian product Non-parallel sentence pairs Filtered non-parallel sentence pairs Negative instances Common Chinese characters Bilingual dictionary Positive instances Parallel Sentence Classifier 31

Proposed Novel Features Chinese Character (CC) Features Number of Chinese characters Percentage of characters that are Chinese characters Ratio of Chinese characters on both sides Number of n-gram common Chinese characters Percentage of n-gram common Chinese characters ※ For Non-CC words, we do not consider Japanese kana Non-Chinese Character (Non-CC) Word Features Number of Non-CC words Percentage of words that are Non-CC words Ratio of Non-CC words on both sides Number of same Non-CC words Percentage of the Non-CC words that are the same Content Word Features Percentage of words that are content words Content word overlap (according to the dictionary) 32

Dataset Seed parallel corpus: Zh-Ja paper abstract corpus (680k sentences, scientific domain) – Used two distinct sets of 5k parallel sentences for training and testing the classifier Wikipedia: 162k Zh-Ja article pairs aligned via the interlanguage links (2.1M Zh and 3.5M Ja sentences) 33

Extraction Experimental settings Dictionary: top 5 translations with probability ≧ 0.1 generated from the seed corpus Parallel sentence candidate filtering – WF: dictionary-based word overlap (threshold: 0.25) – CCF: common Chinese character overlap (threshold: 0.1 for Zh and 0.3 for Ja) – WF and CCF: logical conjunction of WF and CCF – WF or CCF: logical disjunction of WF and CCF Classifier – LIBSVM (5-fold cross-validation, RBF kernel) – Classification probability threshold: 0.9 34

Parallel Sentence Classification Results ※ [Munteanu+ 2005] uses sentence length, word overlap and alignment features ※ +CC: additionally use Chinese character (CC) features ※ +Non-CC: additionally use Non-Chinese character (Non-CC) word features ※ +Content: additionally use content word features 35

Parallel Sentence Extraction Results # extracted sentences (unit: k) ※ (WF) and (CCF) denote word overlap filter and common Chinese character filter respectively 36

MT Experimental Settings Tuning and testing: two distinct sets of 198 parallel sentences selected from Wikipedia Decoder: Moses [Koehn+ 2007] (Zh to Ja translation) Language model: 5-gram LM trained on the Ja Wikipedia (10.7M sentences) using SRILM Compared MT performance using the extracted sentences by different methods as training data 37

※ Seed denotes the system trained on the seed parallel corpus ※ “†” and “‡” denote that the result is significantly better than “[Munteanu+ 2005]” and “+CC (WF)” respectively at p < 0.05 BLEU-4 MT Results OOV% † †‡ † † † 38

Examples of Extracted Sentence Pairs 此外，牧伸二也在「フランク永井低音的魅力，牧伸二低能的魅力」漫谈中披露这些事。また牧伸二も漫談で「フランク永井は低音の魅力、牧伸二は低能の魅力」というネタを披露した。 Zh: Ja: 本专辑与首张单曲「玻璃少年」同时发售。デビューシングル「硝子の少年」との同時発売。故乡的风是日本的一个广播电台，由日本政府的绑架问题对策本部向朝鲜民主主义人民共和国进行短波广播。ふるさとの風（ふるさとのかぜ）は、日本政府の拉致問題対策本部が朝鮮民主主義人民共和国（北朝鮮）向けに行っている短波放送である。 Zh: Ja: Zh: Ja: 39

BLE Based Experimental Settings Used the bilingual lexicons extracted by our BLE system for parallel sentence extraction Based on the best performing sentence extraction method “+Content (CCF)” Compared different dictionary settings – Baseline (10k): use 10k parallel sentences used for training and testing the classifier to generate the dictionary – Baseline (10k) + lexicon: combine the Baseline (10k) dictionary with the extracted bilingual lexicons – Baseline (680k): use all the parallel sentences (680k) from the seed parallel corpus to generate the dictionary – Baseline (680k) + lexicon: combine the Baseline (680k) dictionary with the extracted bilingual lexicons 40

Method# dictionary entries # sentencesBLEU–4OOV% Seed (10k)16.5923.18 Baseline (10k)32,60757,68133.62†5.83 Baseline (10k) + lexicon87,52394,93135.30†‡4.93 Seed (680k)25.429.11 Baseline (680k)204,254126,81137.82†3.71 Baseline (680k) + lexicon258,246152,51137.97†3.38 BLE Based Results ※ Seed: the systems trained on the seed parallel corpus ※ “†” and “‡” denote that the result is better than “Seed” and “Baseline” significantly at p < 0.01 41

Summary of Parallel Sentence Extraction We improved a parallel sentence extraction system by using the common Chinese characters for filtering, three novel feature sets for classification and BLE Future Work – Apply the similar idea to other language pairs by using cognates – Experiment on other comparable corpora 42

Related Work Based on bilingual lexicon [Munteanu+ 2006] – A sub-sentential fragment extraction method – Locate the source and target fragments step by step, making it unreliable Based on alignment model [Quirk+ 2007] – Design alignment model for comparable sentences – Because the comparable sentences are quite noisy, the extracted fragments are inaccurate 44

Classifier Parallel sentence candidates Zh-Ja Wikipedia Filter Article pairs Interlanguage link Seed parallel corpus... Parallel sentences Comparable sentences Common Chinese characters Bilingual dictionary Alignment Lexicon-based filter Parallel fragments Parallel fragment candidates Parallel Fragment Extraction System 45

Initial Score Assignment for Parallel Fragment Candidates Assign scores to aligned word pairs in the candidates according to the bilingual lexicon The aligned word pairs exist in the bilingual lexicon with a positive association The aligned word pairs does not exist in the bilingual lexicon or a negative association 46

Score Smoothing Only smooth a word with negative score when both the left and right words around it have positive scores Smoothing can gain new knowledge that does not exit in the bilingual lexicon 47

Parallel Fragment Extraction Fragments more than 3 tokens with continuous positive scores 48

Dataset The same dataset used in the parallel sentence extraction experiments – Parallel sentences: the best performing method “+Content (CCF)” with classification probability ≧ 0.9 (126k) – Comparable sentences: 0.1 ≦ classification probability ＜ 0.9 (169k) 49

Method# fragments# w/o CCCAvg size (Zh/Ja)Accuracy [Munteanu+ 2006]153k16.76/17.70(6%) IBM Model 1140k137k4.20/4.6672% LLR131k129k4.18/4.6382% SampLEX100k95k3.85/4.1282% ※ [Munteanu+ 2006] is for sub-sentential fragment extraction ※ IBM Model 1, LLR and SampLEX denote different bilingual lexicons ※ w/o CCC does not use common Chinese characters for the lexicon filter ※ Accuracy is manually evaluated 100 fragments based on exact match 50 Parallel Fragment Extraction Results

Examples of Extracted Fragment Pairs IDZh FragmentJa Fragment 1 第 73 装甲掷弹兵团第７３装甲擲弾兵連隊 2 银幕投影系统スクリーン投影システム 3 为成人杂志は成人向け雑誌 4 １９９７年世界女子手球锦标赛为１９９７年世界女子ハンドボール選手権は 5 氦开始聚变ヘリウムが核 6 日本福岛县岩濑、福島県岩瀬 7 和学术参考书や参考書 8 上将军衔。上将に就任。 ※ Noise is written in red font Most noise is due to the noisy bilingual lexicon (Example 5, 6) Score smoothing also produces some noise (Example 7) Some noise is produced by both of the two reasons (Example 8) 51

MT Experimental Settings Baseline training: the parallel sentences (126k) Tuning and testing: two distinct sets of 198 parallel sentences selected from Wikipedia Decoder: Moses [Koehn+ 2007] (Zh to Ja translation) Language model: 5-gram LM trained on the Ja Wikipedia (10.7M sentences) using SRILM Compared MT performance by appending the extracted fragments to the baseline training data 52

MT Results ※ +Sentence: append the comparable sentences to “Baseline” ※ “†”, “‡” and “*” denote that the result is significantly better than “+[Munteanu+ 2006]”, “+Sentences” and “Baseline” respectively at p < 0.05 BLEU-4 †‡ †‡* OOV% 53

Summary of Parallel Fragment Extraction We proposed an accurate parallel fragment extraction system using alignment model and bilingual lexicon Future Work – Develop a method to deal with ordering – Experiment on other language pairs and domains 54

Integrated Extraction and MT Results 55 BLEU-4OOV% †‡ † † ※ Seed: trained on the seed parallel corpus ※ Sentence: trained on the parallel sentences extracted by [Munteanu+ 2005] ※ +CC: trained on the parallel sentences extracted additionally using Chinese characters ※ +BLE: trained on the parallel sentences extracted additionally using the extracted lexicons ※ +Fragment: appended the parallel fragments to “+CC” ※ “†” and “‡” denote that the result is significantly better than “+Sentence” and “+CC” respectively at p < 0.05 (122k)(126k)(126k+131k)(152k)(680k)

Discussion about the Noisy Problem The main problem of exploiting comparable corpora is that they are noisy – We proposed many approaches to addressing this problem, and verified the effectiveness Noisy data is a common problem in many research fields – This study can benefit the research in other fields 56

Conclusion We proposed novel approaches of bilingual lexicon, parallel sentence and parallel fragment extraction from comparable corpora for SMT in an integrated framework Our proposed framework and approaches significantly outperform previous studies, and are very effective to addressing the scarceness of parallel corpora that SMT suffers 57

Future Work Large-scale parallel data extraction for various language pairs and domains – Bilingual lexicon extraction from monolingual corpora – Unsupervised parallel data extraction – Paraphrases based extraction 58

Publications (1/3) Invited book chapter 1.Chenhui Chu, Toshiaki Nakazawa and Sadao Kurohashi: Chinese-Japanese Parallel Sentence Extraction from Quasi-Comparable and Comparable Corpora, Book Chapter of Using Comparable Corpora for Under-Resourced Areas of Machine Translation, Springer, 17 pages, 2015. (to appear) Journal 1.Chenhui Chu, Toshiaki Nakazawa, Sadao Kurohashi: Integrated Parallel Sentence and Fragment Extraction from Comparable Corpora: A Case Study on Chinese– Japanese Wikipedia, TALIP, 22 pages, 2015. (conditionally accepted) [C. 4 & 5] 2.Chenhui Chu, Toshiaki Nakazawa, Sadao Kurohashi: Parallel Sentence Extraction Based on Unsupervised Bilingual Lexicon Extraction from Comparable Corpora, JNLP, 28 pages, 2015. (conditionally accepted) [C. 3 & 4] 3.Chenhui Chu, Toshiaki Nakazawa, Daisuke Kawahara, Sadao Kurohashi: Chinese- Japanese Machine Translation Exploiting Chinese Characters, TALIP, Vol.12, No.4, pp.16:1-16:25, 2013. [C. 2] 59

Publications (2/3) International Conference 1.Chenhui Chu, Toshiaki Nakazawa and Sadao Kurohashi: Improving Statistical Machine Translation Accuracy Using Bilingual Lexicon Extraction with Paraphrases, In Proceedings of PACLIC2014, 10 pages, 2014. [C. 6] 2.Chenhui Chu, Toshiaki Nakazawa, Sadao Kurohashi: Constructing a Chinese- Japanese Parallel Corpus from Wikipedia, In Proceedings of LREC2014, pp. 642– 647, 2014. [C. 4] 3.Chenhui Chu, Toshiaki Nakazawa, Sadao Kurohashi: Iterative Bilingual Lexicon Extraction from Comparable Corpora with Topical and Contextual Knowledge, In Proceedings of CICLing2014, LNCS 8404(II):296–309, 2014. (Best Student Paper Award) [C. 3] 4.Chenhui Chu, Toshiaki Nakazawa and Sadao Kurohashi: Accurate Parallel Fragment Extraction from Quasi-Comparable Corpora using Alignment Model and Translation Lexicon, In Proceedings of IJCNLP2013, pp.1144-1150, 2013. [C. 5] 5.Chenhui Chu, Toshiaki Nakazawa and Sadao Kurohashi: Chinese-Japanese Parallel Sentence Extraction from Quasi-Comparable Corpora, In Proceedings of BUCC2013, pp.34-42, 2013. [C. 4] 60

Publications (3/3) International Conference 6.Chenhui Chu, Toshiaki Nakazawa and Sadao Kurohashi: EBMT System of Kyoto University in OLYMPICS Task at IWSLT 2012, In Proceedings of IWSLT2012, pp.96- 101, 2012. 7.Chenhui Chu, Toshiaki Nakazawa, Daisuke Kawahara and Sadao Kurohashi: Exploiting Shared Chinese Characters in Chinese Word Segmentation Optimization for Chinese-Japanese Machine Translation, In Proceedings of EAMT2012, pp.35-42, 2012. [C. 2] 8.Chenhui Chu, Toshiaki Nakazawa and Sadao Kurohashi: Chinese Characters Mapping Table of Japanese, Traditional Chinese and Simplified Chinese, In Proceedings of LREC2012, pp.2149-2152, 2012. [C. 2] 9.Chenhui Chu, Toshiaki Nakazawa and Sadao Kurohashi: Japanese-Chinese Phrase Alignment Using Common Chinese Characters Information, In Proceedings of MT Summit XIII, pp.475-482, 2011. 61

Outline 1.Background 2.Overview of Our Approach 3.Bilingual Lexicon Extraction 4.Parallel Sentence Extraction 5.Parallel Fragment Extraction 6.Improving SMT Accuracy Using BLE 7.Conclusion and Future Work 62

Similarity Measure TI: Cue: TI+Cue: 63

Context Modeling and Similarity Window-based context (±2) – e.g. Cosine similarity mainstream drink factory market law system sellers exchange goods services information 64

Experimental Settings BiLDA topic model training: PolyLDA++ [Richardson+ 2013] – α = 50/K, β = 0.01, Gibbs sampling with 1k iterations TI+Cue measure: BLETM [Vulic+ 2011] Proposed method – Linear interpolation parameter γ = 0.8, 20 iterations 65

Evaluation Criterion Manually created Zh-En and Ja-En test sets for the most 1k frequent source words Metrics – Precision@1 – Mean Reciprocal Rank (MRR) [Voorhees+, 1999] 66

Results (Zh-En Precision@1) Precision@1 Iteration 67

Results (Ja-En Precision@1) Precision@1 Iteration 68

Results (Zh-En MRR) MRR Iteration 69

Results (Ja-En MRR) Iteration MRR 70

Results (Ja-Zh MRR) MRR Iteration 71

Improved Examples (2/2) CandidateSim_TopicSim_ContextSim_Comb facility0.05610.11270.0674 center0.05250.11350.0647 building0.05680.09330.0641 landmark0.05710.05780.0572 plan0.04600.10070.0570 ※ An improved example of word 施設 (facility), where both topical and contextual similarity scores are not distinguishable 72

Related Work [Munteanu+ 2005] Classifier Parallel sentences Parallel sentence candidates Filter Article pairs Seed parallel corpus (1) Cross -lingual IR (2)(3)... Bilingual dictionary Comparable corpora 74

Features used in [Muntean+ 2005] General Features Sentence length, difference, ratio Word overlap (according to the dictionary) Word Alignment Features Percentage and number of words have no connection The top three largest fertilities Length of the longest contiguous connected span Length of the longest unconnected substring 75

Parallel Sentence Extraction System Classifier Parallel sentences Parallel sentence candidates Filter Article pairs Common Chinese characters Seed parallel corpus (2)(3)... Three novel feature sets Common Chinese characters filtering Bilingual dictionary Comparable corpora (1) Cross -lingual IR 76

Parallel Sentence Classifier Classifier Filter Common Chinese characters Seed parallel corpus Cartesian product Non-parallel sentence pairs Filtered non-parallel sentence pairs Positive instances Negative instances Bilingual dictionary 77

Non-CC Word Features Chinese-Japanese parallel sentences often contain alignable Non-CC words 日本的一级行政区划单位为都道府县，全国划分为 1 都、 1 道、 2 府、 43 县。都道府県（ 1 都 1 道 2 府 43 県）という広域行政区画から構成される。 Zh: Ja: Non–CC words can be helpful clues to identify parallel sentences ! 78

Content Word Features Non-parallel sentences can contain alignable function words, which may lead to high word overlap (function words are in blue) ＹＹ / 的 / 尸体 / ， / 和 / 活着 / 的 / 黑 / 猩猩 / 相比 / ， / 皮肤 / 的 / 颜色 / 看起来 / 稍微 / 明朗 / 一些 / 。つぎに / ， / 配線 / に / 使用 / する / パターン / 幅 / や / クリアランス / の / 設定 / の / 方法 / を / 説明 / した / 。 Zh: Ja: It is necessary to explicitly add content word features! 79

Bootstrapping Experiments Method# dictionary entries # sentencesBLEU–4OOV Seed25.429.11 +Content (CCF)204,254126,81137.823.71 Iteration 1274,496164,40337.993.40 Iteration 2292,186167,31038.713.38 80

BLE Based Experiments (1) Method# dictionary entries # sentencesBLEU–4OOV Seed25.429.11 Baseline (680k)204,254126,81137.823.71 Baseline (680k) + lexicon (Topic)259,035151,68137.193.54 Baseline (680k) + lexicon (Comb 1) 258,437149,55637.453.38 Baseline (680k) + lexicon (Comb 2) 258,246152,51137.973.38 81

BLE Based Experiments (3) Method# dictionary entries # sentencesBLEU–4OOV Seed (5k)15.5326.89 Baseline (5k)23,44622,84928.10†9.99 Baseline (5k) + lexicon (Topic)78,96146,33229.80†‡7.18 Baseline (5k) + lexicon (Comb)78,56147,19132.22†‡*6.78 Seed (10k)16.5923.18 Baseline (10k)32,60730,11528.67†9.37 Baseline (10k) + lexicon (Comb)87,52350,44031.56†‡7.44 Seed (680k)25.429.11 Baseline (680k)204,25474,85234.44†5.19 Baseline (680k) + lexicon (Comb) 258,12495,64434.94†4.53 # Without Chinese character features ※ “†”, “‡” and “*”denote that the result is better than “Seed”, “Baseline” and “Baseline + lexicon (Topic)” significantly at p < 0.05 82

Parallel Fragments In comparable corpora, there could be parallel fragments in comparable sentences Parallel fragments are also helpful for SMT We aim to accurately extract parallel fragments from comparable sentences 应用 / 铅 / 离子 / 选择 / 电极 / 电位 / 滴定 / 法 / 测定 / 甘草 / 及 / 其 / 制品 / 中 / 的 / 甘草 / 酸 (Applying lead ion selective electrode potentiometric titration method to determine licorice and its products ‘s glycyrrhizic acid) ＜ / 原 / 報 / ＞ / 鉛 / イオン / 選択 / 性 / 電極を / 用いる / 混合 / 試料 / 中 / の /…/ と / 電位 / 差 / 滴定 / 法 / の / 比較 ( lead ion selective electrode used mixed sample ‘s … and potentiometric titration method ‘s comparison) Zh : Ja: 84

Related Work [Munteanu+ 2006] 1.Extract translation lexicon from a parallel corpus 2.Apply a lexicon-based filter to comparable sentences in two directions independently – Assign initial scores according to the lexicon – Score smoothing to gain new knowledge that does not exist in the lexicon 3.Extract sub-sentential (not exactly parallel) fragment 85

应用应用铅离子离子选择选择电极电极电位电位滴定滴定法测定测定甘草甘草及其制品制品中的甘草甘草酸＜原報＞鉛イオンイオン選択選択性電極電極を用いる用いる混合混合試料試料中のと電位電位差滴定滴定法の比較比較 Lexicon Filter on Ja-to-Zh Direction 86

应用应用铅离子离子选择选择电极电极电位电位滴定滴定法测定测定甘草甘草及其制品制品中的甘草甘草酸＜原報＞鉛イオンイオン選択選択性電極電極を用いる用いる混合混合試料試料中のと電位電位差滴定滴定法の比較比較 Lexicon Filter on Zh-to-Ja Direction 87

Parallel Fragment Extraction System Classifier (2) IR: top N results (1) (3) (4) Alignment Lexicon filter (5) SMT Source corpora Seed parallel corpus Target corpora Translated sentences Comparable sentences Parallel fragments Parallel fragment candidates An accurate filter based on bilingual lexicons An alignment model to locate the source and target fragment candidates simultaneously 88

Parallel Fragment Extraction System Classifier (2)(3) Alignment Lexicon-based filter (4) Seed parallel corpus Comparable sentences Parallel fragments Parallel fragment candidates An accurate filter based on bilingual lexicons An alignment model to locate parallel fragment candidates Article pairs... Bilingual dictionary Comparable corpora (1) Cross -lingual IR 89

Parallel Fragment Candidate Detection by Alignment Monotonic, non-NULL and longest aligned fragments more than 3 tokens 90

Extraction Experimental settings Alignment: GIZA++ with symmetrization heuristics – Only: only use the extracted comparable sentences – External: together with 11k chemistry domain data in the parallel corpus Translation lexicon – IBM Model 1 [Brown+ 1993] – Log-Likelihood-Ratio (LLR) [Munteanu+ 2006] – Sub-corpora sampling lexicon (SampLEX) [Vulic+ 2012] Compare with [Munteanu+ 2006] 91

Dataset Parallel corpus: Zh-Ja abstract corpus (680k sentences, scientific domain) Quasi-Comparable Corpora – Chinese corpora: CNKI (90k articles, 420k sentences, chemistry domain) – Japanese corpora: CiNii (880k articles, 5M sentences, scientific domain) Comparable sentences: 30k chemistry domain sentences were extracted 92

Method# fragmentsAvg size (Zh/Ja)Accuracy [Munteanu+ 2006]28.4k20.36/21.39(1%) Only (IBM Model 1)18.9k4.03/4.1480% Only (LLR)18.3k4.00/4.1489% Only (SampLEX)18.4k3.96/4.0587% External (IBM Model 1) 28.7k4.18/4.3381% External (LLR)26.9k4.17/4.3385% External (SampLEX)28.0k4.11/4.2382% Parallel Fragment Extraction Results on Quasi-comparable Corpora ※ [Munteanu+ 2006] is for sub-sentential fragment extraction ※ Only and External: whether use external parallel data for alignment ※ IBM Model 1, LLR and SampLEX: denote different bilingual lexicons ※ Accuracy: manually evaluated 100 fragments based on exact match 93

Examples of Extracted Fragment Pairs on Quasi-comparable Corpora ※ Noise is written in red font Most noise is due to the noisy bilingual lexicon (Example 5, 6) Score smoothing also produces some noise (Example 7) Some noise is produced by both of the two reasons (Example 8) 94 IDZh FragmentJa Fragment 1 直接甲醇燃料电池直接メタノール燃料電池 2 Ｘ射线光电子能谱（ＸＰＳ）Ｘ線光電子分光法（ＸＰＳ） 3 （ＯＨ）２４（Ｈ２Ｏ）１２］ 4 的原生质体融合のプロトプラスト融合 5 分子动力学（ＭＤ）模拟了分子動力学（ＭＤ）シミュレーションを 6 扫描电子显微镜（ＳＥＭ）、透射电子显微镜（ＴＥＭ）型電子顕微鏡（ＳＥＭ），透過型電子顕微鏡（ＴＥＭ） 7 Ｘ射线粉末衍射Ｘ線回折分析 8 证明了本算法的から本アルゴリズムの

MT Experimental Settings on Quasi-comparable Corpora Baseline training: Zh-Ja paper abstract corpus (680k with 11k chemistry domain sentences) Tuning and Testing: 368 and 367 sentences of chemistry domain Decoder: Moses (Zh to Ja translation) Language model: 5–gram language model on the Ja side of the parallel corpus using SRILM Compare MT performance by appending the extracted fragments to the baseline training data 95

MT Results on Quasi-comparable Corpora ※ +Sentence: append comparable sentences to “Baseline” ※ Settings besides “Baseline” and “+Sentence” append fragments to “Baseline” ※ “†” denotes that the result is better than “Baseline” significantly at p < 0.05 BLEU-4 † † † † 96

Parallel Fragment Extraction Results Using “Nakazawa+” for Candidate Detection on Quasi-comparable Corpora ※ [Munteanu+ 2006] is for sub-sentential fragment extraction ※ Only and External: whether use external parallel data for alignment ※ IBM Model 1, LLR and SampLEX: denote different bilingual lexicons 97 Method# fragmentsAvg size (Zh/Ja) [Munteanu+ 2006]28.4k20.36/21.39 Only (IBM Model 1)13.8k3.85/4.13 Only (LLR)13.3k3.87/4.12 Only (SampLEX)13.5k3.81/4.06 External (IBM Model 1) 16.8k3.87/4.13 External (LLR)16.0k3.88/4.13 External (SampLEX)16.4k3.84/4.09

MT Results for Different Alignment Models on Quasi-comparable Corpora ※ [Munteanu+ 2006] is for sub-sentential fragment extraction ※ Only and External: whether use external parallel data for alignment ※ IBM Model 1, LLR and SampLEX: denote different bilingual lexicons ※ “*” denotes that the result is better than “+Sentence” significantly at p < 0.05 98 SystemGIZA++Nakazawa+ Baseline38.64 +Sentence39.16 +[Munteanu+ 2006]38.87 +Only (IBM Model 1)38.8638.96 +Only (LLR)39.2739.17 +Only (SampLEX)39.28 +External (IBM Model 1) 39.6339.88* +External (LLR)39.2239.35 +External (SampLEX)39.4039.42

Bootstrapping Experiments on Quasi-comparable Corpora ※ Only and External: whether use external parallel data for alignment ※ IBM Model 1 and LLR: denote different bilingual lexicons 99 Method# fragmentsAvg size (Zh/Ja)BLEU-4 Baseline38.64 Only (LLR)18.3k4.00/4.1439.27 Only (LLR) Ite_118.5k4.03/4.1639.68 Only (LLR) Ite_218.5k4.03/4.1639.13 External (IBM Model 1)28.7k4.18/4.3339.63 External (IBM Model 1) Ite_1 29.3k4.21/4.3839.58 External (IBM Model 1) Ite_2 29.5k4.22/4.3839.39

Phrase-based SMT [Koehn+ 2003] Parallel corpus f 1 ||| e 1 ||| φ(f 1 |e 1 ) lex(f 1 |e 1 ) … f 2 ||| e 2 ||| φ(f 2 |e 2 ) lex(f 2 |e 2 ) … f 3 ||| e 3 ||| φ(f 3 |e 3 ) lex(f 3 |e 3 ) … ・・・・・・ Phrase table Word alignment Decoder Source sentences Translations 101

Accuracy Problem of SMT 102 ※ φ(f|e) and φ(e|f): inverse and direct phrase translation probabilities ※ lex(f|e) and lex(e|f): inverse and direct lexical weightings Caused by sparseness of the parallel corpora – Word alignment errors – Translation probability overestimations for rare word and phrase pairs feφ(f|e)lex(f|e)φ(e|f)lex(e|f)Alignment 失业人数 unemployment figures0.30.00370.07690.00180-0 1-1 失业人数 number of unemployed0.13330.01880.10250.00411-0 1-1 0- 2 失业人数. unemployment was0.33330.00150.02566.8e-060-1 1-1 1- 2 失业人数 unemployment and bringing 10.00290.02565.4e-070-0 1-0

Improving Accuracy Using BLE [Klementiev+ 2012] Comparable corpora f 1 ||| e 1 ||| φ(f 1 |e 1 ) lex(f 1 |e 1 ) … f 2 ||| e 2 ||| φ(f 2 |e 2 ) lex(f 2 |e 2 ) … f 3 ||| e 3 ||| φ(f 3 |e 3 ) lex(f 3 |e 3 ) … ・・・・・・ Parallel corpus F1(f 1,e 1 ) F2(f 1,e 1 ) … F1(f 2,e 2 ) F2(f 2,e 2 ) … F1(f 3,e 3 ) F2(f 3,e 3 ) … ・・・・・・ Phrase table BLE-based comparable feature estimation 103

BLE-based Comparable Feature Estimation (1/2) Contextual similarity: similarity between context vectors unemployment figures: 失业人数 : (projection via a seed dictionary) 失业人数 : Sim=1.4e-06 The similarity is unreliable because of the sparseness of the vectors! 104

BLE-based Comparable Feature Estimation (2/2) Topical similarity: similarity between topical occurrence vectors estimated from Wikipedia Temporal similarity: similarity between temporal occurrence vectors estimated from temporal information associated news articles 失业人数 : unemployment figures : 失业人数 : Sim=1e-07 Sim=0.1942 105

Proposed Method Comparable corpora f 1 ||| e 1 ||| φ(f 1 |e 1 ) lex(f 1 |e 1 ) … f 2 ||| e 2 ||| φ(f 2 |e 2 ) lex(f 2 |e 2 ) … f 3 ||| e 3 ||| φ(f 3 |e 3 ) lex(f 3 |e 3 ) … ・・・・・・ Parallel corpus F1(f 1,e 1 ) F2(f 1,e 1 ) … F1(f 2,e 2 ) F2(f 2,e 2 ) … F1(f 3,e 3 ) F2(f 3,e 3 ) … ・・・・・・ Phrase table f 1 ||| f 2 ||| p(f 1 |f 2 ) f 1 ||| f 3 ||| p(f 1 |f 3 ) ・・・・・・ Paraphrase e 1 ||| e 2 ||| p(e 1 |e 2 ) e 1 ||| e 3 ||| p(e 1 |e 3 ) ・・・・・・ 106 BLE-based comparable feature estimation (Vector smoothing with paraphrases)

Paraphrases A paraphrase is a restatement of the meaning of a text or passage using other words. -Wikipedia Extract from parallel corpora through bilingual pivoting [Bannard+ 2005] 人数整体就业及失业人数 the overall employment and unemploymentfigures 107 number of unemployed ( not seasonally adjusted ) 失业人数（不经季节性调整）

Context vector Smoothing with Paraphrases Combined vector Paraphrase’s vector n: a set of paraphrases for phrase x Frequency Phrase x’s vector Paraphrase probability x and x i may overlap unemployment figures: Smoothing Note that smoothing is done for both source and target vectors 108

Topical and Temporal Occurrence Vector Smoothing with Paraphrases Paraphrase’s vector Paraphrase probability n: a set of paraphrases for phrase x Phrase x’s vector Combined vector x and x i may overlap unemployment figures: Smoothing 109

Dataset Parallel corpus: Zh-En NIST (991k sentences) (contextual similarity) Comparable corpora: – Zh & En Gigaword 5.0 (temporal similarity) – Zh-En Wikipedia (248k article pairs) (topical similarity) Seed dictionary: Zh-En NIST translation lexicon (82k entries) 110

MT Experimental Settings Training: Zh-En NIST parallel corpus (991k sentences) Tuning and testing: NIST MT02 (878 sentences with 4 references) and NIST MT03 (919 sentences with 4 references) Decoder: MOSES [Koehn+ 2007] (Zh to En translation) Language model: 5-gram LM trained with SRILM on the En side of the parallel corpus Compared MT performance by appending the comparable features to a baseline system 111

MT Results ※ Baseline: does not use comparable features ※ + Contextual, +Topical, +Temporal and +All : append contextual, topical and temporal features respectively and all the three types of features ※ “†” and “‡” denote that the result is significantly better than Baseline and Klementiev+ respectively at p < 0.05 †‡ † ‡ BLEU-4 112

Examples of Comparable Feature Scores feContextualTopicalTemporal 失业人数 unemployment figures1.4e-061e-070.1942 失业人数 number of unemployed0.01441e-070.0236 失业人数. unemployment was0.01071e-070.0709 失业人数 unemployment and bringing 1e-07 feContextualTopicalTemporal 失业人数 unemployment figures0.07490.54340.4307 失业人数 number of unemployed0.05220.19070.5983 失业人数. unemployment was0.00500.01170.0967 失业人数 unemployment and bringing 5.1e-051e-070.0073 ※ Estimated by [Klementiev+ 2012] ※ Estimated by our proposed method 113

Summary of Improving SMT Accuracy Using BLE We improved SMT accuracy using BLE with paraphrases to estimate comparable features from comparable corpora Future Work – Generate paraphrases from external parallel corpora and monolingual corpora – Apply our method other SMT models 114

Integrated Parallel Data Extraction from Comparable Corpora for Statistical Machine Translation Kurohashi & Kawahara Lab. Chenhui Chu.

Similar presentations

Presentation on theme: "Integrated Parallel Data Extraction from Comparable Corpora for Statistical Machine Translation Kurohashi & Kawahara Lab. Chenhui Chu."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Integrated Parallel Data Extraction from Comparable Corpora for Statistical Machine Translation Kurohashi & Kawahara Lab. Chenhui Chu.

Similar presentations

Presentation on theme: "Integrated Parallel Data Extraction from Comparable Corpora for Statistical Machine Translation Kurohashi & Kawahara Lab. Chenhui Chu."— Presentation transcript:

Similar presentations

About project

Feedback