Accurate Parallel Fragment Extraction from Quasi-Comparable Corpora using Alignment Model and Translation Lexicon Chenhui Chu, Toshiaki Nakazawa, Sadao Kurohashi Graduate School of Informatics, Kyoto University IJCNLP2013 (2013/10/17) 1
Outline Background Related Work Proposed Method Experiments Conclusion 2
Outline Background Related Work Proposed Method Experiments Conclusion 3
Bilingual Corpora [Fung+ 2004] TypeDefinitionExample Parallel Sentence-aligned bilingual corporaEuroparl Noisy Parallel Bilingual translations of documentsPatent family Comparable Topic-aligned bilingual documentsWikipedia Quasi-Comparable Very-non-parallel bilingual documentsthis study 4 Lack of parallel corpora Parallel sentences can be extracted from noisy and comparable corpora Quasi-comparable corpora more available, however few parallel sentences exist
Parallel Fragments In quasi-comparable corpora, there could be parallel fragments in comparable sentences Parallel fragments are also helpful for SMT We aim to accurately extract parallel fragments from comparable sentences 应用 / 铅 / 离子 / 选择 / 电极 / 电位 / 滴定 / 法 / 测定 / 甘草 / 及 / 其 / 制品 / 中 / 的 / 甘草 / 酸 (Applying lead ion selective electrode potentiometric titration method to determine licorice and its products ‘s glycyrrhizic acid) < / 原 / 報 / > / 鉛 / イオン / 選択 / 性 / 電極を / 用いる / 混合 / 試料 / 中 / の /…/ と / 電位 / 差 / 滴定 / 法 / の / 比較 ( lead ion selective electrode used mixed sample ‘s … and potentiometric titration method ‘s comparison) Zh : Ja: 5
Outline Background Related Work Proposed Method Experiments Conclusion 6
Parallel Sub-sentential Fragment Extraction [Munteanu+ 2006] 1.Extract translation lexicon from a parallel corpus 2.Apply a lexicon filter to comparable sentences in two directions independently – Assign initial scores according to the lexicon – Score smoothing to gain new knowledge that does not exist in the lexicon 3.Extract sub-sentential (not exactly parallel) fragment 7
8 应用应用 铅离子离子 选择选择 电极电极 电位电位 滴定滴定 法测定测定 甘草甘草 及其制品制品 中的甘草甘草 酸 <原報>鉛イオンイオン 選択選択 性電極電極 を用いる用いる 混合混合 試料試料 中のと電位電位 差滴定滴定 法の比較比較 Lexicon Filter on Ja-to-Zh Direction
9 应用应用 铅离子离子 选择选择 电极电极 电位电位 滴定滴定 法测定测定 甘草甘草 及其制品制品 中的甘草甘草 酸 <原報>鉛イオンイオン 選択選択 性電極電極 を用いる用いる 混合混合 試料試料 中のと電位電位 差滴定滴定 法の比較比較 Lexicon Filter on Zh-to-Ja Direction
Outline Background Related Work Proposed Method Experiments Conclusion 10
System Overview Translated sentences Comparable sentences Parallel fragments Source corpora Target corpora Classifier (2) IR: top N results (1) (3) (4) Alignment Parallel corpus Parallel fragment candidates Lexicon filter (5) SMT 11 Use an alignment model to locate the source and target fragment candidates simultaneously Use a more accurate lexicon filter
Parallel Fragment Candidate Detection by Alignment Monotonic, non-NULL and longest aligned fragments more than 3 tokens 12
Lexicon Filter − Assign Initial Scores 13 Assign scores in two directions to aligned word pairs in the candidates according to translation lexicon
Lexicon Filter − Score Smoothing 14 Only smooth a word with negative score when both the left and right words around it have positive scores
Fragment Extraction 15 Fragments more than 3 tokens with continuous positive scores in both directions
Outline Background Related Work Proposed Method Experiments – Parallel Fragment Extraction – Translation Conclusion 16
Experimental settings (Parallel Fragment Extraction 1/2) Parallel corpus: Zh-Ja abstract corpus (680k sentences, scientific domain) Quasi-Comparable Corpora – Chinese corpora: CNKI (90k articles, 420k sentences, chemistry domain) – Japanese corpora: CiNii (880k articles, 5M sentences, scientific domain) Comparable sentences: 30k chemistry domain sentences were extracted 17
Experimental settings (Parallel Fragment Extraction 2/2) Alignment: GIZA++ with symmetrization heuristics – Only: only use the extracted comparable sentences – External: together with 11k chemistry domain data in the parallel corpus Translation lexicon – IBM Model 1 [Brown+ 1993] – Log-Likelihood-Ratio (LLR) [Munteanu+ 2006] – Sub-corpora sampling lexicon (SampLEX) [Vulic+ 2012] Compare with [Munteanu+ 2006] 18
Results Method# fragmentsAvg size (Zh/Ja)Accuracy [Munteanu+ 2006]28.4k20.36/21.39(1%) Only (IBM Model 1)18.9k4.03/4.1480% Only (LLR)18.3k4.00/4.1489% Only (SampLEX)18.4k3.96/4.0587% External (IBM Model 1) 28.7k4.18/4.3381% External (LLR)26.9k4.17/4.3385% External (SampLEX)28.0k4.11/4.2382% ※ Accuracy: manually evaluated 100 fragments based on exact match 19
Experimental Settings (Translation) Baseline: Zh-Ja paper abstract corpus (680k with 11k chemistry domain sentences) Tuning: 368 sentences of chemistry domain Testing: 367 sentences of chemistry domain Decoder: Moses Language model: 5–gram language model on the Ja side of the parallel corpus using SRILM Compare MT performance by appending the extracted fragments to the baseline training data 20
BLUE-4 for Different Systems 21 ※ “*” denotes that the result is better than “Baseline” significantly at p < 0.05 ** * *
Outline Background Related Work Proposed Method Experiments Conclusion 22
Conclusion We proposed an accurate parallel fragment extraction system using alignment model and translation lexicon Future Work – A method to deal with ordering – Parallel corpus independent method – Try other language pairs and domains 23
Thank you for your attention!
Examples of Extracted Fragment Pairs 25 IDZh FragmentJa Fragment 1 直接甲醇燃料电池直接メタノール燃料電池 2 X射线光电子能谱(XPS)X線光電子分光法(XPS) 3 (OH)24(H2O)12] 4 的原生质体融合のプロトプラスト融合 5 分子动力学(MD)模拟了分子動力学(MD)シミュレー ションを 6 扫描电子显微镜(SEM)、透射 电子显微镜(TEM) 型電子顕微鏡(SEM),透過型 電子顕微鏡(TEM) 7 证明了本算法的から本アルゴリズムの 8 X射线粉末衍射X線回折分析 ※ Noise is written in red font Most noise is due to the noisy translation lexicon (Example 5-7) Score smoothing also produces some noise (Example 8)