Accurate Parallel Fragment Extraction from Quasi-Comparable Corpora using Alignment Model and Translation Lexicon Chenhui Chu, Toshiaki Nakazawa, Sadao.


Similar presentations
Iterative Bilingual Lexicon Extraction from Comparable Corpora Using Topic Model and Context Based Methods Chenhui Chu, Toshiaki Nakazawa, Sadao Kurohashi.

Astro-E2 target: The Galactic Center Red: 1-3 keV Green: 3-5 keV Blue: 5-8 keV Sgr B2 Wang et al Yoshitomo Maeda Hiroshi Murakami 2 degree The Arches.
Improving Statistical Machine Translation Accuracy Using Bilingual Lexicon Extraction with Paraphrases Chenhui Chu, Toshiaki Nakazawa, Sadao Kurohashi.
Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.
学生の携帯電話選択理由 岡田隆太.
情報処理A 第10回 Excelの使い方 その3.
电位滴定分析法概述 杨扬 引言 在分析化学中测原电池的电位 —— 取决于电池.
平衡态电化学 化学电池 浓差电池 电极过程动力学.
Bar-TOP における光の 群速度伝播の解析 名古屋大学 高エネルギー物理研究室 松石 武 (Matsuishi Takeru)
裂解气相色谱技术与应用 裂解气相色谱技术与应用 裂解气相色谱技术与应用 application and technology of pyrolysis gas chromatograph 裂解气相色谱技术与应用 张潇.
组织学与胚胎学 南通大学 医学院 组织学与胚胎学教研室. 组织学( histology ) — 绪论 定义:研究正常人体微细结构及其相关 定义:研究正常人体微细结构及其相关 功能的学科 功能的学科 内容:细胞、组织、器官和系统 内容:细胞、组织、器官和系统.
电导分析法 conductometricanalysis 吕高阳 电导分析法简介 通过测定溶液的电导而求得 溶液中电解质浓度的方法称 为电导分析法 (conductometricanalysis)
晶体 完美性 de 鉴定 报告人 : 叶佳. 测定晶体缺陷结构的重要  晶体中缺陷的种类、数量是鉴 别晶体质量优劣的重要标志。  晶体的缺陷通常能够吸收、反 射或散射晶体内部产生的或是 由外部输入的磁、光、声和电 能量,从而影响晶体的性能。
通電着火による金属間化合物 TiAl の加圧反応焼結 塑性加工研究室 岩城 信二 通電焼結の特徴 導電性のある粉末を加圧 しながら短時間通電し,そ の抵抗発熱によって焼結す る. 粉末の自己発熱によって焼 結するため,エネルギ効率 が高い. TiAl における反応焼結に及ぼす 通電条件の影響 粉末 コンテナ.
資訊教育 吳桂光 東海大學物理系助理教授 Tel: 3467 Office: ST223 Office hour: Mon (10:30-12am) or by appointment.
导体  电子导体  R   离子导体 L  mm      ,,, m m 
平衡态电化学 化学电池 浓差电池 电极过程动力学. Electrode Kinetics 极 化 Polarization.
二次元、三次元空間の座標表現 点のベクトル表現と行列による変換 点、線、面の数理表現 図形の変換 投影、透視変換
电导分析技术 05 级化学四班 曹羽 Conductometricanalysis.
第十一章 光学分析法导论.
第四章 非水酸碱滴定法 为什么要进行非水滴定? ( 1 )大部分有机化合物难溶于水; ( 2 )弱酸、弱碱的解离常数小于 时,不能满足 目视直接滴定的要求,在水溶液中不能直接滴定; ( 3 )当弱酸和弱碱并不很弱时,其共轭碱或共轭酸 在水溶液中也不能直接滴定。
非均相物系的分离 沉降速度 球形颗粒的 :一、自由沉降 二、沉降速度的计算 三、直径计算 1. 试差法 2. 摩擦数群法 四、非球形颗粒的自由沉降 1. 当量直径 de :与颗粒体积相等的圆球直径 V P — 颗粒的实际体积 2. 球形度  s : S—— 与颗粒实际体积相等的球形表面积.
高效液相层析法 及其在生化研究中的应用 03 级 化学生物专业 李楠 高效液相层析法 ( HPLC,high performance liquid chromatography )是近二十年来发展起 来的一项新颖快速的分离技术。它是在经典液 相层析法基础上,引进了气相层析的理论具有.
Run2b シリコン検出 器 現在の SVX-II (内側3層)は 放射線損傷により Run2b 中に 著しく性能が劣化する Run2b シリコン検出器 日本の分担: 1512 outer axial sensors 648 outer stereo sensors ( 144 inner axial.
資訊教育 吳桂光 東海大學物理系助理教授 Tel: 3467 Office: ST223 Office hour: Tue, Fri. (10-11am)
烟碱(尼古丁, Nicotine ) [N - 甲基 (β- 吡啶基 ) 四氢吡啶 ]  分子式: C 10 H 14 N 2 ,分子量 。是一种无色至淡黄色透 明油状液体。沸点 247C ( 99320Pa )。能与水以任何比 例混合,易溶与氯仿,乙醚,石 油醚等。有强吸湿性,随水蒸气.
分 析 化 学 实 验 实验十一 用 pH 计测定溶液的 pH 和磷酸的电位滴定. 一、实验目的 实验十一 用 pH 计测定溶液的 pH 和磷酸的电位滴定 1. 掌握用 pH 计测定溶液 pH 的操作。 2. 掌握电位滴定法操作和确定终点的方法。 3. 掌握磷酸电位滴定曲线的绘制方法。 4. 了解磷酸离解常数.
25  C 时电解质水溶液的摩尔电导率 p.290. 注意强、弱电解质溶液的区别 p HCl KCl HAc 430.
导体  电子导体  R   L  i 离子导体  ( 平衡 ) mm   .
平衡态电化学 化学电池 浓差电池. 平衡态电化学 膜电势 化学电池浓差电池 电极过程动力学 Electrode Kinetics 极 化 Polarization.
信息科学部 “ 十一五 ” 计划期间 优先资助领域 信息科学部 秦玉文 2006 年 2 月 24 日.
§8-3 电 场 强 度 一、电场 近代物理证明:电场是一种物质。它具有能量、 动量、质量。 电荷 电场 电荷 电场对外的表现 : 1) 电场中的电荷要受到电场力的作用 ; 2) 电场力可移动电荷作功.
Photometric Stereo for Lambertian Surface Robert J. Woodham, "Photometric method for determining surface orientation from multiple shading images", Optical.
Alignment by Bilingual Generation and Monolingual Derivation Toshiaki Nakazawa and Sadao Kurohashi Kyoto University.
《 UML 分析与设计》 交互概述图 授课人:唐一韬. 知 识 图 谱知 识 图 谱知 识 图 谱知 识 图 谱.
1 Chinese Term Extraction Based on Delimiters Yuhang Yang, Qin Lu, Tiejun Zhao School of Computer Science and Technology, Harbin Institute of Technology.
Language Knowledge Engineering Lab. Kyoto University NTCIR-10 PatentMT, Japan, Jun , 2013 Description of KYOTO EBMT System in PatentMT at NTCIR-10.
Iterative Bilingual Lexicon Extraction from Comparable Corpora with Topical and Contextual Knowledge Chenhui Chu, Toshiaki Nakazawa, Sadao Kurohashi Graduate.
Exploiting Shared Chinese Characters in Chinese Word Segmentation Optimization for Chinese-Japanese Machine Translation Chenhui Chu, Toshiaki Nakazawa,
1 Co-Training for Cross-Lingual Sentiment Classification Xiaojun Wan ( 萬小軍 ) Associate Professor, Peking University ACL 2009.
Phrase Reordering for Statistical Machine Translation Based on Predicate-Argument Structure Mamoru Komachi, Yuji Matsumoto Nara Institute of Science and.
Reordering Model Using Syntactic Information of a Source Tree for Statistical Machine Translation Kei Hashimoto, Hirohumi Yamamoto, Hideo Okuma, Eiichiro.
Statistical Phrase Alignment Model Using Dependency Relation Probability Toshiaki Nakazawa and Sadao Kurohashi Kyoto University.
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
2006/12/081 Large Scale Crawling the Web for Parallel Texts Chikayama Taura lab. M1 Dai Saito.
实验一、光学显微镜使用 及显微摄影技术.
Integrated Parallel Data Extraction from Comparable Corpora for Statistical Machine Translation Kurohashi & Kawahara Lab. Chenhui Chu.
高温固相法制备高效 YAG 荧 光粉及性能表征 实验指导书 2013 年 6 月 4 日 材料科学与工程实验教学中心 Experimental Teaching Center for Materials Science and Engineering.
Bayesian Subtree Alignment Model based on Dependency Trees Toshiaki Nakazawa Sadao Kurohashi Kyoto University 1 IJCNLP2011.
Cell-Surface Proteomics Identifies Lineage-Specific Markers of Embryo-Derived Stem Cells wangjianyu.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Chinese Word Segmentation Adaptation for Statistical Machine Translation Hailong Cao, Masao Utiyama and Eiichiro Sumita Language Translation Group NICT&ATR.
2007/4/201 Extracting Parallel Texts from Massive Web Documents Chikayama Taura lab. M2 Dai Saito.
项目 1 典型低压电器 的拆装、检修及调试 任务 2 交流接触器的拆装与检修 接触器是一种自动的电磁式自动开关,是 一种依靠电磁力作用使触点闭合或分离的自 动电器,用于接通和断开电动机或其它用电 设备电路。适用于远距离频繁地接通或断开 交直流主电路及大容量控制电路。交流接触 器具有控制容量大、操作方便、便于远距离.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
21 Sep 2006 Kentaro MIKI for the PHENIX collaboration University of Tsukuba The Physical Society of Japan 62th Annual Meeting RHIC-PHENIX 実験における高横運動量領域での.
Structural Phrase Alignment Based on Consistency Criteria Toshiaki Nakazawa, Kun Yu, Sadao Kurohashi (Graduate School of Informatics, Kyoto University)
Conductometry ( 电导法 ). Conductometric Analysis Fundamentals of conductometry Conductivity measurements Analytical applications of conductometric measurements.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
材料分析化学 朱永法 电话: 传真: 材料的成份分析.
Japanese-Chinese Phrase Alignment Exploiting Shared Chinese Characters Chenhui Chu, Toshiaki Nakazawa and Sadao Kurohashi Graduate School of Informatics,
Generation of Chinese Character Based on Human Vision and Prior Knowledge of Calligraphy 报告人: 史操 作者: 史操、肖建国、贾文华、许灿辉 单位: 北京大学计算机科学技术研究所 NLP & CC 2012: 基于人类视觉和书法先验知识的汉字自动生成.
Cross-language Projection of Dependency Trees Based on Constrained Partial Parsing for Tree-to-Tree Machine Translation Yu Shen, Chenhui Chu, Fabien Cromieres.
Fabien Cromieres Chenhui Chu Toshiaki Nakazawa Sadao Kurohashi
Japan Science and Technology Agency
SCTB: A Chinese Treebank in Scientific Domain
Suggestions for Class Projects
An Empirical Comparison of Domain Adaptation Methods for
Kyoto University Participation to WAT 2016
Presentation transcript:

Accurate Parallel Fragment Extraction from Quasi-Comparable Corpora using Alignment Model and Translation Lexicon Chenhui Chu, Toshiaki Nakazawa, Sadao Kurohashi Graduate School of Informatics, Kyoto University IJCNLP2013 (2013/10/17) 1

Outline Background Related Work Proposed Method Experiments Conclusion 2

Outline Background Related Work Proposed Method Experiments Conclusion 3

Bilingual Corpora [Fung+ 2004] TypeDefinitionExample Parallel Sentence-aligned bilingual corporaEuroparl Noisy Parallel Bilingual translations of documentsPatent family Comparable Topic-aligned bilingual documentsWikipedia Quasi-Comparable Very-non-parallel bilingual documentsthis study 4 Lack of parallel corpora Parallel sentences can be extracted from noisy and comparable corpora Quasi-comparable corpora more available, however few parallel sentences exist

Parallel Fragments In quasi-comparable corpora, there could be parallel fragments in comparable sentences Parallel fragments are also helpful for SMT We aim to accurately extract parallel fragments from comparable sentences 应用 / 铅 / 离子 / 选择 / 电极 / 电位 / 滴定 / 法 / 测定 / 甘草 / 及 / 其 / 制品 / 中 / 的 / 甘草 / 酸 (Applying lead ion selective electrode potentiometric titration method to determine licorice and its products ‘s glycyrrhizic acid) < / 原 / 報 / > / 鉛 / イオン / 選択 / 性 / 電極を / 用いる / 混合 / 試料 / 中 / の /…/ と / 電位 / 差 / 滴定 / 法 / の / 比較 ( lead ion selective electrode used mixed sample ‘s … and potentiometric titration method ‘s comparison) Zh : Ja: 5

Outline Background Related Work Proposed Method Experiments Conclusion 6

Parallel Sub-sentential Fragment Extraction [Munteanu+ 2006] 1.Extract translation lexicon from a parallel corpus 2.Apply a lexicon filter to comparable sentences in two directions independently – Assign initial scores according to the lexicon – Score smoothing to gain new knowledge that does not exist in the lexicon 3.Extract sub-sentential (not exactly parallel) fragment 7

8 应用应用 铅离子离子 选择选择 电极电极 电位电位 滴定滴定 法测定测定 甘草甘草 及其制品制品 中的甘草甘草 酸 <原報>鉛イオンイオン 選択選択 性電極電極 を用いる用いる 混合混合 試料試料 中のと電位電位 差滴定滴定 法の比較比較 Lexicon Filter on Ja-to-Zh Direction

9 应用应用 铅离子离子 选择选择 电极电极 电位电位 滴定滴定 法测定测定 甘草甘草 及其制品制品 中的甘草甘草 酸 <原報>鉛イオンイオン 選択選択 性電極電極 を用いる用いる 混合混合 試料試料 中のと電位電位 差滴定滴定 法の比較比較 Lexicon Filter on Zh-to-Ja Direction

Outline Background Related Work Proposed Method Experiments Conclusion 10

System Overview Translated sentences Comparable sentences Parallel fragments Source corpora Target corpora Classifier (2) IR: top N results (1) (3) (4) Alignment Parallel corpus Parallel fragment candidates Lexicon filter (5) SMT 11 Use an alignment model to locate the source and target fragment candidates simultaneously Use a more accurate lexicon filter

Parallel Fragment Candidate Detection by Alignment Monotonic, non-NULL and longest aligned fragments more than 3 tokens 12

Lexicon Filter − Assign Initial Scores 13 Assign scores in two directions to aligned word pairs in the candidates according to translation lexicon

Lexicon Filter − Score Smoothing 14 Only smooth a word with negative score when both the left and right words around it have positive scores

Fragment Extraction 15 Fragments more than 3 tokens with continuous positive scores in both directions

Outline Background Related Work Proposed Method Experiments – Parallel Fragment Extraction – Translation Conclusion 16

Experimental settings (Parallel Fragment Extraction 1/2) Parallel corpus: Zh-Ja abstract corpus (680k sentences, scientific domain) Quasi-Comparable Corpora – Chinese corpora: CNKI (90k articles, 420k sentences, chemistry domain) – Japanese corpora: CiNii (880k articles, 5M sentences, scientific domain) Comparable sentences: 30k chemistry domain sentences were extracted 17

Experimental settings (Parallel Fragment Extraction 2/2) Alignment: GIZA++ with symmetrization heuristics – Only: only use the extracted comparable sentences – External: together with 11k chemistry domain data in the parallel corpus Translation lexicon – IBM Model 1 [Brown+ 1993] – Log-Likelihood-Ratio (LLR) [Munteanu+ 2006] – Sub-corpora sampling lexicon (SampLEX) [Vulic+ 2012] Compare with [Munteanu+ 2006] 18

Results Method# fragmentsAvg size (Zh/Ja)Accuracy [Munteanu+ 2006]28.4k20.36/21.39(1%) Only (IBM Model 1)18.9k4.03/4.1480% Only (LLR)18.3k4.00/4.1489% Only (SampLEX)18.4k3.96/4.0587% External (IBM Model 1) 28.7k4.18/4.3381% External (LLR)26.9k4.17/4.3385% External (SampLEX)28.0k4.11/4.2382% ※ Accuracy: manually evaluated 100 fragments based on exact match 19

Experimental Settings (Translation) Baseline: Zh-Ja paper abstract corpus (680k with 11k chemistry domain sentences) Tuning: 368 sentences of chemistry domain Testing: 367 sentences of chemistry domain Decoder: Moses Language model: 5–gram language model on the Ja side of the parallel corpus using SRILM Compare MT performance by appending the extracted fragments to the baseline training data 20

BLUE-4 for Different Systems 21 ※ “*” denotes that the result is better than “Baseline” significantly at p < 0.05 ** * *

Outline Background Related Work Proposed Method Experiments Conclusion 22

Conclusion We proposed an accurate parallel fragment extraction system using alignment model and translation lexicon Future Work – A method to deal with ordering – Parallel corpus independent method – Try other language pairs and domains 23

Thank you for your attention!

Examples of Extracted Fragment Pairs 25 IDZh FragmentJa Fragment 1 直接甲醇燃料电池直接メタノール燃料電池 2 X射线光电子能谱(XPS)X線光電子分光法(XPS) 3 (OH)24(H2O)12] 4 的原生质体融合のプロトプラスト融合 5 分子动力学(MD)模拟了分子動力学(MD)シミュレー ションを 6 扫描电子显微镜(SEM)、透射 电子显微镜(TEM) 型電子顕微鏡(SEM),透過型 電子顕微鏡(TEM) 7 证明了本算法的から本アルゴリズムの 8 X射线粉末衍射X線回折分析 ※ Noise is written in red font Most noise is due to the noisy translation lexicon (Example 5-7) Score smoothing also produces some noise (Example 8)