Kuang Ru; Jinan Xu; Yujie Zhang; Peihao Wu Beijing Jiaotong University

Slides:



Advertisements
Similar presentations
Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.
Advertisements

PHONE MODELING AND COMBINING DISCRIMINATIVE TRAINING FOR MANDARIN-ENGLISH BILINGUAL SPEECH RECOGNITION Yanmin Qian, Jia Liu ICASSP2010 Pei-Ning Chen CSIE.
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Context-based object-class recognition and retrieval by generalized correlograms by J. Amores, N. Sebe and P. Radeva Discussion led by Qi An Duke University.
Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Robust Extraction of Named Entity Including Unfamiliar Word Masatoshi Tsuchiya, Shinya Hida & Seiichi Nakagawa Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi.
Automatic Identification of Cognates, False Friends, and Partial Cognates University of Ottawa, Canada University of Ottawa, Canada.
Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.
47 th Annual Meeting of the Association for Computational Linguistics and 4 th International Joint Conference on Natural Language Processing Of the AFNLP.
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Introduction to CL Session 1: 7/08/2011. What is computational linguistics? Processing natural language text by computers  for practical applications.
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
Machine translation Context-based approach Lucia Otoyo.
BILINGUAL CO-TRAINING FOR MONOLINGUAL HYPONYMY-RELATION ACQUISITION Jong-Hoon Oh, Kiyotaka Uchimoto, Kentaro Torisawa ACL 2009.
A Phonotactic-Semantic Paradigm for Automatic Spoken Document Classification Bin MA and Haizhou LI Institute for Infocomm Research Singapore.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Final Review 31 October WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh.
Graphical models for part of speech tagging
Exploiting Ontologies for Automatic Image Annotation M. Srikanth, J. Varner, M. Bowden, D. Moldovan Language Computer Corporation
Combining Lexical Semantic Resources with Question & Answer Archives for Translation-Based Answer Finding Delphine Bernhard and Iryna Gurevvch Ubiquitous.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
Processing of large document collections Part 7 (Text summarization: multi- document summarization, knowledge- rich approaches, current topics) Helena.
Named Entity Recognition based on Bilingual Co-training Li Yegang School of Computer, BIT.
TEMPLATE DESIGN © Zhiyao Duan 1,2, Lie Lu 1, and Changshui Zhang 2 1. Microsoft Research Asia (MSRA), Beijing, China.2.
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Instance Filtering for Entity Recognition Advisor : Dr.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Translation Memory System (TMS)1 Translation Memory Systems Presentation by1 Melina Takanen & Julianna Ekert CAT Prof. Thorsten Trippel University.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Automatic Extraction of Translational Japanese- KATAKANA.
Mutual bilingual terminology extraction Le An Ha*, Gabriela Fernandez**, Ruslan Mitkov*, Gloria Corpas*** * University of Wolverhampton ** Universidad.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Qing.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Psychiatric document retrieval using a discourse-aware model Presenter : Wu, Jia-Hao Authors : Liang-Chih.
A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.
Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Computational Linguistics Courses Experiment Test.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Statistical techniques for video analysis and searching chapter Anton Korotygin.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Chinese Named Entity Recognition using Lexicalized HMMs.
Statistical Machine Translation Part II: Word Alignments and EM
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Guangbing Yang Presentation for Xerox Docushare Symposium in 2011
Social Knowledge Mining
Motivation It can effectively mine multi-modal knowledge with structured textural and visual relationships from web automatically. We propose BC-DNN method.
Extracting Why Text Segment from Web Based on Grammar-gram
Presentation transcript:

Kuang Ru; Jinan Xu; Yujie Zhang; Peihao Wu Beijing Jiaotong University A Method to Construct Chinese-Japanese Named Entity Translation Equivalents Using Monolingual Corpora Kuang Ru; Jinan Xu; Yujie Zhang; Peihao Wu Beijing Jiaotong University 各位老师,各位同学,大家好,我是来自北京交通大学的茹旷,今天我演讲的题目是《使用两侧单语语料库获取中日命名实体翻译等价对方法》

Introduction & Motivation Proposed Method Experiment & Result Analysis Outline Introduction & Motivation Proposed Method Experiment & Result Analysis Conclusion 演讲的主要内容有以下几个方面组成,首先,介绍研究的背景和提出该方法的动机。其次介绍我们提出的方法,然后是实验结果和分析,最后是总结。

Named Entity research is a basic task of natural language processing; Introduction (1) Named Entity research is a basic task of natural language processing; Multilingual information increase rapidly; NE translation equivalents has important significance for multilingual information processing. Such as automatic summarization, machine translation, cross-language information retrieval and etc.. 命名实体研究是自然语言处理的基本任务之一,有很多任务都和命名实体的研究相关。随着互联网的普及,多语言信息飞速增加。所以命名实体翻译等价对在多语言信息处理中也变得越来越重要,比如在如下领域:多语言自动文摘/机器翻译/跨语言信息检索还有很多其他应用。

From the parallel corpus Introduction (2) From the parallel corpus Relevant language resources are rare. The parallel corpus acquisition is not a simple task. From comparable corpus Avoid the huge cost to obtain parallel corpus. Lower accuracy. Feature Their phonetic feature. Entity context. Relationships between an NE pair. 现在多语言命名实体抽取已经有比较多的论文,大体上可以分成两种方法,一种是从平行语料中抽取,一种是从可比语料中抽取,随着研究的深入,越来越多的转入到第二种方式。是因为很多语言资源是比较稀缺的,构建某些语言对的难度很高。方法没有普遍性。相比于第一种方法第二种方法就避免了构建平行语料库的花费。但是,这种方法抽取出来的命名实体对的精确度较低。这些方法的使用了多种特征,包括他们自身的语音特征,像是bush和布什,还有上下文特征,还有命名实体对之间的关系等等。

The Chinese and Japanese bilingual resources are very few. Motivation The Chinese and Japanese bilingual resources are very few. Difficult to obtain bilingual corpus. Japanese Kanji comes from ancient Chinese. Japanese Kanji and Chinese Hanzi are Potentially important information. 首先,中日双语资源比较稀少,很难构建双语语料库。大部分提出的方法就没法使用了。但是,日语汉字潜在的包含了一些重要信息。

The Proposed Method NER 如图就是提出方法的流程,然后对每一个步骤做一下解释。

Statistical machine learning methods Monolingual NER Statistical machine learning methods Does not require extensive linguistics knowledge. Easy to migrate to new areas. Main methods: HMM, Maximum Entropy, SVM, Conditional Random Fields (CRF) and etc.. Conditional Random Fields (CRF) flexible feature and global optimal annotation framework slow convergence and long training time 在单语的命名实体识别上我们使用了基于CRF的方法,为什么要采用这种方法,是因为CRF的特征选择比较灵活,而且通过全局优化,能产生较好的结果,缺点是训练的时间会比较长。

The Proposed Method 下面讲解一下中日汉字对照表

Kanji Comparison Table Use open source resources to build Japanese kanji, Traditional Chinese, Simplified Chinese Hanzi table. Hanzi converter standard conversion table 它是一个中日汉字之间的映射表,这一部分内容可以从开放数据中获取,第一个表上可以看出从日语到繁体汉字到简体汉字的映射关系,需要注意的问题就是映射关系可能不是1对1的,如同下面的表所示。可能会有多个词变成一个简体汉字。

The Proposed Method 如图就是提出方法的流程,然后对每一个步骤做一下解释。

Inductive Learning (1) Extraction instance(common part and different part) from Unknown string pairs Segment 2  Rules 从这个表上可以看出段2是想两个输入中相同的部分,这部分直接可以认为是一个局部翻译的规则,段1和段3是否是翻译还有待于统计信息证明。或者他们之间的翻译关系可能是需要调序的。还可能有其他的情况,比如input1前面一部分由input2前后部分组合而成,由于命名实体的本身的特性,一个是长度相比句子比较短,二是组合格式比较规整。所以这种情况没有发现。

Extraction primitive from segments. Inductive Learning (2) Extraction primitive from segments. Segment 2  Rules 继续上面的步骤,如果之后出现了如表所示的段,则通过归纳学习的方式就能得到词元2同样是一个局部翻译的规则。

Algorithm: Inductive Learning INPUT:J_C_dict, Simlarity,Part_Trans_Rules INITIALIZATION: Diff_count For entity in Simlarity: If entity.value > Threshold.diff: #Get the same and different part in NEs. content = GetDiff(entity.ja, entity.ch) Diff_count[content]++ For content in Diff_count: #If rule frequency higher than threshold, add it into partial translation rules. If Diff_count[content] > Threshold.rules: Part_Trans_Rules += content 这是在本文中归纳学习的伪代码,通过这个阈值来控制规则抽取。

The Proposed Method 然后我们结合例子解释实例选择过程

Examples Selection (1) Calculate NE similarity between Chinese and Japanese instances. (cosine vector) 𝑆= 𝑨∙𝑩 𝑨 × 𝑩 = 𝑖=1 𝑛 𝐴 𝑖 × 𝐵 𝑖 𝑖=1 𝑛 𝐴 𝑖 2 × 𝑖=1 𝑛 𝐵 𝑖 2 Example selection就是要从双语的命名实体列表中抽取出相似度比较高的条目进行规则抽取,计算相似度的方法比较多,本文采用了cosine 向量的方法来计算相似度。

use the NE Partial Translation Rules to translation the NEs Examples Selection (2) 大原綜合病院附属大原医療センター 大原综合医院附属大原医疗中心 Word segmentation 大原 綜合 病院 附属 大原 医療 センター 大原 综合 医院 附属 大原 医疗 中心 use the NE Partial Translation Rules to translation the NEs 大原 综合 医院 附属 大原 医療 センター "病院"(hospital) -> "医院" "綜合"(general) -> "综合" ‘療’-> ‘疗’ 这里是一个真实例子,假设系统已经迭代过几次,规则库中已经有一部分规则。第一步先进行分词,得到这样的结果,再结合规则库中的规则,得到如下的例子。如果是初次迭代的话,由于规则库中还没有规则,这一步是不做任何改变的。然后使用中日汉字对照表进行翻译,这是汉字表的信息

Convert the Japanese Kanji Examples Selection (3) Convert the Japanese Kanji 大原 综合 医院 附属 大原 医疗 センター 大原 综合 医院 附属 大原 医疗 中心 Form a bilingual word vector [大原 综合 医院 附属 医疗 センター 中心] Get the word frequency vector [大原/2 综合/1 医院/1 附属/1 医疗/1 センター/1 中心/0] [大原/2 综合/1 医院/1 附属/1 医疗/1 センター/0 中心/1] 然后得到了最后的转化结果,之后再用之前的cos公式计算两个向量的相似度,如果高于阈值的,就把这个双语对当作归纳学习的输入。

Experiment & Result Analysis (1) Monolingual corpora From Wikipedia database 69,874 Japanese texts 84,055 Chinese texts The result of NER 实验的数据来源是维基百科,选取69874篇日语文章和84055篇中文文章,分别对两个单语语料库进行命名实体识别,识别结果如表所示。

Experiment & Result Analysis (2) Experimental data Randomly select 8000 entries. Manually aligned Evaluation criteria 𝑃= 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑣𝑒𝑠 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠+𝑓𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 𝑅= 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠+𝑡𝑟𝑢𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 𝐹 𝛽 = 𝛽 2 +1 ∙𝑃∙𝑅 𝛽 2 ∙𝑃+𝑅 实验数据是从上表中选取8000个实体,人工的对这些实体做对齐。采用PRF来评价方法的效果。

Experiment & Result Analysis (3) The experimental results 通过几次迭代之后,效果已经比较好了。

Experiment & Result Analysis (4) The different type of NE 这个是不同的命名实体类别的PRF,还有一点就是人名在2到3次迭代之后就基本不会再产生新的规则,可能是由于人名长度普遍较短的原因。

We propose a method extracting NE translation equivalents. Conclusion(1) We propose a method extracting NE translation equivalents. Reduce the cost to build the corpus Based on inductive learning Minimal additional knowledge 我们提出来一种抽取命名实体翻译等价对的方法,它的优点是不用构建双语的语料库,包括可比和平行的,使用归纳学习法和少量的语言信息。

Pure kana Named Entity may can not extract partial translation rules. Conclusion(2) Pure kana Named Entity may can not extract partial translation rules. コーネリアス(小山田圭吾) Time complexity rapidly increase with the increasing NE list. 但也有一些问题,在面对纯假名信息时可能不能得到翻译规则,原因是这些假名可能没有归纳的条件。还有就是复杂度比较高,随着NE表的增大迅速增加。

Thanks! Q&A 谢谢!