Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.

Slides:



Advertisements
Similar presentations
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Advertisements

Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Enhancing Translation Systems with Bilingual Concordancing Functionalities V. ANTONOPOULOSC. MALAVAZOS I. TRIANTAFYLLOUS. PIPERIDIS Presentation: V. Antonopoulos.
Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.
Identifying Translations Philip Resnik, Noah Smith University of Maryland.
Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.
The current status of Chinese- English EBMT -where are we now Joy (Ying Zhang) Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Word and Phrase Alignment Presenters: Marta Tatu Mithun Balakrishna.
Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.
Statistical Phrase-Based Translation Authors: Koehn, Och, Marcu Presented by Albert Bertram Titles, charts, graphs, figures and tables were extracted from.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Bilingual Lexical Acquisition From Comparable Corpora Andrea Mulloni.
C SC 620 Advanced Topics in Natural Language Processing Lecture 24 4/22.
Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
5/28/031 Data Intensive Linguistics Statistical Alignment and Machine Translation.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Natural Language Processing Lab Northeastern University, China Feiliang Ren EBMT Based on Finite Automata State Transfer Generation Feiliang Ren.
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
CLEF – Cross Language Evaluation Forum Question Answering at CLEF 2003 ( Bridging Languages for Question Answering: DIOGENE at CLEF-2003.
Bilingual term extraction revisited: Comparing statistical and linguistic methods for a new pair of languages Špela Vintar Faculty of Arts Dept. of Translation.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.
Named Entity Recognition based on Bilingual Co-training Li Yegang School of Computer, BIT.
Translating Unknown Queries with Web Corpora for Cross- Language Information Retrieval Pu-Jen Cheng, Jei-Wen Teng, Ruei- Cheng Chen, Jenq-Haur Wang, Wen-
Multilingual Relevant Sentence Detection Using Reference Corpus Ming-Hung Hsu, Ming-Feng Tsai, Hsin-Hsi Chen Department of CSIE National Taiwan University.
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
5/28/031 Data Intensive Linguistics Statistical Alignment and Machine Translation.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Learning Phonetic Similarity for Matching Named Entity.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Chinese Word Segmentation Adaptation for Statistical Machine Translation Hailong Cao, Masao Utiyama and Eiichiro Sumita Language Translation Group NICT&ATR.
NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
LREC 2008 Marrakech 29 May Caroline Lavecchia, Kamel Smaïli and David Langlois LORIA / Groupe Parole, Vandoeuvre-Lès-Nancy, France Phrase-Based Machine.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.
Mutual bilingual terminology extraction Le An Ha*, Gabriela Fernandez**, Ruslan Mitkov*, Gloria Corpas*** * University of Wolverhampton ** Universidad.
1 Chen Yirong, Lu Qin, Li Wenjie, Cui Gaoying Department of Computing The Hong Kong Polytechnic University Chinese Core Ontology Construction from a Bilingual.
A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
From Words to Senses: A Case Study of Subjectivity Recognition Author: Fangzhong Su & Katja Markert (University of Leeds, UK) Source: COLING 2008 Reporter:
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
Keyword Translation Accuracy and Cross-Lingual Question Answering in Chinese and Japanese Teruko Mitamura Mengqiu Wang Hideki Shima Frank Lin In CMU EACL.
Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
Multi-Criteria-based Active Learning for Named Entity Recognition ACL 2004.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Automatic Question Answering Beyond the Factoid Radu Soricut Information Sciences Institute University of Southern California Eric Brill Microsoft Research.
Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Chinese Named Entity Recognition using Lexicalized HMMs.
Statistical NLP: Lecture 13
Presentation transcript:

Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004

Introduction Focusing on extracting entity names (PER, LOC, ORG) in bilingual corpus. The feasibility of extracting interlingual NEs has seldom been addressed. – Al Onaizan and Knight 2002 – Huang and Vogel 2002 – Chen et al – Moore 2003 – Kumano et al – Lee et al. (Baseline Model) 2003 Integrating approximate matching and personal name recognition into the baseline model.

Framework 1. Preprocess: 1) Perform sentence alignment. 2) Label English named entities. 2. Main process: 1. For each labeled NE E, apply Statistical Probability Translation Model and Approximate Matching to find Chinese named-entity candidates {NE A } in S C. 2. For any word W E, in NE E, that cannot find the corresponding Chinese translation in S C, apply the proposed Statistical Transliteration Model, enhanced with Chinese Personal Name Recognition to extracting the corresponding Chinese transliterations {NE B }, in S C, with scores above a predefined threshold. 3. Merge {NE A } with {NE B } into possible candidates {NE C }. 4. Rank {NE C } by the cost scores. The candidate with the maximum score is chosen as the answer.

SPTM A noisy channel approach Translating an English phrase e with l words into a Mandarin Chinese phrase f with m words by decomposing the channel function into two independent probabilistic functions: – Lexical translation probability function P(f ai | e i ) where e i is the i -th word in e and e i is aligned with f ai in f under the alignment a – Alignment probability function P(a | l, m) = P(a 1, a 2, …, a l | l, m)

SPTM E = “Ichthyosis Concern Association” F = “ 關懷 魚鱗癬 協會 ” Correct alignment: (a1 = 2, a2 = 1, a3 = 3). The phrase translation probability is Defining the scoring function as a log probability function:

Estimating Lexical Translation Probability Based on Parallel Corpus Adopting a word alignment module to automatically extracting lexical translation probabilities. (Wu and Chang 2003) 1. Developing a list of preferred part-of-speech (POS) patterns of collocation in both languages 2. Conducting collocation candidates matching to the preferred POS patterns and apply N-gram statistics for both languages 3. The log likelihood ratio statistics is employed for two consecutive words in both languages 4. Finally, we deploy content word alignment based on the Competitive Linking Algorithm (Melamed 1997). For the purpose of not introducing too much noise, only bilingual phrases with high probabilities are considered.

Estimating Lexical Translation Probability Based on Transliteration Model Adopting a Romanization system to represent a Chinese word E and F are assumed to be an English word and a Romanized Chinese character sequence, respectively. The transliteration probability P(F|E) can be approximated by decomposing E and F into transliteration units (TUs). A word E with l characters and a Romanized word F with m characters are denoted by e 1 e 2 …e l and f 1 f 2 …f m respectively. We can represent the mapping of (E, F) as a sequence of matched n TUs: {(u 1, v 1 ), (u 2, v 2 ), … (u n, v n ) }. The alignment a between E and F can be represented as a sequence of match type (m 1 m 2 …m n ) where m i denotes as a pair of lengths of u i and v i.

Estimating Lexical Translation Probability Based on Transliteration Model

NE alignment 1. g(0,0) = Suppose that there is an entry (e i,w f ) in the bilingual dictionary. Score lex (f ai | e i ) is formulated as:

Approximate Matching

CPNR Chinese surnames are used as anchor points. The Chinese personal name recognizer is applied only on the case that the given NE is a named person and Score tm (R(f ai ) | e i ) is less than Thr 1.

Training Data Noun phrases of the BDC Electronic Chinese-English Dictionary were used to train PTM. – To train the transliteration model, 2,430 pairs of English names together with their Chinese transliterations and Chinese Romanization were used. The LDC Central News Agency Corpus was used to extract keywords of entity names for identifying NE types. We collected 117 bilingual keyword pairs from the corpora. A list of Chinese surnames was also gathered to help to identify and extract the PER-type of NEs. The parallel corpus collected from the Sinorama Magazine was used to construct the corpus-based lexicon and estimation of LTP.

Experiments 275 aligned sentences from Sinorama are randomly selected. Answer keys are manually prepared. Each chosen aligned sentence contains at least one NE pair. Currently, the lengths of English NEs are restricted to be less than 6. In total, 830 pairs of NEs are labeled. The numbers of NE pairs for types PER, LOC, and ORG are 208, 362, and 260, respectively.