A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore.

Slides:



Advertisements
Similar presentations
The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Advertisements

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Part of Speech Tagging The DT students NN went VB to P class NN Plays VB NN well ADV NN with P others NN DT Fruit NN flies NN VB NN VB like VB P VB a DT.
Segmentation and Fitting Using Probabilistic Methods
Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.
Expectation Maximization Method Effective Image Retrieval Based on Hidden Concept Discovery in Image Database By Sanket Korgaonkar Masters Computer Science.
Flow Network Models for Sub-Sentential Alignment Ying Zhang (Joy) Advisor: Ralf Brown Dec 18 th, 2001.
Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.
Gobalisation Week 8 Text processes part 2 Spelling dictionaries Noisy channel model Candidate strings Prior probability and likelihood Lab session: practising.
Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.
Carnegie Mellon Exact Maximum Likelihood Estimation for Word Mixtures Yi Zhang & Jamie Callan Carnegie Mellon University Wei Xu.
Computational Language Andrew Hippisley. Computational Language Computational language and AI Language engineering: applied computational language Case.
C SC 620 Advanced Topics in Natural Language Processing Lecture 24 4/22.
Relevance Feedback Content-Based Image Retrieval Using Query Distribution Estimation Based on Maximum Entropy Principle Irwin King and Zhong Jin The Chinese.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
Machine Transliteration Bhargava Reddy B.Tech 4 th year UG.
Natural Language Processing Expectation Maximization.
Machine Transliteration T BHARGAVA REDDY (Knowledge sharing)
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
A Phonotactic-Semantic Paradigm for Automatic Spoken Document Classification Bin MA and Haizhou LI Institute for Infocomm Research Singapore.
November 2005CSA3180: Statistics III1 CSA3202: Natural Language Processing Statistics 3 – Spelling Models Typing Errors Error Models Spellchecking Noisy.
1 Bayesian Learning for Latent Semantic Analysis Jen-Tzung Chien, Meng-Sun Wu and Chia-Sheng Wu Presenter: Hsuan-Sheng Chiu.
Transliteration Transliteration CS 626 course seminar by Purva Joshi Mugdha Bapat Aditya Joshi Manasi Bapat
CS460/626 : Natural Language Processing/Speech, NLP and the Web Lecture 33: Transliteration Pushpak Bhattacharyya CSE Dept., IIT Bombay 8 th Nov, 2012.
Week 9: resources for globalisation Finish spell checkers Machine Translation (MT) The ‘decoding’ paradigm Ambiguity Translation models Interlingua and.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Presenter : Chien-Hsing Chen Author: Jong-Hoon Oh Key-Sun.
Concept Unification of Terms in Different Languages for IR Qing Li, Sung-Hyon Myaeng (1), Yun Jin (2),Bo-yeong Kang (3) (1) Information & Communications.
Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.
Scalable Inference and Training of Context- Rich Syntactic Translation Models Michel Galley, Jonathan Graehl, Keven Knight, Daniel Marcu, Steve DeNeefe.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Chinese Word Segmentation and Statistical Machine Translation Presenter : Wu, Jia-Hao Authors : RUIQIANG.
Translating Unknown Queries with Web Corpora for Cross- Language Information Retrieval Pu-Jen Cheng, Jei-Wen Teng, Ruei- Cheng Chen, Jenq-Haur Wang, Wen-
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien Shing Chen Author: Wei-Hao.
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
Martin KayTranslation—Meaning1 Martin Kay Stanford University with thanks to Kevin Knight.
Mining Translations of OOV Terms from the Web through Crosslingual Query Expansion Ying Zhang Fei Huang Stephan Vogel SIGIR 2005.
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Learning Phonetic Similarity for Matching Named Entity.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Using the Web for Automated Translation Extraction in.
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien Shing Chen Author: Wei-Hao.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.
Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Keyword Translation Accuracy and Cross-Lingual Question Answering in Chinese and Japanese Teruko Mitamura Mengqiu Wang Hideki Shima Frank Lin In CMU EACL.
Intro to NLP - J. Eisner1 Finite-State and the Noisy Channel.
Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
September 2004CSAW Extraction of Bilingual Information from Parallel Texts Mike Rosner.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
Statistical Language Models
Do-Gil Lee1*, Ilhwan Kim1 and Seok Kee Lee2
Statistical NLP: Lecture 13
Neural Language Model CS246 Junghoo “John” Cho.
CSA3180: Natural Language Processing
Emre Yılmaz, Henk van den Heuvel and David A. van Leeuwen
Presentation transcript:

A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore ACL 2004

2 Abstract This paper presents a new framework that allows direct orthographical mapping (DOM) between two different languages, through a joint source- channel model, also called n-gram TM model, Automated the orthographic alignment process derives the aligned transliteration units from a bilingual dictionary. State-of-the-art machine learning algorithm.

3 Research Question Out of vocabulary Transliterating English names into Chinese is not straightforward. For example,/Lafontant/ is transliterated into 拉丰 唐 (La-Feng-Tang) while /Constant/ becomes 康斯 坦特 (Kang-Si-Tan-Te), where syllable /-tant/ in the two names are transliterated differently depending on the names’ language of origin.

4 Research Question For a given English name E as the observed channel output, one seeks a posteriori the most likely Chinese transliteration C that maximizes P(C|E). – P(E|C) : the probability of transliterating C to E through a noisy channel, which is also called transformation rules. – P(C) : n-gram language models. Likewise, in C2E transliteration – P(E) : n-gram language models. ? ?

5 Research Question E intermediat phonemic C – P(C|E) = P(C|P)P(P|E) P(E|C) = P(E|P)P(P|C), P: phonemic – For Example: Chinese characters -> PinYin -> IPA English characters -> IPA Orthographic alignment – For example: one-to-many mappings

6 Research Question English name : Chinese name : ?

7 Machine Transliteration - Review phoneme-based techniques – Transformation-based learning algorithm (Meng et al. 2001; Jung et al, 2000;Virga & Khudanpur, 2003) – Using finite state transducer that implements transformation rules. (Knight & Graehl, 1998), Web- (Sigir 2004) – Using the Web for Automated Translation Extraction Ying Zhang, Phil Vines – Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung – Translating Unknown Queries with Web Corpora for Cross- Language Information Retrieval Pu-Jen Cheng, Jei-Wen Teng, Ruei-Cheng Chen, Jenq-Haur Wang, Wen-Hsiang Lu+, and Lee-Feng Chien

8 Joint Source-Channel Model For K aligned transliteration units, we propose to estimate P(E,C) by a joint source-channel model. – Suppose exists an alignment γ with  :  :  :  : γ ?

9 Joint Source-Channel Model A transliteration unit correspondence is called a transliteration pair. – E2C transliteration can be formulated as – Similarly the C2E back-transliteration as An n-gram transliteration model is defined as the conditional probability, or transliteration probability, of a transliteration pair k depending on its immediate n predecessor pairs:

10 Transliteration alignment A bilingual dictionary contains entries mapping English names to their respective Chinese transliterations. – To obtain the statistics, the bilingual dictionary needs to be aligned. The maximum likelihood approach, through EM Algorithm.

11 Transliteration alignment  :  :  :  : γ

12 The experiments Database from the bilingual dictionary “Chinese Transliteration of Foreign Personal Names” – Xinhua News Agency 37,694 unique English entries and their official Chinese transliteration. 13 subsets for 13-flod open test 13-fold open tests : error rates < 1% deviation. – EM iteration converges at 8 th round ?

13 The experiments For a test set W composed of V names – Each name has been aligned into a sequence of transliteration pair tokens. – The cross-entropy H p (W) of a model on data W is defined as,where and W T is the total number of aligned transliteration pair tokens in the data W. – The perplexity :

14 The experiments

15 The experiments In word error report, a word is considered correct only if an exact match happens between transliteration and the reference. The character error rate is the sum of deletion, insertion and substitution errors..

16 The experiments 5,640 transliteration pairs are cross mappings between 3,683 English and 374 Chinese units. – Each English unit have 1.53 = 5,640/3,683 Chinese correspondences. – Each Chinese unit have 15.1 = 5,640/374 English back-transliteration units. Monolingual language perplexity – Chinese and English independently using the n-gram language models

17 The experiments

18 Discussions

19 Conclusions In this paper, we propose a new framework (DOM) for transliteration. n-gram TM is a successful realization of DOM paradigm. It generates probabilistic orthographic transformation rules using a data driven approach. By skipping the intermediate phonemic interpretation, the transliteration error rate is reduced significantly. Place and company names : – /Victoria-Fall/ becomes 维多利亚瀑布 (Pinyin:Wei Duo Li Ya Pu Bu). – It can also be easily extended to handle such name translation. ?

20 Ca Va