Machine Transliteration Bhargava Reddy 110050078 B.Tech 4 th year UG.

Slides:



Advertisements
Similar presentations
A probabilistic model for retrospective news event detection
Advertisements

The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.
Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User.
Flow Network Models for Sub-Sentential Alignment Ying Zhang (Joy) Advisor: Ralf Brown Dec 18 th, 2001.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
BTP Stage 1 Machine Transliteration & Entropy Final Presentation Bhargava Reddy
MTP I Stage Project Presentation Guided by- Presented by- Prof. Pushpak Bhattacharyya Abhijeet Padhye Department of Computer Science and Engineering Indian.
Entropy in Machine Transliteration & Phonology Bhargava Reddy B.Tech Project.
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
Machine Transliteration T BHARGAVA REDDY (Knowledge sharing)
TransRank: A Novel Algorithm for Transfer of Rank Learning Depin Chen, Jun Yan, Gang Wang et al. University of Science and Technology of China, USTC Machine.
A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande.
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Technical Report of NEUNLPLab System for CWMT08 Xiao Tong, Chen Rushan, Li Tianning, Ren Feiliang, Zhang Zhuyu, Zhu Jingbo, Wang Huizhen
Transliteration Transliteration CS 626 course seminar by Purva Joshi Mugdha Bapat Aditya Joshi Manasi Bapat
Graphical models for part of speech tagging
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Presenter : Chien-Hsing Chen Author: Jong-Hoon Oh Key-Sun.
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
Approximating a Deep-Syntactic Metric for MT Evaluation and Tuning Matouš Macháček, Ondřej Bojar; {machacek, Charles University.
Sentence Compression Based on ILP Decoding Method Hongling Wang, Yonglei Zhang, Guodong Zhou NLP Lab, Soochow University.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
A Systematic Exploration of the Feature Space for Relation Extraction Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois,
Extracting bilingual terminologies from comparable corpora By: Ahmet Aker, Monica Paramita, Robert Gaizauskasl CS671: Natural Language Processing Prof.
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
Chinese Word Segmentation Adaptation for Statistical Machine Translation Hailong Cao, Masao Utiyama and Eiichiro Sumita Language Translation Group NICT&ATR.
Prepared by: Mahmoud Rafeek Al-Farra College of Science & Technology Dep. Of Computer Science & IT BCs of Information Technology Data Mining
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.
Latent Topic Modeling of Word Vicinity Information for Speech Recognition Kuan-Yu Chen, Hsuan-Sheng Chiu, Berlin Chen ICASSP 2010 Hao-Chin Chang Department.
A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien Shing Chen Author: Wei-Hao.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.
Venue Recommendation: Submitting your Paper with Style Zaihan Yang and Brian D. Davison Department of Computer Science and Engineering, Lehigh University.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
Discriminative Modeling extraction Sets for Machine Translation Author John DeNero and Dan KleinUC Berkeley Presenter Justin Chiu.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Leveraging supplemental transcriptions and transliterations via re-ranking Aditya Bhargava April 19, 2011.
Arnar Thor Jensson Koji Iwano Sadaoki Furui Tokyo Institute of Technology Development of a Speech Recognition System For Icelandic Using Machine Translated.
Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky
Do-Gil Lee1*, Ilhwan Kim1 and Seok Kee Lee2
--Mengxue Zhang, Qingyang Li
Eiji Aramaki* Sadao Kurohashi* * University of Tokyo
Adaptive entity resolution with human computation
Jointly Generating Captions to Aid Visual Question Answering
Presentation transcript:

Machine Transliteration Bhargava Reddy B.Tech 4 th year UG

Contents Fundamental Definition of Machine Transliteration History of Machine Transliteration Modelling of the Transliteration Bridge Transliteration System Syllabification and the use of CRF Substring alignment and re-raking methods Using Hybrid Models

Definition of Machine Transliteration Conversion of a given name in source language to a name in target language such that the target language is: 1.Phonemically equivalent to the source language 2.Conforms to the phonology of the target language 3.Matches the user intuition of the equivalent of the source language name in the target language, considering the culture and orthographic character usage in the target language We need to note that all are equivalent in it’s own kind Ref: Report of NEWS 2012 Machine Transliteration Shared Task

Brief History of the work carried out Early Models for MT 1.Grapheme-based transliteration model (ψ G ) 2.Phoneme-based transliteration model (ψ P ) 3.Hybrid transliteration model (ψ H ) 4.Correspondence-based transliteration model (ψ C ) ψ G is known as direct method because it directly transforms source language graphemes to target language ψ P is called as pivot method because it uses source language as pivot when it produces target language graphemes Ref: A comparison of Different Machine Transliteration Models. 2006

Hybrid and Correspondence Models Both the model were combined as resulted in ψ C and ψ H ψ H directly combines phoneme based transliteration probability Pr(ψ P ) and grapheme based transliteration probability Pr(ψ G ) using linear interpolation (Dependence between them is not considered) ψ C made use of the correspondence between a source grapheme and a source phoneme when it produces target language graphemes Ref: 1. Improving back-transliteration by combining information sources 2. An English-Korean transliteration model using pronunciation and contextual rules

Graphical Representation Ref: A comparison of Different Machine Transliteration Models. 2006

Modelling the Components Maximum entropy model (MEM) is a widely used probability model that can incorporate heterogeneous information effectively. Thus used in Hybrid model Decision-Tree Learning used for creating the training set for the models Memory-based Learning (MBL) also known as “instance based learning” and “case-based learning”, is an example-based learning method. Useful for computation of φ (SP)T Ref: A comparison of Different Machine Transliteration Models. 2006

Study of MT through Bridging Languages Data is available between a language pair due to one of the following three reasons: 1.Politically related languages: Due to the political dominance of English it is easy to obtain parallel names data between English and most languages 2.Genealogically related languages: Languages sharing the same origin. Might have significant overlap between their phonemes and graphemes 3.Demographically related languages: Hindi and Telugu. Might not have the same origin but due to the shared culture and demographics there will be similarities Ref: Everybody loves a rich cousin: An empirical study of transliteration through bridge languages. NAACL 2010

Bridge Transliteration Methodology Ref: Everybody loves a rich cousin: An empirical study of transliteration through bridge languages. NAACL 2010

Results for the Bridge System We must remember that Machine Transliteration is a lossy conversion In the bridge system we can assume that we will get loss in information and thus the accuracy score will drop down The results have shown that there has been a drop in accuracy of about 8-9%(ACC1) and about 1-3%(Mean F-score) NEWS 2009 was used as a dataset for this training and evaluation of the results Ref: Everybody loves a rich cousin: An empirical study of transliteration through bridge languages. NAACL 2010

Stepping though an intermediate language Ref: Everybody loves a rich cousin: An empirical study of transliteration through bridge languages. NAACL 2010

Syllabification? No gold standard syllable segmentation Yang et al. (2009) applied N-gram joint source channel and EM algorithm Aramaki and Abekawwa (2009) made use of word alignment tool in GIZA++ to obtain a syllable segmentation and alignment corpus from the training data given Yand et al. (2010) proposed a joint optimization method to reduce the propaganda of alignment error The paper made syllabification of Chinese words

Forward-Backward Machine Transliteration between English and Chinese based on Combined CRFs The transliteration is implemented as a 2 phase CRF The first CRF splits the word into chunks(Similar to syllabification) The second CRF to label what target characters are transliterated Final transliteration is the sequence of all the target characters

Using CRF models for MT Hindi to English Machine transliteration model using CRF’s has been proposed by Manikrao, Shantanu and Tushar in the paper: “Hindi to English Machine Transliteration of named entities using CRFs” during the International Journal of Computer Applications( ) on June 2012 The description is show as follows:

Model Flow

Results of the CRF model proposed

English-Korean Transliteration using substring alignment and re-ranking methods Chun-Kai Wu, Yu-Chun Wang, Richard Tzong-Han Tsai described in their paper the approach for the MT. It consisted of 4 parts: 1.Pre-Processing 2.Letter-to-phoneme alignment 3.DirecTL-p training 4.Re-ranking results

Hybrid models Dong Yang, Paul Dixon, Yi-Cheng Pan, Tasuku Oonishi, Masanobu Nakamura and Sadaoki Furui of Computer Science Department in Tokyo Institute of Technology have combined the 2-step CRF model with a joint source channel model for Machine Transliteration

References Report of NEWS 2012 Machine Transliteration Shared Task(2012), Min Zhang, haizhou Li, A Kumaran and Ming Lui. ACL 2012 A comparison of Different Machine Transliteration Models (2006), A Comparison of Different Machine Transliteration Models Improving back-transliteration by combining information sources. (2004). Bilac S., & Tanaka, H. In Proceedings of IJCNLP2004, pp. 542–547 An English-Korean transliteration model using pronunciation and contextual rules. (2002). Oh, J. H., & Choi, K. S. In Proceedings of COLING2002, pp. 758–764 Everybody loves a rich cousin: An empirical study of transliteration through bridge languages. (2010). Mitesh M. Khapra, A Kumaran, Pushpak Bhattacharyya

References Forward-Backward Machine Transliteration between English and Chinese based on Combined CRFs. (2011). Ying Qin, Guohua Chen. Nov’ 12 FWBW Hindi to English Machine transliteration model of named entities using CRFs(2012). Manikrao, Shantanu and Tushar. International Journal of Computer Applications( ) on June 2012 English-korean named entity transliteration using substring alignment and re- ranking methods. (2012). Chun-Kai Wu, Yu-Chun Wang, and Richard TzongHan Tsai. In Proc. Named Entities Workshop at ACL 2012 Combining a two-step CRF and a joint source channel model for machine transliteration. (2009). D Yang, P Dixon, YC Pan, R Oonishi, M Nakamura in NEWS ‘09 proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration