1 1 Automatic Transliteration of Proper Nouns from Arabic to English Mehdi M. Kashani, Fred Popowich, Anoop Sarkar Simon Fraser University Vancouver, BC.

Slides:



Advertisements
Similar presentations
Statistical Machine Translation
Advertisements

The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
1 A Hidden Markov Model- Based POS Tagger for Arabic ICS 482 Presentation A Hidden Markov Model- Based POS Tagger for Arabic By Saleh Yousef Al-Hudail.
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
To quantitatively test the quality of the spell checker, the program was executed on predefined “test beds” of words for numerous trials, ranging from.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Training and Decoding in SMT System) Kushal Ladha M.Tech Student CSE Dept.,
Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Rapid and Accurate Spoken Term Detection Owen Kimball BBN Technologies 15 December 2006.
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Transliteration Transliteration CS 626 course seminar by Purva Joshi Mugdha Bapat Aditya Joshi Manasi Bapat
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
1 Cross-Lingual Query Suggestion Using Query Logs of Different Languages SIGIR 07.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.
2012: Monolingual and Crosslingual SMS-based FAQ Retrieval Johannes Leveling CNGL, School of Computing, Dublin City University, Ireland.
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
Statistical Machine Translation Part III – Phrase-based SMT Alexander Fraser CIS, LMU München WSD and MT.
GoogleDictionary Paul Nepywoda Alla Rozovskaya. Goal Develop a tool for English that, given a word, will illustrate its usage.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
Improving out of vocabulary name resolution The Hanks David Palmer and Mari Ostendorf Computer Speech and Language 19 (2005) Presented by Aasish Pappu,
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
Tokenization & POS-Tagging
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Chinese Word Segmentation Adaptation for Statistical Machine Translation Hailong Cao, Masao Utiyama and Eiichiro Sumita Language Translation Group NICT&ATR.
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.
NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
LREC 2008 Marrakech 29 May Caroline Lavecchia, Kamel Smaïli and David Langlois LORIA / Groupe Parole, Vandoeuvre-Lès-Nancy, France Phrase-Based Machine.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.
Estimating N-gram Probabilities Language Modeling.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Keyword Translation Accuracy and Cross-Lingual Question Answering in Chinese and Japanese Teruko Mitamura Mengqiu Wang Hideki Shima Frank Lin In CMU EACL.
Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.
WNSpell: A WordNet-Based Spell Corrector BILL HUANG PRINCETON UNIVERSITY Global WordNet Conference 2016Bucharest, Romania.
Learning Analogies and Semantic Relations Nov William Cohen.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
Language Identification and Part-of-Speech Tagging
Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky
Alexander Fraser CIS, LMU München Machine Translation
Mohamed Kamel Omar and Lidia Mangu ICASSP 2007
N-Gram Model Formulas Word sequences Chain rule of probability
Presentation transcript:

1 1 Automatic Transliteration of Proper Nouns from Arabic to English Mehdi M. Kashani, Fred Popowich, Anoop Sarkar Simon Fraser University Vancouver, BC July 22, 2007 Mehdi M. Kashani, Fred Popowich, Anoop Sarkar Simon Fraser University Vancouver, BC July 22, 2007 Second Workshop on Computational Approaches to Arabic Script-based Languages

2 Overview Problem Definition and Challenges Related Work Our Approach Evaluation Discussion Problem Definition and Challenges Related Work Our Approach Evaluation Discussion

3 Transliteration Translation tools facilitate dialogue across cultures. source language  target language Transliteration is a subtask dealing with transcribing a word written in one writing system into another writing system. Forward Transliteration محمد  Mohammed, Mohammad, Mohamed, Muhammad … Backward Transliteration روبرت  Robert Our task: Arabic to English (for machine translation) Translation tools facilitate dialogue across cultures. source language  target language Transliteration is a subtask dealing with transcribing a word written in one writing system into another writing system. Forward Transliteration محمد  Mohammed, Mohammad, Mohamed, Muhammad … Backward Transliteration روبرت  Robert Our task: Arabic to English (for machine translation)

4 Challenges Not a 1-to-1 relationship کاترين can be the equivalent for both Catherine and Katharine. Context can disambiguate: Katharine Hepburn. Lack of diacritics in Arabic writings Long vowels are always explicitly written. و ی ا Short vowels are omitted in writings. مُحَمِد  محمد Mohammed  Mhmmd Lack of certain sounds in Arabic Popowich  Bobowij Different pronunciations based on the letter position in the word. how is ي at the beginning is pronounced? how is ي at the middle or end is usually pronounced? Not a 1-to-1 relationship کاترين can be the equivalent for both Catherine and Katharine. Context can disambiguate: Katharine Hepburn. Lack of diacritics in Arabic writings Long vowels are always explicitly written. و ی ا Short vowels are omitted in writings. مُحَمِد  محمد Mohammed  Mhmmd Lack of certain sounds in Arabic Popowich  Bobowij Different pronunciations based on the letter position in the word. how is ي at the beginning is pronounced? how is ي at the middle or end is usually pronounced?

5 Convention Cursive script but not always. ابراهيم  ا ب ر ا ه ی م Right to left From now on, Arabic words are shown letter by letter and from left to right ابراهيم  ا ب ر ا ه ی م e b r a h i m Cursive script but not always. ابراهيم  ا ب ر ا ه ی م Right to left From now on, Arabic words are shown letter by letter and from left to right ابراهيم  ا ب ر ا ه ی م e b r a h i m

6 Overview Problem Definition and Challenges Related Work Our Approach Evaluation Discussion Problem Definition and Challenges Related Work Our Approach Evaluation Discussion

7 Related Work Stalls and Knight (1998) Arabic to English using a noisy channel model for phonemes. Al-Onaizan and Knight (2002) Combining phonetic- and spelling-based methods They show a spelling-based approach works better than phonetic approach. Using parallel corpora (Samy et al., 2005) or comparable corpora (Sproat et al., 2006) (Klementiev and Roth, 2006) to discover the transliterations. Not very useful for machine translation task. Stalls and Knight (1998) Arabic to English using a noisy channel model for phonemes. Al-Onaizan and Knight (2002) Combining phonetic- and spelling-based methods They show a spelling-based approach works better than phonetic approach. Using parallel corpora (Samy et al., 2005) or comparable corpora (Sproat et al., 2006) (Klementiev and Roth, 2006) to discover the transliterations. Not very useful for machine translation task.

8 Overview Problem Definition and Challenges Related Work Our Approach Evaluation Discussion Problem Definition and Challenges Related Work Our Approach Evaluation Discussion

9 Our Approach Consists of three phases Phase 1 (generative): ignore diacritic, simply turn the Arabic letters into English letters. م ح م د  m h mm d Phase 2 (generative): use best candidates from phase 1 to guess the omitted short vowels. م ح م د & m h mm d  mo ha mm d Phase 3 (comparative): compare best candidates from phase 2 with entries in a monolingual dictionary. mo ha mm d  mohammd  mohammed, muhammed … Consists of three phases Phase 1 (generative): ignore diacritic, simply turn the Arabic letters into English letters. م ح م د  m h mm d Phase 2 (generative): use best candidates from phase 1 to guess the omitted short vowels. م ح م د & m h mm d  mo ha mm d Phase 3 (comparative): compare best candidates from phase 2 with entries in a monolingual dictionary. mo ha mm d  mohammd  mohammed, muhammed …

10 Training Data Preparation Extract name pairs from two different sources Named entities annotated in the LDC Arabic Treebank 3 Arabic-English parallel news corpus tagged by an entity tagger In total, 9660 pairs are prepared. Extract name pairs from two different sources Named entities annotated in the LDC Arabic Treebank 3 Arabic-English parallel news corpus tagged by an entity tagger In total, 9660 pairs are prepared.

11 Tools GIZA++ is used for alignment Implementation of IBM Model 4 Output files are used to rearrange letters Alignment score is used to filter out noise Cambridge Language Model Toolkit For us to use these tools… our words are treated as "sentences" our letters are treated as "words" GIZA++ is used for alignment Implementation of IBM Model 4 Output files are used to rearrange letters Alignment score is used to filter out noise Cambridge Language Model Toolkit For us to use these tools… our words are treated as "sentences" our letters are treated as "words"

12 Preprocessing Noise Filtering GIZA++ is run on the character-level training data Bad pairs have low alignment scores and are filtered out the 9660 pairs are reduced to 4255 pairs Normalizing the training data Convert names to lower case. Put space between word letters. Add prefix (B) and suffix (E) to names. example: if we were actually dealing with English Noise Filtering GIZA++ is run on the character-level training data Bad pairs have low alignment scores and are filtered out the 9660 pairs are reduced to 4255 pairs Normalizing the training data Convert names to lower case. Put space between word letters. Add prefix (B) and suffix (E) to names. example: if we were actually dealing with English

13 Preprocessing Run GIZA++ with Arabic as the source and English as the target. the most frequent sequences of English letters aligned to the same Arabic letter are added to the alphabet. Apply the new alphabet to the training data. Run GIZA++ with Arabic as the source and English as the target. the most frequent sequences of English letters aligned to the same Arabic letter are added to the alphabet. Apply the new alphabet to the training data. م ح م د m o h a m m e d م ح م د

14 Phase 1 Run GIZA++ with Arabic as the source and English as the target. Remove English letters aligned to null from the training set Run GIZA++ with Arabic as the source and English as the target. Remove English letters aligned to null from the training set m o h a mm e d م ح م د

15 Phase 1 Translation Model: run GIZA++ with English as the source and Arabic as the target. Language Model: Run Cambridge LM toolkit on the English training set. Use unigram and bigram models for Viterbi training and trigram model for rescoring. Translation Model: run GIZA++ with English as the source and Arabic as the target. Language Model: Run Cambridge LM toolkit on the English training set. Use unigram and bigram models for Viterbi training and trigram model for rescoring. P(e j |e j-1 )P(a i |e j ) E E = argmax E P(A|E) P(E) A = a 0 …a I, E = e 0 …e J

16 Phase 1 Beam Search Decoding is used. Relative Threshold Pruning. k best candidates are returned. Beam Search Decoding is used. Relative Threshold Pruning. k best candidates are returned. Bp Bm Bs Bsh Bll Bk dsmjhadsmjha mm n d m l dhE fE kE dE ghE wE محمد

17 Phase 2 Instead of removing the letters aligned to null, they are concatenated to their first immediate neighbor. New letters (phrases) are formed. New translation and language models are created using the new training set. Instead of removing the letters aligned to null, they are concatenated to their first immediate neighbor. New letters (phrases) are formed. New translation and language models are created using the new training set. m o h a mm e d م ح م د

18 Phase 2 Use phase 1 candidates. Phase 1 candidates: e 0 |e 1 |…|e n Phase 2 phrases: p 0 |p 1 |…|p n All the probabilities P(a i |p i ) where p i is not prefixed by given e i are set to zero. The rest is similar to phase 1. Use phase 1 candidates. Phase 1 candidates: e 0 |e 1 |…|e n Phase 2 phrases: p 0 |p 1 |…|p n All the probabilities P(a i |p i ) where p i is not prefixed by given e i are set to zero. The rest is similar to phase 1.

19 Phase 2 The same decoding technique applied. For each candidate of phase 1, l new names are generated  kl candidates overall. New combined score NewScore = log(S1) + log(S2) The same decoding technique applied. For each candidate of phase 1, l new names are generated  kl candidates overall. New combined score NewScore = log(S1) + log(S2) Bma Bm Bmou Bmo Bn Bno he h t s ha hou m mm mme mi n me ma dhE deE dE shE diE doE م /m ح /h م /mm د /d

20 Phase first and last names. US census bureau. OAK System. All the entries are stripped of the vowels. Francisco  frncsc Stripped versions of the candidates are compared to the stripped versions of the dictionary entries. If matched, the distance of the original names is computed. Levenshtein (Edit) Distance first and last names. US census bureau. OAK System. All the entries are stripped of the vowels. Francisco  frncsc Stripped versions of the candidates are compared to the stripped versions of the dictionary entries. If matched, the distance of the original names is computed. Levenshtein (Edit) Distance.

21 Phase 3

22 Word Filtering To avoid adding every output that the HMM generates, a word filtering step is necessary. Web Filtering Requires online queries for each execution. Not suitable for most offline tasks. Language Model Filtering Requires rich and updated language model. Google Unigram Model is used. Over 13 million words with frequency over 200 on the internet A huge FSA is built and HMM candidates that are accepted by the FSA remain in the system. To avoid adding every output that the HMM generates, a word filtering step is necessary. Web Filtering Requires online queries for each execution. Not suitable for most offline tasks. Language Model Filtering Requires rich and updated language model. Google Unigram Model is used. Over 13 million words with frequency over 200 on the internet A huge FSA is built and HMM candidates that are accepted by the FSA remain in the system.

23 Score Final Score =  S +  D +  R S is the combined Viterbi score from last two phases. D is Levenshtein Distance R is the number of repetitions. All the kl outputs from phase 2 are among the final outputs to accommodate those names not found in dictionary (LD = 0). Final Score =  S +  D +  R S is the combined Viterbi score from last two phases. D is Levenshtein Distance R is the number of repetitions. All the kl outputs from phase 2 are among the final outputs to accommodate those names not found in dictionary (LD = 0).

24 Overview Problem Definition and Challenges Related Work Our Approach Evaluation Discussion Problem Definition and Challenges Related Work Our Approach Evaluation Discussion

25 Test Data Preparation Extracted from Arabic Treebank 2 part Transliteration pairs First 300 pairs as development test set Second 300 pairs as blind test set Filter out explicit translations or wrong pairs manually 273 pairs for development test set 291 pairs for blind test set Extracted from Arabic Treebank 2 part Transliteration pairs First 300 pairs as development test set Second 300 pairs as blind test set Filter out explicit translations or wrong pairs manually 273 pairs for development test set 291 pairs for blind test set

26 Distribution of Names Distribution of Seen and Unseen Names Number of Alternatives for Names. Distribution of Seen and Unseen Names Number of Alternatives for Names. SeenUnseenTotal Dev Set Blind Set OneTwoThreeFour Dev Set Blind Set

27 Performance on Dev Top 1Top 2Top 5Top 10Top 20 Single- phase HMM 44%59%73%81%85% Double- phase HMM 45%60%72%84%88% HMM+Dict.52%64%73%84%88%

28 Performance on Blind Top 1Top 2Top 5Top 10Top 20 Single- phase HMM 38%54%72%80%83% Double- phase HMM 41%57%75%82%85% HMM+Dict.46%61%76%84%86%

29 Overview Problem Definition and Challenges Related Work Our Approach Evaluation Discussion Problem Definition and Challenges Related Work Our Approach Evaluation Discussion

30 Discussion Does the "use of a dictionary help a lot"? You can never have enough training data Rare alignments: N i e t z s c h e  ن ی ت ش ه Issues with names with different origins depends on the task Appropriate for incorporation into an MT system Issues introduced in the introduction absence of short vowels (3) ambiguity resolution (4) Does the "use of a dictionary help a lot"? You can never have enough training data Rare alignments: N i e t z s c h e  ن ی ت ش ه Issues with names with different origins depends on the task Appropriate for incorporation into an MT system Issues introduced in the introduction absence of short vowels (3) ambiguity resolution (4)

31 Questions?