CS460/626 : Natural Language Processing/Speech, NLP and the Web Lecture 33: Transliteration Pushpak Bhattacharyya CSE Dept., IIT Bombay 8 th Nov, 2012.

Slides:

Advertisements

Similar presentations

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE.

Advertisements

Pushpak Bhattacharyya CSE Dept., IIT Bombay 31st March, 2011

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:

Building an ASR using HTK CS4706

Pronunciation Modeling Lecture 11 Spoken Language Processing Prof. Andrew Rosenberg.

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Large-Scale Entity-Based Online Social Network Profile Linkage.

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 35– Phonetics and phonology; syllabification) Pushpak Bhattacharyya CSE Dept.,

Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.

SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.

Pushpak Bhattacharyya CSE Dept. IIT Bombay 1st Nov, 2012

FIRE 2013 By:- Hardik Joshi 1, Apurva Bhatt 1, Honey Patel 2 1 Department of Computer Science, Gujarat.

Unsupervised Turkish Morphological Segmentation for Statistical Machine Translation Coskun Mermer and Murat Saraclar Workshop on Machine Translation and.

Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.

1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.

Machine Transliteration Bhargava Reddy B.Tech 4 th year UG.

Evaluation of Hindi→English, Marathi→English and English→Hindi CLIR at FIRE 2008 Nilesh Padariya, Manoj Chinnakotla, Ajay Nagesh and Om P. Damani Center.

MACHINE TRANSLATION AND MT TOOLS: GIZA++ AND MOSES -Nirdesh Chauhan.

1 1 Automatic Transliteration of Proper Nouns from Arabic to English Mehdi M. Kashani, Fred Popowich, Anoop Sarkar Simon Fraser University Vancouver, BC.

SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.

Machine Transliteration T BHARGAVA REDDY (Knowledge sharing)

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Training and Decoding in SMT System) Kushal Ladha M.Tech Student CSE Dept.,

Word-subword based keyword spotting with implications in OOV detection Jan “Honza” Černocký, Igor Szöke, Mirko Hannemann, Stefan Kombrink Brno University.

Machine translation Context-based approach Lucia Otoyo.

English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.

CLEF – Cross Language Evaluation Forum Question Answering at CLEF 2003 ( Bridging Languages for Question Answering: DIOGENE at CLEF-2003.

Transliteration Linguistic Enrichment of Statistical

Transliteration Transliteration CS 626 course seminar by Purva Joshi Mugdha Bapat Aditya Joshi Manasi Bapat

6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.

An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.

Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.

A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.

Prof. Pushpak Bhattacharyya, IIT Bombay.1 Application of Noisy Channel, Channel Entropy CS 621 Artificial Intelligence Lecture /09/05.

2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.

CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-27: Phonology (quiz took place on 12/10/09; Lect 26.

Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 3 (10/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Statistical Formulation.

NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.

IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1.

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

Sequence Models With slides by me, Joshua Goodman, Fei Xia.

Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.

Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.

From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.

Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.

CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-14: Probabilistic parsing; sequence labeling, PCFG.

Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.

Search and Decoding Final Project Identify Type of Articles Using Property of Perplexity By Chih-Ti Shih Advisor: Dr. V. Kepuska.

A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore.

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 27– SMT Assignment; HMM recap; Probabilistic Parsing cntd) Pushpak Bhattacharyya.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-25: Vowels cntd and a “grand” assignment.

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Keyword Translation Accuracy and Cross-Lingual Question Answering in Chinese and Japanese Teruko Mitamura Mengqiu Wang Hideki Shima Frank Lin In CMU EACL.

CS : NLP, Speech and Web-Topics-in-AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 34: Precision, Recall, F- score, Map.

CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-15: Probabilistic parsing; PCFG (contd.)

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.

Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,

Audio Books for Phonetics Research CatCod2008 Jiahong Yuan and Mark Liberman University of Pennsylvania Dec. 4, 2008.

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 31–Inside and Outside probabilities; PCFG training; start of phonetics and phonology)

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

Automatic Transliteration for Japanese-to-English Text Retrieval

Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky

Pushpak Bhattacharyya CSE Dept., IIT Bombay

Do-Gil Lee1*, Ilhwan Kim1 and Seok Kee Lee2

Statistical NLP: Lecture 13

Audio Books for Phonetics Research

Pushpak Bhattacharyya CSE Dept., IIT Bombay 31st Jan, 2011

Presentation transcript:

CS460/626 : Natural Language Processing/Speech, NLP and the Web Lecture 33: Transliteration Pushpak Bhattacharyya CSE Dept., IIT Bombay 8 th Nov, 2012

Transliteration Transliteration Credit: lot of material from seminar of Maoj (PhD student) Purva, Mugdha, Aditya, Manasi (M.Tech students)

Task of converting a word from one alphabetic script to another Used for: Named entities : Gandhiji Out of vocabulary words : Bank What is transliteration?

Transliteration for OOV words Name searching (people, places, organizations) constitutes a large proportion of search Words of foreign origin in a language - Loan Words  Example: Such words not found in the dictionary are called “Out Of Vocabulary (OOV) words” in CLIR/MT बस (bus), स्कूल (school)

Machine Transliteration – The Problem Graphemes – Basic units of written language (English – 26 letters, Devanagari – 92 including matraas) Definition “The process of automatically mapping an given grapheme sequence in source language to a valid grapheme sequence in the target language such that it preserves the pronunciation of the original source word”

Challenges in Machine Transliteration Lot of ambiguities at the grapheme level esp. while dealing with non-phonetic languages  Example: Devanagari letter क has multiple grapheme mappings in English {ca, ka, qa, c, k, q, ck} Presence of silent letters  Pneumonia – Difference of scripts causes spelling variations esp. for loan words  नूमोनिया रिलीस, रिलीज, जार्ज, जॉर्ज, बैंक, बॅंक

An Example from CLEF 2007 आस् ‍ ट्रेलियाई प्रधानमंत्री Transliteration Algorithm (Candidate Generation) aastreliyai (Invalid token in Index) Approximate String Matching Target Language Index australian, australia, estrella Top ‘3’ candidates

Candidate Generation Schemes Takes an input Devanagari word and generates most likely transliteration candidates in English Any standard transliteration scheme could be used for candidate generation In our current work, we have experimented with  Rule Based Schemes oSingle Mapping oMultiple Mapping Pre-Storing Hindi Transliterations in Index

Rule Based Transliteration Manually defined mapping for each Devanagari grapheme to English grapheme(s) Devanagari being a phonetic script, easy to come up with such rules Single Mapping  Each Devanagari grapheme has only a single mapping to English grapheme(s)  Example: न – {na} A given Devanagari word is transliterated from left-right

Rule Based Transliteration (Contd..) Multiple Mapping  Each Devanagari grapheme has multiple mappings to target English grapheme(s) Example: न – {na,kn,n}  May lead to very large number of possible candidates  Not possible to efficiently rank and perform approximate matching Pruning Candidates  At each stage rank and retain only top ‘n’ desirable candidates  Desirability based on probability of forming a valid spelling in English language  Bigram letter model trained on words of English language

Evaluation Metrics Transliteration engine outputs ranked list of English transliterations Following metrics used to evaluate various transliteration techniques  Accuracy – Percentage of words where right transliteration was retrieved as one of the candidates in list  Mean Reciprocal Rank (MRR) – Used for capturing efficiency of ranking

Example result ब्रूकलिन brooklin Brooklin, Brooklyn Brooklin, Brookin Levenshtein Jaro-Winkler

x Overview Source String Transliteration Units Target String Transliteration Units

Contents Source String Transliteration Units Target String Transliteration Units Phoneme- based

Phoneme-based approach Word in Source language Pronunciation in Source language Word in Target language Pronunciation In target language P( p s | w s ) P ( p t | p s ) P ( w t | p t ) Note: Phoneme is the smallest linguistically distinctive unit of sound. P(w t ) W t * = argmax (P (w t ). P (w t | p t ). P (p t | p s ). P (p s | w s ) )

Phoneme-based approach Step I : Consider each character of the word Transliterating ‘BAPAT’ BA PA T P/ə//a:/ /ə//a:/B T Source word to phonemes P/ə//a:/ /ə//a:/B T Source phonemes to target phonemes t t Step II : Converting to phoneme seq. Step III : Converting to target phoneme seq.

Phoneme-based approach Step IV : Phoneme sequence to target string B : /ə/ : /a:/ : P: /ə/ : /a:/ : T: t: Output :

Concerns Word in Source language Pronunciation in Source language Word in Target language Pronunciation In target language Check if the world is valid In target language Check if environment Is noise-free

Unknown pronunciations Back-transliteration can be a problem Johnson  Jonson Issues in phonetic model sanhita samhita

Contents Source String Transliteration Units Target String Transliteration Units Phoneme- based Spelling- based

Particularly developed for Chinese Chinese : Highly ideographic Example : Two main steps: LM based method Image courtesy: wikimedia-commons ModelingDecoding

Modeling Step A bilingual dictionary in the source and target language From this dictionary, the character mapping between the source and target language is learnt The word “Geo” has two possible mappings, the “context” in which it occurs is important John Georgia Geology Geo Modeling step

Modeling step … N-gram Mapping : This concludes the modeling step Modeling step

Decoding Step Consider the transliteration of the word “George”. Alignments of George: Geo rge G eo rge Decoding step

Decision to be made between…. The context mapping is present in the map-dictionary Using…… Decoding step …

Where do the n-gram statistics come from? Ans.: Automatic analysis of the bilingual dictionary How to align this dictionary? Ans. : Using EM-algorithm Transliteration Alignment

EM Algorithm Bootstrap Expectation Maximization Transliteration Units Bootstrap initial random alignment Update n-gram statistics to estimate probability distribution Apply the n-gram TM to obtain new alignment Derive a list of transliteration units from final alignment

“Parallel” Corpus Phoneme Example Translation AA odd AA D AE at AE T AH hut HH AH T AO ought AO T AW cow K AW AY hide HH AY D B be B IY

“Parallel” Corpus cntd Phoneme Example Translation CH cheese CH IY Z D dee D IY DH thee DH IY EH Ed EH D ER hurt HH ER T EY ate EY T F fee F IY G green G R IY N HH he HH IY IH it IH T IY eat IY T JH gee JH IY

A Statistical Machine Translation like task First obtain the Carnegie Mellon University's Pronouncing Dictionary Train and Test the following Statistical Machine Learning Algorithms HMM - For HMM we can use either Natural Language Toolkit or you can use GIZA++ with MOSES

Evaluation E2C Error rates for n-gram testsE2C v/s C2E for TM Tests

Read up/look up/ study Google transliterator (routinely used; supervised by Anupama Dutt, ex-MTP student of CFILT) For all Devnagari transliterations, H. Li,M. Zhang, and J. Su A joint source-channel model for machine transliteration. In ACL, pages 159– Y. Al-Onaizan and K. Knight Machine transliteration of names in Arabic text. In ACL Workshop on Comp. Approaches to Semitic Languages. K. Knight and J. Graehl Machine transliteration. Computational Linguistics, 24(4):599–612. N. AbdulJaleel and L. S. Larkey Statistical transliteration for English-Arabic cross language information retrieval. In CIKM, pages 139–146. Joint source-channel model Phoneme and spelling-based models