Transliteration Transliteration CS 626 course seminar by Purva Joshi Mugdha Bapat Aditya Joshi Manasi Bapat
Humans transliterate frequently for different reasons Can a machine do this? (Why would a machine have to do this?) If yes, how? Picture courtesy: Snapshot of Yahoo! Messenger
Motivation An important component of machine translation When you cannot translate, transliterate Generally used for named entities, technical terms and out of vocabulary words (OOV) Issues specific to sounds, scripts and accents Can a machine do this? If yes, how?
Task of converting a word from one alphabetic script to another Used for: Named entities : Gandhiji Out of vocabulary words : Bank What is transliteration?
Accents : Thoda or thora? Mapping of sounds Mahaan:Kahaan: Back-transliteration Linguistic issues
Arabic Chinese Hindi / Japanese Arabic b -> English p or b English word: Paul transliterates to Arabic word: Baul (issue in Back-transliteration) Origin of the proper noun determines the symbol in Chinese language Ideographic symbols in Chinese Several English symbols do not map to any Japanese symbols. So, often mapped to closest sounding symbol ice cream aisukuriimu Linguistic Issues : Mapping of sounds Symbols map to different symbols based on their position America Difference in origin Restaurant constant
x Overview Source String Transliteration Units Target String Transliteration Units
Contents Source String Transliteration Units Target String Transliteration Units Phoneme- based
Phoneme-based approach Word in Source language Pronunciation in Source language Word in Target language Pronunciation In target language P( p s | w s ) P ( p t | p s ) P ( w t | p t ) Note: Phoneme is the smallest linguistically distinctive unit of sound. P(w t ) W t * = argmax (P (w t ). P (w t | p t ). P (p t | p s ). P (p s | w s ) )
Phoneme-based approach Step I : Consider each character of the word Transliterating ‘BAPAT’ BA PA T P/ə//a:/ /ə//a:/B T Source word to phonemes P/ə//a:/ /ə//a:/B T Source phonemes to target phonemes t t Step II : Converting to phoneme seq. Step III : Converting to target phoneme seq.
Phoneme-based approach Step IV : Phoneme sequence to target string B : /ə/ : /a:/ : P: /ə/ : /a:/ : T: t: Output :
Concerns Word in Source language Pronunciation in Source language Word in Target language Pronunciation In target language Check if the world is valid In target language Check if environment Is noise-free
Unknown pronunciations Back-transliteration can be a problem Johnson Jonson Issues in phonetic model sanhita samhita
Contents Source String Transliteration Units Target String Transliteration Units Phoneme- based Spelling- based
Maps source word sequences to target word sequences (i.e. direct word to word) The transliteration score: P(w) Spelling-based model Letter trigram model included Thus, we can accommodate the words not included in the dictionary Pronunciation in Source language Pronunciation In target language Word in Source language Word in Target language
Comparison of the two methods
Contents Source String Transliteration Units Target String Transliteration Units Phoneme- based Spelling- based Joint Source Channel
Particularly developed for Chinese Chinese : Highly ideographic Example : Two main steps: The Third Method - Why? Image courtesy: wikimedia-commons ModelingDecoding
Modeling Step A bilingual dictionary in the source and target language From this dictionary, the character mapping between the source and target language is learnt The word “Geo” has two possible mappings, the “context” in which it occurs is important John Georgia Geology Geo Modeling step
Modeling step … N-gram Mapping : This concludes the modeling step Modeling step …
Decoding Step Consider the transliteration of the word “George”. Alignments of George: Geo rge G eo rge Decoding step
Decision to be made between…. The context mapping is present in the map-dictionary Using…… Decoding step …
Where do the n-gram statistics come from? Ans.: Automatic analysis of the bilingual dictionary How to align this dictionary? Ans. : Using EM-algorithm Transliteration Alignment
EM Algorithm Bootstrap Expectation Maximization Transliteration Units Bootstrap initial random alignment Update n-gram statistics to estimate probability distribution Apply the n-gram TM to obtain new alignment Derive a list of transliteration units from final alignment
Evaluation E2C Error rates for n-gram testsE2C v/s C2E for TM Tests
Conclusion Transliteration can make use of phonemes as an intermediate layer to move from a script to another Spelling-based approach connects the word sequences of the two languages The joint source channel method integrates optimization of alignment and transliteration no pre-alignment needed reduction in development efforts
( the end )
References For all Devnagari transliterations, H. Li,M. Zhang, and J. Su A joint source-channel model for machine transliteration. In ACL, pages 159– Y. Al-Onaizan and K. Knight Machine transliteration of names in Arabic text. In ACL Workshop on Comp. Approaches to Semitic Languages. K. Knight and J. Graehl Machine transliteration. Computational Linguistics, 24(4):599–612. N. AbdulJaleel and L. S. Larkey Statistical transliteration for English- Arabic cross language information retrieval. In CIKM, pages 139–146. Joint source-channel model Phoneme and spelling-based models