Download presentation
Presentation is loading. Please wait.
Published byChristal Gibbs Modified over 9 years ago
BTP Stage 1 Machine Transliteration & Entropy Final Presentation Bhargava Reddy 110050078
Contents What is Machine Transliteration? Classical Methods of Transliteration CRF and its use in Transliteration Study of Transliteration through Bridging Languages Study of Entropy in Information Theory Entropy of a Language Transliterability and Transliteration Performance Conclusion and Future Works
Definition of Machine Transliteration Conversion of a given name in source language to a name in target language such that the target language is: 1.Phonemically equivalent to the source language 2.Conforms to the phonology of the target language 3.Matches the user intuition of the equivalent of the source language name in the target language, considering the culture and orthographic character usage in the target language We need to note that all are equivalent in it’s own kind Ref: Report of NEWS 2012 Machine Transliteration Shared Task
Classical Machine Transliteration Models Work on machine transliteration started between 1995-2000. 4 classical Machine Transliteration Models have been proposed: 1.Grapheme Based Transliteration Model (ψ G ) 2.Phoneme Based Transliteration Model (ψ P ) 3.Hybrid Transliteration Model (ψ H ) 4.Correspondence Based Transliteration Model (ψ C ) Ref: A comparison of Different Machine Transliteration Models (2006), A Comparison of Different Machine Transliteration Models.
Grapheme and Phoneme Phoneme: Smallest contrastive linguistic unit which may bring about a change of meaning. Kiss and Kill are two completely contrasting words. The phonemes here are /l/ and /s/ which make the difference. Grapheme: Grapheme is the smallest semantically distinguishing unit in a written language. Analogous to the phonemes of spoken languages. A grapheme may or may not carry meaning by itself and may or may not correspond to single phoneme.
Phoneme-based Transliteration Model This model is basically source grapheme – source grapheme and source grapheme – target grapheme transformation This model was first proposed by Knight and Graehl in 1997 They modelled it for English – Japanese and Japanese – English Transliteration In these methods the main transliteration key is pronunciation (or) source phoneme rather than spelling or source grapheme
Katakana Words and Japanese Language Katakana words are those words which are imported from other languages (primarily English) This language has a lot of issues with when pronunciation is concerned In Japanese the words L,R are pronounced the same. (Pronounced as something in between) Same goes with H,F either
Katakana Words Golf bag is pronounced as go-ru-hu-ba-ggu ---- ゴルフバッグ Johnson is pronounced as jyo-n-s-o-n --- ジョンソン Ice cream is pronounced as a-i-su-ku-ri-i-mu アイスクリーム What have we observed in the transliteration? We can say that there has been a lot of information loss in the process of conversion from English to Japanese So when we do the back-transliteration we may fall into trouble
The steps to convert from English to Katakana
Fixing Back-Transliteration
Example for Back Transliteration
Study of MT through Bridging Languages Data is available between a language pair due to one of the following three reasons: 1.Politically related languages: Due to the political dominance of English it is easy to obtain parallel names data between English and most languages 2.Genealogically related languages: Languages sharing the same origin. Might have significant overlap between their phonemes and graphemes 3.Demographically related languages: Hindi and Telugu. Might not have the same origin but due to the shared culture and demographics there will be similarities Ref: Everybody loves a rich cousin: An empirical study of transliteration through bridge languages. NAACL 2010
Bridge Transliteration Methodology Ref: Everybody loves a rich cousin: An empirical study of transliteration through bridge languages. NAACL 2010
Results for the Bridge System We must remember that Machine Transliteration is a lossy conversion In the bridge system we can assume that we will get loss in information and thus the accuracy score will drop down The results have shown that there has been a drop in accuracy of about 8-9%(ACC1) and about 1-3%(Mean F-score) NEWS 2009 was used as a dataset for this training and evaluation of the results Ref: Everybody loves a rich cousin: An empirical study of transliteration through bridge languages. NAACL 2010
Stepping though an intermediate language Ref: Everybody loves a rich cousin: An empirical study of transliteration through bridge languages. NAACL 2010
What is Entropy Entropy is the amount of information obtained in each message received It characterizes our uncertainty about our source of information (Randomness) Expected value function of information content in random variable Based on Shannon's: A Mathematical Theory of Communication
Properties and Mathematical Formulation Based on Shannon's: A Mathematical Theory of Communication
The Formula for Entropy Based on Shannon's: A Mathematical Theory of Communication
Motivation for Entropy in English Aoccdrnig to rseearchat at Elingsh uinervtisy, it deosn't mttaer in waht odrer the ltteers ina wrod are, the olny iprmoatnt tihng is taht the frist and lsat ltteer is at the rghit pclae. The rset canbe a toatl mses and youcansitll raed it wouthit a porbelm. Tihs is bcuseae we do not raed ervey lteter by it slef but the wrod as a wlohe.
Entropy of Language If the language is translated into binary digits (0 or 1) in the most efficient way, the entropy H is the average number or binary digits required per letter of the original language. The redundancy measures the amount of constraint imposed on a text in the language due to its statistical structure Ex: In english: The frequency of letter E, Strong tendency that H follows T or U to follow Q Based on Shannon's: Prediction and Entropy of Printed English
Entropy calculation from the statistics of English Based on Shannon's: Prediction and Entropy of Printed English
Entropy of English Based on Shannon's: Prediction and Entropy of Printed English
Interpretation of the equation Based on Shannon's: Prediction and Entropy of Printed English
Calculations of the F N Based on Shannon's: Prediction and Entropy of Printed English
Letter Frequencies in English Source: Wikipedia’s article on letter frequency in English
Calculation of higher F N Similar calculations for F 3 gives the value as 3.3 bits The tables of N-gram frequencies are not available for N>3 as a result F 4,F 5,F 6 cannot be calculated the same way Word frequencies are used to calculate to assist in such situations Let us look at the log-log paper of the probabilities of words against frequency rank Based on Shannon's: Prediction and Entropy of Printed English
Word Frequencies Based on Shannon's: Prediction and Entropy of Printed English
Entropy for various world languages From the data we can infer that english languages has the least entropy and Finnish language has the highest entropy But all the languages have a comparable entropy when we take Shannon’s experiment into consideration Finnish (fi), German (de), Swedish (sv), Dutch (nl), English (en), Italian (it), French (fr), Spanish (es), Portuguese (pt) and Greek (el) Based on Word-length entropies and correlations of natural language written texts. 2014
Ziff like plots for various world languages Based on Word-length entropies and correlations of natural language written texts. 2014
Entropy of Telugu Language Indian languages are highly phonetic which makes the computation of the entropy to be a rather difficult task Thus entropy for Telugu has been calculated by converting it into english language and using Shannon’s experiment. The entropy is calculated in 2 ways: 1.Converting into English and then considering them as English letters 2.Converting into English and then considering them as Telugu letters Based on Entropy of Telugu. Venkata Ravinder Paruchuri. 2011
Telugu Language Entropy Based on Entropy of Telugu. Venkata Ravinder Paruchuri. 2011
Inferences The entropy of Telugu is higher than that of English, which means that Telugu is more succinct than English and each syllable in Telugu(as in other Indian languages) contains more information compared to English
Indus Script Very less has been known about the script from the ancient time But no inferences has been made about weather it is a linguistic script or not But from the diagram besides we can check that Indus script lies somewhere near most of the world languages We can thus infer that the Indus script is a one which can be noted as a linguistic script but we have no solid proof for it Based on Entropy, the Indus Script and Language: A Reply to R.Sproat
Transliterability and Transliteration performance Ref: Compositional Machine Transliteration,(2010), A Kumaran, Mitesh M Khapra, Pushpak Bhattacharya
Transliterability Measure The measure with the desirable qualities which could measure the ease of Transliterability among languages: 1.Rely purely on orthographic features of the language only( easily calculated based on parallel names corpora) 2.Capture and weigh the inherent ambiguity in transliteration at the character level. (i.e., the average number of character mappings) 3.Weigh the ambiguous transitions for a given character, according to the transition frequencies. Perhaps highly ambiguous mappings occur rarely The Transliterability measure Weighted Average Entropy (WAVE), does out work Ref: Compositional Machine Transliteration,(2010), A Kumaran, Mitesh M Khapra, Pushpak Bhattacharya
WAVE Ref: Compositional Machine Transliteration,(2010), A Kumaran, Mitesh M Khapra, Pushpak Bhattacharya
Motivation From the adjacent table we can conclude that frequency of occurrence of unigram ‘a’ is nearly 150 times more frequent than unigram ‘x’ Which implies capturing ambiguities of ‘a’ will be more beneficial than those of ‘x’ The term ‘frequency(i)’ captures this effect Table IV shows the mappings from the source to target languages We can observe that the uni-gram c has mapping to 2 characters स and क Whereas p has only one which is प The term ‘Entropy(i)’ captures this information and ensures that c is weighted more than p Ref: Compositional Machine Transliteration,(2010), A Kumaran, Mitesh M Khapra, Pushpak Bhattacharya
Plot Between WAVE and Transliteration Quality The following plots are drawn between log(WAVE) and accuracy measure (for approximately 15k of training corpus) for language pairs of En-Hi, En-Ka, En-Ma, Hi-En, Ka-En, Hi-Ka, Ma-Ka We can see that as the value of WAVE decreases the accuracy is decreasing exponentially The left-top 2 in each of the plots is between Hindi and Marathi languages that share the same orthography and have large one-to-one character mappings between them We can observe that different n-grams have almost similar results which means we can choose the uni-gram model to generalize the model Based on these observations we can term two languages with small WAVE 1 measure as more easily transliterable.
Conclusions In this presentation we introduced the concept of Machine Transliteration. We have looked over the classical methods of Transliteration starting from the phoneme based transliteration models to Combined CRF models. We have introduced the concept of Entropy and its usefulness in Transliteration model. We studied how phonology and syllabification helps in creating chunks which would be useful for transliteration. We introduced the concept of WAVE which determines the ease of Transliteration between a pair of languages.
Future Work As a part of future work, I would implement transliteration performance between languages based on entropy measures between then. Find a measure like WAVE which could be used to measure the transliterability performance without actual transliteration
References Report of NEWS 2012 Machine Transliteration Shared Task (2012), Min Zhang, haizhou Li, A Kumaran and Ming Lui. ACL 2012 A comparison of Different Machine Transliteration Models (2006), A Comparison of Different Machine Transliteration Models. Machine Transliteration (1997). Kevin Knight and Jonathan Graehl, Journal Computational Linguistics Improving back-transliteration by combining information sources. (2004). Bilac S., & Tanaka, H. In Proceedings of IJCNLP2004, pp. 542–547 An English-Korean transliteration model using pronunciation and contextual rules. (2002). Oh, J. H., & Choi, K. S. In Proceedings of COLING2002, pp. 758–764 Everybody loves a rich cousin: An empirical study of transliteration through bridge languages. (2010). Mitesh M. Khapra, A Kumaran, Pushpak Bhattacharyya Forward-Backward Machine Transliteration between English and Chinese based on Combined CRFs. (2011). Ying Qin, Guohua Chen. Nov’ 12 FWBW Hindi to English Machine transliteration model of named entities using CRFs (2012). Manikrao, Shantanu and Tushar. International Journal of Computer Applications(0975-8887) on June 2012
References (2) Linguistics, An Introduction to Language and Communication (5 th Edition), Adrian Akmajian, Richard A Demers, Ann K Farmer, Robert M Harnish. MIT University Press. 2001. A Mathematical Theory of Communication (1948), C.E.Shannon, The Bell System Technical Journal, July 1948 Prediction and Entropy of Printed English. C.E.Shannon. The Bell System Technical Journal. January 1951 Word-length entropies and correlations of natural language written texts. Maria Kalimeri, Vassilios Constantoudis, Constantinos Papadimitrious. ArXiV conference 2014 Entropy of Telugu. Venkata Ravinder Paruchuri. 2011 Entropy, the Indus Script and Language: A Reply to R.Sproat. Rajesh PN Rao, Nisha Yadav, Mayank Vahia, Hrishikesh. Computational Linguistics 36(4). 2010 Compositional Machine Transliteration, Transactions on Asian Language Information Processing (TALIP Journal), A.Kumaran, Mitesh M. Khapra and Pushpak Bhattacharyya,, September 2010 Wiki articles
Extra Slides
Explanation of property 3 1/2 1/6 1/2 1/3 1/2 1/3 2/3 1/2 1/6 1/3 Based on Shannon's: A Mathematical Theory of Communication
Similar presentations
© 2025 Inc.
All rights reserved.