Machine Transliteration T BHARGAVA REDDY (Knowledge sharing)
What is Machine Transliteration It is the conversion of text from one script to another Not every word in a particular language has its alternative in other language We call such words as Out of Vocabulary words Machine Transliteration would be a useful tool in machine translation when dealing with OOV words Tirupati తిరుపతి
Machine Transliteration Models 4 Machine Transliteration Models have been proposed so far: 1. Grapheme Based Transliteration Model (ψ G ) 2. Phoneme Based Transliteration Model (ψ P ) 3. Hybrid Transliteration Model (ψ H ) 4. Correspondence Based Transliteration Model (ψ C )
Grapheme and Phoneme Phoneme: Smallest contrastive linguistic unit which may bring about a change of meaning. Kiss and Kill are two completely contrasting words. The phonemes here are /l/ and /s/ which make the difference. Grapheme: Grapheme is the smallest semantically distinguishing unit in a written language. Analogous to the phonemes of spoken languages. A grapheme may or may not carry meaning by itself and may or may not correspond to single phoneme.
Grapheme Based Transliteration Model (ψ G ) Machine directly converts the source language graphemes to target language graphemes This method need not have any knowledge about phonetic knowledge of the source and target languages 4 methods are implemented for this scenario: 1. Source Channel Model 2. Decision Tree model 3. Transliteration network 4. Joint source-channel model
Source Channel Model English language words are converted to chunks of English graphemes Next all possible chunks of other language corresponding to the chunk of English language are produced Most relevant sequence of the target language graphemes are identified Advantage: It considers a chunk of graphemes representing a phonetic property of the source language word Disadvantage: Errors in first step propagate to the subsequent steps making it difficult to produce the correct transliteration Time complexity is a major issue in this case. As it is a time consuming task
Decision Tree Model Decision trees that transform each source grapheme into target graphemes are learned and the directly applied to MT Advantage: Considers a wide range of contextual information, say the left three and right three contexts Disadvantage: Unlike the source channel model does not consider phonetic aspects
Transliteration Network The network consists of arcs and nodes Node represents a chunk of source graphemes and its corresponding target graphemes Arc represents a possible link between the nodes and has a weight showing their strengths Method considers phonetic aspects in the formation of graphemes Segmenting a chunk and identification of most relevant sequence in done in one step This means the errors are not propagated from one step to the next
Phoneme-based Transliteration Model This model is basically source grapheme – source grapheme and source grapheme – target grapheme transformation This model was first proposed by Knight and Graehl in 1997 They used Weighted Finite State Transducers (WFST’s) They modelled it for English – Japanese and Japanese – English Transliteration Similar methods have come up for Arab-English and English-Chinese transliteration
Knight and Graehl’s Work In these methods the main transliteration key is pronunciation (or) source phoneme rather than spelling or source grapheme Katakana words are those words which are imported from other languages (primarily English) This language has a lot of issues with when pronunciation is concerned In Japanese the words L,R are pronounced the same Same goes with H,F either
Katakana Words Golf bag is pronounced as go-ru-hu-ba-ggu ---- ゴルフバッグ Johnson is pronounced as jyo-n-s-o-n --- ジョンソン Ice cream is pronounced as a-i-su-ku-ri-i-mu アイスクリーム What have we observed in the transliteration? We can say that there has been a lot of information loss in the process of conversion from English to Japanese So when we do the back-transliteration we may fall into trouble
Trouble in Back-Transliteration There are several forms of writing the word switch which are acceptable by the Japanese language rules But when converting it from the Japanese language to English we need to be very strict we cannot have any other word than ‘switch’ Back transliteration is harder than Romanization. Converting the word Angela ( アンジェ ラ ) would give us the word anjera in English which is no where acceptable The words are many times compressed. The word ‘word processing’ is transliterated as ‘waapuro’ which is not at all easy to back-transliterate
The steps to convert from English to Katakana
Fixing Back-Transliteration
Algorithms for extracting the best transliteration
Example for Back Transliteration
BTP Work I would be working under PhD. Student Arjun Atre for the project We would be trying to develop Machine Transliteration tools for Indian Languages I would be trying to develop a bridging language which can be used to transliterate text from one Indian language to other This contributes a lot to the NLP society and would be a leading step to develop OOV words which are many in our native languages
References A comparison of Different Machine Transliteration Tools (2006), Jong-Hoon Oh, Key-Sun Choi, Hitoshi Isahara Machine Transliteration (1997), Kevin Knight and Jonathan Graehl. Phoneme based transliteration model