Presentation is loading. Please wait.

Presentation is loading. Please wait.

Machine Transliteration T BHARGAVA REDDY (Knowledge sharing)

Similar presentations


Presentation on theme: "Machine Transliteration T BHARGAVA REDDY (Knowledge sharing)"— Presentation transcript:

1 Machine Transliteration T BHARGAVA REDDY (Knowledge sharing)

2 What is Machine Transliteration  It is the conversion of text from one script to another  Not every word in a particular language has its alternative in other language  We call such words as Out of Vocabulary words  Machine Transliteration would be a useful tool in machine translation when dealing with OOV words Tirupati  తిరుపతి

3 Machine Transliteration Models  4 Machine Transliteration Models have been proposed so far: 1. Grapheme Based Transliteration Model (ψ G ) 2. Phoneme Based Transliteration Model (ψ P ) 3. Hybrid Transliteration Model (ψ H ) 4. Correspondence Based Transliteration Model (ψ C )

4 Grapheme and Phoneme  Phoneme: Smallest contrastive linguistic unit which may bring about a change of meaning. Kiss and Kill are two completely contrasting words. The phonemes here are /l/ and /s/ which make the difference.  Grapheme: Grapheme is the smallest semantically distinguishing unit in a written language. Analogous to the phonemes of spoken languages. A grapheme may or may not carry meaning by itself and may or may not correspond to single phoneme.

5 Grapheme Based Transliteration Model (ψ G )  Machine directly converts the source language graphemes to target language graphemes  This method need not have any knowledge about phonetic knowledge of the source and target languages  4 methods are implemented for this scenario: 1. Source Channel Model 2. Decision Tree model 3. Transliteration network 4. Joint source-channel model

6 Source Channel Model  English language words are converted to chunks of English graphemes  Next all possible chunks of other language corresponding to the chunk of English language are produced  Most relevant sequence of the target language graphemes are identified  Advantage: It considers a chunk of graphemes representing a phonetic property of the source language word  Disadvantage: Errors in first step propagate to the subsequent steps making it difficult to produce the correct transliteration  Time complexity is a major issue in this case. As it is a time consuming task

7 Decision Tree Model  Decision trees that transform each source grapheme into target graphemes are learned and the directly applied to MT  Advantage: Considers a wide range of contextual information, say the left three and right three contexts  Disadvantage: Unlike the source channel model does not consider phonetic aspects

8 Transliteration Network  The network consists of arcs and nodes  Node represents a chunk of source graphemes and its corresponding target graphemes  Arc represents a possible link between the nodes and has a weight showing their strengths  Method considers phonetic aspects in the formation of graphemes  Segmenting a chunk and identification of most relevant sequence in done in one step  This means the errors are not propagated from one step to the next

9 Phoneme-based Transliteration Model  This model is basically source grapheme – source grapheme and source grapheme – target grapheme transformation  This model was first proposed by Knight and Graehl in 1997  They used Weighted Finite State Transducers (WFST’s)  They modelled it for English – Japanese and Japanese – English Transliteration  Similar methods have come up for Arab-English and English-Chinese transliteration

10 Knight and Graehl’s Work  In these methods the main transliteration key is pronunciation (or) source phoneme rather than spelling or source grapheme  Katakana words are those words which are imported from other languages (primarily English)  This language has a lot of issues with when pronunciation is concerned  In Japanese the words L,R are pronounced the same  Same goes with H,F either

11 Katakana Words  Golf bag is pronounced as go-ru-hu-ba-ggu ---- ゴルフバッグ  Johnson is pronounced as jyo-n-s-o-n --- ジョンソン  Ice cream is pronounced as a-i-su-ku-ri-i-mu アイスクリーム  What have we observed in the transliteration?  We can say that there has been a lot of information loss in the process of conversion from English to Japanese  So when we do the back-transliteration we may fall into trouble

12 Trouble in Back-Transliteration  There are several forms of writing the word switch which are acceptable by the Japanese language rules  But when converting it from the Japanese language to English we need to be very strict we cannot have any other word than ‘switch’  Back transliteration is harder than Romanization. Converting the word Angela ( アンジェ ラ ) would give us the word anjera in English which is no where acceptable  The words are many times compressed. The word ‘word processing’ is transliterated as ‘waapuro’ which is not at all easy to back-transliterate

13 The steps to convert from English to Katakana

14 Fixing Back-Transliteration

15 Algorithms for extracting the best transliteration

16 Example for Back Transliteration

17 BTP Work  I would be working under PhD. Student Arjun Atre for the project  We would be trying to develop Machine Transliteration tools for Indian Languages  I would be trying to develop a bridging language which can be used to transliterate text from one Indian language to other  This contributes a lot to the NLP society and would be a leading step to develop OOV words which are many in our native languages

18 THANK YOU

19 References  A comparison of Different Machine Transliteration Tools (2006), Jong-Hoon Oh, Key-Sun Choi, Hitoshi Isahara  Machine Transliteration (1997), Kevin Knight and Jonathan Graehl. Phoneme based transliteration model  www.Wikipedia.org/transliteration www.Wikipedia.org/transliteration


Download ppt "Machine Transliteration T BHARGAVA REDDY (Knowledge sharing)"

Similar presentations


Ads by Google