GH-MAP: Rule Based Token Mapping For Translation between Sibling Language Pair: Gujarati-Hindi Kalyani Patel K.S.School of Business Management,Gujarat University. patel_kalyani_05@yahoo.co.in Dr. Jyoti Pareek Department of Computer Science,Gujarat University. drjyotipareek@yahoo.com
Contents Introduction Hindi-Gujarati : A comparative study The Rule Base Translation Token Mapping Engine Algorithm Example Evaluation Conclusion 4/20/2017 ICON 2009
Introduction GH-MAP is designed for a particular pair of a language to take advantage of similarity between sibling language pair Gujarati-Hindi. It uses a rule based token mapping for effective word to word translation. GH-MAP can be utilize for MT, CLIR, GurjerNet, Multilingual Dictionary. 4/20/2017 ICON 2009
Contents Introduction Hindi-Gujarati : A comparative study The Rule Base Translation Token Mapping Engine Algorithm Example Evaluation Conclusion 4/20/2017 ICON 2009
Hindi-Gujarati : A comparative study Indo-Aryan family (Hindi, Bangla, Assami, Punjabi, Marathi, Oriya and Gujarati) Being same group, there is high degree of structural similarity Hindi and Gujarati languages have bijectively mappable characters (Varna Maala) excluding ळ. relatively free word-order, where the noun group can come in any order followed generally by the verb group. 4/20/2017 ICON 2009
Continue.... Nouns in Hindi and Gujarati languages are inflected based on the case (direct or oblique), number (singular or plural), and the gender (masculine or feminine). In addition to this Gujarati language also has common gender . Verbs in both the languages are inflected based on gender, number, person, tense, aspect, modality, formality, and voice. 4/20/2017 ICON 2009
Continue... Many words in the languages have a shared origin (from Sanskrit) and because of shared culture, they usually also share meaning e.g. (book) ‘પુસ્તક’/ ‘puswaka’ in Gujarati is similar to ‘पुस्तक’/ ‘puswaka’ in Hindi. Sentence from one language can be mapped to sentence in another language by substituting each word group in source language by appropriate word group in the target language. 4/20/2017 ICON 2009
Contents Introduction Hindi-Gujarati : A comparative study The Rule Base Translation Token Mapping Engine Algorithm Example Evaluation Conclusion 4/20/2017 ICON 2009
The Rule Base Rule Base for Translation: Domain Specific monolingual data Stores typologically different words and their relations Domain Independent bilingual data Stores cases , pronouns, adjectives, adverbs etc.. Substring Substitution rules Stores Hindi substrings corresponding to Gujarati substring and location of substring Stem – Suffix rules Stores bilingual stem and suffix rules Phrases Stores bilingual compound words 4/20/2017 ICON 2009
Contents Introduction Hindi-Gujarati : A comparative study The Rule Base Translation Token Mapping Engine Algorithm Example Evaluation Conclusion 4/20/2017 ICON 2009
Translation Language 2 tokens Tokenize the sentence Token Mapping Engine Phrase Yes No Sentences in language1 Sentences in language 2 STOP START GH-MAP Translate a text in source language to a text in the target language , retaining a flavor of the source language. GH-MAP utilize Token Mapping Engine for translation. 4/20/2017 ICON 2009
Contents Introduction Hindi-Gujarati : A comparative study The Rule Base Translation Token Mapping Engine Algorithm Example Evaluation Conclusion 4/20/2017 ICON 2009
Token Mapping Engine Token Mapping Engine uses Rule Base for finding the match of a given token in target language. 4/20/2017 ICON 2009
Contents Introduction Hindi-Gujarati : A comparative study The Rule Base Translation Token Mapping Engine Algorithm Example Evaluation Conclusion 4/20/2017 ICON 2009
TME Algorithm For each SL (Source Language) word (token): Search the word in Table of pronouns, cases, adjectives. If match found then get TL (Target Language) word from the same table. Go to step 7. Table of domain specific words. If match found then get corresponding TL words from the table of TL domain specific words. Go to step 7. Remove suffix. Search for stem in table of Stem. If match found then get TL stem and corresponding TL suffix. Generate TL word. Go to step 7. Search repeatedly for substring (affix) in SL word. If match found then substitute SL substring with corresponding TL substring. Go to step 6. Transliterate remaining non-translated characters by TL character. Next 4/20/2017 ICON 2009
Contents Introduction Hindi-Gujarati : A comparative study The Rule Base Translation Token Mapping Engine Algorithm Example Evaluation Conclusion 4/20/2017 ICON 2009
Example (‘The lotus blossoms’ (E)) Input Gujarati Sentence ‘કમળ નું ખીલવું’ / kamalYa nuM KIlavuM Tokenize the sentence કમળ’/kamalYa + ‘નું’/nuM + ‘ખીલવું’/ KIlavuM. Tokens are given to Token Mapping Engine First token ‘કમળ’/kamalYa is translated by substituting substring ‘ળ’/lYa by ‘ल’/la and remaining Gujarati character ‘કમ’/kama transliterate to ‘कम’/kama to generate ‘कमल’/kamala 4/20/2017 ICON 2009
‘कमल का खिलना ’ / kamala kA KilanA Second token ‘નું’/nuM is translated to ‘का’/kA using Case (Karaka) table. Third token ‘ખીલવું’/ KIlavuM is translated by first removing suffix ‘વું’/vuM, to obtain stem ‘ખીલ’/KIla, the stem is searched in table of stem and corresponding stem in Hindi ‘खिल‘/Kila is obtained, & corresponding suffix of ‘વું’ /vuM in target language i.e. ‘ना‘/nA is obtained to generate ‘खिलना’/KilanA Output Hindi Sentence ‘कमल का खिलना ’ / kamala kA KilanA 4/20/2017 ICON 2009
Contents Introduction Hindi-Gujarati : A comparative study The Rule Base Translation Token Mapping Engine Algorithm Example Evaluation Conclusion 4/20/2017 ICON 2009
Contribution of various Approaches in translation 4/20/2017 ICON 2009
Evaluation Bilingual documents Total Number of Words TNW Number of words not matched by evaluation software N1 Number of words incorrectly translated as per language expert N2 Percentage of words matched as per evaluation software P1 Percentage of words correctly translated as per language expert P2 Satya 543 160 59 70% 89% Sambhav 332 117 50 65% 85% Samanta 370 132 44 64% 88% TOTAL 1245 409 153 67% Thus we can conclude that for given test bed GH-MAP could produce about 88% correct translation 4/20/2017 ICON 2009
Contents Introduction Hindi-Gujarati : A comparative study The Rule Base Translation Token Mapping Engine Algorithm Example Evaluation Conclusion 4/20/2017 ICON 2009
Conclusion To the best of our knowledge, this is the first attempt at rule based token mapping for sibling language pair Hindi-Gujarati. In this model, only lexical analysis is carried out. It requires only limited linguistic effort and tools for achieving the said goal. The test results for a small set of data are encouraging. There are some limitations of GH-MAP, which needs to be addressed. 4/20/2017 ICON 2009
Limitations karaka : का / kA(H) can be map to નો/no /ની/nI /નું/nuM /ના/nA (G) [of (E)]. pronoun :उसे / se (H) can be map to તેનો/weno / તેની/wenI / તેને/wene (G) [He/She/It (E)] adjective :नया / nayA (H) can be map to નવું/navuM/નવા/navA (G) [New (E) Work is in progress towards overcoming these limitations. With further enhancement in rule base, GH-MAP is expected to yield better result. 4/20/2017 ICON 2009
Thank You 4/20/2017 ICON 2009