Kalyani Patel K.S.School of Business Management,Gujarat University.

Slides:



Advertisements
Similar presentations
The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Advertisements

Morphology.
Designing Multimedia with Fuzzy Logic Enrique Diaz de Leon * Rene V. Mayorga ** Paul D. Guild *** * ITESM, Guadalajara Campus, Mexico ** Faculty of Engineering,
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
Brief introduction to morphology
1 A Hidden Markov Model- Based POS Tagger for Arabic ICS 482 Presentation A Hidden Markov Model- Based POS Tagger for Arabic By Saleh Yousef Al-Hudail.
1 Words and the Lexicon September 10th 2009 Lecture #3.
Introduction to Computational Linguistics Lecture 2.
Stemming, tagging and chunking Text analysis short of parsing.
Session 6 Morphology 1 Matakuliah : G0922/Introduction to Linguistics
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.
Creation of a Russian-English Translation Program Karen Shiells.
Getting started with Sanskrit grammar. Inflectional form: Root + Affix = Stem Stem + Inflectional ending = Word.
My Marathi Marathi language learning CDs. My Marathi is a CD based Marathi self study tool built by the next generation, for the next generation.
Machine Translation History of Machine Translation Difficulties in Machine Translation Structure of Machine Translation System Research methods for Machine.
Natural Language Processing DR. SADAF RAUF. Topic Morphology: Indian Language and European Language Maryam Zahid.
1 LOMGen: A Learning Object Metadata Generator Applied to Computer Science Terminology A. Singh, H. Boley, V.C. Bhavsar National Research Council and University.
Natural Language Processing Lab Northeastern University, China Feiliang Ren EBMT Based on Finite Automata State Transfer Generation Feiliang Ren.
Morphology For Marathi POS-Tagger Veena Dixit 11/ 10 /2005.
Machine translation Context-based approach Lucia Otoyo.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Paradigm based Morphological Analyzers Dr. Radhika Mamidi.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
ICS611 Introduction to Compilers Set 1. What is a Compiler? A compiler is software (a program) that translates a high-level programming language to machine.
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Fourth Grade Domain Specific Words L thru Z Work in Progress.
Automatic Lexical Annotation Applied to the SCARLET Ontology Matcher Laura Po and Sonia Bergamaschi DII, University of Modena and Reggio Emilia, Italy.
Modular InfoTech’s Modular Infotech is proud to offer Tools and Components enabled with Indian language so as to address each & every client located across.
Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval Doctorate Course Web Information Retrieval Speaker Gaia Trecarichi.
Finite State Automata and Tries Sambhav Jain IIIT Hyderabad.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
Scientific writing style Exact  Word choice: make certain that every word means exactly what you want to express. Choose synonyms with care. Be not.
PETRA – the Personal Embedded Translation and Reading Assistant Werner Winiwarter University of Vienna InSTIL/ICALL Symposium 2004 June 17-19, 2004.
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
Language Learning Targets based on CLIMB standards.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Malay-English Bitext Mapping and Alignment Using SIMR/GSA Algorithms Mosleh Al-Adhaileh Tang Enya Kong Mosleh Al-Adhaileh and Tang Enya Kong Computer Aided.
Translation Memory System (TMS)1 Translation Memory Systems Presentation by1 Melina Takanen & Julianna Ekert CAT Prof. Thorsten Trippel University.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
LANGUAGE ARTS LA WORKS UNIT 3 REVIEW STUDY GUIDE.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Natural Language Processing Group Computer Sc. & Engg. Department JADAVPUR UNIVERSITY KOLKATA – , INDIA. Professor Sivaji Bandyopadhyay
Types of Dictionaries A. Types of Dictionaries in terms of form/medium: - Books (advantages & disadvantages) - CDs (advantages & disadvantages) - Internet/Online.
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
Utilizing vector models for automatic text lemmatization Ladislav Gallay Supervisor: Ing. Marián Šimko, PhD. Slovak University of Technology Faculty of.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
A knowledge rich morph analyzer for Marathi derived forms Ashwini Vaidya IIIT Hyderabad.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Inflection. Inflection refers to word formation that does not change category and does not create new lexemes, but rather changes the form of lexemes.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Minority Languages Katharina Probst Language Technologies Institute Carnegie Mellon.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
TYPES OF TRANSLATION.
Approaches to Machine Translation
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
What is SPaG? pelling unctuation nd rammar. What is SPaG? pelling unctuation nd rammar.
Getting started with Sanskrit grammar
Charlie and the Chocolate Factory
Translation Problems.
Token generation - stemming
A method for WSD on Unrestricted Text
Approaches to Machine Translation
SANSKRIT ANALYZING SYSTEM
Introduction to Linguistics
TECHNICAL REPORTS WRITING
Presentation transcript:

  GH-MAP: Rule Based Token Mapping For Translation between Sibling Language Pair: Gujarati-Hindi           Kalyani Patel K.S.School of Business Management,Gujarat University. patel_kalyani_05@yahoo.co.in Dr. Jyoti Pareek Department of Computer Science,Gujarat University. drjyotipareek@yahoo.com

Contents Introduction Hindi-Gujarati : A comparative study The Rule Base Translation Token Mapping Engine Algorithm Example Evaluation Conclusion 4/20/2017 ICON 2009

Introduction GH-MAP is designed for a particular pair of a language to take advantage of similarity between sibling language pair Gujarati-Hindi. It uses a rule based token mapping for effective word to word translation. GH-MAP can be utilize for MT, CLIR, GurjerNet, Multilingual Dictionary. 4/20/2017 ICON 2009

Contents Introduction Hindi-Gujarati : A comparative study The Rule Base Translation Token Mapping Engine Algorithm Example Evaluation Conclusion 4/20/2017 ICON 2009

Hindi-Gujarati : A comparative study Indo-Aryan family (Hindi, Bangla, Assami, Punjabi, Marathi, Oriya and Gujarati) Being same group, there is high degree of structural similarity Hindi and Gujarati languages have bijectively mappable characters (Varna Maala) excluding ळ. relatively free word-order, where the noun group can come in any order followed generally by the verb group. 4/20/2017 ICON 2009

Continue.... Nouns in Hindi and Gujarati languages are inflected based on the case (direct or oblique), number (singular or plural), and the gender (masculine or feminine). In addition to this Gujarati language also has common gender . Verbs in both the languages are inflected based on gender, number, person, tense, aspect, modality, formality, and voice. 4/20/2017 ICON 2009

Continue... Many words in the languages have a shared origin (from Sanskrit) and because of shared culture, they usually also share meaning e.g. (book) ‘પુસ્તક’/ ‘puswaka’ in Gujarati is similar to ‘पुस्तक’/ ‘puswaka’ in Hindi. Sentence from one language can be mapped to sentence in another language by substituting each word group in source language by appropriate word group in the target language. 4/20/2017 ICON 2009

Contents Introduction Hindi-Gujarati : A comparative study The Rule Base Translation Token Mapping Engine Algorithm Example Evaluation Conclusion 4/20/2017 ICON 2009

The Rule Base Rule Base for Translation: Domain Specific monolingual data Stores typologically different words and their relations Domain Independent bilingual data Stores cases , pronouns, adjectives, adverbs etc.. Substring Substitution rules Stores Hindi substrings corresponding to Gujarati substring and location of substring Stem – Suffix rules Stores bilingual stem and suffix rules Phrases Stores bilingual compound words 4/20/2017 ICON 2009

Contents Introduction Hindi-Gujarati : A comparative study The Rule Base Translation Token Mapping Engine Algorithm Example Evaluation Conclusion 4/20/2017 ICON 2009

Translation Language 2 tokens Tokenize the sentence Token Mapping Engine Phrase Yes No Sentences in language1 Sentences in language 2 STOP START GH-MAP Translate a text in source language to a text in the target language , retaining a flavor of the source language. GH-MAP utilize Token Mapping Engine for translation. 4/20/2017 ICON 2009

Contents Introduction Hindi-Gujarati : A comparative study The Rule Base Translation Token Mapping Engine Algorithm Example Evaluation Conclusion 4/20/2017 ICON 2009

Token Mapping Engine Token Mapping Engine uses Rule Base for finding the match of a given token in target language. 4/20/2017 ICON 2009

Contents Introduction Hindi-Gujarati : A comparative study The Rule Base Translation Token Mapping Engine Algorithm Example Evaluation Conclusion 4/20/2017 ICON 2009

TME Algorithm For each SL (Source Language) word (token): Search the word in Table of pronouns, cases, adjectives. If match found then get TL (Target Language) word from the same table. Go to step 7. Table of domain specific words. If match found then get corresponding TL words from the table of TL domain specific words. Go to step 7. Remove suffix. Search for stem in table of Stem. If match found then get TL stem and corresponding TL suffix. Generate TL word. Go to step 7. Search repeatedly for substring (affix) in SL word. If match found then substitute SL substring with corresponding TL substring. Go to step 6. Transliterate remaining non-translated characters by TL character. Next 4/20/2017 ICON 2009

Contents Introduction Hindi-Gujarati : A comparative study The Rule Base Translation Token Mapping Engine Algorithm Example Evaluation Conclusion 4/20/2017 ICON 2009

Example (‘The lotus blossoms’ (E)) Input Gujarati Sentence ‘કમળ નું ખીલવું’ / kamalYa nuM KIlavuM Tokenize the sentence કમળ’/kamalYa + ‘નું’/nuM + ‘ખીલવું’/ KIlavuM. Tokens are given to Token Mapping Engine First token ‘કમળ’/kamalYa is translated by substituting substring ‘ળ’/lYa by ‘ल’/la and remaining Gujarati character ‘કમ’/kama transliterate to ‘कम’/kama to generate ‘कमल’/kamala 4/20/2017 ICON 2009

‘कमल का खिलना ’ / kamala kA KilanA Second token ‘નું’/nuM is translated to ‘का’/kA using Case (Karaka) table. Third token ‘ખીલવું’/ KIlavuM is translated by first removing suffix ‘વું’/vuM, to obtain stem ‘ખીલ’/KIla, the stem is searched in table of stem and corresponding stem in Hindi ‘खिल‘/Kila is obtained, & corresponding suffix of ‘વું’ /vuM in target language i.e. ‘ना‘/nA is obtained to generate ‘खिलना’/KilanA Output Hindi Sentence ‘कमल का खिलना ’ / kamala kA KilanA 4/20/2017 ICON 2009

Contents Introduction Hindi-Gujarati : A comparative study The Rule Base Translation Token Mapping Engine Algorithm Example Evaluation Conclusion 4/20/2017 ICON 2009

Contribution of various Approaches in translation 4/20/2017 ICON 2009

Evaluation Bilingual documents Total Number of Words TNW Number of words not matched by evaluation software N1 Number of words incorrectly translated as per language expert N2 Percentage of words matched as per evaluation software P1 Percentage of words correctly translated as per language expert P2 Satya 543 160 59 70% 89% Sambhav 332 117 50 65% 85% Samanta 370 132 44 64% 88% TOTAL 1245 409 153 67% Thus we can conclude that for given test bed GH-MAP could produce about 88% correct translation 4/20/2017 ICON 2009

Contents Introduction Hindi-Gujarati : A comparative study The Rule Base Translation Token Mapping Engine Algorithm Example Evaluation Conclusion 4/20/2017 ICON 2009

Conclusion To the best of our knowledge, this is the first attempt at rule based token mapping for sibling language pair Hindi-Gujarati. In this model, only lexical analysis is carried out. It requires only limited linguistic effort and tools for achieving the said goal. The test results for a small set of data are encouraging. There are some limitations of GH-MAP, which needs to be addressed. 4/20/2017 ICON 2009

Limitations karaka : का / kA(H) can be map to નો/no /ની/nI /નું/nuM /ના/nA (G) [of (E)]. pronoun :उसे / se (H) can be map to તેનો/weno / તેની/wenI / તેને/wene (G) [He/She/It (E)] adjective :नया / nayA (H) can be map to નવું/navuM/નવા/navA (G) [New (E)  Work is in progress towards overcoming these limitations. With further enhancement in rule base, GH-MAP is expected to yield better result. 4/20/2017 ICON 2009

Thank You 4/20/2017 ICON 2009