Automatic Translation of Nominal Compound into Hindi Prashant Mathur IIIT Hyderabad Soma Paul IIIT Hyderabad.

Slides:



Advertisements
Similar presentations
ThemeInformation Extraction for World Wide Web PaperUnsupervised Learning of Soft Patterns for Generating Definitions from Online News Author Cui, H.,
Advertisements

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Specialized models and ranking for coreference resolution Pascal Denis ALPAGE Project Team INRIA Rocquencourt F Le Chesnay, France Jason Baldridge.
Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,
CL Research ACL Pattern Dictionary of English Prepositions (PDEP) Ken Litkowski CL Research 9208 Gue Road Damascus,
A Bilingual Corpus of Inter-linked Events Tommaso Caselli♠, Nancy Ide ♣, Roberto Bartolini ♠ ♠ Istituto di Linguistica Computazionale – ILC-CNR Pisa ♣
Word Sense Disambiguation for Machine Translation Han-Bin Chen
CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY CALTS has been in NLP for over a decade. It has participated in the following major projects: 1. NLP-TTP,
Building an Ontology-based Multilingual Lexicon for Word Sense Disambiguation in Machine Translation Lian-Tze Lim & Tang Enya Kong Unit Terjemahan Melalui.
Simple Features for Chinese Word Sense Disambiguation Hoa Trang Dang, Ching-yi Chia, Martha Palmer, Fu- Dong Chiou Computer and Information Science University.
CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.
Methods for Domain-Independent Information Extraction from the Web An Experimental Comparison Oren Etzioni et al. Prepared by Ang Sun
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
Bilingual Lexical Acquisition From Comparable Corpora Andrea Mulloni.
Part of speech (POS) tagging
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
Aparna Kulkarni Nachal Ramasamy Rashmi Havaldar N-grams to Process Hindi Queries.
Mining and Summarizing Customer Reviews
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
1 Statistical NLP: Lecture 10 Lexical Acquisition.
Bilingual term extraction revisited: Comparing statistical and linguistic methods for a new pair of languages Špela Vintar Faculty of Arts Dept. of Translation.
CLEF Ǻrhus Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Oier Lopez de Lacalle, Arantxa Otegi, German Rigau UVA & Irion: Piek Vossen.
“How much context do you need?” An experiment about context size in Interactive Cross-language Question Answering B. Navarro, L. Moreno-Monteagudo, E.
A Survey of NLP Toolkits Jing Jiang Mar 8, /08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
WORD SENSE DISAMBIGUATION STUDY ON WORD NET ONTOLOGY Akilan Velmurugan Computer Networks – CS 790G.
Part-Of-Speech Tagging using Neural Networks Ankur Parikh LTRC IIIT Hyderabad
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
SYMPOSIUM ON SEMANTICS IN SYSTEMS FOR TEXT PROCESSING September 22-24, Venice, Italy Combining Knowledge-based Methods and Supervised Learning for.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
14/12/2009ICON Dipankar Das and Sivaji Bandyopadhyay Department of Computer Science & Engineering Jadavpur University, Kolkata , India ICON.
GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)
For Wednesday No reading Homework –Chapter 23, exercise 15 –Process: 1.Create 5 sentences 2.Select a language 3.Translate each sentence into that language.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Hybrid Method for Tagging Arabic Text Written By: Yamina Tlili-Guiassa University Badji Mokhtar Annaba, Algeria Presented By: Ahmed Bukhamsin.
1 Gloss-based Semantic Similarity Metrics for Predominant Sense Acquisition Ryu Iida Nara Institute of Science and Technology Diana McCarthy and Rob Koeling.
Keyword Translation Accuracy and Cross-Lingual Question Answering in Chinese and Japanese Teruko Mitamura Mengqiu Wang Hideki Shima Frank Lin In CMU EACL.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Second Language Learning From News Websites Word Sense Disambiguation using Word Embeddings.
Word Sense and Subjectivity (Coling/ACL 2006) Janyce Wiebe Rada Mihalcea University of Pittsburgh University of North Texas Acknowledgements: This slide.
A knowledge rich morph analyzer for Marathi derived forms Ashwini Vaidya IIIT Hyderabad.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Sentiment Analysis Using Common- Sense and Context Information Basant Agarwal 1,2, Namita Mittal 2, Pooja Bansal 2, and Sonal Garg 2 1 Department of Computer.
The University of Illinois System in the CoNLL-2013 Shared Task Alla RozovskayaKai-Wei ChangMark SammonsDan Roth Cognitive Computation Group University.
Measuring Monolinguality
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Kiril Simov1, Alexander Popov1, Iliana Simova2, Petya Osenova1
Statistical NLP: Lecture 13
A method for WSD on Unrestricted Text
The Winograd Schema Challenge Hector J. Levesque AAAI, 2011
Extracting Why Text Segment from Web Based on Grammar-gram
Presentation transcript:

Automatic Translation of Nominal Compound into Hindi Prashant Mathur IIIT Hyderabad Soma Paul IIIT Hyderabad

OUTLINE  What is a Nominal Compound (NC) ?  Translation variation of English NC into Hindi  Motivation  Approach  Results  Future Work  Bibliography 2 Prashant Mathur

Nominal Compound A construct of two or more nouns. The rightmost noun being the head, preceding nouns modifiers. Oil Pump : a device used to pump oil Customer satisfaction indices : index that indicates the satisfaction rate of customer Two word nominal compounds are the object of study here 3 Prashant Mathur

Frequency of NC in English Corpus (Baldwin et al 2004) CorpusWordsNC Frequency BNC84M2.6% Reuters108M3.9% 4 Prashant Mathur

OUTLINE  What is a Nominal Compound (NC) ?  Translation variation of English NC into Hindi  Motivation  Approach  Results  Future Work  Bibliography 5 Prashant Mathur

Variation in translating English NC into Hindi As Nominal Compound ‘Hindu texts’  hindU SastroM, ‘milk production’  dugdha utpAdana As Genitive Construction ‘rice husk’  cAval kI bhUsI, ‘room temperature’  kamare ka tApamAna As one word Cow dung  gobar As Adjective Noun Construction ‘nature cure’  prAkratik cikitsA, ‘hill camel’  ‘pahARI UMTa’ As other syntactic phrase wax work  mom par kalAkArI ‘work on wax’, body pain  SarIr meM dard ‘pain in body’ Others Hand luggage  haat meM le jaaye jaane vaale saamaan 6 Prashant Mathur

OUTLINE  What is a Nominal Compound (NC) ?  Translation variation of English NC into Hindi  Motivation  Approach  Results  Future Work  Bibliography 7 Prashant Mathur

Motivation Issues in translation Choice of the appropriate target lexeme during lexical substitution; and Selection of the right target construct type. Occurrence of NCs in a corpus is high in frequency, however individual compound occur only a few times. NCs are too varied to be precompiled in an exhaustive list of translated candidates 8 Prashant Mathur

Therefore … NCs are to be handled on the fly. The task of translation of NCs from English into Hindi becomes a challenging task of NLP 9 Prashant Mathur

With Google translator When tested on the same dataset that has been used to evaluate our system Translation formationPrecision Overall45% Eng NC  Hindi NC29% Eng NC  Hindi Genitive10% Others6% 10 Prashant Mathur

OUTLINE  What is a Nominal Compound (NC) ?  Translation variation of English NC into Hindi  Motivation  Approach  Results  Future Work  Bibliography 11 Prashant Mathur

Approach Translation template generation Extraction of NC from English corpus Sense disambiguation of components Lexical substitution of the component nouns using Bi-Lingual Dictionary Preparing translation candidates Corpus Search of translation candidates and their Ranking. 12 Prashant Mathur

Translation Template Generation Construction TypeNo. of occurrencesPercentage Nominal Compound % Genitive % Long Phrases Adjective Noun Phrase % Single Word % Transliterated Nominal Compound % None % We did the survey of 50,000 sentences of parallel corpora and found out the following construction types. 13 Prashant Mathur

Some Templates Nominal Compound H1 H2 Genitive H1 kA H2 H1 ke H2 H1 kI H2 Long Phrases H1 pe H2 H1 meM H2 H1 par H2 H1 ke xvArA H2 H1 se prApwa H2 Total of 44 templates were formed, some of them are showed below. Adjective H1-ikA H2 Single-Word H1 14 Prashant Mathur

Approach Translation template generation Extraction of NC from English corpus Sense disambiguation of components Lexical substitution of the component nouns using Bi- Lingual Dictionary Preparing translation candidates Corpus Search of translation candidates and their Ranking. 15 Prashant Mathur

Extraction Corpus (7000 raw sentences)Tree Tagger 1 Extracted Noun-Noun 2 formations (1584 occurrences)Randomly selected 1000 NCs 1 Tree-Tagger is a POS-Tagger which gives some extra information. Word  Tree-Tagger  word POS TAG lemma rods  rods_NNS_rod 2 As assumed previously we consider only Noun-Noun formation as Nominal Compound. 16 Prashant Mathur

Approach Translation template generation Extraction of NC from English corpus Sense disambiguation of components Lexical substitution of the component nouns using Bi- Lingual Dictionary Preparing translation candidates Corpus Search of translation candidates and their Ranking. 17 Prashant Mathur

Lexical Substitution 18 Prashant Mathur

Step 3 : Sense Disambiguation of components To reduce the number of translation candidates Example : Campaigns for road safety are organized to keep everyone safer on the Indian roads Noun ComponentNo. of WN sense Sense selected Synset Road2#1 Safety6#2 19 Prashant Mathur

WordNet Sense-Relate by Ted Peterson. 80% accuracy in case of NC disambiguation. 20 Prashant Mathur

Approach Translation template generation Extraction of NC from English corpus Sense disambiguation of components Lexical substitution Preparing translation candidates Corpus Search of translation candidates and their Ranking. 21 Prashant Mathur

Lexical Substitution Now how to translate it into Hindi ? We don’t have direct wordnet mapping from English to Hindi. We use alternative method to translate. 22 Prashant Mathur

Step 4: Lexical Substitution Acquire all possible translations for all the words within a synset. Roadpath, maarg, saDak, raastaa Routemaarg, saDak, raastaa Safety ahAnikArakatA, suraksita sthAna, suraksA, salAmatI, suraksA sAdhana Refuge ASraya sthAna, ASraya, sahArA, SaraNa, CipanA 23 Prashant Mathur

Contd… Select those Hindi words which are common translations to all English words of a synset, if there is one Selected words are: maarg, saDak, raastaa All words are selected Roadpath, maarg, saDak, raastaa Routemaarg, saDak, raastaa SafetyahAnikArakatA, suraksita sthAna, suraksA, salAmatI, suraksA sAdhana RefugeASraya sthAna, ASraya, sahArA, SaraNa, CipanA 24 Prashant Mathur

Approach Translation template generation Extraction of NC from English corpus Sense disambiguation of components Lexical substitution Preparing translation candidates Corpus Search of translation candidates and their Ranking. 25 Prashant Mathur

Step 5: Preparing Translation Candidate For “road safety” Templates generated are: mArga para surakRA, mArga surakRA, SaDak para surakRA, SaDak kI surakRA Prashant Mathur

Approach Translation template generation Extraction of NC from English corpus Sense disambiguation of components Lexical substitution Preparing translation candidates Corpus Search of translation candidates and their Ranking. 27 Prashant Mathur

Step 6 Corpus Search Hindi Corpus (Raw): 28 million words Indexed Search – pattern match 28 Prashant Mathur

Example election time  cunAva ke samaya temple community  maMxira kA samAja marriage customs  vivAha kI praWA … But we didn’t found any translation for road safety  Ф Prashant Mathur 29

CTQ (Corpus based Translation Quality)  Rate a given translation candidate for both  The fully specified translation and  Its parts in the context of the translation template in question. CTQ (w 1 H, w 2 H, t) = αP(w 1 H, w 2 H, t) + βP(w 1 H,t) P(w 2 H, t) P(t)  t is the translation template used  w 1 H, w 2 H are the translations of components of NC  α = 1, β=0 if P(w 1 H, w 2 H, t) > 0 (didn’t perform variation in α, β constants) 30 Prashant Mathur

Contd..  Example  road safety  P(w 1 H, w 2 H, t) = 0  road  mArga, mArga ke, mArga meM, saDaka, saDaka par …  safety  surakRA, ke surakRA, meM surakRA, … so on  P (mArga, meM) * P(meM, surakRA) * P(meM) = (2.28*10 -5 ) * (9.14*10 -6 ) * (.286) = 6 *  P (mArga, kI) * P(kI, surakRA) * P(kI) = (1.35 × ) * ( × ) * (.228) = 1.17 ×  Higher probablity for “mArga kI surakRA” 31 Prashant Mathur

Ranking Baseline Ranking: Count based ranking A stronger ranking measure CTQ ( borrowed from Baldwin and Tanaka (2004)) 32 Prashant Mathur

Results Prashant Mathur

Contd.. Measure taken to improve recall: By using genitives as default construct when translation for a NC is not found Motivation: We conduct one experiment on development data We verify whether the NCs for which no translation found during corpus search can be legitimately translated as a genitive construct We found the heuristics is working for 59% cases 34 Prashant Mathur

Results  Using genitive as default construct where the system fails to produce a translation 35 Prashant Mathur

Related works Similar approaches (search of translation templates in the corpus) adopted in Bungum and Oepen (2009) for Norwegian to English nominal compound translation Tanaka and Baldwin (2004) for English to Japanese nominal compound and vice versa 36 Prashant Mathur

Conclusion Novelty of our approach Using a WSD tool on Source language - to select the correct sense of nominal components The result : The number of possible translation candidates to be searched in the target language corpus is significantly reduced. 37 Prashant Mathur

Future Work Multinary NC translation Using semantic features provided in UW-Dictionary Varying α & β in ranking technique to produce more effective results. 38 Prashant Mathur

Bibliography Translation by Machine of Complex Nominals: Getting it right Tanaka and Timothy Baldwin Translation Selection for Japanese-English Noun-Noun Compounds Tanaka, Takaaki and Timothy Baldwin Automatic Translation Of Noun Compounds Rackow, Ido Dagan, Ulrike Schwall Norwegian to English nominal compound translation Bungum, Oepen 39 Prashant Mathur