Presentation is loading. Please wait.

Presentation is loading. Please wait.

Automatic Translation of Nominal Compound into Hindi Prashant Mathur IIIT Hyderabad Soma Paul IIIT Hyderabad.

Similar presentations


Presentation on theme: "Automatic Translation of Nominal Compound into Hindi Prashant Mathur IIIT Hyderabad Soma Paul IIIT Hyderabad."— Presentation transcript:

1 Automatic Translation of Nominal Compound into Hindi Prashant Mathur IIIT Hyderabad Soma Paul IIIT Hyderabad

2 OUTLINE  What is a Nominal Compound (NC) ?  Translation variation of English NC into Hindi  Motivation  Approach  Results  Future Work  Bibliography 2 Prashant Mathur

3 Nominal Compound A construct of two or more nouns. The rightmost noun being the head, preceding nouns modifiers. Oil Pump : a device used to pump oil Customer satisfaction indices : index that indicates the satisfaction rate of customer Two word nominal compounds are the object of study here 3 Prashant Mathur

4 Frequency of NC in English Corpus (Baldwin et al 2004) CorpusWordsNC Frequency BNC84M2.6% Reuters108M3.9% 4 Prashant Mathur

5 OUTLINE  What is a Nominal Compound (NC) ?  Translation variation of English NC into Hindi  Motivation  Approach  Results  Future Work  Bibliography 5 Prashant Mathur

6 Variation in translating English NC into Hindi As Nominal Compound ‘Hindu texts’  hindU SastroM, ‘milk production’  dugdha utpAdana As Genitive Construction ‘rice husk’  cAval kI bhUsI, ‘room temperature’  kamare ka tApamAna As one word Cow dung  gobar As Adjective Noun Construction ‘nature cure’  prAkratik cikitsA, ‘hill camel’  ‘pahARI UMTa’ As other syntactic phrase wax work  mom par kalAkArI ‘work on wax’, body pain  SarIr meM dard ‘pain in body’ Others Hand luggage  haat meM le jaaye jaane vaale saamaan 6 Prashant Mathur

7 OUTLINE  What is a Nominal Compound (NC) ?  Translation variation of English NC into Hindi  Motivation  Approach  Results  Future Work  Bibliography 7 Prashant Mathur

8 Motivation Issues in translation Choice of the appropriate target lexeme during lexical substitution; and Selection of the right target construct type. Occurrence of NCs in a corpus is high in frequency, however individual compound occur only a few times. NCs are too varied to be precompiled in an exhaustive list of translated candidates 8 Prashant Mathur

9 Therefore … NCs are to be handled on the fly. The task of translation of NCs from English into Hindi becomes a challenging task of NLP 9 Prashant Mathur

10 With Google translator When tested on the same dataset that has been used to evaluate our system Translation formationPrecision Overall45% Eng NC  Hindi NC29% Eng NC  Hindi Genitive10% Others6% 10 Prashant Mathur

11 OUTLINE  What is a Nominal Compound (NC) ?  Translation variation of English NC into Hindi  Motivation  Approach  Results  Future Work  Bibliography 11 Prashant Mathur

12 Approach Translation template generation Extraction of NC from English corpus Sense disambiguation of components Lexical substitution of the component nouns using Bi-Lingual Dictionary Preparing translation candidates Corpus Search of translation candidates and their Ranking. 12 Prashant Mathur

13 Translation Template Generation Construction TypeNo. of occurrencesPercentage Nominal Compound395942.9% Genitive197621.4% Long Phrases5816.284 Adjective Noun Phrase5576.024% Single Word7668.285% Transliterated Nominal Compound 120813.065% None1992.152% We did the survey of 50,000 sentences of parallel corpora and found out the following construction types. 13 Prashant Mathur

14 Some Templates Nominal Compound H1 H2 Genitive H1 kA H2 H1 ke H2 H1 kI H2 Long Phrases H1 pe H2 H1 meM H2 H1 par H2 H1 ke xvArA H2 H1 se prApwa H2 Total of 44 templates were formed, some of them are showed below. Adjective H1-ikA H2 Single-Word H1 14 Prashant Mathur

15 Approach Translation template generation Extraction of NC from English corpus Sense disambiguation of components Lexical substitution of the component nouns using Bi- Lingual Dictionary Preparing translation candidates Corpus Search of translation candidates and their Ranking. 15 Prashant Mathur

16 Extraction Corpus (7000 raw sentences)Tree Tagger 1 Extracted Noun-Noun 2 formations (1584 occurrences)Randomly selected 1000 NCs 1 Tree-Tagger is a POS-Tagger which gives some extra information. Word  Tree-Tagger  word POS TAG lemma rods  rods_NNS_rod 2 As assumed previously we consider only Noun-Noun formation as Nominal Compound. 16 Prashant Mathur

17 Approach Translation template generation Extraction of NC from English corpus Sense disambiguation of components Lexical substitution of the component nouns using Bi- Lingual Dictionary Preparing translation candidates Corpus Search of translation candidates and their Ranking. 17 Prashant Mathur

18 Lexical Substitution 18 Prashant Mathur

19 Step 3 : Sense Disambiguation of components To reduce the number of translation candidates Example : Campaigns for road safety are organized to keep everyone safer on the Indian roads Noun ComponentNo. of WN sense Sense selected Synset Road2#1 Safety6#2 19 Prashant Mathur

20 WordNet Sense-Relate by Ted Peterson. 80% accuracy in case of NC disambiguation. 20 Prashant Mathur

21 Approach Translation template generation Extraction of NC from English corpus Sense disambiguation of components Lexical substitution Preparing translation candidates Corpus Search of translation candidates and their Ranking. 21 Prashant Mathur

22 Lexical Substitution Now how to translate it into Hindi ? We don’t have direct wordnet mapping from English to Hindi. We use alternative method to translate. 22 Prashant Mathur

23 Step 4: Lexical Substitution Acquire all possible translations for all the words within a synset. Roadpath, maarg, saDak, raastaa Routemaarg, saDak, raastaa Safety ahAnikArakatA, suraksita sthAna, suraksA, salAmatI, suraksA sAdhana Refuge ASraya sthAna, ASraya, sahArA, SaraNa, CipanA 23 Prashant Mathur

24 Contd… Select those Hindi words which are common translations to all English words of a synset, if there is one Selected words are: maarg, saDak, raastaa All words are selected Roadpath, maarg, saDak, raastaa Routemaarg, saDak, raastaa SafetyahAnikArakatA, suraksita sthAna, suraksA, salAmatI, suraksA sAdhana RefugeASraya sthAna, ASraya, sahArA, SaraNa, CipanA 24 Prashant Mathur

25 Approach Translation template generation Extraction of NC from English corpus Sense disambiguation of components Lexical substitution Preparing translation candidates Corpus Search of translation candidates and their Ranking. 25 Prashant Mathur

26 Step 5: Preparing Translation Candidate For “road safety” Templates generated are: mArga para surakRA, mArga surakRA, SaDak para surakRA, SaDak kI surakRA... 26 Prashant Mathur

27 Approach Translation template generation Extraction of NC from English corpus Sense disambiguation of components Lexical substitution Preparing translation candidates Corpus Search of translation candidates and their Ranking. 27 Prashant Mathur

28 Step 6 Corpus Search Hindi Corpus (Raw): 28 million words Indexed Search – pattern match 28 Prashant Mathur

29 Example election time  cunAva ke samaya temple community  maMxira kA samAja marriage customs  vivAha kI praWA … But we didn’t found any translation for road safety  Ф Prashant Mathur 29

30 CTQ (Corpus based Translation Quality)  Rate a given translation candidate for both  The fully specified translation and  Its parts in the context of the translation template in question. CTQ (w 1 H, w 2 H, t) = αP(w 1 H, w 2 H, t) + βP(w 1 H,t) P(w 2 H, t) P(t)  t is the translation template used  w 1 H, w 2 H are the translations of components of NC  α = 1, β=0 if P(w 1 H, w 2 H, t) > 0 (didn’t perform variation in α, β constants) 30 Prashant Mathur

31 Contd..  Example  road safety  P(w 1 H, w 2 H, t) = 0  road  mArga, mArga ke, mArga meM, saDaka, saDaka par …  safety  surakRA, ke surakRA, meM surakRA, … so on  P (mArga, meM) * P(meM, surakRA) * P(meM) = (2.28*10 -5 ) * (9.14*10 -6 ) * (.286) = 6 * 10 -11  P (mArga, kI) * P(kI, surakRA) * P(kI) = (1.35 × 10 -5 ) * (3.82857143 × 10 -5 ) * (.228) = 1.17 × 10 -10  Higher probablity for “mArga kI surakRA” 31 Prashant Mathur

32 Ranking Baseline Ranking: Count based ranking A stronger ranking measure CTQ ( borrowed from Baldwin and Tanaka (2004)) 32 Prashant Mathur

33 Results 14 50 24 46.1 24.6 53.6 19 56.2 28 54.1 28.5 62.1 33 Prashant Mathur

34 Contd.. Measure taken to improve recall: By using genitives as default construct when translation for a NC is not found Motivation: We conduct one experiment on development data We verify whether the NCs for which no translation found during corpus search can be legitimately translated as a genitive construct We found the heuristics is working for 59% cases 34 Prashant Mathur

35 Results 24.8 54 44.5 57  Using genitive as default construct where the system fails to produce a translation 35 Prashant Mathur

36 Related works Similar approaches (search of translation templates in the corpus) adopted in Bungum and Oepen (2009) for Norwegian to English nominal compound translation Tanaka and Baldwin (2004) for English to Japanese nominal compound and vice versa 36 Prashant Mathur

37 Conclusion Novelty of our approach Using a WSD tool on Source language - to select the correct sense of nominal components The result : The number of possible translation candidates to be searched in the target language corpus is significantly reduced. 37 Prashant Mathur

38 Future Work Multinary NC translation Using semantic features provided in UW-Dictionary Varying α & β in ranking technique to produce more effective results. 38 Prashant Mathur

39 Bibliography Translation by Machine of Complex Nominals: Getting it right Tanaka and Timothy Baldwin Translation Selection for Japanese-English Noun-Noun Compounds Tanaka, Takaaki and Timothy Baldwin Automatic Translation Of Noun Compounds Rackow, Ido Dagan, Ulrike Schwall Norwegian to English nominal compound translation Bungum, Oepen 39 Prashant Mathur


Download ppt "Automatic Translation of Nominal Compound into Hindi Prashant Mathur IIIT Hyderabad Soma Paul IIIT Hyderabad."

Similar presentations


Ads by Google