Presentation is loading. Please wait.

Presentation is loading. Please wait.

02/19/13English-Indian Language MT (Phase-II)1 English – Indian Language Machine Translation Anuvadaksh Phase – II - The SMT Team, CDAC Mumbai.

Similar presentations


Presentation on theme: "02/19/13English-Indian Language MT (Phase-II)1 English – Indian Language Machine Translation Anuvadaksh Phase – II - The SMT Team, CDAC Mumbai."— Presentation transcript:

1 02/19/13English-Indian Language MT (Phase-II)1 English – Indian Language Machine Translation Anuvadaksh Phase – II - The SMT Team, CDAC Mumbai

2 02/19/13English-Indian Language MT (Phase-II)2 English-Indian Language Machine Translation (MT) Anuvadaksh ( Phase-1 Background) DIT Funded: Consortium Mode Project (10 instt.) Objective Deploy Eng-Indian Lang MT system using 4 engines Develop language res. and tools for 2 domains Phase-I: Achievements Deployed an MT system for 3 pairs (E-Hin,Mar,Ben) Two of the four engines gave comparable translations CDAC Mumbai: Statistical Machine Translation Engine First SMT engine to be developed in India under TDIL purview CDAC Pune (Consortium Leader) : Tree Adjoining Grammar Engine CDAC Mumbai: Language Resources 15000 sentence corpora developed (Eng-Mar) 3000 word synset creation Test Report evaluated by GIST, CDAC Pune

3 02/19/13English-Indian Language MT (Phase-II)3 English-Indian Language Machine Translation (MT) Anuvadaksh ( Phase-2) [Jul 2010 - Jul 2013] CDACM Objective: Extend the MT system deployed in Phase-I (esp. the SMT engine) Improvise the SMT engine using reordering and factored models Introduce language knowledge with the help of language verticals (conceptualising the hybrid approach) Developing language resources in the form of bilingual corpus for health domain Team Members Rajnath Patel, Rohit Gupta, Ritesh Shah

4 02/19/13English-Indian Language MT (Phase-II)4 Anuvadaksh-II Financial Status Total Budget Outlay: Rs. 14,99,20,000 CDACM Budget: Rs. 98,79,000 for 3 years (2010-2013) Total funds released up to 31st Dec 2011 Rs. 31,68,000 Total expenditure upto 31 st Dec 2011 Rs. 52,46,767 Deficit incurred Rs. 20,78,767 (to be adjusted against grant-in-aid, 2012-13)

5 02/19/13English-Indian Language MT (Phase-II)5 Existing web-service mode changed Integration for improved SMT subsystem with the Anuvadaksh system completed successfully Development of consistent APIs for easy integration with the EILMT system Reordered models for Marathi added Integration of all three language modules Anuvadaksh-II Tasks completed (1)

6 02/19/13English-Indian Language MT (Phase-II)6 Anuvadaksh-II Tasks completed (2) Multi-word expressions (MWE) annotation task Classification of about 1000 words completed Wordnet based dictionary extraction completed Report for pipeline like architecture for overall improvement of the system prepared Consolidation of all the components represented as factors by various language verticals added Roles and responsibilities for the resp. instt. assigned TDIL task: Submitted a reference (equivalent to 25 books) to bilingually aligned corpora from Sahitya Akadami website

7 02/19/13English-Indian Language MT (Phase-II)7 bilingual corpus Corpus resources Morphological Analysis Decoder Core TM estimation module WSD processed data TM probability phrase table Word Sense Disambiguation POS Tagger Name Entity Recognition Multi Word Expression Extraction Morph tagged data POS tagged data NE tagged data MWE data UNL Tagger TAG module UNL tagged data Clause marker Syntactic reordering component TAG annotated data Clause marked data Reordered data SMT engine: Advanced TM module ( Components or Factors could vary across languages) Anuvadaksh-II : Tasks completed (3)

8 02/19/13English-Indian Language MT (Phase-II)8 E6_H6: Enhancement of SMT engine (C-DAC, Mumbai & IIT-Bombay)  The work was carried out by C-DAC Mumbai for English – Hindi/Marathi and English – Bangla(JU) as baseline systems using SMT approach and have been integrated into the system and will be released for testing.  There are lot many horizontal approaches where consortia institutes have tried their algorithms. A group has been formed of some consortia members as they have shown their interest to work as a part of SMT horizontal tasks.  It is identified that source pre-processing will be carried out on factored basis for MA, POS, NER, WSD, MWE, UNL semantic mapping, Semantic TAG features and Clause boundary marking.  Pre-processing techniques like source re-ordering and transliteration will be used for translation model improvements.  Moses decoder and GIZA ++ training tools will be used for remaining five language pairs such as English to Oriya, Urdu, Tamil, Gujarati & Bodo. Enhancement of Existing modules :

9 02/19/13English-Indian Language MT (Phase-II)9 Enhancement of SMT engine (C-DAC, Mumbai & IIT-Bombay)  Source Pre-processing and responsible institutes: Source Pre-processingInstitutes responsible MAJU POSStanford POS NERIIT-B NER MWEIIT-B, CDAC-M, JU WSDIIT-B UNL semantic mappingIIT-B TAG Parsed outputCDAC-P Clause boundary markingIIIT-H Anuvadaksh-II : Tasks completed (4)

10 02/19/13English-Indian Language MT (Phase-II)10 Enhancement of SMT engine (Contd…) (C-DAC Mumbai, IIT-Bombay) Target Pre-processing & Language model Target Pre-processing Language Model MA (segmentation & case marker) POSNER (JU) MWE (IIT-B) WSD (IIT-B) Source re- ordering (CDAC-M) Transliterat ion (IIIT-A) LM Developmen t (CDAC-M) English E-HindiIIIT-H (ILMT) IIIT-H E-MarathiIIT-B (ILMT) IIT-B E-BanglaJU (ILMT) JU E-TamilAU (ILMT) AU E-UrduIIIT-A (ILMT) IIIT-A E-OriyaUU, CDAC-PUU (IIIT-BHU, CLIA) UU, CDAC-P E-GujratiDDUDDU (DICT, CLIA) DDU E-BodoNEHU, CDAC-PNEHU, (CLIA) NEHU, CDAC-P Anuvadaksh-II : Tasks completed (5)

11 02/19/13English-Indian Language MT (Phase-II)11 Anuvadaksh-II Tasks completed (6) LMs created using various smoothing techniques Hindi (15000 sentences + BBC monolingual corpus) Marathi (13000 sentences) Bengali (14000 sentences) Tamil (14000 sentences) Gujarati (2000 sentences)

12 02/19/13English-Indian Language MT (Phase-II)12 Anuvadaksh-II Achievements Reordered + factored (Improvements for Hindi) Source side factor (POS) BLEU (Non-Factored) : 32.45 BLEU (Factored) : 32.93 Reordered Baseline (good for Marathi) Standardized XML log format update as per the requirements

13 Anuvadaksh-II Achievements Publication: Learning Improved Models for Urdu, Farsi and Italian using SMT - Rohit Gupta, Raj N. Patel and Ritesh Shah, Proceedings of the first workshop on Reordering for Statistical Machine Translation, COLING 2012, Mumbai, India, December 8-15, 2012 Applying statistical MT techniques to learn improved reordering models Study of correlation between reordering and distortion- parameters for English-Urdu pair among others 02/19/13English-Indian Language MT (Phase-II)13

14 02/19/13English-Indian Language MT (Phase-II)14 GRADE POINT (0-4) Version 2.0 (Feb 2013) Version 1.0 (July 2012) 4 (>=80%)39%14% 3(60%-79%)26%18% 2(40%-59%)25%26% 1(20%-39%)10%37% 0(<20%)05% SMT Improvements (Hindi) Corpus: EILMT Tourism Corpus (approx 15000 sentences) Anuvadaksh-II Achievements

15 02/19/13English-Indian Language MT (Phase-II)15 Anuvadaksh-II Achievements SMT Improvements (Marathi) Corpus: EILMT Tourism Corpus (approx 13000 sentences) GRADE POINT (0-4) Baseline (Eval 1) Baseline (Eval 2) Reordered 4 (>=80%)24%20% 10% 3(60%-79%)26%23% 31% 2(40%-59%)15%25% 43% 1(20%-39%)34%25% 16% 0(<20%)1%7% 0

16 02/19/13English-Indian Language MT (Phase-II)16 Anuvadaksh-II Future Plan Use factored model in the Statistical MT engine to enhance translations for all languages in the tourism domain For the health domain specifically, obtain translations using existing resources and evaluate basic coverage of grammar for this domain The entire system with its hybrid approach has to be deployed efficiently and the outputs have to be sent to the testing team at CDAC Pune.

17 02/19/13English-Indian Language MT (Phase-II)17 Thank you


Download ppt "02/19/13English-Indian Language MT (Phase-II)1 English – Indian Language Machine Translation Anuvadaksh Phase – II - The SMT Team, CDAC Mumbai."

Similar presentations


Ads by Google