02/19/13English-Indian Language MT (Phase-II)1 English – Indian Language Machine Translation Anuvadaksh Phase – II - The SMT Team, CDAC Mumbai.

Slides:



Advertisements
Similar presentations
Statistical Machine Translation
Advertisements

The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Word Sense Disambiguation for Machine Translation Han-Bin Chen
Part-Of-Speech Tagging and Chunking using CRF & TBL
Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.
CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY CALTS has been in NLP for over a decade. It has participated in the following major projects: 1. NLP-TTP,
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Flow Network Models for Sub-Sentential Alignment Ying Zhang (Joy) Advisor: Ralf Brown Dec 18 th, 2001.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.
TIDES MT Workshop Review. Using Syntax?  ISI-small: –Cross-lingual parsing/decoding Input: Chinese sentence + English lattice built with all possible.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
Consortium Project on Development of Dravidian WordNet: An Integrated WordNet for Telugu, Tamil, Kannada and Malayalam.
ÓC-DAC Noida’2004 Efforts in Language & Speech Technology Natural Language Processing Lab Centre for Development of Advanced Computing (Ministry of Communications.
S ANDHAN Indian language search engine. S ANDHAN – C ONSORTIUM P ROJECT IIT Bombay (co-ordinator) CDAC Noida (co-cordinator) CDAC Pune IIT Kharaghpur.
Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English Pattabhi R.K Rao and Sobha. L AU-KBC Research Centre, MIT Campus,
Czech-to-English Translation: MT Marathon 2009 Session Preview Jonathan Clark Greg Hanneman Language Technologies Institute Carnegie Mellon University.
Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata.
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
ELN – Natural Language Processing Giuseppe Attardi
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
CLEF – Cross Language Evaluation Forum Question Answering at CLEF 2003 ( Bridging Languages for Question Answering: DIOGENE at CLEF-2003.
NERIL: Named Entity Recognition for Indian FIRE 2013.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Part-Of-Speech Tagging using Neural Networks Ankur Parikh LTRC IIIT Hyderabad
Development of NE Wordnet: An Integrated Wordnet for Languages of the North-East India Assamese & Bodo by Utpal Saikia Biswajit Brahma Dibyajyoti Sarmah.
Recent Major MT Developments at CMU Briefing for Joe Olive February 5, 2008 Alon Lavie and Stephan Vogel Language Technologies Institute Carnegie Mellon.
A Language Independent Method for Question Classification COLING 2004.
Phrase Reordering for Statistical Machine Translation Based on Predicate-Argument Structure Mamoru Komachi, Yuji Matsumoto Nara Institute of Science and.
The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan.
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
Reordering Model Using Syntactic Information of a Source Tree for Statistical Machine Translation Kei Hashimoto, Hirohumi Yamamoto, Hideo Okuma, Eiichiro.
Advanced MT Seminar Spring 2008 Instructors: Alon Lavie and Stephan Vogel.
IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1.
Approximating a Deep-Syntactic Metric for MT Evaluation and Tuning Matouš Macháček, Ondřej Bojar; {machacek, Charles University.
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
14/12/2009ICON Dipankar Das and Sivaji Bandyopadhyay Department of Computer Science & Engineering Jadavpur University, Kolkata , India ICON.
TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.
Chinese Word Segmentation Adaptation for Statistical Machine Translation Hailong Cao, Masao Utiyama and Eiichiro Sumita Language Translation Group NICT&ATR.
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.
LREC 2008 Marrakech 29 May Caroline Lavecchia, Kamel Smaïli and David Langlois LORIA / Groupe Parole, Vandoeuvre-Lès-Nancy, France Phrase-Based Machine.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
NCSR “Demokritos” Institute of Informatics & Telecommunications CROSSMARC CROSS-lingual Multi Agent Retail Comparison Costas Spyropoulos & Vangelis Karkaletsis.
LREC 2004, 26 May 2004, Lisbon 1 Multimodal Multilingual Resources in the Subtitling Process S.Piperidis, I.Demiros, P.Prokopidis, P.Vanroose, A. Hoethker,
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Natural Language Processing Group Computer Sc. & Engg. Department JADAVPUR UNIVERSITY KOLKATA – , INDIA. Professor Sivaji Bandyopadhyay
Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
Centre for Translation Studies FACULTY OF ARTS
Approaches to Machine Translation
Urdu-to-English Stat-XFER system for NIST MT Eval 2008
--Mengxue Zhang, Qingyang Li
Approaches to Machine Translation
Computational Linguistics: New Vistas
Indradhanush WordNet Project Consortium PRSG Meeting
Presentation transcript:

02/19/13English-Indian Language MT (Phase-II)1 English – Indian Language Machine Translation Anuvadaksh Phase – II - The SMT Team, CDAC Mumbai

02/19/13English-Indian Language MT (Phase-II)2 English-Indian Language Machine Translation (MT) Anuvadaksh ( Phase-1 Background) DIT Funded: Consortium Mode Project (10 instt.) Objective Deploy Eng-Indian Lang MT system using 4 engines Develop language res. and tools for 2 domains Phase-I: Achievements Deployed an MT system for 3 pairs (E-Hin,Mar,Ben) Two of the four engines gave comparable translations CDAC Mumbai: Statistical Machine Translation Engine First SMT engine to be developed in India under TDIL purview CDAC Pune (Consortium Leader) : Tree Adjoining Grammar Engine CDAC Mumbai: Language Resources sentence corpora developed (Eng-Mar) 3000 word synset creation Test Report evaluated by GIST, CDAC Pune

02/19/13English-Indian Language MT (Phase-II)3 English-Indian Language Machine Translation (MT) Anuvadaksh ( Phase-2) [Jul Jul 2013] CDACM Objective: Extend the MT system deployed in Phase-I (esp. the SMT engine) Improvise the SMT engine using reordering and factored models Introduce language knowledge with the help of language verticals (conceptualising the hybrid approach) Developing language resources in the form of bilingual corpus for health domain Team Members Rajnath Patel, Rohit Gupta, Ritesh Shah

02/19/13English-Indian Language MT (Phase-II)4 Anuvadaksh-II Financial Status Total Budget Outlay: Rs. 14,99,20,000 CDACM Budget: Rs. 98,79,000 for 3 years ( ) Total funds released up to 31st Dec 2011 Rs. 31,68,000 Total expenditure upto 31 st Dec 2011 Rs. 52,46,767 Deficit incurred Rs. 20,78,767 (to be adjusted against grant-in-aid, )

02/19/13English-Indian Language MT (Phase-II)5 Existing web-service mode changed Integration for improved SMT subsystem with the Anuvadaksh system completed successfully Development of consistent APIs for easy integration with the EILMT system Reordered models for Marathi added Integration of all three language modules Anuvadaksh-II Tasks completed (1)

02/19/13English-Indian Language MT (Phase-II)6 Anuvadaksh-II Tasks completed (2) Multi-word expressions (MWE) annotation task Classification of about 1000 words completed Wordnet based dictionary extraction completed Report for pipeline like architecture for overall improvement of the system prepared Consolidation of all the components represented as factors by various language verticals added Roles and responsibilities for the resp. instt. assigned TDIL task: Submitted a reference (equivalent to 25 books) to bilingually aligned corpora from Sahitya Akadami website

02/19/13English-Indian Language MT (Phase-II)7 bilingual corpus Corpus resources Morphological Analysis Decoder Core TM estimation module WSD processed data TM probability phrase table Word Sense Disambiguation POS Tagger Name Entity Recognition Multi Word Expression Extraction Morph tagged data POS tagged data NE tagged data MWE data UNL Tagger TAG module UNL tagged data Clause marker Syntactic reordering component TAG annotated data Clause marked data Reordered data SMT engine: Advanced TM module ( Components or Factors could vary across languages) Anuvadaksh-II : Tasks completed (3)

02/19/13English-Indian Language MT (Phase-II)8 E6_H6: Enhancement of SMT engine (C-DAC, Mumbai & IIT-Bombay)  The work was carried out by C-DAC Mumbai for English – Hindi/Marathi and English – Bangla(JU) as baseline systems using SMT approach and have been integrated into the system and will be released for testing.  There are lot many horizontal approaches where consortia institutes have tried their algorithms. A group has been formed of some consortia members as they have shown their interest to work as a part of SMT horizontal tasks.  It is identified that source pre-processing will be carried out on factored basis for MA, POS, NER, WSD, MWE, UNL semantic mapping, Semantic TAG features and Clause boundary marking.  Pre-processing techniques like source re-ordering and transliteration will be used for translation model improvements.  Moses decoder and GIZA ++ training tools will be used for remaining five language pairs such as English to Oriya, Urdu, Tamil, Gujarati & Bodo. Enhancement of Existing modules :

02/19/13English-Indian Language MT (Phase-II)9 Enhancement of SMT engine (C-DAC, Mumbai & IIT-Bombay)  Source Pre-processing and responsible institutes: Source Pre-processingInstitutes responsible MAJU POSStanford POS NERIIT-B NER MWEIIT-B, CDAC-M, JU WSDIIT-B UNL semantic mappingIIT-B TAG Parsed outputCDAC-P Clause boundary markingIIIT-H Anuvadaksh-II : Tasks completed (4)

02/19/13English-Indian Language MT (Phase-II)10 Enhancement of SMT engine (Contd…) (C-DAC Mumbai, IIT-Bombay) Target Pre-processing & Language model Target Pre-processing Language Model MA (segmentation & case marker) POSNER (JU) MWE (IIT-B) WSD (IIT-B) Source re- ordering (CDAC-M) Transliterat ion (IIIT-A) LM Developmen t (CDAC-M) English E-HindiIIIT-H (ILMT) IIIT-H E-MarathiIIT-B (ILMT) IIT-B E-BanglaJU (ILMT) JU E-TamilAU (ILMT) AU E-UrduIIIT-A (ILMT) IIIT-A E-OriyaUU, CDAC-PUU (IIIT-BHU, CLIA) UU, CDAC-P E-GujratiDDUDDU (DICT, CLIA) DDU E-BodoNEHU, CDAC-PNEHU, (CLIA) NEHU, CDAC-P Anuvadaksh-II : Tasks completed (5)

02/19/13English-Indian Language MT (Phase-II)11 Anuvadaksh-II Tasks completed (6) LMs created using various smoothing techniques Hindi (15000 sentences + BBC monolingual corpus) Marathi (13000 sentences) Bengali (14000 sentences) Tamil (14000 sentences) Gujarati (2000 sentences)

02/19/13English-Indian Language MT (Phase-II)12 Anuvadaksh-II Achievements Reordered + factored (Improvements for Hindi) Source side factor (POS) BLEU (Non-Factored) : BLEU (Factored) : Reordered Baseline (good for Marathi) Standardized XML log format update as per the requirements

Anuvadaksh-II Achievements Publication: Learning Improved Models for Urdu, Farsi and Italian using SMT - Rohit Gupta, Raj N. Patel and Ritesh Shah, Proceedings of the first workshop on Reordering for Statistical Machine Translation, COLING 2012, Mumbai, India, December 8-15, 2012 Applying statistical MT techniques to learn improved reordering models Study of correlation between reordering and distortion- parameters for English-Urdu pair among others 02/19/13English-Indian Language MT (Phase-II)13

02/19/13English-Indian Language MT (Phase-II)14 GRADE POINT (0-4) Version 2.0 (Feb 2013) Version 1.0 (July 2012) 4 (>=80%)39%14% 3(60%-79%)26%18% 2(40%-59%)25%26% 1(20%-39%)10%37% 0(<20%)05% SMT Improvements (Hindi) Corpus: EILMT Tourism Corpus (approx sentences) Anuvadaksh-II Achievements

02/19/13English-Indian Language MT (Phase-II)15 Anuvadaksh-II Achievements SMT Improvements (Marathi) Corpus: EILMT Tourism Corpus (approx sentences) GRADE POINT (0-4) Baseline (Eval 1) Baseline (Eval 2) Reordered 4 (>=80%)24%20% 10% 3(60%-79%)26%23% 31% 2(40%-59%)15%25% 43% 1(20%-39%)34%25% 16% 0(<20%)1%7% 0

02/19/13English-Indian Language MT (Phase-II)16 Anuvadaksh-II Future Plan Use factored model in the Statistical MT engine to enhance translations for all languages in the tourism domain For the health domain specifically, obtain translations using existing resources and evaluate basic coverage of grammar for this domain The entire system with its hybrid approach has to be deployed efficiently and the outputs have to be sent to the testing team at CDAC Pune.

02/19/13English-Indian Language MT (Phase-II)17 Thank you