2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan

Slides:



Advertisements
Similar presentations
Machine Translation II How MT works Modes of use.
Advertisements

CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio.
Help communities share knowledge more effectively across the language barrier Automated Community Content Editing PorTal.
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
A Syntactic Translation Memory Vincent Vandeghinste Centre for Computational Linguistics K.U.Leuven
Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.
CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY CALTS has been in NLP for over a decade. It has participated in the following major projects: 1. NLP-TTP,
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
Machine Translation (Level 2) Anna Sågvall Hein GSLT Course, September 2004.
April 2004 TM RASMAT 2004 – Uppsala Business Needs and Practices Pierre-Yves Foucou CTO - SYSTRAN.
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Erasmus University Rotterdam Frederik HogenboomEconometric Institute School of Economics Flavius Frasincar.
Center for Computational Learning Systems Independent research center within the Engineering School NLP people at CCLS: Mona Diab, Nizar Habash, Martin.
Center for Computational Learning Systems Independent research center within the Engineering School NLP people at CCLS: Mona Diab, Nizar Habash, Martin.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
AI – CS364 Hybrid Intelligent Systems Overview of Hybrid Intelligent Systems 07 th November 2005 Dr Bogdan L. Vrusias
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
تمرين شماره 1 درس NLP سيلابس درس NLP در دانشگاه هاي ديگر ___________________________ راحله مکي استاد درس: دکتر عبدالله زاده پاييز 85.
Les Diablerets - 07/07/ M.Risoldi - SMV UniGe A 3-Levels approach to GUI development for complex control systems Matteo Risoldi.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
An innovative platform to allow translation and indexing of internet sites Localization World
Intuitive Coding of the Arabic Lexicon Ali Farghaly & Jean Senellart SYSTRAN Software Corporation San Diego, CA & Soisy, France.
Help communities share knowledge more effectively across the language barrier Automated Community Content Editing PorTal.
CAREERS IN LINGUISTICS OUTSIDE OF ACADEMIA CAREERS IN INDUSTRY.
Machine Translation Dr. Radhika Mamidi. What is Machine Translation? A sub-field of computational linguistics It investigates the use of computer software.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Final Review 31 October WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh.
“How much context do you need?” An experiment about context size in Interactive Cross-language Question Answering B. Navarro, L. Moreno-Monteagudo, E.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
Area Report Machine Translation Hervé Blanchon CLIPS-IMAG A Roadmap for Computational Linguistics COLING 2002 Post-Conference Workshop.
Can Controlled Language Rules increase the value of MT? Fred Hollowood & Johann Rotourier Symantec Dublin.
PETRA – the Personal Embedded Translation and Reading Assistant Werner Winiwarter University of Vienna InSTIL/ICALL Symposium 2004 June 17-19, 2004.
A Novel Approach to Architectural Recovery in Evolving Object- Oriented Systems PhD thesis Koen De Hondt December 11, 1998.
Reordering Model Using Syntactic Information of a Source Tree for Statistical Machine Translation Kei Hashimoto, Hirohumi Yamamoto, Hideo Okuma, Eiichiro.
Advanced MT Seminar Spring 2008 Instructors: Alon Lavie and Stephan Vogel.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
Project Overview Vangelis Karkaletsis NCSR “Demokritos” Frascati, July 17, 2002 (IST )
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
Exploitation of Semantic Web Technology in ERP Systems Amin Andjomshoaa, Shuaib Karim Ferial Shayeganfar, A Min Tjoa (andjomshoaa, skarim, ferial,
Machine Translation (Level 2) Anna Sågvall Hein GSLT Course, January 2003.
30 March – 8 April 2005 Dipartimento di Informatica, Universita di Pisa ML for NLP With Special Focus on Tagging and Parsing Kiril Ribarov.
Stentor A new Computer-Aided Transcription software for French language.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
MT Managing Technological Innovation Unit 9– What Will Be: Forecasting Technological Developments.
Syntactic Annotation of Slovene Corpora (SDT, JOS) Nina Ledinek ISJ ZRC SAZU
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
Eliciting a corpus of word- aligned phrases for MT Lori Levin, Alon Lavie, Erik Peterson Language Technologies Institute Carnegie Mellon University.
Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,
A method to restrict the blow-up of hypotheses... A method to restrict the blow-up of hypotheses of a non-disambiguated shallow machine translation system.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
Big Data: Every Word Managing Data Data Mining TerminologyData Collection CrowdsourcingSecurity & Validation Universal Translation Monolingual Dictionaries.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
Centre for Translation Studies FACULTY OF ARTS
Approaches to Machine Translation
Statistical Machine Translation
Urdu-to-English Stat-XFER system for NIST MT Eval 2008
Automatic Translation
Approaches to Machine Translation
Getting Started with Microsoft Azure Machine Learning
Presentation transcript:

2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan

2008 – copyright SYSTRAN Overview SYSTRAN – 40 years of innovation The MT Challenges SYSTRANLab Projects Hybrid Engines From Research to Products CWMT08 Conclusions

2008 – copyright SYSTRAN SYSTRAN 40 years of history Located in Paris (La Défense) and San Diego +70 employees: ~ 20 linguists, ~ 30 engineers Including 10 PhDs

2008 – copyright SYSTRAN Core Technology Core technology “Rule-Based” Based on language description Analysis – Transfer – Generation paradigm Build a « syntax tree » based on hierarchical constituents with multi-level relationships Multi-pass analysis Morphology Analysis Homograph Resolution Clause Boundary Syntagm Identification Syntactic Role Identification … Rely heavily on linguistic resources

2008 – copyright SYSTRAN

Languages Chinese882Korean78 Arabic422Italian62 Spanish358Ukrainian47 English350Polish42 Hindi325Dutch23 Portuguese250Serbo-Croatian21 Russian170Greek18 French130Czech12 Japanese125Albanian6 Urdu100Slovak6 German100 Farsi82 22 source languages 70 language pairs Dictionaries: 200K-1M entries per LP ~ 6M reference multi-source / multi-target dictionary 3600

2008 – copyright SYSTRAN SYSTRAN Activity Retail products: Windows Desktop Product SYSTRAN Mobile on PDA Mac OS Dashboard Widget Online Services SYSTRAN Box, SYSTRAN Net, SYSTRAN Links Corporate customers Symantec, Cisco, Verizon, Ford, Daimler, Chemical Abstract… Institutional Customers EC and US agencies Portals - Online Translation “Babel Fish”, Google, Yahoo!, Microsoft Live, …

2008 – copyright SYSTRAN MT Challenges RBMT/SMT Strengths and Weaknesses - I Rule-Based system builds a translation with available linguistic resources (dictionaries, rules) Human-built resources Incremental Track the translation process Predictable output Some phenomena are hard to formalize Need semantic/pragmatic knowledge Not designed to deal with exceptions to the rules … which are very frequent

2008 – copyright SYSTRAN MT Challenges RBMT/SMT Strengths and Weaknesses - II Statistical system finds a translation within a choice of many, many possible translations Very easy to build Automatic training process Knowledge acquisition is easy… Not limited to predefined linguistic patterns – “phrase” … but cannot “understand” or generalize information Not even elementary rules Output is “ unpredictable ”

2008 – copyright SYSTRAN MT Challenges Corpus-Based or Rule-Based Approach? No conflict between “corpus” and “rule-based” approaches Possible to learn rules Already learns terminology – monolingual and multilingual Some approaches acquire complex rules Possible to find the best translation amongst several translations “Decoding” can be constrained by syntactic restrictions Linguistic rules but corpus drives !

2008 – copyright SYSTRAN SYSTRANLab Research Projects Overview Toward Hybrid Engines Collaborations Statistical Post-Edition Lattice Decoding Source Analysis Adaptation From Research to Products

2008 – copyright SYSTRAN Research Projects Resources Acquisition Consolidating a 6M entry multilingual dictionary Acquiring more from corpus – lexicon and rules Linguistic Development Entity Recognition with local grammars Autonomous Generation modules Introduction of corpus-based technology Applications More interactive applications Professional Post-Edition Module (POEM)

2008 – copyright SYSTRAN SYSTRANLab Research Projects The Phoenix Project Collaboration with P. Koehn (University of Edinburgh) Introduce corpus-based decision modules in SYSTRAN Specialized modules Word Sense Disambiguation Lattice Generation Preposition / Determiner Choice

2008 – copyright SYSTRAN SYSTRANLab Research Projects The Sphinx Project Collaboration with CNRC Sequential use of SYSTRAN and statistical engines (Statistical Post-Edition) GALE (DARPA Project) Participated in WMT07, NIST08

2008 – copyright SYSTRAN SYSTRANLab Research Projects The Pegasus Project Collaboration with H. Schwenk (Université du Maine) Introduce linguistic knowledge in statistical engines Participated in WMT08

2008 – copyright SYSTRAN SYSTRANLab Hybrid Engines Introduce Self- Learning capability Learn “post-edition rules” Deep integration of statistical decision modules Insert linguistic knowledge in statistical engines HYBRID

2008 – copyright SYSTRAN CWMT08 Chinese-English MT evaluation Primary: RBMT+SPE Contrast: RBMT Started in 1994, 1.2M terms, S&T-focus BLEU4BLEU4- SBP NIST5GTMmWERmPERICT Primary-a Contrast-b

2008 – copyright SYSTRAN CWMT08: SPE Usage SPE module trained on 1.8m sentences CWMT08 training data not use Not only translation by also annotation by RBMT Dates, numerals, etc. Transfer model is filtered Exclusion of “ bad rules ” by rule based filtering Examples are “ random ” quotes, entities appearing Some expressions are “ protected ” Constituents will be replaced with placeholders before SPE Translated with RBMT Re-injected in translation after SPE SPE model for CWMT08 is trained using GIZA++, and decoding using Moses (

2008 – copyright SYSTRAN Statistical Post-Edition A Case Study Case Study – SYMANTEC – English>Chinese BLEUPERFECTImprov / Degrad SYSTRAN Raw SYSTRAN Cust ref SYSTRAN Raw + Translation Model SYSTRAN Cust + Translation Model

2008 – copyright SYSTRAN Conclusions Our approach is to start with rule-based framework Developed techniques give very competitive results Major focus on “degradation” control Learn more advanced post-edition rules Generic Translation – still a long way to go Bigger still better? Domain Translation Quality is there – statistics provides adaptation and fluidity  Need dedicated applications, workflow Bootstrapping new language pair development