METIS-II: a hybrid MT system Peter Dirix Vincent Vandeghinste Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven TMI 2007,

Slides:



Advertisements
Similar presentations
European Patent Office Wolfgang Täger December 2006 European Patent Office European Machine Translation Programme.
Advertisements

Sentence Classification and Clause Detection for Croatian Kristina Vučković, Željko Agić, Marko Tadić Department of Information Sciences, Department of.
Development of a German- English Translator Felix Zhang.
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Dr. Radhika Mamidi Corpus. What is a Corpus? a corpus (plural corpora) or text corpus is a large and structured set of texts (now usually electronically.
eWika: Digitalization of Philippine Languages
A Syntactic Translation Memory Vincent Vandeghinste Centre for Computational Linguistics K.U.Leuven
Machine Translation (Level 2) Anna Sågvall Hein GSLT Course, September 2004.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Center for Computational Learning Systems Independent research center within the Engineering School NLP people at CCLS: Mona Diab, Nizar Habash, Martin.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004.
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
Bilingual Lexical Acquisition From Comparable Corpora Andrea Mulloni.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
1 Lending a Hand: Sign Language Machine Translation Sara Morrissey NCLT Seminar Series 21 st June 2006.
Analyzing Sentiment in a Large Set of Web Data while Accounting for Negation AWIC 2011 Bas Heerschop Erasmus School of Economics Erasmus University Rotterdam.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.
Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Czech-to-English Translation: MT Marathon 2009 Session Preview Jonathan Clark Greg Hanneman Language Technologies Institute Carnegie Mellon University.
Automatic translation quality control using Eurovoc descriptors Marko Tadić, Božo Bekavac
Natural Language Processing Lab Northeastern University, China Feiliang Ren EBMT Based on Finite Automata State Transfer Generation Feiliang Ren.
Automated Compounding as a means for Maximizing Lexical Coverage Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U. Leuven.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Machine translation Context-based approach Lucia Otoyo.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Bilingual term extraction revisited: Comparing statistical and linguistic methods for a new pair of languages Špela Vintar Faculty of Arts Dept. of Translation.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
Experiments on Building Language Resources for Multi-Modal Dialogue Systems Goals identification of a methodology for adapting linguistic resources for.
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
Can Controlled Language Rules increase the value of MT? Fred Hollowood & Johann Rotourier Symantec Dublin.
Lecture 6 Hidden Markov Models Topics Smoothing again: Readings: Chapters January 16, 2013 CSCE 771 Natural Language Processing.
PETRA – the Personal Embedded Translation and Reading Assistant Werner Winiwarter University of Vienna InSTIL/ICALL Symposium 2004 June 17-19, 2004.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
Spanish FrameNet Project Autonomous University of Barcelona Marc Ortega.
Carnegie Mellon Goal Recycle non-expert post-editing efforts to: - Refine translation rules automatically - Improve overall translation quality Proposed.
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
METIS Project full title: STATISTICAL MACHINE TRANSLATION USING MONOLINGUAL CORPORA: from concept to implementation Project acronym: METIS-II Instrument:
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
A method to restrict the blow-up of hypotheses... A method to restrict the blow-up of hypotheses of a non-disambiguated shallow machine translation system.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
Language Identification and Part-of-Speech Tagging
Approaches to Machine Translation
--Mengxue Zhang, Qingyang Li
Approaches to Machine Translation
Extracting Why Text Segment from Web Based on Grammar-gram
Presentation transcript:

METIS-II: a hybrid MT system Peter Dirix Vincent Vandeghinste Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven TMI 2007, Skövde

Overview Techniques and issues in MT The METIS-II project Intermediate evaluation and ongoing work

Overview of techniques in MT Since 50s: word-by-word systems Later: rule-based systems (RBMT) Since 80s: statistical MT (SMT) 90s: example-based MT (EBMT)

Issues SMT/EBMT need huge parallel corpora with aligned text (often not available) SMT/EBMT sparsity of data RBMT infinity of rules/vocabulary → manual work, nearly impossible RBMT advanced analytic resources needed

Resolve issues Use only large monolingual corpora (widely available) Use basic analytic resources and an electronic translation dictionary Enable construction of new language pairs more easily Combine EBMT/SMT and RBMT techniques to resolve disjoint issues Construct hybrid MT system

The METIS-II Project European project consisting of KULeuven, ILSP Athens, IAI Saarbrücken, and FUPF Barcelona Language pairs Dutch, Greek, German and Spanish to English Ongoing work ( ) Build further on an assessment project ( )

Three language models Source-language model (SLM): analyses the structure in SL – tokenizers, lemmatizers, PoS taggers, chunkers, … Translation model (TM): models mapping between languages: dictionary, tag mapping rules, … Target-language model (TLM): uses TL corpus to pick most likely translation

Source-language model (Dutch) Tokenizer Tagger Lemmatizer Chunker

SLM: Tokenizer Rule-based tokenizer for Dutch 99.4% precision and recall

SLM: PoS tagger External tool: TnT (Brants 2000) About 96-97% accuracy for Dutch Trained on CGN (Corpus of Spoken Dutch) Uses CGN/DCoi tag set

SLM: Lemmatizer In-house, rule-based Uses tags and CGN lexicon as input Deals with separable verbs Future plans: use memory-based DCoi tagger/lemmatizer

SLM: Chunker In-house robust chunker/shallow parser: ShaRPa 2.1 Steps can be defined as context-free grammars (non recursive) or perl subroutines Detects NPs, PPs and verb groups (F = 95%) Marks subclauses and relative clauses (F = 70%) Future plans: add subject detection

Translation model (Dutch to English) Bilingual dictionary Tag-mapping rules Expander (extra rules/statistics to deal with language-specific phenomena, e.g. reorganising word/chunk order, adding/deleting words,…)

TM: Dictionary Compiled from free internet resources and EuroWordNet About 38,000 entries and 115,000 translations XML format Contains relevant PoS and chunking information Contains complex and discontinuous entries

TM: Tag-mapping rules Mapping between Dutch (CGN/DCoi) and English (BNC) tag sets Uses mapping table

TM: Expander Generates extra translation candidates Deals with tense mapping Treats verb groups Inserts do when necessary Translates like to + infinitive Translates om te + infinitive

Target-language model (English) TL corpus preprocessing: same process as SL (tokenizing, lemmatizing, tagging, chunking,…) + draw statistics/put in DB TM has generated a list of possibilities Corpus look-up ranks possibilities according to TL corpus statistics Selects most likely translation or n-best Token generator for morphological generation

TLM: Corpus Corpus preprocessing: BNC (British National Corpus) BNC is already tokenized and tagged Lemmatized using IAI lemmatizer Chunked using ShaRPa 2.1 (NPs, PPs, VGs, subclauses, …) Put into SQL database

TLM: Corpus statistics Drawn statistics from corpus Co-occurrence of lemmas, chunks (heads), … Put into database

TLM: Corpus look-up (ranker) Dictionary look-up, tag-mapping rules, expander => result = bag of bags Lexical selection + word/chunk order is drawn from TL corpus Makes a ranking of candidate translations

Example (1) We want to translate: ‘De grote zwarte hond blaft naar de postbode’.

Example (2)

Example (3)

Example (4)

Example (5)

Translation process Wrapper for whole process Analyse SL sentence(s) Build TM Pick translations with highest rank(s) and do token generation Offer translations to translator for post- editing (not implemented yet)

Evaluation Evaluated with BLEU, NIST and Levenshtein distance algorithm

Ongoing work & ideas Reimplementing the system (code clean-up) Elaborate rules (e.g. continuous tenses), lexica, … Take SL chunk order into account Improve SL and TL toolsets Provide tools for post-editing PACO-MT

Related work Context-based Machine Translation (CBMT, Carbonell 2006) Generation-heavy Hybrid Machine Translation (GHMT, Habash, 2003)

Questions ?