Approaching a New Language in Machine Translation Anna Sågvall Hein, Per Weijnitz.

Slides:



Advertisements
Similar presentations
Statistical Machine Translation
Advertisements

Anna Sågvall Hein, GSLT, January 2003 A grammar rule SVE.GRAM CL.IMP :=: 'CL, :=: 'IMP, = 'VERB, :=: 'VERB, = 'IMP, :=:, :=:, :=:, ADVANCE,
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Anna Sågvall Hein, GSLT, January 2003 Direct translation no intermediary sentence structure translation proceeds in a number of steps, each step dedicated.
Language tools for writers Ola Knutsson IPLab, NADA, KTH Sweden.
A Syntactic Translation Memory Vincent Vandeghinste Centre for Computational Linguistics K.U.Leuven
Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.
Motivations for transfer-based translation lexical ambiguity structural differences See further Ingo 91.
TURKALATOR A Suite of Tools for English to Turkish MT Siddharth Jonathan Gorkem Ozbek CS224n Final Project June 14, 2006.
Machine Translation (Level 2) Anna Sågvall Hein GSLT Course, September 2004.
Machine Translation Anna Sågvall Hein Mösg F
Hybridity in MT: Experiments on the Europarl Corpus Declan Groves 24 th May, NCLT Seminar Series 2006.
Are Linguists Dinosaurs? 1.Statistical language processors seem to be doing away with the need for linguists. –Why do we need linguists when a machine.
A Tree-to-Tree Alignment- based Model for Statistical Machine Translation Authors: Min ZHANG, Hongfei JIANG, Ai Ti AW, Jun SUN, Sheng LI, Chew Lim TAN.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Statistical Phrase-Based Translation Authors: Koehn, Och, Marcu Presented by Albert Bertram Titles, charts, graphs, figures and tables were extracted from.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
A Hierarchical Phrase-Based Model for Statistical Machine Translation Author: David Chiang Presented by Achim Ruopp Formulas/illustrations/numbers extracted.
Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group.
Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)
LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.
Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
Machine translation Context-based approach Lucia Otoyo.
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Technical Report of NEUNLPLab System for CWMT08 Xiao Tong, Chen Rushan, Li Tianning, Ren Feiliang, Zhang Zhuyu, Zhu Jingbo, Wang Huizhen
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
Grammatical Machine Translation Stefan Riezler & John Maxwell.
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
PETRA – the Personal Embedded Translation and Reading Assistant Werner Winiwarter University of Vienna InSTIL/ICALL Symposium 2004 June 17-19, 2004.
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
UCREL: from LOB to REVERE Paul Rayson. November 1999CSEG awayday Paul Rayson2 A brief history of UCREL In ten minutes, I will present a brief history.
Recent Major MT Developments at CMU Briefing for Joe Olive February 5, 2008 Alon Lavie and Stephan Vogel Language Technologies Institute Carnegie Mellon.
Phrase Reordering for Statistical Machine Translation Based on Predicate-Argument Structure Mamoru Komachi, Yuji Matsumoto Nara Institute of Science and.
The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan.
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
Reordering Model Using Syntactic Information of a Source Tree for Statistical Machine Translation Kei Hashimoto, Hirohumi Yamamoto, Hideo Okuma, Eiichiro.
Approximating a Deep-Syntactic Metric for MT Evaluation and Tuning Matouš Macháček, Ondřej Bojar; {machacek, Charles University.
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
Lecture 12 Applications and demos. Building applications Previous lectures have discussed stages in processing: algorithms have addressed aspects of language.
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
Machine Translation (Level 2) Anna Sågvall Hein GSLT Course, January 2003.
Chinese Word Segmentation Adaptation for Statistical Machine Translation Hailong Cao, Masao Utiyama and Eiichiro Sumita Language Translation Group NICT&ATR.
NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.
LREC 2008 Marrakech 29 May Caroline Lavecchia, Kamel Smaïli and David Langlois LORIA / Groupe Parole, Vandoeuvre-Lès-Nancy, France Phrase-Based Machine.
A non-contiguous Tree Sequence Alignment-based Model for Statistical Machine Translation Jun Sun ┼, Min Zhang ╪, Chew Lim Tan ┼ ┼╪
Rapid Development in new languages Limited training data (6hrs) provided by NECTEC from 34 speakers, + 8 spks for development and test Romanization of.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
NLP. Machine Translation Tree-to-tree – Yamada and Knight Phrase-based – Och and Ney Syntax-based – Och et al. Alignment templates – Och and Ney.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
A Syntax-Driven Bracketing Model for Phrase-Based Translation Deyi Xiong, et al. ACL 2009.
Automatic Question Answering Beyond the Factoid Radu Soricut Information Sciences Institute University of Southern California Eric Brill Microsoft Research.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
Semi-Automatic Learning of Transfer Rules for Machine Translation of Minority Languages Katharina Probst Language Technologies Institute Carnegie Mellon.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
A CASE STUDY OF GERMAN INTO ENGLISH BY MACHINE TRANSLATION: MOSES EVALUATED USING MOSES FOR MERE MORTALS. Roger Haycock 
Approaches to Machine Translation
Statistical NLP: Lecture 13
Transfer-based translation
Approaches to Machine Translation
Presentation transcript:

Approaching a New Language in Machine Translation Anna Sågvall Hein, Per Weijnitz

SALTMIL, LREC 2006, Sågvall Hein & Weijnitz A Swedish example Experiences of rule-based translation by means of translation software that was developed from scratch statistical translation by means of publicly available software

SALTMIL, LREC 2006, Sågvall Hein & Weijnitz Developing a robust transfer- based system for Swedish collecting a small sv-en translation corpus from the automotive domain (Scania) building a prototype of a core translation engine, Multra extending the translation corpus to 50k words for each language and scaling-up the dictionaries for the extended corpus building a translation system, Mats for hosting Multra and processing real-word documents making the system robust, transparent and trace-able building an extended, more flexible version of Mats, Convertus

SALTMIL, LREC 2006, Sågvall Hein & Weijnitz Features of the Multra engine transfer-based modular analysis by chart parsing transfer based on unification generation based on unification and concatenation non-deterministic processing preference machinery

SALTMIL, LREC 2006, Sågvall Hein & Weijnitz Features of the host system(s) robust –always produces a translation modular –a separate module for each translation step transparent –text based communication between modules trace-able –step-wise for each module evaluation of the linguistic coverage –counting and collecting missing units from each module process communication –MATS, unidirectional pipe –Convertus, blackboard

SALTMIL, LREC 2006, Sågvall Hein & Weijnitz Robustness dictionary –complementary access to external dictionaries analysis –exploiting partial analyses –concatenation of sub-strings in preserved order transfer –only differences covered by rules generation –token translations presented in source language order –fall back generations cleaned up using a language model

SALTMIL, LREC 2006, Sågvall Hein & Weijnitz Language resources, full system analysis –dictionary –grammar transfer –dictionary –grammar generation –dictionary –grammar external translation dictionary target language model

SALTMIL, LREC 2006, Sågvall Hein & Weijnitz Language resources, simplified, direct translation system analysis –dictionary transfer –dictionary generation –dictionary external translation dictionary target language model

SALTMIL, LREC 2006, Sågvall Hein & Weijnitz Achievements Bleu scores ~ for training materials –automotive service literature –EU agricultural texts –security police communication –academic curricula

SALTMIL, LREC 2006, Sågvall Hein & Weijnitz Current project Translation of curricula of Uppsala University from Swedish to English

SALTMIL, LREC 2006, Sågvall Hein & Weijnitz Current development initial studies of automatic extraction of grammar rules from text and tree-banks for parsing and generation inspired by – Megyesi, B. (2002). Data-Driven Syntactic Analysis - Methods and Applications for Swedish. Ph.D.Thesis. Department of Speech, Music and Hearing, KTH, Stockholm, Sweden. –Nivre, J., Hall, J. and Nilsson, J. (2006) MaltParser: A Data-Driven Parser-Generator for Dependency Parsing. In Proceedings of LREC. MaltParser: A Data-Driven Parser-Generator for Dependency Parsing

SALTMIL, LREC 2006, Sågvall Hein & Weijnitz Statistical MT Publicly available software: decoder – Pharaoh (Koehn 2004) translation models – UPlug (Tiedemann, J. 2003) – GIZA++ (Och, F. J. and Ney, H. 2000) – Thot (Ortiz-Martínez, D. et al. 2005) language models – SRILM (Stolcke, A. 2002)

SALTMIL, LREC 2006, Sågvall Hein & Weijnitz Success factors language differences translation direction size of training corpus density of corpus –corpus density: lexical openness, degree of repetetiveness of n-grams, plus other significant factors How can they be appropriately formalised? Measured? Combined?

SALTMIL, LREC 2006, Sågvall Hein & Weijnitz Experiments limited amount of training data (assumed for minority languages) <=32k sentence pairs Swedish represents the minority lang. search for correlation between density of corpus and translation quality

SALTMIL, LREC 2006, Sågvall Hein & Weijnitz Mats automotive corpus languagesBLEUtraining size sv-en k en-sv k sv-de k de-sv k

SALTMIL, LREC 2006, Sågvall Hein & Weijnitz Europarl languagesBLEUtraining size sv-en k en-sv k sv-de k de-sv k

SALTMIL, LREC 2006, Sågvall Hein & Weijnitz Mats & Europarl, density in terms of type/occurrence ratio CorpusBLEU3-GRAM4-GRAM Mats, 16k %78.2% Europarl, 16k %92.3

SALTMIL, LREC 2006, Sågvall Hein & Weijnitz BLEU for Europarl: 10 SL->sv

SALTMIL, LREC 2006, Sågvall Hein & Weijnitz BLEU for Europarl: sv->10 TL

SALTMIL, LREC 2006, Sågvall Hein & Weijnitz 4-gram type/occurrence ratio, SL->sv

SALTMIL, LREC 2006, Sågvall Hein & Weijnitz 3-gram type/occurrence ratio, SL->sv

SALTMIL, LREC 2006, Sågvall Hein & Weijnitz Detailed view, Europarl, sv->en Examining the correlation between SL n-gram type/occurrence – density - and BLEU. Size (k) gram % gram % BLEU

SALTMIL, LREC 2006, Sågvall Hein & Weijnitz Detailed view, Europarl sv-fi Examining the correlation between SL n-gram type/occurrence – density - and BLEU. Size (k) gram % gram % BLEU

SALTMIL, LREC 2006, Sågvall Hein & Weijnitz Rule-based and statistical - moving slightly off domain MATS automotive corpus used for training, 16k test data from Mats (outside training data) and from separate, similar corpus: Scania98 systemlanguage pair MATS test, bleu Scania98 test, bleu convertussv->en pharaohsv->en

SALTMIL, LREC 2006, Sågvall Hein & Weijnitz Correlation between overlap and performance - Pharaoh MATS automotive corpus used for training, 16k test data from MATS and Scania98 measured occurrences of test data units that also occur in the training data test and training source language data overlap: the precondition for successful data driven MT Test data BLEUsent4- gram 3- gram 2- gram 1- gram MATS %31%46%72%97% Scania %7%16%44%88%

SALTMIL, LREC 2006, Sågvall Hein & Weijnitz Summary development of Convertus, a robust transfer-based system equipped with language resources for sv-en translation in several domains BLEU measures of smt using publicly available software (Pharaoh) and Europarl – 10 languages, two translation directions, and training intervals of 5k sentence pairs up to 32k – data on density of Europarl in terms of overlaps comparing rbmt and smt using Convertus and Pharaoh searching for a formal way of quantifying how well a corpus will work for SMT – starting with density of source language

SALTMIL, LREC 2006, Sågvall Hein & Weijnitz Concluding remarks building a rule-based system from scratch is a major undertaking –customizing existing software is better smt systems can be built fairly easily using publicly available software –restrictions on commercial use, though factors influencing quality in smt –size of training corpus –density of source side of training corpus –language differences and translation direction other important factors (future work) –quality of training corpus, alignment quality, …

SALTMIL, LREC 2006, Sågvall Hein & Weijnitz Concluding remarks (cont.) smt versus rbmt –smt seems more sensitive to density than rbmt –error analysis and correction can be linguistically controlled in rbmt as opposed to smt