Download presentation
Presentation is loading. Please wait.
Published byVerity Newton Modified over 9 years ago
1
Approaching a New Language in Machine Translation Anna Sågvall Hein, Per Weijnitz
2
SALTMIL, LREC 2006, Sågvall Hein & Weijnitz A Swedish example Experiences of rule-based translation by means of translation software that was developed from scratch statistical translation by means of publicly available software
3
SALTMIL, LREC 2006, Sågvall Hein & Weijnitz Developing a robust transfer- based system for Swedish collecting a small sv-en translation corpus from the automotive domain (Scania) building a prototype of a core translation engine, Multra extending the translation corpus to 50k words for each language and scaling-up the dictionaries for the extended corpus building a translation system, Mats for hosting Multra and processing real-word documents making the system robust, transparent and trace-able building an extended, more flexible version of Mats, Convertus
4
SALTMIL, LREC 2006, Sågvall Hein & Weijnitz Features of the Multra engine transfer-based modular analysis by chart parsing transfer based on unification generation based on unification and concatenation non-deterministic processing preference machinery
5
SALTMIL, LREC 2006, Sågvall Hein & Weijnitz Features of the host system(s) robust –always produces a translation modular –a separate module for each translation step transparent –text based communication between modules trace-able –step-wise for each module evaluation of the linguistic coverage –counting and collecting missing units from each module process communication –MATS, unidirectional pipe –Convertus, blackboard
6
SALTMIL, LREC 2006, Sågvall Hein & Weijnitz Robustness dictionary –complementary access to external dictionaries analysis –exploiting partial analyses –concatenation of sub-strings in preserved order transfer –only differences covered by rules generation –token translations presented in source language order –fall back generations cleaned up using a language model
7
SALTMIL, LREC 2006, Sågvall Hein & Weijnitz Language resources, full system analysis –dictionary –grammar transfer –dictionary –grammar generation –dictionary –grammar external translation dictionary target language model
8
SALTMIL, LREC 2006, Sågvall Hein & Weijnitz Language resources, simplified, direct translation system analysis –dictionary transfer –dictionary generation –dictionary external translation dictionary target language model
9
SALTMIL, LREC 2006, Sågvall Hein & Weijnitz Achievements Bleu scores ~0.4-0.5 for training materials –automotive service literature –EU agricultural texts –security police communication –academic curricula
10
SALTMIL, LREC 2006, Sågvall Hein & Weijnitz Current project Translation of curricula of Uppsala University from Swedish to English
11
SALTMIL, LREC 2006, Sågvall Hein & Weijnitz Current development initial studies of automatic extraction of grammar rules from text and tree-banks for parsing and generation inspired by – Megyesi, B. (2002). Data-Driven Syntactic Analysis - Methods and Applications for Swedish. Ph.D.Thesis. Department of Speech, Music and Hearing, KTH, Stockholm, Sweden. –Nivre, J., Hall, J. and Nilsson, J. (2006) MaltParser: A Data-Driven Parser-Generator for Dependency Parsing. In Proceedings of LREC. MaltParser: A Data-Driven Parser-Generator for Dependency Parsing
12
SALTMIL, LREC 2006, Sågvall Hein & Weijnitz Statistical MT Publicly available software: decoder – Pharaoh (Koehn 2004) translation models – UPlug (Tiedemann, J. 2003) – GIZA++ (Och, F. J. and Ney, H. 2000) – Thot (Ortiz-Martínez, D. et al. 2005) language models – SRILM (Stolcke, A. 2002)
13
SALTMIL, LREC 2006, Sågvall Hein & Weijnitz Success factors language differences translation direction size of training corpus density of corpus –corpus density: lexical openness, degree of repetetiveness of n-grams, plus other significant factors How can they be appropriately formalised? Measured? Combined?
14
SALTMIL, LREC 2006, Sågvall Hein & Weijnitz Experiments limited amount of training data (assumed for minority languages) <=32k sentence pairs Swedish represents the minority lang. search for correlation between density of corpus and translation quality
15
SALTMIL, LREC 2006, Sågvall Hein & Weijnitz Mats automotive corpus languagesBLEUtraining size sv-en0.62716k en-sv0.64616k sv-de0.49116k de-sv0.50616k
16
SALTMIL, LREC 2006, Sågvall Hein & Weijnitz Europarl languagesBLEUtraining size sv-en0.22520k en-sv0.24720k sv-de0.20120k de-sv0.23120k
17
SALTMIL, LREC 2006, Sågvall Hein & Weijnitz Mats & Europarl, density in terms of type/occurrence ratio CorpusBLEU3-GRAM4-GRAM Mats, 16k0.6368.2%78.2% Europarl, 16k 0.2376.3%92.3
18
SALTMIL, LREC 2006, Sågvall Hein & Weijnitz BLEU for Europarl: 10 SL->sv
19
SALTMIL, LREC 2006, Sågvall Hein & Weijnitz BLEU for Europarl: sv->10 TL
20
SALTMIL, LREC 2006, Sågvall Hein & Weijnitz 4-gram type/occurrence ratio, SL->sv
21
SALTMIL, LREC 2006, Sågvall Hein & Weijnitz 3-gram type/occurrence ratio, SL->sv
22
SALTMIL, LREC 2006, Sågvall Hein & Weijnitz Detailed view, Europarl, sv->en Examining the correlation between SL n-gram type/occurrence – density - and BLEU. Size (k)12481632 3-gram % 81.681.080.077.874.369.9 4-gram % 94.093.993.692.891.389.2 BLEU0.130.160.190.210.230.25
23
SALTMIL, LREC 2006, Sågvall Hein & Weijnitz Detailed view, Europarl sv-fi Examining the correlation between SL n-gram type/occurrence – density - and BLEU. Size (k)12481632 3-gram % 82.882.381.078.875.470.9 4-gram % 94.4 94.093.392.090.0 BLEU0.050.070.080.090.100.11
24
SALTMIL, LREC 2006, Sågvall Hein & Weijnitz Rule-based and statistical - moving slightly off domain MATS automotive corpus used for training, 16k test data from Mats (outside training data) and from separate, similar corpus: Scania98 systemlanguage pair MATS test, bleu Scania98 test, bleu convertussv->en0.3450.377 pharaohsv->en0.6270.324
25
SALTMIL, LREC 2006, Sågvall Hein & Weijnitz Correlation between overlap and performance - Pharaoh MATS automotive corpus used for training, 16k test data from MATS and Scania98 measured occurrences of test data units that also occur in the training data test and training source language data overlap: the precondition for successful data driven MT Test data BLEUsent4- gram 3- gram 2- gram 1- gram MATS0.62732%31%46%72%97% Scania 98 0.3246%7%16%44%88%
26
SALTMIL, LREC 2006, Sågvall Hein & Weijnitz Summary development of Convertus, a robust transfer-based system equipped with language resources for sv-en translation in several domains BLEU measures of smt using publicly available software (Pharaoh) and Europarl – 10 languages, two translation directions, and training intervals of 5k sentence pairs up to 32k – data on density of Europarl in terms of overlaps comparing rbmt and smt using Convertus and Pharaoh searching for a formal way of quantifying how well a corpus will work for SMT – starting with density of source language
27
SALTMIL, LREC 2006, Sågvall Hein & Weijnitz Concluding remarks building a rule-based system from scratch is a major undertaking –customizing existing software is better smt systems can be built fairly easily using publicly available software –restrictions on commercial use, though factors influencing quality in smt –size of training corpus –density of source side of training corpus –language differences and translation direction other important factors (future work) –quality of training corpus, alignment quality, …
28
SALTMIL, LREC 2006, Sågvall Hein & Weijnitz Concluding remarks (cont.) smt versus rbmt –smt seems more sensitive to density than rbmt –error analysis and correction can be linguistically controlled in rbmt as opposed to smt
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.