S andrejs vasiļjevs chairman of the board data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

Slides:



Advertisements
Similar presentations
© 2000 XTRA Translation Services Is MT technology available today ready to replace human translators?
Advertisements

Getting Ready for a Career in Translation Equipping Yourself with the Tools for Success © Jost Zetzsche internationalwriters.com.
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Machine Translation The Translator s Choice Heidi Düchting Sylke Krämer Johann Roturier.
Towards Application of User-Tailored Machine Translation in Localization Andrejs Vasiļjevs, Raivis Skadiņš, Inguna Skadiņa TILDE JEC 2011, Luxembourg October.
Help communities share knowledge more effectively across the language barrier Automated Community Content Editing PorTal.
A Syntactic Translation Memory Vincent Vandeghinste Centre for Computational Linguistics K.U.Leuven
Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.
Languages & The Media, 5 Nov 2004, Berlin 1 New Markets, New Trends The technology side Stelios Piperidis
Languages & The Media, 4 Nov 2004, Berlin 1 Multimodal multilingual information processing for automatic subtitle generation: Resources, Methods and System.
THE TRANSLATION NETWORK Overview  Easily manage your multilingual sites  Synchronize content and manage changes  Translate content on the fly  Use.
Speech and Language Technologies in the Next Generation Localisation CSET Prof. Andy Way, School of Computing, DCU.
Software Quality Metrics
Computer Assisted Translation CAT Alexander C. Wu Fall 2004.
HLT Research and Development for Baltic Languages in Tilde Andrejs Vasiļjevs, Raivis Skadiņš Tilde Riga, October 27, 2004.
Web Localization Online Document Translation Source: Fernando Delgado.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
TRANSLATION AND LOCALIZATION MARKET IN THE US Olga Melnikova Translation Forum Russia-2015 DIFFERENCES FROM RUSSIA.
An innovative platform to allow translation and indexing of internet sites Localization World
Funded under the EU ICT Policy Support Programme Automated Solutions for Patent Translation John Tinsley Project PLuTO WIPO Symposium of.
Translating for the European Commission Vilnius, 7 June 2013 Miroslav Adamiš Director DGT.
Stefan Kreckwitz Senior System Engineer across Systems GmbH „Future Web-Based Translation Environments“ Localisation Research Forum 28 September 2007,
Help communities share knowledge more effectively across the language barrier Automated Community Content Editing PorTal.
 Trends: › usual trio: desktop version, server version, cloud version › cloud version + free editor › industry standards adopted (XLIFF, TMX, TBX)
Implementation of HUBzero as a Knowledge Management System in a Large Organization HUBBUB Conference 2012 September 24 th, 2012 Gaurav Nanda, Jonathan.
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Pržno, Republic of Montenegro 8 October 2007 TRANSLATION FOR EU ACCESSION TRANSLATION FOR EU ACCESSION Jasminka Novak, Head of Service Independent Service.
Achieving Domain Specificity in SMT without Over Siloing William Lewis, Chris Wendt, David Bullock Microsoft Research Machine Translation.
Streamlining the Review Cycle Michael Oettli, nlg GmbH Santa Clara, October 10 th.
FLAVIUS Technical presentation (Overblog, Qype, TVTrip) - WP2 Platform architecture.
Working freelance for an international organisation.
Final Review 31 October WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh.
JRC-Ispra, , Slide 1 Next Steps / Technical Details Bruno Pouliquen & Ralf Steinberger Addressing the Language Barrier Problem in the Enlarged.
The LSPs and Machine Translation: Why Not Treat MT as TM? David Canek, MemSource Technologies Torben Dahl Jensen, Oversætterhuset.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
Localizing Prestashop eCommerce Site with Wordfast
Overview of technologies for translators and language service providers Belinda Maia University of Porto.
Can Controlled Language Rules increase the value of MT? Fred Hollowood & Johann Rotourier Symantec Dublin.
Introducing MorphoLogic to LIRICS Gábor Prószéky MorphoLogic Pázmány Péter Catholic University Faculty.
Sofia Garcia/Roberto Silva Tutorial Workshop, GrenobleDate: 31/Jan/2007 The work of a professional translator and the translation agency V1.0.
Enhanced Infrastructure for Creation & Collection of Translation Resources Zhiyi Song, Stephanie Strassel (speaker), Gary Krug, Kazuaki Maeda.
The worldwide language services market 2.33% Revenue from post-edited machine translation 53.5% Consumers using free MT on the web Copyright © 2011 by.
The MultilingualWeb-LT Working Group receives funding by the European Commission (project name LT-Web) through the Seventh Framework Programme (FP7) in.
Xml:tm XML Based Text Memory Using XML technology to reduce the cost of translating XML documents 27 June 2005.
Xml:tm XML Text Memory Using XML technology to reduce the cost of translating XML documents.
1 Machine Assisted Human Translation (MAHT) (…aka “Translation Memory” or “CAT tool”) …and what it does for the translator…
FEISGILTT Dublin 2014 Yves Savourel ENLASO Corporation QuEst Integration in Okapi This presentation was made possible by This project is sponsored by the.
LINGUATECA FLUP/CLUP The Corpógrafo – a Web-based environment for corpora research extract Term Candidates.
LREC 2004, 26 May 2004, Lisbon 1 Multimodal Multilingual Resources in the Subtitling Process S.Piperidis, I.Demiros, P.Prokopidis, P.Vanroose, A. Hoethker,
Introduction to the European Union. The European Union Foundation Purpose.
SDL Trados Studio 2014 Getting Started. Components of a CAT Tool Translation Memory Terminology Management Alignment – transforming previously translated.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Semantic Wiki: Automating the Read, Write, and Reporting functions Chuck Rehberg, Semantic Insights.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
Part 1 The Basics of Information Systems. Purpose of Information Systems Information systems ◦ Collects, stores and organizes information ◦ Retrieves.
TRANSLATION & LOCALIZATION SERVICES Certified Provider Providing Local Translations To Our Global Partners.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
Academic Cooperation: Terminology Research for IATE.
Is Neural Machine Translation the New State of the Art?
Centre for Translation Studies FACULTY OF ARTS
Language Technologies Institute Carnegie Mellon University
The ACCEPT Project Enabling machine translation for the emerging community content paradigm. Allowing citizens across the EU better access to communities.
8. Translation resources
HLT Research and Development for Baltic Languages in Tilde
Building the Localization Web
Part of the Multilingual Web-LT Program
Statistics Explained goes multilingual
ITS 2.0 Enriched Terminology Annotation Showcase
Statistics Explained goes multilingual
Presentation transcript:

s andrejs vasiļjevs chairman of the board data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

Language technology developer Localization service provider Leadership in smaller languages Offices in Riga (Latvia), Tallinn (Estonia) and Vilnius (Lithuania) 135 employees Strong R&D team 9 PhDs and candidates

MT machine translation

INNOVATION disruptive

rule-based MT statistical MT High quality translation in specialized domains Require highly qualified linguists, researchers and software developers Time and resource consuming Difficult to evolve Translation and linguistic knowledge is derived from data Relatively easy and quick to develop Requires huge amounts of parallel and monolingual data Translation quality inconsistent and can differ dramatically from domain to domain MT paradigms

CHALLENGE

one size fits all ?

DATA

The total body of European Union law applicable in the EU Member States JRC-Acquis

The DGT Multilingual Translation Memory of the Acquis Communautaire DGT-TM

Parallel data collected from the Web by University of Uppsala 90 languages, 3800 language 2,7B parallel units Opus

open European language resource infrastructure

Data for SMT training

PLATFORM

Moses toolkit [ttable-file] /.../unfactored/model/phrase-table.0-0.gz % ls steps/1/LM_toy_tokenize.1* | cat steps/1/LM_toy_tokenize.1 steps/1/LM_toy_tokenize.1.DONE steps/1/LM_toy_tokenize.1.INFO steps/1/LM_toy_tokenize.1.STDERR steps/1/LM_toy_tokenize.1.STDERR.digest steps/1/LM_toy_tokenize.1.STDOUT % train-model.perl \ --corpus factored-corpus/proj-syndicate \ --root-dir unfactored \ --f de --e en \ --lm 0:3:factored-corpus/surface.lm:0 % moses -f moses.ini -lmodel-file " /lm/europarl.srilm.gz“ use-berkeley = true alignment-symmetrization-method = berkeley berkeley-train = $moses-script- dir/ems/support/berkeley-train.sh berkeley-process = $moses-script- dir/ems/support/berkeley-process.sh berkeley-jar = /your/path/to/berkeleyaligner- 2.1/berkeleyaligner.jar berkeley-java-options = "-server -mx30000m -ea" berkeley-training-options = "-Main.iters EMWordAligner.numThreads 8" berkeley-process-options = "- EMWordAligner.numThreads 8" berkeley-posterior = 0.5 tokenize in: raw-stem out: tokenized-stem default-name: corpus/tok pass-unless: input-tokenizer output-tokenizer template-if: input-tokenizer IN.$input- extension OUT.$input-extension template-if: output-tokenizer IN.$output- extension OUT.$output-extension parallelizable: yes working-dir = /home/pkoehn/experiment wmt10-data = $working-dir/data

build your own MT engine

Tilde / Coordinator LATVIA University of Edinburgh UK Uppsala University SWEDEN Copehagen University DENMARK University of Zagreb CROATIA Moravia CZECH REPUBLIC SemLab NETHERLANDS

Cloud-based self-service MT factory Repository of parallel and monolingual corpora for MT generation Automated training of SMT systems from specified collections of data Users can specify particular training data collections and build customised MT engines from these collections Users can also use LetsMT! platform for tailoring MT system to their needs from their non- public data

Stores SMT training data Supports different formats – TMX, XLIFF, PDF, DOC, plain text Converts to unified format Performs format conversions and alignment Resource Repository

Put users in control of their data Fully public or fully private should not be the only choice Data can be used for MT generation without exposing it Empower users to create custom MT engines from their data user-driven machine translation

Integration with CAT tools Integration in web pages Integration in web browsers API-level integration integration

Integration of MT in SDL Trados

use case FORTERA

EVALUATION

Keyboard-monitoring of post- editing (O´Brien, 2005) Productivity of MS Office localization (Schmidtke, 2008) 5-10% productivity gain for SP, FR, DE Adobe (Flournoy and Duran, 2009) 22%-51% productivity increase for RU, SP, FR Autodesk Moses SMT system (Plitt and Masselot, 2010) 74% average productivity increase for FR, IT, DE, SP Previous Work

Evaluation at Tilde Latvian:  About 1,6 M native speakers  Highly inflectional - ~22M possible word forms in total  Official EU language Tilde English – Latvian MT system IT Software Localization Domain Evaluation of translators’ productivity

English-Latvian data Bilingual corpusParallel units Localization TM1 290 K DGT-TM1 060 K OPUS EMEA970 K Fiction660 K Dictionary data510 K Web corpus900 K Total5 370 K Monolingual corpusWords Latvian side of parallel corpus 60 M News (web)250 M Fiction9 M Total, Latvian319 M

MT Integration into Localization Workflow Evaluate original / assign Translator and Editor Analyze against TMs Translate using translation suggestions for TMs and MT Evaluate translation quality / Edit Fix errors Ready translation MT translate new sentences

Key interest of localization industry is to increase productivity of translation process while maintaining required quality level Productivity was measured as the translation output of an average translator in words per hour 5 translators participated in evaluation including both experienced and new translators Evaluation of Productivity

Performed by human editors as part of their regular QA process Result of translation process was evaluated, editors did not know was or was not MT applied to assist translator Comparison to reference is not part of this evaluation Tilde standard QA assessment form was used covering the following text quality areas:  Accuracy  Spelling and grammar  Style  Terminology Evaluation of Quality

QA Grades Error Score (sum of weighted errors) Resulting Quality Evaluation 0…9Superior 10…29Good 30…49Mediocre 50…69Poor >70Very poor Tilde Localization QA assessment applied in the evaluation

Evaluation data ► 54 documents in IT domain ► adjusted words in each document ► Each document was split in half: ► the first part was translated using suggestions from TM only ► the second half was translated using suggestions from both TM and MT

% productivity 32.9% * * Skadiņš R., Puriņš M., Skadiņa I., Vasiļjevs A., Evaluation of SMT in localization to under-resourced inflected language, in Proceedings of the 15th International Conference of the European Association for Machine Translation EAMT 2011, p , May 30-31, 2011, Leuven, Belgium Latvian

Evaluation at Moravia ► IT Localization domain ► Systems trained on the LetsMT platform ► English - Czech translation  25.1% productivity increase  Error score increase from 19 to 27, still at the GOOD grade (<30) ► English – Polish translation  28.5% productivity increase  Error score increase from 16.8 to 23.6, still at the GOOD grade (<30)

% productivity 25% *For Czech and Polish formal evaluation was done by Moravia Foror Slovak productivity increase was estimated by Fortera 28.5% Slovak*Polish 25.1% Czech

MORE DATA

corpora collection tools comparability metrics named entity recognition tools terminology extraction tools ACCURAT TOOLKIT

use case AUTOMOTIVE MANUFACTURER

very small translation memories (just 3500 sentences) no in-domain corpora in target languages no money for expensive developments ?

Terminology extraction Web crawling parallel monolingual Parallel data extraction from comparable corpora data collection workflow

TMs Terminology glossary Parallel phrases Parallel Named Entities Monolingual target language corpus Resulting data

General domain data as a basis Domain specific language model Impose domain specific terminology, named entity translations Add linguistic knowledge atop of statistical components SMT Training

right data & right tools

tilde.com technologies for smaller languages The research within the LetsMT! project leading to these results has received funding from the ICT Policy Support Programme (ICT PSP), Theme 5 – Multilingual web, grant agreement no