Download presentation
Presentation is loading. Please wait.
1
SMT in various United Nations agencies
MT Summit 2015 Nov 2015 Bruno Pouliquen World Intellectual Property Organization (WIPO)
2
Menu Introduction Various models History Quality Interface
Intra-Organization work Combining models Future work Conclusion model
3
Introduction Our software (Tapta) is based on Statistical Machine Translation: open source Moses the organization data creates the model Developed in WIPO (World Intellectual Property Organization), installed in various organizations under the auspices of Intra-Organization collaborations
4
From data to Moses models
Fully automatic (preparation/training/publishing etc.) Fast translations (on the fly), scalable (4 trillion words) Free to use (open source + in-house development) Runs on physical servers / virtual machines / cloud Confidentiality Various User interfaces First goal: assimilation, online translation of patent applications on our search engine Patentscope Additional goal: dissemination, integration in CAT tool, “translation accelerator” sentence-align train-model prune binarize optimize Publish re-clean post-filter clean
5
History 210’000 2‘000 en-zh 400 84 en-fr 210‘000 670 en-es 62‘000
Inst. date Organization Words/day (Oct 2015) Languages Corpus size (Mw) Feb 2011 210’000 de,en,es,fr,ja,ko, ru,zh (ar, pt) 2‘000 en-zh ~4‘000 en-ja Sep 2012 400 en,fr,zh (es, ru, ar) 84 en-fr Oct 2012 210‘000 ar,en,es,fr,ru,zh (de) 670 en-es Feb 2013 62‘000 en,fr,es 22 en-fr May 2013 100‘000 790 en-fr Dec 2014 2’000 en,fr,es, ar,ru (zh) 54 en-fr
6
Prototypes & future installations
Date Organization Words/day Languages Corpus size Mw Jul 2015 3’000 en,fr,es,ar,ru,zh (de) 55 en-fr 7’000 en,fr,es (ar,ru,zh) 41 en-fr Oct 2015 Running prototype en,fr,pt 0.7 en-fr Sep 2014 prototype de,en,es,fr (ar,ru,zh) Aug 2014 en,es,fr, (ar,de,ru,zh) 34 en-fr Nov 2014 en,es,fr 400 en-fr Prototypes & future installations International Patent Classification
7
MT Quality PATENTSCOPE Tapta4UN Tapta4IMO Tapta4FAO
Lang Wipo Translate Google de-en 46.11 37.94 es-en 36.00 33.07 fr-en 46.97 41.72 ru-en 28.88 17.76 zh-en 28.68 21.89 ko-en 22.09 19.85 ja-en 22.10 21.27 26.37 21.80 zh-en claims zh-en descriptions 38.03 32.40 Tapta4UN Lang Tapta4UN Google Bing ar-en 55.25 n/a 51.17 en-ar 47.00 33.74 28.94 en-es 64.13 53.39 46.86 en-fr 53.75 45.58 42.19 en-ru 52.91 39.67 38.96 en-zh 44.72 34.16 32.77 es-en 64.72 52.54 49.18 fr-en 57.77 46.46 43.39 ru-en 60.85 47.71 47.09 zh-en 44.73 36.55 30.60 Tapta4IMO lang Tapta4IMO Google en-fr 54.32 32.58 en-es 52.74 35.18 en-ru 35.99 20.56 en-ar 39.99 16.58 Tapta4FAO lang Tapta4FAO en-es 52.13 en-zh 34.85 en-ru 25.73 en-ar 28.80 BLEU: score between 0 and 100, similarity of ngrams between evaluated and reference sentence Papineni, K.; Roukos, S.; Ward, T.; Zhu, W. J. (2002). "BLEU: a method for automatic evaluation of machine translation". ACL-2002: 40th Annual meeting of the Association for Computational Linguistics. pp. 311–318.
8
Interfaces: Tapta close to users
Html widget Web F3 Ongoing… Word macro ?
9
Intra-Organization work
Tapta is a very small team (1.5 people)… Exchange of knowledge Training / tools / ideas Exchange of technology WIPO uses plugins developed in UN (SDL Studio / autohotkey) IMO uses WIPO technology, UN developed plugin, UN corpus Exchange of data IMO/FAO/WIPO use UN corpus… WIPO could benefit from Arabic ITU/UN data etc… WIPO: has released is Patent corpus Coppa, UN to follow An overview of the European Union’s highly multilingual parallel corpora, Ralf Steinberger • Mohamed Ebrahim • Alexandros Poulis • Manuel Carrasco-Benitez • Patrick Schlu¨ ter • Marek Przybyszewski • Signe Gilbro Language Resources and Evaluation ISSN X Lang Resources & Evaluation DOI /s
10
Our tool in different situations
Adapted our code so that it can easily install and run Under version control, regression tests, Installation/administration documentation (100 pages) With installation instructions: ½ day to configure a new Linux server Runs on Linux: Hardware: Amazon cloud, virtual machine, server OS: Ubuntu/Suse/Centos/RedHat Currently installed on ~ 20 servers, running 7/24
11
Hardware & OS Virtual, 2G Ram, Ubuntu, 4 cores, 50Gb (toy model)
Virtual, 4 Gb Ram, Suse SLES11, 4 cores, 250Gb Virtual, 16Gb Ram, RedHat ent R6.2, 16 cores, 200Gb PC, 8Gb ram, Ubuntu 12.4, 8 cores, 350Gb Server, 16Gb ram, Centos R6.4, 16 cores, 400Gb Server, 11Gb Ram, RedHat ent. R6.5, 8 cores, 100Gb … Amazon cloud, 64Gb Ram, Suse Ent. 11, 8 cores, 400Gb Serverloft cloud, 188Gb Ram, Ubuntu, 24 cores, 2.5Tb Server, 500Gb ram, RedHat Ent. R6.5, 48 cores, 4T
12
Training and scalability: UN data
Previous United Nation texts ~ 212 Million words, 10 M segments Phrase table Reordering model Language model M rows Gb Basic 82 9.70 8.70 49 1.70 Pruned 19 2.20 1.90 31 1.00 Binarized 0.27 0.15 0.70 UN data 20Gb 1.12Gb (6%)
13
Our tool in production Install/publish/update/train/evaluate (robust) scripts Monitor tool Dashboard interface Anti robot policy (captcha) Statistics …
14
Combining models: IMO (+ UN data) experience
Language pair Corpus size (Million words) BLEU IMO Corpus UN Corpus IMO only Combined En-es 54 316 52.68 52.99 En-ar 4 304 41.20 44.18 Combination does not provide much improvement when the corpus is big enough (en-es) However translators prefer combined models for general texts and IMO-only for technical texts Tapta4IMO offers both Combination is useful for “small” corpus (en-ar) Bruno Pouliquen, Marcin Junczys-Dowmunt, Michal Ziemski, Blanca Pinero, SMT at the International Maritime Organization: experiences with combining in-house corpus with more general corpus, EAMT 2015, Antalya, Turkey, June 2015
15
United Nations: retraining with recent data helps…
16
User acceptance How Tapta is perceived among translators?
When seen as a “translation accelerator”: useful When seen as “replacement for translator”: useless When proposed as a copy-paste tool: not used When integrated in translator’s environment: used daily At least 30% gain in time, and gain in quality!
17
Future work Post-editions / incremental training
Better domain adaptation (Continue) Integration Translating through pivot (eg. fr-zh) Syntax-based models / Neural network models…
18
Future work: user feedback
User feedbacks Take into account new translations Blacklist of phrases Collect post-editions Quality estimation …
19
Conclusion Free to use tool provides “accurate” machine translation
Always question machine output & BLEU scores Tool flexibility: different organizations different working methods Integration/customization is essential What next? Publish our corpora Invest time in MT
20
Why is each model so specific?
Withdrawing "securing pins" from inside a sealed lifeboat WIPO: retrait des " broches de fixation " à partir de l'intérieur d'un canot de sauvetage étanche Tapta4UN : retirer "la sécurisation des numéros d'identification personnels (PIN)" embarcation de sauvetage à partir du territoire placé sous scellés IMO: retirer les " goupilles d'assujettissement " depuis l'intérieur de l'embarcation de sauvetage scellé Back-translation: withdraw “the security of personal identification numbers (PIN)” lifeboat from the territory under sealed Reason: This model was not trained on “maritime” data
21
Why is each model so specific?
Withdrawing "securing pins" from inside a sealed lifeboat TaptaWIPO: retrait des " broches de fixation " à partir de l'intérieur d'un canot de sauvetage étanche Tapta4UN : retirer "la sécurisation des numéros d'identification personnels (PIN)" embarcation de sauvetage à partir du territoire placé sous scellés Tapta4IMO: retirer les "goupilles de sécurité" depuis l'intérieur de l'embarcation de sauvetage scellé
22
Thank you for your attention
شكرا لكم على اهتمامكم Merci pour votre attention! 感谢您的关注 Grazie per la vostra attenzione! ¡ Gracias por su atención ! Obrigado pela vossa atenção! Dziękuję bardzo za Państwa uwagę! Děkujeme za Vaši pozornost! Ďakujem ti veľmi pekne za tvoju pozornosť Tänan tähelepanu eest! Благодарим за Вашето внимание! Tak for Jeres opmærksomhed!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.