Raivis SKADIŅŠ a,b, Kārlis GOBA a and Valters ŠICS a a Tilde SIA, Latvia b University of Latvia, Latvia Machine Translation and Morphologically-Rich Languages.

Slides:



Advertisements
Similar presentations
European Patent Office Wolfgang Täger December 2006 European Patent Office European Machine Translation Programme.
Advertisements

Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation.
Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.
Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.
LABELING TURKISH NEWS STORIES WITH CRF Prof. Dr. Eşref Adalı ISTANBUL TECHNICAL UNIVERSITY COMPUTER ENGINEERING 1.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE.
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
+ The First Declension Latin I. + Declension refers to NOUNS set of endings for nouns that follow a consistent pattern there are 5 Latin declensions each.
Towards Application of User-Tailored Machine Translation in Localization Andrejs Vasiļjevs, Raivis Skadiņš, Inguna Skadiņa TILDE JEC 2011, Luxembourg October.
1 Minimally Supervised Morphological Analysis by Multimodal Alignment David Yarowsky and Richard Wicentowski.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Advanced Google Becoming a Power Googler. (c) Thomas T. Kaun 2005 How Google Works PageRank: The number of pages link to any given page. “Importance”
Introduction.  “a technique that enables the computer to encode complex grammatical knowledge such as humans use to assemble sentences, recognize errors.
Automatic Metaphor Interpretation as a Paraphrasing Task Ekaterina Shutova Computer Lab, University of Cambridge NAACL 2010.
English-Hindi Translation in 21 Days Ondřej Bojar, Pavel Straňák, Daniel Zeman ÚFAL MFF, Univerzita Karlova, Praha.
Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.
1 A Hidden Markov Model- Based POS Tagger for Arabic ICS 482 Presentation A Hidden Markov Model- Based POS Tagger for Arabic By Saleh Yousef Al-Hudail.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
HLT Research and Development for Baltic Languages in Tilde Andrejs Vasiļjevs, Raivis Skadiņš Tilde Riga, October 27, 2004.
“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.
Evaluating an MT French / English System Widad Mustafa El Hadi Ismaïl Timimi Université de Lille III Marianne Dabbadie LexiQuest - Paris.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Overview of Search Engines
Statistical Machine Translation Part X – Dealing with Morphology for Translating to German Alexander Fraser ICL, U. Heidelberg CIS, LMU München
S andrejs vasiļjevs chairman of the board data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.
Automatic Evaluation Philipp Koehn Computer Science and Artificial Intelligence Lab Massachusetts Institute of Technology.
The CMU-UKA Statistical Machine Translation Systems for IWSLT 2007 Ian Lane, Andreas Zollmann, Thuy Linh Nguyen, Nguyen Bach, Ashish Venugopal, Stephan.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
PETRA – the Personal Embedded Translation and Reading Assistant Werner Winiwarter University of Vienna InSTIL/ICALL Symposium 2004 June 17-19, 2004.
Morpho Challenge competition Evaluations and results Authors Mikko Kurimo Sami Virpioja Ville Turunen Krista Lagus.
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng.
Comparing syntactic semantic patterns and passages in Interactive Cross Language Information Access (iCLEF at the University of Alicante) Borja Navarro,
Approximating a Deep-Syntactic Metric for MT Evaluation and Tuning Matouš Macháček, Ondřej Bojar; {machacek, Charles University.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
Ibrahim Badr, Rabih Zbib, James Glass. Introduction Experiment on English-to-Arabic SMT. Two domains: text news,spoken travel conv. Explore the effect.
Using Surface Syntactic Parser & Deviation from Randomness Jean-Pierre Chevallet IPAL I2R Gilles Sérasset CLIPS IMAG.
The MultilingualWeb-LT Working Group receives funding by the European Commission (project name LT-Web) through the Seventh Framework Programme (FP7) in.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.
Clarity Cross-Lingual Document Retrieval, Categorisation and Navigation Based on Distributed Services
LREC 2008 Marrakech 29 May Caroline Lavecchia, Kamel Smaïli and David Langlois LORIA / Groupe Parole, Vandoeuvre-Lès-Nancy, France Phrase-Based Machine.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.
NLP Midterm Solution #1 bilingual corpora –parallel corpus (document-aligned, sentence-aligned, word-aligned) (4) –comparable corpus (4) Source.
E v a f i c o u g r á v i d a e, a o d a r à l u z u m m e n i n o, d i s s e : " O S e n h o r m e d e u u m f i l h o h o m e m ". E l a d e u à c.
Arnar Thor Jensson Koji Iwano Sadaoki Furui Tokyo Institute of Technology Development of a Speech Recognition System For Icelandic Using Machine Translated.
Is Neural Machine Translation the New State of the Art?
Language Identification and Part-of-Speech Tagging
Wu et. al., arXiv - sept 2016 Presenter: Lütfi Kerem Şenel
KantanNeural™ LQR Experiment
HLT Research and Development for Baltic Languages in Tilde
ITS 2.0 Enriched Terminology Annotation Showcase
Sadov M. A. , NRU HSE, Moscow, Russia Kutuzov A. B
Domain Mixing for Chinese-English Neural Machine Translation
A Path-based Transfer Model for Machine Translation
Presentation transcript:

Raivis SKADIŅŠ a,b, Kārlis GOBA a and Valters ŠICS a a Tilde SIA, Latvia b University of Latvia, Latvia Machine Translation and Morphologically-Rich Languages University of Haifa, Israel, January, 2011

The exotic Baltic languages Problems we all know about What Philipp & others said Agreement, reordering Sparseness Alignment and evaluation Morphology integration (Not) using factor models English-Latvian Lithuanian-English Interlude: Human evaluation

PIELatvianLithuanian Nom.pod-spēd-apėd-a Voc.podpēd-apėd-a Acc.pod-m̥pēd-upėd-ą Instr.ped-ehpēd-upėd-a Dat.ped-eypēd-aipėd-ai Abl.ped-es Gen.ped-espēd-aspėd-os Loc.ped-(i)pēd-āpėd-oje >2000 morphosyntactic tags, >1000 observed Derivation vs inflection Similar problems as with Czech, German, etc. Case/gender/number/definiteness Quite elaborated participle system Feature redundancy

Latv(ij|i|j)a|Latfia|Let(land|[oó]n ia)|Lett(land|ország|oni[ae])|Lot yšk[áo]|Łotwa|Läti |An Laitvia Fancy letters: ā ē ī ū š ž č ķ ģ ņ ļ Lietuva|Liettua|Lith([uw]ania|áe n)|Lit(uani[ea]|[vw]a|vánia|vanij a|(au|ouw)en)|An Liotuáin Fancy letters: ė ę ą ų ū

Quality BLEU useful for development, but not for the end user Efficient human evaluation Hardware requirements Decoding speed Memory Not just bare translation Correct casing Domain-specific translation User feedback

English: dependency parser Latvian: morphological analysis, disambiguated Add lemmas and morphological tags as factors pēdas pēda | Nfsn Tags also mark agreement for prepositions Alternatively, Split words into stems and suffixes pēdas pēd | as More ambiguous -u Amsa, Ampg, Afsa, Afpg,...

Tuning and evaluation data 1000 sentence tuning set sentence eval set Domain/topic mixture Available in EN, LV, LT, ET, RO, SL, HR, DE, RU ACCURAT project

What Philipp said... 5-gram LM over surface forms 7-gram LM over morphology tags / suffixes

Comparable corpora Heavily filtered Alignment score Length Alphanumeric Language detection 22.5M parallel units 0.9M left

Domain Our SMTGoogleDelta General information about European Union41.6% 0.0% Fiction32.4%22.5%9.9% Letters11.1%12.5%-1.4% News and magazines10.0%8.9%1.1% Popular science and education23.0%40.0%-16.9% Official and legal documents52.0%46.8%5.3% Information technology63.9%47.5%16.5% Specifications, instructions and manuals31.6%29.8%1.8%

We got slightly better BLEU scores, but is it really getting better? Looking for practical methods simple, reliable, relatively cheap

Ranking of translated sentences relative to each other Only 2 systems The same 500 sentences as for BLEU Web based evaluation system Upload source text and 2 target outputs Send URL to many evaluators See results

We calculate Probability that system A is better than B: Confidence interval: we calculate it Based on all evaluations Based on a sentence level

No morphology tagger for Lithuanian Split Stem and an optional suffix Mark the suffix vienas vien #as Suffixes correspond to endings, but are ambiguous One would expect #ai to/for Prefixes – verb negation

Automatic evaluation Human evaluation

Translating to highly inflected language Some success in predicting the right inflections by a LM Things to try: Two-step approach Marking the relevant source features Translating from highly inflected language Slight decrease in BLEU Decrease in OOV rate Human evaluation suggests users prefer lower OOV rate Things to try: Removing the irrelevant features