Patent documentation - comparison of two MT strategies Lene Offersgaard, Claus Povlsen Center for Sprogteknologi, University of Copenhagen

Slides:

Advertisements

Similar presentations

A Probabilistic Representation of Systemic Functional Grammar Robert Munro Department of Linguistics, SOAS, University of London.

Advertisements

OLIF V2 Gr. Thurmair April OLIF April 2000 OLIF: Overview Rationale Principles Entries Descriptions Header Examples Status.

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.

Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.

LABELING TURKISH NEWS STORIES WITH CRF Prof. Dr. Eşref Adalı ISTANBUL TECHNICAL UNIVERSITY COMPUTER ENGINEERING 1.

1 Statistical Machine Translation Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 8 October 27, 2004.

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE.

26/10/2008 SWESE'08 1 Enhanced Semantic Access to Software Artefacts Danica Damljanović and Kalina Bontcheva.

25. sep Dias 1 Center for Sprogteknologi Lene Offersgaard, Claus Povlsen Center for Sprogteknologi SDMT-SMV2 workshop 25. september 2007 Inter-set.

The use of the CEFR at national level in the member states of the Council of Europe Waldek Martyniuk José Noijons Language Policy Forum 2007.

Machine Translation II How MT works Modes of use.

Slide 1 Shall Lists. Slide 2 Shall List Statement Categories  Functional Requirements  Non-Functional Requirements.

Do Now 10/22/ = 10 = ? Copy HW in your planner.

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Rating Evaluation Methods through Correlation presented by Lena Marg, Language Tools MTE 2014, Workshop on Automatic and Manual Metrics for Operational.

Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.

TURKALATOR A Suite of Tools for English to Turkish MT Siddharth Jonathan Gorkem Ozbek CS224n Final Project June 14, 2006.

Machine Translation (Level 2) Anna Sågvall Hein GSLT Course, September 2004.

Machine Translation Anna Sågvall Hein Mösg F

Hybridity in MT: Experiments on the Europarl Corpus Declan Groves 24 th May, NCLT Seminar Series 2006.

Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.

“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.

Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.

Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.

Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali.

The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.

© 2014 The MITRE Corporation. All rights reserved. Stacey Bailey and Keith Miller On the Value of Machine Translation Adaptation LREC Workshop: Automatic.

Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**

English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

Achieving Domain Specificity in SMT without Over Siloing William Lewis, Chris Wendt, David Bullock Microsoft Research Machine Translation.

Bilingual term extraction revisited: Comparing statistical and linguistic methods for a new pair of languages Špela Vintar Faculty of Arts Dept. of Translation.

Statistical Machine Translation Part IV – Log-Linear Models Alex Fraser Institute for Natural Language Processing University of Stuttgart Seminar:

Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.

Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.

Sanjay Chatterji Dev shri Roy Sudeshna Sarkar Anupam Basu CSE, IIT Kharagpur A Hybrid Approach for Bengali to Hindi Machine Translation.

2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.

Part-Of-Speech Tagging using Neural Networks Ankur Parikh LTRC IIIT Hyderabad

NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.

Cs target cs target en source Subject-PastParticiple agreement Czech subject and past participle must agree in number and gender. Two-step translation.

GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)

Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:

Ibrahim Badr, Rabih Zbib, James Glass. Introduction Experiment on English-to-Arabic SMT. Two domains: text news,spoken travel conv. Explore the effect.

Using Surface Syntactic Parser & Deviation from Randomness Jean-Pierre Chevallet IPAL I2R Gilles Sérasset CLIPS IMAG.

What you have learned and how you can use it : Grammars and Lexicons Parts I-III.

Machine Translation (Level 2) Anna Sågvall Hein GSLT Course, January 2003.

Chinese Word Segmentation Adaptation for Statistical Machine Translation Hailong Cao, Masao Utiyama and Eiichiro Sumita Language Translation Group NICT&ATR.

NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.

LREC 2008 Marrakech 29 May Caroline Lavecchia, Kamel Smaïli and David Langlois LORIA / Groupe Parole, Vandoeuvre-Lès-Nancy, France Phrase-Based Machine.

Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.

UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.

Shallow Parsing for South Asian Languages -Himanshu Agrawal.

Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.

Approaching a New Language in Machine Translation Anna Sågvall Hein, Per Weijnitz.

Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.

An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,

October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.

CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)

A Simple English-to-Punjabi Translation System By : Shailendra Singh.

Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,

Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.

Is Neural Machine Translation the New State of the Art?

Language Identification and Part-of-Speech Tagging

Approaches to Machine Translation

Approaches to Machine Translation

Statistical Machine Translation Papers from COLING 2004

Artificial Intelligence 2004 Speech & Natural Language Processing

Presentation transcript:

Patent documentation - comparison of two MT strategies Lene Offersgaard, Claus Povlsen Center for Sprogteknologi, University of Copenhagen

MT-Summit, Sep A comparison of two different MT strategies RBMT and SMT, similarities and differences, in a patent documentation context What requirements should be met in order to develop an SMT production system within the area of patent documentation? The two strategies: PaTrans: A transfer and rule based translation system, used the last 15 years at Lingtech A/S (Ørsnes, 1996). SpaTrans: A SMT system based on the Pharaoh framework (Koehn, 2004). Investigations supported by Danish Research Council. Subdomain: chemical patents

MT-Summit, Sep A comparison of two different MT strategies -2 PaTrans: Transfer and rule based En-Da, linguistic development Grammatical coverage tailored to the text type of Patents Tools for terminology selection and coding Handling of formulas and references SpaTrans: An SMT system based on Pharaoh framework En-Da, research version Word and grammatical coverage determined by training corpus No termilology handling yet Simple handling of formulas and references

MT-Summit, Sep SpaTrans: Statistical resources Translation Workflow PaTrans: Linguistic resources Preprocessing English Patent Translation Engine Postprocessing Danish Patent Proff reading PaTrans Engine lexicon grammar termbases Language model srilm 3 Phrase table Pharaoh Decoder CorpusEnglish words Danish words Training4.2 mill4.5 mill Language model-4.5 mill Development test Test

MT-Summit, Sep BLEU Evaluation Reference translations are two post-edited PaTrans translations The PaTrans system is favoured: term bases, wording and sentence structure Some SpaTrans errors are caused by incomplete treatment of formulas and references BLEU differs for the two patents Very promising results for the SpaTrans system BLEU %Test patent ATest patent B PaTrans SpaTrans with reordering SpaTrans monotonic Diff (PaTrans - SpaTrans mono.)

MT-Summit, Sep Human evaluation of the SMT system Limited resources for manual evaluation Proof readers have post-edited SMT output and focussed on: Post editing time Quality of output Intelligibility (understandable?) Fidelity (same meaning?) Fluency (fluent Danish?) Conclusions: Usable translation quality Both intelligibility and fidelity scores are best without reordering Annoying agreement errors New terms has to be included in the SMT system easily

MT-Summit, Sep SpaTrans translation results A dominant error pattern is the frequent occurrence of agreement errors in nominal phrases Examples Gender disagreement: (lit:… control of the full spectrum) … kontrol af den fulde spektrum … kontrol af den[DET_common_sing] fulde spektrum[N_neuter_sing] Corrected output: … kontrol af det[DET_neuter_sing] fulde spektrum[N_neuter_sing]

MT-Summit, Sep SpaTrans translation results - 2 Number disagreement: (lit: … the active ingredients) … den aktive bestanddele … den[DET_common_sing] aktive bestanddele[N_common_plur]... denne[DET_definite] konstant[ADJ_indefinite] erosion Corrected output: … de[DET_common_plur] aktive bestandele[N_common_plur] Corrected output:... denne[DET_definite] konstante[ADJ_definite] erosion Lets give linguistic information a try! Definiteness disagreement: (lit:... this constant erosion)... denne konstant erosion

MT-Summit, Sep MOSES Open source system replacing Pharaoh (Koehn et al. 2007) State-of-the-art phrase-based approach Using factored translation models Comparison SpaTrans-Pharao and Moses decoder Reuse of statistical resources Pharao parameters for monotonic setup optimised based on development tests Adding linguistic information to SMT: MOSES BLEU %Test patent ATest patent B SpaTrans with reordering Moses (SpaTrans models) reord SpaTrans monotonic Moses (SpaTrans models) mono

MT-Summit, Sep Using factored translation models Makes it possible to build translation models based on surface forms, part-of-speech, morphology etc. We use: Translation model: word->word, pos->pos Generation model determine the output Adding linguistic information using MOSES InputOutput word pos+morf word pos+morf

MT-Summit, Sep Adding POS-tags and morphology Pos-tagging training material: Brill tagger used Different tagsets for Danish and English text Experiments with language model (lm) order order 3 or 5 Results not significant: Test Patent A: +0.1% BLEU Test Patent B: -0.1% BLEU Perhaps training material too small to do lm order experiments Training parameters kept: phrase-length 3, lm order 3 No tuning of parameters, just training.

MT-Summit, Sep Results adding pos-tags – by inspection With inclusion of morpho-syntactic information: (lit:… control of the full spectrum)... kontrol af det fulde spektrum (gender agreement) (lit: … the active ingredients)... de aktive bestanddele (number agreement) (lit:... this constant erosion)... denne konstante erosion (definiteness agreement)

MT-Summit, Sep BLEU not designed to test linguistic improvement, anyway: Significant improvement! Results using pos-tags - BLEU BLEU % Test patent ATest patent B Moses word, lm3, with reordering Moses word+pos, lm3, with reordering Moses word, lm3, monotonic Moses word+pos, lm3, monotonic

MT-Summit, Sep Conclusions MOSES En-Da Patents: best results when no reordering Agreement errors can be reduced by applying factored training using pos+mophology Experiments using a ”language” model order > 3 for POS-tags might give even better results

MT-Summit, Sep Conclusions SMT test results for patent text Usable translation quality comparable with RBMT systems in production low cost development for new domain possible to have SMT-systems tailored to different domains of patents - if training data are available Patent texts always contain new terms/concepts Therefore new terms have to be handled in SMT production systems Agreement errors can be reduced by applying factored training with pos-information - BLEU score improved!

MT-Summit, Sep Acknowledgements Thanks! The work was partly financed by the Danish Research Council. Special thanks to Lingtech A/S and Ploughmann & Vingtoft for providing us with training material and proofread patents.