Patent documentation - comparison of two MT strategies Lene Offersgaard, Claus Povlsen Center for Sprogteknologi, University of Copenhagen
MT-Summit, Sep A comparison of two different MT strategies RBMT and SMT, similarities and differences, in a patent documentation context What requirements should be met in order to develop an SMT production system within the area of patent documentation? The two strategies: PaTrans: A transfer and rule based translation system, used the last 15 years at Lingtech A/S (Ørsnes, 1996). SpaTrans: A SMT system based on the Pharaoh framework (Koehn, 2004). Investigations supported by Danish Research Council. Subdomain: chemical patents
MT-Summit, Sep A comparison of two different MT strategies -2 PaTrans: Transfer and rule based En-Da, linguistic development Grammatical coverage tailored to the text type of Patents Tools for terminology selection and coding Handling of formulas and references SpaTrans: An SMT system based on Pharaoh framework En-Da, research version Word and grammatical coverage determined by training corpus No termilology handling yet Simple handling of formulas and references
MT-Summit, Sep SpaTrans: Statistical resources Translation Workflow PaTrans: Linguistic resources Preprocessing English Patent Translation Engine Postprocessing Danish Patent Proff reading PaTrans Engine lexicon grammar termbases Language model srilm 3 Phrase table Pharaoh Decoder CorpusEnglish words Danish words Training4.2 mill4.5 mill Language model-4.5 mill Development test Test
MT-Summit, Sep BLEU Evaluation Reference translations are two post-edited PaTrans translations The PaTrans system is favoured: term bases, wording and sentence structure Some SpaTrans errors are caused by incomplete treatment of formulas and references BLEU differs for the two patents Very promising results for the SpaTrans system BLEU %Test patent ATest patent B PaTrans SpaTrans with reordering SpaTrans monotonic Diff (PaTrans - SpaTrans mono.)
MT-Summit, Sep Human evaluation of the SMT system Limited resources for manual evaluation Proof readers have post-edited SMT output and focussed on: Post editing time Quality of output Intelligibility (understandable?) Fidelity (same meaning?) Fluency (fluent Danish?) Conclusions: Usable translation quality Both intelligibility and fidelity scores are best without reordering Annoying agreement errors New terms has to be included in the SMT system easily
MT-Summit, Sep SpaTrans translation results A dominant error pattern is the frequent occurrence of agreement errors in nominal phrases Examples Gender disagreement: (lit:… control of the full spectrum) … kontrol af den fulde spektrum … kontrol af den[DET_common_sing] fulde spektrum[N_neuter_sing] Corrected output: … kontrol af det[DET_neuter_sing] fulde spektrum[N_neuter_sing]
MT-Summit, Sep SpaTrans translation results - 2 Number disagreement: (lit: … the active ingredients) … den aktive bestanddele … den[DET_common_sing] aktive bestanddele[N_common_plur]... denne[DET_definite] konstant[ADJ_indefinite] erosion Corrected output: … de[DET_common_plur] aktive bestandele[N_common_plur] Corrected output:... denne[DET_definite] konstante[ADJ_definite] erosion Lets give linguistic information a try! Definiteness disagreement: (lit:... this constant erosion)... denne konstant erosion
MT-Summit, Sep MOSES Open source system replacing Pharaoh (Koehn et al. 2007) State-of-the-art phrase-based approach Using factored translation models Comparison SpaTrans-Pharao and Moses decoder Reuse of statistical resources Pharao parameters for monotonic setup optimised based on development tests Adding linguistic information to SMT: MOSES BLEU %Test patent ATest patent B SpaTrans with reordering Moses (SpaTrans models) reord SpaTrans monotonic Moses (SpaTrans models) mono
MT-Summit, Sep Using factored translation models Makes it possible to build translation models based on surface forms, part-of-speech, morphology etc. We use: Translation model: word->word, pos->pos Generation model determine the output Adding linguistic information using MOSES InputOutput word pos+morf word pos+morf
MT-Summit, Sep Adding POS-tags and morphology Pos-tagging training material: Brill tagger used Different tagsets for Danish and English text Experiments with language model (lm) order order 3 or 5 Results not significant: Test Patent A: +0.1% BLEU Test Patent B: -0.1% BLEU Perhaps training material too small to do lm order experiments Training parameters kept: phrase-length 3, lm order 3 No tuning of parameters, just training.
MT-Summit, Sep Results adding pos-tags – by inspection With inclusion of morpho-syntactic information: (lit:… control of the full spectrum)... kontrol af det fulde spektrum (gender agreement) (lit: … the active ingredients)... de aktive bestanddele (number agreement) (lit:... this constant erosion)... denne konstante erosion (definiteness agreement)
MT-Summit, Sep BLEU not designed to test linguistic improvement, anyway: Significant improvement! Results using pos-tags - BLEU BLEU % Test patent ATest patent B Moses word, lm3, with reordering Moses word+pos, lm3, with reordering Moses word, lm3, monotonic Moses word+pos, lm3, monotonic
MT-Summit, Sep Conclusions MOSES En-Da Patents: best results when no reordering Agreement errors can be reduced by applying factored training using pos+mophology Experiments using a ”language” model order > 3 for POS-tags might give even better results
MT-Summit, Sep Conclusions SMT test results for patent text Usable translation quality comparable with RBMT systems in production low cost development for new domain possible to have SMT-systems tailored to different domains of patents - if training data are available Patent texts always contain new terms/concepts Therefore new terms have to be handled in SMT production systems Agreement errors can be reduced by applying factored training with pos-information - BLEU score improved!
MT-Summit, Sep Acknowledgements Thanks! The work was partly financed by the Danish Research Council. Special thanks to Lingtech A/S and Ploughmann & Vingtoft for providing us with training material and proofread patents.