I Evaluation of Free Online Machine Translations for Croatian-English and English-Croatian Language Pairs Sanja Seljan,

Slides:



Advertisements
Similar presentations
Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.
Advertisements

KeTra.
The Research Consumer Evaluates Measurement Reliability and Validity
Rating Evaluation Methods through Correlation presented by Lena Marg, Language Tools MTE 2014, Workshop on Automatic and Manual Metrics for Operational.
A Metric for Software Readability by Raymond P.L. Buse and Westley R. Weimer Presenters: John and Suman.
TAP-ET: TRANSLATION ADEQUACY AND PREFERENCE EVALUATION TOOL Mark Przybocki, Kay Peterson, Sébastien Bronsart May LREC 2008 Marrakech, Morocco.
MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.
Statistical Issues in Research Planning and Evaluation
How do we work in a virtual multilingual classroom? A virtual multilingual classroom with Moodle and Apertium Cultural and Linguistic Practices in the.
MSS 905 Methods of Missiological Research
Translation as a L2 Teaching and Learning Tool Third IATIS Regional Workshop, September 2014 Faculty of Philosophy, University of Novi Sad, Serbia Melita.
Inter-rater reliability in the Performance Test: Summer workshop 2014 By: Dustin Florence.
An Experimental Evaluation on Reliability Features of N-Version Programming Xia Cai, Michael R. Lyu and Mladen A. Vouk ISSRE’2005.
Evaluating a Norm-Referenced Test Dr. Julie Esparza Brown SPED 510: Assessment Portland State University.
CORRELATIO NAL RESEARCH METHOD. The researcher wanted to determine if there is a significant relationship between the nursing personnel characteristics.
Linda Mitchell Evaluating Community Post-Editing - Bridging the Gap between Translation Studies and Social Informatics Linda Mitchell PhD student.
AN EVALUATION OF THE EIGHTH GRADE ALGEBRA PROGRAM IN GRAND BLANC COMMUNITY SCHOOLS 8 th Grade Algebra 1A.
Determining Sample Size
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
Office of Institutional Research, Planning and Assessment January 24, 2011 UNDERSTANDING THE DIAGNOSTIC GUIDE.
MEASUREMENT CHARACTERISTICS Error & Confidence Reliability, Validity, & Usability.
McMillan Educational Research: Fundamentals for the Consumer, 6e © 2012 Pearson Education, Inc. All rights reserved. Educational Research: Fundamentals.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
Evaluation of the Statistical Machine Translation Service for Croatian-English Marija Brkić Department of Informatics, University of Rijeka
Qatar Comprehensive Educational Assessment (QCEA) 2007: Summary of Results.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
FishBase Summary Page about Salmo salar in the standard Language of FishBase (English) ENBI-WP-11: Multilingual Access to European Biodiversity Sites through.
The Genetics Concept Assessment: a new concept inventory for genetics Michelle K. Smith, William B. Wood, and Jennifer K. Knight Science Education Initiative.
1 TURKOISE: a Mechanical Turk-based Tailor-made Metric for Spoken Language Translation Systems in the Medical Domain Workshop on Automatic and Manual Metrics.
Digital Information and Heritage INFuture Zagreb, Sentence Alignment as the Basis For Translation Memory Database Sanja Seljan Faculty of.
A Language Independent Method for Question Classification COLING 2004.
Chapter 1 Introduction to Statistics. Statistical Methods Were developed to serve a purpose Were developed to serve a purpose The purpose for each statistical.
EDU 8603 Day 6. What do the following numbers mean?
Language Identification of Web Data for Building Linguistic Corpora Marija Stupar, Tereza Jurić, Nikola Ljubešić Faculty of Humanities and Social Sciences.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Click to edit Master title style Evaluation of Electronic Translation Tools Through Quality Parameters Vlasta Kučiš University of Maribor, Department of.
Selecting a Sample. Sampling Select participants for study Select participants for study Must represent a larger group Must represent a larger group Picked.
English Proficiency of Undergraduate Engineering Students: Assessment and Influencing Factors 2011 ThaiPOD Annual Conference July 2011, Bangkok Kuntinee.
Inter-rater reliability in the KPG exams The Writing Production and Mediation Module.
Presented By Dr / Said Said Elshama  Distinguish between validity and reliability.  Describe different evidences of validity.  Describe methods of.
© 2011 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license.
Research Methodology and Methods of Social Inquiry Nov 8, 2011 Assessing Measurement Reliability & Validity.
© 2011 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Information Transfer through Online Summarizing and Translation Technology Sanja Seljan*, Ksenija Klasnić**, Mara Stojanac*, Barbara Pešorda*, Nives Mikelić.
12/23/2015Slide 1 The chi-square test of independence is one of the most frequently used hypothesis tests in the social sciences because it can be used.
Towards the Use of Linguistic Information in Automatic MT Evaluation Metrics Projecte de Tesi Elisabet Comelles Directores Irene Castellon i Victoria Arranz.
Chapter 10 Copyright © Allyn & Bacon 2008 This multimedia product and its contents are protected under copyright law. The following are prohibited by law:
Reliability and Validity of the Intensive Care Delirium Screening Checklist in Turkish Gulsah Kose, Abdullah Bolu, Leyla Ozdemir, Cengizhan Acikel, Sevgi.
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.
DEVELOPED BY MARY BETH FURST ASSOCIATE PROFESSOR, BUCO DIVISION AMY CHASE MARTIN DIRECTOR OF FACULTY DEVELOPMENT AND INSTRUCTIONAL MEDIA UNDERSTANDING.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
Welcome to All S. Course Code: EL 120 Course Name English Phonetics and Linguistics Lecture 1 Introducing the Course (p.2-8) Unit 1: Introducing Phonetics.
Project VIABLE - Direct Behavior Rating: Evaluating Behaviors with Positive and Negative Definitions Rose Jaffery 1, Albee T. Ongusco 3, Amy M. Briesch.
METHODOLOGY Target population All students from 2 nd year to 4 th year in Audiology and Speech Sciences Department in National University Malaysia (UKM).
NURS 306, Nursing Research Lisa Broughton, MSN, RN, CCRN RESEARCH STATISTICS.
OBJECTIVE INTRODUCTION Emergency Medicine Milestones: Longitudinal Interrater Agreement EM milestones were developed by EM experts for the Accreditation.
1 Measuring Agreement. 2 Introduction Different types of agreement Diagnosis by different methods  Do both methods give the same results? Disease absent.
Arnar Thor Jensson Koji Iwano Sadaoki Furui Tokyo Institute of Technology Development of a Speech Recognition System For Icelandic Using Machine Translated.
MSS 905 Methods of Missiological Research
Using Translation Memory to Speed up Translation Process
Anastassia Loukina, Klaus Zechner, James Bruno, Beata Beigman Klebanov
Eiji Aramaki* Sadao Kurohashi* * University of Tokyo
Sociology Outcomes Assessment
Cuyamaca College Library
Understanding Statistical Inferences
Ch 5: Measurement Concepts
2017 Postgraduate Research Experience Survey (PRES) Results
Presentation transcript:

I Evaluation of Free Online Machine Translations for Croatian-English and English-Croatian Language Pairs Sanja Seljan, University of Zagreb - Faculty of Humanities and Social Sciences, Department of Information Sciences, Croatia Marija Brkić, University of Rijeka, Department of Informatics, Croatia Vlasta Kučiš, University of Maribor, Department of Translation Studies, Slovenia FF Zagreb – Informacijske znanosti

I Aim  Text evaluation from four domains (city description, law, football, monitors)  Cro-Eng - by four free online translation services (Google Translate, Stars21, InterTran and Translation Guide)  En- Croatian - by Google Translate  Measuring of inter-rater agreement (Fleiss kappa)  influence of error types on the criteria of fluency and adequacy  Pearson’s correlation

I I.Introduction II.MT evaluation III.Experimental study  Translation tools  Test set description  Evaluation  Error analysis  Correlations IV.Conclusion

I I INTRODUCTION  increased use of online in recent years, even among less widely spoken languages  Desirable: moderate to good quality translations  evaluation from the user's perspective  Tools and evaluation mainly for widely spoken languages  Possible use: gisting translations, information retrieval, i.e. question-answering systems  1976 Systran - first MT for the Commission of the European Communities + online tool + different versions  first online translation tool - Babel Fish using Systran technology  Important: realistic expectations

I  Studies for popular languages  Considerable difference in the quality of translation dependent on the language pair  German-French (GT, ProMT, WorldLingo)  three popular online tools  Spanish-English (introductory textbook)  2008 – 13 languages into English (6 tools: BabelFish, Google Translate, ProMT, SDL free translator, Systran, World Lingo)

I  MT evaluation – important in research and product design  measure system performance  identify weak points and adjust parameter settings  language independent algorithms (BLEU, NIST)  Better metric – closer to human evaluation  need for qualitative evaluation of different linguistic phenomena

I II EXPERIMENTAL STUDY  evaluation of free online translation services (FTS) – from user’s perspective  undergraduate and graduate students of languages, linguistics and information sciences attending courses on language technologies at the University of Zagreb, Faculty of Humanities and Social Science Test set description  texts 4 domains (city description, law, football, monitors)  Cca 7-9 sentence per domain (17.8 word/ sent.)  Cro-En, En-Cro

I Evaluators  Cro-En: 48 students, final year of undergraduate and graduate levels  En-Cro: 50 students, native speakers  75% of students attended language technology course(s) Average grades for free language resources on the Internet Evaluation – before pilot study

I Croatian tools/resourcesTools/ resources in general

I Desirable tools/ resources of appropriate quality

I Evaluation Manual evaluation  fluency (indicating how much the translation is fluent in the target language)  adequacy (indicating how much of the information is adequately transmitted)  evaluation enriched by translation errors analysis −morphological errors, −untranslated words −lexical errors and word omissions −syntactic errors

I Tools Cro-En translations  Google Translate (GT)  Stars21 (S21)  InterTran (IT)  Translation Guide (TG) - guide.com guide.com En-Cro translations  obtained from Google Translate

I Google Translate  translation service provided by Google Inc.  statistical MT based on huge amount of corpora  It supports 57 languages, Croatian since 2008 S21 service  powered by GT  translations not always the same InterTran  powered by NeuroTran and WordTran  sentence-by-sentence and word-by-word Translation Guide  powered by IT  Different translations

I Results - Cro-En  either low grades (TG and IT) or high grades (S21 and GT), in comparison to the average value (3.04)  S21(4.66) : GT (4.62) – city description, legal  GT – football, monitors  Best average result – legal domain, then monitors and football  Lowest – city description (the most free in style)

I Results - Cro-En -En-Cro - lower average results than the reverse direction: football (3.75 : 4.84), law, monitors -Higher average grade in city description (shorter sentences, mostly nominative constructions, frequent terms) -Football domain - specific terms, non-nominative constructions

I Error analysis En-Cro  Translations offered by GT and S21 are very similar, although not identical  TG and IT – difference in number of untranslated words  TG does not recognize words with diacritics Cro-En  the highest number of lexical errors, including also errors in style (av )  Untranslated words (1.83), morphological (1.75), syntactic errors (1.38)  Lowest score, highest number of errors - football domain (mostly lexical errors and untranslated words)  best score – in city description domain (lexcial errors)  Lowest no. errors – legal domain (evenly distributed)

I  Morphological errors – mostly in domain of monitors, the smallest no. in city desription (dominant value 1)  Untranslated words - by far mostly in the football  translation grades - mostly influenced by untranslated words Dominant values  Morphological errors: 1 in city description and monitors, 3 in the legal and football  Lexical errors: 1 in city description, others higher  untranslated words - 1 in all domains  syntactic errors - 1 in all domains but football (2-3)

I Pearson’s correlation  smaller number of errors augments the average grade  correlation between errors types and the criteria of fluency and adequacy  fluency - more affected by the increase of lexical and syntactic errors,  adequacy is more affected by untranslated words

I Fleiss' kappa  for assessing the reliability of agreement among raters when giving ratings to the sentences  Indicating extent to which the observed amount of agreement among raters exceeds what would be expected if all the raters made their ratings completely randomly.  Score - between 0 and 1 (perfect agreement)  slight agreementN – total of subjects  fair agreement n – no. of raters per subject  moderate agreement i – extent to which raters  substantial agreement agree on i-subject  almost perfect agreementj - categories

I  relatively high level of the agreement among raters per domain and per system in Cro-En translations  moderate (for IT translation service),  substantial agreement (S21 and GT)  perfect agreement (TG – the worst tool)  En-Cro translations - inter-rater agreement per domain  lowest level of agreement has been detected in the domains of football and law (from fair & moderate) – larger and more complex sentences  substantial agreement ( ) – in city description  level of inter-rater agreement is lower for En-Cro translations in all domains

I Conclusion  evaluation study of MT in 4 domains  Cro-En – 4 free online translation services  En-Cro translations – by Google Translate  Evaluator’s profile  high interest in use of translation resources and tools  Critical evaluation  System evaluation  perfect agreement in the ranking of TG as the worst translation service  substantial agreement is achieved for S21 and GT services  moderate agreement is shown for IT, which has performed slightly better than TG.

I Cro-En translations  S21 and GT ( 4.63 to 4.84) - football, law and monitors  city description - Cro-En lower than in En-Cro En-Cro direction – by GT  lower grades than in the opposite direction (specific terms, non-nominative constructions, multi-word units)  Except city description domain - containing mostly nominative constructions, frequent words, no specific terms Error analysis  translation grades are mostly influenced by untranslated words (especially the criteria of adequacy)  morphological and syntactic errors reflect grades in smaller proportion (fluency) ,

I Google Translate service  used in both translation directions  harvesting data from the Web, seems to be well trained and suitable for the translation of frequent expressions  Doesn’t perform well where language information is needed, e.g. gender agreement, in MW expressions Further research  Better quantitavie analysis per domain  more detailed analysis of specific language phenomena

I Evaluation of Free Online Machine Translations for Croatian-English and English-Croatian Language Pairs Sanja Seljan, University of Zagreb - Faculty of Humanities and Social Sciences, Department of Information Sciences, Croatia Marija Brkić, University of Rijeka, Department of Informatics, Croatia Vlasta Kučiš, University of Maribor, Department of Translation Studies, Slovenia FF Zagreb – Informacijske znanosti