Automatic translation quality control using Eurovoc descriptors Marko Tadić, Božo Bekavac

Slides:



Advertisements
Similar presentations
European Patent Office Wolfgang Täger December 2006 European Patent Office European Machine Translation Programme.
Advertisements

Sentence Classification and Clause Detection for Croatian Kristina Vučković, Željko Agić, Marko Tadić Department of Information Sciences, Department of.
CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.
Example Database English-German Dictionary
Eleni Galiotou, Dept. of Informatics
ANLE1 CC 437: Advanced Natural Language Engineering ASSIGNMENT 2: Implementing a query expansion component for a Web Search Engine.
Patent Search QUERY Log Analysis Shariq Bashir Department of Software Technology and Interactive Systems Vienna.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
Near Language Identification Using NooJ Božo Bekavac, Kristina Kocijan, Marko Tadić Faculty of Humanities and Social Sciences University of Zagreb, Croatia.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Bruxelles, Computer Aided Document Indexing System (CADIS) with Eurovoc Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing.
Translating for the European Commission Vilnius, 7 June 2013 Miroslav Adamiš Director DGT.
Translating for International Organisations Non-Legislative Texts.
GRAMMAR APPROACH By: Katherine Marzán Concepción EDUC 413 Prof. Evelyn Lugo.
Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein.
Leuven, Computer Aided Document Indexing System for Accessing Legislation A Joint Venture of Flanders and Croatia Bojana Dalbelo Bašić Faculty.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Pržno, Republic of Montenegro 8 October 2007 TRANSLATION FOR EU ACCESSION TRANSLATION FOR EU ACCESSION Jasminka Novak, Head of Service Independent Service.
IMSS005 Computer Science Seminar
Bilingual term extraction revisited: Comparing statistical and linguistic methods for a new pair of languages Špela Vintar Faculty of Arts Dept. of Translation.
Automatic Eurovoc Indexing: Results and Evaluations Bruno Pouliquen Lang Tech group, JRC, European Commission Ispra-Italy
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
Linguistic Aspects of Alignment of Croatian Legislation and EU Law JANKA DORANIĆ Ministry of Foreign Affairs and European Integration Zagreb, Croatia
Working freelance for an international organisation.
IATE EU tool for translation-oriented terminology work
Survey of Semantic Annotation Platforms
JRC-Ispra, , Slide 1 Next Steps / Technical Details Bruno Pouliquen & Ralf Steinberger Addressing the Language Barrier Problem in the Enlarged.
Evaluation of the Statistical Machine Translation Service for Croatian-English Marija Brkić Department of Informatics, University of Rijeka
2013 Court of Justice of the European Union Language arrangements at the Court of Justice of the European Union Interpretation - Translation.
Can Controlled Language Rules increase the value of MT? Fred Hollowood & Johann Rotourier Symantec Dublin.
Eurovoc does not yet exist for your language? The Hungarian experience. Tamás Váradi
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
Digital Information and Heritage INFuture Zagreb, Sentence Alignment as the Basis For Translation Memory Database Sanja Seljan Faculty of.
Coping with Babel How to Localize XML. Designing for Localization Document design can seriously impact the costs of translation and localization. Remember.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
2XML Marko Tadić Department of linguistics, Faculty of philosophy, University of Zagreb ( Tübingen,
Legislative Texts. The legislative process in the EU Proposal, recommendation, communication from Commission, Green Paper, consultation, studies, draft.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic,
Semiautomatic domain model building from text-data Petr Šaloun Petr Klimánek Zdenek Velart Petr Šaloun Petr Klimánek Zdenek Velart SMAP 2011, Vigo, Spain,
Spanish FrameNet Project Autonomous University of Barcelona Marc Ortega.
Using Surface Syntactic Parser & Deviation from Randomness Jean-Pierre Chevallet IPAL I2R Gilles Sérasset CLIPS IMAG.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Natural Language Processing for Information Retrieval -KVMV Kiran ( )‏ -Neeraj Bisht ( )‏ -L.Srikanth ( )‏
ISPRA 2004 Automatic Eurovoc indexing an Experiment in the Czech Parliament Anna Lhotská, Václav Sklenář Office of the Chamber of Deputies, Parliament.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Information Transfer through Online Summarizing and Translation Technology Sanja Seljan*, Ksenija Klasnić**, Mara Stojanac*, Barbara Pešorda*, Nives Mikelić.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Introduction to the European Union. The European Union Foundation Purpose.
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
ZRINKA DUJMOVIĆ University of Zagreb/ETF JRC Workshop: Exploiting parallel corpora in up to 20 Languages Arona, September 2005 STATISTICAL ANALYSIS.
EUROPEAN UNION. EU basics  When did the European project start?  Why did it start?  How many members does it have? What are its member states?  What.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
1 Possibilities of identification of translation equivalents in a parallel corpus Krešimir Šojat Marko Tadić Institute of Linguistics Faculty of Philosophy;
Academic Cooperation: Terminology Research for IATE.
HOW CONSOLIDATED TEXTS GET THEIR LEGAL FORCE ESTONIAN EXPERIENCE
Linguist careers in the EU
Text Based Information Retrieval
European Studies Glossary
Using Translation Memory to Speed up Translation Process
Computer Aided Document Indexing System for Accessing Legislation A Joint Venture of Flanders and Croatia Bojana Dalbelo Bašić Faculty of Electrical Engineering.
Ag.no.14.1 Dissemination of A65
Laura Mihăilescu Translation Coordination Unit
A Path-based Transfer Model for Machine Translation
Information Retrieval and Web Design
Presentation transcript:

Automatic translation quality control using Eurovoc descriptors Marko Tadić, Božo Bekavac Department of linguistics / Institute of linguistics, Faculty of Philosophy, University of Zagreb ( JRC Ispra / Arona,

Talk plan motivation automatic translation quality control resources: glossary and test corpus results further directions

Motivation Acquis communautaire (AC) is still being translated to Croatian former Ministry of European Integrations (MEI), today Ministry of External Affairs and European Integrations AC –ca 200,000 pages of EU OJ –AC corpus: from 8 Mw (Estonian) to 82 Mw (Spanish) –not precisely delimited (lawyers are working on that!) –constantly growing –legal texts a lot of repetitious and formulaic expressions low polysemy in terms expected

Motivation 2 different EU accession candidates  different organization of translation process –several years of work –large number of translators –in-house/out-house (tenders) –large-scale document translation and revision MEI –outsourcing to ca 100 translators or translating companies –use of glossary with pre-established TEs –glossaries being translated in advance Eurovoc EU Law Glossary / Čtyřjazyčný slovník práva Evropské unie, Prague 1999 maintain the consistency of translation –by use of the same glossary only?

Preparing AC originals for translation project proposed by our Institute to MEI in 2002 entries from glossary marked in original text before translation signal of the existence of pre-established TE to the translator obligatory usage of existing TE in legal texts, e.g.: –Council of EuropeVijeće Europe –European CouncilEuropsko vijeće –Council of the European UnionVijeće Europske unije... AC had to be converted to XML MEI dropped the project in 2003 for the lack of finances now: AC corpus in XML

Revision of Translation largest effort was put on translation in all candidate countries revision of translation always in the last place –quality: consistency task undermined by all candidate countries –large portions of official translation of AC poorly revised usually done –manually –simple search & replace commands –no terms/entries marked in texts automatic approach? –lexical level and idiomatic level

Automatic Translation Quality Control use system to check whether all pre-established TE are used –sentence aligned parallel corpus –glossary entries marked in original text translated text if a TE of a glossary entry found in original, has not been found in aligned translated sentence  translation is departing from pre-established TE e.g.: –Eurovoc:(en) President of the Commission = (hr) Predsjednik Komisije –Corpus:(en) … if the President of the Commission declares … (hr 1 ) … ako Predsjednik Komisije objavi … (hr 2 ) … ako Predsjednik objavi …

Resources our lexicon: Eurovoc 4.1 –documentational indexing glossary –ca 6000 entries (descriptors) covering topics found in EU legal texts –accompanied by non-descriptors (synonyms) –translated to Croatian in 2000 – Croatian specific descriptors –translation always 1:1 –combination of nouns, adjectives, prepositions, conjunctions our corpus: 9 documents from AC corpus and their translations from MEI –size: tokens (en) tokens (hr) –Croatian translations converted to AC corpus XML format

Method simple glossary look-up? problem of inflection –English at least: sg, pl, ’s –Croatian 7 cases  2 numbers for nouns 7 cases  2 numbers  3 genders  2 definiteness  3 comparison for Adjectives lemmatization of corpus or glossary? Eurovoc lemmatized and converted to FSA: Intex –English lemmatizer from Intex for English Eurovoc –Croatian Lemmatization Server (hml.ffzg.hr) for Croatian Eurovoc –FSA with states

Eurovoc as FSA " / "

Method 2 glossary entires marked in corpus together with IDs – checking whether the same ID appears on both sides of alignment (Perl script)          statistics en hr with matched IDs 803 (60,47%) matched s are also word/phrase aligned parts below

Drawbacks syntactic merging –abbreviations not matched / marked (e.g. EP delegation vs. European Parliament delegation) –merged terms not matched / marked (e.g. head of State, head of government vs. heads of State or Government) EUROVOC = glossary intended for indexing –a lot of real terms (MWU) not matched / marked (e.g. country candidate to EU accession, Stabilisation and Association Agreement)  they don’t exist as entries –no semantic processing  polysemous terms wrongly matched / marked (e.g....which might lead (olovo) to a common defence...) Intex English lemmatizer didn’t cover all Eurovoc entries

Further directions evaluation of matched pairs of regarding –single-word units –multi-word units improving Intex English lemmatizer / lexicon use Eurovoc non-descriptors as synonyms –to capture a wider departure from expected TE in translation more precisely use / include other glossaries –EU Law Glossary test the whole system on larger corpus use it with other languages

Automatic translation quality control using Eurovoc descriptors Marko Tadić, Božo Bekavac Department of linguistics / Institute of linguistics, Faculty of Philosophy, University of Zagreb ( JRC Ispra / Arona,