Enhancing Translation Systems with Bilingual Concordancing Functionalities V. ANTONOPOULOSC. MALAVAZOS I. TRIANTAFYLLOUS. PIPERIDIS Presentation: V. Antonopoulos.

Slides:



Advertisements
Similar presentations
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Advertisements

The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Polarity Analysis of Texts using Discourse Structure CIKM 2011 Bas Heerschop Erasmus University Rotterdam Frank Goossen Erasmus.
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
A Syntactic Translation Memory Vincent Vandeghinste Centre for Computational Linguistics K.U.Leuven
Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.
Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.
Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.
Languages & The Media, 4 Nov 2004, Berlin 1 Multimodal multilingual information processing for automatic subtitle generation: Resources, Methods and System.
Speech Translation on a PDA By: Santan Challa Instructor Dr. Christel Kemke.
An interactive environment for creating and validating syntactic rules Panagiotis Bouros*, Aggeliki Fotopoulou, Nicholas Glaros Institute for Language.
Word and Phrase Alignment Presenters: Marta Tatu Mithun Balakrishna.
EBMT1 Example Based Machine Translation as used in the Pangloss system at Carnegie Mellon University Dave Inman.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.
ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June Competitive Grouping in Integrated Segmentation and Alignment.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
C SC 620 Advanced Topics in Natural Language Processing Lecture 24 4/22.
Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
Supplement 02CASE Tools1 Supplement 02 - Case Tools And Franchise Colleges By MANSHA NAWAZ.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
 QUALITY ASSURANCE:  QA is defined as a procedure or set of procedures intended to ensure that a product or service under development (before work is.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.
Machine translation Context-based approach Lucia Otoyo.
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
CLEF2003 Forum/ August 2003 / Trondheim / page 1 Report on CLEF-2003 ML4 experiments Extracting multilingual resources from corpora N. Cancedda, H. Dejean,
Xml:tm XML Text Memory Using XML technology to reduce the cost of translating XML documents.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.
Methods for Automatic Evaluation of Sentence Extract Summaries * G.Ravindra +, N.Balakrishnan +, K.R.Ramakrishnan * Supercomputer Education & Research.
LREC 2008 Marrakech 29 May Caroline Lavecchia, Kamel Smaïli and David Langlois LORIA / Groupe Parole, Vandoeuvre-Lès-Nancy, France Phrase-Based Machine.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Mutual bilingual terminology extraction Le An Ha*, Gabriela Fernandez**, Ruslan Mitkov*, Gloria Corpas*** * University of Wolverhampton ** Universidad.
Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.
LREC 2004, 26 May 2004, Lisbon 1 Multimodal Multilingual Resources in the Subtitling Process S.Piperidis, I.Demiros, P.Prokopidis, P.Vanroose, A. Hoethker,
Intelligent Key Prediction by N-grams and Error-correction Rules Kanokwut Thanadkran, Virach Sornlertlamvanich and Tanapong Potipiti Information Research.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Development of an Intelligent Translation Memory MorphoLogic SZAK Publishers Balázs Kis
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
8 December 1997Industry Day Applications of SuperTagging Raman Chandrasekar.
1 Discussion Class 3 Stemming Algorithms. 2 Discussion Classes Format: Question Ask a member of the class to answer Provide opportunity for others to.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Evaluating Translation Memory Software Francie Gow MA Translation, University of Ottawa Translator, Translation Bureau, Government of Canada
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
Language Identification and Part-of-Speech Tagging
Information Retrieval in Practice
Joint Training for Pivot-based Neural Machine Translation
Statistical NLP: Lecture 13
Translating Collocations for Bilingual Lexicons
Presentation transcript:

Enhancing Translation Systems with Bilingual Concordancing Functionalities V. ANTONOPOULOSC. MALAVAZOS I. TRIANTAFYLLOUS. PIPERIDIS Presentation: V. Antonopoulos Institute for Language and Speech Processing Workshop on Balkan Language Resources & Tools

Current Framework Increasing demand for multilinguality, for translation Current translation systems still fail to completely meet the translation needs Language transfer still prevailing problem Need for further development of existing systems 1.Integration of technologies (TM & MT) 2.Intelligent Tools Workshop on Balkan Language Resources & Tools Page 2 of 17

Proposed Method Expands the transfer selection capabilities Utilizes sub-sentential information Performs well when dealing with limited amount of parallel data (Translation Memories) Feasible usage for run-time applications Statistically overcome the translation unit (TU) identification barrier Workshop on Balkan Language Resources & Tools Page 3 of 17

Method Basics Extracts sub-sentential bilingual correspondences Statistical approach Unique prerequisite a parallel corpus Automatic translation unit identification Two-level iterative method: Incrementally constructed translation Continuously extended source segments Employs target language correspondence information Workshop on Balkan Language Resources & Tools Page 4 of 17

Core Engine Description Workshop on Balkan Language Resources & Tools Page 5 of 17 CDECDE FW Filtering CECE Parallel Text Database SSent-1 SSent-2. SSent-N TSent-1 TSent-2. TSent-N TW-1 TW-2. TW-k Irrelevant word CTWSSTS

1 st - Level Iterations Workshop on Balkan Language Resources & Tools Page 6 of 17 Incremental translation construction: Employs DICE coefficient as similarity measure Adds one word from CTW set in every new iteration Stores translations above threshold during an iteration Terminates when no new translation is added Selects best translation based on similarity score and length

1 st - Level Iterations Example Workshop on Balkan Language Resources & Tools Page 7 of 17 ηλεκτρονική αυτόματη μετάδοση electronic automatic transmission Iteration 1 electronic automatic automatic transmission Iteration 2 electronic automatic transmission Iteration 3 ECU refer EAT Transmission EAT EAT ECU refer electronic automatic automatic transmission EAT

refer electronic automatic Transmission EAT Translation Synthesis Example Workshop on Balkan Language Resources & Tools Page 8 of 17 ηλεκτρονική αυτόματη μετάδοση electronic automatic transmission Iteration 1 electronic automatic automatic transmission Iteration 2 electronic automatic transmission Iteration 3 electronic automatic transmission ECU refer EAT EAT ECU automatic transmission EAT a)length b)score

2 nd - Level Iterations Workshop on Balkan Language Resources & Tools Page 9 of 17 Aims of this 2 nd - level process: Improve accuracy of translation outcome Improve accuracy of translation outcome Automatic translation unit identification Automatic translation unit identification Efficient integration in a Translation Memory Framework Efficient integration in a Translation Memory Framework

2 nd - Level Iterations Workshop on Balkan Language Resources & Tools Page 10 of 17 Employ “Sequence Window Variety” technique Employ “Sequence Window Variety” technique: Try to determine the best “cover” of an input text by examining translation outcome of length-varying source segments Try to determine the best “cover” of an input text by examining translation outcome of length-varying source segments Initiate procedure from smallest segments (1-word segments) Initiate procedure from smallest segments (1-word segments) Continuously extend the input source segments Continuously extend the input source segments Shift observation window from left to right for source segments Shift observation window from left to right for source segments Store acceptable translations along with their score during every iteration Store acceptable translations along with their score during every iteration Combinatorial process for computing the optimal set of candidate source units that provides the best “cover” Combinatorial process for computing the optimal set of candidate source units that provides the best “cover”

2 nd - Level Iterations Example Workshop on Balkan Language Resources & Tools Page 11 of 17 A B C D E F G HIteration 0 Iteration 0-a Iteration 0-b Iteration 0-c Iteration 1-a Iteration 1-b Iteration 2-a Iteration 2-b Iteration 2-c IterationsSource SentenceInput Phrase A B C D E F G HD E D E A B C D E F G HC D E A B C D E F G HD E F A B C D E F G HB C D E A B C D E F G HC D E F A B C D E F G HD E F G

Transmission EAT Translation Synthesis Example (1) Workshop on Balkan Language Resources & Tools Page 12 of 17 ηλεκτρονική αυτόματη μετάδοση electronic permission traction electronic automatic automatic transmission electronic & automatic transmission ETC force EAT EAT ECU a)length b)score ηλεκτρονικήαυτόματη μετάδοση

fuse passenger passenger compartment Translation Synthesis Example (2) Workshop on Balkan Language Resources & Tools Page 13 of 17 ασφαλειοθήκη χώρου επιβατών fuse box switch passenger compartment fuse box & passenger compartment relay ignition fuse box a)length b)score ασφαλειοθήκηχώρου επιβατών compartment fuse box

Significant Technical Aspects N-gram based conflation method for enhancing the existing statistical evidence (overcome limitations that morphologically rich languages introduce) Variable cut-off threshold (eliminate rejections of translation parts at an early stage of the algorithm) Specific word order not taken into account (enhance statistical evidence in small bilingual corpora) Contiguity requirement (ensure translation accuracy) Workshop on Balkan Language Resources & Tools Page 14 of 17

Evaluation Evaluation set: 350 input text fragments (80% noun phrases, 20% verb phrases) manually extracted from an automotive bilingual parallel corpus (3.100 EN words, EL words) Workshop on Balkan Language Resources & Tools Page 15 of 17 Static Window Flexible Window Correct75%83% Second Match 8%6% Errors17%11%

Future Work Apply in comparable bilingual corpora Exploit linguistic information when available Explore ways of integrating in a Machine Translation & Translation Memory framework Workshop on Balkan Language Resources & Tools Page 16 of 17

Integration in MT & TM Framework Workshop on Balkan Language Resources & Tools Page 17 of 17 TM Statistical Processing Machine Translation ABCDEFGHABCDEFGH ABCDEFGHABCDEFGH Part 1 DEFDEF Part 3 Part 2 Target Sentence

Why DICE Although the constituent words may have multiple senses, the identified TUs appear to have unique translation Workshop on Balkan Language Resources & Tools “current”: a) present, existing b) electricity (alternating ~) “current flows across”: a) ρεύμα περνά (1 meaning) Better measure of similarity than MI and specific MI (log- likelihood ratio): 1-1, 1-0 matches are significant, 0-0 are not Good measures of independence are not necessarily good measures of similarity… In practice, DICE works better!

Corpus Size Automotive industry bilingual corpus (EN-EL) sentences in each language EN words – EL words Workshop on Balkan Language Resources & Tools

Champollion Approach Tested in 2 different parts of Hansard corpus (Canadian Parliament) : 3.5 million & 8.5 million words 65% - 75% accuracy was reported for the 3 evaluation sets Proposed to increase database corpus for better results Workshop on Balkan Language Resources & Tools

Conflation Method N-gram method Soft clustering of words >98% accuracy (evaluated using the first 1000 entries of the ILSP morphological lexicon) Works well even with small words Most significant factor was the performance, so emphasis was given on recall Workshop on Balkan Language Resources & Tools

Conflation Methods Workshop on Balkan Language Resources & Tools Conflation Methods InteractiveAutomatic Suffix removal Statistical Table- based N-grams Longest Match Simple Removal