Enhancing Translation Systems with Bilingual Concordancing Functionalities V. ANTONOPOULOSC. MALAVAZOS I. TRIANTAFYLLOUS. PIPERIDIS Presentation: V. Antonopoulos Institute for Language and Speech Processing Workshop on Balkan Language Resources & Tools
Current Framework Increasing demand for multilinguality, for translation Current translation systems still fail to completely meet the translation needs Language transfer still prevailing problem Need for further development of existing systems 1.Integration of technologies (TM & MT) 2.Intelligent Tools Workshop on Balkan Language Resources & Tools Page 2 of 17
Proposed Method Expands the transfer selection capabilities Utilizes sub-sentential information Performs well when dealing with limited amount of parallel data (Translation Memories) Feasible usage for run-time applications Statistically overcome the translation unit (TU) identification barrier Workshop on Balkan Language Resources & Tools Page 3 of 17
Method Basics Extracts sub-sentential bilingual correspondences Statistical approach Unique prerequisite a parallel corpus Automatic translation unit identification Two-level iterative method: Incrementally constructed translation Continuously extended source segments Employs target language correspondence information Workshop on Balkan Language Resources & Tools Page 4 of 17
Core Engine Description Workshop on Balkan Language Resources & Tools Page 5 of 17 CDECDE FW Filtering CECE Parallel Text Database SSent-1 SSent-2. SSent-N TSent-1 TSent-2. TSent-N TW-1 TW-2. TW-k Irrelevant word CTWSSTS
1 st - Level Iterations Workshop on Balkan Language Resources & Tools Page 6 of 17 Incremental translation construction: Employs DICE coefficient as similarity measure Adds one word from CTW set in every new iteration Stores translations above threshold during an iteration Terminates when no new translation is added Selects best translation based on similarity score and length
1 st - Level Iterations Example Workshop on Balkan Language Resources & Tools Page 7 of 17 ηλεκτρονική αυτόματη μετάδοση electronic automatic transmission Iteration 1 electronic automatic automatic transmission Iteration 2 electronic automatic transmission Iteration 3 ECU refer EAT Transmission EAT EAT ECU refer electronic automatic automatic transmission EAT
refer electronic automatic Transmission EAT Translation Synthesis Example Workshop on Balkan Language Resources & Tools Page 8 of 17 ηλεκτρονική αυτόματη μετάδοση electronic automatic transmission Iteration 1 electronic automatic automatic transmission Iteration 2 electronic automatic transmission Iteration 3 electronic automatic transmission ECU refer EAT EAT ECU automatic transmission EAT a)length b)score
2 nd - Level Iterations Workshop on Balkan Language Resources & Tools Page 9 of 17 Aims of this 2 nd - level process: Improve accuracy of translation outcome Improve accuracy of translation outcome Automatic translation unit identification Automatic translation unit identification Efficient integration in a Translation Memory Framework Efficient integration in a Translation Memory Framework
2 nd - Level Iterations Workshop on Balkan Language Resources & Tools Page 10 of 17 Employ “Sequence Window Variety” technique Employ “Sequence Window Variety” technique: Try to determine the best “cover” of an input text by examining translation outcome of length-varying source segments Try to determine the best “cover” of an input text by examining translation outcome of length-varying source segments Initiate procedure from smallest segments (1-word segments) Initiate procedure from smallest segments (1-word segments) Continuously extend the input source segments Continuously extend the input source segments Shift observation window from left to right for source segments Shift observation window from left to right for source segments Store acceptable translations along with their score during every iteration Store acceptable translations along with their score during every iteration Combinatorial process for computing the optimal set of candidate source units that provides the best “cover” Combinatorial process for computing the optimal set of candidate source units that provides the best “cover”
2 nd - Level Iterations Example Workshop on Balkan Language Resources & Tools Page 11 of 17 A B C D E F G HIteration 0 Iteration 0-a Iteration 0-b Iteration 0-c Iteration 1-a Iteration 1-b Iteration 2-a Iteration 2-b Iteration 2-c IterationsSource SentenceInput Phrase A B C D E F G HD E D E A B C D E F G HC D E A B C D E F G HD E F A B C D E F G HB C D E A B C D E F G HC D E F A B C D E F G HD E F G
Transmission EAT Translation Synthesis Example (1) Workshop on Balkan Language Resources & Tools Page 12 of 17 ηλεκτρονική αυτόματη μετάδοση electronic permission traction electronic automatic automatic transmission electronic & automatic transmission ETC force EAT EAT ECU a)length b)score ηλεκτρονικήαυτόματη μετάδοση
fuse passenger passenger compartment Translation Synthesis Example (2) Workshop on Balkan Language Resources & Tools Page 13 of 17 ασφαλειοθήκη χώρου επιβατών fuse box switch passenger compartment fuse box & passenger compartment relay ignition fuse box a)length b)score ασφαλειοθήκηχώρου επιβατών compartment fuse box
Significant Technical Aspects N-gram based conflation method for enhancing the existing statistical evidence (overcome limitations that morphologically rich languages introduce) Variable cut-off threshold (eliminate rejections of translation parts at an early stage of the algorithm) Specific word order not taken into account (enhance statistical evidence in small bilingual corpora) Contiguity requirement (ensure translation accuracy) Workshop on Balkan Language Resources & Tools Page 14 of 17
Evaluation Evaluation set: 350 input text fragments (80% noun phrases, 20% verb phrases) manually extracted from an automotive bilingual parallel corpus (3.100 EN words, EL words) Workshop on Balkan Language Resources & Tools Page 15 of 17 Static Window Flexible Window Correct75%83% Second Match 8%6% Errors17%11%
Future Work Apply in comparable bilingual corpora Exploit linguistic information when available Explore ways of integrating in a Machine Translation & Translation Memory framework Workshop on Balkan Language Resources & Tools Page 16 of 17
Integration in MT & TM Framework Workshop on Balkan Language Resources & Tools Page 17 of 17 TM Statistical Processing Machine Translation ABCDEFGHABCDEFGH ABCDEFGHABCDEFGH Part 1 DEFDEF Part 3 Part 2 Target Sentence
Why DICE Although the constituent words may have multiple senses, the identified TUs appear to have unique translation Workshop on Balkan Language Resources & Tools “current”: a) present, existing b) electricity (alternating ~) “current flows across”: a) ρεύμα περνά (1 meaning) Better measure of similarity than MI and specific MI (log- likelihood ratio): 1-1, 1-0 matches are significant, 0-0 are not Good measures of independence are not necessarily good measures of similarity… In practice, DICE works better!
Corpus Size Automotive industry bilingual corpus (EN-EL) sentences in each language EN words – EL words Workshop on Balkan Language Resources & Tools
Champollion Approach Tested in 2 different parts of Hansard corpus (Canadian Parliament) : 3.5 million & 8.5 million words 65% - 75% accuracy was reported for the 3 evaluation sets Proposed to increase database corpus for better results Workshop on Balkan Language Resources & Tools
Conflation Method N-gram method Soft clustering of words >98% accuracy (evaluated using the first 1000 entries of the ILSP morphological lexicon) Works well even with small words Most significant factor was the performance, so emphasis was given on recall Workshop on Balkan Language Resources & Tools
Conflation Methods Workshop on Balkan Language Resources & Tools Conflation Methods InteractiveAutomatic Suffix removal Statistical Table- based N-grams Longest Match Simple Removal