Download presentation
Presentation is loading. Please wait.
Published byPrudence Dickerson Modified over 9 years ago
1
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management A Technical Word and Term Translation Aid using Noisy Parallel Corpora across Language Groups Pascale Fung, Kathleen McKeown Machine Translation, 1997
2
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Outline Motivation Objective Introduction Related work Noisy parallel corpora across language groups Algorithm overview Experiments Conclusion
3
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 3 Motivation The difficult task, technical term translation ─ Translators quality and domain specific terminology. ─ Not adequately covered by printed dictionaries. ─ Terms from noisy parallel corpora, especially. Ex: ─ Hong Kong Governor / 香港總督 ─ Basic Law / 基本法 ─ Green Paper / 綠皮書
4
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 4 Objective This paper describes an algorithm for ─ translating technical words and ─ terms from noisy parallel corpora across language groups. 2 to 1
5
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 5 1. Introduction Technical terms ─ often cannot be translated on a word by word basis. ─ The individual words of the term may have many possible translations. ─ Example: Governor 總督, 主管 (top manager), 總裁 (chief), 州長 (of a State) Hong Kong Governor – 香港總督 Domain-specific terms ─ Basic Law / 基本法 ─ Green Paper / 綠皮書
6
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 6 1. Introduction An algorithm for translating technical terms given a noisy parallel corpus as input ─ Notion similar words won’t occur at the exact same position in each half of the corpus distances between instances of the same word will be similar across languages ─ Method To find word correlations and then builds technical terms translations. Dynamic time warping algorithm. Reliable anchor points.
7
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 7 2. Related work Sentence alignment Segment alignment Word and term translation Word alignment Phrase translation
8
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 8 2.1. Sentence alignment Two main approaches ─ Text-based: use of lexical information (dictionary) Use paired lexical indicators across the languages to find matching sentences. ─ Length-based: use of the total number of characters (words) Make the assumption that translated sentences in the parallel corpus will be of approximately the same, or constantly related, length.
9
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 9 2.2. Segment alignment Church(1993) show that we can align a text by using delimiters. Segment alignment is more appropriate for aligning noisy corpora. The problem is finding reliable anchor points that can be used for Asian/Romance language pairs.
10
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 10 2.3. Word and term translation Some algorithms used for alignment produce a small bilingual lexicon. Some others use sentence-aligned parallel text. Most of the following algorithms require clean, sentence-aligned parallel text input.
11
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 11 2.4. Word alignment [Brown et al. 1990, Brown et al. 1993] [Gale & Church 1991] [Dagan et al. 1993] [Wu & Xia 1994] Various filtering techniques are used to improve the matching.
12
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 12 2.5. Phrase translation [Kupiec1993] [Smadja & McKeown1993] [Dagan & Church1994] All the work described in this section assumes a clean, parallel corpus as input.
13
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 13 3. Noisy parallel corpora across language groups Previous approaches are lack of robustness ─ Against structural noise in parallel corpora. ─ Against language pairs which don’t share etymological roots. Still exist problems ─ Bilingual texts which are translations of each other but are not translated sentence by sentence. ─ Language robustness.
14
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 14 3. Noisy parallel corpora across language groups Two noisy parallel corpora ─ English version of the AWK manual and its Japanese translation. ─ Parts of the HKUST English-Chinese Bilingual Corpora. Two noisy parallel corpora ─ English version of the AWK manual and its Japanese translation. ─ Parts of the HKUST English-Chinese Bilingual Corpora.
15
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 15 4. Algorithm overview Treat the domain word translation problem as a pattern matching problem ─ Each word shares some common features with its counterpart in the translated text. ─ To find the best representations of these features and the best ways to match them.
16
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 16 5. Compile non-linear segment boundaries with high frequency word pairs 6. Compile bilingual word lexicon 7. Suggest a word list for each technical term to the translator Algorithm overview English Chinese Tag English word list Tokenize Japanese and Chinese texts, and form a word list 1 – 4 Corpus 1. Primary lexicon 2. Anchor points for alignment 3. Align the text 4. Secondary lexicon
17
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 17 5. Extracting technical terms from English text To find domain-specific terms, we tagged the English part of the corpus by a modified POS tagger ─ Extracted noun phrases which are most likely to be technical terms. ─ To find the translations for words which are part of these terms only.
18
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 18 6. Tokenization of Chinese and Japanese texts Tokenization of the Chinese text is done by using a statistically augmented dictionary- based tokenizer which is able to recognize frequent domain words. ─ Example: 基本法 /Basic Law The Japanese text is tokenized by JUMAN without domain word augmentation.
19
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 19 7. A rough word pair based alignment Treat translation as a pattern matching task. The task is to find a representation and similarity measurement which can find word pairs to serve as anchor points.
20
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 20 7.1. Dynamic Recency Vectors Governor ─ The word position of length 212. ─ Recency vector 總督 ─ The word position of length 254. ─ Recency vector
21
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 21 Recency vector signals Governor.chGovernor.en Bill.ch President.en
22
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 22 7.2. Matching Recency Vectors Dynamic time warping, DTW ─ Takes two vectors of lengths N and M, finds an optimal path through the N by M trellis, starting from (1,1) to (N,M). Governor 總督
23
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 23 DTW algorithm Initialization ─ Costs are initialized according to recency vector values Governor 總督
24
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 24 DTW algorithm Recursion ─ To accumulate cost of the DTW path Governor 總督
25
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 25 DTW algorithm Termination ─ Final cost of the DTW path is normalized by the length of the path. Governor 總督
26
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 26 DTW algorithm Path reconstruction ─ Reconstruct the DTW path and obtain the points on the path. ─ For finding anchor points and eliminating noise use. Governor 總督
27
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 27 DTW algorithm For each word vector in language A, the word vector in language B which has lowest DTW score is taken to be its translation. We thresholded the bilingual word pairs obtained from above stages in the algorithm and stored the more reliable pairs as our primary bilingual lexicon.
28
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 28 7.3. Statistical filters To avoid the complexity, we incorporated constraints to filter the set of possible pairs ─ Starting point constraints, i.e., position constraint. ─ Length constraint, i.e., frequency constraint. ─ Means/standard deviation constraint
29
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 29 8. Finding anchor points and eliminating noise Primary lexicon is used for aligning the segments in the corpus ─ To find anchor points on the DTW paths which divide the texts into multiple aligned segments for the secondary lexicon. We only keep an anchor point (i,j) if it satisfies the following ─ (slope constraint) ─ (continuity constraint) ─ (window size constraint) ─ (offset constraint)
30
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 30 8. Finding anchor points and eliminating noise All word pairs After filtering AWK HKUST Text alignment path
31
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 31 9. Finding bilingual word pair matches To obtain the secondary and final bilingual word lexicon ─ A non-linear K segment binary vector representation for each word. ─ A similarity measure to compute word pair correlations.
32
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 32 9.1. Non-Linear K segments The anchor points divide a bilingual corpus into k+1 non-linear segments, where i in text1 and j in text2. The algorithm then proceeds to obtain a secondary bilingual lexicon, considering words of both high and low frequency.
33
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 33 9.2. Non-Linear segment binary vectors The occurrences of a pair of translated words in a bilingual corpus, i.e., to compute the correlation between two words. Pr(w s, w t ) occurring in the same place in the corpus. Binary vector where the i-th bit is set to 1 if both words are found in the i-th segment. 1 0 … 1 K segments governor
34
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 34 9.2. Non-Linear segment binary vectors ─ If the source and target words are good translations of one another, then a should be large. T F TFTF
35
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 35 9.3. Binary vector correlation measure Similarity measure, weighted mutual information
36
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 36 10. Word translation results
37
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 37 11. Term translations from word groups
38
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 38 Term translation aid result
39
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 39 Conclusion A technique to align noisy parallel corpora by segments, and to extract a bilingual word lexicon from it. ─ Substitute the sentence alignment step with a rough segment alignment. ─ No sentence boundary information and with noise. ─ Highly reliable anchor points using DTW to serve as segment delimiters.
40
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 40 Personal opinion Valuable idea ─ Treat the domain word translation problem as a pattern matching problem. Contribution ─ Language robustness and noisy parallel corpora. Drawback ─ Too long and too complex.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.