Malay-English Bitext Mapping and Alignment Using SIMR/GSA Algorithms Mosleh Al-Adhaileh Tang Enya Kong Mosleh Al-Adhaileh and Tang Enya Kong Computer Aided Translation Unit School of Computer Sciences University Science Malaysia I. Dan Melamed Department of Computer Science Courant Institute New York University
Presentation Outline Introduction SIMR and GSA algorithms. Bitext Mapping and Alignment Porting SIMR/GSA to Malay-English Bitext Data Collection Steps to adopt SIMR into Malay-English language pair Matching Predicate Axis Generator Parameter Optimization Results and Evaluation Conclusion
Bitext Mapping and Alignment Bitext-(parallel text): A text in one language and its translation in another language. Bitext Mapping and Alignment: to describe the correspondence between the two halves of the bitext. Bitext Mapping: is to find the corresponding points, i.e. words, text units, or segments boundaries, between its two halves Bitext Alignment: is a segmentation of the two texts, such that the n th segment of one text corresponds to the n th segment of the other.
Bitext Mapping and Alignment are needed in order to compile this Data into a useful source of knowledge. Word sense disambiguation Bilingual lexicography Machine Translation Multilingual information retrieval Also as practical tool for assisting translators
Bitext space X: Characters’ position in text 1 Y: Characters’ position in text 2 terminus origin Main diagonal A Bitext can form the axes of a rectangular bitext space. True Points of Correspondence (TPCs) can be plotted as points in the bitext space. X: Characters’ position in text 1 Y: Characters’ position in text 2 terminus origin Main diagonal XYTPC The point (X,Y) is TPC if token at position X and a token at position Y are translation to each other.
Real bitexts are noisy: - Fertility = A single segment in one half may correspond to zero, one, two or more segments in the other half. - crossed dependencies (distortion) = Where human translators change and rearrange material so the target output text will not flow well according to the order of the source text.
SIMRSIMR SIMRSIMR MalayEnglish Mapped Bitext SIMR and GSA algorithms Bitext MalayEnglish Mapping SIMR: stands for Smooth Injective Map SIMR: stands for Smooth Injective Map Recognizer Recognizer TCPs TBM
Alignment SIMR Output: the correspondence points SIMR Output: the correspondence points GSA: stands for Geometric Segment Alignment. GSA: stands for Geometric Segment Alignment. ABCDEGJIHFKL a b c d e g j i h f Segment boundaries form a grid over the bitext space Segment boundaries form a grid over the bitext space ABCDEGJIHFKL a b c d e g j i h f Each cell represents the intersection of two segments, one from each half of the bitext Each cell represents the intersection of two segments, one from each half of the bitext GSA: reduces the sets of correspondence points in SIMR’s output to segment alignments GSA: reduces the sets of correspondence points in SIMR’s output to segment alignments A point inside (X,y) cell indicates that some token in segment X corresponds with some token in segment y; segments X and y correspond. A point inside (X,y) cell indicates that some token in segment X corresponds with some token in segment y; segments X and y correspond.
Data Collection The 7 Habits of Highly Effective People “The 7 Habits of Highly Effective People” UTM KUTMK Malay-English Bitexts from Unit Terjemahan Melalui Komputer (UTMK) - USM 101,790 English Version: 101,790 words 13 chapters 107,161 Malay Version: 107,161 words Semantics “Semantics” 50,170 English Version: 50,170 words 8 chapters 51,802 Malay Version: 51,802 words User’s Guide: Microsoft Word for Windows “User’s Guide: Microsoft Word for Windows” 6,974 English Version: 6,974 words First 20 pages 8,281 Malay Version: 8,281 words
SIMR Steps to adopt SIMR into Malay-English language pair Malay English Segment Alignment Malay English Malay English Test Data Bitext Mapping Malay English GSA Manual Alignment Parameter re-optimization re-optimizationParameter Validate Manual Alignment ADOMIT Training Data SIMR Axis generator KIMD Bilingual dictionary KIMD Bilingual dictionary LexiconLexicon
Matching Predicate Find the TPCs between the two halves of the bitext It is a heuristic used to decide whether two given tokens might be mutual translation. It is a heuristic used to decide whether two given tokens might be mutual translation. Cognate words Computer Komputer Sistem System Punctuation marks The matching Predicates were fine-tuned with stop-list words for both Malay and English languages Lexicon Bury: mengebumikan, menanam, kematian, kereta Bury: mengebumikan, menanam, kematian, kereta
For each language, an axis generator performs the mapping from tokens (the smallest semantic units) to axis position. Data Lemmatization Axis Generator The position of a token (in character) is the position of its median character. “tujuh tabiat gambaran seluruh.” 0 3 tujuh 9.5 tabiat 17.5 gambaran 26 seluruh English English: word stemming POS tag (Brill’s) and XTAG lexicon (contains roots, inflected forms). Malay Malay: root construction rules, and lexicon (contains popular words).
ADOMIT ADOMIT (Automatic Detection of OMIssions in Translation) Alignment Validation Parameter Optimization We use Chapter 3, 7 and 11 from the 7habits book. All together 1245 segments. It is manually aligned at the sentence level. Simply say: Any segment whose slope is unusually low is a likely omission. A OB a b ParameterValue Chain size7 Max. point ambiguity Max. linear regression error Min. Cognate length ratio0.80 Max. angle deviation 5 Parameters value
Results and Evaluation
Conclusion This experiment shows that SIMR/GSA algorithms can map/align Malay-English bitexts with high accuracy as they performed on the other variety of language pairs and text genres. These results encourage us, as a future work, to think of extending the text alignment to word alignment aiming at the identification of correspondence between linguistic units below the sentence level within a bitext. Bitexts are becoming plentiful and available, both in private data warehouses and on publicly accessible sites on the WWW. They form a very useful source of knowledge if they were treated efficiently. Visit the URL for Unit Terjemahan Melalui Komputer (UTMK) – USM. Visit the URL for important references
Thank you….. Mosleh Al-Adhaileh Tang Enya Kong Mosleh Al-Adhaileh and Tang Enya Kong Computer Aided Translation Unit School of Computer Sciences University Science Malaysia I. Dan Melamed Department of Computer Science Courant Institute New York University