Malay-English Bitext Mapping and Alignment Using SIMR/GSA Algorithms Mosleh Al-Adhaileh Tang Enya Kong Mosleh Al-Adhaileh and Tang Enya Kong Computer Aided.

Slides:



Advertisements
Similar presentations
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Advertisements

The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Automatically Annotating and Integrating Spatial Datasets Chieng-Chien Chen, Snehal Thakkar, Crail Knoblock, Cyrus Shahabi Department of Computer Science.
A Syntactic Translation Memory Vincent Vandeghinste Centre for Computational Linguistics K.U.Leuven
Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.
Building an Ontology-based Multilingual Lexicon for Word Sense Disambiguation in Machine Translation Lian-Tze Lim & Tang Enya Kong Unit Terjemahan Melalui.
The current status of Chinese- English EBMT -where are we now Joy (Ying Zhang) Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
Language Identification in Web Pages Bruno Martins, Mário J. Silva Faculdade de Ciências da Universidade Lisboa ACM SAC 2005 DOCUMENT ENGENEERING TRACK.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
An innovative platform to allow translation and indexing of internet sites Localization World
Lecture 5 Geocoding. What is geocoding? the process of transforming a description of a location—such as a pair of coordinates, an address, or a name of.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.
Kalyani Patel K.S.School of Business Management,Gujarat University.
Natural Language Processing Lab Northeastern University, China Feiliang Ren EBMT Based on Finite Automata State Transfer Generation Feiliang Ren.
Invitation to Computer Science 5th Edition
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Statistical Alignment and Machine Translation
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
1 Introduction to Modeling Languages Striving for Engineering Precision in Information Systems Jim Carpenter Bureau of Labor Statistics, and President,
INTRODUCTION TO COMPUTING CHAPTER NO. 06. Compilers and Language Translation Introduction The Compilation Process Phase 1 – Lexical Analysis Phase 2 –
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
Parser-Driven Games Tool programming © Allan C. Milne Abertay University v
Chapter 10: Compilers and Language Translation Invitation to Computer Science, Java Version, Third Edition.
PETRA – the Personal Embedded Translation and Reading Assistant Werner Winiwarter University of Vienna InSTIL/ICALL Symposium 2004 June 17-19, 2004.
Towards an Intelligent Multilingual Keyboard System Tanapong Potipiti, Virach Sornlertlamvanich, Kanokwut Thanadkran Information Research and Development.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
LANGUAGE RESOURCES IN MALAYSIA Zaharin Yusoff Computer-Aided Translation Unit School of Computer Sciences Universiti Sains Malaysia Penang, Malaysia.
Problem Solving Techniques. Compiler n Is a computer program whose purpose is to take a description of a desired program coded in a programming language.
14/12/2009ICON Dipankar Das and Sivaji Bandyopadhyay Department of Computer Science & Engineering Jadavpur University, Kolkata , India ICON.
GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)
Using Surface Syntactic Parser & Deviation from Randomness Jean-Pierre Chevallet IPAL I2R Gilles Sérasset CLIPS IMAG.
Translation Memory System (TMS)1 Translation Memory Systems Presentation by1 Melina Takanen & Julianna Ekert CAT Prof. Thorsten Trippel University.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
Extracting bilingual terminologies from comparable corpora By: Ahmet Aker, Monica Paramita, Robert Gaizauskasl CS671: Natural Language Processing Prof.
Compiler design Lecture 1: Compiler Overview Sulaimany University 2 Oct
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
1 Compiler Design (40-414)  Main Text Book: Compilers: Principles, Techniques & Tools, 2 nd ed., Aho, Lam, Sethi, and Ullman, 2007  Evaluation:  Midterm.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Improving Translation Selection using Conceptual Vectors LIM Lian Tze Computer Aided Translation Unit School of Computer Sciences Universiti Sains Malaysia.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
Intelligent Key Prediction by N-grams and Error-correction Rules Kanokwut Thanadkran, Virach Sornlertlamvanich and Tanapong Potipiti Information Research.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
Multilingual Search Shibamouli Lahiri
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
Linear Functions Section 1-1 Points Lines Objective: To find the intersection of two lines and to find the length and the coordinates of the midpoint of.
Text Summarization using Lexical Chains. Summarization using Lexical Chains Summarization? What is Summarization? Advantages… Challenges…
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Evaluating Translation Memory Software Francie Gow MA Translation, University of Ottawa Translator, Translation Bureau, Government of Canada
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
Statistical NLP: Lecture 13
Basic Text Processing: Sentence Segmentation
Algorithm design (computational geometry)
Chapter 10: Compilers and Language Translation
Presentation transcript:

Malay-English Bitext Mapping and Alignment Using SIMR/GSA Algorithms Mosleh Al-Adhaileh Tang Enya Kong Mosleh Al-Adhaileh and Tang Enya Kong Computer Aided Translation Unit School of Computer Sciences University Science Malaysia I. Dan Melamed Department of Computer Science Courant Institute New York University

 Presentation Outline Introduction SIMR and GSA algorithms. Bitext Mapping and Alignment Porting SIMR/GSA to Malay-English Bitext Data Collection Steps to adopt SIMR into Malay-English language pair Matching Predicate Axis Generator Parameter Optimization Results and Evaluation Conclusion

 Bitext Mapping and Alignment Bitext-(parallel text): A text in one language and its translation in another language. Bitext Mapping and Alignment: to describe the correspondence between the two halves of the bitext. Bitext Mapping: is to find the corresponding points, i.e. words, text units, or segments boundaries, between its two halves Bitext Alignment: is a segmentation of the two texts, such that the n th segment of one text corresponds to the n th segment of the other.

Bitext Mapping and Alignment are needed in order to compile this Data into a useful source of knowledge.   Word sense disambiguation   Bilingual lexicography   Machine Translation   Multilingual information retrieval   Also as practical tool for assisting translators

 Bitext space X: Characters’ position in text 1 Y: Characters’ position in text 2 terminus origin Main diagonal A Bitext can form the axes of a rectangular bitext space. True Points of Correspondence (TPCs) can be plotted as points in the bitext space. X: Characters’ position in text 1 Y: Characters’ position in text 2 terminus origin Main diagonal XYTPC The point (X,Y) is TPC if token at position X and a token at position Y are translation to each other.

Real bitexts are noisy: - Fertility = A single segment in one half may correspond to zero, one, two or more segments in the other half. - crossed dependencies (distortion) = Where human translators change and rearrange material so the target output text will not flow well according to the order of the source text.

SIMRSIMR SIMRSIMR MalayEnglish Mapped Bitext SIMR and GSA algorithms Bitext MalayEnglish  Mapping SIMR: stands for Smooth Injective Map SIMR: stands for Smooth Injective Map Recognizer Recognizer TCPs TBM

 Alignment SIMR Output: the correspondence points SIMR Output: the correspondence points GSA: stands for Geometric Segment Alignment. GSA: stands for Geometric Segment Alignment. ABCDEGJIHFKL a b c d e g j i h f Segment boundaries form a grid over the bitext space Segment boundaries form a grid over the bitext space ABCDEGJIHFKL a b c d e g j i h f Each cell represents the intersection of two segments, one from each half of the bitext Each cell represents the intersection of two segments, one from each half of the bitext GSA: reduces the sets of correspondence points in SIMR’s output to segment alignments GSA: reduces the sets of correspondence points in SIMR’s output to segment alignments A point inside (X,y) cell indicates that some token in segment X corresponds with some token in segment y; segments X and y correspond. A point inside (X,y) cell indicates that some token in segment X corresponds with some token in segment y; segments X and y correspond.

 Data Collection The 7 Habits of Highly Effective People “The 7 Habits of Highly Effective People” UTM KUTMK Malay-English Bitexts from Unit Terjemahan Melalui Komputer (UTMK) - USM 101,790 English Version: 101,790 words 13 chapters 107,161 Malay Version: 107,161 words Semantics “Semantics” 50,170 English Version: 50,170 words 8 chapters 51,802 Malay Version: 51,802 words User’s Guide: Microsoft Word for Windows “User’s Guide: Microsoft Word for Windows” 6,974 English Version: 6,974 words First 20 pages 8,281 Malay Version: 8,281 words

 SIMR  Steps to adopt SIMR into Malay-English language pair Malay English Segment Alignment Malay English Malay English Test Data Bitext Mapping Malay English GSA Manual Alignment Parameter re-optimization re-optimizationParameter Validate Manual Alignment ADOMIT Training Data SIMR Axis generator KIMD Bilingual dictionary KIMD Bilingual dictionary LexiconLexicon

  Matching Predicate Find the TPCs between the two halves of the bitext It is a heuristic used to decide whether two given tokens might be mutual translation. It is a heuristic used to decide whether two given tokens might be mutual translation.   Cognate words Computer Komputer Sistem System   Punctuation marks The matching Predicates were fine-tuned with stop-list words for both Malay and English languages   Lexicon Bury: mengebumikan, menanam, kematian, kereta Bury: mengebumikan, menanam, kematian, kereta

For each language, an axis generator performs the mapping from tokens (the smallest semantic units) to axis position.  Data Lemmatization  Axis Generator The position of a token (in character) is the position of its median character. “tujuh tabiat gambaran seluruh.” 0 3 tujuh 9.5 tabiat 17.5 gambaran 26 seluruh English English: word stemming  POS tag (Brill’s) and XTAG lexicon (contains roots, inflected forms). Malay Malay: root construction  rules, and lexicon (contains popular words).

ADOMIT ADOMIT (Automatic Detection of OMIssions in Translation)  Alignment Validation  Parameter Optimization We use Chapter 3, 7 and 11 from the 7habits book. All together 1245 segments. It is manually aligned at the sentence level. Simply say: Any segment whose slope is unusually low is a likely omission. A OB a b ParameterValue Chain size7 Max. point ambiguity Max. linear regression error Min. Cognate length ratio0.80 Max. angle deviation 5 Parameters value

 Results and Evaluation

 Conclusion This experiment shows that SIMR/GSA algorithms can map/align Malay-English bitexts with high accuracy as they performed on the other variety of language pairs and text genres. These results encourage us, as a future work, to think of extending the text alignment to word alignment aiming at the identification of correspondence between linguistic units below the sentence level within a bitext. Bitexts are becoming plentiful and available, both in private data warehouses and on publicly accessible sites on the WWW. They form a very useful source of knowledge if they were treated efficiently. Visit the URL for Unit Terjemahan Melalui Komputer (UTMK) – USM. Visit the URL for important references

Thank you….. Mosleh Al-Adhaileh Tang Enya Kong Mosleh Al-Adhaileh and Tang Enya Kong Computer Aided Translation Unit School of Computer Sciences University Science Malaysia I. Dan Melamed Department of Computer Science Courant Institute New York University