Ibrahim Badr, Rabih Zbib, James Glass. Introduction Experiment on English-to-Arabic SMT. Two domains: text news,spoken travel conv. Explore the effect.

Slides:



Advertisements
Similar presentations
Morphological Analysis for Phrase-Based Statistical Machine Translation LUONG Minh Thang Supervisor: Dr. KAN Min Yen National University of Singapore Web.
Advertisements

Morphology.
The Study Of Language Unit 7 Presentation By: Elham Niakan Zahra Ghana’at Pisheh.
Morphology Nuha Alwadaani.
English-Hindi Translation in 21 Days Ondřej Bojar, Pavel Straňák, Daniel Zeman ÚFAL MFF, Univerzita Karlova, Praha.
Brief introduction to morphology
Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.
1 A Hidden Markov Model- Based POS Tagger for Arabic ICS 482 Presentation A Hidden Markov Model- Based POS Tagger for Arabic By Saleh Yousef Al-Hudail.
Hybridity in MT: Experiments on the Europarl Corpus Declan Groves 24 th May, NCLT Seminar Series 2006.
1 A Tree Sequence Alignment- based Tree-to-Tree Translation Model Authors: Min Zhang, Hongfei Jiang, Aiti Aw, et al. Reporter: 江欣倩 Professor: 陳嘉平.
The Impact of Arabic Morphological Segmentation on Broad-Scale Phrase-based SMT Alon Lavie and Hassan Al-Haj Language Technologies Institute Carnegie Mellon.
Session 6 Morphology 1 Matakuliah : G0922/Introduction to Linguistics
“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.
1 Language Model Adaptation in Machine Translation from Speech Ivan Bulyko, Spyros Matsoukas, Richard Schwartz, Long Nguyen, and John Makhoul.
Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali.
1 A Chart Parser for Analyzing Modern Standard Arabic Sentence Eman Othman Computer Science Dept., Institute of Statistical Studies and Research (ISSR),
Czech-to-English Translation: MT Marathon 2009 Session Preview Jonathan Clark Greg Hanneman Language Technologies Institute Carnegie Mellon University.
The CMU-UKA Statistical Machine Translation Systems for IWSLT 2007 Ian Lane, Andreas Zollmann, Thuy Linh Nguyen, Nguyen Bach, Ashish Venugopal, Stephan.
Paradigm based Morphological Analyzers Dr. Radhika Mamidi.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Morphology The Structure of Words.
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Achieving Domain Specificity in SMT without Over Siloing William Lewis, Chris Wendt, David Bullock Microsoft Research Machine Translation.
Introduction Morphology is the study of the way words are built from smaller units: morphemes un-believe-able-ly Two broad classes of morphemes: stems.
Phonemes A phoneme is the smallest phonetic unit in a language that is capable of conveying a distinction in meaning. These units are identified within.
2012: Monolingual and Crosslingual SMS-based FAQ Retrieval Johannes Leveling CNGL, School of Computing, Dublin City University, Ireland.
Active Learning for Statistical Phrase-based Machine Translation Gholamreza Haffari Joint work with: Maxim Roy, Anoop Sarkar Simon Fraser University NAACL.
Morphology & Machine Translation Eric Davis MT Seminar 02/06/08 Professor Alon Lavie Professor Stephan Vogel.
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
Morpho Challenge competition Evaluations and results Authors Mikko Kurimo Sami Virpioja Ville Turunen Krista Lagus.
Reasons to Study Lexicography  You love words  It can help you evaluate dictionaries  It might make you more sensitive to what dictionaries have in.
Language Learning Targets based on CLIMB standards.
1 The LIG Arabic / English Speech Translation System at IWSLT07 Laurent BESACIER, Amar MAHDHAOUI, Viet-Bac LE LIG*/GETALP (Grenoble, France)
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng.
Phrase Reordering for Statistical Machine Translation Based on Predicate-Argument Structure Mamoru Komachi, Yuji Matsumoto Nara Institute of Science and.
Linguistics The ninth week. Chapter 3 Morphology  3.1 Introduction  3.2 Morphemes.
Linguistics The eleventh week. Chapter 4 Syntax  4.1 Introduction  4.2 Word Classes.
M ORPHOLOGY Lecturer/ Najla AlQahtani. W HAT IS MORPHOLOGY ? It is the study of the basic forms in a language. A morpheme is “a minimal unit of meaning.
Chinese Word Segmentation Adaptation for Statistical Machine Translation Hailong Cao, Masao Utiyama and Eiichiro Sumita Language Translation Group NICT&ATR.
NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.
Morphological typology
Natural Language Processing Chapter 2 : Morphology.
MORPHOLOGY definition; variability among languages.
MORPHOLOGY. Morphology The study of internal structure of words, and of the rules by which words are formed.
III. MORPHOLOGY. III. Morphology 1. Morphology The study of the internal structure of words and the rules by which words are formed. 1.1 Open classes.
◦ Process of describing the structure of phrases and sentences Chapter 8 - Phrases and sentences: grammar1.
Basic Syntactic Structures of English CSCI-GA.2590 – Lecture 2B Ralph Grishman NYU.
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
BAMAE: Buckwalter Arabic Morphological Analyzer Enhancer Sameh Alansary Alexandria University Bibliotheca Alexandrina 4th International.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.
King Faisal University [ ] 1 E-learning and Distance Education Deanship Department of English Language College of Arts King Faisal University Introduction.
INTRODUCTION ADE SUDIRMAN, S.Pd ENGLISH DEPARTMENT MATHLA’UL ANWAR UNIVERSITY.
Morphology Morphology Morphology Dr. Amal AlSaikhan Morphology.
Introduction to Linguistics
عمادة التعلم الإلكتروني والتعليم عن بعد
Revision Outcome 1, Unit 1 The Nature and Functions of Language
Chapter 6 Morphology.
Word Classes and Affixes
Morphology.
Issues in Arabic MT Alex Fraser USC/ISI 9/22/2018 Issues in Arabic MT.
Purpose of Study & Introduction to Sarf (Morphology)
Welcome 6th Grade Class To
FIRST SEMESTER GRAMMAR
Chapter Six CIED 4013 Dr. Bowles
Introduction to English morphology
Introduction to Linguistics
Presentation transcript:

Ibrahim Badr, Rabih Zbib, James Glass

Introduction Experiment on English-to-Arabic SMT. Two domains: text news,spoken travel conv. Explore the effect of Arabic segmentation, on the translation quality. Propose various schemes recombining (Not Trivial!) the segmented Arabic. Apply (basic) factored translation models

Arabic Morphology Arabic is a morphologically rich language. Nouns and Adjectives inflect for gender (m,f), number (pl,sg,du) and case (Nom,Acc,Gen) all comb’ are possible: لاعب (a player, M), لاعبة (a player, F), لاعبان (two players, M), لاعبتان (two players, F), لاعبون (players, M,P,Nom), لاعبين (players, M,P,Acc or gen) In addition to gender and number, verbs inflect for tense, voice, and person: لعبوا (play, past, plM3P), يلعبون (play, present, plM3P), ستلعبون (played, plM3P) Addittional Pefixes: conjunction و, determiner أل, preposition ب (with,in) ل (to) (for) لل.. وبالاخبار Additional Sufixes: - possessive pronouns (attach to nouns): هم (their), كُم (your, pl,M), كُن (your, pl,F),… - object and subject pronouns attach to verbs: ني (me), هُن (them), و (they) وبسياراتهم Many surface forms sharing the same lemma!

Arabic segmentation Use MADA for morphological decomposition of Arabic text. (typical) normalizaion: ى  ي, أآإ  ا 2 proposed segmentation : S1: Split all clitics mentioned in prev slide except plural and subject pronoun morphemes. S2: Same a S1, the split clitics are glued into one prefix and one suffix word = prefix+ stem+ suffix Example: ولأولاده (and for his kids) s1: ه + أولاد + ل + و s2: ول + أولاد + ه

Arabic Recombination Segmented output needs recombination! Why is it not a trivial: a) Letter ambiguity: we normalized ى  ي ه + مدى  مداه ه + في  فيه b) Word Ambiguity: Some words can be grammatically recombined in more than one way: لكن + ي  #1 لكني #2 لكنني Propose two recombination schemes: 1. R: recombination rules define manualy. Resolve a: pick most frequent stem form in non-norm data. Resovle b: pick most frequent grammatical form. 2. T: Build a table derived from the training set: (surface, decomposed word) more than one surface  choose randomly. can help in combining words segmented incorrectly.

Factored Model &Data Factors: -Factors on the English Side: surface form+POS -Factors on the Arabic Side: Surface form+ POS&clitics -Build 3-gram LM on surface form, 7-gram for the POS&clitics. -Generation model : Surface+ POS&clitics  Surface. Data: Newswire & spoken dialogue (travel) - Training Data Newswire: LDC: ~3M,~1.6M, ~600K words. (Avg sent: 33 En,25 Ar, 36 SegAr Spoken dialogue : IWSLT (2007), 200k words (Avg sent: 9 En, 8 Ar, 10 SegAr) - LM: Newswire: ~3M Ar side+ 30M from Arabic Giga word Spoken dialogue: 200k words Ar side. - Tuning and test sets (1 En ref): Newwire: 2000 tune, 2000 test (chosen randomly,same source of trainnig) Spoken dialogue : 500 tune, 500 test

Setup & Recombination Setup: Use GIZA++ for alignment (both unseg Ar, seg Ar), use MAXPHR = 15 for segAr! Decode using MOSES. SRI LM : - News wire: 4 -gram (unseg Ar), 6-gram (SegAr). - Spoken: 3-gram (unseg Ar), 4-gram (SegAr). MERT for tuning, optimize for BLEU. Define 2 tuning schemes for SegAr: - T1: Use segAr for ref -T2: Use UnsegAr for ref. Combine before scoring the n-best list Recombination Results: -Test on Newswire training and test sets.(Sent error!) - T was trained on the Training set. - Baseline: Glue pref and suff. - T+R: if word was seen use T, else use R

Translation Results: News Results for Newswire (BLEU): Segmentation helps, but the gain diminishes as the training data size increases (less sparse model). Segmentation S2 is slightly better than S1. Tuning scheme T2 performs better than T1 Factored models performs the best for the Largest system (at higher cost!)

Translation Results: Spoken Dialogue Results for Spoken dialogue (BLEU): S2 performs slightly better than S1 T1 is better than T2 Conclusions: - Recombination based on both the training data and rules performs best. - Segmentation helps, but the gain diminishes as the training data size increases. - Recombining the segmented output during tuning helps. - Factored models perform best for the “Large” system. - What next: Explore the effect of Syntactic reordering on En  Ar MT : Syntactic Phrase Reordering for English-to-Arabic Statistical Machine Translation, Badr et al., EACL 2009.