Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.

Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon University

Motivation I would like to meet this nice woman. اود ان مواجهه هذا جيد امراه. femmasc woman nice this

Motivation

System guess (Quirk et al, 05)

Motivation Correct System guess (Quirk et al, 05)

Information ‘missing’ on source side Data sparsity Morphological agreement in the target language SMT challenges for English  Morphology rich language

Related work Translation from morphology-rich languages to English –Preprocessing of the inputs, to improve alignments Arabic (Lee, 04), German (Koehn and Knight, 03; Nießen and Ney, 04; Popović and Ney, 04; Collins et al. 05), Czech (Goldwater and McClosky 05) Translation from English to morphology-rich languages –Preprocessing and postprocessing Turkish (El-Kahlout and Oflazer 06), Spanish and Catalan (Oeffing and Ney, 03) Our approach –Extension of Japanese case marker prediction (Suzuki and Toutanova, 06)

Morphology generation as classification: Classify each stem into an inflected form Morphology Prediction Possible inflections eliminare elimino elimini eliminiamo … un una vincolo vincoli di del dei della … chiave chiavi primario primaria primari primarie System guess: eliminare un vincolo di chiave primario Eliminate a primary key constraint Source:

Morphology –Russian, Arabic –Lexicon operations The task of inflection prediction A log-linear model Features –Lexical, Syntactic and Morphological Experiments Outline

Russian Morphology 3 genders, 2 numbers, 6 cases (nom, acc, location …) Nouns have gender, and inflect for number and case Adjectives agree with nouns in number, gender, and case; Verbs agree with Subject person and number (past tense agrees with gender and number) У меня есть синий карандаш at me is blue pencil Pers1 Pres Gen Nom Nom Masc Masc Sing Sing

Arabic morphology Arabic: inflection + clitics –Prefixes: Conj/Prep/Det (in strict order) –Suffixes: Object pronouns/Possessive pronouns Agreement: –In person, number, gender and definiteness وللمكتبات /walilmaktabāt/ و + ل + ال + مكتبة + ات wa+li+al+maktabāt and+for+the+libraries and for the libraries فقلناها /faqulnāhā/ ف+ قال+ نا+ ها fa+qul+na+hā so+said+we+it so we said it (from Bar-Haim et al) (from Nizar Habash)

Lexicon Operations Lexicon Set of possible morphological variants Set of possible lemmas то, тот того, тому, тем, том, те, тех, теми,то тот+PronAdj+DemPron+Neut+Sg+NomAcc (that) то то+Pron+Neut+Inanim+Sg+NomAcc (it) то то+Conj (then) Set of possible morphological analyses Surface word то Stemming Inflection Analysis

Inflection Prediction Model Given a sentence, predict the inflection of each word. Conditional Markov Model y1y1 y2y2 y3y3 y4y4 Sentence processed left-to-right (can be applied top-down) Features: pairs of target and context predicates Can model agreement: POS(y i-2 )=DT & Number(y i-1 )=sg & Number(y i )=sg

Linguistic annotations Source dependency tree Surface features Projected dependency tree Annotations used in Quirk et al (05) system POS & morphological features

Features MonoligualBilingual stem left stem right stem y i-1,y i-2 parent stem … aligned words a i parent (a i ) left sister (a i ) right sister (a i ) POS (a i ) number (a i ) person (a i ) tense (a i ) det* (a i ) prep* (a i ) pron* (a i ) … Inflection inflection (y i ) POS (y i ) tense (y i ) number (y i ) … Syntax Morph. POS (y i-1 ) number (y i-1 ) person (y i-1 ) tense (y i-1 ) … Lexical

[Prev.Stem=qam~-u_qam~, Prep_Inflection=bi] [Aligned_Number=Plur, Number_Inflection=pl] [AlignedWords=and, Conj_Inflection=true] [PrevStem=fiy_y, Prep_Inflection=none] [AlignedWords=applications, Gender_Inflection=fem] Arabic [PrevStem=X, Case_Inflection=y] [AlignedWords=will,Tense_Inflection=future] [AlignedWords=been,Tense_Inflection=past] [AlignedWords=click,Tense_Inflection=imperative] Russian

Reference Experiments Baselines –Random baseline (pick a label at random) –Word-trigram language model baseline Trained using the CMU toolkit on the same training dataset Models –Monolingual word / all, Bilingual Word / all Lexicons: –Russian dictionary, Arabic: Buckwalter analyzer –Evaluated only on words in the lexicon DataEng-RussianEng-Arabic Training1M470K Dev1K Test1K

Russian inflection prediction: accuracy The suggested model better than a language model Syntactic and morphological features are informative

Arabic inflection prediction: accuracy

Accuracy vs. training data size

Error Analysis Russian – Gender of pronoun (it ~ he/she/it) – Case/Gender in coordinate construction – Morphological analysis ambiguity Arabic – Gender/Number of pronoun – Definiteness in noun phrases

Summary Proposed a general framework for improving SMT into morphology rich languages Showed that morpho-syntactic features and source sentence information, derived from aligned sentence pair and a lexicon, are effective. Achieved good results also for little training data

Future Directions Integration with the MT system –Initial results for Russian: 1.7 BLEU improvement Improvements to the model and features –Morphological disambiguation –Semantic role labeling –Longer distance agreements (e.g. pronoun coreference) More languages

Thanks! Questions?

Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.

Similar presentations

Presentation on theme: "Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.

Similar presentations

Presentation on theme: "Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon."— Presentation transcript:

Similar presentations

About project

Feedback