Download presentation
Presentation is loading. Please wait.
Published byIra Buck Harrison Modified over 8 years ago
1
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon University
2
Motivation I would like to meet this nice woman. اود ان مواجهه هذا جيد امراه. femmasc woman nice this
3
Motivation
4
System guess (Quirk et al, 05)
5
Motivation Correct System guess (Quirk et al, 05)
6
Motivation Correct System guess (Quirk et al, 05)
7
Information ‘missing’ on source side Data sparsity Morphological agreement in the target language SMT challenges for English Morphology rich language
8
Related work Translation from morphology-rich languages to English –Preprocessing of the inputs, to improve alignments Arabic (Lee, 04), German (Koehn and Knight, 03; Nießen and Ney, 04; Popović and Ney, 04; Collins et al. 05), Czech (Goldwater and McClosky 05) Translation from English to morphology-rich languages –Preprocessing and postprocessing Turkish (El-Kahlout and Oflazer 06), Spanish and Catalan (Oeffing and Ney, 03) Our approach –Extension of Japanese case marker prediction (Suzuki and Toutanova, 06)
9
Morphology generation as classification: Classify each stem into an inflected form Morphology Prediction Possible inflections eliminare elimino elimini eliminiamo … un una vincolo vincoli di del dei della … chiave chiavi primario primaria primari primarie System guess: eliminare un vincolo di chiave primario Eliminate a primary key constraint Source:
10
Morphology –Russian, Arabic –Lexicon operations The task of inflection prediction A log-linear model Features –Lexical, Syntactic and Morphological Experiments Outline
11
Russian Morphology 3 genders, 2 numbers, 6 cases (nom, acc, location …) Nouns have gender, and inflect for number and case Adjectives agree with nouns in number, gender, and case; Verbs agree with Subject person and number (past tense agrees with gender and number) У меня есть синий карандаш at me is blue pencil Pers1 Pres Gen Nom Nom Masc Masc Sing Sing
12
Arabic morphology Arabic: inflection + clitics –Prefixes: Conj/Prep/Det (in strict order) –Suffixes: Object pronouns/Possessive pronouns Agreement: –In person, number, gender and definiteness وللمكتبات /walilmaktabāt/ و + ل + ال + مكتبة + ات wa+li+al+maktabāt and+for+the+libraries and for the libraries فقلناها /faqulnāhā/ ف+ قال+ نا+ ها fa+qul+na+hā so+said+we+it so we said it (from Bar-Haim et al) (from Nizar Habash)
13
Lexicon Operations Lexicon Set of possible morphological variants Set of possible lemmas то, тот того, тому, тем, том, те, тех, теми,то тот+PronAdj+DemPron+Neut+Sg+NomAcc (that) то то+Pron+Neut+Inanim+Sg+NomAcc (it) то то+Conj (then) Set of possible morphological analyses Surface word то Stemming Inflection Analysis
14
Inflection Prediction Model Given a sentence, predict the inflection of each word. Conditional Markov Model y1y1 y2y2 y3y3 y4y4 Sentence processed left-to-right (can be applied top-down) Features: pairs of target and context predicates Can model agreement: POS(y i-2 )=DT & Number(y i-1 )=sg & Number(y i )=sg
15
Linguistic annotations Source dependency tree Surface features Projected dependency tree Annotations used in Quirk et al (05) system POS & morphological features
16
Features MonoligualBilingual stem left stem right stem y i-1,y i-2 parent stem … aligned words a i parent (a i ) left sister (a i ) right sister (a i ) POS (a i ) number (a i ) person (a i ) tense (a i ) det* (a i ) prep* (a i ) pron* (a i ) … Inflection inflection (y i ) POS (y i ) tense (y i ) number (y i ) … Syntax Morph. POS (y i-1 ) number (y i-1 ) person (y i-1 ) tense (y i-1 ) … Lexical
17
[Prev.Stem=qam~-u_qam~, Prep_Inflection=bi] [Aligned_Number=Plur, Number_Inflection=pl] [AlignedWords=and, Conj_Inflection=true] [PrevStem=fiy_y, Prep_Inflection=none] [AlignedWords=applications, Gender_Inflection=fem] Arabic [PrevStem=X, Case_Inflection=y] [AlignedWords=will,Tense_Inflection=future] [AlignedWords=been,Tense_Inflection=past] [AlignedWords=click,Tense_Inflection=imperative] Russian
18
Reference Experiments Baselines –Random baseline (pick a label at random) –Word-trigram language model baseline Trained using the CMU toolkit on the same training dataset Models –Monolingual word / all, Bilingual Word / all Lexicons: –Russian dictionary, Arabic: Buckwalter analyzer –Evaluated only on words in the lexicon DataEng-RussianEng-Arabic Training1M470K Dev1K Test1K
19
Russian inflection prediction: accuracy The suggested model better than a language model Syntactic and morphological features are informative
20
Arabic inflection prediction: accuracy
21
Accuracy vs. training data size
22
Error Analysis Russian – Gender of pronoun (it ~ he/she/it) – Case/Gender in coordinate construction – Morphological analysis ambiguity Arabic – Gender/Number of pronoun – Definiteness in noun phrases
23
Summary Proposed a general framework for improving SMT into morphology rich languages Showed that morpho-syntactic features and source sentence information, derived from aligned sentence pair and a lexicon, are effective. Achieved good results also for little training data
24
Future Directions Integration with the MT system –Initial results for Russian: 1.7 BLEU improvement Improvements to the model and features –Morphological disambiguation –Semantic role labeling –Longer distance agreements (e.g. pronoun coreference) More languages
25
Thanks! Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.