Download presentation
Presentation is loading. Please wait.
Published byBonnie Hunter Modified over 9 years ago
1
cs target cs target en source Subject-PastParticiple agreement Czech subject and past participle must agree in number and gender. Two-step translation with grammatical post-processing David Mareček, Rudolf Rosa, Petra Galuščáková, Ondřej Bojar; {marecek, rosa, galuscakova, bojar}@ufal.mff.cuni.cz Charles University in Prague, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (ÚFAL) Full text, acknowledgement and the list of references in the proceedings of WMT 2011. SystemBeforeAfterImprovement cmu-heaf.16.9517.040.09 cu-bojar15.8516.090.24 cu-zeman12.3312.550.22 dcu13.3613.590.23 dcu-combo18.7918.900.11 eurotrans10.1010.110.01 koc11.7411.910.17 koc-combo16.6016.860.26 onlineA11.8112.080.27 onlineB16.5716.790.22 potsdam12.3412.570.23 rwth-combo17.5417.790.25 sfu11.4311.830.40 uedin 15.9116.190.28 upv-combo17.5117.730.22 Rule Fired Improved Worsened % Improved SubjCase5146590.2 SubjPP1931652885.5 NounAdj4343548081.6 NounNum1561223478.2 PrepNoun135993673.3 SubjPred68482070.6 ReflTant1510566.7 PrepNoCh45291664.4 Depfix improvements in BLEU score SystemBeforeAfterImprovement cu-twostep16.5716.600.03 cmu-heaf.20.2420.320.08 commerc209.3209.32 0.00 cu-bojar16.8816.85-0.03 cu-popel14.1214.11-0.01 cu-tamch.16.3216.28-0.04 cu-zeman14.6114.800.19 jhu17.3617.420.06 online-B20.2620.310.05 udein17.8017.880.08 upv-prhlt.20.6820.690.01 System Annotator Changed Improved Worsened Indefinite cu-bojar-twostepA26915256.5%3914.5%78 29.0% cu-bojar-twostepB26917364.3%5018.6%46 17.1% online-BA24715663.1%39 15.9%5221.1% online-BB24716566.8%64 25.9%187.3% Test Set Changed Improved Worsened Indefinite BLEU: before after diff newssyscombtest20101045250.02019.23230.816.9917.38 0.39 newssyscombtest20111016665.31918.81615.813.9913.87- 0.12 A/BImpr.Wors.Indef. Impr.2732015 Wors.12597 Indef.533542 Noun number In Czech, plural has sometimes the same form as singular. Correct nouns tagged as singular to plural if the English counterpart is in plural. Subject case Czech subject must be in nominative case. Subject-Predicate agreement Subject and predicate must agree in morphological number. Preposition-Noun agreement Nouns must agree with prepositions in morphological case. Noun-Adjective agreement Adjectives must agree with nouns in number, gender, and case. Reflexive particle deletion Czech reflexive verbs are accompanied by reflexive particles (‘se’, ‘si’). Particles not belonging to any verb are deleted. Prepositions without children Prepositions cannot be leaves. Nouns are attached to them according to English source. Czech translated: Řada lidí z Česká republika nedorazilo. (wrong) English source: A number of people from Czech Republic did not arrive. A Det DT number Subj NN from AuxP IN of AuxP IN people Atr NN not AuxV ?? did AuxV VB Czech Atr NN Republic Atr NN arrive. Pred VB Řada Subj N,fem,sg,nom lidí Atr N,pl,gen republika Adv N,fem,sg,nom z AuxP P,gen Česká Atr A,fem,sg,nom nedorazilo. Pred V,neut,sg,past the Det DT Manual and automatic evaluation across different datasets. Impact of rules Manual evaluation BLEU: 16.99 → 17.38 BLEU: 13.99 → 13.87 Goal: Improve Czech grammar Phrase-based MT often breaks morphological agreement. Some of the dependencies still recovered in parses of MT output. => We can check and fix the agreements. Depfix A rule-based grammar correction of MT to Czech. Applicable on top of any MT system. cu-bojar Simple non-factored Moses setup. Truecasing based on lemmatizer output. Additional parallel data include Official Journal of EU. Large monolingual data: Includes Czech National Corpus, our web collection... Two LMs (5-gr and 6-gr) weighted by MERT. Each LM already interpolated from domain-specific sources. cu-marecek MT in three steps: 1)Moses: English -> Lemmatized Czech 2)Moses: Lemmatized Czech -> Czech 3)Depfix: Rule-based grammar correction. Only 1-best output passed between the steps. The 2nd Moses trained on a large monolingual corpus. Lemmatized Czech includes morphological features over in English. Our WMT11 submissions WMT10 WMT11 Rules nedorazila. Pred V,fem,sg,past republiky Adv N,fem,sg,gen České Atr A,fem,sg,gen Czech after depfix: Řada lidí z České republiky nedorazila. (correct) NounAdj PrepNoun SubjPP Interannotator agreement improved indefinite worsened en source
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.