Download presentation
Presentation is loading. Please wait.
Published byCoral Woods Modified over 9 years ago
1
The CMU Arabic-to-English Statistical MT System Alicia Tribble, Stephan Vogel Language Technologies Institute Carnegie Mellon University
2
The Data lFor translation model: lUN corpus: 80 million words UN lUmmah lSome smaller news corpora lFor LM lEnglish side from bilingual corpus: Language model should have seen the words generated by the translation model lAdditional data from Xinhua news lGeneral preprocessing and cleaning lSeparate punctuation mark lRemove sentence pairs with large length mismatch lRemove sentences which have too many non-words (numbers, special characters)
3
The System lAlignment models: IBM1 and HMM, trained in both directions lPhrase extraction lFrom Viterbi path of HMM alignment lIntegrated Segmentation and Alignment lDecoder lEssentially left to right over source sentence lBuild translation lattice with partial translations lFind best path, allowing for local reordering lSentence length model lPruning: remove low-scoring hypotheses
4
Some Results lTwo test sets: DevTest 203 sentences, May2003 lBaseline: monotone decoding lRO: word reordering lSL: sentence length model DevTest May 2003 NISTBleu4NIST Baseline8.590.3858.95 RO9.020.4419.26 RO + SL9.240.455?
5
Questions lWhat’s specific to Arabic lEncoding lNamed Entities lSyntax and Morphology lWhat’s needed to get further improvements
6
What’s Specific to Arabic lSpecific to Arabic lRight to left not really an issue, as this is only display Text in file is left to right lProblem in UN corpus: numbers (Latin characters) sometimes in the wrong direction, eg. 1997 -> 7991 lData not in vocalized form lVocalization not really studied lAmbiguity can be handled by statistical systems
7
Encoding and Vocalization lEncoding lDifferent encodings: Unicode, UTF-8, CP-1256, romanized forms not too bad, definitely not as bad as Hindi;-) lNeeded to convert, e.g. training and testing data in different encodings lNot all conversion are loss-less lUsed romanized form for processing lConverted all data using ‘Darwish’ transliteration lSeveral characters (ya, allef, hamzda) are collapsed into two classes lConversion not completely reversible lEffect of Normalization lReduction in vocabulary: ~5% lReduction of singletons: >10% lReduction of 3-gram perplexity: ~5%
8
Named Entities lNEs resulted in small but significant improvement in translation quality in the Chinese-English system lIn Chinese: unknown words are splitted into single characters which are then translated as individual words lIn Arabic no segmentation issues -> damage less severe lNEs not used so far for Arabic, but started to work on it
9
Language-Specific Issues for Arabic MT lSyntactic issues: Error analysis revealed two common syntactic errors lVerb-Noun reordering lSubject-Verb reordering lMorphology issues: Problems specific to AR morphology lBased on Darwish transliteration lBased on Buckwalter transliteration lPoor Man’s morphology
10
Syntax Issues: Adjective-Noun reordering lAdjectives and nouns are frequently reordered between Arabic and English lExample: EN: ‘big green chair’ AR: ‘chair green big’ lExperiment: identify noun-adjective sequences in AR and reorder them in preprocessing step lProblem: Often long sequences, e.g. N N Adj Adj N Adj N N lResult: no improvement
11
Syntax Issues: Subject-Noun reordering lAR: main verb at the beginning of the sentence followed by its subject lEN: order prefers to have the subject precede the verb lExample: EN: ‘the President visited Egypt’ AR: ‘Visited Egypt the President’ lExperiment: identify verbs at the beginning of the AR sentence and move them to a position following the first noun lNo full parsing lDone as preprocessing on the Arabic side lResult: no effect
12
Morphology Issues lStructural mismatch between English and Arabic lArabic has richer morphology lTypes Ar-En: ~2.2 : 1 lTokens Ar-En: ~ 0.9 : 1 Tried two different tools for morphological analysis: lBuckwalter analyzer lhttp://www.xrce.xerox.com/ competencies/content-analysis/arabic/info/buckwalter-about.htmlhttp://www.xrce.xerox.com/ competencies/content-analysis/arabic/info/buckwalter-about.html l1-1 Transliteration scheme for Arabic characters lDarwish analyzer lwww.cs.umd.edu/Library/TRs/CS-TR-4326/CS-TR-4326.pdfwww.cs.umd.edu/Library/TRs/CS-TR-4326/CS-TR-4326.pdf lSeveral characters (ya, alef, hamza) are collapsed into two classes with one character representative each
13
Morphology with Darwish Transliteration lAddressed the compositional part of AR morphology since this contributes to the structural mismatch between AR and EN lGoal was to get better word-level alignment lToolkit comes with a stemmer lCreated modified version for separating instead of removing affixes lExperiment 1: Trained on stemmed data lArabic types reduced by ~60%, nearly matching number of English types lBut loosing discriminative power lExperiment 2: Trained on affix-separated data lNumber of tokens increased lMismatch in tokens much larger lResult: Doing morphology monolingually can even increase structural mismatch
14
Morphology with Buckwalter Transliteration lFocused on DET and CONJ prefixes: lAR: ‘the’, ‘and’ frequently attached to nouns and adjectives lEN: always separate lDifferent spitting strategies: lLoosest: Use all prefixes and split even if remaining word is not a stem lMore conservative: Use only prefixes classified as DET or CONJ lMost conservative: Full analysis, split only can be analyzed as a DET or CONJ prefix plus legitimate stem lExperiments: train on each kind of split data lResult: All set-ups gave lower scores
15
Poor Man’s Morphology lList of pre- and suffixes compiled by native speaker lOnly for unknown words lRemove more and more pre- and suffixes lStop when stripped word is in trained lexicon lTypically: 1/2 to 2/3 of the unknown words can be mapped to known words lTranslation not always correct, therefore overall improvement limited lResult: this has so far been (for us) the only morphological processing which gave a small improvement
16
Experience with Morphology and Syntax lInitial experiments with full morphological analysis did not give an improvement lMost words are seen in large corpus lUnknown words: < 5% tokens, < 10% types lSimple prefix splitting reduced to half lPhrase translation captures some of the agreement information lLocal word reordering in the decoder reduces word order problems lWe still believe that morphology could give an additional improvement
17
Requirements for Improvements lData lMore specific data: We have large corpus (UN) but only small news corpora lManual dictionary could help, it helps for Chinese lBetter use of existing resources lLexicon not trained on all data lTreebanks not used lContinues improvement of models and decoder lRecent improvements in decoder (word reordering, overlapping phrases, sentence length model) helped for Arabic lExpect improvement from named entities lIntegrate morphology and alignment
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.