Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System Alon Lavie Language Technologies Institute Carnegie Mellon University.

Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with: Shuly Wintner, Yaniv Eytani - University of Haifa Erik Peterson, Katharina Probst – Carnegie Mellon

October 4, 2004TMI-20042 Outline Hebrew and its Challenges for MT CMU Transfer-based MT Framework Hebrew-to-English System Input pre-proc and Morph. Analysis MT Resources: lexicon and grammar Performance Evaluation Conclusions, Current and Future Work

October 4, 2004TMI-20043 Modern Hebrew Native language of about 3-4 Million in Israel Semitic language, closely related to Arabic and with similar linguistic properties –Root+Pattern word formation system –Rich verb and noun morphology –Particles attach as prefixed to the following word: definite article (H), prepositions (B,K,L,M), coordinating conjuction (W), relativizers ($,K$)… Unique alphabet and Writing System –22 letters represent (mostly) consonants –Vowels represented (mostly) by diacritics –Modern texts omit the diacritic vowels, thus additional level of ambiguity: “bare” word  word –Example: MHGR  mehager, m+hagar, m+h+ger

October 4, 2004TMI-20044 Modern Hebrew Spelling Two main spelling variants –“KTIV XASER” (difficient): spelling with the vowel diacritics, and consonant words when the diacritics are removed –“KTIV MALEH” (full): words with I/O/U vowels are written with long vowels which include a letter KTIV MALEH is predominant, but not strictly adhered to even in newspapers and official publications  inconsistent spelling Example: –niqud (spelling): NIQWD, NQWD, NQD –Written as NQD, could also be niqed, naqed, nuqad

October 4, 2004TMI-20045 Challenges for Hebrew MT Puacity in existing language resources for Hebrew –No publicly available broad coverage morphological analyzer –No publicly available bilingual lexicons or dictionaries –No POS-tagged corpus or parse tree-bank corpus for Hebrew –No large Hebrew/English parallel corpus Scenario well suited for CMU transfer-based MT framework for languages with limited resources

Transfer Engine English Language Model Transfer Rules {NP1,3} NP1::NP1 [NP1 "H" ADJ] -> [ADJ NP1] ((X3::Y1) (X1::Y2) ((X1 def) = +) ((X1 status) =c absolute) ((X1 num) = (X3 num)) ((X1 gen) = (X3 gen)) (X0 = X1)) Translation Lexicon N::N |: ["$WR"] -> ["BULL"] ((X1::Y1) ((X0 NUM) = s) ((Y0 lex) = "BULL")) N::N |: ["$WRH"] -> ["LINE"] ((X1::Y1) ((X0 NUM) = s) ((Y0 lex) = "LINE")) Hebrew Input בשורה הבאה Decoder English Output in the next line Translation Output Lattice (0 1 "IN" @PREP) (1 1 "THE" @DET) (2 2 "LINE" @N) (1 2 "THE LINE" @NP) (0 2 "IN LINE" @PP) (0 4 "IN THE NEXT LINE" @PP) Preprocessing Morphology

October 4, 2004TMI-20047 Transfer Rule Formalism Type information Part-of-speech/constituent information Alignments x-side constraints y-side constraints xy-constraints, e.g. ((Y1 AGR) = (X1 AGR)) ; SL: the old man, TL: ha-ish ha-zaqen NP::NP [DET ADJ N] -> [DET N DET ADJ] ( (X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2) ((X1 AGR) = *3-SING) ((X1 DEF = *DEF) ((X3 AGR) = *3-SING) ((X3 COUNT) = +) ((Y1 DEF) = *DEF) ((Y3 DEF) = *DEF) ((Y2 AGR) = *3-SING) ((Y2 GENDER) = (Y4 GENDER)) )

October 4, 2004TMI-20048 Transfer Rule Formalism (II) Value constraints Agreement constraints ;SL: the old man, TL: ha-ish ha-zaqen NP::NP [DET ADJ N] -> [DET N DET ADJ] ( (X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2) ((X1 AGR) = *3-SING) ((X1 DEF = *DEF) ((X3 AGR) = *3-SING) ((X3 COUNT) = +) ((Y1 DEF) = *DEF) ((Y3 DEF) = *DEF) ((Y2 AGR) = *3-SING) ((Y2 GENDER) = (Y4 GENDER)) )

October 4, 2004TMI-20049 The Transfer Engine Analysis Source text is parsed into its grammatical structure. Determines transfer application ordering. Example: 他看书。 (he read book) S NP VP N V NP 他看书 Transfer A target language tree is created by reordering, insertion, and deletion. S NP VP N V NP he read DET N a book Article “a” is inserted into object NP. Source words translated with transfer lexicon. Generation Target language constraints are checked and final translation produced. E.g. “reads” is chosen over “read” to agree with “he”. Final translation: “He reads a book”

October 4, 2004TMI-200410 XFER + Decoder XFER engine produces a lattice of all possible transferred fragments Decoder searches for and selects the best scoring sequence of fragments as a final translation output Main advantages: –Very high robustness always some translation output no transfer grammar  word-to-word translation –Scoring can take into account word-to-word translation probabilities, transfer rule scores, target statistical language model –Effective framework for late-stage disambiguation Main Difficulty: lattice size too big  pruning

October 4, 2004TMI-200411 Hebrew Text Encoding Issues Input texts are (most commonly) in standard Windows encoding for Hebrew, but also unicode (UTF-8) and others… Morphology analyzer and other resources already set to work in a romanized “ascii-like” representation  Converter script converts the input into the romanized representation – 1-to-1 mapping! All further processing is done in the romanized representation Lexicon and grammar rules are also converted into romanized representation

October 4, 2004TMI-200412 Morphological Analyzer Analyzer program developed at Technion was available, works on Windows and with minimal adaptation on Linux Coverage is reasonable (for nouns and verbs and adjectives) Produces all analyses or a disambiguated analysis for each word Output format includes lexeme (base form), POS, morphological features Output was adapted to our representation needs (POS and feature mappings)

October 4, 2004TMI-200413 Morphological Processing Split attached prefixes and suffixes into separate words for translation Produce f-structures as output Convert feature-value codes to our conventions “All analyses mode”: all possible analyses for each input word returned, represented in the form of a input lattice Analyzer installed as a server integrated with input pre-processer

October 4, 2004TMI-200414 Morphology Example Input word: B$WRH 0 1 2 3 4 |--------B$WRH--------| |-----B-----|$WR|--H--| |--B--|-H--|--$WRH---|

October 4, 2004TMI-200415 Morphology Example Y0: ((SPANSTART 0) Y1: ((SPANSTART 0) Y2: ((SPANSTART 1) (SPANEND 4) (SPANEND 2) (SPANEND 3) (LEX B$WRH) (LEX B) (LEX $WR) (POS N) (POS PREP)) (POS N) (GEN F) (GEN M) (NUM S) (NUM S) (STATUS ABSOLUTE)) (STATUS ABSOLUTE)) Y3: ((SPANSTART 3) Y4: ((SPANSTART 0) Y5: ((SPANSTART 1) (SPANEND 4) (SPANEND 1) (SPANEND 2) (LEX $LH) (LEX B) (LEX H) (POS POSS)) (POS PREP)) (POS DET)) Y6: ((SPANSTART 2) Y7: ((SPANSTART 0) (SPANEND 4) (SPANEND 4) (LEX $WRH) (LEX B$WRH) (POS N) (POS LEX)) (GEN F) (NUM S) (STATUS ABSOLUTE))

October 4, 2004TMI-200416 Translation Lexicon Constructed our own Hebrew-to-English lexicon, based primarily on existing “Dahan” H-to-E and E-to-H dictionary made available to us Coverage is not great but not bad –Dahan H-to-E is about 15K translation pairs –Dahan E-to-H is about 7K translation pairs POS information on both sides No proper names or named entities Converted Dahan into our representation, added entries for missing closed-class entries (pronouns, prepositions, etc.) Issue with spelling conventions –Dahan dictionary uses deficient KTIV XASER –Developed conversion scripts for most common patterns of verbs –Add/merge these into resulting lexicon Target side (English) morphological variants added into lexicon

October 4, 2004TMI-200417 Translation Lexicon: Examples PRO::PRO |: ["ANI"] -> ["I"] ( (X1::Y1) ((X0 per) = 1) ((X0 num) = s) ((X0 case) = nom) ) PRO::PRO |: ["ATH"] -> ["you"] ( (X1::Y1) ((X0 per) = 2) ((X0 num) = s) ((X0 gen) = m) ((X0 case) = nom) ) N::N |: ["$&H"] -> ["HOUR"] ( (X1::Y1) ((X0 NUM) = s) ((Y0 NUM) = s) ((Y0 lex) = "HOUR") ) N::N |: ["$&H"] -> ["hours"] ( (X1::Y1) ((Y0 NUM) = p) ((X0 NUM) = p) ((Y0 lex) = "HOUR") )

October 4, 2004TMI-200418 Transfer Grammar (human-developed) Written by Alon in a few days… Current grammar has 36 rules: –21 NP rules –one PP rule –6 verb complexes and VP rules –8 higher-phrase and sentence-level rules Captures the most common (mostly local) structural differences between Hebrew and English

October 4, 2004TMI-200419 Transfer Grammar Example Rules {NP1,2} ;;SL: $MLH ADWMH ;;TL: A RED DRESS NP1::NP1 [NP1 ADJ] -> [ADJ NP1] ( (X2::Y1) (X1::Y2) ((X1 def) = -) ((X1 status) =c absolute) ((X1 num) = (X2 num)) ((X1 gen) = (X2 gen)) (X0 = X1) ) {NP1,3} ;;SL: H $MLWT H ADWMWT ;;TL: THE RED DRESSES NP1::NP1 [NP1 "H" ADJ] -> [ADJ NP1] ( (X3::Y1) (X1::Y2) ((X1 def) = +) ((X1 status) =c absolute) ((X1 num) = (X3 num)) ((X1 gen) = (X3 gen)) (X0 = X1) )

October 4, 2004TMI-200420 Sample Output (dev-data) maxwell anurpung comes from ghana for israel four years ago and since worked in cleaning in hotels in eilat a few weeks ago announced if management club hotel that for him to leave israel according to the government instructions and immigration police in a letter in broken english which spread among the foreign workers thanks to them hotel for their hard work and announced that will purchase for hm flight tickets for their countries from their money

October 4, 2004TMI-200421 Evaluation Results Test set of 62 sentences from Haaretz newspaper, 2 reference translations SystemBLEUNISTPRMETEOR No Gram0.06163.41090.40900.44270.3298 Learned0.07743.54510.41890.44880.3478 Manual0.10263.77890.43340.44740.3617

October 4, 2004TMI-200422 Current and Future Work Issues specific to the Hebrew-to-English system: –Further improvements in the translation lexicon and morphological analyzer –Manual Grammar development –Acquiring/training of word-to-word translation probabilities –Acquiring/training of a Hebrew language model at a post- morphology level that can help with disambiguation General Issues related to XFER framework: –Effective pruning during full lattice construction –Effective model for assigning scores to transfer rules –Extending decoder to incorporate rule scores –Improved grammar learning

October 4, 2004TMI-200423 Conclusions Test case for the CMU XFER framework for rapid MT prototyping Two-month, three person effort – we were quite happy with the outcome Core concept of XFER + Decoder is very powerful and promising We experienced the main bottlenecks of knowledge acquisition for MT: morphology, translation lexicons, grammar...

October 4, 2004TMI-200424 Questions?

October 4, 2004TMI-200425 Learning Transfer-Rules for Languages with Limited Resources Rationale: –Large bilingual corpora not available –Bilingual native informant(s) can translate and align a small pre-designed elicitation corpus, using elicitation tool –Elicitation corpus designed to be typologically comprehensive and compositional –Transfer-rule engine and new learning approach support acquisition of generalized transfer-rules from the data

October 4, 2004TMI-200426 English-Hindi Example

October 4, 2004TMI-200427 Rule Learning - Overview Goal: Acquire Syntactic Transfer Rules Use available knowledge from the source side (grammatical structure) Three steps: 1.Flat Seed Generation: first guesses at transfer rules; flat syntactic structure 2.Compositionality: use previously learned rules to add hierarchical structure 3.Seeded Version Space Learning: refine rules by learning appropriate feature constraints

October 4, 2004TMI-200428 Flat Seed Rule Generation Learning Example: NP Eng: the big apple Heb: ha-tapuax ha-gadol Generated Seed Rule: NP::NP [ART ADJ N]  [ART N ART ADJ] ((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2))

October 4, 2004TMI-200429 Compositionality Initial Flat Rules: S::S [ART ADJ N V ART N]  [ART N ART ADJ V P ART N] ((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2) (X4::Y5) (X5::Y7) (X6::Y8)) NP::NP [ART ADJ N]  [ART N ART ADJ] ((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2)) NP::NP [ART N]  [ART N] ((X1::Y1) (X2::Y2)) Generated Compositional Rule: S::S [NP V NP]  [NP V P NP] ((X1::Y1) (X2::Y2) (X3::Y4))

October 4, 2004TMI-200430 Seeded Version Space Learning Input: Rules and their Example Sets S::S [NP V NP]  [NP V P NP] {ex1,ex12,ex17,ex26} ((X1::Y1) (X2::Y2) (X3::Y4)) NP::NP [ART ADJ N]  [ART N ART ADJ] {ex2,ex3,ex13} ((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2)) NP::NP [ART N]  [ART N] {ex4,ex5,ex6,ex8,ex10,ex11} ((X1::Y1) (X2::Y2)) Output: Rules with Feature Constraints: S::S [NP V NP]  [NP V P NP] ((X1::Y1) (X2::Y2) (X3::Y4) (X1 NUM = X2 NUM) (Y1 NUM = Y2 NUM) (X1 NUM = Y1 NUM))

Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System Alon Lavie Language Technologies Institute Carnegie Mellon University.

Similar presentations

Presentation on theme: "Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System Alon Lavie Language Technologies Institute Carnegie Mellon University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System Alon Lavie Language Technologies Institute Carnegie Mellon University.

Similar presentations

Presentation on theme: "Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System Alon Lavie Language Technologies Institute Carnegie Mellon University."— Presentation transcript:

Similar presentations

About project

Feedback