Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System Alon Lavie Language Technologies Institute Carnegie Mellon University.

Slides:



Advertisements
Similar presentations
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Advertisements

1 A Hidden Markov Model- Based POS Tagger for Arabic ICS 482 Presentation A Hidden Markov Model- Based POS Tagger for Arabic By Saleh Yousef Al-Hudail.
Resource Acquisition for Syntax-based MT from Parsed Parallel data Alon Lavie, Alok Parlikar and Vamshi Ambati Language Technologies Institute Carnegie.
Stat-XFER: A General Framework for Search-based Syntax-driven MT Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
Enabling MT for Languages with Limited Resources Alon Lavie Language Technologies Institute Carnegie Mellon University.
A computational Lexicon for Contemporary Hebrew Alon Itai – CS Technion Shuly Wintner – CS Haifa University Shlomo Yona – CS Haifa University.
Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System Alon Lavie Language Technologies Institute Carnegie Mellon University.
NICE: Native language Interpretation and Communication Environment Lori Levin, Jaime Carbonell, Alon Lavie, Ralf Brown Carnegie Mellon University.
“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.
Automatic Rule Learning for Resource-Limited Machine Translation Alon Lavie, Katharina Probst, Erik Peterson, Jaime Carbonell, Lori Levin, Ralf Brown Language.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
Creation of a Russian-English Translation Program Karen Shiells.
Stat-XFER: A General Framework for Search-based Syntax-driven MT Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing INTRODUCTION Muhammed Al-Mulhem March 1, 2009.
Czech-to-English Translation: MT Marathon 2009 Session Preview Jonathan Clark Greg Hanneman Language Technologies Institute Carnegie Mellon University.
Stat-XFER: A General Framework for Search-based Syntax-driven MT Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
MT for Languages with Limited Resources Machine Translation April 20, 2011 Based on Joint Work with: Lori Levin, Jaime Carbonell, Stephan Vogel,
Stat-XFER: A General Framework for Search-based Syntax-driven MT Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Building NLP Systems for Two Resource Scarce Indigenous Languages: Mapudungun and Quechua, and some other languages Christian Monson, Ariadna Font Llitjós,
Can Controlled Language Rules increase the value of MT? Fred Hollowood & Johann Rotourier Symantec Dublin.
Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.
Statistical XFER: Hybrid Statistical Rule-based Machine Translation Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
Recent Major MT Developments at CMU Briefing for Joe Olive February 5, 2008 Alon Lavie and Stephan Vogel Language Technologies Institute Carnegie Mellon.
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
Improving Statistical Machine Translation by Means of Transfer Rules Nurit Melnik.
Rule Learning - Overview Goal: Syntactic Transfer Rules 1) Flat Seed Generation: produce rules from word- aligned sentence pairs, abstracted only to POS.
AMTEXT: Extraction-based MT for Arabic Faculty: Alon Lavie, Jaime Carbonell Students and Staff: Laura Kieras, Peter Jansen Informant: Loubna El Abadi.
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University.
Hebrew-to-English XFER MT Project - Update Alon Lavie June 2, 2004.
Natural Language Processing
Nov 17, 2005Learning-based MT1 Learning-based MT Approaches for Languages with Limited Resources Alon Lavie Language Technologies Institute Carnegie Mellon.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,
A Trainable Transfer-based MT Approach for Languages with Limited Resources Alon Lavie Language Technologies Institute Carnegie Mellon University Joint.
Hybrid Method for Tagging Arabic Text Written By: Yamina Tlili-Guiassa University Badji Mokhtar Annaba, Algeria Presented By: Ahmed Bukhamsin.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.
A Trainable Transfer-based MT Approach for Languages with Limited Resources Alon Lavie Language Technologies Institute Carnegie Mellon University Joint.
The CMU Mill-RADD Project: Recent Activities and Results Alon Lavie Language Technologies Institute Carnegie Mellon University.
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
Avenue Architecture Learning Module Learned Transfer Rules Lexical Resources Run Time Transfer System Decoder Translation Correction Tool Word- Aligned.
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
Eliciting a corpus of word- aligned phrases for MT Lori Levin, Alon Lavie, Erik Peterson Language Technologies Institute Carnegie Mellon University.
Seed Generation and Seeded Version Space Learning Version 0.02 Katharina Probst Feb 28,2002.
CMU MilliRADD Small-MT Report TIDES PI Meeting 2002 The CMU MilliRADD Team: Jaime Carbonell, Lori Levin, Ralf Brown, Stephan Vogel, Alon Lavie, Kathrin.
AVENUE: Machine Translation for Resource-Poor Languages NSF ITR
Developing affordable technologies for resource-poor languages Ariadna Font Llitjós Language Technologies Institute Carnegie Mellon University September.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Minority Languages Katharina Probst Language Technologies Institute Carnegie Mellon.
Enabling MT for Languages with Limited Resources Alon Lavie and Lori Levin Language Technologies Institute Carnegie Mellon University.
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
The AVENUE Project: Automatic Rule Learning for Resource-Limited Machine Translation Faculty: Alon Lavie, Jaime Carbonell, Lori Levin, Ralf Brown Students:
Faculty: Alon Lavie, Jaime Carbonell, Lori Levin, Ralf Brown Students:
Basic Parsing with Context Free Grammars Chapter 13
Alon Lavie, Jaime Carbonell, Lori Levin,
Stat-Xfer מציגים: יוגב וקנין ועומר טבח, 05/01/2012
Stat-XFER: A General Framework for Search-based Syntax-driven MT
Stat-XFER: A General Framework for Search-based Syntax-driven MT
Artificial Intelligence 2004 Speech & Natural Language Processing
Presentation transcript:

Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with: Shuly Wintner, Yaniv Eytani - University of Haifa Erik Peterson, Katharina Probst – Carnegie Mellon

October 4, 2004TMI Outline Hebrew and its Challenges for MT CMU Transfer-based MT Framework Hebrew-to-English System Input pre-proc and Morph. Analysis MT Resources: lexicon and grammar Performance Evaluation Conclusions, Current and Future Work

October 4, 2004TMI Modern Hebrew Native language of about 3-4 Million in Israel Semitic language, closely related to Arabic and with similar linguistic properties –Root+Pattern word formation system –Rich verb and noun morphology –Particles attach as prefixed to the following word: definite article (H), prepositions (B,K,L,M), coordinating conjuction (W), relativizers ($,K$)… Unique alphabet and Writing System –22 letters represent (mostly) consonants –Vowels represented (mostly) by diacritics –Modern texts omit the diacritic vowels, thus additional level of ambiguity: “bare” word  word –Example: MHGR  mehager, m+hagar, m+h+ger

October 4, 2004TMI Modern Hebrew Spelling Two main spelling variants –“KTIV XASER” (difficient): spelling with the vowel diacritics, and consonant words when the diacritics are removed –“KTIV MALEH” (full): words with I/O/U vowels are written with long vowels which include a letter KTIV MALEH is predominant, but not strictly adhered to even in newspapers and official publications  inconsistent spelling Example: –niqud (spelling): NIQWD, NQWD, NQD –Written as NQD, could also be niqed, naqed, nuqad

October 4, 2004TMI Challenges for Hebrew MT Puacity in existing language resources for Hebrew –No publicly available broad coverage morphological analyzer –No publicly available bilingual lexicons or dictionaries –No POS-tagged corpus or parse tree-bank corpus for Hebrew –No large Hebrew/English parallel corpus Scenario well suited for CMU transfer-based MT framework for languages with limited resources

Transfer Engine English Language Model Transfer Rules {NP1,3} NP1::NP1 [NP1 "H" ADJ] -> [ADJ NP1] ((X3::Y1) (X1::Y2) ((X1 def) = +) ((X1 status) =c absolute) ((X1 num) = (X3 num)) ((X1 gen) = (X3 gen)) (X0 = X1)) Translation Lexicon N::N |: ["$WR"] -> ["BULL"] ((X1::Y1) ((X0 NUM) = s) ((Y0 lex) = "BULL")) N::N |: ["$WRH"] -> ["LINE"] ((X1::Y1) ((X0 NUM) = s) ((Y0 lex) = "LINE")) Hebrew Input בשורה הבאה Decoder English Output in the next line Translation Output Lattice (0 1 (1 1 (2 2 (1 2 "THE (0 2 "IN (0 4 "IN THE NEXT Preprocessing Morphology

October 4, 2004TMI Transfer Rule Formalism Type information Part-of-speech/constituent information Alignments x-side constraints y-side constraints xy-constraints, e.g. ((Y1 AGR) = (X1 AGR)) ; SL: the old man, TL: ha-ish ha-zaqen NP::NP [DET ADJ N] -> [DET N DET ADJ] ( (X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2) ((X1 AGR) = *3-SING) ((X1 DEF = *DEF) ((X3 AGR) = *3-SING) ((X3 COUNT) = +) ((Y1 DEF) = *DEF) ((Y3 DEF) = *DEF) ((Y2 AGR) = *3-SING) ((Y2 GENDER) = (Y4 GENDER)) )

October 4, 2004TMI Transfer Rule Formalism (II) Value constraints Agreement constraints ;SL: the old man, TL: ha-ish ha-zaqen NP::NP [DET ADJ N] -> [DET N DET ADJ] ( (X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2) ((X1 AGR) = *3-SING) ((X1 DEF = *DEF) ((X3 AGR) = *3-SING) ((X3 COUNT) = +) ((Y1 DEF) = *DEF) ((Y3 DEF) = *DEF) ((Y2 AGR) = *3-SING) ((Y2 GENDER) = (Y4 GENDER)) )

October 4, 2004TMI The Transfer Engine Analysis Source text is parsed into its grammatical structure. Determines transfer application ordering. Example: 他 看 书。 (he read book) S NP VP N V NP 他 看 书 Transfer A target language tree is created by reordering, insertion, and deletion. S NP VP N V NP he read DET N a book Article “a” is inserted into object NP. Source words translated with transfer lexicon. Generation Target language constraints are checked and final translation produced. E.g. “reads” is chosen over “read” to agree with “he”. Final translation: “He reads a book”

October 4, 2004TMI XFER + Decoder XFER engine produces a lattice of all possible transferred fragments Decoder searches for and selects the best scoring sequence of fragments as a final translation output Main advantages: –Very high robustness always some translation output no transfer grammar  word-to-word translation –Scoring can take into account word-to-word translation probabilities, transfer rule scores, target statistical language model –Effective framework for late-stage disambiguation Main Difficulty: lattice size too big  pruning

October 4, 2004TMI Hebrew Text Encoding Issues Input texts are (most commonly) in standard Windows encoding for Hebrew, but also unicode (UTF-8) and others… Morphology analyzer and other resources already set to work in a romanized “ascii-like” representation  Converter script converts the input into the romanized representation – 1-to-1 mapping! All further processing is done in the romanized representation Lexicon and grammar rules are also converted into romanized representation

October 4, 2004TMI Morphological Analyzer Analyzer program developed at Technion was available, works on Windows and with minimal adaptation on Linux Coverage is reasonable (for nouns and verbs and adjectives) Produces all analyses or a disambiguated analysis for each word Output format includes lexeme (base form), POS, morphological features Output was adapted to our representation needs (POS and feature mappings)

October 4, 2004TMI Morphological Processing Split attached prefixes and suffixes into separate words for translation Produce f-structures as output Convert feature-value codes to our conventions “All analyses mode”: all possible analyses for each input word returned, represented in the form of a input lattice Analyzer installed as a server integrated with input pre-processer

October 4, 2004TMI Morphology Example Input word: B$WRH | B$WRH | |-----B-----|$WR|--H--| |--B--|-H--|--$WRH---|

October 4, 2004TMI Morphology Example Y0: ((SPANSTART 0) Y1: ((SPANSTART 0) Y2: ((SPANSTART 1) (SPANEND 4) (SPANEND 2) (SPANEND 3) (LEX B$WRH) (LEX B) (LEX $WR) (POS N) (POS PREP)) (POS N) (GEN F) (GEN M) (NUM S) (NUM S) (STATUS ABSOLUTE)) (STATUS ABSOLUTE)) Y3: ((SPANSTART 3) Y4: ((SPANSTART 0) Y5: ((SPANSTART 1) (SPANEND 4) (SPANEND 1) (SPANEND 2) (LEX $LH) (LEX B) (LEX H) (POS POSS)) (POS PREP)) (POS DET)) Y6: ((SPANSTART 2) Y7: ((SPANSTART 0) (SPANEND 4) (SPANEND 4) (LEX $WRH) (LEX B$WRH) (POS N) (POS LEX)) (GEN F) (NUM S) (STATUS ABSOLUTE))

October 4, 2004TMI Translation Lexicon Constructed our own Hebrew-to-English lexicon, based primarily on existing “Dahan” H-to-E and E-to-H dictionary made available to us Coverage is not great but not bad –Dahan H-to-E is about 15K translation pairs –Dahan E-to-H is about 7K translation pairs POS information on both sides No proper names or named entities Converted Dahan into our representation, added entries for missing closed-class entries (pronouns, prepositions, etc.) Issue with spelling conventions –Dahan dictionary uses deficient KTIV XASER –Developed conversion scripts for most common patterns of verbs –Add/merge these into resulting lexicon Target side (English) morphological variants added into lexicon

October 4, 2004TMI Translation Lexicon: Examples PRO::PRO |: ["ANI"] -> ["I"] ( (X1::Y1) ((X0 per) = 1) ((X0 num) = s) ((X0 case) = nom) ) PRO::PRO |: ["ATH"] -> ["you"] ( (X1::Y1) ((X0 per) = 2) ((X0 num) = s) ((X0 gen) = m) ((X0 case) = nom) ) N::N |: ["$&H"] -> ["HOUR"] ( (X1::Y1) ((X0 NUM) = s) ((Y0 NUM) = s) ((Y0 lex) = "HOUR") ) N::N |: ["$&H"] -> ["hours"] ( (X1::Y1) ((Y0 NUM) = p) ((X0 NUM) = p) ((Y0 lex) = "HOUR") )

October 4, 2004TMI Transfer Grammar (human-developed) Written by Alon in a few days… Current grammar has 36 rules: –21 NP rules –one PP rule –6 verb complexes and VP rules –8 higher-phrase and sentence-level rules Captures the most common (mostly local) structural differences between Hebrew and English

October 4, 2004TMI Transfer Grammar Example Rules {NP1,2} ;;SL: $MLH ADWMH ;;TL: A RED DRESS NP1::NP1 [NP1 ADJ] -> [ADJ NP1] ( (X2::Y1) (X1::Y2) ((X1 def) = -) ((X1 status) =c absolute) ((X1 num) = (X2 num)) ((X1 gen) = (X2 gen)) (X0 = X1) ) {NP1,3} ;;SL: H $MLWT H ADWMWT ;;TL: THE RED DRESSES NP1::NP1 [NP1 "H" ADJ] -> [ADJ NP1] ( (X3::Y1) (X1::Y2) ((X1 def) = +) ((X1 status) =c absolute) ((X1 num) = (X3 num)) ((X1 gen) = (X3 gen)) (X0 = X1) )

October 4, 2004TMI Sample Output (dev-data) maxwell anurpung comes from ghana for israel four years ago and since worked in cleaning in hotels in eilat a few weeks ago announced if management club hotel that for him to leave israel according to the government instructions and immigration police in a letter in broken english which spread among the foreign workers thanks to them hotel for their hard work and announced that will purchase for hm flight tickets for their countries from their money

October 4, 2004TMI Evaluation Results Test set of 62 sentences from Haaretz newspaper, 2 reference translations SystemBLEUNISTPRMETEOR No Gram Learned Manual

October 4, 2004TMI Current and Future Work Issues specific to the Hebrew-to-English system: –Further improvements in the translation lexicon and morphological analyzer –Manual Grammar development –Acquiring/training of word-to-word translation probabilities –Acquiring/training of a Hebrew language model at a post- morphology level that can help with disambiguation General Issues related to XFER framework: –Effective pruning during full lattice construction –Effective model for assigning scores to transfer rules –Extending decoder to incorporate rule scores –Improved grammar learning

October 4, 2004TMI Conclusions Test case for the CMU XFER framework for rapid MT prototyping Two-month, three person effort – we were quite happy with the outcome Core concept of XFER + Decoder is very powerful and promising We experienced the main bottlenecks of knowledge acquisition for MT: morphology, translation lexicons, grammar...

October 4, 2004TMI Questions?

October 4, 2004TMI Learning Transfer-Rules for Languages with Limited Resources Rationale: –Large bilingual corpora not available –Bilingual native informant(s) can translate and align a small pre-designed elicitation corpus, using elicitation tool –Elicitation corpus designed to be typologically comprehensive and compositional –Transfer-rule engine and new learning approach support acquisition of generalized transfer-rules from the data

October 4, 2004TMI English-Hindi Example

October 4, 2004TMI Rule Learning - Overview Goal: Acquire Syntactic Transfer Rules Use available knowledge from the source side (grammatical structure) Three steps: 1.Flat Seed Generation: first guesses at transfer rules; flat syntactic structure 2.Compositionality: use previously learned rules to add hierarchical structure 3.Seeded Version Space Learning: refine rules by learning appropriate feature constraints

October 4, 2004TMI Flat Seed Rule Generation Learning Example: NP Eng: the big apple Heb: ha-tapuax ha-gadol Generated Seed Rule: NP::NP [ART ADJ N]  [ART N ART ADJ] ((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2))

October 4, 2004TMI Compositionality Initial Flat Rules: S::S [ART ADJ N V ART N]  [ART N ART ADJ V P ART N] ((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2) (X4::Y5) (X5::Y7) (X6::Y8)) NP::NP [ART ADJ N]  [ART N ART ADJ] ((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2)) NP::NP [ART N]  [ART N] ((X1::Y1) (X2::Y2)) Generated Compositional Rule: S::S [NP V NP]  [NP V P NP] ((X1::Y1) (X2::Y2) (X3::Y4))

October 4, 2004TMI Seeded Version Space Learning Input: Rules and their Example Sets S::S [NP V NP]  [NP V P NP] {ex1,ex12,ex17,ex26} ((X1::Y1) (X2::Y2) (X3::Y4)) NP::NP [ART ADJ N]  [ART N ART ADJ] {ex2,ex3,ex13} ((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2)) NP::NP [ART N]  [ART N] {ex4,ex5,ex6,ex8,ex10,ex11} ((X1::Y1) (X2::Y2)) Output: Rules with Feature Constraints: S::S [NP V NP]  [NP V P NP] ((X1::Y1) (X2::Y2) (X3::Y4) (X1 NUM = X2 NUM) (Y1 NUM = Y2 NUM) (X1 NUM = Y1 NUM))