Vamshi Ambati 14 Sept 2007 Student Research Symposium

Slides:



Advertisements
Similar presentations
Statistical Machine Translation
Advertisements

Feature Forest Models for Syntactic Parsing Yusuke Miyao University of Tokyo.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Learning Accurate, Compact, and Interpretable Tree Annotation Recent Advances in Parsing Technology WS 2011/2012 Saarland University in Saarbrücken Miloš.
Learning and Inference for Hierarchically Split PCFGs Slav Petrov and Dan Klein.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
1 A Tree Sequence Alignment- based Tree-to-Tree Translation Model Authors: Min Zhang, Hongfei Jiang, Aiti Aw, et al. Reporter: 江欣倩 Professor: 陳嘉平.
Machine Translation via Dependency Transfer Philip Resnik University of Maryland DoD MURI award in collaboration with JHU: Bootstrapping Out of the Multilingual.
Learning Accurate, Compact, and Interpretable Tree Annotation Slav Petrov, Leon Barrett, Romain Thibaux, Dan Klein.
Statistical Phrase-Based Translation Authors: Koehn, Och, Marcu Presented by Albert Bertram Titles, charts, graphs, figures and tables were extracted from.
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
A Hierarchical Phrase-Based Model for Statistical Machine Translation Author: David Chiang Presented by Achim Ruopp Formulas/illustrations/numbers extracted.
Probabilistic Parsing Ling 571 Fei Xia Week 5: 10/25-10/27/05.
LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.
Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali.
Czech-to-English Translation: MT Marathon 2009 Session Preview Jonathan Clark Greg Hanneman Language Technologies Institute Carnegie Mellon University.
PFA Node Alignment Algorithm Consider the parse trees of a Chinese-English parallel pair of sentences.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
1 Semi-Supervised Approaches for Learning to Parse Natural Languages Rebecca Hwa
Recent Major MT Developments at CMU Briefing for Joe Olive February 5, 2008 Alon Lavie and Stephan Vogel Language Technologies Institute Carnegie Mellon.
Reordering Model Using Syntactic Information of a Source Tree for Statistical Machine Translation Kei Hashimoto, Hirohumi Yamamoto, Hideo Okuma, Eiichiro.
Rule Learning - Overview Goal: Syntactic Transfer Rules 1) Flat Seed Generation: produce rules from word- aligned sentence pairs, abstracted only to POS.
Hindi SLE Debriefing AVENUE Transfer System July 3, 2003.
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
CS460/626 : Natural Language Processing/Speech, NLP and the Web Some parse tree examples (from quiz 3) Pushpak Bhattacharyya CSE Dept., IIT Bombay 12 th.
Hebrew-to-English XFER MT Project - Update Alon Lavie June 2, 2004.
INSTITUTE OF COMPUTING TECHNOLOGY Forest-to-String Statistical Translation Rules Yang Liu, Qun Liu, and Shouxun Lin Institute of Computing Technology Chinese.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Introduction to Syntactic Parsing Roxana Girju November 18, 2004 Some slides were provided by Michael Collins (MIT) and Dan Moldovan (UT Dallas)
A Trainable Transfer-based MT Approach for Languages with Limited Resources Alon Lavie Language Technologies Institute Carnegie Mellon University Joint.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
2003 (c) University of Pennsylvania1 Better MT Using Parallel Dependency Trees Yuan Ding University of Pennsylvania.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.
11 Project, Part 3. Outline Basics of supervised learning using Naïve Bayes (using a simpler example) Features for the project 2.
A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.
The CMU Mill-RADD Project: Recent Activities and Results Alon Lavie Language Technologies Institute Carnegie Mellon University.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
A Syntax-Driven Bracketing Model for Phrase-Based Translation Deyi Xiong, et al. ACL 2009.
LING 575 Lecture 5 Kristina Toutanova MSR & UW April 27, 2010 With materials borrowed from Philip Koehn, Chris Quirk, David Chiang, Dekai Wu, Aria Haghighi.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Minority Languages Katharina Probst Language Technologies Institute Carnegie Mellon.
Question Classification Ling573 NLP Systems and Applications April 25, 2013.
Statistical Natural Language Parsing Parsing: The rise of data and statistics.
Eliciting a corpus of word-aligned phrases for MT
Neural Machine Translation
Statistical Machine Translation Part II: Word Alignments and EM
Approaches to Machine Translation
Introduction to Machine Learning and Text Mining
Basic Parsing with Context Free Grammars Chapter 13
Urdu-to-English Stat-XFER system for NIST MT Eval 2008
Suggestions for Class Projects
Statistical NLP: Lecture 13
--Mengxue Zhang, Qingyang Li
Stat-Xfer מציגים: יוגב וקנין ועומר טבח, 05/01/2012
Expectation-Maximization Algorithm
Automatic Detection of Causal Relations for Question Answering
Approaches to Machine Translation
Translingual Knowledge Projection and Statistical Machine Translation
Statistical Machine Translation Papers from COLING 2004
Natural Language Processing
A Path-based Transfer Model for Machine Translation
Dekai Wu Presented by David Goss-Grubbs
By Hossein Hematialam and Wlodek Zadrozny Presented by
Presentation transcript:

Vamshi Ambati 14 Sept 2007 Student Research Symposium Noun Phrase driven Bootstrapping of Word Alignment for Syntax Projection Vamshi Ambati 14 Sept 2007 Student Research Symposium

Agenda Rule Learning for MT Syntax Projection Task Word Alignment Task Bootstrapping Word Alignment Experiment and Results

Machine Translation for Resource poor Languages A major portion of the human languages are ‘resource-poor’ Less parallel corpus into Major languages Less monolingual corpus Less annotation tools Less grammarians Less bilingual speakers Machine Translation in such a scenario is extremely difficult

Machine Translation for Resource poor Languages AVENUE [Alavie’03 et.al] –

Rule Learning for MT NP::NP [PP NP] -> [NP PP] ( (X1::Y2) (X2::Y1) (X0 = X2) ((Y1 NUM) = (X2 NUM)) ((Y1 NUM) = (Y2 NUM)) ((Y1 PERS) = (Y2 PERS)) (Y0 = Y1) ) PP::PP [ADVP NP POSTP] -> [ADVP PREP NP] (X1::Y1) (X2::Y3) (X3::Y2) (X0 = X3) (Y0 = Y2)

How can such rules be learnt? Given annotated data for TL we have creative ways to do this Nothing more valuable than annotated data But, these are “resource-poor” languages Can we look from the ‘Source side’ and transfer annotation ?

Syntax Projection Named Entity Projection [Rama] ate an apple rAma ne ek apple khaya

Syntax Projection Named Entity Projection [Rama] ate an apple [rAma] ne ek apple khaya

Syntax Projection Base NP Projection [Rama] ate [an apple] rAma ne ek apple khaya

Syntax Projection Base NP Projection [Rama] ate [an apple] [rAma] ne [ek apple] khaya

Syntax Projection Constituent Phrase Projection rAma ne ek apple khaya

Syntax Projection Constituent Phrase Projection

Rule Learning Goal English: Rama ate an apple Hindi: raMa ne apple khaya S::S [NP NP VP] -> [NP VP NP] S::S [NP NP ‘khaya’] -> [NP ‘ate’ NP] S::S [‘rAma’ ‘ne’ NP ‘khaya’] -> [‘Rama’ ‘ate’ NP]

Word Alignment Task Training data Source language Target language f : source sentence (Hindi) j = 1,2,...,J Target language e : target sentence (English) i = 1,2,...,I

Word Alignment Models IBM1 – lexical probabilities only IBM2 – lexicon plus absolut position IBM3 – plus fertilities IBM4 – inverted relative position alignment IBM5 – non-deficient version of model 4 HMM – lexicon plus relative position [Brown et.al. 1993, Vogel et.al. 1996, Och et al 1999]

Our Approach Better Syntax Projection requires better Word Alignment Our Hypothesis: Word Alignment can be improved using Syntax projection Project Base NPs to TL and obtain a clean NP table Perform a constrained alignment in Parallel Corpus using the NP table

Why Base NPs ? NPs are semantically and syntactically cohesive across languages NPs show minimal categorical divergence when compared to its colleagues NPs are building blocks of a sentence and their translation improves MT quality [Philipp Koehn, PhD thesis 2003]

Constrained Alignment [PESA: Phrase Pair Extraction as Sentence Splitting, Vogel ’05 ]

Constrained Alignment Ex: Rama ate [an apple] rAma ne [ek apple] khaya

Constrained Alignment Ex: Rama ate [an apple] rAma ne khaya [ek apple]

NP based Bootstrapping: Algorithm Word Align ‘S’ and ‘T’ using IBM-4 Extracted NP on source side using the Parse Extracted translations by harvesting Viterbi Alignment Calculate features for NP pairs Prune based on thresholds Perform Constrained Word Alignment and Lexicon extraction using the NP table

Corpus There are quite a large number of Malayalees living here . is found in the west coast of Great Nicobar called the Magapod Island . Plotemy calls them ' Nagadip ' , a Hindu name for naked island malayAlama logoM kI bahuwa badZI saMKyA hE . xvIpa ke paScimI wata para sWiwa mEgApOda xvIpa meM pAyA jAwA hE . plotemI inheM nAgAxvIpa kahawA hE jo ki hiMxI kA Sabxa hE Ora nagna logoM ke lie prayukwa howA hE .

Source side Parsed There are quite a large number of Malayalees living here . is found in the west coast of Great Nicobar called the Magapod Island . Plotemy calls them ' Nagadip ' , a Hindu name for naked island (S1 (S (NP (EX There)) (VP (AUX are) (NP (NP (PDT quite) (DT a) (JJ large) (NN number)) (PP (IN of) (NP (NP (NNS Malayalees)) (VP (VBG living) (ADVP (RB here))))))) (. .))) (S1 (S (VP (AUX is) (VP (VBN found) (PP (IN in) (NP (NP (DT the) (JJ west) (NN coast)) (PP (IN of) (NP (NNP Great) (NNP Nicobar))) (VP (VBN called) (S (NP (DT the) (NNP Magapod) (NNP Island)))))))) (. .))) (S1 (S (NP (NNP Plotemy)) (VP (VBZ calls) (SBAR (S (NP (PRP them)) (VP (POS ') (NP (NP (NNP Nagadip) (POS ')) (, ,) (NP (NP (DT a) (NNP Hindu) (NN name)) (PP (IN for) (NP (JJ naked) (NN island))))))))) (. .)))

NP based Bootstrapping Word Align ‘S’ and ‘T’ using IBM-4 Extracted NP on source side using the Parse Extracted translations by harvesting Viterbi Alignment Calculate features for NP pairs Prune based on thresholds Perform Constrained Word Alignment and Lexicon extraction using the NP table

Aligned Corpus ;;Sentence id = 1 SL:There are quite a large number of Malayalees living here . TL:malayAlama logoM kI bahuwa badZI saMKyA hE . Alignment:((1,8),(2,7),(11,8),(3,1),(4,8),(5,5),(6,6),(7,3),(8,1),(9,1),(10,1)) ;;Sentence id = 2 SL:is found in the west coast of Great Nicobar called the Magapod Island . TL:xvIpa ke paScimI wata para sWiwa mEgApOda xvIpa meM pAyA jAwA hE . Alignment:((1,12),(2,10),(11,2),(12,7),(13,1),(14,13),(3,9),(4,2),(5,4),(6,4),(7,2),(8,7),(9,1),(10,11))

NP based Bootstrapping Word Align ‘S’ and ‘T’ using IBM-4 Extracted NP on source side using the Parse Extracted translations by harvesting Viterbi Alignment Calculate features for NP pairs Prune based on thresholds Perform Constrained Word Alignment and Lexicon extraction using the NP table

Extract Source NPs NP:1:There :. NP:1:quite a large number :. NP:1:Malayalees : NP:2:the west coast : NP:2:Great Nicobar : NP:2:the Magapod Island : NP:6:Plotemy : NP:6:them : NP:6:Nagadip ' : NP:6:a Hindu name : NP:6:naked island :

NP based Bootstrapping Word Align ‘S’ and ‘T’ using IBM-4 Extracted NP on source side using the Parse Extracted translations by harvesting Viterbi Alignment Calculate features for NP pairs Prune based on thresholds Perform Constrained Word Alignment and Lexicon extraction using the NP table

Extract NP translation Pairs NP:1:There :. NP:1:quite a large number :malayAlama logoM kI bahuwa badZI saMKyA hE . NP:1:Malayalees :malayAlama NP:2:the west coast :ke paScimI wata NP:2:Great Nicobar :xvIpa ke paScimI wata para sWiwa mEgApOda NP:2:the Magapod Island :xvIpa ke paScimI wata para sWiwa mEgApOda NP:6:Plotemy :plotemI NP:6:them :inheM NP:6:Nagadip ' :plotemI inheM nAgAxvIpa kahawA hE jo ki hiMxI kA Sabxa hE Ora nagna logoM ke NP:6:a Hindu name :plotemI inheM nAgAxvIpa kahawA hE jo ki hiMxI kA Sabxa hE Ora nagna logoM ke lie prayukwa howA hE . NP:6:naked island :plotemI inheM nAgAxvIpa kahawA hE jo ki hiMxI kA Sabxa hE Ora nagna

NP based Bootstrapping Word Align ‘S’ and ‘T’ using IBM-4 Extracted NP on source side using the Parse Extracted translations by harvesting Viterbi Alignment Calculate features for NP pairs Prune based on thresholds Perform Constrained Word Alignment and Lexicon extraction using the NP table

Feature Extraction for NP Pairs Features Source length in words Target length in words Absolute length difference Freq Source base np Freq of Target base np Freq of the S-T pair Source 2 Target probability Target 2 Source probability

Calculate Features of NP pairs NP:1:There:.:1:1:0:2449:2318:258:0.00129355:0.000321318 NP:1:quite a large number:malayAlama logoM kI bahuwa badZI saMKyA hE .:4:8:4:3:1:1:1.95591979786667e-13:0 NP:1:Malayalees:malayAlama:1:1:0:3:2:1:0.614945:0.0935706 NP:2:the west coast:ke paScimI wata:3:3:0:15:2:1:2.40946933496697e-06:5.34403517215648e-11 NP:2:Great Nicobar:xvIpa ke paScimI wata para sWiwa mEgApOda:2:7:5:6:2:1:1.28793968923196e-05:0 NP:2:the Magapod Island:xvIpa ke paScimI wata para sWiwa mEgApOda:3:7:4:1:2:1:2.19930690076524e-06:0 NP:6:Plotemy:plotemI:1:1:0:1:1:1:1:1 NP:6:them:inheM:1:1:0:2153:27:16:0.0168737:0 NP:6:Nagadip ':plotemI inheM nAgAxvIpa kahawA hE jo ki hiMxI kA Sabxa hE Ora nagna logoM ke:2:15:13:1:1:1:3.06461991111111e-05:0 NP:6:a Hindu name:plotemI inheM nAgAxvIpa kahawA hE jo ki hiMxI kA Sabxa hE Ora nagna logoM ke lie prayukwa howA hE .:3:20:17:1:1:1:1.31075474321488e-12:0 NP:6:naked island:plotemI inheM nAgAxvIpa kahawA hE jo ki hiMxI kA Sabxa hE Ora nagna:2:13:11:1:1:1:1.16829204884615e-06:0

NP based Bootstrapping Word Align ‘S’ and ‘T’ using IBM-4 Extracted NP on source side using the Parse Extracted translations by harvesting Viterbi Alignment Calculate features for NP pairs Prune based on thresholds Perform Constrained Word Alignment and Lexicon extraction using the NP table

Prune based on manual thresholds NP:1:There:. NP:1:Malayalees:malayAlama NP:2:the west coast:ke paScimI wata NP:6:Plotemy:plotemI NP:6:them:inheM

NP based Bootstrapping Word Align ‘S’ and ‘T’ using IBM-4 Extracted NP on source side using the Parse Extracted translations by harvesting Viterbi Alignment Calculate features for NP pairs Prune based on thresholds Perform Constrained Word Alignment and Lexicon extraction using the NP table (Folding)

Constrained Alignment: NP Folding There Malayalees NP are quite a large number of NP living here . the west coast is found in NP of Great Nicobar called the Magapod Island . them Plotemy NP calls NP ' Nagadip ' , a Hindu name for naked island . . malayAlama NP logoM kI bahuwa badZI saMKyA NP ke paScimI wata xvIpa NP para sWiwa mEgApOda xvIpa meM pAyA jAwA hE . them plotemI NP NP nAgAxvIpa kahawA hE jo ki hiMxI kA Sabxa hE Ora nagna logoM ke lie prayukwa howA hE .

Experiments English Hindi (Resource constrained) English Hindi

Word Alignment Experiments Training: 5000 sentences Testing: 200 sentences Human Extracted NP table – 21,736

Word Alignment Results (5k corpus) Experiments with 5k training corpus and 200 test sentences Experiment Prec Rec F- AER Model 4 40.20 30.96 34.98 65.12 Model 4+ NPs ( 1 ) 41.33 32.12 36.15 63.85 Model 4+ NPs ( 2 ) 41.23 32.26 36.02 63.88

NP Projection Results (5k) Evaluation: 21736 NP_Table harvested from 5k test bed corpus Iteration Identified Accuracy (on gold std) After Pruning NPs (1) 21693 33% (7200) 4708 61% (2619) NPs (2) 21690 33.12% (7200) 4819 60.8% (2601)

Word Alignment Experiments Training: 50K Eng-Hin Corpus Testing: 200 Eng-Hin aligned sentences Human Extracted NP table – 21,736

Word Alignment Results (55k corpus) Experiments with 55k training corpus and 200 test sentences Experiment Prec Rec F- AER Model 4 48.16 45.91 46.75 53.76 Model 4+ NPs ( 1 ) 48.19 46.50 46.83 53.17 Model 4+ NPs ( 2 ) 48.14 46.49 46.82 53.18

NP Projection Results (55k) Evaluation: 21736 NP_Table created by human alignment Iteration Identified Precision (on gold std) After Pruning NPs (1) 306366 38% (8906) 88352 58.2% (4236) NPs (2) 306294 37% (8124) 93165 59.16%

From here.. Improvements Machine Translation Reliable NP Projection Hierarchical Word Alignment Machine Translation Rule Learning Refined Probabilistic translation Lexicon Clean Linguistically motivated Phrase table with probabilities

Questions ?

Thanks !