Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.

Slides:

Advertisements

Similar presentations

Statistical Machine Translation

Advertisements

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.

Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.

DP-based Search Algorithms for Statistical Machine Translation My name: Mauricio Zuluaga Based on “Christoph Tillmann Presentation” and “ Word Reordering.

Hybridity in MT: Experiments on the Europarl Corpus Declan Groves 24 th May, NCLT Seminar Series 2006.

June 2004 D ARPA TIDES MT Workshop Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang Stephan Vogel Language Technologies Institute Carnegie.

Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie.

The current status of Chinese- English EBMT -where are we now Joy (Ying Zhang) Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.

Flow Network Models for Sub-Sentential Alignment Ying Zhang (Joy) Advisor: Ralf Brown Dec 18 th, 2001.

The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.

ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June Competitive Grouping in Integrated Segmentation and Alignment.

C SC 620 Advanced Topics in Natural Language Processing Lecture 24 4/22.

MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.

Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group.

ABC--- A Phrase-to-Phrase Alignment Method Integrating monolingual and bilingual information in sub sentential phrase alignment Ying Zhang (Joy)

9/12/2003LTI Student Research Symposium1 An Integrated Phrase Segmentation/Alignment Algorithm for Statistical Machine Translation Joy Advisor: Stephan.

1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.

Jan 2005Statistical MT1 CSA4050: Advanced Techniques in NLP Machine Translation III Statistical MT.

Natural Language Processing Lab Northeastern University, China Feiliang Ren EBMT Based on Finite Automata State Transfer Generation Feiliang Ren.

Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.

Comparable Corpora Kashyap Popat( ) Rahul Sharnagat(11305R013)

An Introduction to SMT Andy Way, DCU. Statistical Machine Translation (SMT) Translation Model Language Model Bilingual and Monolingual Data* Decoder:

English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.

Direct Translation Approaches: Statistical Machine Translation

An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.

METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.

Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.

Leveraging Reusability: Cost-effective Lexical Acquisition for Large-scale Ontology Translation G. Craig Murray et al. COLING 2006 Reporter Yong-Xiang.

INSTITUTE OF COMPUTING TECHNOLOGY Bagging-based System Combination for Domain Adaptation Linfeng Song, Haitao Mi, Yajuan Lü and Qun Liu Institute of Computing.

NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.

Reordering Model Using Syntactic Information of a Source Tree for Statistical Machine Translation Kei Hashimoto, Hirohumi Yamamoto, Hideo Okuma, Eiichiro.

Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/

Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:

1 Alignment Entropy as an Automated Predictor of Bitext Fidelity for Statistical Machine Translation Shankar Ananthakrishnan Rohit Prasad Prem Natarajan.

Extracting bilingual terminologies from comparable corpora By: Ahmet Aker, Monica Paramita, Robert Gaizauskasl CS671: Natural Language Processing Prof.

Korea Maritime and Ocean University NLP Jung Tae LEE

Statistical Machine Translation Part III – Phrase-based SMT / Decoding Alexander Fraser Institute for Natural Language Processing Universität Stuttgart.

Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.

NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.

LREC 2008 Marrakech 29 May Caroline Lavecchia, Kamel Smaïli and David Langlois LORIA / Groupe Parole, Vandoeuvre-Lès-Nancy, France Phrase-Based Machine.

A non-contiguous Tree Sequence Alignment-based Model for Statistical Machine Translation Jun Sun ┼, Min Zhang ╪, Chew Lim Tan ┼ ┼╪

Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.

August 17, 2005Question Answering Passage Retrieval Using Dependency Parsing 1/28 Question Answering Passage Retrieval Using Dependency Parsing Hang Cui.

Haitham Elmarakeby.  Speech recognition

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.

2003 (c) University of Pennsylvania1 Better MT Using Parallel Dependency Trees Yuan Ding University of Pennsylvania.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

(Statistical) Approaches to Word Alignment

A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.

Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 1 CMU TEAM-A in TDT 2004 Topic Tracking Yiming.

11 A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 1, Michael R. Lyu 1, Irwin King 1,2 1 The Chinese.

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Towards Syntactically Constrained Statistical Word Alignment Greg Hanneman : Advanced Machine Translation Seminar April 30, 2008.

Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.

September 2004CSAW Extraction of Bilingual Information from Parallel Texts Mike Rosner.

A Syntax-Driven Bracketing Model for Phrase-Based Translation Deyi Xiong, et al. ACL 2009.

Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,

LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.

Neural Machine Translation

METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev.

Alexander Fraser CIS, LMU München Machine Translation

Do-Gil Lee1*, Ilhwan Kim1 and Seok Kee Lee2

Statistical Machine Translation Part III – Phrase-based SMT / Decoding

Eiji Aramaki* Sadao Kurohashi* * University of Tokyo

Memory-augmented Chinese-Uyghur Neural Machine Translation

Statistical Machine Translation Papers from COLING 2004

Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.

Presentation transcript:

Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen

Motivation In the CMU EBMT system, alignment has been less studied compared to the other components. We want to investigate a new sub- sentential aligner which uses translation probabilities in a symmetric fashion.

Outline Introduction Symmetric Probabilistic Alignment Experiments and Results Conclusions Future Work

Aligner in the EBMT

Sub-sentential Alignment The CMU EBMT system refers to translation examples to translate unknown source sentence Since it is hard to find an exactly matching example sentence, the system finds the longest match  Encapsulated local context  Local reordering The aligner should work on fragments (sub- sentences)

Need for a new aligner Relatively less studied compared to the other components The old aligner  Heuristic based  Builds a correspondence table  Finds the longest target fragment and the shortest target fragment  Checks every substring of the longest one, which includes the shortest one  Fast but doesn’t use probabilities

Related Work IBM models (Brown et al, 93) HMM (Vogel et al, 96) Competitive link (Melamed, 97) Explicit Syntactic Information(Yamada et al, 02) ISA (Zhang, 03) The SPA is different from the above in that it aligns sub-sentences using translation probabilities and some heuristics when the boundary of source fragment is given.

Outline Introduction Symmetric Probabilistic Alignment Experiments and Results Conclusions Future Work

Basic Algorithm (1) Assumptions:  A bilingual probabilistic dictionary is available  Contiguous source fragments are translated into contiguous target fragments  Fragments are translated independently of surrounding context Given and

Basic Algorithm (2) Assume that we are considering a candidate target fragment 't2 t3 t4' given a source fragment 's7 s8 s9' Source -> Target Translation Score S_tmp = max( p(t2|s7), p(t3|s7), p(t4|s7), ε ) x max( p(t2|s8), p(t3|s8), p(t4|s8), ε ) x max( p(t2|s9), p(t3|s9), p(t4|s9), ε ) S_st = S_tmp^{1/3}

Basic Algorithm (3) Source <- Target Translation Score S_tmp = max( p(s7|t2), p(s8|t2), p(s9|t2), ε ) x max( p(s7|t3), p(s8|t3), p(s9|t3), ε ) x max( p(s7|t4), p(s8|t4), p(s9|t4), ε ) S_ts = S_tmp^{1/3} Source Target Translation Score Score = S_st * S_ts

Restrictions (1) Untranslated word penalty s7 s8 s9 t2 t3 t4 Anchor Context s6 s7 s8 s9 s10 t1 t2 t3 t4 t5

Restrictions (2) Length penalty  “t2... t30” for “s7 s8 s9”. Realistic?  We expect a proportional target fragment length to the source fragment length. Distance penalty  “t45 t46 t47” for “s7 s8 s9”. Realistic? Maybe.  Between similar word order languages, we might expect a proportional position.

The SPA CFD

Combined Aligner Set a threshold for the SPA The SPA produces results with higher score than the threshold For each source fragment  If there is a result from the SPA -> use the SPA result  Otherwise, use the IBM result

Outline Introduction Symmetric Probabilistic Alignment Experiments and Results Conclusions Future Work

Alignment Accuracy (1) Evaluation Metrics  F1 (Precision, Recall) - based on positions Data  English-Chinese  Xinhua news wire  Training data: 1m sentence pairs  Trained GIZA++ with default parameters  For the SPA, used the dictionary by GIZA++  Test data:  366 sentence pairs - 3 copies by 3 people  20 more sentence pairs - 1 copy by another  words long source fragments

Alignment Accuracy (2) Data  French-English  Canadian Hansard  Training data: 1m sentence pairs  Trained GIZA++ with default parameters  For the SPA, used the dictionary by GIZA++  Test data  91 sentence pairs  words long source fragments

Alignment Accuracy (3) Alignments to be compared  Random: random alignment to a reasonably long target fragment  Positional: alignment to a proportionally positioned target fragment  Oracle: the best possible contiguous human alignment  SPA-uni: unidirectional basic alignment  SPA-basic: bidirectional basic alignment  SPA: the best SPA alignment with restrictions  IBM4: non-contiguous alignment by IBM Model 4  COMB: the combination of SPA and IBM4 alignments  SPA-top10: the best of top 10 alignment results of SPA

Alignment Accuracy : En-Cn SPA-basic outperformed SPA-uni SPA was the best when we applied untranslated word penalty and length penalty Our significance test showed that the difference between IBM4 and COMB is significant

Alignment Accuracy : Fr-En SPA-basic outperformed SPA-uni SPA was the best when we applied all the restrictions Our significance test showed that the difference between IBM4 and COMB is not significant

Human Alignment Evaluation Rough idea about how much humans agree on alignment

EBMT Performance (1) Data  French-English (Canadian Hansard)  20k training sentence pairs  Test  Development set: 100 sentence pairs  2 reference set: 2 references for 100 source sentences  Evaluation set: 10 X 100 sentence pairs Evaluation Metric  BLEU

EBMT performance (2) SPA, IBM4 and COMB performs significantly better than EBMT (the old aligner) For 'Test', SPA outperformed EBMT by 28.5 % Among SPA, IBM4 and COMB, nothing is significantly better than the others

Outline Introduction Symmetric Probabilistic Alignment Experiments and Results Conclusions Future Work

Conclusions Improvement on EBMT performance Combined aligner worked the best on English-Chinese set Bidirectional alignment worked better than unidirectional alignment

Future Work Incorporating human dictionaries to cover more general domains Non-contiguous alignment Co-training of the SPA and a dictionary Experiments on different data sets and different language pairs Experiments with different metrics Speed up

References Ying Zhang, Stephan Vogel and Alex Waibel. Integrated Phrase Segmentation and Alignment Model for Statistical Machine Translation. submitted to Proc. of International Confrerence on Natural Language Processing and Knowledge Engineering (NLP-KE), 2003, Beijing, China. Peter F. Brown, Stephen A. Della Pietra, Vin-cent J. Della Pietra, and Robert L. Mercer The mathematics of statistical machinetranslation: Parameter estimation. Computa-tional Linguistics, 19 (2) : Stephan Vogel, Hermann Ney, and Christoph Till-mann HMM-based word alignment in statistical translation. In COLING '96: The 16th Int. Conf. on Computational Linguistics, pages , Copenhagen, August. I. Dan Melamed. "A Word-to-Word Model of Translational Equivalence". In Procs. of the ACL97. pp Madrid Spain, K. Yamada and K. Knight. A decoder for syntax-based statistical MT. In ACL '02, 2002.

Thank You !! Questions?

Backup Slides Alignment Accuracy Calculation Non-contiguous Alignment

Alignment Accuracy Calculation Human Answer... under the unemployment insurance plan of the other country... Machine Answer... under the unemployment insurance plan of the other country... Precision: 4/5 = 0.2 Recall: 4/8 = 0.5 F1 =

Non-contiguous Alignment