Recent Major MT Developments at CMU Briefing for Joe Olive February 5, 2008 Alon Lavie and Stephan Vogel Language Technologies Institute Carnegie Mellon.

Slides:



Advertisements
Similar presentations
Joint Parsing and Alignment with Weakly Synchronized Grammars David Burkett, John Blitzer, & Dan Klein TexPoint fonts used in EMF. Read the TexPoint manual.
Advertisements

Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Resource Acquisition for Syntax-based MT from Parsed Parallel data Alon Lavie, Alok Parlikar and Vamshi Ambati Language Technologies Institute Carnegie.
Stat-XFER: A General Framework for Search-based Syntax-driven MT Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
1 A Tree Sequence Alignment- based Tree-to-Tree Translation Model Authors: Min Zhang, Hongfei Jiang, Aiti Aw, et al. Reporter: 江欣倩 Professor: 陳嘉平.
A Tree-to-Tree Alignment- based Model for Statistical Machine Translation Authors: Min ZHANG, Hongfei JIANG, Ai Ti AW, Jun SUN, Sheng LI, Chew Lim TAN.
Flow Network Models for Sub-Sentential Alignment Ying Zhang (Joy) Advisor: Ralf Brown Dec 18 th, 2001.
1 Improving a Statistical MT System with Automatically Learned Rewrite Patterns Fei Xia and Michael McCord (Coling 2004) UW Machine Translation Reading.
TIDES MT Workshop Review. Using Syntax?  ISI-small: –Cross-lingual parsing/decoding Input: Chinese sentence + English lattice built with all possible.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
A Hierarchical Phrase-Based Model for Statistical Machine Translation Author: David Chiang Presented by Achim Ruopp Formulas/illustrations/numbers extracted.
Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)
Stat-XFER: A General Framework for Search-based Syntax-driven MT Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
Microsoft Research Faculty Summit Robert Moore Principal Researcher Microsoft Research.
SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.
Czech-to-English Translation: MT Marathon 2009 Session Preview Jonathan Clark Greg Hanneman Language Technologies Institute Carnegie Mellon University.
Stat-XFER: A General Framework for Search-based Syntax-driven MT Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
PFA Node Alignment Algorithm Consider the parse trees of a Chinese-English parallel pair of sentences.
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Direct Translation Approaches: Statistical Machine Translation
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
July 24, 2007GALE Update: Alon Lavie1 Statistical Transfer and MEMT Activities Multi-Engine Machine Translation –MEMT service within the cross-GALE IOD.
Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.
The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan.
Reordering Model Using Syntactic Information of a Source Tree for Statistical Machine Translation Kei Hashimoto, Hirohumi Yamamoto, Hideo Okuma, Eiichiro.
Advanced MT Seminar Spring 2008 Instructors: Alon Lavie and Stephan Vogel.
AMTEXT: Extraction-based MT for Arabic Faculty: Alon Lavie, Jaime Carbonell Students and Staff: Laura Kieras, Peter Jansen Informant: Loubna El Abadi.
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
MEMT: Multi-Engine Machine Translation Faculty: Alon Lavie, Robert Frederking, Ralf Brown, Jaime Carbonell Students: Shyamsundar Jayaraman, Satanjeev Banerjee.
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
Hebrew-to-English XFER MT Project - Update Alon Lavie June 2, 2004.
INSTITUTE OF COMPUTING TECHNOLOGY Forest-to-String Statistical Translation Rules Yang Liu, Qun Liu, and Shouxun Lin Institute of Computing Technology Chinese.
NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.
A Trainable Transfer-based MT Approach for Languages with Limited Resources Alon Lavie Language Technologies Institute Carnegie Mellon University Joint.
Imposing Constraints from the Source Tree on ITG Constraints for SMT Hirofumi Yamamoto, Hideo Okuma, Eiichiro Sumita National Institute of Information.
Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.
MEMT: Multi-Engine Machine Translation Faculty: Alon Lavie, Jaime Carbonell Students and Staff: Gregory Hanneman, Justin Merrill (Shyamsundar Jayaraman,
The CMU Mill-RADD Project: Recent Activities and Results Alon Lavie Language Technologies Institute Carnegie Mellon University.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
Improving a Statistical MT System with Automatically Learned Rewrite Rules Fei Xia and Michael McCord IBM T. J. Watson Research Center Yorktown Heights,
Discriminative Modeling extraction Sets for Machine Translation Author John DeNero and Dan KleinUC Berkeley Presenter Justin Chiu.
Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.
Avenue Architecture Learning Module Learned Transfer Rules Lexical Resources Run Time Transfer System Decoder Translation Correction Tool Word- Aligned.
July 24, 2007GALE Update: Alon Lavie1 Statistical Transfer and MEMT Activities Chinese-to-English Statistical Transfer MT system (Stat-XFER) –Developed.
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
CMU Statistical-XFER System Hybrid “rule-based”/statistical system Scaled up version of our XFER approach developed for low-resource languages Large-coverage.
Eliciting a corpus of word- aligned phrases for MT Lori Levin, Alon Lavie, Erik Peterson Language Technologies Institute Carnegie Mellon University.
A Syntax-Driven Bracketing Model for Phrase-Based Translation Deyi Xiong, et al. ACL 2009.
Seed Generation and Seeded Version Space Learning Version 0.02 Katharina Probst Feb 28,2002.
CMU MilliRADD Small-MT Report TIDES PI Meeting 2002 The CMU MilliRADD Team: Jaime Carbonell, Lori Levin, Ralf Brown, Stephan Vogel, Alon Lavie, Kathrin.
MEMT: Multi-Engine Machine Translation Guided by Explicit Word Matching Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work.
MEMT: Multi-Engine Machine Translation Faculty: Alon Lavie, Robert Frederking, Ralf Brown, Jaime Carbonell Students: Shyamsundar Jayaraman, Satanjeev Banerjee.
Machine Translation Overview Alon Lavie Language Technologies Institute Carnegie Mellon University LTI Immigration Course August 22, 2011.
LING 575 Lecture 5 Kristina Toutanova MSR & UW April 27, 2010 With materials borrowed from Philip Koehn, Chris Quirk, David Chiang, Dekai Wu, Aria Haghighi.
Spring 2010 Lecture 4 Kristina Toutanova MSR & UW With slides borrowed from Philipp Koehn and Hwee Tou Ng LING 575: Seminar on statistical machine translation.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Minority Languages Katharina Probst Language Technologies Institute Carnegie Mellon.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
Urdu-to-English Stat-XFER system for NIST MT Eval 2008
Statistical NLP: Lecture 13
Stat-Xfer מציגים: יוגב וקנין ועומר טבח, 05/01/2012
Stat-XFER: A General Framework for Search-based Syntax-driven MT
Statistical Machine Translation Papers from COLING 2004
Stat-XFER: A General Framework for Search-based Syntax-driven MT
AMTEXT: Extraction-based MT for Arabic
Presentation transcript:

Recent Major MT Developments at CMU Briefing for Joe Olive February 5, 2008 Alon Lavie and Stephan Vogel Language Technologies Institute Carnegie Mellon University

February 5, 2008CMU MT Update for Joe Olive2 What’s the Big News? Novel new process for extracting syntax-based translation rules and syntax-based word and phrase translation lexicons from parallel corpora completely automatically When “plugged” into an enhanced version of our Statistical Transfer engine – this creates a complete process for building a syntax-based statistical MT system that can produce translations that are much more fluent and grammatical than current approaches This amounts to a complete solution to syntax-driven SMT, akin to phrase-based MT, that can handle highly divergent syntax such is evident in Chinese-to-English

February 5, 2008CMU MT Update for Joe Olive3 Where are we with this? We conducted a super intensive effort over the last four months to put this together from scratch in time for the GALE retest on Chinese-to-English Ran out of time – only preliminary version of system with partial resources and without proper tuning was ready for the retest We ran the system for the retest and provided the output to IBM, but performance on dev-data was disappointing and it didn’t contribute much to the system combination for the retest Significant performance issues remain to be solved Some improvements over the last couple of weeks, plus a new novel way for integrating our syntax-based translations with “standard” SMT phrases via joint decoding Stand-alone, we are still significantly behind SMT, but we expect things to dramatically improve within the next few months, and with joint decoding, surpass state-of-the-art very soon

February 5, 2008CMU MT Update for Joe Olive4 Main Ideas Phrase-based SMT –Start with sentence-parallel corpus –Perform word-alignment to find likely word correspondences –Then perform phrase-extraction to find likely phrase-to-phrase correspondences Incorrect word alignments lead to many incorrect phrase translations, and there is no real information to drive how the phrase pairs should be put together during translation Our Syntax-based extraction process is based on parsing the two sides of the parallel data –Find and extract sub-sentential syntactic constituents that correspond to each other –The parsing information helps us get the phrase “boundaries” right –The phrases come out with syntactic constituent “labels” (i.e. NP  NP) that indicate their syntactic function –We also extract synchronous syntax rules that indicate how to correctly put together the constituents during translation (syntax reordering)

February 5, 2008CMU MT Update for Joe Olive5 Main Ideas Automatic Process for Extracting Syntax-driven Rules and Lexicons: 1.Start with word-aligned parallel corpus 2.Parse the sentences independently for both languages 3.Run our new Constituent Aligner over the parsed sentence pairs 4.Extract all aligned constituents from the parallel trees 5.Extract all derived synchronous transfer rules from the constituent-aligned parallel trees 6.Construct a “data-base” of all extracted parallel constituents and synchronous rules with their frequencies and model them statistically (assign them relative-likelihood probabilities)

February 5, 2008CMU MT Update for Joe Olive6 Example

February 5, 2008CMU MT Update for Joe Olive7 Translation Example SrcSent 3 澳洲是与北韩有邦交的少数国家之一。 Gloss: Australia is with north korea have diplomatic relations DE few country world Reference: Australia is one of the few countries that have diplomatic relations with North Korea. Translation:Australia is one of the few countries that has diplomatic relations with north korea. Overall: , Prob: , Rules: , TransSGT: , TransTGS: , Frag: , Length: , Words: 11,15 ( 0 10 "Australia is one of the few countries that has diplomatic relations with north korea" " 澳洲 是 与 北韩 有 邦交 的 少数 国家 之一 " "(S1, (S, (NP,2 (NB,1 (LDC_N,1267 'Australia') ) ) (VP, (MISC_V,1 'is') (NP, (LITERAL 'one') (LITERAL 'of') (NP, (NP, (NP,1 (LITERAL 'the') (NUMNB,2 (LDC_NUM,420 'few') (NB,1 (WIKI_N,62230 'countries') ) ) ) (LITERAL 'that') (VP, (LITERAL 'has') (FBIS_NP,11916 'diplomatic relations') ) ) (FBIS_PP,84791 'with north korea') ) ) ) ) ) ") ( "." " 。 " "(MISC_PUNC,20 '.')")

February 5, 2008CMU MT Update for Joe Olive8 Example: Syntactic Phrases (LDC_N,1267 'Australia') (WIKI_N,62230 'countries') (FBIS_NP,11916 'diplomatic relations') (FBIS_PP,84791 'with north korea')

February 5, 2008CMU MT Update for Joe Olive9 Example: XFER Rules ;;SL::(2,4) 对 台 贸易 ;;TL::(3,5) trade to taiwan ;;Score::22 {NP, } NP::NP [PP NP ] -> [NP PP ] ((*score* ) (X2::Y1) (X1::Y2)) ;;SL::(2,7) 直接 提到 伟 哥 的 广告 ;;TL::(1,7) commercials that directly mention the name viagra ;;Score::5 {NP, } NP::NP [VP " 的 " NP ] -> [NP "that" VP ] ((*score* ) (X3::Y1) (X1::Y3)) ;;SL::(4,14) 有 一 至 多 个 高 新 技术 项目 或 产品 ;;TL::(3,14) has one or more new, high level technology projects or products ;;Score::4 {VP, } VP::VP [" 有 " NP ] -> ["has" NP ] ((*score* 0.1) (X2::Y2))

February 5, 2008CMU MT Update for Joe Olive10 New Idea: Joint Decoding Our Stat-XFER runtime system operates in two “passes”: –Lattice Construction: bottom-up search creates hierarchical syntax-driven constituent translations driven by the transfer rules –Monotonic Decoder puts these pieces together into complete translations Currently we do system combination with other systems at the level of 1-best or n-best lists New Idea: Use the lattice output of the first pass, and combine it with standard phrases from SMT, then jointly decode the lattice We tried this for the first time last week for NIST MT Eval on Urdu and this looks promising

February 5, 2008CMU MT Update for Joe Olive11 Major Open Research Issues Our stand-alone system is still significantly behind phrase-based SMT. Why? –Weaker decoder? –Feature set is not sufficiently discriminant? –Problems with the parsers for the two sides? –Syntactic constituents don’t provide sufficient coverage? [joint decoding should help] –Bugs and deficiencies in the underlying algorithms? Significant engineering issues to improve speed and efficient runtime processing and improved search