Download presentation
Presentation is loading. Please wait.
Published byElmer Atkinson Modified over 8 years ago
1
Recent Major MT Developments at CMU Briefing for Joe Olive February 5, 2008 Alon Lavie and Stephan Vogel Language Technologies Institute Carnegie Mellon University
2
February 5, 2008CMU MT Update for Joe Olive2 What’s the Big News? Novel new process for extracting syntax-based translation rules and syntax-based word and phrase translation lexicons from parallel corpora completely automatically When “plugged” into an enhanced version of our Statistical Transfer engine – this creates a complete process for building a syntax-based statistical MT system that can produce translations that are much more fluent and grammatical than current approaches This amounts to a complete solution to syntax-driven SMT, akin to phrase-based MT, that can handle highly divergent syntax such is evident in Chinese-to-English
3
February 5, 2008CMU MT Update for Joe Olive3 Where are we with this? We conducted a super intensive effort over the last four months to put this together from scratch in time for the GALE retest on Chinese-to-English Ran out of time – only preliminary version of system with partial resources and without proper tuning was ready for the retest We ran the system for the retest and provided the output to IBM, but performance on dev-data was disappointing and it didn’t contribute much to the system combination for the retest Significant performance issues remain to be solved Some improvements over the last couple of weeks, plus a new novel way for integrating our syntax-based translations with “standard” SMT phrases via joint decoding Stand-alone, we are still significantly behind SMT, but we expect things to dramatically improve within the next few months, and with joint decoding, surpass state-of-the-art very soon
4
February 5, 2008CMU MT Update for Joe Olive4 Main Ideas Phrase-based SMT –Start with sentence-parallel corpus –Perform word-alignment to find likely word correspondences –Then perform phrase-extraction to find likely phrase-to-phrase correspondences Incorrect word alignments lead to many incorrect phrase translations, and there is no real information to drive how the phrase pairs should be put together during translation Our Syntax-based extraction process is based on parsing the two sides of the parallel data –Find and extract sub-sentential syntactic constituents that correspond to each other –The parsing information helps us get the phrase “boundaries” right –The phrases come out with syntactic constituent “labels” (i.e. NP NP) that indicate their syntactic function –We also extract synchronous syntax rules that indicate how to correctly put together the constituents during translation (syntax reordering)
5
February 5, 2008CMU MT Update for Joe Olive5 Main Ideas Automatic Process for Extracting Syntax-driven Rules and Lexicons: 1.Start with word-aligned parallel corpus 2.Parse the sentences independently for both languages 3.Run our new Constituent Aligner over the parsed sentence pairs 4.Extract all aligned constituents from the parallel trees 5.Extract all derived synchronous transfer rules from the constituent-aligned parallel trees 6.Construct a “data-base” of all extracted parallel constituents and synchronous rules with their frequencies and model them statistically (assign them relative-likelihood probabilities)
6
February 5, 2008CMU MT Update for Joe Olive6 Example
7
February 5, 2008CMU MT Update for Joe Olive7 Translation Example SrcSent 3 澳洲是与北韩有邦交的少数国家之一。 Gloss: Australia is with north korea have diplomatic relations DE few country world Reference: Australia is one of the few countries that have diplomatic relations with North Korea. Translation:Australia is one of the few countries that has diplomatic relations with north korea. Overall: -5.77439, Prob: -2.58631, Rules: -0.66874, TransSGT: -2.58646, TransTGS: -1.52858, Frag: -0.0413927, Length: -0.127525, Words: 11,15 ( 0 10 "Australia is one of the few countries that has diplomatic relations with north korea" -5.66505 " 澳洲 是 与 北韩 有 邦交 的 少数 国家 之一 " "(S1,1124731 (S,1157857 (NP,2 (NB,1 (LDC_N,1267 'Australia') ) ) (VP,1046077 (MISC_V,1 'is') (NP,1077875 (LITERAL 'one') (LITERAL 'of') (NP,1045537 (NP,1017929 (NP,1 (LITERAL 'the') (NUMNB,2 (LDC_NUM,420 'few') (NB,1 (WIKI_N,62230 'countries') ) ) ) (LITERAL 'that') (VP,1021811 (LITERAL 'has') (FBIS_NP,11916 'diplomatic relations') ) ) (FBIS_PP,84791 'with north korea') ) ) ) ) ) ") ( 10 11 "." -11.9549 " 。 " "(MISC_PUNC,20 '.')")
8
February 5, 2008CMU MT Update for Joe Olive8 Example: Syntactic Phrases (LDC_N,1267 'Australia') (WIKI_N,62230 'countries') (FBIS_NP,11916 'diplomatic relations') (FBIS_PP,84791 'with north korea')
9
February 5, 2008CMU MT Update for Joe Olive9 Example: XFER Rules ;;SL::(2,4) 对 台 贸易 ;;TL::(3,5) trade to taiwan ;;Score::22 {NP,1045537} NP::NP [PP NP ] -> [NP PP ] ((*score* 0.916666666666667) (X2::Y1) (X1::Y2)) ;;SL::(2,7) 直接 提到 伟 哥 的 广告 ;;TL::(1,7) commercials that directly mention the name viagra ;;Score::5 {NP,1017929} NP::NP [VP " 的 " NP ] -> [NP "that" VP ] ((*score* 0.111111111111111) (X3::Y1) (X1::Y3)) ;;SL::(4,14) 有 一 至 多 个 高 新 技术 项目 或 产品 ;;TL::(3,14) has one or more new, high level technology projects or products ;;Score::4 {VP,1021811} VP::VP [" 有 " NP ] -> ["has" NP ] ((*score* 0.1) (X2::Y2))
10
February 5, 2008CMU MT Update for Joe Olive10 New Idea: Joint Decoding Our Stat-XFER runtime system operates in two “passes”: –Lattice Construction: bottom-up search creates hierarchical syntax-driven constituent translations driven by the transfer rules –Monotonic Decoder puts these pieces together into complete translations Currently we do system combination with other systems at the level of 1-best or n-best lists New Idea: Use the lattice output of the first pass, and combine it with standard phrases from SMT, then jointly decode the lattice We tried this for the first time last week for NIST MT Eval on Urdu and this looks promising
11
February 5, 2008CMU MT Update for Joe Olive11 Major Open Research Issues Our stand-alone system is still significantly behind phrase-based SMT. Why? –Weaker decoder? –Feature set is not sufficiently discriminant? –Problems with the parsers for the two sides? –Syntactic constituents don’t provide sufficient coverage? [joint decoding should help] –Bugs and deficiencies in the underlying algorithms? Significant engineering issues to improve speed and efficient runtime processing and improved search
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.