CMU Statistical-XFER System Hybrid “rule-based”/statistical system Scaled up version of our XFER approach developed for low-resource languages Large-coverage “clean” bilingual lexicon + syntactic transfer rules (human written + extracted from data) XFER formalism is a Synchronous CFG + feature unification constraints Supports morphological analysis and generation as “plug in” components Two-stage translation process: –Build lattice of translation fragments at all levels “bottom-up” –Monotonic decoder selects best combination of lattice edges –Beam-search with multiple features at both stages –Features include: LM, fragmentation, length, …
Chinese-English S-XFER System Bilingual lexicon: over 1.1 million entries (multiple resources, incl. ADSO) Manual syntactic xfer grammar: 65 rules! (mostly NPs and reordering of NPs/PPs) Multiple overlapping Chinese word segmentations English morphology generation Uses CMU’s Suffix-Array LM toolkit for LM Current Performance (GALE dev-test): NW 14.04(B)/0.4825(M) UMD: 30.29(B) NG 7.92(B) UMD: 9.82(B) WL 5.40(B)/0.3022(M) UMD: 6.30(B) Integration: provides n-best lists (combination/rescoring) In Progress: –Additional features for decoding + MERT –Automatic extraction of “clean” NPs from parallel data –Automatic extraction of xfer-rules from parallel data
Chinese-English Example - Before THE SCIENTISTS IN ORDER TO Øü TO CLOSE IN THE EARLY PERIOD TO GO THE THE KNOWLEDGE THE THE DISEASE IN THE CHROMOSOME HAS BEEN COMPLETED IS SCHEDULED TO ORDER Overall: , Prob: , Rules: , Frag: 0.4, Length: , Words: 13,
Chinese-English Example - After SrcSent 0 ¿Æѧ¼ÒΪØü¹Ø³õÆÚʧÖÇÖ¢µÄȾɫÌåÍê³É¶¨Ðò 0 0 THE SCIENTISTS COMPLETED SEQUENCING FOR THE CHROMOSOMES WHICH RELATED TO THE INITIAL STAGE DEMENTIA Overall: , Prob: , Rules: , Frag: 0, Length: , Words: 8, < : ¿Æѧ¼Ò Ϊ Øü¹Ø ³õÆÚ Ê§ÖÇÖ¢ µÄ ȾɫÌå Íê³É ¶¨Ðò (S,1 (NP,1 (LITERAL 'THE') (NB,1 (N,21601 'SCIENTISTS'))) (VP,4 (VP,1 (V,7513 'COMPLETED')(NP,2 (NB,1 (N, 'SEQUENCING')))) (PP,1 (PREP,5 'FOR')(NPRC,1 (NP,1 (LITERAL 'THE') (NB,1 (N, 'CHROMOSOMES'))) (LITERAL 'WHICH') (VP,1 (V,18 'RELATED TO') (NPASSOC,5 (NP,1 (LITERAL 'THE') (NB,1 (N,7637 'INITIAL STAGE'))) (NP,2 (NB,1 (N,445 'DEMENTIA')))))))))>
MEMT – Main Activities Preserving Source Alignments: target phrases that originate from same source word can be marked as unbreakable units (performance effects under testing…) LM experiments using CMU’s Suffix-Array LM toolkit and new features (work still in progress…) Case Restoration: scheme for selecting the case of words in final MEMT output Improved tokenization and handling of punctuation Handling of varying number of MEMT input engines Upgrades to MEMT software infrastructure to support IOD-2 requirements, GTS 1.0 and UIMA v1.4 MEMT server is up 24/7 for ongoing IOD runs