Czech-to-English Translation: MT Marathon 2009 Session Preview Jonathan Clark Greg Hanneman Language Technologies Institute Carnegie Mellon University.

Slides:

Advertisements

Similar presentations

A Word-Class Approach to Labeling PSCFG Rules for Machine Translation (ACL 2011) Andreas Zollmann and Stephan Vogel Presented by Yun Huang 01/07/2011.

Advertisements

Dependency tree projection across parallel texts David Mareček Charles University in Prague Institute of Formal and Applied Linguistics.

The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China

Cluster Computing for Statistical Machine Translation Qin Gao, Kevin Gimpel, Alok Palikar, Andreas Zollmann Stephan Vogel, Noah Smith.

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

May 2006CLINT-LN Parsing1 Computational Linguistics Introduction Approaches to Parsing.

Statistical NLP: Lecture 3

March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing Probabilistic Context Free Grammars (Chapter 14) Muhammed Al-Mulhem March 1,

TURKALATOR A Suite of Tools for English to Turkish MT Siddharth Jonathan Gorkem Ozbek CS224n Final Project June 14, 2006.

1 Quasi-Synchronous Grammars  Based on key observations in MT: translated sentences often have some isomorphic syntactic structure, but not usually in.

1 A Tree Sequence Alignment- based Tree-to-Tree Translation Model Authors: Min Zhang, Hongfei Jiang, Aiti Aw, et al. Reporter: 江欣倩 Professor: 陳嘉平.

A Tree-to-Tree Alignment- based Model for Statistical Machine Translation Authors: Min ZHANG, Hongfei JIANG, Ai Ti AW, Jun SUN, Sheng LI, Chew Lim TAN.

Machine Translation via Dependency Transfer Philip Resnik University of Maryland DoD MURI award in collaboration with JHU: Bootstrapping Out of the Multilingual.

Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.

Statistical Phrase-Based Translation Authors: Koehn, Och, Marcu Presented by Albert Bertram Titles, charts, graphs, figures and tables were extracted from.

“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.

Introduction to Syntax, with Part-of-Speech Tagging Owen Rambow September 17 & 19.

ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June Competitive Grouping in Integrated Segmentation and Alignment.

Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.

Artificial Intelligence 2004 Natural Language Processing - Syntax and Parsing - Language Syntax Parsing.

Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali.

SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.

1 A Chart Parser for Analyzing Modern Standard Arabic Sentence Eman Othman Computer Science Dept., Institute of Statistical Studies and Research (ISSR),

TectoMT two goals of TectoMT –to allow experimenting with MT based on deep- syntactic (tectogrammatical) transfer –to create a software framework into.

11 CS 388: Natural Language Processing: Syntactic Parsing Raymond J. Mooney University of Texas at Austin.

PFA Node Alignment Algorithm Consider the parse trees of a Chinese-English parallel pair of sentences.

Parsing Long and Complex Natural Language Sentences

1/21 Introduction to TectoMT Zdeněk Žabokrtský, Martin Popel Institute of Formal and Applied Linguistics Charles University in Prague CLARA Course on Treebank.

English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.

Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.

Intro to Lexing & Parsing CS 153. Two pieces conceptually: – Recognizing syntactically valid phrases. – Extracting semantic content from the syntax. E.g.,

April 17, 2007MT Marathon: Tree-based Translation1 Tree-based Translation with Tectogrammatical Representation Jan Hajič Institute of Formal and Applied.

2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.

Scalable Inference and Training of Context- Rich Syntactic Translation Models Michel Galley, Jonathan Graehl, Keven Knight, Daniel Marcu, Steve DeNeefe.

Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová

Recent Major MT Developments at CMU Briefing for Joe Olive February 5, 2008 Alon Lavie and Stephan Vogel Language Technologies Institute Carnegie Mellon.

Reordering Model Using Syntactic Information of a Source Tree for Statistical Machine Translation Kei Hashimoto, Hirohumi Yamamoto, Hideo Okuma, Eiichiro.

Advanced MT Seminar Spring 2008 Instructors: Alon Lavie and Stephan Vogel.

Cs target cs target en source Subject-PastParticiple agreement Czech subject and past participle must agree in number and gender. Two-step translation.

Approximating a Deep-Syntactic Metric for MT Evaluation and Tuning Matouš Macháček, Ondřej Bojar; {machacek, Charles University.

A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart

Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:

What’s in a translation rule? Paper by Galley, Hopkins, Knight & Marcu Presentation By: Behrang Mohit.

CPE 480 Natural Language Processing Lecture 4: Syntax Adapted from Owen Rambow’s slides for CSc Fall 2006.

INSTITUTE OF COMPUTING TECHNOLOGY Forest-to-String Statistical Translation Rules Yang Liu, Qun Liu, and Shouxun Lin Institute of Computing Technology Chinese.

CSA2050 Introduction to Computational Linguistics Parsing I.

A non-contiguous Tree Sequence Alignment-based Model for Statistical Machine Translation Jun Sun ┼, Min Zhang ╪, Chew Lim Tan ┼ ┼╪

CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-14: Probabilistic parsing; sequence labeling, PCFG.

2003 (c) University of Pennsylvania1 Better MT Using Parallel Dependency Trees Yuan Ding University of Pennsylvania.

CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.

Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.

Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.

Discriminative Modeling extraction Sets for Machine Translation Author John DeNero and Dan KleinUC Berkeley Presenter Justin Chiu.

October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.

CMU Statistical-XFER System Hybrid “rule-based”/statistical system Scaled up version of our XFER approach developed for low-resource languages Large-coverage.

NSF PARTNERSHIP FOR RESEARCH AND EDUCATION : M EANING R EPRESENTATION FOR S TATISTICAL L ANGUAGE P ROCESSING 1 TectoMT TectoMT = highly modular software.

A Syntax-Driven Bracketing Model for Phrase-Based Translation Deyi Xiong, et al. ACL 2009.

A Simple English-to-Punjabi Translation System By : Shailendra Singh.

1/16 TectoMT Zdeněk Žabokrtský ÚFAL MFF UK Software framework for developing MT systems (and other NLP applications)

Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.

Natural Language Processing Vasile Rus

Approaches to Machine Translation

Statistical NLP: Lecture 3

Urdu-to-English Stat-XFER system for NIST MT Eval 2008

Suggestions for Class Projects

--Mengxue Zhang, Qingyang Li

CS 388: Natural Language Processing: Syntactic Parsing

LING/C SC 581: Advanced Computational Linguistics

Natural Language - General

Approaches to Machine Translation

Presentation transcript:

Czech-to-English Translation: MT Marathon 2009 Session Preview Jonathan Clark Greg Hanneman Language Technologies Institute Carnegie Mellon University 26 January 2009

Stat-XFER processing pipeline Processed Czech–English resources Possible workshop tasks –Syntactic phrase table combination methods –Synchronous grammar development Selection of grammar rules Exploration of label granularity Development of manual grammars –Integration of morphological analysis Outline

Parsing Stat-XFER Data Processing Word Alignment Moses Phrase Extraction Syntactic Phrase Extraction Czech Corpus English Corpus Moses Phrase Table Syntactic Phrase Table Grammar Extraction SCFG Grammar Rules

Stat-XFER Data Processing Corpus: –Project Syndicate news data: portion of CzEng corpus (84,141 sentences) Czech Corpus English Corpus

Parsing Stat-XFER Data Processing Parsing: –Czech dependency parses by TectoMT; converted to projective c- structure –English c-structure parses by Stanford parser Czech Corpus English Corpus

Parsing Stat-XFER Data Processing Word alignment: –GIZA++ grow-diag-final alignment done in advance on tokenized corpus –Alignments computed on full CzEng corpus of 8 million sentences Word Alignment Czech Corpus English Corpus

Parsing Stat-XFER Data Processing Phrase extraction: –Syntactic extraction by PFA node alignment algorithm, t2ts mode –Non-syntactic extraction with Moses package Word Alignment Czech Corpus English Corpus Moses Phrase Extraction Syntactic Phrase Extraction Moses Phrase Table Syntactic Phrase Table

Parsing Stat-XFER Data Processing Grammar extraction: –Using syntactic node alignments as tree decomposition points Word Alignment Czech Corpus English Corpus Moses Phrase Extraction Syntactic Phrase Extraction Moses Phrase Table Syntactic Phrase Table Grammar Extraction SCFG Grammar Rules

Two phrase tables, with counts: Final Result 1 NNS NNS rozumem brains3 NN NN rozumem reason4 NN NN rozumem sense1 NP NP rozumem reason1 NN NN rozumností wisdom1 JJ JJ rozumnou sensible1 ADJP ADJP rozumnou m ě rou jisté reasonably certain 1 NP NP rozumnou politiku sensible policy 1 PHR PHR rozumem brains3 PHR PHR rozumem reason4 PHR PHR rozumem sense2 PHR PHR r ozumem. sense. 1 PHR PHR rozumem, a ž e brains ; and that 1 PHR PHR rozumem, pokud sense if1 PHR PHR rozumem, pokud ne sense if not

Three suffix-array language models –Target side of Project Syndicate corpus –… + more monolingual English data –… + target side of public CzEng corpus WMT tuning, development, and test sets = Baseline Stat-XFER system ready to analyse and expand Final Result

Stat-XFER processing pipeline Processed Czech–English resources Possible workshop tasks –Syntactic phrase table combination methods –Synchronous grammar development Selection of grammar rules Exploration of label granularity Development of manual grammars –Integration of morphological analysis Outline

Phrase Table Combination Combination of non-syntactic and syntactic phrase pairs –Direct combination and syntax prioritization

Synchronous Grammars: Rule Selection Rule learning yields huge grammars Decoding with millions of abstract rules is intractable Open Question: How do we select the best grammar rules with regard to translation quality and decoding speed?

Synchronous Grammars: Label Granularity Rule learning assigns non-terminal and POS labels from input parse trees Input labels are believed appropriate… –For a given single language –According to a particular theory of grammar Open Question: How do we expand or collapse these labels so that they are appropriate for translating a particular language pair?

Synchronous Grammars: Czech Example Subject moves in English translation Verbs in past tense cannot be associated with modifiers in present tense Proti odmítnutí se zitra Petr against dismissal AUX-REFL tomorrow Peter v práci rozhodl protestovat of work decided to protest “Peter decided to protest against the dismissal of work tomorrow.” * Example from Bojar and Lopez, “Tree-based Translation,” MT Marathon Presentation 2008

Synchronous Grammars: Manual Grammar Writing VP::VP : [ADV VP] -> [VP ADV] ( (X1::Y2) (X2::Y1) (*tgsrule* 0.2) (*sgtrule* 0.6) ((X0 tense) = (X1 tense)) ((X0 tense) = (X2 tense)) ) VP::VP → [ADV VP]::[VP ADV] VP …decided… tomorrow ADV tomorrow VP …decided… zitra…rozhodl… Stat-XFER supports LFG-style unification Feature structures for unification can also be provided by the morphology server

Czech Morphology: Example Czech words include clitics and inflectional morphology, marking meanings such as gender and number nerozumím ne+rozum+ím NEG+understand+1SG “I do not understand”

Czech Morphology in Stat-XFER Stat-XFER allows external morphology server to segment and annotate words at runtime Ambiguous word segmentations can be encoded as a lattice Must segment all training data, then rebuild phrase table & language model

(Your Idea Here) Any ideas about applying the statistical transfer framework to Czech–English translation are welcome!