April 17, 2007MT Marathon: Tree-based Translation1 Tree-based Translation with Tectogrammatical Representation Jan Hajič Institute of Formal and Applied.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

Dependency tree projection across parallel texts David Mareček Charles University in Prague Institute of Formal and Applied Linguistics.
Deep Linguistic Information in Hybrid Machine Translation
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
En->Cz MT system based on tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague.
June 6, 20073rd PIRE Meeting1 Tectogrammatical Representation of English in Prague Czech-English Dependency Treebank Lucie Mladová Silvie Cinková, Kristýna.
Language Data Resources Treebanks. A treebank is a … database of syntactic trees corpus annotated with morphological and syntactic information segmented,
Using Percolated Dependencies in PBSMT Ankit K. Srivastava and Andy Way Dublin City University CLUKI XII: April 24, 2009.
Probabilistic Parsing Chapter 14, Part 2 This slide set was adapted from J. Martin, R. Mihalcea, Rebecca Hwa, and Ray Mooney.
Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.
LING 581: Advanced Computational Linguistics Lecture Notes January 19th.
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester1 Treebanks: Layering the Annotation Jan Hajič Institute of Formal and Applied Linguistics.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester1 Treebanks: Language-specific Issues Czech Jan Hajič Institute of Formal and Applied Linguistics.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester1 Treebanks and Parsing Jan Hajič Institute of Formal and Applied Linguistics School of.
Workshop on Treebanks, Rochester NY, April 26, 2007 The Penn Treebank: Lessons Learned and Current Methodology Ann Bies Linguistic Data Consortium, University.
Part-of-Speech Tagging & Sequence Labeling
1/36 TectoMT Zdeněk Žabokrtský Institute of Formal and Applied Linguistics MFF UK Software framework for developing MT systems (and other NLP applications)
Czech-to-English Translation: MT Marathon 2009 Session Preview Jonathan Clark Greg Hanneman Language Technologies Institute Carnegie Mellon University.
PDT 2.0 Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
LING/C SC/PSYC 438/538 Lecture 27 Sandiway Fong. Administrivia 2 nd Reminder – 538 Presentations – Send me your choices if you haven’t already.
1/21 Introduction to TectoMT Zdeněk Žabokrtský, Martin Popel Institute of Formal and Applied Linguistics Charles University in Prague CLARA Course on Treebank.
Prague Dependency Treebank(s) Workshop at LSA2011, Part II Jan Hajič, Zdeňka Urešová Institute of Formal and Applied Linguistics School of Computer Science.
March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax 1 The PDT Morphology and Surface Syntax.
Morphological Meanings in the Prague Dependency Treebank Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University,
Tree-based Machine Translation using syntax and semantics
The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
CS : Language Technology for the Web/Natural Language Processing Pushpak Bhattacharyya CSE Dept., IIT Bombay Constituent Parsing and Algorithms (with.
Treebanks and MWEs (Part 1) Jan Hajič, Pavel Straňák, Jiří Mírovský Institute of Formal and Applied Linguistics & LINDAT/CLARIN School of Computer Science.
12/06/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Statistical Parsing Dr. Jan Hajič CS Dept., Johns Hopkins Univ.
Cs target cs target en source Subject-PastParticiple agreement Czech subject and past participle must agree in number and gender. Two-step translation.
Jan Hajič Otakar Smrž Petr Zemánek Jan Šnaidauf Emanuel Beška Faculty of Mathematics and Physics Faculty of Philosophy and Arts Charles University in Prague.
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
Resemblances between Meaning-Text Theory and Functional Generative Description Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University,
Conversion of Penn Treebank Data to Text. Penn TreeBank Project “A Bank of Linguistic Trees” (as of 11/1992) University of Pennsylvania, LINC Laboratory.
1 / 5 Zdeněk Žabokrtský: Automatic Functor Assignment in the PDT Automatic Functor Assignment (AFA) in the Prague Dependency Treebank PDT : –a long term.
Prague Dependency Treebank(s) Workshop at LSA2011, Part I Jan Hajič, Zdeňka Urešová Institute of Formal and Applied Linguistics School of Computer Science.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-14: Probabilistic parsing; sequence labeling, PCFG.
nd PIRE project workshop1 Tectogrammatical Representation of English Silvie Cinková Lucie Mladová, Anja Nedoluzhko, Jiří Semecký, Jana Šindlerová,
March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Intro 1 The Prague Dependency Treebank (PDT) Introduction Jan Hajič Institute.
NLP. Introduction to NLP Background –From the early ‘90s –Developed at the University of Pennsylvania –(Marcus, Santorini, and Marcinkiewicz 1993) Size.
Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague.
Arabic Syntactic Trees Zdeněk Žabokrtský Otakar Smrž Center for Computational Linguistics Faculty of Mathematics and Physics Charles University in Prague.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-15: Probabilistic parsing; PCFG (contd.)
NLP. Parsing ( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (,,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (,,) ) (VP (MD will) (VP (VB join) (NP (DT.
Part-of-Speech Tagging & Sequence Labeling Hongning Wang
NSF PARTNERSHIP FOR RESEARCH AND EDUCATION : M EANING R EPRESENTATION FOR S TATISTICAL L ANGUAGE P ROCESSING 1 TectoMT TectoMT = highly modular software.
March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Tectogrammatics 1 PDT: Tectogrammatical Representation Jan Hajič Institute.
NLP. Introduction to NLP #include int main() { int n, reverse = 0; printf("Enter a number to reverse\n"); scanf("%d",&n); while (n != 0) { reverse =
Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 3 rd.
Semantic annotation of a dialog corpus Silvie Cinková Institute of Formal and Applied Linguistics Charles University in Prague, Czech Republic COMPANIONS.
Pre-processing Tasks for Rule- Based English-Korean Machine Translation System Sung-Dong Kim, Dept. of Computer Engineering, Hansung University, Seoul,
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th.
Prague Czech-English Dependency Treebank 2.0 ufal.mff.cuni.cz/pcedt2.0 Silvie Cinková, Marie Mikulová, Jan Štěpánek & professors, annotators and programmers.
1/16 TectoMT Zdeněk Žabokrtský ÚFAL MFF UK Software framework for developing MT systems (and other NLP applications)
Approaches to Machine Translation
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27
Prague Dependency Treebank 2. 0 Zdeněk Žabokrtský Dept
LING/C SC 581: Advanced Computational Linguistics
LING/C SC 581: Advanced Computational Linguistics
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 26
Approaches to Machine Translation
Presentation transcript:

April 17, 2007MT Marathon: Tree-based Translation1 Tree-based Translation with Tectogrammatical Representation Jan Hajič Institute of Formal and Applied Linguistics School of Computer Science Faculty of Mathematics and Physics Charles University, Prague Czech Republic

April 17, 2007MT Marathon: Tree-based Translation2 Standard Scheme of Machine Translation The Translation (“Vauquois”) triangle transfer source target “Deep” Syntax (Tectogrammatics) Surface Syntax Morphology Generation (Cz) (En)

April 17, 2007MT Marathon: Tree-based Translation3 Machine Translation Architecture Tectogrammatical layer-based system: source sentence target sentence morphological layer analytical layer tectogrammatical layer morphology (tagging) parsing linearization (tectogrammatical) parsing generation Transfer morph. synthesis word layer

April 17, 2007MT Marathon: Tree-based Translation4 Analytical Layer Correspondence

April 17, 2007MT Marathon: Tree-based Translation5 Tectogrammatical Correspondence The [Homestead’s] only remaining baker bakes the most famous rolls to the north of Long River. ‘al-xabaaz ‘al-’axiir ‘al-baaqii [fii Homestead] yaśmacu ‘ashhar ‘al-kruasaanaat ilaa shimaal min Long River.

April 17, 2007MT Marathon: Tree-based Translation6 The Additional Steps Analytical (surface)  Tectogrammatical additional parsing required Transfer minimal effort: only “true” transformations needed (like swimming ~ schwimmen gern) Generation back from Tectogrammatical representation to Analytical (surface syntax)

April 17, 2007MT Marathon: Tree-based Translation7 The Devil’s in... The additional three steps: source sentence target sentence morphological layer analytical layer tectogrammatical layer morphology (tagging) parsing linearization (trivial) (tectogrammatical) parsing Generation Transfer morph. synthesis (easy) word layer

April 17, 2007MT Marathon: Tree-based Translation8 Zooming In... The additional three steps: (Simple) transfer tectogrammatical layer Tectogrammatical parsing Generation: - Deletions - Insertions: prepositions, conjunctions,... - Word order - Morphology source analytical layer target

April 17, 2007MT Marathon: Tree-based Translation9 Tectogrammatical Parsing Newest results: 4 phases Transformation -based learning FnTBL Largely langu- age independent Coreference: >90% m- and a-layer: Attributemanualauto structure89,3 %76,4 % functor 85,5 %77,4 % val_frame.rf 92,3 %90,9 % t_lemma 93,5 %90,9 % nodetype 94,5 %92,6 % gram/sempos 93,8 %91,5 % a/lex.rf 96,5 %95,1 % a/aux.rf 94,3 %90,3 % is_member 94,3 %89,5 % is_generated 96,6 %95,2 % deepord 68,0 %66,7 %

April 17, 2007MT Marathon: Tree-based Translation10 Generation Components: Deletions of nodes [rare if going into English] Insertions of nodes prepositions, conjunctions, punctuation splitting phrases/idioms/named entities Tree reorganization (numeric expressions) Surface word order (analytical tree: defined w.o.) Morphology (agreement, cases based on subcat) English, Czech

April 17, 2007MT Marathon: Tree-based Translation11 Example Translation Insertion of Prepositions center of AuxP Venice.NNP Atr centrum Benátky APP center Venice APP centrum Benátky.NFS2 Atr analytical layer tectogrammatical layer

April 17, 2007MT Marathon: Tree-based Translation12 Example Translation Surface word order přijít.PAST včera Petr TWHEN ACT come.PAST yesterday Peter TWHEN ACT analytical layer tectogrammatical layer přijít.VB3SP včera Petr Adv Sb come.VBD Peter yesterday Sb Adv

April 17, 2007MT Marathon: Tree-based Translation13 The Data: Parallel, Annotated Treebank Parallel corpora Comparative/contrastive and translation studies Semantics Other “linguistic research goals” Machine Translation “Training” material Human-translated texts Testing material Evaluation – human, automatic

April 17, 2007MT Marathon: Tree-based Translation14 The Prague Czech-English Dependency Treebank “PCEDT” One of “family” of PDT-like treebanks Wall Street Portion of the Penn Treebank, ver. III Czech translation (manual) of the above Size 1.2 million words, ~50,000 sentences Annotation All 4 layers as in PDT: tokens, morphology, syntax, tectogrammatical representation

April 17, 2007MT Marathon: Tree-based Translation15 Penn Treebank University of Pennsylvania, 1993 Linguistic Data Consortium Wall Street Journal texts, ca. 50,000 sentences Financial (most), news, arts, sports 2499 (2312) documents in 25 sections Annotation POS (Part-of-speech tags) Syntactic “bracketing” + bracket (syntactic) labels (Syntactic) Function tags, traces, co-indexing

April 17, 2007MT Marathon: Tree-based Translation16 Penn Treebank Example ( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (,,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (,,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (..) )) Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. POS tag (NNS) (noun, plural) Phrase label (NP) Noun Phrase “Preterminal”

April 17, 2007MT Marathon: Tree-based Translation17 Penn Treebank Example: Sentence Tree Phrase-based tree representation:

April 17, 2007MT Marathon: Tree-based Translation18 PDT Layers of Annotation Tectogrammatical structure Surface syntax Morphology Tokens (words)

April 17, 2007MT Marathon: Tree-based Translation19 Parallel Czech-English Annotation English text -> Czech text (human translation) Czech side (goal): all layers manual annotation English side (goal): Morphology and surface syntax: technical conversion Penn Treebank style -> PDT Analytic layer Tectogrammatical annotation: manual annotation (Slightly) different rules needed for English Alignment Natural, sentence level only (now)

April 17, 2007MT Marathon: Tree-based Translation20 Human Translation of WSJ Texts Hired translators / FCE level Specific rules for translation Sentence per sentence only …to get simple 1:1 alignment Fluent Czech at the target side If a choice, prefer “literal” translation The numbers: English tokens: 1,173,766 Translated to Czech: Revised/PCEDT 1.0:487,929 Now finished (all 2312 documents)

April 17, 2007MT Marathon: Tree-based Translation21 English Annotation POS and Syntax Automatic conversion from Penn Treebank PDT morphological layer From POS tags PDT analytic layer From:  Penn Treebank Syntactic Structure  Non-terminal labels  Function tags (non-terminal “suffixes”) 2-step process  Head determination rules  Conversion to dependency + analytic function

April 17, 2007MT Marathon: Tree-based Translation22 Head Determination Rules Exhaustive set of rules By J. Eisner + M. Cmejrek/J. Curin 4000 rules (non-terminal based) Ex.: (S (NP-SBJ VP.)) → VP Additional rules Coordination, Apposition Punctuation (end-of-sentence, internal) Original idea (possibility of conversion) J. Robinson (1960s)

April 17, 2007MT Marathon: Tree-based Translation23 Example: Head Determination Rules (board) (the) (join) (will) (join) (NP (DT NN)) → NN (VP (VB NP)) → VB (VP (MD VP)) → VP (S (… VP …)) → VP Rules:

April 17, 2007MT Marathon: Tree-based Translation24 Conversion: Analytic Structure, Functions Analytic Function assignment (conversion) Rules based on functional tags: -SBJ Sb -PRD Pnom -BNF Obj -DTV Obj -LGS Obj-ADV Adv -DIR Adv-EXT Adv -LOC Adv-MNR Adv -PRP Adv-PUT Adv -TMP Adv Ad-hoc rules (if functional tags missing) Lemmatization (years → year)

April 17, 2007MT Marathon: Tree-based Translation25 Example: Analytical Structure, Functions (board) (the) (join) (will) (join) → → Penn Treebank structure (with heads added) PDT-like Analytic Representation

April 17, 2007MT Marathon: Tree-based Translation26 English PDT-style Annotation Morphology and Syntax By conversion Tectogrammatical annotation Manual (English TR: by S. Cinková) Pre-annotation Transformation from Penn Treebank & Propbank (Palmer, Kingsbury) by Z. Žabokrtský et al. Valency From Propbank Frame Files (Cinková, Šindlerová, Nedolužko, Semecký) Volume annotation starting now

April 17, 2007MT Marathon: Tree-based Translation27 Czech PDT-style Annotation All layers (morphology, analytic, tectogrammatical) So far… Automatic (many tools by many authors) Manual annotation Started revised guidelines: M. Mikulová, J. Štěpánek Top-down Tectogrammatical first (lower layers automatically) … then analytic structure and morphology

April 17, 2007MT Marathon: Tree-based Translation28 Analytical Pair En – Cz Those current holders would also receive minority interests in the new company.

April 17, 2007MT Marathon: Tree-based Translation29 Analytical Pair En - Cz According to his opinion UAL's executives were misinformed about the financing of the original transaction.

April 17, 2007MT Marathon: Tree-based Translation30 Tectogrammatical Pair According to his opinion UAL's executives were misinformed about the financing of the original transaction. Podle jeho názoru bylo vedení UAL o financování původní transakce nesprávně informováno.

April 17, 2007MT Marathon: Tree-based Translation31 PCEDT 1.0 – The CD Published 2004 by the LDC (LDC2004T25) Texts, size of data: 480,000 words: parallel annotated WSJ treebank (Cz: auto) 21,600 sentences 2 mil. words (53,000 sent.): Reader’s Digest short stories Evaluation data (5 reference translations, 500 sent.) Tools GIZA++ (Statistical Machine Translation Toolkit) Scripts for easy training (“SMT Quick Run”) Probabilistic dictionary (46,150 words, lemmatized) Czech – English (WSJ and other sources) Euromatrix & other projects: PCEDT 2.0 (2008)

April 17, 2007MT Marathon: Tree-based Translation32 PCEDT – some pointers PCEDT catalog No. LDC2004T PDT 2.0 (Czech annotation - documentation) catalog No. LDC2006T Cinkova: English Tectogrammatics