Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague.

Slides:



Advertisements
Similar presentations
Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University.
Advertisements

En->Cz MT system based on tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague.
CL Research ACL Pattern Dictionary of English Prepositions (PDEP) Ken Litkowski CL Research 9208 Gue Road Damascus,
En->Cz MT system based on TR Zdeněk Žabokrtský IFAL, Charles University in Prague.
June 6, 20073rd PIRE Meeting1 Tectogrammatical Representation of English in Prague Czech-English Dependency Treebank Lucie Mladová Silvie Cinková, Kristýna.
Statistical NLP: Lecture 3
Languages & The Media, 4 Nov 2004, Berlin 1 Multimodal multilingual information processing for automatic subtitle generation: Resources, Methods and System.
C SC 620 Advanced Topics in Natural Language Processing Lecture 20 4/8.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
Creation of a Russian-English Translation Program Karen Shiells.
TectoMT two goals of TectoMT –to allow experimenting with MT based on deep- syntactic (tectogrammatical) transfer –to create a software framework into.
1/36 TectoMT Zdeněk Žabokrtský Institute of Formal and Applied Linguistics MFF UK Software framework for developing MT systems (and other NLP applications)
Czech-to-English Translation: MT Marathon 2009 Session Preview Jonathan Clark Greg Hanneman Language Technologies Institute Carnegie Mellon University.
Building the Valency Lexicon of Arabic Verbs Viktor Bielický Otakar Smrž LREC 2008, Marrakech, Morocco.
ELN – Natural Language Processing Giuseppe Attardi
1/36 TectoMT Zdeněk Žabokrtský ÚFAL MFF UK Software framework for developing MT systems (and other NLP applications)
PDT 2.0 Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
PDT Grammatemes and Coreference in the PDT 2.0 Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University in Prague.
1/21 Introduction to TectoMT Zdeněk Žabokrtský, Martin Popel Institute of Formal and Applied Linguistics Charles University in Prague CLARA Course on Treebank.
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Morphosyntactic correspondence: a progress report on bitext parsing Alexander Fraser, Renjing Wang, Hinrich Schütze Institute for NLP University of Stuttgart.
A Survey of NLP Toolkits Jing Jiang Mar 8, /08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)
Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague.
Morphological Meanings in the Prague Dependency Treebank Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University,
Tree-based Machine Translation using syntax and semantics
April 17, 2007MT Marathon: Tree-based Translation1 Tree-based Translation with Tectogrammatical Representation Jan Hajič Institute of Formal and Applied.
The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Approximating a Deep-Syntactic Metric for MT Evaluation and Tuning Matouš Macháček, Ondřej Bojar; {machacek, Charles University.
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University.
Resemblances between Meaning-Text Theory and Functional Generative Description Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University,
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Topic #1: Introduction EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
1 / 5 Zdeněk Žabokrtský: Automatic Functor Assignment in the PDT Automatic Functor Assignment (AFA) in the Prague Dependency Treebank PDT : –a long term.
1 Context Free Grammars October Syntactic Grammaticality Doesn’t depend on Having heard the sentence before The sentence being true –I saw a unicorn.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Proper Nouns in Czech Corpora Magda Ševčíková Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics.
PDT Grammatemes in the PDT 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague
nd PIRE project workshop1 Tectogrammatical Representation of English Silvie Cinková Lucie Mladová, Anja Nedoluzhko, Jiří Semecký, Jana Šindlerová,
March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Intro 1 The Prague Dependency Treebank (PDT) Introduction Jan Hajič Institute.
Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics.
Syntactic Annotation of Slovene Corpora (SDT, JOS) Nina Ledinek ISJ ZRC SAZU
Leonid Iomdin Institute for Information Transmission Problems, Russian Academy of Sciences
SYNTAX.
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
Arabic Syntactic Trees Zdeněk Žabokrtský Otakar Smrž Center for Computational Linguistics Faculty of Mathematics and Physics Charles University in Prague.
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
NSF PARTNERSHIP FOR RESEARCH AND EDUCATION : M EANING R EPRESENTATION FOR S TATISTICAL L ANGUAGE P ROCESSING 1 TectoMT TectoMT = highly modular software.
A knowledge rich morph analyzer for Marathi derived forms Ashwini Vaidya IIIT Hyderabad.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
1/16 TectoMT Zdeněk Žabokrtský ÚFAL MFF UK Software framework for developing MT systems (and other NLP applications)
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
English-Korean Machine Translation System
Approaches to Machine Translation
Natural Language Processing (NLP)
A Statistical Model for Parsing Czech
Prague Dependency Treebank 2. 0 Zdeněk Žabokrtský Dept
Approaches to Machine Translation
Natural Language Processing (NLP)
Natural Language Processing (NLP)
Presentation transcript:

Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague

Overview Part I - theoretical background Part II - TectoMT system

MT pyramid (in terms of PDT) Key question in MT: optimal level of abstraction? Our answer: somewhere around tectogrammatics high generalization over different language characteristics, but still computationally (and mentally!) tractable

Basic facts about "Tecto" introduced by Petr Sgall in 1960's implemented in Prague Dep. Treebank 2.0 each sentence represented as a deep-syntactic dependency tree functional words accompanying an autosemantic word "collapse" with it into a single t-node, labeled with the autosemantic t-lemma added t-nodes (e.g. because of pro-drop) semantically indispensable syntactic and morphological categories rendered by a complex system of t-node attributes (functors+subfunctors, grammatemes for tense, number, degree of comparison, etc.)

SMT and limits of growth current state-of-the-art approaches to MT n-grams + large parallel (and also monolingual) corpora + huuuuge computational power n-grams are very greedy! availability (or even existence!) of more data? example: Czech-English parallel data ~1 MW - easy (just download and align some tens of e-books) ~10 MW - doable (parallel corpus Czeng) ~100 MW - not now, but maybe in a couple of years... ~1 GW - ? ~10 GW (~ books) - Was it ever translated???

How could tecto help SMT? n-gram view: manifestations of lexemes are mixed with manifestations of language means expressing the relations between the lexemes and of other grammar rules inflectional endings, agglutinative affixes, functional words, word order, punctuation orthographic rules... It will be delivered to Mr. Green's assistants at the nearest meeting.  training data sparsity how could tecto ideas help? within each sentence, clear separation of meaningful "signs" from "signs" which are only imposed by grammar (e.g. imposed by agreement) clear separation of lexical, syntactical and morphological meaning components  modularization of the translation task  potential for a better structuring of statistical models  more effective exploatation of the limited training data

"Semitecto" abstract sentence representation, tailored for MT purposes motivation: not to make decisions which are not really necessary for the MT process (such as distinguishing between many types of temporal and directional semantic complementations) given the target-language "semitecto" tree, we want the sentence generation to be deterministic slightly "below" tecto (w.r.t. the abstraction axis): adopting the idea of separating lexical, syntactical and morphological meaning components; adopting the t-tree topology principles adopting many t-node attributes (especially grammatemes, coreference, etc.) but (almost) no functors, no subfunctors, no WSD, no pointers to valency dictionary, no tfa... closer to the surface-syntax main innovation: concept of formemes

Formemes formeme = morphosyntactic language means expressing the dependency relation n:v+6 (in Czech) = semantic noun which is on the surface expressed in the form of prepositional group in locative with preposition "v" v:that+fin/a (in English) = semantic verb expressed in active voice as a head of subordinating clause introduced with the sub.conjunction "that" obviously, sets of formeme values are specific for each of the four semantic parts of speech in fact, formemes are edge labels partially substituting functors what is NOT captured by formemes: morphological categories imposed by grammar rules (esp. by agreement), such as gender, number and case for adjectives in attributive positions morphological categories already represented by grammatemes, such as degree of comparison for adjectives, tense for verbs, number for nouns

Formemes in the tree Example: It is extremely important that Iraq held elections to a constitutional assembly.

Some more examples of proposed formemes Czech 968 adj:attr 604 n:1 552 n:2 497 v:fin/a 308 n:4 260 adv: 169 n:v adj:compl 117 v:inf 104 n:poss 86 n:7 82 v:že+fin/a 77 v:rc/a 63 n:s+7 53 n:k+3 53 n:attr 50 n:na+6 47 n:na+4 42 v:aby+fin/a English 661 adj:attr 568 n:attr 456 n:subj 413 n:obj 370 v:fin/a 273 n:of+X 238 adv: 160 n:poss 160 n:in+X 146 v:to+inf/a 92 adj:compl 91 n:to+X v:rc/a v:that+fin/a v:ger/a

Three-way transfer translation process: (I have been asked by him to come -> Požádal mě, abych přišel) 1. source language sentence analysis up to the "semitecto" layer 2. tranfer of lexemes (ask  požádat, come  přijít ) formemes (v:fin/p  v:fin/a, v:to+inf  v:aby+fin/a) grammatemes (tense=past1  past, 0  verbmod=cdn) 3. target language sentence synthesis from the "semitecto" layer

Adding statistics... P(l T |l S ) P(f T |f S ) P(l gov,l dep,f) source language target language translation model (e.g. from parallel corpus Czeng, 30MW) "binode" language model (e.g. from partially parsed Czech National Corpus, 100MW)

Part II TectoMT System

Goals primary goal to build a high-quality linguistically motivated MT system using the PDT layered framework, starting with English -> Czech direction secondary goals to create a system for testing the true usefulness of various NLP tools within a real-life application to exploit the abstraction power of tectogrammatics to supply data and technology for other projects

Main design decisions Linux + Perl set of well-defined, linguistically relevant levels of language representation neutral w.r.t. chosen methodology (e.g. rules vs. statistics) in-house OO architecture as the backbone,but easy incorporation of external tools (parsers, taggers, lemmatizers etc.) accent on modularity: translation scenario as a sequence of translation blocks (modules corresponding to individual NLP subtasks) source language target language MT triangle: interlingua tectogram. surf.synt. morpho. raw text.

TectoMT - Example of analysis (1) Sample sentence: It is extremely important that Iraq held elections to a constitutional assembly.

TectoMT - example of analysis (2) phrase-structure tree:

TectoMT - example of analysis (3) analytical tree

TectoMT - example of analysis (4) tectogrammatical tree (with formemes)

Heuristic alignment Sentence pair: It is extremely important that Iraq held elections to a constitutional assembly. Je nesmírně důležité, že v Iráku proběhly volby do ústavního shromáždění.

Formeme pairs extracted from parallel aligned trees 593 adj:attr adj:attr 290 v:fin/a v:fin/a 282 n:1 n:subj 214 adj:attr n:attr 165 n:2 n:of+X 152 adv: adv: 149 n:4 n:obj 102 n:2 n:attr 86 n:v+6 n:in+X 79 n:poss n:poss 73 n:1 n:obj 61 n:2 n:obj 51 v:inf v:to+inf/a 50 adj:compl adj:compl 39 n:2 n: 34 n:4 n:subj 34 n:attr n:attr 32 v:že+fin/a v:that+fin/a 32 n:2 n:poss 27 n:4 n:attr 27 n:2 n:subj 26 adj:attr n:poss 25 v:rc/a v:rc/a 20 v:aby+fin/a v:to+inf/a

Thank you !