1/16 TectoMT Zdeněk Žabokrtský ÚFAL MFF UK Software framework for developing MT systems (and other NLP applications)

Slides:



Advertisements
Similar presentations
Dependency tree projection across parallel texts David Mareček Charles University in Prague Institute of Formal and Applied Linguistics.
Advertisements

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.
En->Cz MT system based on tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague.
June 6, 20073rd PIRE Meeting1 Tectogrammatical Representation of English in Prague Czech-English Dependency Treebank Lucie Mladová Silvie Cinková, Kristýna.
Documentation Generators: Internals of Doxygen John Tully.
Prague Arabic Dependency Treebank Center for Computational Linguistics Institute of Formal and Applied Linguistics Charles University in Prague MorphoTrees.
©Ian Sommerville 2006Software Engineering, 8th edition. Chapter 8 Slide 1 System models.
Are Linguists Dinosaurs? 1.Statistical language processors seem to be doing away with the need for linguists. –Why do we need linguists when a machine.
Modified from Sommerville’s originalsSoftware Engineering, 7th edition. Chapter 8 Slide 1 System models.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Object-oriented design CS 345 September 20,2002. Unavoidable Complexity Many software systems are very complex: –Many developers –Ongoing lifespan –Large.
1.3 Executing Programs. How is Computer Code Transformed into an Executable? Interpreters Compilers Hybrid systems.
(C) 2013 Logrus International Practical Visualization of ITS 2.0 Categories for Real World Localization Process Part of the Multilingual Web-LT Program.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
TectoMT two goals of TectoMT –to allow experimenting with MT based on deep- syntactic (tectogrammatical) transfer –to create a software framework into.
1/36 TectoMT Zdeněk Žabokrtský Institute of Formal and Applied Linguistics MFF UK Software framework for developing MT systems (and other NLP applications)
Czech-to-English Translation: MT Marathon 2009 Session Preview Jonathan Clark Greg Hanneman Language Technologies Institute Carnegie Mellon University.
Basic Concepts The Unified Modeling Language (UML) SYSC System Analysis and Design.
1/36 TectoMT Zdeněk Žabokrtský ÚFAL MFF UK Software framework for developing MT systems (and other NLP applications)
PDT 2.0 Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague.
Chapter 10 Architectural Design
ICS611 Introduction to Compilers Set 1. What is a Compiler? A compiler is software (a program) that translates a high-level programming language to machine.
April 2005CSA2050:NLTK1 CSA2050: Introduction to Computational Linguistics NLTK.
1/21 Introduction to TectoMT Zdeněk Žabokrtský, Martin Popel Institute of Formal and Applied Linguistics Charles University in Prague CLARA Course on Treebank.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
(1.1) COEN 171 Programming Languages Winter 2000 Ron Danielson.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
Chapter 4 System Models A description of the various models that can be used to specify software systems.
A Survey of NLP Toolkits Jing Jiang Mar 8, /08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)
Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague.
Information Extraction From Medical Records by Alexander Barsky.
Morphological Meanings in the Prague Dependency Treebank Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University,
Tree-based Machine Translation using syntax and semantics
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
SWE © Solomon Seifu ELABORATION. SWE © Solomon Seifu Lesson 10 Use Case Design.
CST320 - Lec 11 Why study compilers? n n Ties lots of things you know together: –Theory (finite automata, grammars) –Data structures –Modularization –Utilization.
Chapter 7 System models.
FlexElink Winter presentation 26 February 2002 Flexible linking (and formatting) management software Hector Sanchez Universitat Jaume I Ing. Informatica.
Semiautomatic domain model building from text-data Petr Šaloun Petr Klimánek Zdenek Velart Petr Šaloun Petr Klimánek Zdenek Velart SMAP 2011, Vigo, Spain,
Approximating a Deep-Syntactic Metric for MT Evaluation and Tuning Matouš Macháček, Ondřej Bojar; {machacek, Charles University.
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
Resemblances between Meaning-Text Theory and Functional Generative Description Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University,
October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.
Towards the better software metrics tool motivation and the first experiences Gordana Rakić Zoran Budimac.
30 March – 8 April 2005 Dipartimento di Informatica, Universita di Pisa ML for NLP With Special Focus on Tagging and Parsing Kiril Ribarov.
1. 2 Preface In the time since the 1986 edition of this book, the world of compiler design has changed significantly 3.
1 / 5 Zdeněk Žabokrtský: Automatic Functor Assignment in the PDT Automatic Functor Assignment (AFA) in the Prague Dependency Treebank PDT : –a long term.
March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Intro 1 The Prague Dependency Treebank (PDT) Introduction Jan Hajič Institute.
Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics.
 Programming - the process of creating computer programs.
Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague.
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
Arabic Syntactic Trees Zdeněk Žabokrtský Otakar Smrž Center for Computational Linguistics Faculty of Mathematics and Physics Charles University in Prague.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Chapter – 8 Software Tools.
NSF PARTNERSHIP FOR RESEARCH AND EDUCATION : M EANING R EPRESENTATION FOR S TATISTICAL L ANGUAGE P ROCESSING 1 TectoMT TectoMT = highly modular software.
ICS312 Introduction to Compilers Set 23. What is a Compiler? A compiler is software (a program) that translates a high-level programming language to machine.
Named Entities in Czech Texts and Their Processing Magda Ševčíková Zdeněk Žabokrtský ÚFAL MFF UK.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
Approaches to Machine Translation
Natural Language Processing (NLP)
Prague Arabic Dependency Treebank
--Mengxue Zhang, Qingyang Li
Approaches to Machine Translation
Extracting Recipes from Chemical Academic Papers
Natural Language Processing (NLP)
Natural Language Processing (NLP)
Presentation transcript:

1/16 TectoMT Zdeněk Žabokrtský ÚFAL MFF UK Software framework for developing MT systems (and other NLP applications)

2/16 What is TectoMT TectoMT is … a highly modular extendable NLP software system composed of numerous NLP tools integrated into a uniform infrastructure aimed at (not limited to) developing Machine Translation systems TectoMT is not … a specific method of MT (even if some approaches can profit from its existence more than others) an end-user application (even if releasing of single- purpose stand-alone applications is possible and technically supported)

3/16 Motivation for creating TectoMT First, technical reasons: Want to make use of more than two NLP tools in your experiment? Be ready for endless data conversions, need for other people's source code tweaking, incompatibility of source code and model versions… Unified software infrastructure might help us. Second, our long-term MT plan: We believe that tectogrammar (deep syntax) as implemented in Prague Dependency Treebank might help to (1) reduce data sparseness, and (2) find and employ structural similarities revealed by tectogrammar even between typologically different languages.

4/16 Prague Dependency Treebank 2.0 three layers of annotation: tectogrammatical layer deep-syntactic dependency tree analytical layer surface-syntactic dependency tree 1 word (or punct.) ~ 1 node morphological layer sequence of tokens with their lemmas and morphological tags [Ex: He would have gone into forest]

5/16 MT triangle in terms of PDT Key question of MT: what is the optimal level of abstraction? Obvious trade-off: ease of transfer vs. additional analysis and synthesis costs (system complexity, errors...)

6/16 Design decisions Linux + Perl set of well-defined, linguistically relevant layers of language representation neutral w.r.t. chosen methodology ("rules vs. statistics") accent on modularity: translation scenario as a sequence of translation blocks (modules corresponding to individual NLP subtasks) reusability substitutability source language target language MT triangle: interlingua tectogram. surf.synt. morpho. raw text.

7/16 Design decisions (cont.) in-house object-oriented architecture as the backbone all tools communicate via standardized OO Perl interface avoiding the former practice of tools communicating via files in specialized formats easy incorporation of external tools previously existing parsers, taggers, lemmatizers etc. just provide them with a Perl "wrapper" with the prescribed interface

8/16 Hierarchy of processing units block the smallest individually executable unit with well-defined input and output block parametrization possible (e.g. model size choice) scenario sequence of blocks, applied one after another on given documents application typically 3 steps: 1. conversion from the input format 2. applying the scenario on the data 3. conversion into the output format source language target language MT triangle: interlingua tectogram. surf.synt. morpho. raw text.

9/16 Blocks technically, Perl classes derived from a common ancestor class around 300 blocks in TectoMT now, for various purposes: blocks for analysis/transfer/synthesis, e.g. SEnglishW_to_SEnglishM::Lemmatize_mtree SEnglishP_to_SEnglishA::Mark_heads TCzechT_to_TCzechA::Vocalize_prepositions blocks for alignment, evaluation, feature extraction, etc. English-Czech tecto-based translation currently composes of roughly 80 blocks

10/16 Tools integrated as blocks already integrated tools: taggers Hajič's tagger, Raab&Spoustová Morče tagger, Rathnaparkhi MXPOST tagger, Brants's TnT tager, Schmid's Tree tagger, Coburn's Lingua::EN::Tagger parsers Collins' phrase structure parser, McDonalds dependency parser, ZŽ's dependency parser named-entity recognizers Stanford Named Entity Recognizer, Kravalová's SVM-based NE recognizer etc.

11/16 Other TectoMT components numerous file-format convertors (e.g. from PDT, Penn treebank, Czeng corpus, WMT shared task data etc. to our xml format) TectoMT-customized Pajas' tree editor TrEd (visualization) tools for parallelized processing (Bojar) tools for testing (regular daily tests), documentation...

12/16 PDT-style layered analysis analyze a given Czech or English text up to morphological, analytical and tectogrammatical layer used currently e.g. in experiments with intonation generation, information extraction, man-machine dialog system source M A T TectoMT target analysis input text files with PDT-style analysis

13/16 Transl. dictionary extraction using the lemma pairs from the aligned t-nodes from a huge parallel corpus, we build a probabilistic translation dictionary source M A T TectoMT target analysis parallel corpus CzEng lemma pairs analysis alignment probabilistic translation dictionary

14/16 Translation with tecto-transfer analysis-transfer-synthesis translation from English to Czech and vice versa employed probabilistic dictionary from the previous slide source M A T TectoMT target analysis input text translated text analysis transfer

15/16 Sentence re-synthesis analysis-clone-synthesis scenario for postprocessing of other MT system's output (to make it more grammatical) speech reconstruction - postprocessing of STT's output (to make it more grammatical) source M A T TectoMT target analysis input text re-synthetized sentences analysis clone

16/16 Final remarks TectoMT developed since TectoMT available under GPL since At the moment around 15 programmers using/contributing to TectoMT. Want to analyze Czech or English texts? Why don't you try doing it in your own TectoMT svn working copy?