1/21 Introduction to TectoMT Zdeněk Žabokrtský, Martin Popel Institute of Formal and Applied Linguistics Charles University in Prague CLARA Course on Treebank.

Slides:



Advertisements
Similar presentations
Dependency tree projection across parallel texts David Mareček Charles University in Prague Institute of Formal and Applied Linguistics.
Advertisements

An Introduction to GATE
En->Cz MT system based on tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague.
June 6, 20073rd PIRE Meeting1 Tectogrammatical Representation of English in Prague Czech-English Dependency Treebank Lucie Mladová Silvie Cinková, Kristýna.
Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.
Prague Arabic Dependency Treebank Center for Computational Linguistics Institute of Formal and Applied Linguistics Charles University in Prague MorphoTrees.
4.1 Blended approaches: Information Engineering IMS Information Systems Development Practices.
SSP Re-hosting System Development: CLBM Overview and Module Recognition SSP Team Department of ECE Stevens Institute of Technology Presented by Hongbing.
Are Linguists Dinosaurs? 1.Statistical language processors seem to be doing away with the need for linguists. –Why do we need linguists when a machine.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester1 Treebanks and Parsing Jan Hajič Institute of Formal and Applied Linguistics School of.
Lesson-21Process Modeling Define systems modeling and differentiate between logical and physical system models. Define process modeling and explain its.
NATURAL LANGUAGE TOOLKIT(NLTK) April Corbet. Overview 1. What is NLTK? 2. NLTK Basic Functionalities 3. Part of Speech Tagging 4. Chunking and Trees 5.
(C) 2013 Logrus International Practical Visualization of ITS 2.0 Categories for Real World Localization Process Part of the Multilingual Web-LT Program.
TectoMT two goals of TectoMT –to allow experimenting with MT based on deep- syntactic (tectogrammatical) transfer –to create a software framework into.
1/36 TectoMT Zdeněk Žabokrtský Institute of Formal and Applied Linguistics MFF UK Software framework for developing MT systems (and other NLP applications)
Czech-to-English Translation: MT Marathon 2009 Session Preview Jonathan Clark Greg Hanneman Language Technologies Institute Carnegie Mellon University.
Basic Concepts The Unified Modeling Language (UML) SYSC System Analysis and Design.
1/36 TectoMT Zdeněk Žabokrtský ÚFAL MFF UK Software framework for developing MT systems (and other NLP applications)
PDT 2.0 Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague.
April 2005CSA2050:NLTK1 CSA2050: Introduction to Computational Linguistics NLTK.
Slide 1 Wolfram Höpken RMSIG Reference Model Special Interest Group Second RMSIG Workshop Methodology and Process Wolfram Höpken.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
Final Review 31 October WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh.
An Introduction to Software Architecture
A Survey of NLP Toolkits Jing Jiang Mar 8, /08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)
Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague.
Parser-Driven Games Tool programming © Allan C. Milne Abertay University v
©Ian Sommerville 2000 Software Engineering, 6th edition. Chapter 10Slide 1 Architectural Design l Establishing the overall structure of a software system.
SOFTWARE DESIGN (SWD) Instructor: Dr. Hany H. Ammar
Morphological Meanings in the Prague Dependency Treebank Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University,
Tree-based Machine Translation using syntax and semantics
The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
11 CORE Architecture Mauro Bruno, Monica Scannapieco, Carlo Vaccari, Giulia Vaste Antonino Virgillito, Diego Zardetto (Istat)
Systematic Parameterized Description of Pro-forms in the Prague Dependency Treebank 2.0 Magda Ševčíková Zdeněk Žabokrtský Institute of Formal and Applied.
Architectural Design Yonsei University 2 nd Semester, 2014 Sanghyun Park.
Semiautomatic domain model building from text-data Petr Šaloun Petr Klimánek Zdenek Velart Petr Šaloun Petr Klimánek Zdenek Velart SMAP 2011, Vigo, Spain,
Approximating a Deep-Syntactic Metric for MT Evaluation and Tuning Matouš Macháček, Ondřej Bojar; {machacek, Charles University.
Information Systems Engineering. Lecture Outline Information Systems Architecture Information System Architecture components Information Engineering Phases.
Resemblances between Meaning-Text Theory and Functional Generative Description Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University,
October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
1 / 5 Zdeněk Žabokrtský: Automatic Functor Assignment in the PDT Automatic Functor Assignment (AFA) in the Prague Dependency Treebank PDT : –a long term.
Introduction of Geoprocessing Lecture 9. Geoprocessing  Geoprocessing is any GIS operation used to manipulate data. A typical geoprocessing operation.
nd PIRE project workshop1 Tectogrammatical Representation of English Silvie Cinková Lucie Mladová, Anja Nedoluzhko, Jiří Semecký, Jana Šindlerová,
March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Intro 1 The Prague Dependency Treebank (PDT) Introduction Jan Hajič Institute.
Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics.
Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague.
Rational Unified Process Fundamentals Module 4: Core Workflows II - Concepts Rational Unified Process Fundamentals Module 4: Core Workflows II - Concepts.
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
Arabic Syntactic Trees Zdeněk Žabokrtský Otakar Smrž Center for Computational Linguistics Faculty of Mathematics and Physics Charles University in Prague.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
NSF PARTNERSHIP FOR RESEARCH AND EDUCATION : M EANING R EPRESENTATION FOR S TATISTICAL L ANGUAGE P ROCESSING 1 TectoMT TectoMT = highly modular software.
ICS312 Introduction to Compilers Set 23. What is a Compiler? A compiler is software (a program) that translates a high-level programming language to machine.
Named Entities in Czech Texts and Their Processing Magda Ševčíková Zdeněk Žabokrtský ÚFAL MFF UK.
1/16 TectoMT Zdeněk Žabokrtský ÚFAL MFF UK Software framework for developing MT systems (and other NLP applications)
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
Chapter 1 Introduction.
SOFTWARE DESIGN AND ARCHITECTURE
Chapter 1 Introduction.
课程名 编译原理 Compiling Techniques
Natural Language Processing (NLP)
Prague Arabic Dependency Treebank
Part of the Multilingual Web-LT Program
Natural Language Processing (NLP)
Natural Language Processing (NLP)
Presentation transcript:

1/21 Introduction to TectoMT Zdeněk Žabokrtský, Martin Popel Institute of Formal and Applied Linguistics Charles University in Prague CLARA Course on Treebank Annotation, December 2010, Prague

2/21 Outline PART 1 What is TectoMT? TectoMT’s architecture Overview of TectoMT’s tools and applications PART 2 - demo

3/21 What is TectoMT? multi-purpose NLP software framework created at UFAL since 2005 main linguistic features layered language representation linguistic data structures adopted from the Prague Dependency Treebank main technical features highly modular, open-source numerous NLP tools already integrated (both existing and new) all tools communicating via a uniform OO infrastructure Linux + Perl reuse of PDT technology (tree editor TrEd, XML…)

4/21 Why “TectoMT” ? Tecto.. refers the (Praguian) tectogrammar deep-syntactic dependency-oriented sentence representation developed by Petr Sgall and his colleagues since 1960s large scale application in the Prague Dependency Treebank.....MT the main application of TectoMT is Machine Translation however, not only “tecto” and not only “MT” !!! re-branding planned for 2011: TectoMT  Treex

5/21 What is not TectoMT? TectoMT (as a whole) is not an end-user application it is rather an experimental lab for NLP researchers however, releasing of single-purpose stand- alone applications is possible

6/21 Motivation for creating TectoMT First, technical reasons: Want to make use of more than two NLP tools in your experiment? Be ready for endless data conversions, need for other people's source code tweaking, incompatibility of source code and model versions…  unified software infrastructure might help us in many aspects. Second, our long-term MT plan: We believe that tectogrammar (deep syntax) as implemented in Prague Dependency Treebank might help to (1) reduce data sparseness, and (2) find and employ structural similarities revealed by tectogrammar even between typologically different languages.

7/21 Main Design Decisions Linux Perl as the core language set of well-defined, linguistically relevant layers of language representation neutral w.r.t. chosen methodology ("rules vs. statistics") emphasis on modularity each task implemented by a sequence of blocks each block corresponds to a well-defined NLP subtask reusability and substitutability of blocks support for distributed processing

8/21 Data Flow Diagram in a typical application in TectoMT INPUT DATA FILES OUTPUT DATA FILES MEMORY REPRESENTATION OF SENTENCE STRUCTURES input format converter output format converter block 1block 2 block n non-Perl tool X non-Perl tool Y block 3 … scenario:

9/21 Hierarchy of data-structure units document the smallest independently storable unit (~ xml file)‏ represents a text as a sequence of bundles, each representing one sentence (or sentence tuples in the case of parallel documents)‏ bundle set of tree representations of a given sentence tree representation of a sentence on a given layer of linguistic description node attribute document's, node's, or bundle's name-value pairs

10/21 Tree types adopted from PDT tectogrammatical layer deep-syntactic dependency tree analytical layer surface-syntactic dependency tree 1 word (or punct.) ~ 1 node morphological layer sequence of tokens with their lemmas and morphological tags

11/21 Trees in a bundle in each bundle, there can be at most one tree for each "layer" set of possible layers = {S,T} x {English,Czech,...} x {M,A,T,P, N} S - source, T-target (analysis vs. synthesis, MT perspective) M - morphological analysis P - phrase-structure tree A - analytical tree T - tectogrammatical tree N - instances of named entities Example: SEnglishA - tectogrammatical analysis of an English sentence on the source-language side

12/21 Hierarchy of processing units block the smallest individually executable unit with well-defined input and output block parametrization possible (e.g. model size choice)‏ scenario sequence of blocks, applied one after another on given documents application typically 3 steps: 1. conversion from the input format 2. applying the scenario on the data 3. conversion into the output format source language target language MT triangle: interlingua tectogram. surf.synt. morpho. raw text.

13/21 Blocks technically, Perl classes derived from TectoMT::Block either method process_bundle (if sentences are processed independently) or method process_document must be defined several hundreds blocks in TectoMT now, for various purposes: blocks for analysis/transfer/synthesis, e.g. SEnglishW_to_SEnglishM::Lemmatize_mtree SEnglishP_to_SEnglishA::Mark_heads TCzechT_to_TCzechA::Vocalize_prepositions blocks for alignment, evaluation, feature extraction, etc. some of them only implement simple rules, some of them call complex probabilistic tools English-Czech tecto-based translation currently composes of roughly 140 blocks

14/21 Tools available as TectoMT blocks to integrate a stand-alone NLP tool into TectoMT means to provide it with the standardized block interface already integrated tools: taggers Hajič's tagger, Raab&Spoustová Morče tagger, Rathnaparkhi MXPOST tagger, Brants's TnT tager, Schmid's Tree tagger, Coburn's Lingua::EN::Tagger parsers Collins' phrase structure parser, McDonalds dependency parser, Malt parser, ZŽ's dependency parser named-entity recognizer Stanford Named Entity Recognizer, Kravalová's SVM-based NE recognizer miscel. Klimeš's semantic role labeller, ZŽ's C5-based afun labeller, Ptáček's C5-based Czech preposition vocalizer,...

15/21 Other TectoMT components "core" - Perl libraries forming the core of TectoMT infrastructure, esp. for memory representation of (and interface to) to the data structures numerous file-format converters (e.g. from PDT, Penn treebank, Czeng corpus, WMT shared task data etc. to our xml format)‏ TectoMT-customized Pajas' tree editor TrEd tools for parallelized processing (Bojar)‏ data, esp. trained models for the individual tools, morphological dictionaries, probabilistic translation dictionaries... tools for testing (regular daily tests), documentation...

16/21 Languages in TectoMT full-fledged sentence PDT-style analysis/transfer/synthesis for English and Czech using state-of-the-art tools prototype implementations of PDT-style analyses for a number of other languages mostly created by students Polish, French, German, Tamil, Spanish, Esperanto…

17/21 English-Czech translation in TectoMT

18/21 English-Czech translation in TectoMT

19/21 Real Translation Scenario

20/21 Parallel analysis Czech M A T English analysis alignment data needed for training the transfer phase models Czech-English parallel corpus CzEng 8 mil. pairs of sentences with automatic PDT-style analyses and alignment

21/21 Summary of Part I TectoMT (  Treex) environment for NLP experiments multipurpose, multilingual PDT-style linguistic structures Linux+Perl, open-source modular architecture (several hundreds of modules) capable of processing massive data will be released at CPAN