TectoMT two goals of TectoMT –to allow experimenting with MT based on deep- syntactic (tectogrammatical) transfer –to create a software framework into which various NLP software components could be integrated and tested within real life applications (such as MT) developed at UFAL since 2005 around 10 programmers using (and contributing to) TectoMT in 2008
Reminder 1: MT pyramid in terms of PDT layers Key question in MT: optimal level of abstraction? Our answer: somewhere around tectogrammatics –high generalization over different language characteristics, but still computationally (and mentally!) tractable
Reminder 2: MT pyramid in TectoMT modularity is emphasized in TectoMT the MT task is implemented as a sequence of reusable NLP modules (called blocks) around 80 blocks in the current version of English- Czech translation source language target language MT triangle: interlingua tectogram. surf.synt. morpho. raw text.
What is new in TectoMT in 2008? new blocks added new applications created large data processed and used
New blocks in TectoMT in 2008 around 100 new blocks in 2008 two types of extensions: –adding alternative (usually higher-performance) solutions to already implemented blocks, e.g. McDonald's parser (Collins' parser and constituency-to- dependency conversion integrated already in 2005), MORCE tagger (previously integrated taggers: TnT, MxPost, Jan Hajič's tagger, Lingua::EN::Tagger, Schmid's Tree Tagger) –blocks for new tasks relatively isolated tasks such as Named Entity recognition in Czech and English sequence of blocks for English sentence synthesis
New applications of TectoMT in 2008 existing: –real-time tecto-analysis of Czech sentences integrated in tree editor TrEd –English sentence generator (within the Companions project) –sentence analysis for various purposes (intonation in TTS, information extraction) –segmentation of text into finite verb clauses –preprocessing of English text for the purpose of English-to-Hindi translation pilot version in the very near future –simple man-machine dialog manager –Czech-to-English MT
Processing of large data in TectoMT roughly 1GW of Czech texts –analyzed up to simplified tecto –for the purposes of modeling Czech sentences or their trees (functions as the target-side language model in our translation scenario) roughly 60MW of parallel Czech-English texts from the Czeng corpus –analyzed up to simplified tecto and aligned –serves for generating several types of translation models
Plans for 2009 introduce TectoMT to a larger audience (MT Marathon 2009) experiment with more sophisticated tools during the tecto- transfer phase (loglinear combinations of translation and target-language tree models, tree HMM) facilitate addition of new languages to be processed in TectoMT performance tuning (now: roughly 1 translated sentence per second)