En->Cz MT system based on tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague
Goals primary goal to build a high-quality linguistically motivated MT system using the PDT layered framework secondary goal to create a system for testing the true usefulness of various NLP tools within a real-life application
MT triangle in terms of PDT source language w-layer target language analysis synthesis m-layer a-layer t-layer transfer ?
Building the first prototype... chosen direction: English -> Czech main design decisions: several well-defined, linguistically relevant intermediate levels modularity - decompose the task into many isolated subtasks neutral w.r.t. chosen methodology (e.g. rules vs. statistics) available resources experience (and sw tools) from PDT and PCEDT data (parallel corpora, translation dictionaries) freely available NLP tools for analysis on the English side an existing module for sentence synthesis on the Czech side
MT “triangle” in the prototype input English textoutput Czech text English m-layer English p-layerEnglish a-layer English t-layerCzech t-layer
Building blocks (1) EnglishW->EnglishM segment the input text into sentences (Lingua::EN::Tagger from CPAN) tokenize+tag the sentences (Lingua::EN::Tagger from CPAN) lemmatize each token by using morpha tools and ispell EnglishM->EnglishP phrase-structure parsing (Lingua::CollinsParser from CPAN) EnglishP->EnglishA mark phrase heads (Collins’s heads + arrangements) run constituency dependency transformation assign (selected) analytical functions mark subject nodes
Building blocks (2) EnglishA->EnglishT determine the t-tree topology (collapsing fw. subtrees) label t-nodes with t-lemmas assign coordination/apposition functors mark t-nodes corresponding to finite clauses assign (some of) the remaining functors fill the nodetype attribute detect grammatical co-reference in relative clauses determine the semantic part of speech fill grammateme attributes (number, tense, degree...) detect sentence modality
Building blocks (3) EnglishT->CzechT transfer of formemes transfer of lexemes transfer of grammatemes CzechT->CzechW finding agreement links adding auxiliary verbs (in complex verb forms) adding prepositions and conjunctions deriving word forms (conjugation, declination) computing word order adding punctuation vocalization of prepositions concatenation of word forms and sentences
Translation sample A Turkish girl has died from bird flu, days after her brother and sister died from the disease. The girl, 11, who lived on a poultry farm in eastern Turkey's Van province, was being treated in hospital after her siblings became infected with bird flu. The cases are the first human deaths from bird flu outside Asia, where the virus has killed more than 70 people. The hospital in Van is treating 15 others, three of whom are in a critical condition, according to a doctor there. The latest victim, Hulya Kocyigit, died early on Friday at the hospital. Turecká dívka zemřela z ptačí chřipky dny after, že její bratr a sestra zemřeli z nemoci. Ďívka 11, kdo žilo v drůbeží farmě ve van provincii východního Turecka, jsoucno zacházet v nemocnici, že její sourozenci slušeli nakažený s ptačí chřipkou. Případy jsou přední lidské smrti z ptačí chřipky mimo Asii, kde virus zabilo than 70 lid. Nemocnice ve Van zachází 15 zbývajících, whom three of v kritické podmínce souzvuk lékaře tam. Nejpozdnější oběť Kocyigit Hulya zemřela brzy v pátku v nemocnici.