Development of a German- English Translator Felix Zhang TJHSST Computer Systems Research Lab Period 5
Summary of previous quarters Developed a functional translator for simple German sentences to simple English sentences –All rule-based, no statistical methods, yet –Very specifically geared towards German – language dependent
Scope for 4 th quarter Statistical methods –Part-of-speech tagging –Morphological analysis –Information available in TIGER corpus Accuracy testing –Reliability of statistical methods –Vs. Rule-based
Finishing Rule-Based Translation Convert translation into user-readable format Punctuation and capitalization ~/research $ python proj.py Part of speech tags: [['den', 'art'], ['kurzen', 'adj'], ['Mann', 'nou'], ['machen', 'ver'], ['die', 'art'], ['kleinen', 'adj'], ['Kinder', 'nou']] Morphological analysis: [[['kurzen', 'adj'], [['akk', 'mas'], ['dat', 'pl']]], [['Mann', 'nou'], [['akk', 'mas'], ['dat', 'pl']]], [['machen', 'ver'], [['1', 'pl'], ['3', 'pl'], 'pres']], [['kleinen', 'adj'], [['nom', 'pl'], ['akk', 'pl']]], [['Kinder', 'nou'], [['nom', 'pl'], ['akk', 'pl']]]] Disambiguated after noun-verb agreement: [[['kurzen', 'adj'], [['akk', 'mas'], ['dat', 'pl']]], [['Mann', 'nou'], [['akk', 'mas'], ['dat', 'pl']]], [['machen', 'ver'], [['3', 'pl'], 'pres']], [['kleinen', 'adj'], [['nom', 'pl'], ['akk', 'pl']]], [['Kinder', 'nou'], [['nom', 'pl']]]] Lemmatized: [['kurzen', ['kurz']], ['Mann', ['Mann', 'Man']], ['machen', ['machen']], ['kleinen', ['klein']], ['Kinder', ['Kind']]] Root translated: [['den', 'the'], ['kurzen', 'short'], ['Mann', 'man'], ['machen', 'make'], ['die', 'the'], ['kleinen', 'small'], ['Kinder', 'child']] NP Chunked English: [[['the', 'art'], ['short', 'adj'], ['man', 'nou', [['akk', 'mas'], ['dat', 'pl']]]], ['make', 'ver', [['3', 'pl'], 'pres']], [['the', 'art'], ['small', 'adj'], ['child', 'nou', [['nom', 'pl']]]]] Assigned an element type: [[['the', 'art'], ['short', 'adj'], ['man', 'nou', [['akk', 'mas'], ['dat', 'pl']]], 'dobj'], ['make', 'ver', [['3', 'pl'], 'pres'], 'mverb'], [['the', 'art'], ['small', 'adj'], ['child', 'nou', [['nom', 'pl']]], 'sub']] Assigned priority: [['5', ['the', 'art'], ['short', 'adj'], ['man', 'nou', [['akk', 'mas'], ['dat', 'pl']]], 'dobj'], ['2', 'make', 'ver', [['3', 'pl'], 'pres'], 'mverb'], ['1', ['the', 'art'], ['small', 'adj'], ['child', 'nou', [['nom', 'pl']]], 'sub']] Rearranged to English structure: [['1', ['the', 'art'], ['small', 'adj'], ['child', 'nou', [['nom', 'pl']]], 'sub'], ['2', 'make', 'ver', [['3', 'pl'], 'pres'], 'mverb'], ['5', ['the', 'art'], ['short', 'adj'], ['man', 'nou', [['akk', 'mas'], ['dat', 'pl']]], 'dobj']] Inflected: The small childs make the short man.
Statistical Methods Trivial methods –Part of speech tagging, morphological analysis –Tags based on most frequently occurring tag with the word in corpus –Theoretically, should still achieve reasonable levels of accuracy
Testing Check all 746,660 words in TIGER corpus –Problem: Running time Match: Actual part of speech / linguistic properties of the word in the corpus == part of speech, etc. that program assigns based on maximum likelihood Predictions from research: 90% accuracy for part of speech tags
Accuracy – Part of speech Reaches about 90% percent accuracy, as predicted Match NE NE Nato Match ADV ADV allein Match VAFIN VAFIN sein Match PIAT PIAT kein Match NN NN Zukunftskonzept Match APPR APPR für Match ART ART der Match NN NN Sicherheit Match APPR APPR in Match NE NE Europa Total matches: Total words: Accuracy: %
Accuracy – Morphological Analysis Lower accuracy – Properties are more context- dependent Match allein Match 3.Sg.Pres.Ind 3.Sg.Pres.Ind ist Match Nom.Sg.Neut Nom.Sg.Neut Zukunftskonzept Match für Match Acc.Sg.Fem Acc.Sg.Fem die Match Acc.Sg.Fem Acc.Sg.Fem Sicherheit Match in Match Dat.Sg.Neut Dat.Sg.Neut Europa Total matches: Total words: Accuracy: %
Problems No good way to compare effectiveness of rule-based vs. statistical translations Accuracy not high enough –10% error too large in a 700,000+ word corpus –27% even worse Slow run times
Future Research Combine statistical and rule-based methods together in hybrid program Find way to compare accuracy of two techniques More advanced (and more accurate) statistical methods – Context-based