AUTOMATIC TRANSLATION UTILITY Fostering language diversity and participation Juan Dolio, DR, November 2008 Stéphane Bruno, AHTIC/CONSORTIUM CARISNET
LANGUAGE STATS
FACTS English is the dominant language in CIVIC discussions Non-English speaking members that are not fluent in English (or do not speak at all) are reluctant to contribute Manual (Human) translation of all and forum communications is impossible and way too costly Systematic human translation would also delay interactions
CIVIC APPROACH TO LANGUAGE DIVERSITY Three official languages: English, French, Spanish All documents and “official” communications are translated in all three languages, (the original language document being the legally binding one?) Simultaneous translation is provided in face-to- face meetings for plenary sessions when the number of the language group and its needs justify the cost Automatic translation of s is provided to facilitate comprehension and contribution by all language groups
OBJECTIVES OF THE AUTOMATIC TRANSLATION Provide the opportunity for all members to get the essence of all communications in all three official CIVIC languages Make the translation non disruptive, as seamless and as user-friendly as possible Allow an improvement of the translation overtime Construct a contextual terminology and linguistic environment for CIVIC on its field of intervention
HOW IT WORKS
THE TRANSLATION MECHANISMS When a mail arrives, the software breaks the into paragraphs The software tries to guess the language of the paragraph If it cannot guess the language, it assumes it is English Then the software preprocess the paragraph through the knowledgebase Then each paragraph is sent to the translation service (Babelfish) and the result is retrieved for each language pair The resulting paragraph is post-processed Then the is reconstructed and sent to the mailing list manager
INPUT REQUIREMENTS Use simple language constructs Use complete sentences and correct grammar and syntax Avoid abbreviations, metaphors and idiomatic expressions Avoid proverbs and sayings Do not mix languages in same paragraph (as translation is done paragraph by paragraph, and language is guessed)
OTHER FEATURES If you want some words not to be translated, enclose them in “*”, like *CIVIC* The knowledgebase allows to enter in a database how some words are to be translated to override the translation of the translation service, for example, to say ICT is translated TIC in French and Spanish and vice cersa This allows to build a lexicon or linguistic construct in the context of CIVIC and ICT4D
LIMITATIONS The less lengthy a paragraph is, the less accurate is the guessing of the language of the text. So, introductory paragraphs like greetings or opening, single-words texts will usually be wrongly or not translated at all The current version works only with plain text messages. The final version will try to convert HTML-formatted s to plain text before processing them The utility relies on Babelfish without a formal agreement (since it is free) and for which Babelfish was not designed. So, it is vulnerable to the slightest changes on the Babelfish web site
THINGS TO RESOLVE The character encoding issues Who will manage the knowledgebase? How words are entered into the database? How it is decided?