Can Controlled Language Rules increase the value of MT? Fred Hollowood & Johann Rotourier Symantec Dublin
Localisation Challenge Databases filled with English content Large volumes Perishable Technical Fast delivery Cost effective
Goals Reduce cost of Translation to 30% Implement CL within the authoring community Foster the use of editor software to police the CL rule set Identify the most efficient MT system for each target language Develop Post-Editing guidelines Refine Symantec glossaries to assist in dictionary preparation
Controlled Language and MT Controlled Language MT system Rule Sets Terminology Style Editors Language Pairs Jp, De, Fr, It, Es Post Editing Assessment
Sequence of Events Identify a corpus Develop a test suite Develop terminology Work with MT engines Assess results
Two Questions How effective are CL rules in terms of post-editing effort? Which CL rules provide the best results?
Corpus Selection Origin stream of XML messages Volume 30,000 words Process Use TM technology to pre-process raw XML to provide strings for MT Use Macros to tidy up untranslatable text
Terminology Extraction Extraction Tools: Wordsmith Tools 4 Removal of duplicates Spelling variants Hyphenation variants Capitalisation variants Symbol/Plain Abbreviation/Plain Removal of synonyms
Custom Dictionaries Current MT systems Systran Premium 4.0 Logomedia Translate Pro —Differing capabilities —Differing function Per target language Grammars Styles
Test Suite 59 rules examined 17 of which already encapsulated in Symantec’s writing guidelines Classification 8 lexical 40 syntactic 11 textual
Controlled Language Sources
Testing the Rules Process Find an example sentence that does not conform to the rule Edit it to conform to all other rules under study Minimize the linguistic complexity (single test) Apply the CL rule Repeat the procedure to obtain 3 test examples Test Suite 59 rules expressed as 177 sentences
Post Editing Guidelines Ensure information transfer Modify what is grammatically deviant from commercial quality Modify what is lexically essential for understanding in target. Avoid the use of synonyms for the sake of originality Don’t forget that all the words are probably present in the output ( possibly wrong order) Remember style does not matter but information accuracy does. Don’t dally, if an improvement is not obvious, move along
Metrics Generation Quality levels Excellent (4), Good (3), Medium (2), Poor (1) Uncontrolled source generates output A Controlled source generates output B Focus is on Usability Evaluation by native speakers Further study is being done to link into other systems of quality evaluation Blackjack SAE J 2450
Overall evaluation (French)
Overall evaluation (Japanese)
Overall evaluation (German)
Preliminary Results CL significant impact Benefit varies by language Lots of scope for further study Some rules are more effective than others (score range: 0- 17) Symantec’s implied rules have mixed effectiveness Recommend 7 additional rules
Additional rules Rules with an impact in all languages Do not omit words within lexical items, even when the term has already been used in the sentence (12). Repeat the head noun with conjoined articles or prepositions. (15) Do not use slashes to list lexical items (except for product names). (14) Always write a verb next to its particle. (17) Only use the modal ‘could’ when the sentence contains ‘if’, otherwise use ‘can’. (10) Be very careful with the –ing words: If it is a gerund, use an article in front of it. (7). If it is introducing a new clause, use ‘by’ in front it (8). If it is modifying a noun in a non-finite clause, replace it with a relative clause. (5) Make sure that every segment can stand syntactically alone. (11) Avoid footnotes in the middle of a segment. Turn footnotes into independent segments. (11)
Next Steps Apply subsets of rules to a larger corpus. Language checker Acrolinx Increase the number of MT engines studied Comprendium/Prompt (European languages) Fujitsu/Nova’s PC Transer (Japanese) Further refine Post Editing guidelines Keep abreast of upgrades in current systems Bugs fixed New versions of software Move to a production pilot project