Presentation is loading. Please wait.

Presentation is loading. Please wait.

Digital Information and Heritage INFuture Zagreb, 7-9.11.2007. Sentence Alignment as the Basis For Translation Memory Database Sanja Seljan Faculty of.

Similar presentations


Presentation on theme: "Digital Information and Heritage INFuture Zagreb, 7-9.11.2007. Sentence Alignment as the Basis For Translation Memory Database Sanja Seljan Faculty of."— Presentation transcript:

1 Digital Information and Heritage INFuture Zagreb, 7-9.11.2007. Sentence Alignment as the Basis For Translation Memory Database Sanja Seljan Faculty of Humanities and Social Sciences – University of Zagreb Department of Information Sciences sanja.seljan@ffzg.hr Angelina Gašpar SOA Centre Split ginasplit@yahoo.com Damir Pavuna Integra d.o.o. damir.pavuna@integra.hr

2 Digital Information and Heritage INFuture Zagreb, 7-9.11.2007. Overview I Introduction II When to use TMs? Text preparation III Corpus used Text characteristics IV Research Tools used Automatic and manual alignement Comparison of TMs Results V Conclusion

3 Digital Information and Heritage INFuture Zagreb, 7-9.11.2007. Sentence alignment (SA) basis for computer-assisted translation (CAT) terminology management term extraction word alignment cross-linguistic information retrieval Sentence alignment (SA) -> translation memory (TM) basis for further research in translation equivalencies

4 Digital Information and Heritage INFuture Zagreb, 7-9.11.2007. Problems in automatic SA: robustness discrepancies in layout and omissions -> influence on accuracy and TM

5 Digital Information and Heritage INFuture Zagreb, 7-9.11.2007. Research: SA on Cro-Eng parallel texts (laws, regulations, acts, decisions) alignment tool WinAlign 7.5.0 by SDL Trados 2006 Professional

6 Digital Information and Heritage INFuture Zagreb, 7-9.11.2007. Aim: impact of SA process on the creation of TM comparison of 3 types of TMs Differences: –in levels of expert intervention in set up of the alignment program –in preparation of the source text for the segmentation

7 Digital Information and Heritage INFuture Zagreb, 7-9.11.2007. II When to use TMs? Fast and consistent translation (e.g. EU, multinational agencies) Voluminous texts Highly repetitive types of texts Use of specialized and consistent terminology Several languages Sharing of common resources (cooperation) Time-saving (Speed up the translation process) Cost-saving Consistent translation

8 Digital Information and Heritage INFuture Zagreb, 7-9.11.2007. Directly through translation Use of already translated material (alignment process) Creation of TM

9 Digital Information and Heritage INFuture Zagreb, 7-9.11.2007. III Corpus used 9 parallel legislative Croatian-English texts or bitexts related to: acts, laws, regulations, decisions and ordinances; The sake of uniformity: standard presentation and standard formulas; 33.15% - percentage ratio for word count in English translations;

10 Digital Information and Heritage INFuture Zagreb, 7-9.11.2007. Reasons: –English-an analytic type of language, use of passive voice, –Croatian - a highly flective system, use of active voice, Repetitive legal terms, phrases, sentences A regulation main components: the title, preamble, enacting terms, addresee, place, date and signature.

11 Digital Information and Heritage INFuture Zagreb, 7-9.11.2007. Enacting terms - strict rules of presentation: -subject matter and scope, -definitions, -provisions conferring implementing power, -penalties or legal remedies, -transitional and final provisions. Standard form prescribes the layout on the page: spacing, paragraphing, punctuation and even typographic characteristics (capitalisation, typeface, boldface and italics)

12 Digital Information and Heritage INFuture Zagreb, 7-9.11.2007. Use of verbs in enacting terms -Binding Croatian legislation: -declarative terms (definitions, amendments) -and imperative terms (commands, prohibitions) - English “shall”= Croatian present tense, modals (morati, trebati) - English “may” for prohibition, permission and authorisation = Croatian present tense (“ne može se”, “može se”).

13 Digital Information and Heritage INFuture Zagreb, 7-9.11.2007. Bitexts similarities : –punctuation, numbers, dates, foreign words; Differences: –capital letters, hyphens, compound words, synonyms (avoided in target language); Common points: –consistent terminology, a uniform manner, gender-neutral language;

14 Digital Information and Heritage INFuture Zagreb, 7-9.11.2007. IV Alignment research Texts: –Croatian legislative acts translations Cr->En; Tools: –AnyCount 4.0 (version 405) – for document structure analysis –SDL Trados 2006 Professional (WinAlign 7.5.0.) – for alignment process;

15 Digital Information and Heritage INFuture Zagreb, 7-9.11.2007. Alignment research PREPARATORY ACTIVITIES: –comparison of the source and target texts (whether all text is translated) –defining set up of end and skip rules (delimiters, creating abbreviation user list) –preparation of the source text for better segmentation (spelling, automatic bullets and numbering, deleting of soft returns, hyphens, certain punctuation, tables created with tabs and revision marks) –modification of set up rules –verification of the alignment (especially 1:2 and 2:1 pairs and commitment of pairs) –creation of translation memory and verification

16 Digital Information and Heritage INFuture Zagreb, 7-9.11.2007. Alignment research Automatic alignment WinAlign has language independent algorithms that count: –the quality of translation units which can have tree levels (low, medium, high) –translation units aligning 1:2 or 2:1 pairs –unconnected target segments

17 Digital Information and Heritage INFuture Zagreb, 7-9.11.2007. Alignment research Manual alignment –source text corresponds to translated target segment (Aligned TM) –set up of the alignment program (Aligned TM + set up rules, e.g. segment and skip rules, abbreviation user list) –segmentation of the source text (e.g. changes of soft returns, check of colon segmentation)

18 Digital Information and Heritage INFuture Zagreb, 7-9.11.2007. Alignment research

19 Digital Information and Heritage INFuture Zagreb, 7-9.11.2007. Alignment research Raw TMAligned TM+ Setup rules ++ Segmented source 100%121106112120 95%-99%0000 85%-94%2500 75%-84%2210 50%-74%1120 No match618110 Total132 126120 Percent91.67% 80.30%88.89%100%

20 Digital Information and Heritage INFuture Zagreb, 7-9.11.2007. Alignment research Conclusion –The translation memories created in this study out of different types of the alignment processes give different results regarding the quality of the translated material. –The results show necessary interventions of an expert when defining the set up rules, in preparation activities for the source text segmentation and in the verification of suggested translation units.


Download ppt "Digital Information and Heritage INFuture Zagreb, 7-9.11.2007. Sentence Alignment as the Basis For Translation Memory Database Sanja Seljan Faculty of."

Similar presentations


Ads by Google