LREC 2004, 26 May 2004, Lisbon 1 Multimodal Multilingual Resources in the Subtitling Process S.Piperidis, I.Demiros, P.Prokopidis, P.Vanroose, A. Hoethker, W. Daelemans, E. Sklavounou, M. Konstantinou, Y. Karavidas {spip, iason, {walter.daelemans,
LREC 2004, 26 May 2004, Lisbon 2 Subtitling Subtitling aims at enhancing a TV (or similar) programme with a written version of its narrative or dialogue The written version can be in the : –same language as the spoken – monolingual subtitling –different language from the spoken - multilingual subtitling
LREC 2004, 26 May 2004, Lisbon 3 Challenges of Subtitling the challenge (in automated generation) is that there must be agreement between subtitles, the spoken source language and the corresponding image generated subtitles must meet a set of constraints imposed by the visual context of the text and spatio-temporal factors subtitle text is no longer normal written text but rather oral text
LREC 2004, 26 May 2004, Lisbon 4 Experiments in MUSA experiments on monolingual and multilingual subtitle generation Languages : English : source & target French & Greek : target Technologies used –English ASR component for the transcription of audio streams into text –Subtitling component producing English subtitles from English audio transcriptions –Translation component integrating Machine Translation and Translation Memory, for EN-FR & EN-EL
LREC 2004, 26 May 2004, Lisbon 5 Architecture
LREC 2004, 26 May 2004, Lisbon 6 Resources for subtitling in order to train and evaluate system components, an array of application specific resources is necessary primary audiovisual data from BBC World Service, documentaries and “newsy” current affairs programmes for each programme, the following parallel data are sourced the actual video of the programme its script or hand-made transcript English, Greek and French subtitles topically relevant newspaper and web-sourced extracts
LREC 2004, 26 May 2004, Lisbon 7 Resources overview ScriptsTran scripts Scripts +Tran scripts EN sub titles EL sub titles FR sub titles Horizon Panora ma Misc DVDs Totals
LREC 2004, 26 May 2004, Lisbon 8 Speech recognition component Use of parallel corpus of BBC programs, audio and hand-made transcripts, as well as topically relevant newspaper texts Tuning of acoustic and language models of the KUL/ESAT recogniser Background noise & non-native speech hinder the process Alignment of audio with hand-made transcripts proved to be a working solution helping overcome noise and non-native speakers problems
LREC 2004, 26 May 2004, Lisbon 9 Speech recognition component (2)
LREC 2004, 26 May 2004, Lisbon 10 Constraints & Requirements subtitling conventions in various EU countries constraints entail that compression of transcripts’ segments is required compression rate expressed in # of words and # of chars to delete
LREC 2004, 26 May 2004, Lisbon 11 Subtitling engine & resources Use of a parallel corpus of BBC programs featuring program hand-validated transcripts and their hand-made subtitles Align sentences and words in the parallel corpus Extract a table of paraphrases to compress Example –Within the next few years -> Soon –During the years when -> While –It was clear that -> Clearly
LREC 2004, 26 May 2004, Lisbon 12 Subtitling engine & resources (2) If compression rate is not reached by using paraphrasing, apply syntactic rules to delete low-importance units (e.g. adverbs, adjectives, etc) Hand-crafted deletion rules making use of –A shallow-parse of the segments –Surprise values for each word, computed on the basis of a large text corpus. If more deletable segments than necessary exist, start by deleting the least important segments first.
LREC 2004, 26 May 2004, Lisbon 13 Translation component integrate TM (TrAID) and MT (Systran) align EN hand-made subtitles with FR and EL hand-made subtitles build a translation memory database (high % of unique translation units, not unexpected) perform term extraction on the parallel corpus hand-validate automatically extracted terms and use them for translation customisation purposes
LREC 2004, 26 May 2004, Lisbon 14 Subtitle editing responsible for textual operations, tokenisation and subtitle text splitting, calculation of cue-in/cue-out timecodes Requirement : Subtitled text should be segmented at the highest syntactic nodes possible Hand-crafted rules of type: “Cut after punctuation”, “Cut after personal pronouns following a verb phrase”, etc. For EN use of available shallow parse information For FR and EL, use of pos tagging information did not produce worse results
LREC 2004, 26 May 2004, Lisbon 15 Evaluation (in progress) so far, relatively poor ASR results for subtitling purposes alignment mode of ASR yielded >90% accuracy grammaticality and acceptability of subtitles >80% BLEU protocol applied for subtitling evaluation evaluation of translation component and integrated prototype are ongoing
LREC 2004, 26 May 2004, Lisbon 16 Conclusions human subtitling is an extremely complex process a simplified computational model is feasible an architecture for a multilingual subtitling system has been implemented an interesting array of resources has been collected and processed at different levels, yielding useful derivative resources evaluation is even more challenging!