The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible vocabulary speech recognition and high quality speech synthesis covering a wide range of domains. Track II (duration 3 years) Investigation of speech centered translation technologies focusing on requirements concerning language resources (LR) Specification and creation of corpora and lexica needed for speech centered translation Building a demonstrator for speech-to-speech translation Demonstration of language transfer in Catalan, Spanish and US-English Creation of Lexica for Speech Centered Translation Ute Ziegenhain (Siemens AG), Asuncion Moreno (UPC), Nuria Castell (UPC) 1
4 Industrial Partners: Creation of Lexica for Speech Centered Translation Ute Ziegenhain (Siemens AG), Asuncion Moreno (UPC), Nuria Castell (UPC) 1 2 Partners from Universities: 1 External Partner:
Two approaches: - Bi-lingual word by word translation lexica with enriched morphological information - Advantages: reduction of WER - Disadvantage: for more inflected languages lexicon size increases by a factor of 7 (at least); effort varies highly between languages -> only provided for Catalan and Spanish for statistical experiments - 'Phrasal' lexica consisting of bi-lingual short phrases typically found in a tourist domain environment - Advantages: reduction of OOV, better alignment and lexicon model - Disadvantage: selection of adequate corpora Creation of Lexica for Speech Centered Translation Ute Ziegenhain (Siemens AG), Asuncion Moreno (UPC), Nuria Castell (UPC) 1
US-English corpora from Verbmobil (112,541 token): orthographic transcriptions of telephone conversations in US-English for an appointment scheduling domain US-English corpora from TALP corpus (408,452 token): US-English sentences translated from orthographic transcriptions of telephone conversations in Spanish and Catalan for a tourist domain Web corpus (2,640,562 token): Text corpora downloaded from tourist web pages in US-English Phrasal corpus: 1500 expressions in US-English selected from tourist phrasal books. Source Corpora 2
Procedure: 1. Create text corpora in a reference language (US-English) in a given domain 2. Select of the most frequent content words (i.e. nouns, verbs, adjectives, etc.) to create a representative word list of the domain 3. For each word in the word list, provide the syntactic context in which the words are embedded 4. Cut the sentence into a segment that contains the word. The segment have usually been shortened to nominal phrases (in case of nouns and adjectives) or to subject plus verb plus short complement (for verbs) 5.Manually correct the phrases (e.g. typing and orthographic errors, meaningless or offensive phrases, proper names etc.) 6.Add a set of typical phrasal expressions commonly used in the semantic domain. the set is manually choosen from several tourist text books Building a demonstrator for speech-to-speech translation -> Result: 'phrasal' reference lexicon consisting of short phrases Creation of Reference Corpus 3
Format: Textual format will be used with XML-based mark-up in accordance with a common and language specific Document Type Definition (DTD) Advantages of using XML are: - Widely known technique - Many tools supporting it are available - Supports Unicode (useful for languages with non-Latin writing systems) - Allows easy and concise representation of one-to-many relations multiple translations, multiple PoS, etc.) - Easily definable and flexible syntax - Easy well-formedness tests are possible using publicly available tools Format 4
Set of segments: - Source language segment: orthography of the source phrase - Target language segment: target language translation + orthography, one PoS (NOM, VER, ADJ, PRO…) and lemma - Additional information possible (e.g. tags for foreign words, etc.) Example: Content 5
Partners and Languages 6
1. Translate as literal as possible to the source text, while preserving the syntactic correctness, semantic meaning and naturalness 2. Idiomatic expressions will be translated and marked as such 3. Ambiguities: select most plausible translation with respect to semantic domain; otherwise provide more than one translation 4. Proper nouns are marked and translated only in case when they are used in target language (e.g. AIDS -> SIDA) 5. Punctuation marks are separated from words and should be kept. 6. Digits should be kept unless a transcription is required in the target language. 7.Abbreviations should be expanded or kept abbreviated depending on the use in target language. 8.Foreign words can be optionally labeled with a tag 9.Parts of word: (e.g. due to false starts etc.) if the reference phrase does not provide enough context to disambiguate generate the partial target word followed by the + mark. Translation Methodology 7
Approach: - Phrases occuring in all three languages are added to the training corpus - Training corpus consists of selected dialogues from Verbmobil and TALP tourism corpus Preliminary Results: - Reduced OOV rate (13% relative for Spanish and 23% for Catalan) - Overall better translation of certain phrases from touristic domain - No significant change in translation error rates yet References: Asuncion Moreno et al. (2004): Language Independent Specificaiton of LR for Translation. D5.5. of the LC- STAR project, IST , to be published. Nicola Ueffing (2004): Results on Different Structured LR for Speech-to-Speech Translation. D4.5. of the LC- STAR project, IST , to be published. Maja Popović, Hermann Ney (2004): Towards the Use of Word Stems & Suffixes for Statistical Machine Translation. LREC 2004, Lissabon. First Experiments and Preliminary Results Contact: Ute Ziegenhain,