Named Entities in Domain Unlimited Speech Translation Alex Waibel, Stephan Vogel, Tanja Schultz Carnegie Mellon University Interactive Systems Labs
Nov 17, 2003ITIC MT Integration Meeting2 Objective Extraction and Translation of Arabic Named Entities from Speech Problem: –How do we do Domain-Unlimited Speech Translation? –What to do with Named Entities in Speech? Named Entities are Typically OoV’s Recognizer will Replace it with a WRONG Word Named Entity is Unlikely to be Handled Right Translation of Named Entities Named Entities Frequently not in Lexicon
Nov 17, 2003ITIC MT Integration Meeting3 Approach – Speech Translation Piggy-Back on STR-DUST (NSF-ITR Project): –Speech Translation on Domain Unlimited Speech Tasks Approach: –Recognition: Statistical Speech Recognition –Consolidation: Statistical Reduction and Extraction –Translation: Statistical MT Opportunity: –Cascade of Statistical Source-Channel Models –Integration and Optimization –Combine and Compute Joint Models –Working with Errors: Lattices to Communicate between Modules
Nov 17, 2003ITIC MT Integration Meeting4 Approach – Named Entities Two Pass Decoding Strategy –OoV’s in Speech: Recover Named Entity in Dictionary –Identify Relevant Names from Very Large Name Lists –Search for Relevant New Names on Internet –Insert Named Entities in Dictionary, Iterate New Word Model –Model Unseen Words by New-Word-Model –Assign Named Entity Tag to New-Word –Bi-Lingual Named Entity Tagging Recover Named Entity –Identify Relevant Names from Translation Output –IR of Relevant Texts in Target Language –Use Transliteration Model to Update Lexicon
Nov 17, 2003ITIC MT Integration Meeting5 Input/Output Input: –Speech in source language (Arabic) –Text in source language (Arabic) Output: –English translation of transcript –English translation of extracted entities القاعدة بزعامة أسامة بن لادن الهجومين اللذين استهدفا كنيسين يهوديين في إسطنبول واللذين أسفرا عن مقتل 23 شخصا وإصابة 300 آخرين. وهدد البيان بتوجيه مزيد من الضربات للولايات المتحدة وحلفائها في جميع أنحاء العالم. NE Search and Translation Name: Abu Hafz Orgnz: al-Qaida Location: Baghdad Reco
Nov 17, 2003ITIC MT Integration Meeting6 Evaluation Correct Named Entity Detection –Word Correct from Arabic Speech –NE-Tag Correct from Arabic Transcript Correct Translation –Of Output Text (NIST, Bleu) –Of Output Named Entity
Nov 17, 2003ITIC MT Integration Meeting7 First Results NE Translation (Chinese) Test data: 887 sentences Small track NIST score Large track NIST score baseline Offline NE Online NE Online NE translation gives improvements for both tracks 2.Online NE translation works better on uncommon NE translation, and gives more improvement