The CMU TransTac 2007 Eyes-free and Hands-free Two-way Speech-to-Speech Translation System Thilo Köhler and Stephan Vogel Nguyen Bach, Matthias Eck, Paisarn.

Slides:



Advertisements
Similar presentations
Non-Native Users in the Let s Go!! Spoken Dialogue System: Dealing with Linguistic Mismatch Antoine Raux & Maxine Eskenazi Language Technologies Institute.
Advertisements

Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
IBM Labs in Haifa © 2007 IBM Corporation SSW-6, Bonn, August 23th, 2007 Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System.
J. Kunzmann, K. Choukri, E. Janke, A. Kießling, K. Knill, L. Lamel, T. Schultz, and S. Yamamoto Automatic Speech Recognition and Understanding ASRU, December.
Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit.
MULTI LINGUAL ISSUES IN SPEECH SYNTHESIS AND RECOGNITION IN INDIAN LANGUAGES NIXON PATEL Bhrigus Inc Multilingual & International Speech.
Languages & The Media, 4 Nov 2004, Berlin 1 Multimodal multilingual information processing for automatic subtitle generation: Resources, Methods and System.
Speech Translation on a PDA By: Santan Challa Instructor Dr. Christel Kemke.
ITCS 6010 Spoken Language Systems: Architecture. Elements of a Spoken Language System Endpointing Feature extraction Recognition Natural language understanding.
Tanja Schultz, Alan Black, Bob Frederking Carnegie Mellon University West Palm Beach, March 28, 2003 Towards Dolphin Recognition.
Feature vs. Model Based Vocal Tract Length Normalization for a Speech Recognition-based Interactive Toy Jacky CHAU Department of Computer Science and Engineering.
“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.
Text-To-Speech Synthesis An Overview. What is a TTS System  Goal A system that can read any text Automatic production of new sentences Not just audio.
Why is ASR Hard? Natural speech is continuous
Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos
Text-To-Speech System for Marathi Miss. Deepa V. Kadam Indian Institute of Technology, Bombay.
Towards Natural Clarification Questions in Dialogue Systems Svetlana Stoyanchev, Alex Liu, and Julia Hirschberg AISB 2014 Convention at Goldsmiths, University.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
ISSUES IN SPEECH RECOGNITION Shraddha Sharma
May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.
Introduction to Automatic Speech Recognition
Lightly Supervised and Unsupervised Acoustic Model Training Lori Lamel, Jean-Luc Gauvain and Gilles Adda Spoken Language Processing Group, LIMSI, France.
Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006.
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
Zero Resource Spoken Term Detection on STD 06 dataset Justin Chiu Carnegie Mellon University 07/24/2012, JHU.
Midterm Review Spoken Language Processing Prof. Andrew Rosenberg.
Real-Time Speech Recognition Subtitling in Education Respeaking 2009 Dr Mike Wald University of Southampton.
PrepTalk a Preprocessor for Talking book production Ted van der Togt, Dedicon, Amsterdam.
Direct Translation Approaches: Statistical Machine Translation
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Interaction Design Session 12 LBSC 790 / INFM 718B Building the Human-Computer Interface.
CMU Shpinx Speech Recognition Engine Reporter : Chun-Feng Liao NCCU Dept. of Computer Sceince Intelligent Media Lab.
Kishore Prahallad IIIT-Hyderabad 1 Unit Selection Synthesis in Indian Languages (Workshop Talk at IIT Kharagpur, Mar 4-5, 2009)
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
Coşkun Mermer, Hamza Kaya, Mehmet Uğur Doğan National Research Institute of Electronics and Cryptology (UEKAE) The Scientific and Technological Research.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
DIALOG SYSTEMS FOR AUTOMOTIVE ENVIRONMENTS Presenter: Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.
Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.
1 Boostrapping language models for dialogue systems Karl Weilhammer, Matthew N Stuttle, Steve Young Presenter: Hsuan-Sheng Chiu.
Introduction to Computational Linguistics
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
LREC 2004, 26 May 2004, Lisbon 1 Multimodal Multilingual Resources in the Subtitling Process S.Piperidis, I.Demiros, P.Prokopidis, P.Vanroose, A. Hoethker,
© 2013 by Larson Technical Services
Rapid Development in new languages Limited training data (6hrs) provided by NECTEC from 34 speakers, + 8 spks for development and test Romanization of.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.
Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.
St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.
Speech Processing 1 Introduction Waldemar Skoberla phone: fax: WWW:
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Speech Recognition Created By : Kanjariya Hardik G.
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
1 Instructing the English Language Learner (ELL) in the Regular Classroom.
Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.
Named Entities in Domain Unlimited Speech Translation Alex Waibel, Stephan Vogel, Tanja Schultz Carnegie Mellon University Interactive Systems Labs.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
Arnar Thor Jensson Koji Iwano Sadaoki Furui Tokyo Institute of Technology Development of a Speech Recognition System For Icelandic Using Machine Translated.
G. Anushiya Rachel Project Officer
Xiaolin Wang Andrew Finch Masao Utiyama Eiichiro Sumita
Mr. Darko Pekar, Speech Morphing Inc.
Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky
Emre Yılmaz, Henk van den Heuvel and David A. van Leeuwen
Presentation transcript:

The CMU TransTac 2007 Eyes-free and Hands-free Two-way Speech-to-Speech Translation System Thilo Köhler and Stephan Vogel Nguyen Bach, Matthias Eck, Paisarn Charoenpornsawat, Sebastian Stüker, ThuyLinh Nguyen, Roger Hsiao, Alex Waibel, Tanja Schultz, Alan W Black Carnegie Mellon University, USA IWSLT 2007 – Trento, Italy, Oct 2007

Outline Introduction & Challenges System Architecture & Design Automatic Speech Recognition Machine Translation Speech Synthesis Practical Issues Demo

Introduction & Challenges TransTac program & Evaluation Two-way speech-to-speech translation system Hands-free and Eyes-free Real time and Portable Indoor & Outdoor use Force protection, Civil affairs, Medical Iraqi & Farsi Rich inflectional morphology languages No formal writing system in Iraqi 90 days for the development of Farsi system (surprised language task)

Outline Introduction & challenges System Architecture & Design Automatic Speech Recognition Machine Translation Speech Synthesis Practical Issues Demo

System Designs Eyes-free/hands-free use No display or any other visual feedback, only speech is used for a feedback Using speech to control the system “transtac listen” : turn translation on “transtac say translation”: say the back-translation of the last utterance Two user modes Automatic mode: automatically detect speech, make a segment then recognize and translate it Manual mode: providing a push-to-talk button for each speaker

System Architecture Audio segmenter English ASR Farsi/Iraqi ASR Framework English TTS Farsi/Iraqi TTS Bi-directional MT

Process Over Time English to Farsi/Iraqi English speech English – Farsi/Iraqi MT English – Farsi/Iraqi MT English Confirmation output Farsi/Iraqi translation output time Confirmation output repeats recognized sentence delay ASR delay MT English ASR

CMU Speech-to-Speech System Laptop secured in Backpack Optional speech control Push-to-Talk Buttons Close-talking Microphone Small powerful Speakers Optional speech control Push-to-Talk Buttons Close-talking Microphone

Outline Introduction & challenges System Architecture & Design Automatic Speech Recognition Machine Translation Speech Synthesis Practical Issues Demo

English ASR 3-state subphonetically tied, fully-continuous HMM 4000 models, max. 64 Gaussians per model, 234K Gaussians in total 13 MFCC, 15 frames stacking, LDA -> 42 dimensions Trained on 138h of American BN data, 124h Meeting data Merge-and-split training, STC training, 2x Viterbi Training Map adapted on 24h of DLI data Utterance based CMS during training, incremental CMS and cMLLR during decoding

Iraqi ASR 2006 (PDA) 2007 (Laptop) Vocabulary7k62k # AM models #Gaussians/ model≤ 32≤ 64 Acoustic TrainingMLMMIE Language Model3-gram Data for AM93 hours320 hours Data for LM1.2 M words2.2 M words ASR system uses the Janus recognition toolkit (JRTk) featuring the IBIS decoder. Acoustic model trained with 320 hours of Iraqi Arabic speech data. The language model is a tri-gram model trained with 2.2M words.

Farsi ASR The Farsi acoustic model is trained with 110 hours of Farsi speech data. The first acoustic model is bootstrapped from the Iraqi model. Two Farsi phones are not covered and they are initialized by phones in the same phone category. A context independent model was trained and used to align the data. Regular model training is applied based on this aligned data. The language model is a tri-gram model trained with 900K words Farsi ASR2007 Vocabulary33k # AM models2K quintphone #Gaussians/ model64max Acoustic TrainingML/MMIE Front-end42 MFCC-LDA Data for AM110 hours Data for LM900K words Farsi ASR ML builtMMIE built 1.5-way28.73%25.95% 2-way51.62%46.43%

Outline Introduction & challenges System Architecture & Design Automatic Speech Recognition Machine Translation Speech Synthesis Practical Issues Demo

Typical Dialog Structure English speaker gathers information from Iraqi/Farsi speaker English speaker gives information to Iraqi/Farsi speaker English speaker: Questions Instructions Commands Iraqi/Farsi: Yes/No - Short answers English speakerFarsi/Iraqi speaker Do you have electricity? No, it went out five days ago How many people live in this house? Five persons. Are you a student at this university? Yes, I study business. Open the trunk of your car. Yes.

Training Data situation Source Target Iraqi→English Sentences502,380 Unique pairs341,149 Average length Words2,578,9203,707,592 English→Iraqi Sentence pairs 168,812 Unique pairs145,319 Average length Words1,581,2811,133,230 SourceTarget Farsi→English Sentences56,522 Unique pairs50,159 Average length Words367,775455,306 English→Farsi Sentence pairs 75,339 Unique pairs47,287 Average length Words504,109454,599

Data Normalization Minimize the mismatch in vocabulary between ASR, MT, and TTS components while maximizing the performance of the whole system. Sources of vocabulary mismatch Different text preprocessing in different components Different encoding of the same orthography form Lack of standard in writing system (Iraqi) Words can be used with their formal or informal/colloquial endings raftin vs. raftid “you went”. Word forms (inside of the word) may be modified to represent their colloquial pronunciation khune vs. khAne “house” ; midam vs. midaham “i give”

Phrase Extraction For Iraqi – English: PESA Phrase Extraction PESA phrase pairs based on IBM1 word alignment probabilities

PESA Phrase Extraction Online Phrase Extraction Phrases are extracted as needed from the bilingual corpus Advantage Long matching phrases are possible especially prevalent in the TransTac scenarios: “Open the trunk!”, “I need to see your ID!”, “What is your name?” Disadvantage Slow speed: Up to 20 seconds/sentence

Speed Constraints...20 seconds per sentence is too long Solution: Combination of pre-extracted common phrases (→ speedup) Online extraction for rare phrases (→ performance increase) Also Pruning of phrasetables in necessary Goal: no delay between confirmation output and translation output time English speech English – Farsi/Iraqi MT English – Farsi/Iraqi MT English Confirmation output Farsi/Iraqi translation output English ASR

Pharaoh – Missing Vocabulary Some words in the training corpus will not be translated because they occur only in longer phrases of Pharaoh phrase table. E2F and F2E: 50% of vocabulary not covered Similar phenomenon in Chinese, Japanese BTEC PESA generates translations for all n-grams including all individual words. Trained two phrase tables and combined them. Re-optimized parameters through a minimum-error-rate training framework. English → FarsiBLEU Pharaoh + SA LM15.42 PESA + SA LM14.67 Pharaoh + PESA + SA LM16.44

Translation Performance Iraqi ↔ English PESA Phrase pairs (online + preextracted) Farsi ↔ English Pharaoh + PESA (pre-extracted) English → Iraqi42.12 Iraqi → English63.49 English → Farsi16.44 Farsi → English LM Options: 3-gram SRI language model (Kneser-Ney discounting) 6-gram Suffix Array language model (Good-Turing discounting) 6-gram consistently gave better results English→FarsiDev SetTest Set Pharaoh + SRI LM Pharaoh + SA LM

Outline Introduction & challenges System Architecture & Design Automatic Speech Recognition Machine Translation Speech Synthesis Practical Issues Demo

Text-to-speech TTS from Cepstral, LLC's SWIFT Small footprint unit selection Iraqi month old ~2000 domain appropriate phonetically balanced sentences Farsi -- constructed in 90 days 1817 domain appropriate phonetically balanced sentences record the data from a native speaker construct a pronunciation lexicon and build the synthetic voice itself. used CMUSPICE Rapid Language Adaptation toolkit to design prompts

Pronunciation Iraqi/Farsi pronunciation from Arabic script Explicit lexicon: words (without vowels) to phonemes Shared between ASR and TTS OOV pronunciation by statistical model CART prediction from letter context Iraqi: 68% word correct for OOVs Farsi: 77% word correct for OOVs (Probably) Farsi script better defined than Iraqi script (not normally written)

Outline Introduction & challenges System Architecture & Design Automatic Speech Recognition Machine Translation Speech Synthesis Practical Issues Demo

Back Translation Play the “back translation” to the user Allows judgement of Iraqi output If “back translation” is still correct → translation was probably correct If “back translation” is incorrect → translation was potentially incorrect as well (repeat/rephrase) Very useful to develop the system Who is the owner of this car? تِسْمَحْ تِفتَحِ الشَّنْطة Who does this vehicle belong to? Spoken sentence Translation Back Translation

But the users… Confused by back translation “is that the same meaning?” Interpret it just as a repetition of their sentence mimic the non-grammatical output resulting from translating twice Also: Underestimates system performance: Translation might be correct/understandable but back translation loses some information → User repeats but it would not have been necessary Back Translation

Automatic mode translation mode was offered Completely hands-free translation System notices speech activity and translates everything But the users… Do not like this loss of control Not everything should be translated, e.g. Discussions among the soldiers: “Do you think he is lying?” Definitely prefer “push-to-talk” manual mode Automatic Mode

Other Issues Some users: “TTS is too fast to understand” Speech synthesizers are designed to speak fluent speech, but output of an MT system may not be fully grammatical Phrase breaks in the speech could help listener to understand it How to use language expertise efficiently and effectively when working on rapid development of speech translation components We had no Iraqi speaker and only 1 Farsi part timer How do you best use the limited time of the Farsi speaker? Check data, translate new data, fix errors, explain errors, use the system....?

Other Issues User interface Needs to be as simple as possible Only short time to train English speaker No training of the Iraqi/Farsi speaker Over-heating Outside temperatures during Evaluation reached 95 Fahrenheit (35° Centigrade) System cooling is necessary via added fans

DEMO: CMU Speech-to-Speech System Laptop secured in Backpack Optional speech control Push-to-Talk Buttons Close-talking Microphone Small powerful Speakers Optional speech control Push-to-Talk Buttons Close-talking Microphone

DEMO