The Use of Speech in Speech-to-Speech Translation Andrew Rosenberg 8/31/06 Weekly Speech Lab Talk.

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

IBM Labs in Haifa © 2007 IBM Corporation SSW-6, Bonn, August 23th, 2007 Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System.

Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.

SPEECH RECOGNITION Kunal Shalia and Dima Smirnov.

Speech Translation on a PDA By: Santan Challa Instructor Dr. Christel Kemke.

Course Overview Lecture 1 Spoken Language Processing Prof. Andrew Rosenberg.

Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.

Spoken Language Technologies: A review of application areas and research issues Analysis and synthesis of F0 contours Agnieszka Wagner Department of Phonetics,

Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN Speech and Audio Processing and Recognition 4/27/05.

Novel Reordering Approaches in Phrase-Based Statistical Machine Translation S. Kanthak, D. Vilar, E. Matusov, R. Zens & H. Ney ACL Workshop on Building.

ASR Evaluation Julia Hirschberg CS Outline Intrinsic Methods –Transcription Accuracy Word Error Rate Automatic methods, toolkits Limitations –Concept.

“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.

Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.

Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group.

1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.

Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.

Natural Language Understanding

Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.

Spoken Language Translation 1 Intelligent Robot Lecture Note.

1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.

Midterm Review Spoken Language Processing Prof. Andrew Rosenberg.

Improving Utterance Verification Using a Smoothed Na ï ve Bayes Model Reporter : CHEN, TZAN HWEI Author :Alberto Sanchis, Alfons Juan and Enrique Vidal.

Direct Translation Approaches: Statistical Machine Translation

Speech and Language Processing

7-Speech Recognition Speech Recognition Concepts

An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.

May 2006CLINT-CS Verbmobil1 CLINT-CS Dialogue II Verbmobil.

Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.

A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.

Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.

1 Boostrapping language models for dialogue systems Karl Weilhammer, Matthew N Stuttle, Steve Young Presenter: Hsuan-Sheng Chiu.

Automatic Cue-Based Dialogue Act Tagging Discourse & Dialogue CMSC November 3, 2006.

Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.

Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.

NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.

1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.

Xinhao Wang, Jiazhong Nie, Dingsheng Luo, and Xihong Wu Speech and Hearing Research Center, Department of Machine Intelligence, Peking University September.

Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.

Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.

Supertagging CMSC Natural Language Processing January 31, 2006.

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.

Integrating Multiple Knowledge Sources For Improved Speech Understanding Sherif Abdou, Michael Scordilis Department of Electrical and Computer Engineering,

St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

Tight Coupling between ASR and MT in Speech-to-Speech Translation Arthur Chan Prepared for Advanced Machine Translation Seminar.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

A Simple English-to-Punjabi Translation System By : Shailendra Singh.

Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.

Christoph Prinz / Automatic Speech Recognition Research Progress Hits the Road.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Mastering the Pipeline CSCI-GA.2590 Ralph Grishman NYU.

Tight Coupling between ASR and MT in Speech-to-Speech Translation Arthur Chan Prepared for Advanced Machine Translation Seminar.

Arnar Thor Jensson Koji Iwano Sadaoki Furui Tokyo Institute of Technology Development of a Speech Recognition System For Icelandic Using Machine Translated.

N-best list reranking using higher level phonetic, lexical, syntactic and semantic knowledge sources Mithun Balakrishna, Dan Moldovan and Ellis K. Cave.

Olivier Siohan David Rybach

Neural Machine Translation

Statistical NLP: Lecture 13

Recognizing Structure: Sentence, Speaker, andTopic Segmentation

CS4705 Natural Language Processing

Statistical Machine Translation Papers from COLING 2004

Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,

Speaker Identification:

Presentation transcript:

The Use of Speech in Speech-to-Speech Translation Andrew Rosenberg 8/31/06 Weekly Speech Lab Talk

8/31/06 2 Candidacy Exam Organization Use and Meaning of Intonation Automatic Analysis of Intonation Applications Speech-to-Speech Translation L2 Learning Systems

8/31/06 3 The Use of Speech in Speech-to-Speech Translation The Use of Prosodic Event Information On the Use of Prosody in a Speech-to-Speech Translator Strom et al A Japanese-to-English Speech Translation System: ATR-MATRIX Takezawa et al Cascaded / Loose Coupled Approaches Janus-III: Speech-to-Speech Translation in Multiple Languages Lavie et al A Unified Approach in Speech Translation: Integrating Features of Speech Recognition and Machine Translation Zhang et al Integrated Approaches Finite State Speech-to-Speech Translation Vidal 1997 On the Integration of Speech Recognition and Statistical Machine Translation Matusov 2005 Coupling vs. Unifying: Modeling Techniques for Speech-to-Speech Translation Gao 2003 ASRMTTTS ASR + MTTTS

8/31/06 4 The Use of Speech in Speech-to-Speech Translation The Use of Prosodic Event Information On the Use of Prosody in a Speech-to-Speech Translator Strom et al A Japanese-to-English Speech Translation System: ATR-MATRIX Takezawa et al Cascaded / Loosely Coupled Approaches Janus-III: Speech-to-Speech Translation in Multiple Languages Lavie et al A Unified Approach in Speech Translation: Integrating Features of Speech Recognition and Machine Translation Zhang et al Integrated / Tightly Coupled Approaches Finite State Speech-to-Speech Translation Vidal 1997 On the Integration of Speech Recognition and Statistical Machine Translation Matusov 2005 Coupling vs. Unifying: Modeling Techniques for Speech-to-Speech Translation Gao 2003

8/31/06 5 On the Use of Prosody in a Speech-to-Speech Translator Strom et al INTARC - German-English Translator produced for VERBMOBIL project. Spontaneous, limited domain (appointment scheduling) 80 minutes of prosodically labeled speech Phrase Boundary (PB) Detector Gaussian classifier based on F0, energy and time features with a 4 syl. window (acc %) Focus Detector Rule based approach: Identifies location of steepest F0 decline (acc. 78.5%) Syntactic parsing search space is reduced by 65% Baseline syntactic parsing uses Decoder factor: product of acoustic and bi-gram scores Grammar factor: grammar model probability of a parse using the hypothesized word Prosody factor: 4-gram model of prosodic events (focus and PB) Semantic parsing search space is reduced by 24.7% The semantic grammar was augmented, labeling rules as “segment-connecting”(SC) and “segment-internal” (SI) SC rules are applied when there is a PB between segments, SI are applied when there are not. Ideal phrase boundaries reduced the number of hypotheses by 65.4% (analysis trees by 41.9%) Automatically hypothesized PBs required a backoff mechanism to handle errors and PBs that are not aligned with grammatical phrase boundaries. Prosodically driven translation is used when deep transfer (translation) fails A focused word determines (probabilistically) a dialog act which is translated based on available information from the word chain. Correct: 50%, Incomplete: 45%, Incorrect: 5%

8/31/06 6 A Japanese-to-English Speech Translation System: ATR-MATRIX Takezawa et al Limited domain translation system (Hotel Reservations) Cascaded approach ASR: sequential model ~2k word vocabulary MT: syntactically driven ~12k word vocabulary TTS: CHATR (now unit selection, then concatenative) Early Example of “Interactive” Speech-to-Speech Translation When the system has low confidence in either recognition or MT outputs, it prompts the user for corrections. Speech Information is used in three ways in ATR-MATRIX Voice Selection Based on the source voice, either a male or female voice is used for synthesis Hypothesized phrase boundaries Using pause information along with POS N-gram information the source utterance is divided into “meaningful chunks” for translation. Phrase Final Behavior If phrase final rise is detected, it is passed to the MT module as a “lexical” item potentially indicating a question.

8/31/06 7 The Use of Speech in Speech-to-Speech Translation The Use of Prosodic Event Information On the Use of Prosody in a Speech-to-Speech Translator Strom et al A Japanese-to-English Speech Translation System: ATR-MATRIX Takezawa et al Cascaded / Loosely Coupled Approaches Janus-III: Speech-to-Speech Translation in Multiple Languages Lavie et al A Unified Approach in Speech Translation: Integrating Features of Speech Recognition and Machine Translation Zhang et al Integrated / Tightly Coupled Approaches Finite State Speech-to-Speech Translation Vidal 1997 On the Integration of Speech Recognition and Statistical Machine Translation Matusov 2005 Coupling vs. Unifying: Modeling Techniques for Speech-to-Speech Translation Gao 2003

8/31/06 8 Janus-III: Speech-to-Speech Translation in Multiple Languages Lavie et al Interlingua and Frame-Slot based Spanish-English translation limited domain (conference registration) spontaneous speech Cascaded Approach Two semantic parse techniques GLR* Interlingua parsing (transcript 82.9%; ASR 54%) Manually constructed grammar to parse input into interlingua robust, doesn’t not require “grammatically correct” input Search for the maximal subset covered by the grammar Generation is performed by an interlingua generator Phoenix (transcript 76.3%; ASR 48.6%) identifies key concepts and their structure parsing grammar contains specific patterns which represent domain concepts The patterns are then compiled into a “recursive transition network” Each concept has one or more fixed phrasings in the target language Phoenix is used as a backoff when GLR* fails. Transcript: 83.3%; ASR 63.6% Late stage disambiguation Multiple translations are processed through the whole system. Translation hypothesis selection occurs just before generation using scores from recognition, parsing and discourse processing. MTTTSASR disambiguation

8/31/06 9 A Unified Approach in Speech Translation: Integrating Features of Speech Recognition and Machine Translation Zhang et al Process many hypotheses, then select one. In a cascaded architecture: HMM-based ASR produces N-best recognition hypotheses IBM Model 4 MT processes all N. Rescore MT hypotheses based on weighted log-linear combination of ASR and MT features. Construct the feature weight model by optimizing a translation distance metric (mWER, mPER, BLEU, NIST) Experiment Results Corpus: 162k/510/508 Japanese-English parallel sentences Baseline: no optimization of MT features Substantial improvement was obtained by optimizing feature weights based on distance metric Additional improvement was achieved by including ASR features Translation of N-best ASR hypotheses improved sentence translation accuracy of incorrectly recognized 1-best hypotheses by 7.5%

8/31/06 10 The Use of Speech in Speech-to-Speech Translation The Use of Prosodic Event Information On the Use of Prosody in a Speech-to-Speech Translator Strom et al A Japanese-to-English Speech Translation System: ATR-MATRIX Takezawa et al Cascaded / Loosely Coupled Approaches Janus-III: Speech-to-Speech Translation in Multiple Languages Lavie et al A Unified Approach in Speech Translation: Integrating Features of Speech Recognition and Machine Translation Zhang et al Integrated / Tightly Coupled Approaches Finite State Speech-to-Speech Translation Vidal 1997 On the Integration of Speech Recognition and Statistical Machine Translation Matusov 2005 Coupling vs. Unifying: Modeling Techniques for Speech-to-Speech Translation Gao 2003

8/31/06 11 Finite-State Speech-to-Speech Translation Vidal 1997 FSTs can naturally be applied to translation. FSTs for statistical MT can be learned from parallel corpora. (OSTIA) Speech input is handled in two ways: Baseline cascaded approach Integrated approach Create an FST on text, replace each edge with an acoustic model of the lexical item A major drawback of using this approach is large training data requirement. Align the source and target utterances, reducing their “asynchronicity” Cluster lexical items, reducing the vocabulary size Proof of concept experiment Text: ~30 lexical items used in 16k paired sentences (Spanish- English) Greater than 99% translation accuracy is achieved Speech: 50k/400 (training/testing) paired utterances, spoken by 4 speakers Best performance: 97.2% translation acc. 97.4% recognition accuracy Requires inclusion of source and target 4-gram LMs in FST training. Travel domain experiment Text: ~600 lexical items in 169k/2k paired sentences 0.7% translation WER w/ categorization; 13.3% WER w/o Speech: 336 test utterances (~3k words) spoken by 4 speakers Text transducer was used, edges replaced by concatenation of “phonetic elements” modeled by a continuous HMM. 1.9% translation WER and 2.2% recognition WER were obtained.

8/31/06 12 Use word lattices weighted by HMM ASR scores as input to a weighted FST for translation Noisy Channel Model Using an alignment model, A Instead of modeling the alignment, search for the best alignment Evaluation: Material: 4 parallel corpora Spontaneous speech in the travel domain 3k - 66k paired sentences in Italian-English, Spanish-English and Spanish-Catalan Vocabulary size 1.7k-15k words On all metrics (mWER, mPER, BLEU, NIST), the translation results are as follows: Correct text Word lattice w/ acoustic scores Fully integrated ASR and MT (FUB Italian-English only) Word lattice w/o acoustic scores Single best ASR hypothesis (lower mPER than lattice w/o scores on FUB I-E) Denser ASR lattices yield reduced translation WER (on FUB Italian-English) On the Integration of Speech Recognition and Statistical Machine Translation Matusov et al Best English sentencelengthFrench audioTarget LMTranslation model Length of sourceAligned target wordLexical contextAcoustic context

8/31/06 13 Coupling vs. Unifying: Modeling Techniques for Speech-to-Speech Translation Gao 2003 Application of direct modeling to ASR, with the goal of direct modeling of interlingua text for MT. A direct model of target text from source acoustics could also be constructed using this approach. Composing models (e.g., noisy channel models) can lead to local or sub-optimal solutions Direct Modeling tries to avoid these by creating a single maximum entropy model p(text|acoustics,...) Direct modeling can also include other non-independent observations (features). Major considerations: To simplify computational complexity, acoustic features are quantized. Since the feature vector can get very large, reliable feature selection is necessary. In preliminary experiments, 150M features were reduced to 500K via feature selection stst s t-1 s t-2 otot o t-1 o t-2 stst s t-1 s t-2 otot o t-1 o t-2 LiLi Semantic Label WiWi Word F j-1 FjFj Phoneme stst s t-1 s t-2 otot o t-1 o t-2 Subphone Observation

8/31/06 14 The Use of Speech in Speech-to-Speech Translation The Use of Prosodic Event Information On the Use of Prosody in a Speech-to-Speech Translator Strom et al A Japanese-to-English Speech Translation System: ATR-MATRIX Takezawa et al Cascaded / Loosely Coupled Approaches Janus-III: Speech-to-Speech Translation in Multiple Languages Lavie et al A Unified Approach in Speech Translation: Integrating Features of Speech Recognition and Machine Translation Zhang et al Integrated / Tightly Coupled Approaches Finite State Speech-to-Speech Translation Vidal 1997 On the Integration of Speech Recognition and Statistical Machine Translation Matusov 2005 Coupling vs. Unifying: Modeling Techniques for Speech-to-Speech Translation Gao 2003

Thank you.