Tight Coupling between ASR and MT in Speech-to-Speech Translation Arthur Chan Prepared for Advanced Machine Translation Seminar.

Slides:

Advertisements

Similar presentations

LABORATOIRE DINFORMATIQUE CERI 339 Chemin des Meinajariès BP AVIGNON CEDEX 09 Tél (0) Fax (0)

Advertisements

Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.

Improved Neural Network Based Language Modelling and Adaptation J. Park, X. Liu, M.J.F. Gales and P.C. Woodland 2010 INTERSPEECH Bang-Xuan Huang Department.

Lattices Segmentation and Minimum Bayes Risk Discriminative Training for Large Vocabulary Continuous Speech Recognition Vlasios Doumpiotis, William Byrne.

Confidence Measures for Speech Recognition Reza Sadraei.

Hidden Markov Models Theory By Johan Walters (SR 2003)

Hidden Markov Models in NLP

Coupling between ASR and MT in Speech-to- Speech Translation Arthur Chan Prepared for Advanced Machine Translation Seminar.

Speech Translation on a PDA By: Santan Challa Instructor Dr. Christel Kemke.

Application of HMMs: Speech recognition “Noisy channel” model of speech.

Acoustical and Lexical Based Confidence Measures for a Very Large Vocabulary Telephone Speech Hypothesis-Verification System Javier Macías-Guarasa, Javier.

Novel Reordering Approaches in Phrase-Based Statistical Machine Translation S. Kanthak, D. Vilar, E. Matusov, R. Zens & H. Ney ACL Workshop on Building.

Part 6 HMM in Practice CSE717, SPRING 2008 CUBS, Univ at Buffalo.

The Use of Speech in Speech-to-Speech Translation Andrew Rosenberg 8/31/06 Weekly Speech Lab Talk.

Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.

Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.

Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group.

Deep Learning and its applications to Speech EE 225D - Audio Signal Processing in Humans and Machines Oriol Vinyals UC Berkeley.

Introduction to Automatic Speech Recognition

Adaptation Techniques in Automatic Speech Recognition Tor André Myrvoll Telektronikk 99(2), Issue on Spoken Language Technology in Telecommunications,

A Survey of ICASSP 2013 Language Model Department of Computer Science & Information Engineering National Taiwan Normal University 報告者：郝柏翰 2013/06/19.

Spoken Language Translation 1 Intelligent Robot Lecture Note.

1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.

Zero Resource Spoken Term Detection on STD 06 dataset Justin Chiu Carnegie Mellon University 07/24/2012, JHU.

1 Bayesian Learning for Latent Semantic Analysis Jen-Tzung Chien, Meng-Sun Wu and Chia-Sheng Wu Presenter: Hsuan-Sheng Chiu.

A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.

Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.

Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,

1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.

IRCS/CCN Summer Workshop June 2003 Speech Recognition.

Boosting Training Scheme for Acoustic Modeling Rong Zhang and Alexander I. Rudnicky Language Technologies Institute, School of Computer Science Carnegie.

HIERARCHICAL SEARCH FOR LARGE VOCABULARY CONVERSATIONAL SPEECH RECOGNITION Author :Neeraj Deshmukh, Aravind Ganapathiraju and Joseph Picone.

8.0 Search Algorithms for Speech Recognition References: of Huang, or of Becchetti, or , of Jelinek 4. “ Progress.

Handing Uncertain Observations in Unsupervised Topic-Mixture Language Model Adaptation Ekapol Chuangsuwanich 1, Shinji Watanabe 2, Takaaki Hori 2, Tomoharu.

Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.

NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.

CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.

1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech Frank Seide IEEE Transactions on Speech and Audio Processing 2005.

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.

Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.

ECE 8443 – Pattern Recognition EE 8524 – Speech Signal Processing Objectives: Word Graph Generation Lattices Hybrid Systems Resources: ISIP: Search ISIP:

Gaussian Mixture Language Models for Speech Recognition Mohamed Afify, Olivier Siohan and Ruhi Sarikaya.

Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Pruning Analysis for the Position Specific Posterior Lattices for Spoken Document Search Jorge Silva University of Southern California Ciprian Chelba and.

Christoph Prinz / Automatic Speech Recognition Research Progress Hits the Road.

Sridhar Raghavan and Joseph Picone URL:

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Tight Coupling between ASR and MT in Speech-to-Speech Translation Arthur Chan Prepared for Advanced Machine Translation Seminar.

Bayes Risk Minimization using Metric Loss Functions R. Schlüter, T. Scharrenbach, V. Steinbiss, H. Ney Present by Fang-Hui, Chu.

A NONPARAMETRIC BAYESIAN APPROACH FOR

Olivier Siohan David Rybach

Juicer: A weighted finite-state transducer speech decoder

An overview of decoding techniques for LVCSR

Coupling between ASR and MT in Speech-to-Speech Translation

8.0 Search Algorithms for Speech Recognition

Tight Coupling between ASR and MT in Speech-to-Speech Translation

Mohamed Kamel Omar and Lidia Mangu ICASSP 2007

Coupling between ASR and MT in Speech-to-Speech Translation

Speech recognition, machine learning

A word graph algorithm for large vocabulary continuous speech recognition Stefan Ortmanns, Hermann Ney, Xavier Aubert Bang-Xuan Huang Department of Computer.

Presenter : Jen-Wei Kuo

Speech recognition, machine learning

Presentation transcript:

Tight Coupling between ASR and MT in Speech-to-Speech Translation Arthur Chan Prepared for Advanced Machine Translation Seminar

This Seminar Introduction (6 slides) Ringger’s categorization of Coupling between ASR and NLU (7 slides) Interfaces in Loose Coupling 1 best and N-best (5 slides) Lattices/Confusion Network/Confidence Estimation (12 slides) Results from literature Tight Coupling Theory Some “As Is” Ideas on This Topic

6 papers on Tight Coupling of Speech-to-Speech Translation H. Ney, “Speech translation: Coupling of recognition and translation,” in Proc. ICASSP, Casacuberta et al., “Architectures for speech-to-speech translation using finite-state models,” in Proc. Workshop on Speech-to-Speech Translation, E. Matusov, S.Kanthak, and H. Ney, “On the integration of speech recognition and statistical machine translation,” in Proc. InterSpeech, S.Saleem, S. C. Jou, S. Vogel, and T. Schultz, “Using word lattice information for a tighter coupling in speech translation systems,” in Proc. ICSLP, V.H. Quan et al., “Integrated N-best re-ranking for spoken language translation,” in In EuroSpeech, N. Bertoldi and M. Federico, “A new decoder for spoken language translation based on confusion networks,” in IEEE ASRU Workshop, 2005.

A Conceptual Model of Speech- to-Speech Translation Speech Recognizer Machine Translator Speech Synthesizer waveforms Decoding Result(s) Translation waveforms

Motivation of Tight Coupling between ASR and MT One best of ASR could be wrong MT could be benefited from wide range of supplementary information provided by ASR N-best list Lattice Sentenced/Word-based Confidence Scores E.g. Word posterior probability Confusion network Or consensus decoding (Mangu 1999) MT quality may depend on WER of ASR (?)

Scope of this talk. Speech Recognizer Machine Translator Speech Synthesizer waveforms 1-best? Translation waveforms Lattice? N-best? Confusion network?

Topics Covered Today The concept of Coupling “Tightness” of coupling between ASR and Technology X. (Ringger 95) Interfaces between ASR and MT in loose coupling What could ASR provide? System with Semi-tight coupling Very tight coupling Ney’s formulae Casacuberta’s Approach Some random thoughts for this topic. What is missing in the current research?

Topics not covered Direct Modeling Use both features in ASR and MT Some referred as “ASR and MT unification” Implication of the MT search algorithms on the coupling Generation of speech from text. Presenter doesn’t know enough.

The Concept of Coupling

Classification of Coupling of ASR and Natural Language Understanding (NLU) Proposed in Ringger 95, Harper 94 3 Dimensions of ASR/NLU Complexity of the search algorithm Simple N-gram? Incrementality of the coupling On-line? Left-to-right? Tightness of the coupling Tight? Loose? Semi-tight?

Tightness of Coupling Tight Semi- Tight Loose

Notes: Semi-tight coupling could appear as Feedback loop between ASR and Technology X for the whole utterance of speech Or Feedback loop between ASR and Technology X for every frame. The Ringger system A good way to understand how speech-based system is developed

Example 1: LM Someone asserts that ASR has to be used with 13-grams. In tight-coupling, A search will be devised to search for the best word sequence with best acoustic score + 13 gram likelihood In loose coupling A simple search will be used to generate some outputs (N-best list, lattice etc.), 13-gram will then use to rescore the output. In semi-tight coupling 1, A simple search will be used to generate results 2, 13 gram will be applied at the word-end only (but exact history will not be stored)

Example 2: Higher order AM Segmental model assume obs. probability is not conditionally independent. Someone assert that segmental model is better than just HMM. Tight coupling: Direct search of the best word sequence using segmental model. Loose coupling: Use segmental model to rescore Semi-tight coupling: Hybrid HMM-Segmental model algorithm?

Summary of Coupling between ASR and NLU

Implication on ASR/MT coupling Generalize many systems Loose coupling Any system which uses 1-best, n-best, lattice, or other inputs for 1-way module communication (Bertoldi 2005) CMU System (Saleem 2004) (Matusov 2005) Tight coupling (Ney 1999) (Casacuberta 2002) Semi-tight coupling (Quan 2005)

Interfaces in Loose Coupling: 1-best and N-best

Perspectives ASR outputs 1-best results N-best results Lattice Consensus network. Confidence scores How ASR generate these outputs? Why they are generated? What if there are multiple ASRs? (and what if their results are combined?)

Origin of the 1-best. Decoding of HMM-based ASR = Searching the best path in a huge HMM-state lattice. 1-best ASR result The best path one could find from backtracking. State Lattice (Next page)

Note on 1-best Most of the time 1-best Word Sequence Why? In LVCSR, storing the backtracking pointer table for state sequence takes a lot of memory (even nowadays) [Compare this with the number of frames of score one need to be stored] Usually a backtrack pointer storing The previous words before the current word Clever structure dynamically allocate backtracking pointer table.

What is N-best list? Traceback not only from the 1 st -best, also from the 2 nd best and 3 rd best, etc. Pathway: Directly from search backtrack pointer table Exact N-best algorithm (Chow 90) Word pair N-best algorithm (Chow 91) A* search using Viterbi score as heuristic (Chow 92) Generate lattice first, then generate N-best from lattice

Interfaces in Loose Coupling: Lattice, Consensus Network and Confidence Estimation

What is Lattice? A compact representation of state-lattice Only word node (or link) are involved Difference between N-best and Lattice Lattice could be compact representation of N- best list.

How lattice is generated? From the decoding backtracking pointer table Only record all the links between word nodes. From N-best list Become a compact representation of N-best [Sometimes spurious link will be introduced]

How lattice is generated when there are phone contexts at the word end? Very complicated when phonetic context is involved Not only word-end needs to be stored but also the phone contexts. Lattice has the word identity as well as contexts Lattice can become very large.

How this is resolved? Some used only approximate triphone to generate lattice in first stage (BBN) Some generate lattice even with full CD- phones but convert it back to no-context lattices (RWTH) Use the lattice with full CD phone contexts (RWTH)

What ASR folks do when lattice is still too large? Use some criteria to prune the lattice. Example Criteria Word posterior probability Application of another LM or AM, then filtering. General confidence score Or generate an even more compact representation than lattices E.g. consensus network.

Conclusions on lattices Lattice generation itself could be a complicated issue Sometimes, what post-processing stage (e.g. MT) will get is pre-filtered, pre- processed results.

Confusion Network and Consensus Hypothesis Confusion Network: Or “Sausage Network”. Or “Consensus Network”

Special Properties (?) More “local” than lattice One can apply simple criteria to find the best results E.g. “consensus decoding” is to apply word-posterior probability on confusion network. More tractable In terms of size Found to be useful in ?

How to generate consensus network? From the lattice Summary of Mangu’s algorithm Intra-word clustering Inter-word clustering

Note on Consensus Network: Note: Time information might not be preserved in confusion network The similarity function directly affect the final output of the consensus network.

Other ways to generate confusion network From the N-best list Using Rover. A mixture of voting and adding confidence of word

Confidence Measure Anything other than likelihood which could tell whether the answer is useful E.g. Word posterior probability P(W|A) Usually compute using lattices Language model backoff mode Other posterior probabilities (frame, sentence)

Interfaces in Loose Coupling: Results from the Literature

Tight Coupling

Some “As Is” Ideas on This Topic

Literature Eric K. Ringger, “A Robust Loose Coupling for Speech Recognition and Natural Language Understanding”, Technical Report 592, Computer Science Department, Rochester University, 1995 [The AT&T paper]