Coupling between ASR and MT in Speech-to-Speech Translation

Slides:

Advertisements

Similar presentations

Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.

Advertisements

Lattices Segmentation and Minimum Bayes Risk Discriminative Training for Large Vocabulary Continuous Speech Recognition Vlasios Doumpiotis, William Byrne.

Hidden Markov Models Theory By Johan Walters (SR 2003)

Hidden Markov Model based 2D Shape Classification Ninad Thakoor 1 and Jean Gao 2 1 Electrical Engineering, University of Texas at Arlington, TX-76013,

Coupling between ASR and MT in Speech-to- Speech Translation Arthur Chan Prepared for Advanced Machine Translation Seminar.

Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.

Novel Reordering Approaches in Phrase-Based Statistical Machine Translation S. Kanthak, D. Vilar, E. Matusov, R. Zens & H. Ney ACL Workshop on Building.

Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

Spoken Language Translation 1 Intelligent Robot Lecture Note.

1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.

Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.

1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.

Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.

Sequence Models With slides by me, Joshua Goodman, Fei Xia.

Boosting Training Scheme for Acoustic Modeling Rong Zhang and Alexander I. Rudnicky Language Technologies Institute, School of Computer Science Carnegie.

8.0 Search Algorithms for Speech Recognition References: of Huang, or of Becchetti, or , of Jelinek 4. “ Progress.

Christopher Moh 2005 Competition Programming Analyzing and Solving problems.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech Frank Seide IEEE Transactions on Speech and Audio Processing 2005.

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:

A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida March 31,

Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.

Tight Coupling between ASR and MT in Speech-to-Speech Translation Arthur Chan Prepared for Advanced Machine Translation Seminar.

1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.

Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.

Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.

Christoph Prinz / Automatic Speech Recognition Research Progress Hits the Road.

Sridhar Raghavan and Joseph Picone URL:

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Tight Coupling between ASR and MT in Speech-to-Speech Translation Arthur Chan Prepared for Advanced Machine Translation Seminar.

Bayes Risk Minimization using Metric Loss Functions R. Schlüter, T. Scharrenbach, V. Steinbiss, H. Ney Present by Fang-Hui, Chu.

Arnar Thor Jensson Koji Iwano Sadaoki Furui Tokyo Institute of Technology Development of a Speech Recognition System For Icelandic Using Machine Translated.

Introduction to Machine Learning, its potential usage in network area,

A NONPARAMETRIC BAYESIAN APPROACH FOR

Olivier Siohan David Rybach

Automatic Speech Recognition

Juicer: A weighted finite-state transducer speech decoder

An overview of decoding techniques for LVCSR

EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture

Statistical Models for Automatic Speech Recognition

8.0 Search Algorithms for Speech Recognition

Convolutional Networks

Statistical Machine Translation Part III – Phrase-based SMT / Decoding

Tight Coupling between ASR and MT in Speech-to-Speech Translation

Mohamed Kamel Omar and Lidia Mangu ICASSP 2007

Hidden Markov Models Part 2: Algorithms

Objective of This Course

Statistical Models for Automatic Speech Recognition

Hidden Markov Models (HMMs)

Statistical Machine Translation Papers from COLING 2004

Coupling between ASR and MT in Speech-to-Speech Translation

LECTURE 15: REESTIMATION, EM AND MIXTURES

Research on the Modeling of Chinese Continuous Speech Recognition

Speech recognition, machine learning

Dynamic Programming Search

A word graph algorithm for large vocabulary continuous speech recognition Stefan Ortmanns, Hermann Ney, Xavier Aubert Bang-Xuan Huang Department of Computer.

Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.

Presenter : Jen-Wei Kuo

Speech recognition, machine learning

Emre Yılmaz, Henk van den Heuvel and David A. van Leeuwen

Presentation transcript:

Coupling between ASR and MT in Speech-to-Speech Translation Arthur Chan Prepared for Advanced Machine Translation Seminar

This Seminar Introduction (6 slides) Ringger’s categorization of Coupling between ASR and NLU (7 slides) Interfaces in Loose Coupling 1 best and N-best (5 slides) Lattices/Confusion Network/Confidence Estimation (12 slides) Results from literature Tight Coupling Ney’s Theory and 2 methods of Implementation (14 slides)  Sorry, without FST approaches. Some “As Is” Ideas on This Topic

History of this presentation V1: Draft finished in Mar 1st Tanja’s comment: Direct modeling could be skipped. We could focus on more ASR-related issue. Issues in MT search could be ignored.

History of this presentation (cont.) V2 – V4: Follow Tanja’s comment and finished in Mar 19th . Reviewer’s comment Ney’s search formulation is too difficult to follow FST-based tight coupling method is important. We should cover it. V5: Reviewed another 5 papers solely on the issue of FST-based tight-coupling. (True coupling)

6 papers on Coupling of Speech-to-Speech Translation H. Ney, “Speech translation: Coupling of recognition and translation,” in Proc. ICASSP, 1999. Casacuberta et al., “Architectures for speech-to-speech translation using finite-state models,” in Proc. Workshop on Speech-to-Speech Translation, 2002. E. Matusov, S.Kanthak, and H. Ney, “On the integration of speech recognition and statistical machine translation,” in Proc. InterSpeech, 2005. S.Saleem, S. C. Jou, S. Vogel, and T. Schultz, “Using word lattice information for a tighter coupling in speech translation systems,” in Proc. ICSLP, 2004. V.H. Quan et al., “Integrated N-best re-ranking for spoken language translation,” in In EuroSpeech, 2005. N. Bertoldi and M. Federico, “A new decoder for spoken language translation based on confusion networks,” in IEEE ASRU Workshop, 2005.

A Conceptual Model of Speech-to-Speech Translation Recognizer Machine Translator Speech Synthesizer Decoding Result(s) Translation waveforms waveforms

Motivation of Tight Coupling between ASR and MT One best of ASR could be wrong MT could be benefited from wide range of supplementary information provided by ASR N-best list Lattice Sentenced/Word-based Confidence Scores E.g. Word posterior probability Confusion network Or consensus decoding (Mangu 1999) MT quality may depend on WER of ASR (?)

Scope of this talk. Speech Recognizer Machine Translator Speech Synthesizer 1-best? N-best? Translation waveforms waveforms Lattice? Confusion network?

Topics Covered Today Two questions: The concept of Coupling “Tightness” of coupling between ASR and Technology X. (Ringger 95) Two questions: What could ASR provide in loose coupling? Discussion of interfaces between ASR and MT in loose coupling What is the status of tight coupling? Ney’s Formulation

Topics not covered Direct Modeling Use both features in ASR and MT Some referred as “ASR and MT unification” Implication of the MT search algorithms on the coupling Generation of speech from text. Presenter doesn’t know enough.

The Concept of Coupling

Proposed in Ringger 95, Harper 94 3 Dimensions of ASR/NLU Classification of Coupling of ASR and Natural Language Understanding (NLU) Proposed in Ringger 95, Harper 94 3 Dimensions of ASR/NLU Complexity of the search algorithm Simple N-gram? Incrementality of the coupling On-line? Left-to-right? Tightness of the coupling Tight? Loose? Semi-tight?

Tightness of Coupling Tight Semi-Tight Loose

Notes: Semi-tight coupling could appear as The Ringger system Feedback loop between ASR and Technology X for the whole utterance of speech Or Feedback loop between ASR and Technology X for every frame. The Ringger system A good way to understand how speech-based system is developed

Example 1: LM Someone asserts that ASR has to be used with 13-grams. In tight-coupling, A search will be devised to search for the best word sequence with best acoustic score + 13 gram likelihood In loose coupling A simple search will be used to generate some outputs (N-best list, lattice etc.), 13-gram will then use to rescore the output. In semi-tight coupling 1, A simple search will be used to generate results 2, 13 gram will be applied at the word-end only (but exact history will not be stored)

Example 2: Higher order AM Segmental model assume obs. probability is not conditionally independent. Someone assert that segmental model is better than just HMM. Tight coupling: Direct search of the best word sequence using segmental model. Loose coupling: Use segmental model to rescore Semi-tight coupling: Hybrid HMM-Segmental model algorithm?

Summary of Coupling between ASR and NLU

Implication on ASR/MT coupling Generalize many systems Loose coupling Any system which uses 1-best, n-best, lattice, or other inputs for 1-way module communication (Bertoldi 2005) CMU System (Saleem 2004) Tight coupling (Ney 1999) (Matusov 2005) (Casacuberta 2002) Semi-tight coupling (Quan 2005)

Interfaces in Loose Coupling: 1-best and N-best

Perspectives ASR outputs How ASR generate these outputs? 1-best results N-best results Lattice Consensus network. Confidence scores How ASR generate these outputs? Why they are generated? What if there are multiple ASRs? (and what if their results are combined?) Note : we are talking about state-lattice now, not word-lattice. 

Origin of the 1-best. Decoding of HMM-based ASR 1-best ASR result = Searching the best path in a huge HMM-state lattice. 1-best ASR result The best path one could find from backtracking. State Lattice in ASR (Next page)

Note on 1-best in ASR Most of the time 1-best Word Sequence Why? In LVCSR, storing the backtracking pointer table for state sequence takes a lot of memory (even nowadays) [Compare this with the number of frames of score one need to be stored] Usually a backtrack pointer storing The previous words before the current word Clever structure dynamically allocate back-tracking pointer table.

What is N-best list? Traceback not only from the 1st -best, also from the 2nd best and 3rd best, etc. Pathway: Directly from search backtrack pointer table Exact N-best algorithm (Chow 90) Word pair N-best algorithm (Chow 91) A* search using Viterbi score as heuristic (Chow 92) Generate lattice first, then generate N-best from lattice

Interfaces in Loose Coupling: Lattice, Consensus Network and Confidence Estimation

What is Lattice? A word-based lattice A compact representation of state-lattice Only word node (or link) are involved Difference between N-best and Lattice Lattice could be compact representation of N-best list.

How lattice is generated? From the decoding backtracking pointer table Only record all the links between word nodes. From N-best list Become a compact representation of N-best [Sometimes spurious link will be introduced]

Very complicated when phonetic context is involved How lattice is generated when there are phone contexts at the word end? Very complicated when phonetic context is involved Not only word-end needs to be stored but also the phone contexts. Lattice has the word identity as well as contexts Lattice can become very large.

How this is resolved? Some used only approximate triphone to generate lattice in first stage (BBN) Some generate lattice even with full CD-phones but convert it back to no-context lattices (RWTH) Use the lattice with full CD phone contexts (RWTH)

What ASR folks do when lattice is still too large? Use some criteria to prune the lattice. Example Criteria Word posterior probability Application of another LM or AM, then filtering. General confidence score Maximum lattice density (number of words in lattice/number of words) Or generate an even more compact representation than lattices E.g. consensus network.

Conclusions on lattices Lattice generation itself could be a complicated issue Sometimes, what post-processing stage (e.g. MT) will get is pre-filtered, pre-processed results.

Confusion Network and Consensus Hypothesis Or “Sausage Network”. Or “Consensus Network”

Special Properties (?) More “local” than lattice More tractable One can apply simple criteria to find the best results E.g. “consensus decoding” is to apply word-posterior probability on confusion network. More tractable In terms of size Found to be useful in ?

How to generate consensus network? From the lattice Summary of Mangu’s algorithm Intra-word clustering Inter-word clustering

Note on Consensus Network: Time information might not be preserved in confusion network The similarity function directly affect the final output of the consensus network.

Other ways to generate confusion network From the N-best list Using Rover. A mixture of voting and adding confidence of word

Confidence Measure Anything other than likelihood which could tell whether the answer is useful E.g. Word posterior probability P(W|A) Usually compute using lattices Language model backoff mode Other posterior probabilities (frame, sentence)

Interfaces in Loose Coupling: Results from the Literature

General word Coupling in SST is still pretty new Papers are chosen according to whether some outputs have been used Other techniques such as direct modeling might be mixed into the papers.

N-best list (Quan 2005) Using N-best list for reranking Summary: Interpolation weights of AM and TM are then optimized. Summary: Reranking gives improvements.

Lattices: CMU results (Saleem 2004) Summary of results Lattice word error rate improved when lattice density improves Lattice density and Weight on Acoustic scores turns out to be an important parameter to tune Too large and small could hurt.

LWER against Lattice Density

Modified Bleu scores against lattice density

Optimal density and score weight based on Utterance Length.

Consensus Network Bertoldi 2005 is probably the only work on confusion-network based method Summary of results: When direct modeling is applied Consensus Network doesn’t beat N-best method. Author argues for speed and simplicity of the algorithm

Confidence: Does it help? According to Zhang 2006, Yes. Confidence Measure (CM) filtering is used to filter out unnecessary results in N-best Note: The approaches used is quite different.

Conclusion on Loose Coupling SR could give a rich sets of output. It is still an unknown what type of output should be used in pipeline. Currently, it seem to lack of comprehensive experimental studies on which method is the best. Usage of confusion network and confidence estimation seem to be under-explored.

Tight Coupling : Theory and Practice

Theory (Ney 1999) Baye’s Rule Introduce f as hidden var. Baye’s Rule Assume x doesn’t depend on target lang. Sum to Max

Layman point of view Three factors Pr(e) : target language model Pr(f|e) : translation model Pr(x|f) : acoustic model Note: assumption has been made only the best matching f for e is used.

Comparison with SR In SR: In Tight coupling Pr(f) : Source language model In Tight coupling Pr(f|e), Pr(e) : Translation model and Target language model

Algorithmic Point of View Brute Force Method: Instead of incorporating LM into standard Viterbi algorithm Incoporating P(e) and P(f|e) => Very complicated The backup slides in the presentation has detail about Ney’s implementations.

Experimental Results in Matusov, Kanthak and Ney 2005 Summary of the results Translation quality is only improved by tight coupling when the lattice density is not high. Same as Saleem 2004, incorporation of acoustic scores help.

Conclusion: Possible Issues of tight coupling Possibilities: In SR, source n-gram LM is very closed to the best configuration. The complexity of the algorithm is too high, approximation is still necessary to make it work. When the criterion in tight coupling is used. It is possible that the LM and the TM need to be jointly estimated. The current approaches still haven’t really implement tight-coupling There might be bugs in the programs.

Conclusion Two major issues in coupling of SST is discussed In loose coupling: Consensus network and Confidence scoring is still not fully utilized In tight coupling: The approach seem to be haunted by very high complexity of search algorithm construction

Discussion

The End. Thanks.

Literature 2006 Ruiqiang Zhang, Genichiro Kikui. Integration of Speech Recognition and Machine Translation: Speech Recognition Word Lattice Translation. Speech Communication. Vol.48, Issues 3-4 H. Ney, “Speech translation: Coupling of recognition and translation,” in Proc. ICASSP, 1999. E. Matusov, S.Kanthak, and H. Ney, “On the integration of speech recognition and statistical machine translation,” in Proc. InterSpeech, 2005. S.Saleem, S. C. Jou, S. Vogel, and T. Schultz, “Using word lattice information for a tighter coupling in speech translation systems,” in Proc. ICSLP, 2004. V.H. Quan et al., “Integrated N-best re-ranking for spoken language translation,” in In EuroSpeech, 2005. N. Bertoldi and M. Federico, “A new decoder for spoken language translation based on confusion networks,” in IEEE ASRU Workshop, 2005. L. Mangu, E. Brill, & A. Stolcke, Finding consensus in speech recognition: word error minimization and other applications of confusion networks, Computer Speech and Language 14(4), 373-400., (2000) E. Ringger, A Robust Loose Coupling for Speech Recognition and Natural Language Understanding, 1995

Backup Slides

Ney 99’s Formulation of SST’s Search.

Assumptions in Modeling Alignment Models (HMM) Acoustic Modeling Speech Recognizer will produce a word graph. Each link with word hypothesis covers the portion of acoustic scores. (notation is confusing in paper)

Lexicon Modeling Further assumption from standard IBM* models Target word is assumed to be dependent on previous word So, in fact, source LM is actually there.

First Implementation: Local Average Assumptions P(x|e) is used to capture the local characteristic of the acoustic.

Justification of Using Average Local Assumption Rephrased from Author (p.3 para 2) Lexicon modeling and language modeling will cause f_{j-1}, f_{j}, f_{j+1} appear in the math. In another words It is too complicated to carry out Computation advantage: the local score could be obtained just from the word graph but before translation => Full translation strategy could still be carried out

Computation of P(x|e) Make use of best source sequence Also refer to Wessel 98, A commonly used word posterior probability algorithm for lattice A forward-backward like procedure is used

Second Method: Monotone Alignment Assumption - Network

Monotone Alignment Assumption – Formula for Text Input Close-formed solution exist form DP O(JE^2)

Monotone Alignment Assumption – Formula for Speech Input DP: O(JE^2F^2)

How to make Monotone Assumptions work? Words needs to be reordered As part of search strategy. Does acoustic model assumption used? i.e. Are we talking about word lattice or still state lattice? Don’t know, seems like we are actually talking about word lattice. Supported by Matusov 2005