The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech Frank Seide IEEE Transactions on Speech and Audio Processing 2005.

Slides:



Advertisements
Similar presentations
Is Random Model Better? -On its accuracy and efficiency-
Advertisements

LABORATOIRE DINFORMATIQUE CERI 339 Chemin des Meinajariès BP AVIGNON CEDEX 09 Tél (0) Fax (0)
PHONE MODELING AND COMBINING DISCRIMINATIVE TRAINING FOR MANDARIN-ENGLISH BILINGUAL SPEECH RECOGNITION Yanmin Qian, Jia Liu ICASSP2010 Pei-Ning Chen CSIE.
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Sphinx-3 to 3.2 Mosur Ravishankar School of Computer Science, CMU Nov 19, 1999.
Yasuhiro Fujiwara (NTT Cyber Space Labs)
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
15.0 Utterance Verification and Keyword/Key Phrase Spotting References: 1. “Speech Recognition and Utterance Verification Based on a Generalized Confidence.
Dynamic Bayesian Networks (DBNs)
Lattices Segmentation and Minimum Bayes Risk Discriminative Training for Large Vocabulary Continuous Speech Recognition Vlasios Doumpiotis, William Byrne.
Nonparametric-Bayesian approach for automatic generation of subword units- Initial study Amir Harati Institute for Signal and Information Processing Temple.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Ch-9: Markov Models Prepared by Qaiser Abbas ( )
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
Application of HMMs: Speech recognition “Noisy channel” model of speech.
Distributional Cues to Word Boundaries: Context Is Important Sharon Goldwater Stanford University Tom Griffiths UC Berkeley Mark Johnson Microsoft Research/
Part 6 HMM in Practice CSE717, SPRING 2008 CUBS, Univ at Buffalo.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Natural Language Understanding
Automatic Continuous Speech Recognition Database speech text Scoring.
Isolated-Word Speech Recognition Using Hidden Markov Models
1 The Hidden Vector State Language Model Vidura Senevitratne, Steve Young Cambridge University Engineering Department.
Improving Utterance Verification Using a Smoothed Na ï ve Bayes Model Reporter : CHEN, TZAN HWEI Author :Alberto Sanchis, Alfons Juan and Enrique Vidal.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
1 ENTROPY-BASED CONCEPT SHIFT DETECTION PETER VORBURGER, ABRAHAM BERNSTEIN IEEE ICDM 2006 Speaker: Li HueiJyun Advisor: Koh JiaLing Date:2007/11/6 1.
Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者:郝柏翰 2013/01/28.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.
Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.
CMU Robust Vocabulary-Independent Speech Recognition System Hsiao-Wuen Hon and Kai-Fu Lee ICASSP 1991 Presenter: Fang-Hui CHU.
The Use of Context in Large Vocabulary Speech Recognition Julian James Odell March 1995 Dissertation submitted to the University of Cambridge for the degree.
HIERARCHICAL SEARCH FOR LARGE VOCABULARY CONVERSATIONAL SPEECH RECOGNITION Author :Neeraj Deshmukh, Aravind Ganapathiraju and Joseph Picone.
8.0 Search Algorithms for Speech Recognition References: of Huang, or of Becchetti, or , of Jelinek 4. “ Progress.
Efficient Language Model Look-ahead Probabilities Generation Using Lower Order LM Look-ahead Information Langzhou Chen and K. K. Chin Toshiba Research.
Probabilistic reasoning over time Ch. 15, 17. Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –Exceptions: games.
MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Chapter 12 search and speaker adaptation 12.1 General Search Algorithm 12.2 Search Algorithms for Speech Recognition 12.3 Language Model States 12.4 Speaker.
Dynamic Tuning Of Language Model Score In Speech Recognition Using A Confidence Measure Sherif Abdou, Michael Scordilis Department of Electrical and Computer.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Confidence Measures As a Search Guide In Speech Recognition Sherif Abdou, Michael Scordilis Department of Electrical and Computer Engineering, University.
Tight Coupling between ASR and MT in Speech-to-Speech Translation Arthur Chan Prepared for Advanced Machine Translation Seminar.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Probabilistic Pronunciation + N-gram Models CMSC Natural Language Processing April 15, 2003.
NTNU SPEECH AND MACHINE INTELEGENCE LABORATORY Discriminative pronunciation modeling using the MPE criterion Meixu SONG, Jielin PAN, Qingwei ZHAO, Yonghong.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
CS 224S / LINGUIST 285 Spoken Language Processing
Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky
Juicer: A weighted finite-state transducer speech decoder
An overview of decoding techniques for LVCSR
Statistical Models for Automatic Speech Recognition
8.0 Search Algorithms for Speech Recognition
K Nearest Neighbor Classification
Mohamed Kamel Omar and Lidia Mangu ICASSP 2007
Statistical Models for Automatic Speech Recognition
Automatic Speech Recognition: Conditional Random Fields for ASR
Lecture 10: Speech Recognition (II) October 28, 2004 Dan Jurafsky
Dynamic Programming Search
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Introduction to HMM (cont)
A word graph algorithm for large vocabulary continuous speech recognition Stefan Ortmanns, Hermann Ney, Xavier Aubert Bang-Xuan Huang Department of Computer.
Presentation transcript:

The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech Frank Seide IEEE Transactions on Speech and Audio Processing 2005 Present by shih-hung 2005/09/29

Speech Lab NTNU Outline Introduction Review of (M+1)-gram Viterbi Decoding with reentrant tree Virtual Hypothesis Copies on word level Virtual Hypothesis Copies on sub-word level Virtual Hypothesis Copies for Long-Range acoustic Lookahead (optional) Experimental Results Conclusion

Speech Lab NTNU Introduction

Speech Lab NTNU Introduction For decoding of LVCSR, the most widely used algorithm is a time-synchronous Viterbi decoder that uses a tree-organized pronunciation lexicon with word-condition tree copies. The search space is organized as a reentrant network which is a composition of the state-level network (lexical tree) and the linguistic (M+1)-gram network. –i.e. a distinct instance ( “ copy ” ) of each HMM state in the lexical tree is needed for every linguistic state (M-word history). Practically, this copying is done on demand in conjunction with beam pruning.

Speech Lab NTNU Introduction

Speech Lab NTNU Introduction

Speech Lab NTNU Introduction One observes that hypotheses for the same word generated from different tree copies are often identical. –i.e. there is redundant computation Can we exploit this redundancy and modify the algorithm such that word hypotheses are shared across multiple linguistic state? frank funny seide

Speech Lab NTNU Introduction A successful approach to this is the two-pass algorithm by Ney and Aubert. It first generates a word-lattice using the “ word-pair approximation ”, and searches the best path through this lattice using the full range language model. –computation is reduced by sharing word hypotheses between two- word histories that end with the same word. An alternative approach is start-time conditioned search, which uses non-reentrant tree copies conditioned on the start time of the tree. Here, word hypotheses are shared across all possible linguistic states during word-level recombination. ?

Speech Lab NTNU Introduction

Speech Lab NTNU Introduction

Speech Lab NTNU Introduction In this paper, we propose a single-pass reentrant-network (M+1)-gram decoder that uses three novel approaches aiming at eliminating copies of the search-space that are redundant. 1.State copies are conditioned on the phonetic history rather than the linguistic history. –Phone-history approximation (PHA) analog to the word-pair approximation (WPA). 2.Path hypotheses at word boundaries are saved at every frame in a data structure similar to a word lattice. To apply the (M+1)- gram at a word end, the needed linguistic path-hypothesis copies are recovered on the fly, similarly to lattice rescoring. We call the recovered copies virtual hypothesis copies (VHC).

Speech Lab NTNU Introduction 3.For further reduction of redundancy, also multiple instances of the same context-dependent phone occurring in the same phonetic history are dynamically replaced by a single instance. Incomplete path hypotheses at phoneme boundaries are temporarily saved as well in the lattice-like structure. To apply the tree lexicon, CD-phone instances associated with tree nodes are recovered on the fly (phone-level VHC).

Speech Lab NTNU Review of (M+1)-gram Viterbi decoding with a reentrant tree := time of the latest transition into the tree root on the best path up to time t that ends in state s of the lexical tree for the history ( “ back-pointer ” ) := probability of the best path up to time t that ends in state s of the lexical tree for history :=probability that the acoustic observation vectors o( 1 ) … o( t ) are generated by a word/state sequence that ends with the M words at time t.

Speech Lab NTNU Review of (M+1)-gram Viterbi decoding with a reentrant tree The dynamic-programming equations for the word-history conditioned (M+1)-gram search are as follow: Within-word recombination (s>0)

Speech Lab NTNU Review of (M+1)-gram Viterbi decoding with a reentrant tree Word-boundary equation:

Speech Lab NTNU Virtual hypothesis copies on word level A. How it works B. Word hypothesis C. Word-Boundary assumption and Phonetic-History approximation D. Virtual hypothesis copies: redundancy of E. Choosing F. Collapsed hypothesis copies G. Word-boundary equations H. Collapsed (M+1)-gram search : Summary I. Beam pruning J. Language model lookahead

Speech Lab NTNU How it works The optimal start time of a word depends on its history. The same word in different histories may have different optimal start times - this is the reason for copying. However, we observed that start times are often identical, in particular if their histories are acoustically similar. For two linguistic histories and we obtain the same optimal start time. then we have computed too much.

Speech Lab NTNU How it works It would only have been necessary to perform the state-level Viterbi recursion for one of the two histories. This is because:

Speech Lab NTNU How it works We are now ready to introduce our method of virtual hypothesis copying (word-level). The method consist of –1.predicting the sets of histories for which the optimal start times are going to be identical - this information is needed already when a path enters a new word; –2.performing state-level Viterbi processing only for one copy per set. –3.for all other copies, recovering their accumulated path probabilities. Thus, on state-level, all but one copy per set are neither stored nor computed - we call them “ virtual ”.

Speech Lab NTNU How it works The art is to reliably predict these sets of histories that will lead to identical optimal start times. An exact prediction is impossible. We propose a heuristic, the phone-history approximation (PHA). The PHA assumes that a word ’ s optimal boundary depends only on the last N phones of the history.

Speech Lab NTNU How it works Regular bigram search Virtual hypothesis copies

Speech Lab NTNU Word hypotheses p(O|w)

Speech Lab NTNU Word-Boundary assumption and Phonetic-History approximation

Speech Lab NTNU Word-Boundary assumption and Phonetic-History approximation Intuitively, the optimal word boundaries should not depend on the linguistic state, but rather on the phonetic context at the boundary. And words ending similarly should lead to the same boundary. Thus, we propose a phonetically motivated history-class definition, the phone-history approximation (PHA): –A word ’ s optimal start time depends on the word and its N-phone history.

Speech Lab NTNU Virtual hypothesis copies: redundancy of

Speech Lab NTNU Virtual hypothesis copies: redundancy of

Speech Lab NTNU Choosing

Speech Lab NTNU Collapsed hypothesis copies The most probable hypothesis is only know when the end of he word is reached - too late to reduce computation.

Speech Lab NTNU Collapsed hypothesis copies

Speech Lab NTNU Word-boundary equations

Speech Lab NTNU Collapsed (M+1)-gram search : Summary

Speech Lab NTNU Language model lookahead M-gram lookahead aims at using language knowledge as early as possible in the lexical tree by pushing partial M-gram scores toward the tree root.

Speech Lab NTNU Virtual hypothesis copies on the sub-word level In the word-level method, the state-level search can be interpreted as a “ word-lattice generator ” with (M+1)-gram “ lattice rescoring ” applied on the fly; and search-space reduction was achieved by sharing tree copies amongst multiple histories. We now want to apply the same idea to the subword level: the state-level search now becomes sort of a “ subword generator, ” subword hypotheses are incrementally matched against the lexical tree (frame-synchronously) and (M+1)-gram lattice rescoring applied as before.

Speech Lab NTNU Virtual hypothesis copies on the sub-word level

Speech Lab NTNU Virtual hypothesis copies on the sub-word level

Speech Lab NTNU Virtual hypothesis copies on the sub-word level

Speech Lab NTNU Experimental setup Philips LVCSR is based on continuous-mixture HMM. MFCC feature. Unigram lookahead. Corpora for Mandarin: –MAT-2000, PCD, National Hi-Tech Project 863 Corpora for English: –Trained on WSJ0+1 –Test on 1994 ARPA NAB

Speech Lab NTNU Experimental result

Speech Lab NTNU Experimental result

Speech Lab NTNU Experimental result

Speech Lab NTNU Experimental result

Speech Lab NTNU Experimental result

Speech Lab NTNU Experimental result

Speech Lab NTNU Experimental result

Speech Lab NTNU Experimental result

Speech Lab NTNU Conclusion We have present a novel time synchronous LVCSR Viterbi decoder for Mandarin based on the novel concept of virtual hypothesis copies (VHC). At no loss of accuracy, a reduction of active states of 60-80% has been achieved for Chinese, and of 40-50% for American English.