Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),

Slides:

Advertisements

Similar presentations

Zhijie Yan, Qiang Huo and Jian Xu Microsoft Research Asia

Advertisements

Current HOARSE related activities 6-7 Sept …include the following (+ more) Novel architectures 1.All-combinations HMM/ANN 2.Tandem HMM/ANN hybrid.

Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.

Advances in WP2 Torino Meeting – 9-10 March

AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.

Advances in WP2 Nancy Meeting – 6-7 July

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

Acoustical and Lexical Based Confidence Measures for a Very Large Vocabulary Telephone Speech Hypothesis-Verification System Javier Macías-Guarasa, Javier.

Giansalvo EXIN Cirrincione unit #7/8 ERROR FUNCTIONS part one Goal for REGRESSION: to model the conditional distribution of the output variables, conditioned.

Incorporating Tone-related MLP Posteriors in the Feature Representation for Mandarin ASR Overview Motivation Tone has a crucial role in Mandarin speech.

Classification of Music According to Genres Using Neural Networks, Genetic Algorithms and Fuzzy Systems.

Why is ASR Hard? Natural speech is continuous

Conditional Random Fields   A form of discriminative modelling   Has been used successfully in various domains such as part of speech tagging and other.

Improving Utterance Verification Using a Smoothed Na ï ve Bayes Model Reporter : CHEN, TZAN HWEI Author :Alberto Sanchis, Alfons Juan and Enrique Vidal.

Speech and Language Processing

Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis for Speech Recognition Bing Zhang and Spyros Matsoukas BBN Technologies Present.

Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis For Speech Recognition Bing Zhang and Spyros Matsoukas, BBN Technologies, 50 Moulton.

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

Artificial Intelligence Techniques Multilayer Perceptrons.

Csc Lecture 7 Recognizing speech. Geoffrey Hinton.

1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.

DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.

Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,

Basics of Neural Networks Neural Network Topologies.

Multiple parallel hidden layers and other improvements to recurrent neural network language modeling ICASSP 2013 Diamantino Caseiro, Andrej Ljolje AT&T.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

Improving Speech Modelling Viktoria Maier Supervised by Prof. Hynek Hermansky.

A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,

Algoritmi e Programmazione Avanzata

Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.

AGA 4/28/ NIST LID Evaluation On Use of Temporal Dynamics of Speech for Language Identification Andre Adami Pavel Matejka Petr Schwarz Hynek Hermansky.

Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.

Boosting Training Scheme for Acoustic Modeling Rong Zhang and Alexander I. Rudnicky Language Technologies Institute, School of Computer Science Carnegie.

SCALING UP: LEARNING LARGE-SCALE RECOGNITION METHODS FROM SMALL-SCALE RECOGNITION TASKS Nelson Morgan, Barry Y Chen, Qifeng Zhu, Andreas Stolcke International.

Quantile Based Histogram Equalization for Noise Robust Speech Recognition von Diplom-Physiker Florian Erich Hilger aus Bonn - Bad Godesberg Berichter:

Conditional Random Fields for ASR Jeremy Morris July 25, 2006.

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.

Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.

Performance Comparison of Speaker and Emotion Recognition

MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.

© Devi Parikh 2008 Devi Parikh and Tsuhan Chen Carnegie Mellon University April 3, ICASSP 2008 Bringing Diverse Classifiers to Common Grounds: dtransform.

A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-

Discriminative Phonetic Recognition with Conditional Random Fields Jeremy Morris & Eric Fosler-Lussier The Ohio State University Speech & Language Technologies.

Speech Lab, ECE, State University of New York at Binghamton  Classification accuracies of neural network (left) and MXL (right) classifiers with various.

0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.

Supervised Sequence Labelling with Recurrent Neural Networks PRESENTED BY: KUNAL PARMAR UHID:

St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.

1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.

1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.

Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan

1 Experiments with Detector- based Conditional Random Fields in Phonetic Recogntion Jeremy Morris 06/01/2007.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

H ADVANCES IN MANDARIN BROADCAST SPEECH RECOGNITION Overview Goal Build a highly accurate Mandarin speech recognizer for broadcast news (BN) and broadcast.

Olivier Siohan David Rybach

Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI

Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.

2 Research Department, iFLYTEK Co. LTD.

Hierarchical Multi-Stream Posterior Based Speech Recognition System

Conditional Random Fields for ASR

Intelligent Information System Lab

RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION

Speech Processing Speech Recognition

CRANDEM: Conditional Random Fields for ASR

Jeremy Morris & Eric Fosler-Lussier 04/19/2007

8-Speech Recognition Speech Recognition Concepts

Automatic Speech Recognition: Conditional Random Fields for ASR

Learning Long-Term Temporal Features

Presented by Chen-Wei Liu

Presentation transcript:

Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI), Berkeley, CA, USA, 2004MLMI 報告 : 張志豪日期 : 2005/02/24

2 Reference 1998 Hynek Hermansky, “TRAPS - Classifiers of Temporal Patterns”, ICSLP 1999 Hynek Hermansky, “Data-Derived Nonlinear Mapping For Feature Extraction in HMM”, ICASSP 2003 Barry Chen, “Learning Discriminative Temporal Patterns in Speech : Development of Novel Traps-Link Classifiers”, Eurospeech 2003 Hemant Misra, “New Entropy Based Combination Rules in HMM/ANN Multi-Stream ASR”, ICASSP

3 Outline Introduction –TRAPS –HATS MLP Architectures –One Stage Architectures –Two Stage Architectures Experiment –Frame accuracy –Word accuracy –Combine long-term and short-term

4 Introduction Hynek Hermansky’s group pioneered a method to capture long-term( ms) information for phonetic classification using multi-layered perceptrons(MLP). They developed an MLP architecture called TRAPS, which stands for “TempoRAl PatternS” TRAPS perform about as well as more conventional ASR systems using short-term features, and improve word error rates when used in combination with these short-term features. This paper worked on improving the TRAPS architecture in the context of TIMIT phoneme recognition. Hidden Activation TRAPS (HATS), which differ from TRAPS in that HATS use the hidden activations of the critical band MLPs instead of their outputs as inputs to the “merger” MLP.

5 Log-Critical Band Energies (LCBE) (1/2) Conventional Feature Extraction

6 Log-Critical Band Energies (LCBE) (2/2) TRAPS/HATS Feature Extraction

7 MLP Architectures One Stage Approach (unconstrained) –15 Bands x 51 Frames Two Stage Approach (constrained) –Linear Approach PCA40 LDA40 –Non-Linear Approach TRAPS HATS

8 One Stage Approach (1/3) The paper use LCBEs calculated every 10 ms on 8 kHz sampled speech which gives a total of 15 bark scale spaced LCBEs. There are mean and variance normalized per utterance. Use 51 frames of all 15 bands of LCBEs as inputs to an MLP. These inputs are built by stacking 25 frames before and after the current frame to the current frame, and the target phoneme comes from the current frame. The network is trained with output targets that are “1.0” for the class associated with the current frame, and “0” for all others. The MLPs are trained on 46 phoneme targets, and consist of a single hidden layer with sigmoidal nonlinearity and an output layer with softmax nonlinearity. Baseline system : “15 Bands x 51 Frames” unconstraint

9 One Stage Approach (2/3) Softmax nonlinearity –If you want the outputs of a network to be interpretable as posterior probabilities for a categorical target variable, it is highly desirable for those outputs to lie between zero and one and to sum to one. The purpose of the softmax activation function is to enforce these constraints on the outputs. Let the net input to each output unit be q_i, i=1,...,c, where c is the number of categories. Then the softmax output p_i is: Sigmoidal nonlinearity

10 One Stage Approach (3/3)

11 Two Stage Approach They developed an MLP architecture called TRAPS, which stands for “TempoRAl PatternS”. The TRAPS system consists of two stages fo MLPs. 1. Critical band MLPs learn phone probabilities posterior on the input, which is a set of consecutive frames of LCBEs, or LCBE trajectory. 2. A “merger” MLP merges the output of each of these individual critical band MLPs resulting in overall phone posteriors probabilites. Correlations among individual frames of LCBEs from difference frequency bands are not directly modeled; instead, correlation among long-term LCBE trajectories from different frequency bands are modeled.

12 Linear Approaches (1/2) Feature –The paper calculate PCA transforms for successive 51 frames of each of the 15 individual 51 frames of LCBE resulting in a 51 x 51 transform matrix for each of the 15 bands. –Then use this transform to orthogonalize the temporal trajectory in each band, retaining only the top 40 features per band. –Final use these transformed features as input to an MLP.

13 Linear Approaches (2/2)

14 Non-Linear Approaches (1/2)

15 Non-Linear Approaches (2/2)

16 Experimental Setup Training : ~68 hours of conversational telephone speech from English CallHome, Switchboard I, and Switchboard Cellular. Testing : 2001 Hub-5 Evaluation Set (Eval2001) –A large vocabulary conversational telephone speech test set –2,255,609 frames and 62,890 words Back-end recognizer: SRI’s Decipher System. 1 st pass decoding using a bigram language model and within-word triphone acoustic models.

17 Frame Accuracy (1/2) Classification is deemed correct when the highest output of the MLP corresponds to the correct phoneme label. A conventional intermediate temporal context MLP that uses 9 frames of per-side normalized (mean, variance, vocal tract length) PLP plus deltas and double deltas as inputs (PLP 9 Frames). Result –With the exception of the TRAPS system, all of the two-stage systems do better than this. –HATS Before Sigmoid and TRAPS Before Softmax perform comparably at 65.80% and 65.85% respectively, while PCA and LDA approaches perform similarly at 65.50% and 65.52% respectively.

18 Frame Accuracy (2/2)

19 Word Error Rates (1/2) System (46D) –The experiment take the log of the outputs from the MLPs and then decorrelate the features via PCA. –Apply mean and variance normalization in these transformed outputs. Result –The HATS always ranks 1 when compared to all other long temporal systems, achieving 7.29% relative improvement over the baseline. –The TRAPS doesn’t provide an improvement over the baseline, but all of the other approaches do. The final softmax nonlinearity in the critical band MLPs in TRAPS is the only difference between it and TRAPS Before Softmax. So including this nonlinearity during recognition, causes performance degradation. It is likely that the softmax’s output normalization is obscuring useful information that the second stage MLP needs.

20 Word Error Rates (2/2)

21 Combine Long-Term with Short-Term (1/3) SRI’s EARS Rich Transcription 2003 front-end features (short-term) (39 D) –Baseline HLDA(PLP+3d) Feature 1. 12th order PLP plus first three ordered deltas, 2. mean, variance, and vocal tract length normalized 3. transformed by heteroskedastic linear discriminant analysis (HLDA), keeping the top 39 features.

22 Combine Long-Term with Short-Term (2/3) Methods (64 D) –Appended the top 25 dimensions after PCA on each of the temporal features to the baseline HLDA(PLP+3d) features. –PLP 9 Frames –Combine HATS and PLP 9 Frames systems using an inverse entropy weighting method, take the log followed by PCA to 25 dimension and append to HLDA(PLP+3d) feature, can get the “Inv Entropy Combo HATS+PLP 9 Frames” Frames. Result –HATS improves 3.23% WER. –PLP 9 Frames is the same as HATS. –Combine PLP 9 Frames with HATS improves 8.60% WER.

23 Combine Long-Term with Short-Term (3/3)

24 Conclusions So TRAPS including this softmax during recognition, causes performance degradation. Inverse entropy weighting is good research direction. Combine long-term with short-term information has improvement.