Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Zhijie Yan, Qiang Huo and Jian Xu Microsoft Research Asia

Current HOARSE related activities 6-7 Sept …include the following (+ more) Novel architectures 1.All-combinations HMM/ANN 2.Tandem HMM/ANN hybrid.

Speech Recognition with Hidden Markov Models Winter 2011

Biointelligence Laboratory, Seoul National University

1 Bayesian Adaptation in HMM Training and Decoding Using a Mixture of Feature Transforms Stavros Tsakalidis and Spyros Matsoukas.

SRI 2001 SPINE Evaluation System Venkata Ramana Rao Gadde Andreas Stolcke Dimitra Vergyri Jing Zheng Kemal Sonmez Anand Venkataraman.

Designing Facial Animation For Speaking Persian Language Hadi Rahimzadeh June 2005.

Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.

Speaker Adaptation in Sphinx 3.x and CALO David Huggins-Daines

Speaker Adaptation for Vowel Classification

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

Incorporating Tone-related MLP Posteriors in the Feature Representation for Mandarin ASR Overview Motivation Tone has a crucial role in Mandarin speech.

Optimal Adaptation for Statistical Classifiers Xiao Li.

EE225D Final Project Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye EE225D Final Project.

1 USING CLASS WEIGHTING IN INTER-CLASS MLLR Sam-Joo Doh and Richard M. Stern Department of Electrical and Computer Engineering and School of Computer Science.

May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International.

EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Example Clustered Transformations MAP Adaptation Resources: ECE 7000:

1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.

This week: overview on pattern recognition (related to machine learning)

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

1M4 speech recognition University of Sheffield M4 speech recognition Vincent Wan, Martin Karafiát.

1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.

ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.

A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA.

SCALING UP: LEARNING LARGE-SCALE RECOGNITION METHODS FROM SMALL-SCALE RECOGNITION TASKS Nelson Morgan, Barry Y Chen, Qifeng Zhu, Andreas Stolcke International.

Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),

ISL Meeting Recognition Hagen Soltau, Hua Yu, Florian Metze, Christian Fügen, Yue Pan, Sze-Chen Jou Interactive Systems Laboratories.

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:

1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.

Speech Lab, ECE, State University of New York at Binghamton  Classification accuracies of neural network (left) and MXL (right) classifiers with various.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:

RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.

1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.

Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.

H ADVANCES IN MANDARIN BROADCAST SPEECH RECOGNITION Overview Goal Build a highly accurate Mandarin speech recognizer for broadcast news (BN) and broadcast.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Olivier Siohan David Rybach

Machine Learning Supervised Learning Classification and Regression

LECTURE 11: Advanced Discriminant Analysis

Background on Classification

Hierarchical Multi-Stream Posterior Based Speech Recognition System

LECTURE 10: DISCRIMINANT ANALYSIS

Conditional Random Fields for ASR

RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION

Random walk initialization for training very deep feedforward networks

Machine Learning Feature Creation and Selection

Speech Processing Speech Recognition

CRANDEM: Conditional Random Fields for ASR

Statistical Models for Automatic Speech Recognition

Neuro-Computing Lecture 4 Radial Basis Function Network

Automatic Speech Recognition: Conditional Random Fields for ASR

Feature space tansformation methods

Generally Discriminant Analysis

John H.L. Hansen & Taufiq Al Babba Hasan

LECTURE 09: DISCRIMINANT ANALYSIS

Parametric Methods Berlin Chen, 2005 References:

Multivariate Methods Berlin Chen

Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.

Multivariate Methods Berlin Chen, 2005 References:

Learning Long-Term Temporal Features

Test #1 Thursday September 20th

Lecture 16. Classification (II): Practical Considerations

Presenter: Shih-Hsiang(士翔)

Presentation transcript:

Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI Tandem Connectionist Feature Extraction for Conversational Speech Recognition Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI June 21, 2004 Tandem Connectionist Feature Extraction for Conversational Speech Recognition

Using Multi-Layer Perceptron (MLP) in Feature Extraction for Speech Recognition Acoustic modeling: a machine learning algorithm to learn phone posteriors (Hybrid system). Data driven feature extraction / data driven nonlinear feature transformation (Tandem system). This work extends the second approach. We present some properties of MLP based transform, the recognition system set-up and the recognition performance with this novel feature. It’s about the feature June 21, 2004 Tandem Connectionist Feature Extraction for Conversational Speech Recognition

MLP outputs as features to HMM MLP outputs: phone posterior approximation Regular within distribution in feature space with simple class boundary (easy to model) Reducing target irrelevant information (such as the speaker variation) Easy to combine different MLP features, effective in improving performance without increasing feature dimension (to avoid the ‘curse’) We will show these properties in more detail …… June 21, 2004 Tandem Connectionist Feature Extraction for Conversational Speech Recognition

*1 Simple and Regular Within-Class Distribution Class boundary approximates the optimal equal- posterior hyper-plane. Nearly-flat distribution for the ‘in-line’ feature component (the posterior of the underlying class) ‘Off-line’ components distribute close to zero. June 21, 2004 Tandem Connectionist Feature Extraction for Conversational Speech Recognition

Exp. 1: Posterior Feature Space Feature space of the three MLP components corresponding to /ah/(triangle), /ao/ (star), and /aw/ (circle). Each class is a “stick” Posterior feature space with value in [0,1] June 21, 2004 Tandem Connectionist Feature Extraction for Conversational Speech Recognition

Exp. 2: Log Posterior Feature Space Logarithm can further manipulate the distribution to avoid vary sharp distribution of the ‘off-line’ component. Each class is a “pie” after logarithm. Log-posterior feature space with value in (-, 0] June 21, 2004 Tandem Connectionist Feature Extraction for Conversational Speech Recognition

Exp. 3: Typical Distributions of Log Posteriors in Histogram ‘In-line’ component ‘off-line’ component -2 -18 -2 June 21, 2004 Tandem Connectionist Feature Extraction for Conversational Speech Recognition

*2 Reducing Speaker Variation Posteriors are by nature speaker independent, if trained with speaker balanced data. The MLP output, as the posterior approximation, carries this property. To show this, we compare the variances of the SAT transform matrices for different speakers with both PLP feature and MLP feature, both mean/variance normalized. MLP feature has smaller average variance. June 21, 2004 Tandem Connectionist Feature Extraction for Conversational Speech Recognition

Exp. 4: Variances of (Speaker Adaptive Training) SAT Transforms for Different Speakers Speaker variation can be viewed as the variations of the SAT matrices on normalized features. Ratio of the average variances in the PLP block (first 39 dim) and the MLP block (next 25 dim) =1.6 variances feature dim feature dim June 21, 2004 Tandem Connectionist Feature Extraction for Conversational Speech Recognition

*3 Feature Combination: Better Performance, No Dimensionality Increase Combine PLP-MLP (full band/short term) and TRAPS (sub-band/long term) outputs as posteriors. Use Inverse Entropy Weighting to combine two MLP outputs in the posterior level. Both frame accuracy and recognition word accuracy get improved with the combined feature. June 21, 2004 Tandem Connectionist Feature Extraction for Conversational Speech Recognition

Usually What to Expect for a Feature Transform Find the discriminative information (such as LDA). Make the feature fit the model better, especially for the Gaussian likelihood computation(such as MLLT) Reduce feature dimensionality to reduce computation and to avoid the ‘curse’. With the good properties of the MLP outputs, MLPs can be viewed as a nonlinear feature transform for these purposes. June 21, 2004 Tandem Connectionist Feature Extraction for Conversational Speech Recognition

The Feature Generation Diagram June 21, 2004 Tandem Connectionist Feature Extraction for Conversational Speech Recognition

Some Practical Details in Feature Generation and HMM Decoding Gaussian Weight Tuning for the augmented feature. Another per-speaker normalization after MLP transform. KLT based truncation can be applied without affecting recognition performance. (The first 25 dimensions keep 98% of the total variance.) MLP features are appended to regular PLP features to form the final features for the HMM. June 21, 2004 Tandem Connectionist Feature Extraction for Conversational Speech Recognition

Recognition Experiments Recognition task is the NIST 2001 Hub5 testset (6 hours conversational telephone speech). Training uses 68 hours mainly from the Switchboard Corpus for the initial evaluation. SRI Decipher system is used for these experiments. Gender dependent HMM system, bi-gram LM, Nbest decoding and re-score, using VTLN, HLDA in the PLP baseline feature with first three derivatives. June 21, 2004 Tandem Connectionist Feature Extraction for Conversational Speech Recognition

Recognition with a ‘Plain’ System with ML Training A 8.6% relative error reduction was achieved on this task using the combined MLP feature June 21, 2004 Tandem Connectionist Feature Extraction for Conversational Speech Recognition

Concerns for a Novel Feature: Scale and Carry Through Scale to larger training sets Improvements carry through with other advanced technologies: Adaptation MMIE discriminative training Better LM rescore System combination June 21, 2004 Tandem Connectionist Feature Extraction for Conversational Speech Recognition

Results with Adaptation A 8.9% relative error reduction. Block diagonal MLLR adaptation, no need to cross adapt the PLP feature with MLP feature MLP feature works well with adaptation! June 21, 2004 Tandem Connectionist Feature Extraction for Conversational Speech Recognition

Results in a Full-Fledged System Male only, 200 hours training, discriminative training and adaptation. 6.1%-8.2% error reduction with the advanced system. June 21, 2004 Tandem Connectionist Feature Extraction for Conversational Speech Recognition

Summary Feature extraction is usually a bottom-top process. Most class-driven top-bottom supervised transforms are linear transforms. MLP based data driven nonlinear feature transform works well in LVCSR task. The work presented here discusses some nice properties of the MLP feature, which might be responsible for the improvement. The End. Thanks. June 21, 2004 Tandem Connectionist Feature Extraction for Conversational Speech Recognition

MLP Training Different Inputs to MLP PLP-MLP MLPs with 46 phone targets can be trained with different inputs, taking different views of the time- frequency plane. PLPMLP focus on full band short term, while TRAPs (HATs) focus on sub-band long term. TRAPs Different Inputs to MLP June 21, 2004 Tandem Connectionist Feature Extraction for Conversational Speech Recognition