Learning Long-Term Temporal Features

Slides:

Advertisements

Similar presentations

Zhijie Yan, Qiang Huo and Jian Xu Microsoft Research Asia

Advertisements

Improved ASR in noise using harmonic decomposition Introduction Pitch-Scaled Harmonic Filter Recognition Experiments Results Conclusion aperiodic contribution.

Current HOARSE related activities 6-7 Sept …include the following (+ more) Novel architectures 1.All-combinations HMM/ANN 2.Tandem HMM/ANN hybrid.

Face Recognition and Biometric Systems Eigenfaces (2)

Minimum Redundancy and Maximum Relevance Feature Selection

A Comprehensive Study on Third Order Statistical Features for Image Splicing Detection Xudong Zhao, Shilin Wang, Shenghong Li and Jianhua Li Shanghai Jiao.

Incorporating Tone-related MLP Posteriors in the Feature Representation for Mandarin ASR Overview Motivation Tone has a crucial role in Mandarin speech.

Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.

May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International.

EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

Adaptation Techniques in Automatic Speech Recognition Tor André Myrvoll Telektronikk 99(2), Issue on Spoken Language Technology in Telecommunications,

INTRODUCTION  Sibilant speech is aperiodic.  the fricatives /s/, / ʃ /, /z/ and / Ʒ / and the affricatives /t ʃ / and /d Ʒ /  we present a sibilant.

Zero Resource Spoken Term Detection on STD 06 dataset Justin Chiu Carnegie Mellon University 07/24/2012, JHU.

March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen Wang 1 Andreas Stolcke 1 Mary P. Harper 2 1. Speech Technology.

Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis for Speech Recognition Bing Zhang and Spyros Matsoukas BBN Technologies Present.

Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis For Speech Recognition Bing Zhang and Spyros Matsoukas, BBN Technologies, 50 Moulton.

A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.

1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.

Basics of Neural Networks Neural Network Topologies.

Improving Speech Modelling Viktoria Maier Supervised by Prof. Hynek Hermansky.

Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.

AGA 4/28/ NIST LID Evaluation On Use of Temporal Dynamics of Speech for Language Identification Andre Adami Pavel Matejka Petr Schwarz Hynek Hermansky.

ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.

SCALING UP: LEARNING LARGE-SCALE RECOGNITION METHODS FROM SMALL-SCALE RECOGNITION TASKS Nelson Morgan, Barry Y Chen, Qifeng Zhu, Andreas Stolcke International.

Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),

Tom Ko and Brian Mak The Hong Kong University of Science and Technology.

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:

1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.

Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.

CSC2515: Lecture 7 (post) Independent Components Analysis, and Autoencoders Geoffrey Hinton.

Discriminative Phonetic Recognition with Conditional Random Fields Jeremy Morris & Eric Fosler-Lussier The Ohio State University Speech & Language Technologies.

ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.

Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation Authors: Naveen Parihar and Joseph Picone Inst. for Signal and Info.

Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.

1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.

January 2001RESPITE workshop - Martigny Multiband With Contaminated Training Data Results on AURORA 2 TCTS Faculté Polytechnique de Mons Belgium.

Intro. ANN & Fuzzy Systems Lecture 16. Classification (II): Practical Considerations.

1 Experiments with Detector- based Conditional Random Fields in Phonetic Recogntion Jeremy Morris 06/01/2007.

Dec. 4-5, 2003EARS STT Workshop1 Broadcast News Training Experiments Anand Venkataraman, Dimitra Vergyri, Wen Wang, Ramana Rao Gadde, Martin Graciarena,

FIGURE 1: Spectrogram of the phrase “that experience”, shown with phonetic labels and corresponding neural network posterior distributions over each phonetic.

CSC321 Lecture 27 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.

Jeff Ma and Spyros Matsoukas EARS STT Meeting March , Philadelphia Post-RT04 work on Mandarin.

Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI

Deep Feedforward Networks

Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.

LECTURE 11: Advanced Discriminant Analysis

Data Mining, Neural Network and Genetic Programming

Hierarchical Multi-Stream Posterior Based Speech Recognition System

Conditional Random Fields for ASR

Intelligent Information System Lab

RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION

Machine Learning Feature Creation and Selection

Speech Processing Speech Recognition

CRANDEM: Conditional Random Fields for ASR

Statistical Models for Automatic Speech Recognition

Jeremy Morris & Eric Fosler-Lussier 04/19/2007

8-Speech Recognition Speech Recognition Concepts

Automatic Speech Recognition: Conditional Random Fields for ASR

A Tutorial on Bayesian Speech Feature Enhancement

EE513 Audio Signals and Systems

Dimensionality Reduction

John H.L. Hansen & Taufiq Al Babba Hasan

Human Speech Communication

Presented by Chen-Wei Liu

Lecture 16. Classification (II): Practical Considerations

Presenter: Shih-Hsiang(士翔)

Presentation transcript:

Learning Long-Term Temporal Features A Comparative Study Barry Chen May 4, 2004 Speech Lunch Talk

Log-Critical Band Energies May 4, 2004 Speech Lunch Talk

Log-Critical Band Energies Conventional Feature Extraction May 4, 2004 Speech Lunch Talk

Log-Critical Band Energies TRAPS/HATS Feature Extraction May 4, 2004 Speech Lunch Talk

What is a TRAP? (Background Tangent) TRAPs were originally developed by our colleagues at OGI: Sharma, Jain (now at SRI), Hermansky and Sivadas (both now at IDIAP) Stands for TempRAl Pattern TRAP = a narrow frequency speech energy pattern over a period of time (usually 0.5 – 1 second long) May 4, 2004 Speech Lunch Talk

Example of TRAPS Mean Temporal Patterns for 45 phonemes at 500 Hz May 4, 2004 Speech Lunch Talk

TRAPS Motivation Psychoacoustic studies suggest that human peripheral auditory system integrates information on a longer time scale Information measurements (joint mutual information) show information still exists >100ms away within single critical-band Potential robustness to speech degradations May 4, 2004 Speech Lunch Talk

Let’s Explore TRAPS and HATS are examples of a specific two-stage approach to learning long-term temporal features Is this constrained two-stage approach better than an unconstrained one-stage approach? Are the non-linear transformations of critical band trajectories, provided in different ways by TRAPS and HATS, actually necessary? May 4, 2004 Speech Lunch Talk

Learn Everything in One Step May 4, 2004 Speech Lunch Talk

Learn in Individual Bands May 4, 2004 Speech Lunch Talk

Learn in Individual Bands May 4, 2004 Speech Lunch Talk

Learn in Individual Bands May 4, 2004 Speech Lunch Talk

Learn in Individual Bands May 4, 2004 Speech Lunch Talk

Learn in Individual Bands May 4, 2004 Speech Lunch Talk

Learn in Individual Bands May 4, 2004 Speech Lunch Talk

Learn in Individual Bands May 4, 2004 Speech Lunch Talk

Learn in Individual Bands May 4, 2004 Speech Lunch Talk

Learn in Individual Bands May 4, 2004 Speech Lunch Talk

One-Stage Approach May 4, 2004 Speech Lunch Talk

2-Stage Linear Approaches May 4, 2004 Speech Lunch Talk

PCA/LDA Comments PCA on log critical band energy trajectories scales and rotates dimensions in directions of highest variance LDA projects in directions that maximize class separability measured by between class covariance over within class covariance Keep top 40 dimensions for comparison with MLP-based approaches May 4, 2004 Speech Lunch Talk

2-Stage MLP-Based Approaches May 4, 2004 Speech Lunch Talk

MLP Comments As with the other 2-stage approaches, we first learn patterns independently in separate critical band trajectories, and then learn correlations among these discriminative trajectories Interpretation of various MLP layers: Input to hidden weights – discriminant linear transformations Hidden unit outputs – Non-linear discriminant transforms Before Softmax – transforms hidden activation space to unnormalized phone probability space Output Activations – critical band phone probabilities May 4, 2004 Speech Lunch Talk

Experimental Setup Training: ~68 hours of conversational telephone speech from English CallHome, Switchboard I, and Switchboard Cellular 1/10 used for cross-validation set for MLPs Testing: 2001 Hub-5 Evaluation Set (Eval2001) 2,255,609 frames and 62,890 words Back-end recognizer: SRI’s Decipher System. 1st pass decoding using a bigram language model and within-word triphone acoustic models (thanks to Andreas Stolcke for all his help) May 4, 2004 Speech Lunch Talk

Frame Accuracy Performance May 4, 2004 Speech Lunch Talk

Standalone Feature System Transform MLP outputs by: log transform to make features more Gaussian PCA for decorrelation Same as Tandem setup introduced by Hermansky, Ellis, and Sharma Use transformed MLP outputs as front-end features for the SRI recognizer May 4, 2004 Speech Lunch Talk

Standalone Features May 4, 2004 Speech Lunch Talk

Combination W/State-of-the-Art Front-End Feature SRI’s 2003 PLP front-end feature is 12th order PLP with three deltas. Then heteroskedastic discriminant analysis (HLDA) transforms this 52 dimensional feature vector to 39 dimensional HLDA(PLP+3d) Concatenate PCA truncated MLP features to HLDA(PLP+3d) and use as augmented front-end feature Similar to Qualcom-ICSI-OGI features in AURORA May 4, 2004 Speech Lunch Talk

Combo W/PLP Baseline Features May 4, 2004 Speech Lunch Talk

Ranking Table May 4, 2004 Speech Lunch Talk

Observations Throughout the three various testing setups: HATS is always #1 The one-stage 15 Bands x 51 Frames is always #6 or second last TRAPS is always last PCA, LDA, HATS before sigmoid, and TRAPS before softmax flip flop in performance May 4, 2004 Speech Lunch Talk

Interpretation Learning constraints introduced by the 2-stage approach is helpful if done right. Non-linear discriminant transform of HATS is better than linear discriminant transforms from LDA and HATS before sigmoid The further mapping from hidden activations to critical-band phone posteriors is not helpful Perhaps, mapping to critical-band phones is too difficult and inherently noisy Finally, like TRAPS, HATS is complementary to the more conventional features and combines synergistically with PLP 9 Frames. May 4, 2004 Speech Lunch Talk

May 4, 2004 Speech Lunch Talk

Frame Accuracy Performance May 4, 2004 Speech Lunch Talk

Standalone Features WER May 4, 2004 Speech Lunch Talk

Combo W/PLP Baseline Features May 4, 2004 Speech Lunch Talk