Speech Analysis and Cognition Using Category-Dependent Features in a Model of the Central Auditory System Woojay Jeon Research Advisor: Fred Juang School.

Slides:

Advertisements

Similar presentations

Applications of one-class classification

Advertisements

Discriminative Training in Speech Processing Filipp Korkmazsky LORIA.

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.

Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.

Confidence Measures for Speech Recognition Reza Sadraei.

AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.

Visual Recognition Tutorial

Minimum Classification Error Networks Based on book chapter 9, by Shigeru Katagiri Jaakko Peltonen, 28 th February, 2002.

HIWIRE Progress Report Trento, January 2007 Presenter: Prof. Alex Potamianos Technical University of Crete Presenter: Prof. Alex Potamianos Technical University.

Visual Recognition Tutorial

HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University.

EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Example Clustered Transformations MAP Adaptation Resources: ECE 7000:

Isolated-Word Speech Recognition Using Hidden Markov Models

Presented by: Fang-Hui, Chu Automatic Speech Recognition Based on Weighted Minimum Classification Error Training Method Qiang Fu, Biing-Hwang Juang School.

INTRODUCTION  Sibilant speech is aperiodic.  the fricatives /s/, / ʃ /, /z/ and / Ʒ / and the affricatives /t ʃ / and /d Ʒ /  we present a sibilant.

1 Bayesian Learning for Latent Semantic Analysis Jen-Tzung Chien, Meng-Sun Wu and Chia-Sheng Wu Presenter: Hsuan-Sheng Chiu.

1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.

Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis for Speech Recognition Bing Zhang and Spyros Matsoukas BBN Technologies Present.

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

LOG-ENERGY DYNAMIC RANGE NORMALIZATON FOR ROBUST SPEECH RECOGNITION Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Quebec,

DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.

Hyperparameter Estimation for Speech Recognition Based on Variational Bayesian Approach Kei Hashimoto, Heiga Zen, Yoshihiko Nankaku, Akinobu Lee and Keiichi.

Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,

Detection, Classification and Tracking in a Distributed Wireless Sensor Network Presenter: Hui Cao.

Speaker Verification Speaker verification uses voice as a biometric to determine the authenticity of a user. Speaker verification systems consist of two.

BCS547 Neural Decoding.

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

Conditional Random Fields for ASR Jeremy Morris July 25, 2006.

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

Ch 5b: Discriminative Training (temporal model) Ilkka Aho.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:

1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.

Performance Comparison of Speaker and Emotion Recognition

MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.

A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-

Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.

A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.

Data Mining and Decision Support

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.

Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Present by: Fang-Hui Chu Large Margin Gaussian Mixture Modeling for Phonetic Classification and Recognition Fei Sha*, Lawrence K. Saul University of Pennsylvania.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.

Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.

Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.

A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Bayes Risk Minimization using Metric Loss Functions R. Schlüter, T. Scharrenbach, V. Steinbiss, H. Ney Present by Fang-Hui, Chu.

Lecture 1.31 Criteria for optimal reception of radio signals.

LECTURE 03: DECISION SURFACES

Conditional Random Fields for ASR

ECE539 final project Instructor: Yu Hen Hu Fall 2005

CRANDEM: Conditional Random Fields for ASR

EE513 Audio Signals and Systems

AUDIO SURVEILLANCE SYSTEMS: SUSPICIOUS SOUND RECOGNITION

Presenter: Shih-Hsiang(士翔)

Discriminative Training

Combination of Feature and Channel Compensation (1/2)

Presentation transcript:

Speech Analysis and Cognition Using Category-Dependent Features in a Model of the Central Auditory System Woojay Jeon Research Advisor: Fred Juang School of Electrical and Computer Engineering Georgia Institute of Technology October 8, 2006

1 Synopsis of Project One of the very few, if any, attempt to address auditory modeling beyond the periphery (ear, cochlea, even auditory nerve fiber) for ASR; Implemented a model (periphery + 3D cortical model) to calculate cortical response to stimuli; Investigated cortical representations in ASR; conducted a comprehensive comparative study to understand robustness in auditory representations; Developed a methodology to analyze robustness based on matched filter theory; Spawned a new development based on category dependent feature selection and hierarchical pattern recognition.

2 Matched Filtering Cortical response: – p(y) : power (auditory) spectrum – w(y; )  response area  x,s, , where – r( )  cortical response – R( ) : non-zero frequency range of w(y; ) The Cauchy-Schwarz Inequality tells us that r ( ) 2 will be maximum when: If R( ) includes enough spectral peaks, we can also use the spectral envelope v ( y ):

3 Signal-Respondent Neurons (a) (b) (c) (d) (all points differ in phase)

4 Noise-Respondent Neurons (a)(b) (all points differ in phase)

5 Noise Robustness Assuming a conventional Fourier power spectrum with stationary white noise as the distortion, it can be shown mathematically that: – S r, : SNR of signal-respondent neuron – S p, : SNR of auditory spectrum in R ( ) – S r,  : SNR of noise-respondent neuron where R (  ) = R ( ) Modeling inhibition can increase S r, even more.

6 Noise Robustness Experiments Vowel /iy/Fricative /dh/ Affricate /jh/Plosive /p/ S r (A i ) : Overall SNR of signal-respondent neurons of phoneme w i S r (U) : Overall SNR of entire cortical response S p : Overall SNR of auditory spectrum

7 Category-Dependent Feature Selection LVF: Low Variance Filter; HAF: High Activation Filter; NR: Neuron Reduction (via Clustering and Remapping); PCA: Principal Component Analysis

8 Hierarchical Classification Single-Layer Classifier Uses standard Bayesian Decision Theory to classify a test observation into 1 of N classes using class-wise discriminants that estimate the a posteriori probabilities Hierarchical Classifier (Two-Layer Classifier) A two-stage process that first classifies a test observation into 1 of M categories, then into 1 of |C n | classes

9 Searching for a Categorization The phoneme-wise variances are arranged into N orderings (each ordering with a different “seed” phoneme). For each ordering, a CART-style splitting routine is applied to create a “phoneme tree,” from which a list of candidate categorizations is obtained. We search for the categorization with the best hierarchical classification performance over the training data (using initial models).

10 Model Training CI features are used to construct category models, which are refined with MCE training

11 Hierarchical Classification

12 Phoneme Categorization Categorization

13 Phoneme Classification Results Classification rates (%) for clean speech in TIMIT database (48 phonemes) Classification rates (%) for varying SNR, features, and classifier configurations (*74.51 when 48 phonemes are mapped down to 39 according to convention) SL: Single-Layer Classifier; CI: Category-Independent Features; CD: Category-Dependent Features; TL: Two-Layer (Hierarchical) Classifier (results produced after MCE training)

Generalization of the MCE Method Qiang Fu Research Advisor: Fred Juang School of Electrical and Computer Engineering Georgia Institute of Technology October 8, 2006

15 Synopsis Excellent detector results (6-class, 14 class, 44-class) reported; use of detector results as “independent” information for rescoring. Generalization of minimum error principle to large vocabulary continuous speech recognition – Definition of competing events – Selection of training units (state, phone,..) – Use of word graph – Unequal error weight.

16 We investigate effects of combining the conventional ASR paradigm and the phonetic class detectors using MVE training We keep the segmentation information from the Viterbi decoder, which may affect the final improvement The rescoring algorithm can be flexible in order to fit different tasks Rescoring Using MVE Detectors

17 Assume there are M classes and K training tokens. A token labeled in the i th class may generate one type I (missing) error and M-1 type II (false alarm) errors. Hence, key scores related to these two types of error are : And the overall performance objective becomes In the above, 1 is the indicator function, l is a sigmoid function, and kI and kII are penalty weights for missing and false alarm errors. A descent algorithm is then applied for the minimization of the overall error objective. Minimum Verification Error

18 Conventional Decoder Speech Signals MVE Detector 1 MVE Detector 2 MVE Detector M Rescoring Algorithm Decision Criteria & Thresholds Decoding Scores Rescoring Candidates Detector Scores Neyman- Pearson Rescoring Paradigm

19 Suppose there are M classes of sub-word units. Hence there are M sets of detectors accordingly, each of which consists of a target model and an anti- model. For a segment that is decoded as the i th class with log likelihood, its j th (j = 1, 2,…,M) detector scores are and respectively. Namely, the likelihood ratio for the j th detector is. We call the score for the test segment belonging to class i after combination. Method 1: Naive-adding (NA) The reason for subtracting the anti-model score is to scale the decoding score into a relatively close dynamic range with the likelihood ratio. This procedure is also taken in the following two methods. We simply add the decoder score and the detector score together Rescoring Methods (I)

20 Method 3: Remodeled Posterior Probability (RPP) We compute the “remodeled posterior probability” Method 2: Competitive Rescoring (CR) We add the decoder score and the “competitive” score together, which is a “distance measure” between the claimed class and competitors Rescoring Methods (II)

21 Experiments are conducted on the TIMIT database (3696 training utterances and 1344 test utterances. There are 119,580 training tokens for MVE detectors) using three-state HMMs. Rescoring candidates are generated using HVite. The model for decoder is trained by Maximum Likelihood (ML) method, and the detectors are trained by MVE. Performance is examined on 6-class (Rabiner and Juang, 1993), 14-class (Deller et.al., 1999), and 48- class (Lee and Hon, ASAP-1989) broad phonetic categories, respectively. The models for both decoder and detectors are trained on 39-dimensional MFCC (12 MFCC + 12 delta + 12 acc + 3 log energy). Experiments Setup

22 Rescoring performance Phoneme Class Acc(%)Method 1 (NA)Method 2 (CR)Method 3 (RPP) 6-class Baseline Upper bound Rescored Relative class Baseline Upper bound Rescored Relative class Baseline Upper bound Rescored Relative Need to perform phone or word rescoring

23 Three different rescoring methods are introduced and the experiment results show that creating a pseudo- phone graph and re-computing the posterior probability achieves the best performance enhancement; MVE trained detectors shows promising results in helping the conventional ASR techniques. The detectors can be optimized in the sense of features or attributes (e.g. features representing articulatory knowledge and others), and used for re-ranking the decoded candidates; Bottom-up event detection and information fusion will be conducted on continuous speech signals in the future. Conclusions and Future work

24 MCE Generalization MCE criterion formulation: 1. Define the performance objective and the corresponding task evaluation measure; 2. Specify the target event (i.e., the correct label), competing events (i.e., the incorrect hypotheses from the recognizer), and the corresponding models; 3. Construct the objective function and set hyper-parameters 4. Choose a suitable optimization method to update parameters. In this presentation, only the first step which is also the most fundamental one is discussed due to limited space. This work is the first of an extensive generalization of the MCE training criterion

25 Competing words Competing words Target words Target words A A B A start end … labeled word A Competing words Competing words Target words Target words A A B A startend … labeled word A Strict Boundary and Relaxed Boundary

26 Experiments are conducted on the WSJ0 database (7077 training utterances and 330 test utterances); All models are three-state HMMs with 8 Gaussian mixtures in each state. There are totally 7385 physical models, logical models and 2329 tied states; The models are constructed on 39-dimensional MFCC (12 MFCC + 12 delta + 12 acc + 3 log energy) feature vectors; The baseline recognizer basically follows the large vocabulary continuous speech recognition recipe using HTK; We investigated three cases of maximizing the GPP on different training levels (word, phone, state) Experimental Setup

27 Table 1: Word Error Rate (WER) and Sentence Error Rate (SER) for WSJ0-eval using different training levels Training levelWER(%)SER(%) Baseline Word-level Phone-level State-level Results

28 We generalize the criterion for minimum classification error (MCE) training and investigate their impact on recognition performance. This paper is the first part of an extensive generation of the MCE training. The experiments are conducted based on the framework of “maximizing posterior probability”. The impact of different training levels is investigated and the phone level gained the best performance; Further investigation upon various tasks based on this generalized framework is in progress. Conclusion & Future Work