Speech Analysis and Cognition Using Category-Dependent Features in a Model of the Central Auditory System Woojay Jeon Research Advisor: Fred Juang School of Electrical and Computer Engineering Georgia Institute of Technology October 8, 2006
1 Synopsis of Project One of the very few, if any, attempt to address auditory modeling beyond the periphery (ear, cochlea, even auditory nerve fiber) for ASR; Implemented a model (periphery + 3D cortical model) to calculate cortical response to stimuli; Investigated cortical representations in ASR; conducted a comprehensive comparative study to understand robustness in auditory representations; Developed a methodology to analyze robustness based on matched filter theory; Spawned a new development based on category dependent feature selection and hierarchical pattern recognition.
2 Matched Filtering Cortical response: – p(y) : power (auditory) spectrum – w(y; ) response area x,s, , where – r( ) cortical response – R( ) : non-zero frequency range of w(y; ) The Cauchy-Schwarz Inequality tells us that r ( ) 2 will be maximum when: If R( ) includes enough spectral peaks, we can also use the spectral envelope v ( y ):
3 Signal-Respondent Neurons (a) (b) (c) (d) (all points differ in phase)
4 Noise-Respondent Neurons (a)(b) (all points differ in phase)
5 Noise Robustness Assuming a conventional Fourier power spectrum with stationary white noise as the distortion, it can be shown mathematically that: – S r, : SNR of signal-respondent neuron – S p, : SNR of auditory spectrum in R ( ) – S r, : SNR of noise-respondent neuron where R ( ) = R ( ) Modeling inhibition can increase S r, even more.
6 Noise Robustness Experiments Vowel /iy/Fricative /dh/ Affricate /jh/Plosive /p/ S r (A i ) : Overall SNR of signal-respondent neurons of phoneme w i S r (U) : Overall SNR of entire cortical response S p : Overall SNR of auditory spectrum
7 Category-Dependent Feature Selection LVF: Low Variance Filter; HAF: High Activation Filter; NR: Neuron Reduction (via Clustering and Remapping); PCA: Principal Component Analysis
8 Hierarchical Classification Single-Layer Classifier Uses standard Bayesian Decision Theory to classify a test observation into 1 of N classes using class-wise discriminants that estimate the a posteriori probabilities Hierarchical Classifier (Two-Layer Classifier) A two-stage process that first classifies a test observation into 1 of M categories, then into 1 of |C n | classes
9 Searching for a Categorization The phoneme-wise variances are arranged into N orderings (each ordering with a different “seed” phoneme). For each ordering, a CART-style splitting routine is applied to create a “phoneme tree,” from which a list of candidate categorizations is obtained. We search for the categorization with the best hierarchical classification performance over the training data (using initial models).
10 Model Training CI features are used to construct category models, which are refined with MCE training
11 Hierarchical Classification
12 Phoneme Categorization Categorization
13 Phoneme Classification Results Classification rates (%) for clean speech in TIMIT database (48 phonemes) Classification rates (%) for varying SNR, features, and classifier configurations (*74.51 when 48 phonemes are mapped down to 39 according to convention) SL: Single-Layer Classifier; CI: Category-Independent Features; CD: Category-Dependent Features; TL: Two-Layer (Hierarchical) Classifier (results produced after MCE training)
Generalization of the MCE Method Qiang Fu Research Advisor: Fred Juang School of Electrical and Computer Engineering Georgia Institute of Technology October 8, 2006
15 Synopsis Excellent detector results (6-class, 14 class, 44-class) reported; use of detector results as “independent” information for rescoring. Generalization of minimum error principle to large vocabulary continuous speech recognition – Definition of competing events – Selection of training units (state, phone,..) – Use of word graph – Unequal error weight.
16 We investigate effects of combining the conventional ASR paradigm and the phonetic class detectors using MVE training We keep the segmentation information from the Viterbi decoder, which may affect the final improvement The rescoring algorithm can be flexible in order to fit different tasks Rescoring Using MVE Detectors
17 Assume there are M classes and K training tokens. A token labeled in the i th class may generate one type I (missing) error and M-1 type II (false alarm) errors. Hence, key scores related to these two types of error are : And the overall performance objective becomes In the above, 1 is the indicator function, l is a sigmoid function, and kI and kII are penalty weights for missing and false alarm errors. A descent algorithm is then applied for the minimization of the overall error objective. Minimum Verification Error
18 Conventional Decoder Speech Signals MVE Detector 1 MVE Detector 2 MVE Detector M Rescoring Algorithm Decision Criteria & Thresholds Decoding Scores Rescoring Candidates Detector Scores Neyman- Pearson Rescoring Paradigm
19 Suppose there are M classes of sub-word units. Hence there are M sets of detectors accordingly, each of which consists of a target model and an anti- model. For a segment that is decoded as the i th class with log likelihood, its j th (j = 1, 2,…,M) detector scores are and respectively. Namely, the likelihood ratio for the j th detector is. We call the score for the test segment belonging to class i after combination. Method 1: Naive-adding (NA) The reason for subtracting the anti-model score is to scale the decoding score into a relatively close dynamic range with the likelihood ratio. This procedure is also taken in the following two methods. We simply add the decoder score and the detector score together Rescoring Methods (I)
20 Method 3: Remodeled Posterior Probability (RPP) We compute the “remodeled posterior probability” Method 2: Competitive Rescoring (CR) We add the decoder score and the “competitive” score together, which is a “distance measure” between the claimed class and competitors Rescoring Methods (II)
21 Experiments are conducted on the TIMIT database (3696 training utterances and 1344 test utterances. There are 119,580 training tokens for MVE detectors) using three-state HMMs. Rescoring candidates are generated using HVite. The model for decoder is trained by Maximum Likelihood (ML) method, and the detectors are trained by MVE. Performance is examined on 6-class (Rabiner and Juang, 1993), 14-class (Deller et.al., 1999), and 48- class (Lee and Hon, ASAP-1989) broad phonetic categories, respectively. The models for both decoder and detectors are trained on 39-dimensional MFCC (12 MFCC + 12 delta + 12 acc + 3 log energy). Experiments Setup
22 Rescoring performance Phoneme Class Acc(%)Method 1 (NA)Method 2 (CR)Method 3 (RPP) 6-class Baseline Upper bound Rescored Relative class Baseline Upper bound Rescored Relative class Baseline Upper bound Rescored Relative Need to perform phone or word rescoring
23 Three different rescoring methods are introduced and the experiment results show that creating a pseudo- phone graph and re-computing the posterior probability achieves the best performance enhancement; MVE trained detectors shows promising results in helping the conventional ASR techniques. The detectors can be optimized in the sense of features or attributes (e.g. features representing articulatory knowledge and others), and used for re-ranking the decoded candidates; Bottom-up event detection and information fusion will be conducted on continuous speech signals in the future. Conclusions and Future work
24 MCE Generalization MCE criterion formulation: 1. Define the performance objective and the corresponding task evaluation measure; 2. Specify the target event (i.e., the correct label), competing events (i.e., the incorrect hypotheses from the recognizer), and the corresponding models; 3. Construct the objective function and set hyper-parameters 4. Choose a suitable optimization method to update parameters. In this presentation, only the first step which is also the most fundamental one is discussed due to limited space. This work is the first of an extensive generalization of the MCE training criterion
25 Competing words Competing words Target words Target words A A B A start end … labeled word A Competing words Competing words Target words Target words A A B A startend … labeled word A Strict Boundary and Relaxed Boundary
26 Experiments are conducted on the WSJ0 database (7077 training utterances and 330 test utterances); All models are three-state HMMs with 8 Gaussian mixtures in each state. There are totally 7385 physical models, logical models and 2329 tied states; The models are constructed on 39-dimensional MFCC (12 MFCC + 12 delta + 12 acc + 3 log energy) feature vectors; The baseline recognizer basically follows the large vocabulary continuous speech recognition recipe using HTK; We investigated three cases of maximizing the GPP on different training levels (word, phone, state) Experimental Setup
27 Table 1: Word Error Rate (WER) and Sentence Error Rate (SER) for WSJ0-eval using different training levels Training levelWER(%)SER(%) Baseline Word-level Phone-level State-level Results
28 We generalize the criterion for minimum classification error (MCE) training and investigate their impact on recognition performance. This paper is the first part of an extensive generation of the MCE training. The experiments are conducted based on the framework of “maximizing posterior probability”. The impact of different training levels is investigated and the phone level gained the best performance; Further investigation upon various tasks based on this generalized framework is in progress. Conclusion & Future Work