Discriminative Feature Optimization for Speech Recognition

Slides:



Advertisements
Similar presentations
Speech Recognition with Hidden Markov Models Winter 2011
Advertisements

1 Bayesian Adaptation in HMM Training and Decoding Using a Mixture of Feature Transforms Stavros Tsakalidis and Spyros Matsoukas.
Pitch Prediction From MFCC Vectors for Speech Reconstruction Xu shao and Ben Milner School of Computing Sciences, University of East Anglia, UK Presented.
AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.
Page 0 of 8 Time Series Classification – phoneme recognition in reconstructed phase space Sanjay Patil Intelligent Electronics Systems Human and Systems.
Lecture 5: Learning models using EM
Speaker Adaptation in Sphinx 3.x and CALO David Huggins-Daines
Speaker Adaptation for Vowel Classification
Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos
Radial Basis Function Networks
9.0 Speaker Variabilities: Adaption and Recognition References: of Huang 2. “ Maximum A Posteriori Estimation for Multivariate Gaussian Mixture.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.
Adaptation Techniques in Automatic Speech Recognition Tor André Myrvoll Telektronikk 99(2), Issue on Spoken Language Technology in Telecommunications,
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Example Clustered Transformations MAP Adaptation Resources: ECE 7000:
Isolated-Word Speech Recognition Using Hidden Markov Models
Gaussian Mixture Model and the EM algorithm in Speech Recognition
This week: overview on pattern recognition (related to machine learning)
Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis for Speech Recognition Bing Zhang and Spyros Matsoukas BBN Technologies Present.
Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis For Speech Recognition Bing Zhang and Spyros Matsoukas, BBN Technologies, 50 Moulton.
Chapter 14 Speaker Recognition 14.1 Introduction to speaker recognition 14.2 The basic problems for speaker recognition 14.3 Approaches and systems 14.4.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者:郝柏翰 2013/01/28.
1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.
Speech recognition and the EM algorithm
DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),
ISL Meeting Recognition Hagen Soltau, Hua Yu, Florian Metze, Christian Fügen, Yue Pan, Sze-Chen Jou Interactive Systems Laboratories.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.
MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.
Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
Jeff Ma and Spyros Matsoukas EARS STT Meeting March , Philadelphia Post-RT04 work on Mandarin.
Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI
LECTURE 11: Advanced Discriminant Analysis
Statistical Models for Automatic Speech Recognition
Hidden Markov Models Part 2: Algorithms
Statistical Models for Automatic Speech Recognition
EE513 Audio Signals and Systems
LECTURE 15: REESTIMATION, EM AND MIXTURES
Combination of Feature and Channel Compensation (1/2)
Presentation transcript:

Discriminative Feature Optimization for Speech Recognition Bing Zhang College of Computer & Information Science Northeastern University

Outline Introduction Problem to attack Methodology Implementation Region-dependent feature transform Discriminative optimization of the feature transform Implementation System description & results Conclusions

Introduction Speech recognition Speech feature extraction Goal: transcribe speech into text Performance measurement: word error rate (WER) Typical approach: Training: statistically model the acoustic and linguistic knowledge Recognition: search for the most probable word sequence using the models Speech feature extraction Reason: raw signals cannot be robustly modeled due to high-dimensionality, therefore compact features have to be extracted Two stages of feature extraction: speech analysis  cepstral coefficients speech feature transformation In this thesis: A better feature transformation approach is developed to reduce the WER of the speech recognition system

Introduction (cont.) A typical speech recognition system Word Sequence Acoustic Model Language Model Search Engine Feature Extraction Word Sequence Speech Signal Features Word Sequence Features Acoustic Model Language Model

Language Model N-grams Models the conditional probability of any word given N-1 words in history The product of N-gram probabilities can be used to approximate the probability of a sequence of words P(w1, w2, …, wk) ≈ P(w1 ) P(w2 | w1) P(w3 | w1, w2) … P(wN | w1, …, wN-1) … P(wk-1 | wk-N, ..., wk-2) P(wk | wk-(N-1), ..., wk-1) Special cases: Unigram: P(wi) Bigram: P(wi | wi-1) Trigram: P(wi | wi-2,wi-1)

HMM-based Acoustic Model Repository of unit HMMs (Hidden Markov Model) Each HMM is a probabilistic finite state machine with outputs at each hidden state Transition probabilities Observation probabilities (modeled by a mixture of Gaussians for each state) Each HMM represents a basic unit of speech, e.g., phoneme, crossword/non-crossword multiphones HMM state-clusters: specify which HMM states can share which parameters Pronunciation dictionary: phonetic spelling of the words Diagonal covariance Gaussian distributions are usually assumed

Example of an HMM HMM Start 1 2 3 4 End Observations o1 o2 o3 o4 o5 o6

Example of an HMM Start 1 2 3 4 End o1 o2 o3 o4 o5 o6 Start 1 2 4 End b1(o1) b1(o2) b2(o3) b3(o4) b3(o5) b4(o6) o1 o2 o3 o4 o5 o6 a22 a44 a12 Start 1 2 4 End a24 b1(o1) b2(o2) b2(o3) b2(o4) b4(o5) b4(o6) o1 o2 o3 o4 o5 o6

HMM-based Acoustic Model Repository of unit HMMs (Hidden Markov Model) Each HMM is a probabilistic finite state machine with outputs at each hidden state Transition probabilities Observation probabilities (modeled by a mixture of Gaussians for each state) Each HMM represents a basic unit of speech, e.g., phoneme, crossword/non-crossword multiphones HMM state-clusters: specify which HMM states can share which parameters Pronunciation dictionary: phonetic spelling of the words Diagonal covariance Gaussian distributions are usually assumed

Acoustic Training Maximum likelihood (ML) training Objective: maximize the conditional likelihood of the observed features given the model Algorithm: Expectation-maximization (EM) Discriminative training Objective: train the model to distinguish the correct word sequence from other hypotheses Criterion Minimum phoneme error (MPE) Representation of hypotheses: lattices Algorithm: Extended EM SIL this is a test sentence sense the quest guest

Feature Extraction Speech analysis Speech feature transformation Deals with the problem of extracting distinguishing characteristics (e.g., formant locations) of speech from digital signals Examples: MFCC (Mel-frequency cepstral coefficients), PLP (perceptual linear prediction) Resulting features: cepstral coefficients Speech feature transformation Applied on top of the cepstral coefficients Transform the cepstral features to better fit the model help the HMM to model the trajectory of the cepstral features fit the diagonal covariance assumption of the Gaussian components Cepstral coefficients: coefficients of the inversely transformed log power spectrum

Commonly Used Feature Transforms LDA (linear discriminant analysis) Transform the features to maximize the distance between different classes while keeping each class as compact as possible Assumes the all classes have equal covariance HLDA (heteroscedastic linear discriminant analysis) Remove the equal covariance assumption of LDA Find the feature transform that maximizes the likelihood of the data with respect to the acoustic model in the transformed space Others HDA (heteroscedastic discriminant analysis) MLLT (maximum likelihood linear transform) What kind of classes?

Drawbacks of Traditional Feature Transforms Inaccurate assumptions about the acoustic model LDA assumes equal-class covariance HDA & LDA ignore the diagonal covariance assumption Linear transform Linear transform has limited power for feature extraction Using more powerful transforms can be risky when the criterion does not correlate with the WER The criteria do not correlate with the WER Performance degrades on high-dimensional input features Experimental results in the thesis Performance degrades on highly-correlated input features Example on the next slide

Example If projected to 1-D The data has linear dependency between two dimensions such that: Z=2X Z Z If project to 2D, then the right figure (maybe rotated if diagonal Gaussian is assumed) will be the result. If project to 1D, then HLDA will map all samples to a single point. LDA would avoid that, but the result can be very sensitive to random errors since the denominator is almost zero. Y X X If projected to 1-D HLDA will map all samples to one single point LDA will fail to find the answer at all because the covariance matrix of each class is singular

A Better Approach Region-dependent transform Nonlinear Computationally inexpensive to train Discriminative training of the feature transform Criterion correlates well with the WER Detailed acoustic model in feature training This slide goes after the analysis LDA, HLDA, etc, and followed by the discussion of discriminative training criteria & the region dependent transform

Region Dependent Transform (RDT) fN r2 r1 rN RDT: Divides the acoustic space to multiple regions e.g., r1, r2, …, rN Applies a different transform based on which region the input feature vector belongs to e.g., f1, f2, …, fN To avoid making hard decisions when choosing which transform to apply, the posterior probabilities of the regions are used to interpolate the transformed results:

More Details of RDT Input features: long-span features A long span feature vector is formed by concatenating the cepstral features from consecutive frames, centered at the current frame Advantage: contains information about the acoustic context of the current frame Division of the regions: global Gaussian mixture model (GMM) Trained via unsupervised clustering Each Gaussian component in the GMM corresponds to a region Region-specific transforms In general, they can be any projections of long-span feature vectors In this thesis, linear projections are studied

Special Cases of RDT RDT RDLT MPE-HLDA fMPE# SPLICE Mean-offset fMPE# Generic projection RDLT Linear projection MPE-HLDA fMPE# SPLICE Mean-offset fMPE# Only one region Only offset Rotation matrix plus offset P is not region-dependent Note (#): fMPE also includes a context-expansion layer, which does not fit this categorization. (see thesis for details)

Projections vs. Offsets in RDT Transform # Uniq. proj. # Uniq. offset WER (%) LDA+MLLT - 25.9 RDT 1 24.9 1000 24.6 24.0 22.3 We use RDT & RDLT interchangeably from now on The projection and the offset in RDT: Different regions can share the same projections and/or offsets. So the unique number of projections/offsets can be less than the number of regions. Projection Offset

Optimization Criterion of RDT Minimum Phoneme Error (MPE) criterion Gives significant gains when used to train the HMM Correlates well with WER Can be rewritten as a function of the feature transform: WER MPE Score - About the accuracy score: the total number correct phonemes in the hypothesis, so the maximum is the total number of phonemes in the reference - Transition to the next slide: the updating rules of the HMM O, Or: original feature vectors; λ: the HMM; FRDT: the feature transform; α(Wrk): the accuracy score of hypothesized word sequence Wrk

HMM Updating Methods In MPE, the HMM depends on the transformed features, so we should define how it is updated When we choose the HMM updating methods, the concern is to make the trained transform be more generic, i.e., reusable for different training setups including: both ML and MPE training different types of HMMs If we can make the feature transform focus on separating the data, this goal can be achieved To ensure that, the HMM should better describe the data rather than anything else

HMM Updating Methods (cont.) If the HMM is updated discriminatively, e.g., under MPE Some Gaussians in the HMM will model decision boundaries, being away from the mass of the data The feature transform will be misled from separating the real data The resulting transform is less generic This method is OK if there is only one HMM to train If the HMM is updated under ML The Gaussians will stay on the data The feature transform will also focus on the data The resulting transform is more generic This method is preferred if there are different HMMs to train We assume ML updating of the HMM in this thesis

Example Discriminative Model ML Model Before transform After transform Since the model is already discriminative, nothing needs to be done here.

Training the Feature Transform The transform is trained using a numerical optimization algorithm Derivative of MPE with respect to the transform Two terms in the derivative MPE depends on the transformed features directly  direct derivative MPE depends on the transform through the HMM, which in turn depends on transformed features  indirect derivative Two passes of data processing The first pass computes the direct derivative using lattices The second pass computes the indirect derivative using reference transcripts

Reference transcripts Training Procedure Iterative update of RDT using numerical optimization RDT Train/Update HMM Compute MPE Derivative Update Original features Apply Transform Projected features Reference transcripts Lattices

Implementation Feature transform network A directed acyclic network of primitive components Design goals: reuse primitive components (e.g., linear projection, frame-concatenation) reuse the algorithm that applies the transform or computes the derivative easy to extend to other transforms efficient usage of CPU time & memory Impact: enables numerical optimization of any differentiable components including but not limited RDT simplifies the BBN system by providing a unified representation of various transforms added flexibility to the front-end processing in the BBN system Cepstra Concatenation Projection Gauss. Mixture RDT

RDT and the State-of-the-art System The state-of-the-art system at BBN Two sub-systems Speaker-independent (SI) system Speaker-adaptive (SA) system Two phases of training ML (initialize MPE training) MPE Three pass decoding Three tied-mixture acoustic models How RDT interacts with the system Trained once, used in three types of acoustic models Integrated with speaker adaptation Explain SCTM, STM, etc. Interaction: 1. Trained once, used in three  choose proper model to use in training (size, etc) 2. Combined with SD transforms  alternative training procedure

RDT in Speaker-independent (SI) Training Bootstrapping SI training baseline SI training with RDT LDA+MLLT Initial Transform RDT Training RDT & HMM ML Training ML-SI HMM Lattice Generation Here only shows one of the HMM. The other two are similar, except that the RDT is not reestimated Lattices MPE Training MPE-SI HMM

Experimental Setup Data Analysis RDT Training: English Conversational Telephone Speech (CTS), 2300 hours SWB+Fisher Testing: Eval03+Dev04, 3 hours SWB-II, 6 hours Fisher Analysis 14 Perceptual Linear Prediction (PLP) cepstral coefficients and normalized energy Vocal Tract Length Normalization (VTLN) RDT 15-frame long-span features projected to 60 dimensions initialized from LDA+MLLT 1000 regions, one linear projection per region crossword state-cluster tied model (SCTM), 7K clusters. number of Gaussians per state-cluster in the HMM varies in different experiments

SI Results (ML) Transform ML Model WER (%) 12-GPS 44-GPS 120-GPS LDA+MLLT 25.9 23.7 22.5 12-GPS RDT 22.3 22.1 21.9 44-GPS RDT - 21.6 20.8# Description Two RDTs were trained using the HMMs with 12 Gaussians per state-cluster (GPS) and 44 GPS, respectively For decoding, several ML crossword SCTM models with different sizes were trained using either LDA+MLLT or RDT Only the lattice-rescoring pass was run in decoding for simplicity (#): After other two models (STM, SCTM-NX) were retrained, the WER was further reduced to 20.4%, i.e., 9.3% relatively better than the LDA+MLLT result 20.8 -> 20.4 means that the ML STM & SCTM-NX models also benefit from the RDT, which means the transform is reusable as expected

SI Results (MPE) Transform MPE Model WER (%) 12-GPS 44-GPS 120-GPS LDA+MLLT 22.1 21.1 20.4 12-GPS RDT 21.2 20.8 44-GPS RDT - 20.3 19.6# Description Same as the ML experiments, except that the final models were trained under MPE (#): After other two models (STM, SCTM-NX) were trained, the WER was further reduced to 19.2%, i.e., 5.8% relatively better than the LDA+MLLT result

Speaker Adaptation Speaker adaptation (figure) Assumption: the speaker-dependent models are linearly transformed from an SI model Variations MLLR: assume that only Gaussian means are transformed CMLLR: both means & covariances are transformed  equivalent to applying the inverse transform to features while keeping model fixed Speaker-Adaptive Training (SAT) The SI model is not optimal for adaptation SAT tries to estimate a better model that when transformed gives the best likelihood of the data SI Model A(2) S(2) Model A(1) S(1) Model S(3) Model S(N) Model A(3) A(N) The deep colored shapes represent the actual models The light-colored ones represents the estimated models, due to the lack of data and the linearity of the SD transforms SI is not optimal: for example: SI model has large variance, while SD models have smaller variance Think of the SAT model as the model of a neutral speaker, instead of the model of all speakers

RDT in Speaker-adaptive Training (SAT) Straightforward approach Train SI RDT SI RDT & HMM Use SI-RDT transparently Simple But RDT is not optimized for SAT CMLLR Estimation SD Transforms ML SAT ML-SAT HMM Here the SD transforms are inverse of the A’s in the previous slide MPE Training MPE-SAT HMM

RDT in Speaker-adaptive Training (SAT) Train SI RDT Iterative approach (SA-RDT) SI RDT & HMM Alternately update RDT and the speaker- dependent (SD) transforms Back-propagation is used to compute the derivative, since SD transforms are applied on top of RDT RDT is optimized for SAT CMLLR Estimation SD Transforms ML SAT ML-SAT HMM Update RDT SA RDT & HMM MPE Training MPE-SAT HMM

Adapted Results Transform SAT-ML WER (%) SAT-MPE WER (%) LDA+MLLT 20.2 18.5 SI-RDT 18.8 17.6 SA-RDT 18.0 17.2 Description Same training & testing data, state-cluster and LM as the unadapted experiments 10.9% relative WER reduction for the ML system 7.0% relative WER reduction for the MPE system

Alternative Procedure for SA-RDT Simplified SA-RDT SI LDA+MLLT & HMM CMLLR Estimation Similar to the original SA-RDT But the speaker-dependent transforms are estimated using the baseline model & features SD Transforms ML SAT ML-SAT HMM Update RDT SA RDT & HMM MPE Training MPE-SAT HMM

Adapted Results Transform SAT-ML WER (%) SAT-MPE WER (%) LDA+MLLT 21.5 20.6 SA-RDT1 20.8 19.7 SA-RDT2 20.5 19.2 Description 500 hours of training data Another set of SD transforms were used before LDA/RDT SA-RDT1 was using the simplified procedure SA-RDT2 was using the original procedure The simplified procedure gave 2/3 of the gain by training the RDT only once

Conclusions Original work Impact Region-dependent transform Improved discriminative feature training that leads to more generic feature transform Improved SAT procedure using RDT Impact RDT encompasses several other feature transforms, including MPE-HLDA, SPLICE and the core of fMPE and mean-offset fMPE The method gives significant WER reduction: 7% relative reduction to the SAT-MPE English CTS system The method is potentially helpful for exploring novel acoustic features We do not have to worry about the negative effect when we add new features to the input of the feature transform, because the training will decide whether to use the new features and how to use them based on a criterion that is correlated to WER

Publications B. Zhang, S. Matsoukas, J. Ma, and R. Schwartz. Long span features and minimum phoneme heteroscedastic linear discriminant analysis. In Proceedings of EARS RT-04 Workshop, 2004. B. Zhang and S. Matsoukas. Minimum phoneme error based heteroscedastic linear discriminant analysis for speech recognition, In Proceedings of ICASSP, 2005. B. Zhang, S. Matsoukas and R. Schwartz. Discriminatively trained region-dependent transform for speech recognition. In Proceedings of ICASSP, 2006. Nominated for the Student Paper Award Awarded the Spoken Language Processing Grant by the IEEE Signal Processing Society B. Zhang, S. Matsoukas and R. Schwartz. Recent progress on the discriminative region-dependent transform for speech feature extraction. In Proceedings of ICSLP, 2006.