A Tutorial on Bayesian Speech Feature Enhancement SCALE Workshop, January 2010 A Tutorial on Bayesian Speech Feature Enhancement Friedrich Faubel
I Motivation
Speech Recognition System Overview A speech recognition system converts speech to text. It basically consists of two components: Front End: extracts speech features from the audio signal Decoder: finds that sentence (sequence of acoustical states), which is the most likely explanation for the observed sequence of speech features Front End Decoder Text Speech
Speech Feature Extraction Windowing
Speech Feature Extraction Windowing
Speech Feature Extraction Windowing
Speech Feature Extraction Windowing
Speech Feature Extraction Time Frequency Analysis Performing spectral analysis separately for each frame yields a time-frequency representation
Speech Feature Extraction Time Frequency Analysis Performing spectral analysis separately for each frame yields a time-frequency representation
Speech Feature Extraction Perceptual Representation Emulation of the logarithmic frequency and intensity perception of the human auditory system
Background Noise Background noise distorts speech features Result: features don’t match the features used during training Consequence: severely degraded recognition performance
Overview of the Tutorial I - Motivation II - The effect of noise to speech features III - Transforming probabilities IV - The MMSE solution to speech feature enhancement V - Model-based speech feature enhancement VI - Experimental results VII - Extensions
II Interaction Function The Effect of Noise
Interaction Function + = Principle of Superposition: signals are additive noise clean speech noisy speech + =
Interaction Function In the signal domain we have the following relationship: noisy speech noise clean speech
Interaction Function In the signal domain we have the following relationship:
Interaction Function In the signal domain we have the following relationship: After Fourier transformation, this becomes:
Interaction Function In the signal domain we have the following relationship: After Fourier transformation, this becomes: Taking the magnitude square on both sides, we get:
Interaction Function In the signal domain we have the following relationship: After Fourier transformation, this becomes: Taking the magnitude square on both sides, we get:
Interaction Function In the signal domain we have the following relationship: After Fourier transformation, this becomes: Taking the magnitude square on both sides, we get:
Interaction Function In the signal domain we have the following relationship: After Fourier transformation, this becomes: Taking the magnitude square on both sides, we get:
Interaction Function In the signal domain we have the following relationship: After Fourier transformation, this becomes: Taking the magnitude square on both sides, we get:
Interaction Function In the signal domain we have the following relationship: After Fourier transformation, this becomes: Taking the magnitude square on both sides, we get:
Interaction Function In the signal domain we have the following relationship: After Fourier transformation, this becomes: Taking the magnitude square on both sides, we get:
Interaction Function In the signal domain we have the following relationship: After Fourier transformation, this becomes: Taking the magnitude square on both sides, we get:
Interaction Function In the signal domain we have the following relationship: After Fourier transformation, this becomes: Taking the magnitude square on both sides, we get:
Interaction Function In the signal domain we have the following relationship: After Fourier transformation, this becomes: Taking the magnitude square on both sides, we get:
Interaction Function In the signal domain we have the following relationship: After Fourier transformation, this becomes: Taking the magnitude square on both sides, we get:
Interaction Function In the signal domain we have the following relationship: After Fourier transformation, this becomes: Taking the magnitude square on both sides, we get:
Interaction Function In the signal domain we have the following relationship: After Fourier transformation, this becomes: Taking the magnitude square on both sides, we get:
Interaction Function In the signal domain we have the following relationship: After Fourier transformation, this becomes: Taking the magnitude square on both sides, we get:
Interaction Function In the signal domain we have the following relationship: After Fourier transformation, this becomes: Taking the magnitude square on both sides, we get:
Interaction Function Taking the magnitude square on both sides, we get:
Interaction Function Taking the magnitude square on both sides, we get: Hence, in the power spectral domain we have:
Interaction Function Taking the magnitude square on both sides, we get: Hence, in the power spectral domain we have: phase term
Interaction Function Taking the magnitude square on both sides, we get: Hence, in the power spectral domain we have: relative phase
Interaction Function The relative phase between two waves describes their relative offset in time (delay) time relative phase
Interaction Function = = = = When 2 sound sources are present the following can happen: = = amplification amplification = = attenuation cancellation
Interaction Function Taking the magnitude square on both sides, we get: Hence, in the power spectral domain we have: relative phase
Interaction Function Taking the magnitude square on both sides, we get: Hence, in the power spectral domain we have: zero in average
Interaction Function Taking the magnitude square on both sides, we get: Hence, in the power spectral domain we have: In the log power spectral domain that becomes:
Interaction Function Taking the magnitude square on both sides, we get: Hence, in the power spectral domain we have: In the log power spectral domain that becomes:
Interaction Function Taking the magnitude square on both sides, we get: Hence, in the power spectral domain we have: In the log power spectral domain that becomes:
Interaction Function Taking the magnitude square on both sides, we get: Hence, in the power spectral domain we have: In the log power spectral domain that becomes:
Interaction Function Taking the magnitude square on both sides, we get: Hence, in the power spectral domain we have: In the log power spectral domain that becomes: Acero, 1990
Interaction Function Taking the magnitude square on both sides, we get: Hence, in the power spectral domain we have: In the log power spectral domain that becomes: But is that really right?
Interaction Function The mean of a nonlinearly transformed random variable is not necessarily equal to the nonlinear transform of the random variable’s mean. nonlinear transform
Interaction Function The mean of a nonlinearly transformed random variable is not necessarily equal to the nonlinear transform of the random variable’s mean. nonlinear transform
Interaction Function Phase-averaged relationship between clean and noisy speech:
III Transforming Probabilities
Transforming Probabilities Motivation In the signal domain we have the following relationship: In the log Mel domain that translates to: nonlinear interaction function
Transforming Probabilities Motivation noise power noisy speech power clean speech power
Transforming Probabilities Motivation noisy speech power clean speech power noise power
Transforming Probabilities Motivation clean speech power noise power noisy speech power
Transforming Probabilities Motivation
Transforming Probabilities Motivation Transformation results in a non-Gaussian probability distribution for noisy speech features.
Transforming Probabilities Introduction Transformation of a random variable Transformation Probability density function
Transforming Probabilities Introduction Transformation of a random variable Transformation Probability density function The transformation maps each x to a y:
Transforming Probabilities Introduction Transformation of a random variable Transformation Probability density function The transformation maps each x to a y: Conversely, each y can be identified with
Transforming Probabilities Introduction Transformation of a random variable Transformation Probability density function Idea: use to map distribution of y to distribution of x
Transforming Probabilities Introduction Transformation of a random variable Transformation Probability density function Idea: use to map distribution of y to distribution of x change of variables
Transforming Probabilities Introduction Transformation of a random variable Transformation Probability density function Idea: use to map distribution of y to distribution of x Jacobian determinant
Transforming Probabilities Introduction Transformation of a random variable Transformation Probability density function Idea: use to map distribution of y to distribution of x Fundamental Transformation Law of Probability
Transforming Probabilities Monte Carlo Idea: approximate probability distribution by samples drawn from the distribution. discrete probability mass pdf
Transforming Probabilities Monte Carlo Idea: approximate probability distribution by samples drawn from the distribution. pdf cumulative density function
Transforming Probabilities Monte Carlo Idea: approximate probability distribution by samples drawn from the distribution. Then: transform each sample pdf transformed pdf
Transforming Probabilities Monte Carlo Idea: approximate probability distribution by samples drawn from the distribution. Then: transform each sample histogram transformed pdf
Transforming Probabilities Local Linearization Idea: Locally linearize the interaction function around the mean of speech and noise, using a first order Taylor series expansion. Note: a linear transformation of a Gaussian random variable results in a Gaussian random variable.
Transforming Probabilities Local Linearization Idea: Locally linearize the interaction function around the mean of speech and noise, using a first order Taylor series expansion. Moreno, 1996 Vector Taylor Series Approach Note: a linear transformation of a Gaussian random variable results in a Gaussian random variable.
Transforming Probabilities Local Linearization Idea: Locally linearize the interaction function around the mean of speech and noise, using a first order Taylor series expansion.
Transforming Probabilities Local Linearization Idea: Locally linearize the interaction function around the mean of speech and noise, using a first order Taylor series expansion.
Transforming Probabilities Local Linearization Idea: Locally linearize the interaction function around the mean of speech and noise, using a first order Taylor series expansion.
Transforming Probabilities The Unscented Transform Idea: similar as in Monte Carlo, select points in a determi- nistic fashion and in such a way that they capture the mean and covariance of the distribution select points
Transforming Probabilities The Unscented Transform select points
Transforming Probabilities The Unscented Transform select points transform points
Transforming Probabilities The Unscented Transform select points transform points Re-estimate parameters of the Gaussian distribution
Transforming Probabilities The Unscented Transform Comparison to local linearization: local linearization unscented transform
Transforming Probabilities The Unscented Transform select points transform points Re-estimate parameters of the Gaussian distribution
Transforming Probabilities The Unscented Transform transform points
Transforming Probabilities The Unscented Transform The points selected by the un-scented transform lie on lines around the center point. transform points
Transforming Probabilities The Unscented Transform The points selected by the un-scented transform lie on lines around the center point. transform points
Transforming Probabilities The Unscented Transform The points selected by the un-scented transform lie on lines around the center point. transform points
Transforming Probabilities The Unscented Transform The points selected by the un-scented transform lie on lines around the center point. After nonlinear transformation, the points might no longer lie on a line transform points
Transforming Probabilities The Unscented Transform The points selected by the un-scented transform lie on lines around the center point. After nonlinear transformation, the points might no longer lie on a line transform points
Transforming Probabilities The Unscented Transform The points selected by the un-scented transform lie on lines around the center point. After nonlinear transformation, the points might no longer lie on a line transform points Hence we can measure the degree of nonlinearity as the average distance of each three points from a linear fit of the three points.
Transforming Probabilities The Unscented Transform transform points Hence we can measure the degree of nonlinearity as the average distance of each three points from a linear fit of the three points.
Transforming Probabilities The Unscented Transform Hence we can measure the degree of nonlinearity as the average distance of each three points from a linear fit of the three points. transform points
Transforming Probabilities The Unscented Transform Hence we can measure the degree of nonlinearity as the average distance of each three points from a linear fit of the three points. This can be shown to be closely related to the R2 measure used in linear regression. transform points
Transforming Probabilities The Unscented Transform true distribution Gaussian fit High degree of nonlinearity Gaussian fit does not well represent the transformed distribution
Transforming Probabilities An Adaptive Level of Detail Approach Idea: splitting a Gaussian into two Gaussian components decreases the covariance and thereby the nonlinearity.
Transforming Probabilities An Adaptive Level of Detail Approach Idea: splitting a Gaussian into two Gaussian components decreases the covariance and thereby the nonlinearity. 2 Gaussians
Transforming Probabilities An Adaptive Level of Detail Approach Algorithm, Adaptive Level of Detail Transform [ALoDT] start with one Gaussian g transform that Gaussian with the UT identify Gaussian component with highest dnl split that component into 2 Gaussians g1, g2 transform g1 and g2 with the UT while #(Gaussians) < N: repeat step 3.
Transforming Probabilities An Adaptive Level of Detail Approach Density approximation with the Adaptive Level of Detail Transform unscented transform
Transforming Probabilities An Adaptive Level of Detail Approach Density approximation with the Adaptive Level of Detail Transform ALoDT-2
Transforming Probabilities An Adaptive Level of Detail Approach Density approximation with the Adaptive Level of Detail Transform ALoDT-4
Transforming Probabilities An Adaptive Level of Detail Approach Density approximation with the Adaptive Level of Detail Transform ALoDT-8
Transforming Probabilities An Adaptive Level of Detail Approach Density approximation with the Adaptive Level of Detail Transform ALoDT-16
Transforming Probabilities An Adaptive Level of Detail Approach Density approximation with the Adaptive Level of Detail Transform ALoDT-32
Transforming Probabilities An Adaptive Level of Detail Approach Kullback Leibler divergence between approximated and true distribution (Monte Carlo with 10M samples). Adaptive Level of Detail Transform N 1 2 4 8 16 32 KLD 0.190 0.078 0.025 0.017 0.007 0.004 decrease by a factor of 48
IV Speech Feature Enhancement The MMSE Solution
Speech Feature Enhancement The MMSE Solution Idea: train speech recognition system on clean speech try to map distorted features to clean speech features Systematic Approach: derive an estimator for clean speech given noisy speech
Speech Feature Enhancement The MMSE Solution Let be an estimator for clean speech , given noisy speech .
Speech Feature Enhancement The MMSE Solution Let be an estimator for clean speech , given noisy speech . Then the expected mean square error introduced by using instead of the true is:
Speech Feature Enhancement The MMSE Solution Let be an estimator for clean speech , given noisy speech . Then the expected mean square error introduced by using instead of the true is:
Speech Feature Enhancement The MMSE Solution Then the expected mean square error introduced by using instead of the true is:
Speech Feature Enhancement The MMSE Solution Then the expected mean square error introduced by using instead of the true is: Minimizing the MSE with respect to yields the optimal estimator with respect to the MMSE criterion:
Speech Feature Enhancement The MMSE Solution Then the expected mean square error introduced by using instead of the true is: Minimizing the MSE with respect to yields the optimal estimator with respect to the MMSE criterion:
Speech Feature Enhancement The MMSE Solution Minimizing the MSE with respect to yields the optimal estimator with respect to the MMSE criterion:
Speech Feature Enhancement The MMSE Solution Minimizing the MSE with respect to yields the optimal estimator with respect to the MMSE criterion: But how to obtain this distribution?
Speech Feature Enhancement The MMSE Solution Minimizing the MSE with respect to yields the optimal estimator with respect to the MMSE criterion: Idea: assume that the joint distribution of S and Y is Gaussian
Speech Feature Enhancement The MMSE Solution Minimizing the MSE with respect to yields the optimal estimator with respect to the MMSE criterion: Idea: assume that the joint distribution of S and Y is Gaussian
Speech Feature Enhancement The MMSE Solution Minimizing the MSE with respect to yields the optimal estimator with respect to the MMSE criterion: Idea: assume that the joint distribution of S and Y is Gaussian Afify, 2007 Stereo-Based Stochastic Mapping
Speech Feature Enhancement The MMSE Solution Minimizing the MSE with respect to yields the optimal estimator with respect to the MMSE criterion: Idea: assume that the joint distribution of S and Y is Gaussian
Speech Feature Enhancement The MMSE Solution Minimizing the MSE with respect to yields the optimal estimator with respect to the MMSE criterion: Idea: assume that the joint distribution of S and Y is Gaussian
Speech Feature Enhancement The MMSE Solution Idea: assume that the joint distribution of S and Y is Gaussian
Speech Feature Enhancement The MMSE Solution Idea: assume that the joint distribution of S and Y is Gaussian Then the conditional distribution of S|Y is again Gaussian: with conditional mean and covariance matrix
Speech Feature Enhancement The MMSE Solution Minimizing the MSE with respect to yields the optimal estimator with respect to the MMSE criterion:
Speech Feature Enhancement The MMSE Solution Minimizing the MSE with respect to yields the optimal estimator with respect to the MMSE criterion: Under the Gaussian assumption, this integral is easily obtained:
Speech Feature Enhancement The MMSE Solution Minimizing the MSE with respect to yields the optimal estimator with respect to the MMSE criterion: Under the Gaussian assumption, this integral is easily obtained: This is exactly what you get with the vector Taylor series approach Moreno, 1996
Speech Feature Enhancement The MMSE Solution Minimizing the MSE with respect to yields the optimal estimator with respect to the MMSE criterion: Under the Gaussian assumption, this integral is easily obtained: Problem: speech is known to be multi modal
Speech Feature Enhancement The MMSE Solution Minimizing the MSE with respect to yields the optimal estimator with respect to the MMSE criterion:
Speech Feature Enhancement The MMSE Solution Minimizing the MSE with respect to yields the optimal estimator with respect to the MMSE criterion: Introduce the index k of the mixture component as a hidden variable.
Speech Feature Enhancement The MMSE Solution Minimizing the MSE with respect to yields the optimal estimator with respect to the MMSE criterion: Then rewrite this as
Speech Feature Enhancement The MMSE Solution Minimizing the MSE with respect to yields the optimal estimator with respect to the MMSE criterion:
Speech Feature Enhancement The MMSE Solution Minimizing the MSE with respect to yields the optimal estimator with respect to the MMSE criterion: pull the sum out of the integral
Speech Feature Enhancement The MMSE Solution Minimizing the MSE with respect to yields the optimal estimator with respect to the MMSE criterion:
Speech Feature Enhancement The MMSE Solution Minimizing the MSE with respect to yields the optimal estimator with respect to the MMSE criterion: independent of s
Speech Feature Enhancement The MMSE Solution Minimizing the MSE with respect to yields the optimal estimator with respect to the MMSE criterion: pull this out of the integral
Speech Feature Enhancement The MMSE Solution Minimizing the MSE with respect to yields the optimal estimator with respect to the MMSE criterion:
Speech Feature Enhancement The MMSE Solution Minimizing the MSE with respect to yields the optimal estimator with respect to the MMSE criterion: Probability that clean speech originated from the kth Gaus-sian given the noisy speech spectrum y.
Speech Feature Enhancement The MMSE Solution Minimizing the MSE with respect to yields the optimal estimator with respect to the MMSE criterion: Clean speech estimate of the k-th Gaussian:
Speech Feature Enhancement The MMSE Solution Minimizing the MSE with respect to yields the optimal estimator with respect to the MMSE criterion: Bayes’ theorem
Speech Feature Enhancement The MMSE Solution Minimizing the MSE with respect to yields the optimal estimator with respect to the MMSE criterion:
Speech Feature Enhancement The MMSE Solution Minimizing the MSE with respect to yields the optimal estimator with respect to the MMSE criterion:
Speech Feature Enhancement The MMSE Solution Minimizing the MSE with respect to yields the optimal estimator with respect to the MMSE criterion: joint distribution
V Model-Based Speech Feature Enhancement
Model-Based Speech Feature Enhancement Distribution of clean speech is modeled as Gaussian Mixture
Model-Based Speech Feature Enhancement Distribution of clean speech is modeled as Gaussian Mixture + +
Model-Based Speech Feature Enhancement Distribution of clean speech is modeled as Gaussian Mixture Noise is modeled as a single Gaussian
Model-Based Speech Feature Enhancement Distribution of clean speech is modeled as Gaussian Mixture Noise is modeled as a single Gaussian
Model-Based Speech Feature Enhancement Distribution of clean speech is modeled as Gaussian Mixture Noise is modeled as a single Gaussian Presence of noise changes the clean speech distribution according to the interaction function
Model-Based Speech Feature Enhancement Distribution of clean speech is modeled as Gaussian Mixture Noise is modeled as a single Gaussian Presence of noise changes the clean speech distribution according to the interaction function Construct the joint distribution of clean and noisy speech based on this model
Model-Based Speech Feature Enhancement Distribution of clean speech is modeled as Gaussian Mixture Noise is modeled as a single Gaussian Presence of noise changes the clean speech distribution according to the interaction function Construct the joint distribution of clean and noisy speech based on this model
Model-Based Speech Feature Enhancement Construct the joint distribution of clean and noisy speech based on this model
Model-Based Speech Feature Enhancement Construct the joint distribution of clean and noisy speech based on this model
Model-Based Speech Feature Enhancement Construct the joint distribution of clean and noisy speech based on this model
Model-Based Speech Feature Enhancement Construct the joint distribution of clean and noisy speech based on this model
Model-Based Speech Feature Enhancement Noise Estimation: Find that noise distribution, which is the most likely explanation for the observed, noisy speech features
Model-Based Speech Feature Enhancement Noise Estimation: Find that noise distribution, which is the most likely explanation for the observed, noisy speech features mean and covariance of the noise
Model-Based Speech Feature Enhancement Noise Estimation: Find that noise distribution, which is the most likely explanation for the observed, noisy speech features Problem: the observations are also dependent on speech!
Model-Based Speech Feature Enhancement Problem: the observations are also dependent on speech! hidden variable
Model-Based Speech Feature Enhancement Problem: the observations are also dependent on speech! hidden variable
Model-Based Speech Feature Enhancement Problem: the observations are also dependent on speech! hidden variable
Model-Based Speech Feature Enhancement Noise Estimation: Find that noise distribution, which is the most likely explanation for the observed, noisy speech features Problem: the observations are also dependent on speech!
Model-Based Speech Feature Enhancement Noise Estimation: Find that noise distribution, which is the most likely explanation for the observed, noisy speech features Problem: the observations are also dependent on speech! Hence, the Expectation Maximization algorithm is used. Rose, 1994 Moreno, 1996
Model-Based Speech Feature Enhancement Expectation Step: construct the joint distribution by using the current noise parameter estimate . Then calculate .
Model-Based Speech Feature Enhancement Expectation Step: construct the joint distribution by using the current noise parameter estimate . Then calculate .
Model-Based Speech Feature Enhancement Expectation Step: construct the joint distribution by using the current noise parameter estimate . Then calculate . Maximization Step: Reestimate by ac-cumulating statistics of the instantaneous noise estimates for each possible , weighted by the probability that clean speech originated from this Gaussian.
Model-Based Speech Feature Enhancement Maximization Step: Reestimate by ac-cumulating statistics of the instantaneous noise estimates for each possible , weighted by the probability that clean speech originated from this Gaussian.
Model-Based Speech Feature Enhancement Maximization Step: Reestimate by ac-cumulating statistics of the instantaneous noise estimates for each possible , weighted by the probability that clean speech originated from this Gaussian:
Model-Based Speech Feature Enhancement Maximization Step: Reestimate by ac-cumulating statistics of the instantaneous noise estimates for each possible , weighted by the probability that clean speech originated from this Gaussian:
Model-Based Speech Feature Enhancement Maximization Step: Reestimate by ac-cumulating statistics of the instantaneous noise estimates for each possible , weighted by the probability that clean speech originated from this Gaussian: But how to obtain this distribution?
Model-Based Speech Feature Enhancement Maximization Step: Reestimate by ac-cumulating statistics of the instantaneous noise estimates for each possible , weighted by the probability that clean speech originated from this Gaussian:
Model-Based Speech Feature Enhancement Maximization Step: Reestimate by ac-cumulating statistics of the instantaneous noise estimates for each possible , weighted by the probability that clean speech originated from this Gaussian:
Model-Based Speech Feature Enhancement Maximization Step: Reestimate by ac-cumulating statistics of the instantaneous noise estimates for each possible , weighted by the probability that clean speech originated from this Gaussian:
Model-Based Speech Feature Enhancement Maximization Step: Reestimate by ac-cumulating statistics of the instantaneous noise estimates for each possible , weighted by the probability that clean speech originated from this Gaussian:
Model-Based Speech Feature Enhancement Maximization Step: Reestimate by ac-cumulating statistics of the instantaneous noise estimates for each possible , weighted by the probability that clean speech originated from this Gaussian: So, we have , need .
Model-Based Speech Feature Enhancement Maximization Step: Reestimate by ac-cumulating statistics of the instantaneous noise estimates for each possible , weighted by the probability that clean speech originated from this Gaussian: But that is just the conditional Gaussian distribution with conditional mean and covariance .
Model-Based Speech Feature Enhancement Maximization Step: Reestimate by ac-cumulating statistics of the instantaneous noise estimates for each possible , weighted by the probability that clean speech originated from this Gaussian:
VI Experimental Results
Experimental Results Speech Recognition Experiments clean speech from MC-WSJ-AV corpus noise from the NOISEX-92 database (artifically added) MFCC with 13 components, stacking of 15 frames, LDA cepstral mean and variance normalization 1743 acoustical states; 70308 Gaussians
Experimental Results WER, destroyer engine noise
Experimental Results WER, factory noise
VII Extensions
Extensions Sequential noise estimation: Sequential expectation maximization (SEM), Kim, 1998
Extensions Sequential noise estimation: Sequential expectation maximization (SEM), Kim, 1998 Interacting Multiple Model (IMM) Kalman Filter, Kim, 1999
Extensions Sequential noise estimation: Sequential expectation maximization (SEM), Kim, 1998 Interacting Multiple Model (IMM) Kalman Filter, Kim, 1999 Particle filter, Yao, 2001
Extensions Sequential noise estimation: Sequential expectation maximization (SEM), Kim, 1998 Interacting Multiple Model (IMM) Kalman Filter, Kim, 1999 Particle filter, Yao, 2001 Improve speech recognition through: Combination with Joint Uncertainty Decoding, Shinohara, 2008
Extensions Sequential noise estimation: Sequential expectation maximization (SEM), Kim, 1998 Interacting Multiple Model (IMM) Kalman Filter, Kim, 1999 Particle filter, Yao, 2001 Improve speech recognition through: Combination with Joint Uncertainty Decoding, Shinohara, 2008 Combination with bounded conditional mean imputation?