Noise Reduction Two Stage Mel-Warped Weiner Filter Approach
Intellectual Property Advanced front-end feature extraction algorithm ETSI ES V1.1.3 ( ) European Telecommunications Standards Institute ETSI Technical Committee Speech Processing, Transmission and Quality Aspects (STQ). Advanced front-end feature extraction algorithm ETSI ES V1.1.3 ( ) European Telecommunications Standards Institute ETSI Technical Committee Speech Processing, Transmission and Quality Aspects (STQ).
Noise Reduction Based on Weiner filter theory Noise reduction is performed in two stages Input signal is de-noised in the first stage. Second stage – dynamic noise reduction based on SNR of processed signal Based on Weiner filter theory Noise reduction is performed in two stages Input signal is de-noised in the first stage. Second stage – dynamic noise reduction based on SNR of processed signal
First Stage Spectrum Estimation PSD Mean WF Design Mel Filter-Bank Mel IDCT Apply Filter VADNest To Second Stage
Second Stage Spectrum Estimation PSD Mean WF Design Mel Filter-Bank Gain Factorization Mel IDCT Apply Filter From First Stage OFF Output
Buffering Buffer 1Buffer ABCDEFGH BCD new FGH De-noised (1 st Stage) De-noised (output) 1 frame = 80 samples 1 buffer = 4 frames A De-noised (output)
Spectrum Estimation Input signal is divided into overlapping frames of N in = 200 samples. A 25ms frame length and 10ms frame shift (80 samples) are used. Each frame S w (n) is windowed with a Hanning window of length N in. Input signal is divided into overlapping frames of N in = 200 samples. A 25ms frame length and 10ms frame shift (80 samples) are used. Each frame S w (n) is windowed with a Hanning window of length N in.
Spectrum Estimation where Padding from N in up to N FFT -1, N FFT = 256
Spectrum Estimation Frequency representation: Power spectrum: Smoothing:
Power Spectral Density Mean Compute for each P in (bin) the mean over the last T PSD = 2 frames.
Wiener Filter Design A forgetting factor (weight) is computed for each frame, λ NSE. If (t < 100 frames) λ NSE = 1 – 1/t else λ NSE = 0.99
Wiener Filter Design First stage noise spectrum estimate is updated based on VAD flag: If flag = 0 P 1/2 noise (bin,t n ) = min(λ NSE ● P 1/2 noise (bin,t n -1)+(1- λ NSE ) ● PSD mean,exp(-10)) If flag = 1 P 1/2 noise (bin,t) = P 1/2 noise (bin,t n ) (last non speech frame) First stage noise spectrum estimate is updated based on VAD flag: If flag = 0 P 1/2 noise (bin,t n ) = min(λ NSE ● P 1/2 noise (bin,t n -1)+(1- λ NSE ) ● PSD mean,exp(-10)) If flag = 1 P 1/2 noise (bin,t) = P 1/2 noise (bin,t n ) (last non speech frame)
Wiener Filter Design Second stage is updated permanently: If (t < 11) P noise (bin,t) = λ NSE ● P noise (bin,t n -1)+(1- λ NSE ) ● PSD mean else update = ×P inPSD (bin,t)/(P inPSD (bin,t)+ P noise (bin,t-1) ) ×(1+1/(1+0.1×P inPSD (bin,t) /(P inPSD (bin,t-1))) P noise (bin,t) = P noise (bin,t-1)×update Second stage is updated permanently: If (t < 11) P noise (bin,t) = λ NSE ● P noise (bin,t n -1)+(1- λ NSE ) ● PSD mean else update = ×P inPSD (bin,t)/(P inPSD (bin,t)+ P noise (bin,t-1) ) ×(1+1/(1+0.1×P inPSD (bin,t) /(P inPSD (bin,t-1))) P noise (bin,t) = P noise (bin,t-1)×update
Wiener Filter Design Noiseless spectrum is estimated: P 1/2 den (bin,t) = 0.98×P 1/2 den (bin,t-1)+(1-0.98)×T[PSD mean -P 1/2 noise (bin,t) ] where the threshold function T is Noiseless spectrum is estimated: P 1/2 den (bin,t) = 0.98×P 1/2 den (bin,t-1)+(1-0.98)×T[PSD mean -P 1/2 noise (bin,t) ] where the threshold function T is
Wiener Filter Design The priori SNR is calculated: The filter transfer function is
Wiener Filter Design The filter transfer function is used to improve noiseless signal estimation: The improved priori SNR is:
Voice Activity Detection VAD is used to detect noise frames Find frame energy: VAD is used to detect noise frames Find frame energy: If frame threshold < 10 long term energy factor ( LTE ) = 1 - 1/t Else LTE = 0.97; Calculate frame energy:
Voice Activity Detection Use frame energy to update mean energy: If frame energy - mean energy < 20 (SNR threshold) or t < 10 Then if (frameEn < meanEn) or (t < 10) meanEn = meanEn + (1 - LTE ) * (frameEn - meanEn) ElsemeanEn = meanEn+( ) * (frameEn - meanEn) If (meanEn < 80) meanEn = 80
Voice Activity Detection Is the current frame speech? If t > 4 if (frameEn - meanEn) > 15 IT IS SPEECH nbSpeechFrame++ else if nbSpeechFrame > 4 hangover = 15, nbSpeechFrame = 0 if (hangover != 0) IT IS SPEECH else IT IS NOT SPEECH
Mel Filter Bank The linear frequency Weiner filter coefficients are smoothed and transformed to the Mel- frequency scale. The mel scale is a scale of pitches judged by listeners to be equal in distance one from another. The linear frequency Weiner filter coefficients are smoothed and transformed to the Mel- frequency scale. The mel scale is a scale of pitches judged by listeners to be equal in distance one from another.
Mel IDCT The time-domain impulse response of the Wiener filter is computed from the Mel-Wiener filter coefficients by using Mel-warped inverse Discrete Cosine Transform:
Gain Factorization Factorization of the Wiener filter Mel-warped coefficients is performed to control the aggression of noise reduction in the second stage. The de-noised frame signal energy is calculated as: Factorization of the Wiener filter Mel-warped coefficients is performed to control the aggression of noise reduction in the second stage. The de-noised frame signal energy is calculated as:
Gain Factorization The noise energy of the current frame is estimated as:
Gain Factorization The smoothed SNR is evaluated using 3 de- noised frame energies and the noise energy If (Ratio > ) Then SNR avg (t) = 6.67 × log 10 (Ratio) Else SNR avg (t) = -33.3
Gain Factorization To decide the degree of aggression, the SNR is tracked: If {(SNR avg (t) – SNR low-track (t-1)) < 10 ort < 10} calculate λ SNR (t) SNR low-track (t) = λ SNR (t)× SNR low-track (t -1)+(1- λ SNR (t))×SNR avg (t) Else SNR low-track (t) = SNR low-track (t -1)
Gain Factorization Gain factorization applies more aggressive noise reduction to purely noisy frames and less to frames containing speech. The aggression coefficient takes on a value of 10% for speech + noise frames and 80% for noise frames. Gain factorization applies more aggressive noise reduction to purely noisy frames and less to frames containing speech. The aggression coefficient takes on a value of 10% for speech + noise frames and 80% for noise frames.
Apply Filter The causal impulse response is obtained, truncated and weighted by a Hanning window. The input signal is filtered with the filter impulse response to produce the noise-reduced signal. The causal impulse response is obtained, truncated and weighted by a Hanning window. The input signal is filtered with the filter impulse response to produce the noise-reduced signal.
Offset Compensation A filter is used to remove the DC offset over the frame length interval (80 samples). Where Snr is the noise reduced signal
Results Noisy test file: After de-noise:
Results Footloose: Not Footloose:
Results: why didn’t this work? Hair dryer: Still there?!?!:
Results Hair dryer: Gone: