Speech Enhancement Using a Minimum Mean Square Error Short-Time Spectral Amplitude Estimation method
Process Flow
Segmenting of Signal The sample is divided into frames whose length is equal to 25ms with a shift percentage of 40% or 10ms. The Window Length is equal to the 25ms times the Sampling Frequency. – Example Sampling Frequency is equal to 8000 samples/s Window Length = 0.025s * 8000 samples/s = 200 samples Each frame is then windowed using a Hamming window.
Initial Silence Segments The initial silence or speech inactivity period is assumed to be 250ms. – This is to allow for a sufficient amount of data to be analyzed for the Noise Spectrum prior to attempting Voice Activity Detection (VAD). The Number of Initial Silence Segments (NISS) = (Initial Silence * Sampling Frequency - Window Length)/(Shift Percentage* Window Length). – Example Using our previous values. NISS = (0.25s * 8000 samples/s samples)/0.4*200 samples = – The value is rounded down to the nearest whole number
Phase Calculation using FFT The Fast Fourier Transform of each frame is calculated. The phase component of the FFT is calculated for use in reconstruction of the enhanced signal.
Noise Power Spectrum An initial Noise Power Spectrum and the Noise Power Spectrum Variance (λ d ) is calculated using the mean values of the FFT for the NISS. For each frame in the NISS, the Noise Power Spectrum and the Noise Power Spectrum Variance are updated. The frames after the NISS are evaluated using a Voice Activity Detector (VAD) which utilizes the Noise Power Spectrum. – If the frames are determined to contain only noise, then the Noise Power Spectrum and the Noise Power Spectrum Variance are updated.
Signal to Noise Ratio Using the Noise Power Spectrum, the a priori SNR (ξ k ) and the a posteriori SNR (γ k ) are calculated. a priori SNR: – γ k =R k 2 /λ d (k) where R k is the modulus of the signal plus noise resultant spectral component a posteriori SNR – ξ k (n)=αG 2 γ k (n-1)+(1- α)P [γ k (n)-1] where α = 0.99 and is a smoothing factor. and G is the Gain Function from the MMSE and P[x] is defined as x if x>0 or 0 otherwise
Gain Calculation The gain (G) of the signal is then updated using the Signal to Noise Ratios. – G= ξ k /(1- ξ k )e (η/2) Where η= λ d ξ k /(1- ξ k )
Signal Enhancement and Reconstruction The signal is then cleaned by combining the FFT of each frame with the gain. The signal is reconstructed using the overlap add method utilizing the phase of the FFT.
Sample – Hair Dryer Background
Sample – Jack Hammer Background
Sample – Air Conditioner Background
Sample – Cafeteria Background
Sample – Automobile Background
Sample – Coffee Grinder Background
Sample – Fan Background
Sample – Feedback Background
Sample – White Noise Background
Sample – Static Background
References Ephraim, Yariv. “Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator”. IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 32, No. 6, December 1984