Speech Enhancement Based on Nonparametric Factor Analysis Lin Li1, Jiawen Wu1, Xinghao Ding1, Qingyang Hong1, Delu Zeng2 1School of Information Science and Technology, Xiamen University, China 2School of Mathematics, South China University of Technology, China Reporter: Jiawen Wu 10/11/2016
Speech Enhancement Based on Non- parametric Factor Analysis Background of the Research The Proposed Method Experiment Setup Experiment Results Outline
Background SS[Boll79] Subspace[Moor93] MMSE NPS[Cohen03] Spectral Subtraction Subspace[Moor93] Speech Enhancement MMSE NPS[Cohen03] Minimum Mean-square Error Algorithm Using a Non-causal Priori SNR MMSE MAP[Paliwal12] maximum a posterior estimator of magnitude-squared spectrum Sparse Representation K-SVD: K-singular value decomposition[Zhao11] CLSMD: constrained low-rank and sparse matrix decomposition[Sun14] Wiener Filtering[Scalart96] S. F. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” Acoustics, Speech and Signal Processing, IEEE Transactions on, vol. 27, no. 2, pp. 113–120, 1979. B. De Moor, “The singular value decomposition and long and short spaces of noisy matrices,” Signal Processing, IEEE Trans-actions on, vol. 41, no. 9, pp. 2826–2838, 1993. I. Cohen, “Noise spectrum estimation in adverse environments: Improved minima controlled recursive averaging,” Speech and Audio Processing, IEEE Transactions on, vol. 11, no. 5, pp. 466–475, 2003. K. Paliwal, B. Schwerin et al., “Speech enhancement using a minimum mean-square error short-time spectral modulation magnitude estimator,” Speech Communication, vol. 54, no. 2, pp. 282–305, 2012. P. Scalart et al., “Speech enhancement based on a priori signal to noise estimation,” ICASSP1996. pp. 629–632. N. Zhao, X. Xu, and Y. Yang, “Sparse representations for speech enhancement,” Chinese Journal of Electronics, vol. 19, no. 2, pp. 268–272, 2011. C. Sun, Q. Zhu, and M. Wan, “A novel speech enhancement method based on constrained low-rank and sparse matrix decomposition,” Speech Communication, vol. 60, pp. 44–55, 2014.
A sparse representation framework The Proposed Method A sparse representation framework with a nonparametric dictionary learning model based on beta process factor analysis
Contributions 1 2 3 Nonparametric The noise variance is not required The average sparsity level of the representation and the dictionary size could be learned by using a beta process. 1 The noise variance is not required The noise variance can be inferred automatically after analytical posterior calculation. 2 An in situ training process An in situ way of speech processing is provided, in which we do not have to train the dictionary beforehand. 3
Problem formulation
K-SVD[1] Sparsity Level L threshold σ threshold σ Sparsity Level L [1] N. Zhao, X. Xu, and Y. Yang, “Sparse representations for speech enhancement,” Chinese Journal of Electronics, vol. 19, no. 2, pp. 268–272, 2011.
Architecture Prior: Posterior: (1) (2) (3) (4) (5) (6) Via variational Bayesian [Paisley09] or Gibbs-sampling analysis,a full posterior density function can be inferred for the update of D and α, accompanied with all other model parameters. (7) (8) J. Paisley and L. Carin, “Nonparametric factor analysis with beta process priors,” in Proceedings of the 26th Annual International Conference on Machine Learning. ACM, 2009, pp. 777–784.
Initial dictionary size Parameters of the beta distribution Setup(1)---Parameter Database Standard NOIZEUS database [Loizou13] Noise type White, Train, Street Noise level 0dB, 5dB, 10dB and 15dB Frame size 128 point Increase step 1 point Initial dictionary size 512 Hyper-parameters c0 = d0 = e0 = f0 = 106 Parameters of the beta distribution a0 = 1;b0 = P/9 Quality evaluation SNR and SegSNR [Hu07] PESQ (Perceptual Evaluation of Speech Quality) [P01] P. C. Loizou, Speech enhancement: theory and practice. CRC press, 2013. Y. Hu and P. C. Loizou, “Subjective comparison and evaluation of speech enhancement algorithms,” Speech communication, vol. 49, no. 7, pp. 588–601, 2007. P. Recommendation, “862: Perceptual evaluation of speech quality (pesq): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs,” Feb, vol. 14, pp. 14–0, 2001.
The output SNR values vs iteration with different b0. Setup(2) Iteration=100 The output SNR values vs iteration with different b0. Input speech: the text “the birch canoe slid on the smooth planks”, corrupted with the street noise at 0dB The posterior: P is the frame number of the input speech
Setup(2) An extra handling No change b0=N/9 Yes No Whether the output SNR declines for ten times continuously? Yes No It remains a great challenge to be further investigated, since the output SNR is unavailable in practical applications. changed b0 to a larger number e.g., 1000×P No change
Results Comparison with K-SVD (a) PESQ (b) SegSNR Noise type: Gaussian white noise SegSNR / PESQ: Mean values calculated using the 30 utterances at each input SNR Match: The noise variance estimation for K-SVD matches the ground truth. Mismatch: The noise variance estimation for K-SVD doesn’t matches the ground truth.
Results Statistics of nonparametric dictionary learning (a)sorted final probabilities of dictionary elements (πks); (b)distribution of the number of elements used per frame. Input utterances: “we talked of the sideshow in the circus” (“sp19.wav” ) with input SNR at 0dB
Results
Thanks!