Speech Enhancement based on Deep Learning Jiawen Wu 2017/4/27
Background Task:Model the mapping relationship between the noisy and clean speech signals 1989,Tamura [1] -- Time domain 1994,Xie and Van [2] -- Frequency domain 2006,Hinton [3] -- RMB Since 2006 -- Classification Task[4-5], Auto-encoder[6] 规模小,结构简单,训练样本少, 没有可靠的初始化方案,容易陷入局部最优 [1] S. I. Tamura, “An analysis of a noise reduction neural network,” in Proc. ICASSP, 1989, pp. 2001–2004. [2] F. Xie and D. V. Compernolle, “A family of MLP based nonlinear spectral estimators for noise reduction,” in Proc. ICASSP, 1994, pp. 53–56. [3] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507,2006. [4] Y.X.WangandD.L.Wang,“Towardsscalingupclassification-based speech separation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 7, pp. 1381–1390, Jul. 2013. [5] E.W.Healy,S.E.Yoho,Y.X.Wang,andD.L.Wang,“Analgorithm to improve speech recognition in noise for hearing-impaired listeners,” J.Acoust.Soc.Amer., vol. 134, no. 4, pp. 3029–3038,2013. [6] X.-G.Lu,Y.Tsao,S.Matsuda,andC.Hori,“Speechenhancement based on deep denoising Auto-Encoder,” in Proc. Interspeech, 2013, pp. 436–440.
Yong Xu University of Science and Technology of China 个人主页:http://home.ustc.edu.cn/~xuyong62/index.html Xu Y, Du J, Dai L R, et al. An experimental study on speech enhancement based on deep neural networks[J]. Signal Processing Letters, IEEE, 2014, 21(1): 65-68. cited:81 Xu Y, Du J, Dai L R, et al. A regression approach to speech enhancement based on deep neural networks[J]. Audio, Speech, and Language Processing, IEEE/ACM Transactions on, 2015, 23(1): 7-19. cited:20 cited up to 2016.01.09
Contributions Nonlinear regression-based framework using DNNs A large amount of training data---2500 hours, and more than 100 noise types Context information Unseen noise and non-stationary noise
DNN-based SE system
Baseline System A nonlinear regression function finding a mapping function between noisy and clean speech
Baseline System Normalization: zero mean and unit variance 对数功率谱 相位 J. Du and Q. Huo, “A speech enhancement approach using piecewise linear approximation of an explicit model of environmental distortions,” in Proc. Interspeech, 2008, pp. 569–572
Baseline System Fine-tuning A nonlinear regression function from noisy speech features to clean speech features Fine-tuning E---mean squared error W---weight parameters b---bias parameters ---the d-th enhanced frequency bins of the log-spectral feature at sample index n --- target frequency bins Update of the weights and bias
Improved System Fine-tuning Update of the weights and bias --- being the noisy log-spectral feature vector where the window size of context is 2*τ+1 E --- mean squared error W --- weight parameters b --- bias parameters --- the d-th enhanced frequency bins of the log-spectral feature at sample index n --- target frequency bins Update of the weights and bias κ --- the weight decay coefficient ω --- is the momentum
Improved System Post-processing----Global Variance Equalization Dropout Training Noise Estimation----Noise aware training(NAT) 2017/4/27
Global Variance Equalization a simple type of histogram equalization The global variance of the estimated clean speech features is defined as: A dimension-independent global variance can be computed as follows: A. D. L. Torre, A. M. Peinado, J. C.Segura,J.L.Perez-Cordoba,M.C. Benitez, and A. J. Rubio, “Histogram equalization of speech representation for robust speech recognition,” IEEE Trans. Speech Audio Process., vol. 13, no. 3, pp. 355–366, May 2005
Noise Aware Training[2] Dropout Training[1] In the DNN training, dropout randomly omits a certain percentage of the neurons in the input and each hidden layer during each presentation of the sample for each training sample, which can be treated as model averaging to avoid the over-fitting problem. Noise Aware Training[2] [1] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R.Salakhutdinov, “Improving neural networks by preventing co-adapta-tion of feature detectors,” Arxiv, 2012 [Online]. [2] Dynamic Noise Aware Training for Speech Enhancement Based on Deep Neural Networks, Yong Xu, Jun Du, Li-Rong Dai and Chin-Hui Lee, to be appeared at Interspeech2014
01 02 03 04 Measures SSNR: segmental SNR LSD: log-spectral distortion PESQ: perceptual evaluation of speech quality 04 STOI: Short-Time Objective Intelligibility [1] J. Du and Q. Huo, “A speech enhancement approach using piecewise linear approximation of an explicit model of environmental distor-tions,” in Proc. Interspeech, 2008, pp. 569–572. [2] ITU-T, Rec. P.862, Perceptual evaluation of speech quality (PESQ):An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs International Telecommu-nication Union-Telecommunication Standardisation Sector, 2001. [3] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algo-rithm for intelligibility prediction of time frequency weighted noisy speech,” IEEE Trans. Audio, Speech, Lang. Process.,Sep. 2011.
Experiment Setup The Depth of DNN The Number of Noise Types
Experiment Setup The Size of Training Set The Length of Acoustic Context
Experiment Results Noise Aware Training Global Variance Equalization
Experiment Results Unseen Noise GVE 2.00 Dropout&GVE 2.13 Noisy 1.42 LogMMSE 1.83 Dropout 2.06 DNN-baseline 1.87 Dropout&GVE&NAT 2.25 Clean 4.5
Experiment Results Non-stationary and Unseen Noise 104-noise DNN 2.78 LogMMSE did not work 104-noise DNN 2.78 4-noise DNN 2.14 Noisy 1.85
Experiment Results Changing Noise Environments DNN 2.99 LogMMSE 1.46 Clean 4.50 Noisy 2.05
Experiment Results Real-world
Experiment Results Overall Evaluation on 15 Unseen Noise Types
Experiment Results 32 real-world noisy utterances (22 spoken in English, and others spoken in other languages) 10 persons :five Chinese males and five Chinese females
01 02 03 Summary Experiment Setup Input data processing Change the Deep Model 03 2017/4/27
THANK YOU end 2017/4/27