Speech Enhancement based on

Slides:

Advertisements

Similar presentations

Zhijie Yan, Qiang Huo and Jian Xu Microsoft Research Asia

Advertisements

Neural networks Introduction Fitting neural networks

Advanced Speech Enhancement in Noisy Environments

AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.

Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.

Unsupervised Learning With Neural Nets Deep Learning and Neural Nets Spring 2015.

Advances in WP1 Nancy Meeting – 6-7 July

HIWIRE MEETING Torino, March 9-10, 2006 José C. Segura, Javier Ramírez.

Speech Enhancement Based on a Combination of Spectral Subtraction and MMSE Log-STSA Estimator in Wavelet Domain LATSI laboratory, Department of Electronic,

HIWIRE MEETING CRETE, SEPTEMBER 23-24, 2004 JOSÉ C. SEGURA LUNA GSTC UGR.

Machine Learning Motivation for machine learning How to set up a problem How to design a learner Introduce one class of learners (ANN) –Perceptrons –Feed-forward.

Advances in WP1 and WP2 Paris Meeting – 11 febr

MACHINE LEARNING 12. Multilayer Perceptrons. Neural Networks Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Artificial Neural Networks

1 Blind Image Quality Assessment Based on Machine Learning 陈欣

Autoencoders Mostafa Heidarpour

Speech Recognition Deep Learning and Neural Nets Spring 2015.

Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.

Convolutional Neural Networks for Image Processing with Applications in Mobile Robotics By, Sruthi Moola.

HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.

Age and Gender Classification using Modulation Cepstrum Jitendra Ajmera (presented by Christian Müller) Speaker Odyssey 2008.

„Bandwidth Extension of Speech Signals“ 2nd Workshop on Wideband Speech Quality in Terminals and Networks: Assessment and Prediction 22nd and 23rd June.

Cepstral Vector Normalization based On Stereo Data for Robust Speech Recognition Presenter: Shih-Hsiang Lin Luis Buera, Eduardo Lleida, Antonio Miguel,

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

Minimum Mean Squared Error Time Series Classification Using an Echo State Network Prediction Model Mark Skowronski and John Harris Computational Neuro-Engineering.

Survey of ICASSP 2013 section: feature for robust automatic speech recognition Repoter: Yi-Ting Wang 2013/06/19.

A Regression Approach to Music Emotion Recognition Yi-Hsuan Yang, Yu-Ching Lin, Ya-Fan Su, and Homer H. Chen, Fellow, IEEE IEEE TRANSACTIONS ON AUDIO,

From Auditory Masking to Binary Classification: Machine Learning for Speech Separation DeLiang Wang Perception & Neurodynamics Lab Ohio State University.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.

Towards Coastal Threat Evaluation Decision Support Presentation by Jacques du Toit Operational Research University of Stellenbosch 3 December 2010.

ICASSP Speech Discrimination Based on Multiscale Spectro–Temporal Modulations Nima Mesgarani, Shihab Shamma, University of Maryland Malcolm Slaney.

Speech Enhancement Using a Minimum Mean Square Error Short-Time Spectral Amplitude Estimation method.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-

CSC321: Lecture 7:Ways to prevent overfitting

ICASSP 2006 Robustness Techniques Survey ShihHsiang 2006.

Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluating Unsupervised Language Model Adaption Methods for Speaking Assessment ShaSha.

Data Mining and Decision Support

Statistical Signal Processing Research Laboratory(SSPRL) UT Acoustic Laboratory(UTAL) A TWO-STAGE DATA-DRIVEN SINGLE MICROPHONE SPEECH ENHANCEMENT WITH.

Dynamic Background Learning through Deep Auto-encoder Networks Pei Xu 1, Mao Ye 1, Xue Li 2, Qihe Liu 1, Yi Yang 2 and Jian Ding 3 1.University of Electronic.

Variable Step-Size Adaptive Filters for Acoustic Echo Cancellation Constantin Paleologu Department of Telecommunications

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.

Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.

語音訊號處理之初步實驗 NTU Speech Lab 指導教授: 李琳山助教: 熊信寬

语音与音频信号处理研究室 Speech and Audio Signal Processing Lab Multiplicative Update of AR gains in Codebook- driven Speech.

Christoph Prinz / Automatic Speech Recognition Research Progress Hits the Road.

A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.

Voice Activity Detection Based on Sequential Gaussian Mixture Model Zhan Shen, Jianguo Wei, Wenhuan Lu, Jianwu Dang Tianjin Key Laboratory of Cognitive.

Speech Enhancement Summer 2009

Convolutional Neural Network

CSC2535: Computation in Neural Networks Lecture 11 Extracting coherent properties by maximizing mutual information across space or time Geoffrey Hinton.

Deep Learning Amin Sobhani.

2 Research Department, iFLYTEK Co. LTD.

Feature Mapping FOR SPEAKER Diarization IN NOisy conditions

Neural networks (3) Regularization Autoencoder

Final Year Project Presentation --- Magic Paint Face

Dipartimento di Ingegneria «Enzo Ferrari»

3. Applications to Speaker Verification

Neural Networks Advantages Criticism

network of simple neuron-like computing elements

An Introduction To The Backpropagation Algorithm

Representation Learning with Deep Auto-Encoder

Effects of Lombard Reflex on Deep-Learning-Based

Neural networks (3) Regularization Autoencoder

Advances in Deep Audio and Audio-Visual Processing

Wiener Filtering: A linear estimation of clean signal from the noisy signal Using MMSE criterion.

Presenter: Shih-Hsiang(士翔)

Combination of Feature and Channel Compensation (1/2)

Speech Enhancement Based on Nonparametric Factor Analysis

Presentation transcript:

Speech Enhancement based on Deep Learning Jiawen Wu 2017/4/27

Background Task：Model the mapping relationship between the noisy and clean speech signals 1989，Tamura [1] -- Time domain 1994，Xie and Van [2] -- Frequency domain 2006，Hinton [3] -- RMB Since 2006 -- Classification Task[4-5], Auto-encoder[6] 规模小，结构简单，训练样本少，没有可靠的初始化方案，容易陷入局部最优 [1] S. I. Tamura, “An analysis of a noise reduction neural network,” in Proc. ICASSP, 1989, pp. 2001–2004. [2] F. Xie and D. V. Compernolle, “A family of MLP based nonlinear spectral estimators for noise reduction,” in Proc. ICASSP, 1994, pp. 53–56. [3] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507,2006. [4] Y.X.WangandD.L.Wang,“Towardsscalingupclassiﬁcation-based speech separation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 7, pp. 1381–1390, Jul. 2013. [5] E.W.Healy,S.E.Yoho,Y.X.Wang,andD.L.Wang,“Analgorithm to improve speech recognition in noise for hearing-impaired listeners,” J.Acoust.Soc.Amer., vol. 134, no. 4, pp. 3029–3038,2013. [6] X.-G.Lu,Y.Tsao,S.Matsuda,andC.Hori,“Speechenhancement based on deep denoising Auto-Encoder,” in Proc. Interspeech, 2013, pp. 436–440.

Yong Xu University of Science and Technology of China 个人主页：http://home.ustc.edu.cn/~xuyong62/index.html Xu Y, Du J, Dai L R, et al. An experimental study on speech enhancement based on deep neural networks[J]. Signal Processing Letters, IEEE, 2014, 21(1): 65-68. cited:81 Xu Y, Du J, Dai L R, et al. A regression approach to speech enhancement based on deep neural networks[J]. Audio, Speech, and Language Processing, IEEE/ACM Transactions on, 2015, 23(1): 7-19. cited:20 cited up to 2016.01.09

Contributions Nonlinear regression-based framework using DNNs A large amount of training data---2500 hours, and more than 100 noise types Context information Unseen noise and non-stationary noise

DNN-based SE system

Baseline System A nonlinear regression function finding a mapping function between noisy and clean speech

Baseline System Normalization： zero mean and unit variance 对数功率谱相位 J. Du and Q. Huo, “A speech enhancement approach using piecewise linear approximation of an explicit model of environmental distortions,” in Proc. Interspeech, 2008, pp. 569–572

Baseline System Fine-tuning A nonlinear regression function from noisy speech features to clean speech features Fine-tuning E---mean squared error W---weight parameters b---bias parameters ---the d-th enhanced frequency bins of the log-spectral feature at sample index n --- target frequency bins Update of the weights and bias

Improved System Fine-tuning Update of the weights and bias --- being the noisy log-spectral feature vector where the window size of context is 2*τ+1 E --- mean squared error W --- weight parameters b --- bias parameters --- the d-th enhanced frequency bins of the log-spectral feature at sample index n --- target frequency bins Update of the weights and bias κ --- the weight decay coefﬁcient ω --- is the momentum

Improved System Post-processing----Global Variance Equalization Dropout Training Noise Estimation----Noise aware training(NAT) 2017/4/27

Global Variance Equalization a simple type of histogram equalization The global variance of the estimated clean speech features is deﬁned as: A dimension-independent global variance can be computed as follows: A. D. L. Torre, A. M. Peinado, J. C.Segura,J.L.Perez-Cordoba,M.C. Benitez, and A. J. Rubio, “Histogram equalization of speech representation for robust speech recognition,” IEEE Trans. Speech Audio Process., vol. 13, no. 3, pp. 355–366, May 2005

Noise Aware Training[2] Dropout Training[1] In the DNN training, dropout randomly omits a certain percentage of the neurons in the input and each hidden layer during each presentation of the sample for each training sample, which can be treated as model averaging to avoid the over-ﬁtting problem. Noise Aware Training[2] [1] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R.Salakhutdinov, “Improving neural networks by preventing co-adapta-tion of feature detectors,” Arxiv, 2012 [Online]. [2] Dynamic Noise Aware Training for Speech Enhancement Based on Deep Neural Networks, Yong Xu, Jun Du, Li-Rong Dai and Chin-Hui Lee, to be appeared at Interspeech2014

01 02 03 04 Measures SSNR: segmental SNR LSD: log-spectral distortion PESQ: perceptual evaluation of speech quality 04 STOI: Short-Time Objective Intelligibility [1] J. Du and Q. Huo, “A speech enhancement approach using piecewise linear approximation of an explicit model of environmental distor-tions,” in Proc. Interspeech, 2008, pp. 569–572. [2] ITU-T, Rec. P.862, Perceptual evaluation of speech quality (PESQ):An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs International Telecommu-nication Union-Telecommunication Standardisation Sector, 2001. [3] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algo-rithm for intelligibility prediction of time frequency weighted noisy speech,” IEEE Trans. Audio, Speech, Lang. Process.,Sep. 2011.

Experiment Setup The Depth of DNN The Number of Noise Types

Experiment Setup The Size of Training Set The Length of Acoustic Context

Experiment Results Noise Aware Training Global Variance Equalization

Experiment Results Unseen Noise GVE 2.00 Dropout&GVE 2.13 Noisy 1.42 LogMMSE 1.83 Dropout 2.06 DNN-baseline 1.87 Dropout&GVE&NAT 2.25 Clean 4.5

Experiment Results Non-stationary and Unseen Noise 104-noise DNN 2.78 LogMMSE did not work 104-noise DNN 2.78 4-noise DNN 2.14 Noisy 1.85

Experiment Results Changing Noise Environments DNN 2.99 LogMMSE 1.46 Clean 4.50 Noisy 2.05

Experiment Results Real-world

Experiment Results Overall Evaluation on 15 Unseen Noise Types

Experiment Results 32 real-world noisy utterances (22 spoken in English, and others spoken in other languages) 10 persons :five Chinese males and five Chinese females

01 02 03 Summary Experiment Setup Input data processing Change the Deep Model 03 2017/4/27

THANK YOU end 2017/4/27