Supervised Speech Separation

Slides:

Advertisements

Similar presentations

Speech Enhancement through Noise Reduction By Yating & Kundan.

Advertisements

Advanced Speech Enhancement in Noisy Environments

Multipitch Tracking for Noisy Speech

Cocktail Party Processing DeLiang Wang (Jointly with Guoning Hu) Perception & Neurodynamics Lab Ohio State University.

AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.

Zhiyao Duan, Gautham J. Mysore, Paris Smaragdis 1. EECS Department, Northwestern University 2. Advanced Technology Labs, Adobe Systems Inc. 3. University.

F 鍾承道 Acoustic Features for Speech Recognition: From Mel-Frequency Cepstrum Coefficients (MFCC) to BottleNeck Features(BNF)

Single-Channel Speech Enhancement in Both White and Colored Noise Xin Lei Xiao Li Han Yan June 5, 2002.

Speaker Adaptation for Vowel Classification

Modeling of Mel Frequency Features for Non Stationary Noise I.AndrianakisP.R.White Signal Processing and Control Group Institute of Sound and Vibration.

Optimal Adaptation for Statistical Classifiers Xiao Li.

Robust Automatic Speech Recognition by Transforming Binary Uncertainties DeLiang Wang (Jointly with Dr. Soundar Srinivasan) Oticon A/S, Denmark (On leave.

EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

Speech Segregation Based on Sound Localization DeLiang Wang & Nicoleta Roman The Ohio State University, U.S.A. Guy J. Brown University of Sheffield, U.K.

Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.

Speech Perception in Noise and Ideal Time- Frequency Masking DeLiang Wang Oticon A/S, Denmark On leave from Ohio State University, USA.

SINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIXFACTORIZATION AND SPECTRAL MASKS Jain-De,Lee Emad M. GraisHakan Erdogan 17 th International.

A VOICE ACTIVITY DETECTOR USING THE CHI-SQUARE TEST

From Auditory Masking to Supervised Separation: A Tale of Improving Intelligibility of Noisy Speech for Hearing- impaired Listeners DeLiang Wang Perception.

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

Reporter: Shih-Hsiang( 士翔 ). Introduction Speech signal carries information from many sources –Not all information is relevant or important for speech.

Minimum Mean Squared Error Time Series Classification Using an Echo State Network Prediction Model Mark Skowronski and John Harris Computational Neuro-Engineering.

REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.

Ekapol Chuangsuwanich and James Glass MIT Computer Science and Artificial Intelligence Laboratory,Cambridge, Massachusetts 02139,USA 2012/07/2 汪逸婷.

Computational Auditory Scene Analysis and Its Potential Application to Hearing Aids DeLiang Wang Perception & Neurodynamics Lab Ohio State University.

From Auditory Masking to Binary Classification: Machine Learning for Speech Separation DeLiang Wang Perception & Neurodynamics Lab Ohio State University.

Basics of Neural Networks Neural Network Topologies.

ICASSP Speech Discrimination Based on Multiscale Spectro–Temporal Modulations Nima Mesgarani, Shihab Shamma, University of Maryland Malcolm Slaney.

Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.

‘Missing Data’ speech recognition in reverberant conditions using binaural interaction Sue Harding, Jon Barker and Guy J. Brown Speech and Hearing Research.

Gammachirp Auditory Filter

Auditory Segmentation and Unvoiced Speech Segregation DeLiang Wang & Guoning Hu Perception & Neurodynamics Lab The Ohio State University.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.

IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 1/21 Detection of Burst Onset Landmarks in Speech.

A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.

Motorola presents in collaboration with CNEL Introduction  Motivation: The limitation of traditional narrowband transmission channel  Advantage: Phone.

Speech Enhancement based on

Supervised Speech Separation

语音与音频信号处理研究室 Speech and Audio Signal Processing Lab Multiplicative Update of AR gains in Codebook- driven Speech.

Voice Activity Detection Based on Sequential Gaussian Mixture Model Zhan Shen, Jianguo Wei, Wenhuan Lu, Jianwu Dang Tianjin Key Laboratory of Cognitive.

Big data classification using neural network

Speech and Singing Voice Enhancement via DNN

Speech Enhancement Summer 2009

Precedence-based speech segregation in a virtual auditory environment

Speaker-Targeted Audio-Visual Models for Speech Recognition in Cocktail-Party Environments Good morning, My name is Guan-Lin Chao, from Carnegie Mellon.

Data Mining, Neural Network and Genetic Programming

Speech Enhancement with Binaural Cues Derived from a Priori Codebook

Liverpool Keele Contribution.

Restricted Boltzmann Machines for Classification

Cocktail Party Problem as Binary Classification

Intelligent Information System Lab

RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION

Deep Learning Based Speech Separation

A Tutorial on Bayesian Speech Feature Enhancement

DeLiang Wang (Jointly with Dr. Soundar Srinivasan) Oticon A/S, Denmark

EE513 Audio Signals and Systems

Department of Electrical Engineering

Outline Background Motivation Proposed Model Experimental Results

John H.L. Hansen & Taufiq Al Babba Hasan

A maximum likelihood estimation and training on the fly approach

Advances in Deep Audio and Audio-Visual Processing

Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,

Emad M. Grais Hakan Erdogan

Deep neural networks for spike sorting: exploring options

Perception & Neurodynamics Lab

Presenter: Shih-Hsiang(士翔)

Combination of Feature and Channel Compensation (1/2)

Speech Enhancement Based on Nonparametric Factor Analysis

Practical Hidden Voice Attacks against Speech and Speaker Recognition Systems NDSS 2019 Hadi Abdullah, Washington Garcia, Christian Peeters, Patrick.

Presentation transcript:

Supervised Speech Separation DeLiang Wang Perception & Neurodynamics Lab Ohio State University & Northwestern Polytechnical University

Outline of tutorial Introduction Training targets Separation algorithms Concluding remarks IWAENC'16 tutorial

Sources of speech interference and distortion additive noise from other sound sources channel distortion reverberation from surface reflections IWAENC'16 tutorial 3

Traditional approaches to speech separation Speech enhancement Monaural methods by analyzing general statistics of speech and noise Require a noise estimate Spatial filtering with a microphone array Beamforming Extract target sound from a specific spatial direction Independent component analysis Find a demixing matrix from multiple mixtures of sound sources Computational auditory scene analysis (CASA) Based on auditory scene analysis principles Feature-based (e.g. pitch) versus model-based (e.g. speaker model) IWAENC'16 tutorial 4

Supervised approach to speech separation Data driven, i.e. dependency on a training set Features Training targets Learning machines Born out of CASA Time-frequency masking concept has led to the formulation of speech separation as a supervised learning problem A recent trend fueled by the success of deep learning IWAENC'16 tutorial 5

Supervised approach to speech separation Data driven, i.e. dependency on a training set Features Training targets Learning machines: deep neural networks Born out of CASA Time-frequency masking concept has led to the formulation of speech separation as a supervised learning problem A recent trend fueled by the success of deep learning IWAENC'16 tutorial 6

Ideal binary mask as a separation goal Motivated by the auditory masking phenomenon and auditory scene analysis, we suggested the ideal binary mask as a main goal of CASA (Hu & Wang, 2001; 2004) The idea is to retain parts of a mixture where the target sound is stronger than the acoustic background, and discard the rest The definition of the ideal binary mask (IBM) θ: A local SNR criterion (LC) in dB, which is typically chosen to be 0 dB Optimal SNR: Under certain conditions the IBM with θ = 0 dB is the optimal binary mask in terms of SNR (Li & Wang’09) Maximal articulation index (AI) in a simplified version (Loizou & Kim’11) IWAENC'16 tutorial

Subject tests of ideal binary masking Ideal binary masking leads to dramatic speech intelligibility improvements Improvement for stationary noise is above 7 dB for normal-hearing (NH) listeners, and above 9 dB for hearing-impaired (HI) listeners Improvement for modulated noise is significantly larger than for stationary noise With the IBM as the goal, the speech separation problem becomes a binary classification problem This new formulation opens the problem to a variety of supervised classification methods Brungart and Simpson (2001)’s experiments show coexistence of informational and energetic masking. They have also isolated informational masking using across-ear effect when listening to two talkers in one ear experience informational masking from third signal in the opposite ear (Brungart and Simpson, 2002) Arbogast et al. (2002) divide speech signal into 15 log spaced, envelope modulated sinewaves. Assign some to target and some to interference: intelligible speech with no spectral overlaps IWAENC'16 tutorial

Deep neural networks Deep neural networks (DNNs) usually refer to multilayer perceptrons with two or more hidden layers Why deep? As the number of layers increases, more abstract features are learned and they tend to be more invariant to superficial variations Deeper model increasingly benefits from bigger training data Superior performance in practice if properly trained (e.g., convolutional neural networks) Hinton et al. (2006) suggest to unsupervisedly pretrain a DNN using restricted Boltzmann machines (RBMs) However, recent practice suggests that RBM pretraining is not needed if large training data exists IWAENC'16 tutorial

Part II: Training targets What supervised training aims to learn is important for speech separation/enhancement Different training targets lead to different mapping functions from noisy features to separated speech Different targets may have different levels of generalization A recent study (Wang et al.’14) examines different training targets (objective functions) Masking based targets Mapping based targets Other targets IWAENC'16 tutorial 10

Background While the IBM is first used in supervised separation (see Part III), the quality of separated speech is a persistent issue What are alternative targets? Target binary mask (TBM) (Kjems et al.’09; Gonzalez & Brookes’14) Ideal ratio mask (IRM) (Srinivasan et al.’06; Narayanan & Wang’13; Hummersone et al.’14) STFT spectral magnitude (Xu et al.’14; Han et al.’14) IWAENC'16 tutorial

Different training targets TBM: similar to the IBM except that interference is fixed to speech-shaped noise (SSN) IRM β is a tunable parameter, and a good choice is 0.5 With β = 0.5, the IRM becomes a square root Wiener filter, which is the optimal estimator of the power spectrum of the target signal IWAENC'16 tutorial

Different training targets (cont.) Gammatone frequency power spectrum (GF-POW) FFT-MAG of clean speech FFT-MASK IWAENC'16 tutorial

Illustration of various training targets Factory noise at -5 dB IWAENC'16 tutorial

Evaluation methodology Learning machine: DNN with 3 hidden layers, each with 1024 units Target speech: TIMIT corpus Noises: SSN + four nonstationary noises from NOISEX Training and testing on different segments of each noise Trained at -5 and 0 dB, and tested at -5, 0, and 5 dB Evaluation metrics STOI: standard metric for predicted speech intelligibility PESQ: standard metric for perceptual speech quality SNR IWAENC'16 tutorial

Comparisons Comparisons among different training targets Comparisons with different approaches Speech enhancement (Hendriks et al.’10) Supervised NMF: ASNA-NMF (Virtanen et al.’13), trained and tested in the same way as supervised separation IWAENC'16 tutorial

STOI comparison for factory noise Spectral mapping NMF & SPEH Masking IWAENC'16 tutorial

PESQ comparison for factory noise Spectral mapping Masking NMF & SPEH IWAENC'16 tutorial

Summary among different targets Among the two binary masks, IBM estimation performs better in PESQ than TBM estimation Ratio masking performs better than binary masking for speech quality IRM, FFT-MASK, and GF-POW produce comparable PESQ results FFT-MASK is better than FFT-MAG for estimation Many-to-one mapping in FFT-MAG vs. one-to-one mapping in FFT-MASK, and the latter should be easier to learn Estimation of spectral magnitudes or their compressed version tends to magnify estimation errors IWAENC'16 tutorial

Other targets: signal approximation In signal approximation (SA), training aims to estimate the IRM but the error is measured against the spectral magnitude of clean speech (Weninger et al.’14) RM(t, f) denotes an estimated IRM This objective function maximizes SNR Jin & Wang (2009) proposed an earlier version in conjunction with IBM estimation There is some improvement over direct IRM estimation IWAENC'16 tutorial

Phase-sensitive target Phase-sensitive target is an ideal ratio mask (FFT mask) that incorporates the phase difference, θ, between clean speech and noisy speech (Erdogan et al.’15) Because of phase sensitivity, this target leads to a better estimate of clean speech than the FFT mask It does not directly estimate phase IWAENC'16 tutorial

Complex Ideal Ratio Mask (cIRM) An ideal mask that can result in clean speech (Williamson et al.’16) 𝑆 𝑡,𝑓 = 𝑀 𝑡,𝑓 ∗ 𝑌 𝑡,𝑓 In Cartesian complex domain, we have 𝑀= 𝑌 𝑟 𝑆 𝑟 + 𝑌 𝑖 𝑆 𝑖 𝑌 𝑟 2 + 𝑌 𝑖 2 +𝑖 𝑌 𝑟 𝑆 𝑖 − 𝑌 𝑖 𝑆 𝑟 𝑌 𝑟 2 + 𝑌 𝑖 2 Structure exists in both real and imaginary components, but not in phase spectrogram Compress the value range cIRM 𝑥 =𝐾 1− 𝑒 −𝐶∙ 𝑀 𝑥 1+ 𝑒 −𝐶∙ 𝑀 𝑥 𝑀 𝑡,𝑓 : mask 𝑌 𝑡,𝑓 : noisy speech x ∊ {r, i} IWAENC'16 tutorial

Part III. DNN-based separation algorithms Separation methods Time-frequency (T-F) masking Binary: classification Ratio: regression or function approximation Spectral mapping Separation task Speech-nonspeech separation Two-talker separation IWAENC'16 tutorial 23

DNN as subband classifier Y. Wang & Wang (2013) first introduced DNN to address the speech separation problem DNN is used for as a subband classifier, performing feature learning from raw acoustic features Classification aims to estimate the IBM IWAENC'16 tutorial

DNN as subband classifier IWAENC'16 tutorial

Extensive training with DNN Training on 200 randomly chosen utterances from both male and female IEEE speakers, mixed with 100 environmental noises at 0 dB (~17 hours long) Six million fully dense training samples in each channel, with 64 channels in total Evaluated on 20 unseen speakers mixed with 20 unseen noises at 0 dB IWAENC'16 tutorial

DNN-based separation results Comparisons with a representative speech enhancement algorithm (Hendriks et al.’10) Using clean speech as ground truth, on average about 3 dB SNR improvements Using IBM separated speech as ground truth, on average about 5 dB SNR improvements IWAENC'16 tutorial

Speech intelligibility evaluation Healy et al. (2013) subsequently evaluated the classifier on speech intelligibility of hearing-impaired listeners A very challenging problem: “The interfering effect of background noise is the single greatest problem reported by hearing aid wearers” (Dillon’12) Two stage DNN training to incorporate T-F context in classification IWAENC'16 tutorial

Separation illustration A HINT sentence mixed with speech-shaped noise at -5 dB SNR IWAENC'16 tutorial

Results and sound demos Both HI and NH listeners showed intelligibility improvements HI subjects with separation outperformed NH subjects without separation IWAENC'16 tutorial

Generalization to new noises While previous speech intelligibility results are impressive, a major limitation is that training and test noise samples were drawn from the same noise segments Speech utterances were different Noise samples were randomized Chen et al. (2016) have recently addressed this limitation through large-scale training for IRM estimation IWAENC'16 tutorial

Large-scale training Training set consisted of 560 IEEE sentences mixed with 10,000 (10K) non-speech noises (a total of 640,000 mixtures) The total duration of the noises is about 125 h, and the total duration of training mixtures is about 380 h Training SNR is fixed to -2 dB The only feature used is the simple T-F unit energy DNN architecture consists of 5 hidden layers, each with 2048 units Test utterances and noises are both different from those used in training IWAENC'16 tutorial

STOI performance at -2 dB input SNR Babble Cafeteria Factory Babble2 Average Unprocessed 0.612 0.596 0.611 0.608 100-noise model 0.683 0.704 0.750 0.688 0.706 10K-noise model 0.792 0.783 0.807 0.786 Noise-dependent model 0.833 0.770 0.802 0.762 The DNN model with large-scale training provides similar results to noise-dependent model IWAENC'16 tutorial

Benefit of large-scale training Invariance to training SNR IWAENC'16 tutorial

Learned speech filters Visualization of 100 units from the first hidden layer Abscissa: 23 time frames; ordinate: 64 frequency channels Formant transition? IWAENC'16 tutorial Harmonics?

Results and demos Both NH and HI listeners received benefit from algorithm processing in all conditions, with larger benefits for HI listeners IWAENC'16 tutorial

DNN as spectral magnitude estimator Xu et al. (2014) proposed a DNN-based enhancement algorithm DNN (with RBM pretraining) is trained to map from log-power spectra of noisy speech to those of clean speech More input frames and training data improve separation results IWAENC'16 tutorial

Xu et al. results in PESQ With trained noises Subscript in DNN denotes number of hidden layers With two untrained noises (A: car; B: exhibition hall) IWAENC'16 tutorial

Xu et al. demo Street noise at 10 dB (upper left: DNN; upper right: log-MMSE; lower left: clean; lower right: noisy) IWAENC'16 tutorial

DNN for two-talker separation Huang et al. (2014; 2015) proposed a two-talker separation method based on DNN, as well as RNN (recurrent neural network) Mapping from an input mixture to two separated speech signals Using T-F masking to constrain target signals Binary or ratio IWAENC'16 tutorial

Network architecture and training objective A discriminative training objective maximizes signal-to-interference ratio (SIR) IWAENC'16 tutorial

Huang et al. results DNN and RNN perform at about the same level About 4-5 dB better in terms of SIR than NMF, while maintaining better SDRs and SARs (input SIR is 0 dB) Demo at https://sites.google.com/site/deeplearningsourceseparation Mixture Binary masking Separated Talker 1 Talker 1 Ratio masking Separated Talker 2 Talker 2 IWAENC'16 tutorial

Binaural separation of reverberant speech Jiang et al. (2014) use binaural (ITD and ILD) and monaural (GFCC) features to train a DNN classifier to estimate the IBM DNN-based classification produces excellent results in low SNR, reverberation, and different spatial configurations The inclusion of monaural features improves separation performance when target and interference are close, potentially overcoming a major hurdle in beamforming and other spatial filtering methods IWAENC'16 tutorial

Part VI: Concluding remarks Formulation of separation as classification or mask estimation enables the use of supervised learning Advances in DNN-based speech separation in the last few years are impressive Large improvements over unprocessed noisy speech and related approaches This approach has yielded the first demonstrations of speech intelligibility improvement in noise IWAENC'16 tutorial

Concluding remarks (cont.) Supervised speech processing represents a major current trend Signal processing provides an important domain for supervised learning, and it in turn benefits from rapid advances in machine learning Use of supervised processing goes beyond speech separation and recognition Multipitch tracking (Huang & Lee’13, Han & Wang’14) Voice activity detection (Zhang et al.’13) Dereverberation (Han et al.’15) SNR estimation (Papadopoulos et al.’14) IWAENC'16 tutorial

Resources and acknowledgments DNN toolbox for speech separation http://www.cse.ohio-state.edu/pnl/DNN_toolbox Thanks to Jitong Chen and Donald Williamson for their assistance in the tutorial preparation IWAENC'16 tutorial