Voice Activity Detection Based on Sequential Gaussian Mixture Model Zhan Shen, Jianguo Wei, Wenhuan Lu, Jianwu Dang Tianjin Key Laboratory of Cognitive.

Slides:



Advertisements
Similar presentations
Discriminative Training in Speech Processing Filipp Korkmazsky LORIA.
Advertisements

Caroline Rougier, Jean Meunier, Alain St-Arnaud, and Jacqueline Rousseau IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 5,
Advanced Speech Enhancement in Noisy Environments
Supervised Learning Recap
Pitch Prediction From MFCC Vectors for Speech Reconstruction Xu shao and Ben Milner School of Computing Sciences, University of East Anglia, UK Presented.
G. Valenzise *, L. Gerosa, M. Tagliasacchi *, F. Antonacci *, A. Sarti * IEEE Int. Conf. On Advanced Video and Signal-based Surveillance, 2007 * Dipartimento.
Visual Recognition Tutorial
Nov 4, Detection, Classification and Tracking of Targets in Distributed Sensor Networks Presented by: Prabal Dutta Dan Li, Kerry Wong,
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Paper Discussion: “Simultaneous Localization and Environmental Mapping with a Sensor Network”, Marinakis et. al. ICRA 2011.
A Data-Driven Approach to Quantifying Natural Human Motion SIGGRAPH ’ 05 Liu Ren, Alton Patrick, Alexei A. Efros, Jassica K. Hodgins, and James M. Rehg.
Speaker Adaptation for Vowel Classification
Object Class Recognition Using Discriminative Local Features Gyuri Dorko and Cordelia Schmid.
Modeling of Mel Frequency Features for Non Stationary Noise I.AndrianakisP.R.White Signal Processing and Control Group Institute of Sound and Vibration.
Optimal Adaptation for Statistical Classifiers Xiao Li.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.
A VOICE ACTIVITY DETECTOR USING THE CHI-SQUARE TEST
Isolated-Word Speech Recognition Using Hidden Markov Models
INTRODUCTION  Sibilant speech is aperiodic.  the fricatives /s/, / ʃ /, /z/ and / Ʒ / and the affricatives /t ʃ / and /d Ʒ /  we present a sibilant.
POWER CONTROL IN COGNITIVE RADIO SYSTEMS BASED ON SPECTRUM SENSING SIDE INFORMATION Karama Hamdi, Wei Zhang, and Khaled Ben Letaief The Hong Kong University.
7-Speech Recognition Speech Recognition Concepts
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Speech Enhancement Using Spectral Subtraction
Chapter 14 Speaker Recognition 14.1 Introduction to speaker recognition 14.2 The basic problems for speaker recognition 14.3 Approaches and systems 14.4.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.
Ekapol Chuangsuwanich and James Glass MIT Computer Science and Artificial Intelligence Laboratory,Cambridge, Massachusetts 02139,USA 2012/07/2 汪逸婷.
G AUSSIAN M IXTURE M ODELS David Sears Music Information Retrieval October 8, 2009.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
NOISE DETECTION AND CLASSIFICATION IN SPEECH SIGNALS WITH BOOSTING Nobuyuki Miyake, Tetsuya Takiguchi and Yasuo Ariki Department of Computer and System.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Speaker Verification Speaker verification uses voice as a biometric to determine the authenticity of a user. Speaker verification systems consist of two.
A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Singer similarity / identification Francois Thibault MUMT 614B McGill University.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
Singer Similarity Doug Van Nort MUMT 611. Goal Determine Singer / Vocalist based on extracted features of audio signal Classify audio files based on singer.
ICASSP 2006 Robustness Techniques Survey ShihHsiang 2006.
ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.
A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.
Speaker Verification Using Adapted GMM Presented by CWJ 2000/8/16.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
1 A Statistical Matching Method in Wavelet Domain for Handwritten Character Recognition Presented by Te-Wei Chiang July, 2005.
语音与音频信号处理研究室 Speech and Audio Signal Processing Lab Multiplicative Update of AR gains in Codebook- driven Speech.
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
Deep Feedforward Networks
Online Multiscale Dynamic Topic Models
Presentation on Artificial Neural Network Based Pathological Voice Classification Using MFCC Features Presenter: Subash Chandra Pakhrin 072MSI616 MSC in.
Statistical Models for Automatic Speech Recognition
Asymmetric Gradient Boosting with Application to Spam Filtering
Two-Stage Mel-Warped Wiener Filter SNR-Dependent Waveform Processing
Statistical Models for Automatic Speech Recognition
朝陽科技大學 資訊工程系 謝政勳 Application of GM(1,1) Model to Speech Enhancement and Voice Activity Detection 朝陽科技大學 資訊工程系 謝政勳
A Tutorial on Bayesian Speech Feature Enhancement
EE513 Audio Signals and Systems
AUDIO SURVEILLANCE SYSTEMS: SUSPICIOUS SOUND RECOGNITION
Missing feature theory
LECTURE 15: REESTIMATION, EM AND MIXTURES
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Presented by Chen-Wei Liu
Presenter: Shih-Hsiang(士翔)
Combination of Feature and Channel Compensation (1/2)
Presentation transcript:

Voice Activity Detection Based on Sequential Gaussian Mixture Model Zhan Shen, Jianguo Wei, Wenhuan Lu, Jianwu Dang Tianjin Key Laboratory of Cognitive Computation & its Applications Tianjin University, China

Introduction 2  Voice activity detector (VAD) plays an important role in many speech signal processing systems, wherein each utterance are partitioned into speech/nonspeech segments.  Research branch of VAD 1.Acoustic features Energy, pitch, zero-crossing rate, higher-order statistics, … Each acoustic feature reflects only some characteristics of human voice; Not very effective in extremely difficult scenarios. 2.Statistical models Make model assumptions on distributions of speech and nonspeech respectively, and then design statistical algorithms to dynamically estimate the model parameters; Gaussian model, Laplacian model, Gamma model, GARCH model … Difficult to derive a closed-form parameter estimation algorithm. 3.Deep Neural Network Train acoustic models from given noisy corpora; Superior performance only if the training scenario is matched with the test scenario; Heavy computational load; Use some succeeding and preceding frames as the input, which leads to the latency for several frames.

Introduction 3  Voice activity detector (VAD) plays an important role in many speech signal processing systems, wherein each utterance are partitioned into speech/nonspeech segments.  Research branch of VAD 1.Acoustic features Energy, pitch, zero-crossing rate, higher-order statistics, … Each acoustic feature reflects only some characteristics of human voice; Not very effective in extremely difficult scenarios. 2.Statistical models Make model assumptions on distributions of speech and nonspeech respectively, and then design statistical algorithms to dynamically estimate the model parameters; Gaussian model, Laplacian model, Gamma model, GARCH model … Difficult to derive a closed-form parameter estimation algorithm. 3.Deep Neural Network Train acoustic models from given noisy corpora; Superior performance only if the training scenario is matched with the test scenario; Heavy computational load; Use some succeeding and preceding frames as the input, which leads to the latency for several frames.

Unsupervised Learning Framework 4  Acoustic feature: smoothed subband logarithmic energy  The input signal is grouped into several Mel subbands in the frequency domain;  The logarithmic energy is calculated by using the logarithmic value of the absolute magnitude sum of each subband, then smoothed to form an envelope for classification; (1)

Unsupervised Learning Framework 5  Two Gaussian models are employed as the classifier to describe the logarithmic energy distributions of speech and nonspeech.  These two models are incorporated into a two-component GMM. The mean and variance of nonspeech logarithmic energy are smaller than those of speech logarithmic energy.

 is an optimal threshold to minimize the classification error;  The samples with logarithmic energy less than the threshold is classified as nonspeech, and otherwise as speech. Unsupervised Learning Framework 6 (2) (3) (4)

Estimation of GMM Parameters 7  The parameter set was updated based on maximum likelihood frame by frame.  The sequential scheme is a first-order process.  The GMM is initialized with the first M frames through the typical EM algorithm (5) (6) (7) (8)

Estimation of GMM Parameters 8  The GMM is sequentially updated based on maximum likelihood criterion after initialization.  The parameter set for the (k+1)-th frame is estimated by (9) (10) (11)

Estimation of GMM Parameters 9  Iterative Newton-Raphson algorithm is utilized to maximized the Q-function. (12) (13) (14)

Estimation of GMM Parameters 10  Substituting the parameters set into (12) yields the following recursive formulas. (15) (16) (17)

Estimation of GMM Parameters 11  The average of the speech presence/absence probability can be defined as a sequential variable.  Each parameter in a new parameter set can be represented as a function of new observation, previous parameter set and speech presence probability. (18) (19) (20) (21) (22)

Constraints on GMM 12  A number of constraints are introduced to make sure that the proposed GMM fits with the situation of speech absence as well as speech presence.  In the situation of speech absence, a virtual speech component is constructed to fit the two- component GMM.  All these constraints are embedded into both the initialization and updating processes of sequential GMM. Constraint to means Constraint to variances Constraint to weight coefficients

Experimental Conditions 13  Data set:TIMIT TEST corpus;  Clean speech signal:16 male speakers, 16 female speakers, 320 utterances in total;  Noises:Babble noise at SNRs of 0, 10 and 20 dB; F16 cockpit at SNRs of 0, 10 and 20 dB; White gaussian at SNRs of 0, 10 and 20 dB;  Sampling rate:8000 Hz;  Referenced VADs:ITU G.729 Annex B VAD(G729B); ESTI AMR VAD options 1 (AMR1); ESTI AMR VAD options 2(AMR2); SGMM VAD by Ying(SGMM);  Parameters:

Experimental Results 14 Babble noise at SNR of 0dB, 10dB, and 20dB

Experimental Results 15 F16 cockpit at SNR of 0dB, 10dB, and 20dB

Experimental Results 16 White gaussian at SNR of 0dB, 10dB, and 20dB

Experimental Results 17 BabbleF16White G729B AMR AMR SGMM ML Table: F-measure for 0dB SNR. TP: True Positive FP: False Positive TN: True Negative FN:False Negative

Process of VAD Decision 18 Initialization 1: FOR the first M frames 2: FOR each Mel subband 3: Extract a logarithmic energy envelope. 4: Establish a GMM by EM with constraints. 5: Determine the threshold from GMM. 6: Tune the threshold. 7: Classify of M samples as speech/nonspeech. 8: END 9: Summarize all subbands’ classification by voting. 10: Discriminate speech/nonspeech. 11: END Updating 1: FOR each new coming frame at time k+1 2: Do FFT and calculate at each Mel subband 3: FOR at each subband 4: Maximized the Q-function with Newton’s method. 5: Update the means. 6: Constrain the means. 7: Update the variances. 8: Constrain the variances. 9: Update the weight coefficients. 10: Constrain the weight coefficients. 11: Determine the threshold from GMM. 12: Tune the threshold. 13: Determine as speech/nonspeech. 14: END 15: Summarize all subbands’ classification by voting. 17: Discriminate the k+1 frame. 17: END

Discussion 19 Conclusion  This work presents a novel voice activity detector based on Gaussian mixture model;  The logarithmic power is utilized as the acoustic feature;  The sequential likelihood function is presented to estimate the parameter set of this GMM frame by frame;  The likelihood function is sequentially maximized based on the Newton-Raphson method;  The major contribution of this paper is the optimal estimation of the GMM parameter set. Future Work  Compare the experimental results with other statistical model based VADs.

20 Thank you ! This work was supported by National Natural Science Foundation of China (No ).