Voice Activity Detection (VAD)

Slides:



Advertisements
Similar presentations
Higher Order Cepstral Moment Normalization (HOCMN) for Robust Speech Recognition Speaker: Chang-wen Hsu Advisor: Lin-shan Lee 2007/02/08.
Advertisements

1 A Spectral-Temporal Method for Pitch Tracking Stephen A. Zahorian*, Princy Dikshit, Hongbing Hu* Department of Electrical and Computer Engineering Old.
Speech Enhancement through Noise Reduction By Yating & Kundan.
Advanced Speech Enhancement in Noisy Environments
Pitch Prediction From MFCC Vectors for Speech Reconstruction Xu shao and Ben Milner School of Computing Sciences, University of East Anglia, UK Presented.
Page 0 of 34 MBE Vocoder. Page 1 of 34 Outline Introduction to vocoders MBE vocoder –MBE Parameters –Parameter estimation –Analysis and synthesis algorithm.
Background Noise Definition: an unwanted sound or an unwanted perturbation to a wanted signal Examples: – Clicks from microphone synchronization – Ambient.
Robust Voice Activity Detection for Interview Speech in NIST Speaker Recognition Evaluation Man-Wai MAK and Hon-Bill YU The Hong Kong Polytechnic University.
Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.
Model-Based Fusion of Bone and Air Sensors for Speech Enhancement and Robust Speech Recognition John Hershey, Trausti Kristjansson, Zhengyou Zhang, Alex.
HIWIRE MEETING Nancy, July 6-7, 2006 José C. Segura, Ángel de la Torre.
HIWIRE MEETING Torino, March 9-10, 2006 José C. Segura, Javier Ramírez.
HIWIRE MEETING CRETE, SEPTEMBER 23-24, 2004 JOSÉ C. SEGURA LUNA GSTC UGR.
Communications & Multimedia Signal Processing Formant Tracking LP with Harmonic Plus Noise Model of Excitation for Speech Enhancement Qin Yan Communication.
Background Noise Definition: an unwanted sound or an unwanted perturbation to a wanted signal Examples: Clicks from microphone synchronization Ambient.
Advances in WP1 and WP2 Paris Meeting – 11 febr
Warped Linear Prediction Concept: Warp the spectrum to emulate human perception; then perform linear prediction on the result Approaches to warp the spectrum:
Representing Acoustic Information
Florian Bacher & Christophe Sourisse [ ] Seminar in Interactive Systems.
Kinect Player Gender Recognition from Speech Analysis
A VOICE ACTIVITY DETECTOR USING THE CHI-SQUARE TEST
„Bandwidth Extension of Speech Signals“ 2nd Workshop on Wideband Speech Quality in Terminals and Networks: Assessment and Prediction 22nd and 23rd June.
SoundSense by Andrius Andrijauskas. Introduction  Today’s mobile phones come with various embedded sensors such as GPS, WiFi, compass, etc.  Arguably,
Background Noise Definition: an unwanted sound or an unwanted perturbation to a wanted signal Examples: Clicks from microphone synchronization Ambient.
POWER CONTROL IN COGNITIVE RADIO SYSTEMS BASED ON SPECTRUM SENSING SIDE INFORMATION Karama Hamdi, Wei Zhang, and Khaled Ben Letaief The Hong Kong University.
Cepstral Vector Normalization based On Stereo Data for Robust Speech Recognition Presenter: Shih-Hsiang Lin Luis Buera, Eduardo Lleida, Antonio Miguel,
DIGITAL IMAGE PROCESSING Instructors: Dr J. Shanbehzadeh M.Gholizadeh M.Gholizadeh
Speech Enhancement Using Spectral Subtraction
Comparing Audio Signals Phase misalignment Deeper peaks and valleys Pitch misalignment Energy misalignment Embedded noise Length of vowels Phoneme variance.
REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.
Robust Speech Feature Decorrelated and Liftered Filter-Bank Energies (DLFBE) Proposed by K.K. Paliwal, in EuroSpeech 99.
Digital Image Processing
LOG-ENERGY DYNAMIC RANGE NORMALIZATON FOR ROBUST SPEECH RECOGNITION Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Quebec,
Pitch-synchronous overlap add (TD-PSOLA)
MUMT611: Music Information Acquisition, Preservation, and Retrieval Presentation on Timbre Similarity Alexandre Savard March 2006.
Informing Multisource Decoding for Robust Speech Recognition Ning Ma and Phil Green Speech and Hearing Research Group The University of Sheffield 22/04/2005.
EDGE DETECTION IN COMPUTER VISION SYSTEMS PRESENTATION BY : ATUL CHOPRA JUNE EE-6358 COMPUTER VISION UNIVERSITY OF TEXAS AT ARLINGTON.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
NOISE DETECTION AND CLASSIFICATION IN SPEECH SIGNALS WITH BOOSTING Nobuyuki Miyake, Tetsuya Takiguchi and Yasuo Ariki Department of Computer and System.
1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.
Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.
Look who’s talking? Project 3.1 Yannick Thimister Han van Venrooij Bob Verlinden Project DKE Maastricht University.
Noise Reduction Two Stage Mel-Warped Weiner Filter Approach.
Speech Enhancement Using a Minimum Mean Square Error Short-Time Spectral Amplitude Estimation method.
Robust Feature Extraction for Automatic Speech Recognition based on Data-driven and Physiologically-motivated Approaches Mark J. Harvilla1, Chanwoo Kim2.
IEEE Transactions on Consumer Electronics, Vol. 58, No. 2, May 2012 Kyungmin Lim, Seongwan Kim, Jaeho Lee, Daehyun Pak and Sangyoun Lee, Member, IEEE 報告者:劉冠宇.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Performance Comparison of Speaker and Emotion Recognition
ICASSP 2006 Robustness Techniques Survey ShihHsiang 2006.
EEL 6586: AUTOMATIC SPEECH PROCESSING Speech Features Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida February 27,
Chapter XIV Data Preparation and Basic Data Analysis.
Spectrum Sensing In Cognitive Radio Networks
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 1/21 Detection of Burst Onset Landmarks in Speech.
A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.
By Sarita Jondhale 1 Signal preprocessor: “conditions” the speech signal s(n) to new form which is more suitable for the analysis Postprocessor: operate.
Feature Transformation and Normalization Present by Howard Reference : Springer Handbook of Speech Processing, 3.3 Environment Robustness (J. Droppo, A.
Feature Matching and Signal Recognition using Wavelet Analysis Dr. Robert Barsanti, Edwin Spencer, James Cares, Lucas Parobek.
Tonal Index in Digital Recognition of Lung Auscultation Marcin Wiśniewski,Tomasz Zieliński 2016/7/12 Signal Processing Algorithms, Architectures,Arrangements,
Voice Activity Detection Based on Sequential Gaussian Mixture Model Zhan Shen, Jianguo Wei, Wenhuan Lu, Jianwu Dang Tianjin Key Laboratory of Cognitive.
1 LOW-RESOURCE NOISE-ROBUST FEATURE POST-PROCESSING ON AURORA 2.0 Chia-Ping Chen, Jeff Bilmes and Katrin Kirchhoff SSLI Lab Department of Electrical Engineering.
Speech Enhancement Summer 2009
Vocoders.
Signal processing.
Two-Stage Mel-Warped Wiener Filter SNR-Dependent Waveform Processing
朝陽科技大學 資訊工程系 謝政勳 Application of GM(1,1) Model to Speech Enhancement and Voice Activity Detection 朝陽科技大學 資訊工程系 謝政勳
Speech / Non-speech Detection
Presented by Chen-Wei Liu
Presenter: Shih-Hsiang(士翔)
Combination of Feature and Channel Compensation (1/2)
Presentation transcript:

Voice Activity Detection (VAD) Problem: Determine if voice is present in a particular audio signal. Issues: loud noise classified as speech and soft speech classified as noise Applications Speech Recognition Speech transmission Speech enhancement Increases performance of speech applications more than any other single component Goal: extract features from a signal that emphasize differences between speech and background noise

General Signal Characteristics Energy compared to long term noise estimates K. Srinivasan, A. Gersho, “Voice activity detection for cellular networks,” Proc. Of the IEEE Speech Coding Workshop, Oct 1993, pp. 85-86 Likelihood ratio based on statistical methods Y.D. Cho, K. Al-Naimi, A. Kondoz, “Improved voide activity detection based on a smoothed statistical likelihood ratio,” Proceedings ICASSP, 2001, IEE Press Compute the kurtosis R. Gaubran, E. Nemer and S.Mahmoud, “SNR estimation of speech signals using subbands and fourth-order statistics,” IEEE Signal Processing Letters, vol. 6, no. 7, pp. 171-174, 1999

Extract Features in Speech Model Presence of pitch “Digital cellular telecommunication system (phase 2+); voice activity detector for adaptive multi-rate (amr) speech traffic channels,” ETSI Report, DEN/SMG-110694Q7, 2000 Formant shape J.D. Hoyt, H.Wechsler, “Detection of human speech in structured noise,” Proc. IEEE International Conference on Acoustics, Speech , and Signal Processing, 1994, pp. 237-240 Cepstrum J.A. Haigh, J.S. Mason, “Robust voice activity detection using cepstral features,” IEEE TEN-CON, pp. 321-324, 1993

Multi-channel Algorithms Utilize additional information provided by additional sensors P. Naylor, N. Doukas, T. Stathaki, “Voice activity detection using source separation techniques,” Proc. Eurospeech, 1997, pp. 1099-1102 J.F. Chen, W. Ser, “Speech detection using microphone array,” Electronic Letters, vol 36(2), pp. 181-182, 2000 Q. Zou, X. Zou, M. Zhang, Z. Lin, “A robust speech detection algorithm in a microphone array teleconferencing system,” Proc. ICASSP, 2001, IEEE Press

Statistics: Mean First moment - Mean or average value: μ = ∑i=1,N si Second moment - Variance or spread: σ2 = 1/N∑i=1,N(si - μ)2 Standard deviation – probability of distance from mean: σ 3rd standardized moment- Skewness: γ1 = 1/N∑i=1,N(si-μ)3/σ3 Negative tail: skew to the left Positive tail: skew to the right 4th standardized moment – Kurtosis: γ2 = 1/N∑i=1,N(si-μ)4/σ4 Positive: relatively peaked Negative: relatively flat

VAD General approaches Noise Level estimated during periods of low energy Adaptive estimate: The noise floor estimate lowers quickly and raises slowly when encountering non-speech frames Energy: Speech energy significantly exceeds the noise level Cepstrum Analysis Voiced speech contains F0 plus harmonics that will show as a Cepstrum peak related to that periodicity and to voice. Flat Cepstrums can result from a door slam or clap Kurtosis: Linear prediction coding clean voiced speech residuals have a large kurtosis

Likelihood Ratio Test (LRT) L.Sohn, N.S. Kim, W.Sung, “A statistical model-based voice activity detection,” IEEE Signal Processing Letters, vol. 6, no.1, pp. 1-3, Jan 1999 J. Ramirez, J.C. Segura, et. al., “Statistical voice activity detection using a multiple observation likelihood ratio test,” IEEE Signal Processing letters, vol. 12, no. 10, pp. 689-692, Oct 2005 Utilizes the geometric mean: GM = (∏1,nai)1/n= e1/n∑1,n ln(ai)) log(GM) = log (∏1,nai)1/n) = 1/n log(∑1,nai)

Geometric Mean Arithmetic mean: applicable when using numeric quantities Annual growth: 2.5, 3, and 3.5 million dollars Geometric mean: applicable when using percentages Company grows annually by 2.5, 3, and 3.5% Example: A company starts with $1,000,000 Assets grow by 2.5, 3, and 3.5 percent over three years Arithmetic mean: 1/N∑i=1,Ngi = (1.025 + 1.03 + 1.035)/3 = 1.03 Geometric mean: (∏i=1,Npi)1/N = (1.025*1.03*1.035)1/3 = 1.02999191 Actual increase: $1,000,000*1.025*1.03*1.035 = $1,092,701.25 Use arithmetic mean: $1,000,000*(1.03)3 = $1,092,727 Use geometric mean: $1,000,000 * (1.02999191)3 = $1,092,701.25

LRT Algorithm Perform a DFFT of the audio signal Likelihood of a fft bin magnitude being speech (p(k)) Perform log of geometric mean of the bin probabilities: (1/K ∑k=0,K-1 log(p(k|speech)/P(k|non-speech) Mark as speech or non-speech if > upper threshold, mark as speech If < lower threshold, mark as non-speech If in between, use HMM or mark based on surrounding frames (multiple observance)

Statistical Modeling of Noise K determines shape Θ determines spread Gamma Laplacian Gaussian

Probability Distribution Formulas Gamma Laplacian Gaussian F(x;k,r) = xk-1rke-rx)/(k-1)! k: shape, r: rate of arrival Mean: k/r Variance: k/r2 Skew: 2/k½ Kurtosis: 6/k f(x;u,b) = 1/(2b)e-|x-μ|/b μ: location b:scale Mean: μ Variance: 2b2 Skew: 0 Kurtosis: 3 F(x; μ,σ) = 1/(2πσ2)½ e-(x-μ)2/(2σ2) Mean: μ Variance: σ2 Skew: 0 Kurtosis: 0 VAD: Determine which distribution most matches the noise Example: if Kurtosis ≠ 0, can’t be Gaussian

Harmonic Frequencies Background Algorithm Voiced speech energy clusters around the formants More frequency is in formant bins Algorithm If voice is present (high energy level compared to noise) Determine fundamental frequency (f0) using Cepstral analysis or some other method. Determine harmonics of F0 Decide if speech by the geometric mean of the DFFT bins in the vicinity of the harmonics Else Mark speech based on geometric mean of all DFFT bins

Auto Correlation Remove the DC offset and apply pre-emphasis xf[i] = (sf[i] – μf) – α(sf[i-1] – μf) where f=frame, μf = mean, α typically 0.96 Apply the auto-correlation formula to estimate pitch Rf[z] = ∑i=1,n-z xf[i]xf[i+z]/∑i=1,F xf[i]2 M[k] = max(rf[z]) Expectation: Voiced speech should produce a higher M[k] than unvoiced speech, silence, or noise frames Notes: We can do the same thing with Cepstrals Auto-correlation complexity improved by limiting the Rf[z] values that we bother to compute

Zero Crossing The effectiveness of auto correlation decreases as SNR approaches zero Enhancement to Auto Correlation method when SNR values are low Algorithm Eliminate the pre-emphasis step (preserve the original pitch) Assume every two zero crossings is a pitch period Auto correlate each period with its predecessor

Use of Entropy as VAD Metric FOR each frame Decompose the signal into 24 Bark (or Mel) scale bands Compute the energy in each frequency band FOR each band of frequencies energy[band] = ∑i=bstart,bend|x[i]|2 IF an initial or low energy frame, noise[band] = energy[band] ELSE speech[band] = energy[band] – noise[band] Sort speech[band] and select subset of bands with max speech[band] values Compute the probability of energy[band]/totalEnergy Compute entropy = - ∑useful bandsP(energy[band]) log(P(energy[band])) Note: We expect higher entropy in noise; signal, should be organized Adaptive Noise adjustment: for frame f and 0 < α <1 noise[band] = α energy[band]f-1 * (1- α) energy[band]f

Unvoiced Speech Detector Bark Scale Decomposition EL,0 = sum of all level five energy bands EL,1 = sum of first four level 4 energy bands EL,2 = sum of last five level 4 energy bands + first level 3 energy band IF EL,2 > EL,1 > EL,0 and EL,0/EL,2 < 0.99, THEN frame is unvoiced speech

G 729 VAD Algorithm Importance: An industry standard and a reference to compare new algorithm proposals Overview A VAD decision is made every 10 ms Features: full band energy, the low band energy, the zero-crossing rate, and a spectral measure. A Long term calculation averages frames judged to not contain voice VAD decision: compute differences between a frame and noise estimate. Average values from predecessor frames to prevent eliminating non-voiced speech IF differences > threshold, return true; ELSE return false

Other algorithms Analyze a larger window and classify based on the percentage of time voice appears to be present or absent Focus on the change of signal peak energy at the onset and termination of speech. Onset: drastic increase in peak energy Continuous speech: intermittent peak spikes Termination: absence of peak spikes and energy

Evaluating ASR Performance Importance of VAD The VAD component impacts speech recognition the most Without VAD, ASR accuracy degrades to less than 30% in noisy environments Evaluation standards Without an objective standard, researchers will not be able evaluate how various algorithms impact ASR accuracy H.G. Hirsch, D. Pearce, “The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions,” Proc. ISCA ITRW ASR2000, vol. ASSP-32, pp. 181-188, Sep. 2000