Ekapol Chuangsuwanich and James Glass MIT Computer Science and Artificial Intelligence Laboratory,Cambridge, Massachusetts 02139,USA 2012/07/2 汪逸婷.

Slides:



Advertisements
Similar presentations
1 A Spectral-Temporal Method for Pitch Tracking Stephen A. Zahorian*, Princy Dikshit, Hongbing Hu* Department of Electrical and Computer Engineering Old.
Advertisements

Combining Heterogeneous Sensors with Standard Microphones for Noise Robust Recognition Horacio Franco 1, Martin Graciarena 12 Kemal Sonmez 1, Harry Bratt.
Advanced Speech Enhancement in Noisy Environments
Robust Voice Activity Detection for Interview Speech in NIST Speaker Recognition Evaluation Man-Wai MAK and Hon-Bill YU The Hong Kong Polytechnic University.
Survey of INTERSPEECH 2013 Reporter: Yi-Ting Wang 2013/09/10.
Advances in WP1 Turin Meeting – 9-10 March
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne 1 Feature selection for audio-visual speech recognition Mihai Gurban.
Spectrum Sensing Based on Cyclostationarity In the name of Allah Spectrum Sensing Based on Cyclostationarity Presented by: Eniseh Berenjkoub Summer 2009.
Model-Based Fusion of Bone and Air Sensors for Speech Enhancement and Robust Speech Recognition John Hershey, Trausti Kristjansson, Zhengyou Zhang, Alex.
HIWIRE MEETING Nancy, July 6-7, 2006 José C. Segura, Ángel de la Torre.
Communications & Multimedia Signal Processing Report of Work on Formant Tracking LP Models and Plans on Integration with Harmonic Plus Noise Model Qin.
Segmentation and Event Detection in Soccer Audio Lexing Xie, Prof. Dan Ellis EE6820, Spring 2001 April 24 th, 2001.
Digital Voice Communication Link EE 413 – TEAM 2 April 21 st, 2005.
Speech Recognition in Noise
Advances in WP2 Chania Meeting – May
Modeling of Mel Frequency Features for Non Stationary Noise I.AndrianakisP.R.White Signal Processing and Control Group Institute of Sound and Vibration.
Advances in WP1 and WP2 Paris Meeting – 11 febr
Gaussian Mixture-Sound Field Landmark Model for Robot Localization Talker: Prof. Jwu-Sheng Hu Department of Electrical and Control Engineering National.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Florian Bacher & Christophe Sourisse [ ] Seminar in Interactive Systems.
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
A VOICE ACTIVITY DETECTOR USING THE CHI-SQUARE TEST
All features considered separately are relevant in a speech / music classification task. The fusion allows to raise the accuracy rate up to 94% for speech.
9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.
INTRODUCTION  Sibilant speech is aperiodic.  the fricatives /s/, / ʃ /, /z/ and / Ʒ / and the affricatives /t ʃ / and /d Ʒ /  we present a sibilant.
Automatic detection of microchiroptera echolocation calls from field recordings using machine learning algorithms Mark D. Skowronski and John G. Harris.
Florian Bacher & Christophe Sourisse [ ] Seminar in Interactive Systems.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Voice Recognition for Wheelchair Control Theo Theodoridis, Xin Liu, and Huosheng Hu.
Reporter: Shih-Hsiang( 士翔 ). Introduction Speech signal carries information from many sources –Not all information is relevant or important for speech.
REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.
LOG-ENERGY DYNAMIC RANGE NORMALIZATON FOR ROBUST SPEECH RECOGNITION Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Quebec,
ICASSP Speech Discrimination Based on Multiscale Spectro–Temporal Modulations Nima Mesgarani, Shihab Shamma, University of Maryland Malcolm Slaney.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
NOISE DETECTION AND CLASSIFICATION IN SPEECH SIGNALS WITH BOOSTING Nobuyuki Miyake, Tetsuya Takiguchi and Yasuo Ariki Department of Computer and System.
Recognition of Speech Using Representation in High-Dimensional Spaces University of Washington, Seattle, WA AT&T Labs (Retd), Florham Park, NJ Bishnu Atal.
Yi-zhang Cai, Jeih-weih Hung 2012/08/17 報告者:汪逸婷 1.
1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.
Handing Uncertain Observations in Unsupervised Topic-Mixture Language Model Adaptation Ekapol Chuangsuwanich 1, Shinji Watanabe 2, Takaaki Hori 2, Tomoharu.
Speaker Verification Speaker verification uses voice as a biometric to determine the authenticity of a user. Speaker verification systems consist of two.
CONTENTS INTRODUCTION TO A.I. WORKING OF A.I. APPLICATIONS OF A.I. CONCLUSIONS ON A.I.
Yousuke Itoh GWDAW8 UW Milwaukee USA December 2003 A large value of the detection statistic indicates a candidate signal at the frequency and.
Duraid Y. Mohammed Philip J. Duncan Francis F. Li. School of Computing Science and Engineering, University of Salford UK Audio Content Analysis in The.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Exploiting cross-modal rhythm for robot perception of objects Artur M. Arsenio Paul Fitzpatrick MIT Computer Science and Artificial Intelligence Laboratory.
Performance Comparison of Speaker and Emotion Recognition
A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-
ICASSP 2006 Robustness Techniques Survey ShihHsiang 2006.
Arlindo Veiga Dirce Celorico Jorge Proença Sara Candeias Fernando Perdigão Prosodic and Phonetic Features for Speaking Styles Classification and Detection.
RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.
Detection of Vowel Onset Point in Speech S.R. Mahadeva Prasanna & Jinu Mariam Zachariah Department of Computer Science & Engineering Indian Institute.
A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.
Suppression of Musical Noise Artifacts in Audio Noise Reduction by Adaptive 2D filtering Alexey Lukin AES Member Moscow State University, Moscow, Russia.
January 2001RESPITE workshop - Martigny Multiband With Contaminated Training Data Results on AURORA 2 TCTS Faculté Polytechnique de Mons Belgium.
Feature Transformation and Normalization Present by Howard Reference : Springer Handbook of Speech Processing, 3.3 Environment Robustness (J. Droppo, A.
Motorola presents in collaboration with CNEL Introduction  Motivation: The limitation of traditional narrowband transmission channel  Advantage: Phone.
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
Voice Activity Detection Based on Sequential Gaussian Mixture Model Zhan Shen, Jianguo Wei, Wenhuan Lu, Jianwu Dang Tianjin Key Laboratory of Cognitive.
Speech and Singing Voice Enhancement via DNN
Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.
Feature Mapping FOR SPEAKER Diarization IN NOisy conditions
Artificial Intelligence for Speech Recognition
RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION
Voice conversion using Artificial Neural Networks
朝陽科技大學 資訊工程系 謝政勳 Application of GM(1,1) Model to Speech Enhancement and Voice Activity Detection 朝陽科技大學 資訊工程系 謝政勳
Missing feature theory
Presented by Chen-Wei Liu
Presenter: Shih-Hsiang(士翔)
Microphone array Raymond Sastraputera.
Combination of Feature and Channel Compensation (1/2)
Presentation transcript:

Ekapol Chuangsuwanich and James Glass MIT Computer Science and Artificial Intelligence Laboratory,Cambridge, Massachusetts 02139,USA 2012/07/2 汪逸婷

 Introduction  Harmonicity  Modulation Frequency  Experiments and Discussions ◦ Speech/Non-speech detection capabilities ◦ Effect on ASR performance  Conclusion

 Voice Activity Detection(VAD) is the process of identifying segments of speech in a continuous audio stream.  First stage of a speech processing application.  Used to both reduce computation by eliminating unnecessary transmission and processing of non-speech segments, as well as reduce potential mis-recognition errors in such segments.(binary)  In this paper, we consider the task of giving commands to an autonomous forklift.

 In high quality recording conditions, energy-based methods perform well.  In noisy conditions, energy-based measures ofter produce a considerable number of false alarms.  Large variety of other features have been investigated for use in noisy environments.  Tuning parameters.  Difficulties dealing with non-stationary or instantaneous types of noises that are frequent in our work.

 Harmonicity is a basic property of any periodic signal, it is not useful by itself.  Some works shows good results even at very low SNR conditions.  Modulation frequencies which measure the temporal rate of change of energy across different frequency band.  Some studies about purely MF-inspired set of features for discriminating between speech and non-speech which gave good generalization to noise types not included in training.

 Figure 1: Example of distant speech from the forklift database.

 Database consisting of speech commands from 26 subjects with added noise to simulate a variety of SNR values ranging from -5 to 15 dB.  Recorded with an array microphone.  Classification was without any additional post- processing.  Transition between speech and non-speech were excluded.  Training 4 min of speech, 22 min of non-speech.

 For comparison, include results based on Relative Spectral Entropy(RSE), Long-Term Signal Variability(LTSV), and statistical model-based VADs using MFCCs and MF as features to Gaussian mixtrue models(GMMs).  EER: Equal error rate  FAR: False alarm rate

 Performance varied specific kinds of noise.

 EER and FAR for each noise type are usually obtained at different thresholds.

 The SNR values ranged from 5 to 25 dB.  Corpus: four microphone channels, 10 hours long, 400 command words.  Commands are sparse, forklift is mostly idle waiting for commands or taking the time to executed commands.  ASR was performed using PocketSUMMIT, had a vocabulary size of 57 words, that out of domain(OOD) command was modeled by a single GMM.  ASR model was trained on over 3600 utterances of commands from 18 talkers.

 Examined the influence of different VAD systems on the ASR results in the task of commanding an autonomous forklift in real world environments.

 Described the task of VAD on distant speech in low SNR environments for an autonomous robotic forklift.  Designed a two-stage approach for speech /non-speech classification.  Parallel SVM outperformed classification based on the whole MF spectrum. Combination MF and simple harmonicity measure helped reduce false alarm rate by another 9% at low miss rates.  In ASR, VAD outperformed standard VADs and achieved a WER very close to that of hand labeled end point.