Ekapol Chuangsuwanich and James Glass MIT Computer Science and Artificial Intelligence Laboratory,Cambridge, Massachusetts 02139,USA 2012/07/2 汪逸婷.

Ekapol Chuangsuwanich and James Glass MIT Computer Science and Artificial Intelligence Laboratory,Cambridge, Massachusetts 02139,USA 2012/07/2 汪逸婷

 Introduction  Harmonicity  Modulation Frequency  Experiments and Discussions ◦ Speech/Non-speech detection capabilities ◦ Effect on ASR performance  Conclusion

 Voice Activity Detection(VAD) is the process of identifying segments of speech in a continuous audio stream.  First stage of a speech processing application.  Used to both reduce computation by eliminating unnecessary transmission and processing of non-speech segments, as well as reduce potential mis-recognition errors in such segments.(binary)  In this paper, we consider the task of giving commands to an autonomous forklift.

 In high quality recording conditions, energy-based methods perform well.  In noisy conditions, energy-based measures ofter produce a considerable number of false alarms.  Large variety of other features have been investigated for use in noisy environments.  Tuning parameters.  Difficulties dealing with non-stationary or instantaneous types of noises that are frequent in our work.

 Harmonicity is a basic property of any periodic signal, it is not useful by itself.  Some works shows good results even at very low SNR conditions.  Modulation frequencies which measure the temporal rate of change of energy across different frequency band.  Some studies about purely MF-inspired set of features for discriminating between speech and non-speech which gave good generalization to noise types not included in training.

 Figure 1: Example of distant speech from the forklift database.

 Database consisting of speech commands from 26 subjects with added noise to simulate a variety of SNR values ranging from -5 to 15 dB.  Recorded with an array microphone.  Classification was without any additional post- processing.  Transition between speech and non-speech were excluded.  Training 4 min of speech, 22 min of non-speech.

 For comparison, include results based on Relative Spectral Entropy(RSE), Long-Term Signal Variability(LTSV), and statistical model-based VADs using MFCCs and MF as features to Gaussian mixtrue models(GMMs).  EER: Equal error rate  FAR: False alarm rate

 Performance varied specific kinds of noise.

 EER and FAR for each noise type are usually obtained at different thresholds.

 The SNR values ranged from 5 to 25 dB.  Corpus: four microphone channels, 10 hours long, 400 command words.  Commands are sparse, forklift is mostly idle waiting for commands or taking the time to executed commands.  ASR was performed using PocketSUMMIT, had a vocabulary size of 57 words, that out of domain(OOD) command was modeled by a single GMM.  ASR model was trained on over 3600 utterances of commands from 18 talkers.

 Examined the influence of different VAD systems on the ASR results in the task of commanding an autonomous forklift in real world environments.

 Described the task of VAD on distant speech in low SNR environments for an autonomous robotic forklift.  Designed a two-stage approach for speech /non-speech classification.  Parallel SVM outperformed classification based on the whole MF spectrum. Combination MF and simple harmonicity measure helped reduce false alarm rate by another 9% at low miss rates.  In ASR, VAD outperformed standard VADs and achieved a WER very close to that of hand labeled end point.

Ekapol Chuangsuwanich and James Glass MIT Computer Science and Artificial Intelligence Laboratory,Cambridge, Massachusetts 02139,USA 2012/07/2 汪逸婷.

Similar presentations

Presentation on theme: "Ekapol Chuangsuwanich and James Glass MIT Computer Science and Artificial Intelligence Laboratory,Cambridge, Massachusetts 02139,USA 2012/07/2 汪逸婷."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Ekapol Chuangsuwanich and James Glass MIT Computer Science and Artificial Intelligence Laboratory,Cambridge, Massachusetts 02139,USA 2012/07/2 汪逸婷.

Similar presentations

Presentation on theme: "Ekapol Chuangsuwanich and James Glass MIT Computer Science and Artificial Intelligence Laboratory,Cambridge, Massachusetts 02139,USA 2012/07/2 汪逸婷."— Presentation transcript:

Similar presentations

About project

Feedback