Keyword Spotting Dynamic Time Warping Ali Akbar Jabini Alexandre Mercier-Dalphond Spring 2006
Introduction Speech recognition: Computer can interpret speech Need input to digitalize sounds Microphone People can speak faster than type Commercial systems available since 1990s People prefer Physical interactions Keyboard/Mouse, On/Off switch Low Accuracy for large vocabulary with noise (50%)
Introduction Speech recognition is more and more used for smaller vocabulary banks Credit Card Systems Simple switching commands Directory assistance Cheap to implement High Accuracy Can verify their interpretation Idea: speech recognition for household appliances
OUTLINE Area of investigation Concrete task/Goal Schematic Feature extraction DTW Training Evaluation metrics Conclusion
Area of Investigation Keyword Spotting: Subfield of speech recognition Grammar constrained Keyword Spotting in isolated word recognition Keywords utterances Keyword separated by silence Main technique is DTW
Concrete task/Goal Goal: develop a robust speaker independent keyword spotting scheme to operate household appliances Concrete tasks Digitalize the sound inputs Implementation in MatLab Train the model with the grammar Analyze the performances of our scheme
Schematic Microphone A/D Feature extraction DTW Output Grammar
Feature extraction Pre-emphasis Blocking into frames Windowing Flattening the spectrum of the signal Blocking into frames Length of the Fourier Transform Windowing Sample window (maybe Hamming) Mel frequency Cepstral coefficients More reliable than LPC coefficients This will be imputed in the DTW algorithm
DTW Idea: smallest distance between an input and the training bank Cepstrum features Dynamic programming: the time axis his not linear to account for utterances t0 -> t0+5 t1 -> t1-2
DTW
DTW
Training Need to create our own grammar Use this data with DTW On: Onnn, Honnn, open, opeeenn Off: Hooofff, Hoff, offfff, close As many potential utterances as possible Use this data with DTW
Evaluation metrics Accuracy High noise Low noise Independent speaker Training data speaker Would like to obtain 80% or more
Conclusion Early stage No code implemented yet Many challenges a head Our methodology may change slightly There is a big potential market for such technique -> influence on every day life.