Deep Neural Network for Robust Speech Recognition With Auxiliary Features From Laser-Doppler Vibrometer Sensor Zhipeng Xie1,2,Jun Du1, Ian McLoughlin3.

Slides:



Advertisements
Similar presentations
Neural networks Introduction Fitting neural networks
Advertisements

Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.
Combining Heterogeneous Sensors with Standard Microphones for Noise Robust Recognition Horacio Franco 1, Martin Graciarena 12 Kemal Sonmez 1, Harry Bratt.
Improved Neural Network Based Language Modelling and Adaptation J. Park, X. Liu, M.J.F. Gales and P.C. Woodland 2010 INTERSPEECH Bang-Xuan Huang Department.
Distribution-Based Feature Normalization for Robust Speech Recognition Leveraging Context and Dynamics Cues Yu-Chen Kao and Berlin Chen Presenter : 張庭豪.
AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.
LORIA Irina Illina Dominique Fohr Chania Meeting May 9-10, 2007.
Advances in WP1 and WP2 Paris Meeting – 11 febr
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Example Clustered Transformations MAP Adaptation Resources: ECE 7000:
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Tracking with Unreliable Node Sequences Ziguo Zhong, Ting Zhu, Dan Wang and Tian He Computer Science and Engineering, University of Minnesota Infocom 2009.
Minimum Mean Squared Error Time Series Classification Using an Echo State Network Prediction Model Mark Skowronski and John Harris Computational Neuro-Engineering.
REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.
Ekapol Chuangsuwanich and James Glass MIT Computer Science and Artificial Intelligence Laboratory,Cambridge, Massachusetts 02139,USA 2012/07/2 汪逸婷.
LOG-ENERGY DYNAMIC RANGE NORMALIZATON FOR ROBUST SPEECH RECOGNITION Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Quebec,
Experimental Results ■ Observations:  Overall detection accuracy increases as the length of observation window increases.  An observation window of 100.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.
Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Performance Comparison of Speaker and Emotion Recognition
MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.
Face Image-Based Gender Recognition Using Complex-Valued Neural Network Instructor :Dr. Dong-Chul Kim Indrani Gorripati.
ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.
St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.
January 2001RESPITE workshop - Martigny Multiband With Contaminated Training Data Results on AURORA 2 TCTS Faculté Polytechnique de Mons Belgium.
1 LING 696B: Final thoughts on nonparametric methods, Overview of speech processing.
NTNU Speech and Machine Intelligence Laboratory 1 Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models 2016/05/31.
Using Speech Recognition to Predict VoIP Quality
A NONPARAMETRIC BAYESIAN APPROACH FOR
Olivier Siohan David Rybach
Speech and Singing Voice Enhancement via DNN
Tomás Pérez-García, Carlos Pérez-Sancho, José M. Iñesta
Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI
Histograms CSE 6363 – Machine Learning Vassilis Athitsos
Traffic State Detection Using Acoustics
Compact Bilinear Pooling
Speaker-Targeted Audio-Visual Models for Speech Recognition in Cocktail-Party Environments Good morning, My name is Guan-Lin Chao, from Carnegie Mellon.
2 Research Department, iFLYTEK Co. LTD.
Automatic Speech Recognition
Bag-of-Visual-Words Based Feature Extraction
Feature Mapping FOR SPEAKER Diarization IN NOisy conditions
Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang
Parts of an Academic Paper
RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION
A CORPUS-BASED STUDY OF COLLOCATIONS OF HIGH-FREQUENCY VERB —— MAKE
Created by Art Kay, Luis Chioye Presented by Peggy Liska
Anastassia Loukina, Klaus Zechner, James Bruno, Beata Beigman Klebanov
Automatic Speech Recognition: Conditional Random Fields for ASR
AN ANALYSIS OF TWO COMMON REFERENCE POINTS FOR EEGS
Word Embedding Word2Vec.
DCT-based Processing of Dynamic Features for Robust Speech Recognition Wen-Chi LIN, Hao-Teng FAN, Jeih-Weih HUNG Wen-Yi Chu Department of Computer Science.
Housam Babiker, Randy Goebel and Irene Cheng
Cheng-Kuan Wei1 , Cheng-Tao Chung1 , Hung-Yi Lee2 and Lin-Shan Lee2
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Word embeddings (continued)
Actively Learning Ontology Matching via User Interaction
Learning Long-Term Temporal Features
Presented by Chen-Wei Liu
Presenter: Shih-Hsiang(士翔)
Combination of Feature and Channel Compensation (1/2)
DNN-BASED SPEAKER-ADAPTIVE POSTFILTERING WITH LIMITED ADAPTATION DATA FOR STATISTICAL SPEECH SYNTHESIS SYSTEMS Mirac Goksu Ozturk1, Okan Ulusoy1, Cenk.
Deep Neural Network Language Models
Language Transfer of Audio Word2Vec:
Presentation transcript:

Deep Neural Network for Robust Speech Recognition With Auxiliary Features From Laser-Doppler Vibrometer Sensor Zhipeng Xie1,2,Jun Du1, Ian McLoughlin3 Yong Xu2, Feng Ma2, Haikun Wang2 1 National Engineering Laboratory for Speech and Language Processing 2 Research Department, iFLYTEK Co. LTD. 3 School of Computing, University of Kent, Medway, UK October 18, 2016 Hello everyone. I’m Xie Zhipeng from USTC. Long title, you know, may cause many inconvenience, like filling the registration chart of this conference. But it is important. The title of this talk is …

Outline Introduction Proposed Methods Experimental Results Conclusion And I’ll show you an interesting ideal and experiment results through 4 steps. First let me give you some brief introduction.

Pic from Y. Avargel,“Speech measurements etc.” 2011 Introduction Robust ASR using DNN Laser-Doppler Vibrometer (LDV) Sensor Directed to speaker’s larynx, non-contact Measure the vibration velocity Immune to acoustic interference The title indicates that our… Target: Robust asr  which is Closely related to our daily life, and of course this conference has many sessions about it. Model: DNN  Here we just want to try an interesting ideal and DNN is enough. Cause DNNs have demonstrated a great capacity to extract discriminative internal representations that are robust to the many sources of variability in speech signals. What we have done? We found an interesting signal which captured by laser Doppler vibrometer (LDV) sensor. Means: try to use LDV feature to assist our normal robust asr. the LDV sensor is a non-contact measurement device that is capable of measuring the vibration frequencies of moving targets. It is directed at a speaker's larynx, and captures useful speech information at certain frequency bands. Due to this reason, the LDV signal remains almost the same regardless of different background acoustic interference in different SNRs. That’s quiet good, isn’t it? Pic from Y. Avargel,“Speech measurements etc.” 2011

Outline Introduction Proposed Methods Experimental Results Conclusion So LDV signal is interesting and probably good for our tasks, then how can we use it? First of all, we run a simple test to validate the usage of LDV speech and its features

Proposed LDV Feature Combination(Limited dataset) Normal speech The availability of data is limited and it contains parallel recordings of acoustic speech and LDV signals. In other words, LDV speech has its corresponding Normal speech, they were recorded simultaneously with different kinds and levels background noises. These details will show afterwards. 1. Normal speech asr system using DNN as acoustic model, got feature and training using BP etc. 2. With the same approach like Normal speech, we run the test which combining these two features together. You guess what, it makes sense! The combining system outperformance the acoustic only system, by getting more robust information. But the asr accuracy is restricted by the small size of the LDV database such that the DNN-based ASR systems are not trained sufficiently well.

Proposed LDV Feature Combination(Limited dataset) Normal speech Normal + LDV speech The availability of data is limited and it contains parallel recordings of acoustic speech and LDV signals. In other words, LDV speech has its corresponding Normal speech, they were recorded simultaneously with different kinds and levels background noises. These details will show afterwards. 1. Normal speech asr system using DNN as acoustic model, got feature and training using BP etc. 2. With the same approach like Normal speech, we run the test which combining these two features together. You guess what, it makes sense! The combining system outperformance the acoustic only system, by getting more robust information. But the asr accuracy is restricted by the small size of the LDV database such that the DNN-based ASR systems are not trained sufficiently well. Stereo Data

Proposed LDV Feature Combination(Limited dataset) Normal speech Normal + LDV speech makes sense! The availability of data is limited and it contains parallel recordings of acoustic speech and LDV signals. In other words, LDV speech has its corresponding Normal speech, they were recorded simultaneously with different kinds and levels background noises. These details will show afterwards. 1. Normal speech asr system using DNN as acoustic model, got feature and training using BP etc. 2. With the same approach like Normal speech, we run the test which combining these two features together. You guess what, it makes sense! The combining system outperformance the acoustic only system, by getting more robust information. But the asr accuracy is restricted by the small size of the LDV database such that the DNN-based ASR systems are not trained sufficiently well. Stereo Data

Proposed Limited LDV dataset  validated Large dataset (Normal real-life speech) We are therefore aiming to make use of much larger datasets, and will test this with acoustic-only ASR at first. Of cause iflytek has many real-life recordings. Now we face two questions: Q a) These two dataset are not matched, so this imbalance might make the acoustic DNN model largely influenced by the larger dataset when simply combing the two datasets. So we first train an acoustic-only DNN-based ASR system from the larger database alone. This provides a good initialization starting point. This DNN is then fine-tuned by using the acoustic-only data from the LDV dataset (LDV-acoustic). Q b) When training the combining system, we need the corresponding LDV feature. But the larger dataset only contains the acoustic normal speech. So we we obtain corresponding pseudo-LDV features by first training a mapping network to learn the relationships between the features from acoustic speech and the features from the LDV signal. Through this method, we can obtain the parallel recordings of normal speech in large dataset.

Proposed Limited LDV dataset  validated Large dataset (Normal real-life speech) How to use these valuable data? Well-trained DNN for initialization We are therefore aiming to make use of much larger datasets, and will test this with acoustic-only ASR at first. Of cause iflytek has many real-life recordings. Now we face two questions: Q a) These two dataset are not matched, so this imbalance might make the acoustic DNN model largely influenced by the larger dataset when simply combing the two datasets. So we first train an acoustic-only DNN-based ASR system from the larger database alone. This provides a good initialization starting point. This DNN is then fine-tuned by using the acoustic-only data from the LDV dataset (LDV-acoustic). Q b) When training the combining system, we need the corresponding LDV feature. But the larger dataset only contains the acoustic normal speech. So we we obtain corresponding pseudo-LDV features by first training a mapping network to learn the relationships between the features from acoustic speech and the features from the LDV signal. Through this method, we can obtain the parallel recordings of normal speech in large dataset.

Proposed Limited LDV dataset  validated Large dataset (Normal real-life speech) How to use these valuable data? Well-trained DNN for initialization We are therefore aiming to make use of much larger datasets, and will test this with acoustic-only ASR at first. Of cause iflytek has many real-life recordings. Now we face two questions: Q a) These two dataset are not matched, so this imbalance might make the acoustic DNN model largely influenced by the larger dataset when simply combing the two datasets. So we first train an acoustic-only DNN-based ASR system from the larger database alone. This provides a good initialization starting point. This DNN is then fine-tuned by using the acoustic-only data from the LDV dataset (LDV-acoustic). Q b) When training the combining system, we need the corresponding LDV feature. But the larger dataset only contains the acoustic normal speech. So we we obtain corresponding pseudo-LDV features by first training a mapping network to learn the relationships between the features from acoustic speech and the features from the LDV signal. Through this method, we can obtain the parallel recordings of normal speech in large dataset. How to get corresponding LDV feature? - Mapping network: regression DNN

Proposed Large dataset to train DNN for initialization Following the previous routings, we first test these with acoustic-only ASR system. The system below the dashed line is the DNN_N system. And we use the larger dataset to train a acoustic DNN system and then use it to initialize the final acoustic asr system. The results shows a obvious improvements. So this validate our suspect, and we want to create a similar network structure to use LDV feature combination method to further improve the performance. So if we add LDV feature here, The system below the dashed line is the DNN_C system. And Also we should initialize this acoustic asr system using well-trained system with larger dataset and here we supposed to need the corresponding LDV feature to do the combination. And the next step is to train a mapping network to map the normal speech feature into LDV feature, then the whole system can be organized like this. And then we have large dataset with its pseudo-LDV feature and everything is ready, the system can work and test.

Proposed Large dataset to train DNN for initialization Following the previous routings, we first test these with acoustic-only ASR system. The system below the dashed line is the DNN_N system. And we use the larger dataset to train a acoustic DNN system and then use it to initialize the final acoustic asr system. The results shows a obvious improvements. So this validate our suspect, and we want to create a similar network structure to use LDV feature combination method to further improve the performance. So if we add LDV feature here, The system below the dashed line is the DNN_C system. And Also we should initialize this acoustic asr system using well-trained system with larger dataset and here we supposed to need the corresponding LDV feature to do the combination. And the next step is to train a mapping network to map the normal speech feature into LDV feature, then the whole system can be organized like this. And then we have large dataset with its pseudo-LDV feature and everything is ready, the system can work and test.

Proposed Large dataset to train DNN for initialization Following the previous routings, we first test these with acoustic-only ASR system. The system below the dashed line is the DNN_N system. And we use the larger dataset to train a acoustic DNN system and then use it to initialize the final acoustic asr system. The results shows a obvious improvements. So this validate our suspect, and we want to create a similar network structure to use LDV feature combination method to further improve the performance. So if we add LDV feature here, The system below the dashed line is the DNN_C system. And Also we should initialize this acoustic asr system using well-trained system with larger dataset and here we supposed to need the corresponding LDV feature to do the combination. And the next step is to train a mapping network to map the normal speech feature into LDV feature, then the whole system can be organized like this. And then we have large dataset with its pseudo-LDV feature and everything is ready, the system can work and test.

Proposed Large dataset to train DNN for initialization Regression DNN Following the previous routings, we first test these with acoustic-only ASR system. The system below the dashed line is the DNN_N system. And we use the larger dataset to train a acoustic DNN system and then use it to initialize the final acoustic asr system. The results shows a obvious improvements. So this validate our suspect, and we want to create a similar network structure to use LDV feature combination method to further improve the performance. So if we add LDV feature here, The system below the dashed line is the DNN_C system. And Also we should initialize this acoustic asr system using well-trained system with larger dataset and here we supposed to need the corresponding LDV feature to do the combination. And the next step is to train a mapping network to map the normal speech feature into LDV feature, then the whole system can be organized like this. And then we have large dataset with its pseudo-LDV feature and everything is ready, the system can work and test.

Proposed Large dataset to train DNN for initialization Regression DNN Following the previous routings, we first test these with acoustic-only ASR system. The system below the dashed line is the DNN_N system. And we use the larger dataset to train a acoustic DNN system and then use it to initialize the final acoustic asr system. The results shows a obvious improvements. So this validate our suspect, and we want to create a similar network structure to use LDV feature combination method to further improve the performance. So if we add LDV feature here, The system below the dashed line is the DNN_C system. And Also we should initialize this acoustic asr system using well-trained system with larger dataset and here we supposed to need the corresponding LDV feature to do the combination. And the next step is to train a mapping network to map the normal speech feature into LDV feature, then the whole system can be organized like this. And then we have large dataset with its pseudo-LDV feature and everything is ready, the system can work and test.

Outline Introduction Proposed Methods Experimental & Results Conclusion Next I will show you the experiments’ settings and results

Experiments Corpus LDV dataset No. Car Speed Window Outside AC 1 stationary closed downtown middle 2 open car park off 3 ≤40km/h 4 41-60 km/h countryside 5 80-120 km/h highway Corpus LDV dataset Total: 13k recordings in 16 kHz, 16 bits Speakers from: U.S. , Hebrew Training: 54 speakers  9.9h Development set: 4 speakers  0.62h Test set: 4 speakers  0.75h CZ dataset Total: 66k recordings in 16 kHz, 16 bits  620h Speakers from: U.S. , England, Canada Replayed in Toyota, Volkswagen and BMW First of all, the corpus we use contains two independent parts. The LDV limited dataset is from the VocalZoom company, which includes speech recordings captured by LDV sensors along with corresponding acoustic recordings. During recording, besides capturing in a clean environment, recordings were made where interfering acoustic noise was present. In those recordings an undesired speaker and background noises (from a moving vehicle) are present in addition to the desired speaker. Measurements by the LDV and acoustic sensors were recorded simultaneous. The SNR contains 0, -3,-6,-9,-12db. Detailed information can be found in the paper. The second database we use comes from the iFlytek company research group, which provides a large resource normal speech recordings of native English speakers. Larger dataset can better coverage of pronunciations and speakers as a supplement to the LDV dataset. These were recorded first into high-quality audio files, then replayed in three different vehicles, namely Toyota, Volkswagen and BMW cars, in 5 different scenarios, shown in Table above. The `Outside' column details the environment that the car is parked in or moving through, while the `AC' column indicates whether the air conditioner is operating, either on a medium setting or turned off.

Experiments Experimental settings Features 72-dim LMFB (24 + Δ + ΔΔ) 10 neighboring frames (± 5 frames) Acoustic-only Feature: 72*11 = 792-dim Combining Feature: 72*2*11 = 1584-dim Acoustic DNN 6 hidden layers with 2048 nodes State numbers: 9004 (senones of HMM) Regression DNN 2 hidden layers with 2048 nodes From normal speech feature to LDV feature Structure: 792-2048-2048-792 The features we use for both DNN regression and acoustic modeling are 72

Table 2: Results of LDV feature combination System Feature_dim SER WER DNNN 72 89.71% 58.88% DNNC 144 84.27% 52.42% Fig 3: Results of LDV feature combination in different environment conditions Table 2: Results of LDV feature combination The recognition performance is evaluated by word error rate (WER) and the sentence error rate(SER), however the WER is more concerned when talking about the performance of acoustic models. Table 2 lists a performance comparison of the two systems with or without using the combined auxiliary features from the LDV sensors. The only difference between this two system is the input feature dimension. Both the WER and SER of feature combination DNN_C system can be reduced by about 6% over the DNN_N system, which verifies the effectiveness of the auxiliary LDV features. To further explore the effectiveness of using LDV information in different environments, we test those two systems on two subsets of utterances recorded in clean and noisy environments, the results are shown in table 3. we can make an observation that the auxiliary LDV features can improve the recognition performances for both clean and noisy environments, with relative word error rate (WER) reductions of 11.5\% and 12.5\%, respectively. The results of the systems initialized by a larger dataset are shown in Table 4. Comparing the first line of Table 2 and Table 4, we can see that with more training data, the system using acoustic-only features significantly outperforms the other. Same results also can be concluded by comparing the second line of table 2 and table 4, which normal speech feature is combined with LDV features. Moreover, if we use both the larger dataset and limited dataset to train the initialized system, we can get a remarkable performance gain.

Table 2: Results of LDV feature combination LDV data helps System Feature_dim SER WER DNNN 72 89.71% 58.88% DNNC 144 84.27% 52.42% Fig 3: Results of LDV feature combination in different environment conditions Table 2: Results of LDV feature combination The recognition performance is evaluated by word error rate (WER) and the sentence error rate(SER), however the WER is more concerned when talking about the performance of acoustic models. Table 2 lists a performance comparison of the two systems with or without using the combined auxiliary features from the LDV sensors. The only difference between this two system is the input feature dimension. Both the WER and SER of feature combination DNN_C system can be reduced by about 6% over the DNN_N system, which verifies the effectiveness of the auxiliary LDV features. To further explore the effectiveness of using LDV information in different environments, we test those two systems on two subsets of utterances recorded in clean and noisy environments, the results are shown in table 3. we can make an observation that the auxiliary LDV features can improve the recognition performances for both clean and noisy environments, with relative word error rate (WER) reductions of 11.5\% and 12.5\%, respectively. The results of the systems initialized by a larger dataset are shown in Table 4. Comparing the first line of Table 2 and Table 4, we can see that with more training data, the system using acoustic-only features significantly outperforms the other. Same results also can be concluded by comparing the second line of table 2 and table 4, which normal speech feature is combined with LDV features. Moreover, if we use both the larger dataset and limited dataset to train the initialized system, we can get a remarkable performance gain.

Table 2: Results of LDV feature combination LDV data helps System Feature_dim SER WER DNNN 72 89.71% 58.88% DNNC 144 84.27% 52.42% Fig 3: Results of LDV feature combination in different environment conditions Table 2: Results of LDV feature combination The recognition performance is evaluated by word error rate (WER) and the sentence error rate(SER), however the WER is more concerned when talking about the performance of acoustic models. Table 2 lists a performance comparison of the two systems with or without using the combined auxiliary features from the LDV sensors. The only difference between this two system is the input feature dimension. Both the WER and SER of feature combination DNN_C system can be reduced by about 6% over the DNN_N system, which verifies the effectiveness of the auxiliary LDV features. To further explore the effectiveness of using LDV information in different environments, we test those two systems on two subsets of utterances recorded in clean and noisy environments, the results are shown in table 3. we can make an observation that the auxiliary LDV features can improve the recognition performances for both clean and noisy environments, with relative word error rate (WER) reductions of 11.5\% and 12.5\%, respectively. The results of the systems initialized by a larger dataset are shown in Table 4. Comparing the first line of Table 2 and Table 4, we can see that with more training data, the system using acoustic-only features significantly outperforms the other. Same results also can be concluded by comparing the second line of table 2 and table 4, which normal speech feature is combined with LDV features. Moreover, if we use both the larger dataset and limited dataset to train the initialized system, we can get a remarkable performance gain. System Feature_dim WER DNNLN 72 32.93% DNNLC 144 26.13% joint-DNNLC 25.22% Table 3: Results of the systems with larger dataset for DNN initialization

Table 2: Results of LDV feature combination LDV data helps System Feature_dim SER WER DNNN 72 89.71% 58.88% DNNC 144 84.27% 52.42% Fig 3: Results of LDV feature combination in different environment conditions Table 2: Results of LDV feature combination More data for initialization, better performance The recognition performance is evaluated by word error rate (WER) and the sentence error rate(SER), however the WER is more concerned when talking about the performance of acoustic models. Table 2 lists a performance comparison of the two systems with or without using the combined auxiliary features from the LDV sensors. The only difference between this two system is the input feature dimension. Both the WER and SER of feature combination DNN_C system can be reduced by about 6% over the DNN_N system, which verifies the effectiveness of the auxiliary LDV features. To further explore the effectiveness of using LDV information in different environments, we test those two systems on two subsets of utterances recorded in clean and noisy environments, the results are shown in table 3. we can make an observation that the auxiliary LDV features can improve the recognition performances for both clean and noisy environments, with relative word error rate (WER) reductions of 11.5\% and 12.5\%, respectively. The results of the systems initialized by a larger dataset are shown in Table 4. Comparing the first line of Table 2 and Table 4, we can see that with more training data, the system using acoustic-only features significantly outperforms the other. Same results also can be concluded by comparing the second line of table 2 and table 4, which normal speech feature is combined with LDV features. Moreover, if we use both the larger dataset and limited dataset to train the initialized system, we can get a remarkable performance gain. System Feature_dim WER DNNLN 72 32.93% DNNLC 144 26.13% joint-DNNLC 25.22% Table 3: Results of the systems with larger dataset for DNN initialization

Outline Introduction Proposed Methods Experimental Results Conclusion Last but not least, let us have a conclusion

Conclusion New & Interesting Idea: LDV + normal speech combination Well-trained DNN for initialization Regression network to get corresponding pseudo-LDV feature In this talk, We have demonstrated an new and interesting ideal that using auxiliary information derived from an LDV sensor for improving ASR performance. Due to the properties of LDV data which make it immune to acoustic interference, we combine LDV features with normal acoustic speech features to train a DNN acoustic model. Experimental results show significant improvements of recognition accuracy under both clean and noisy conditions. Furthermore, after pre-training the DNN model with pseudo-LDV features combined with acoustic features extracted from a large data set, ASR system achieves much better performance than that trained with smaller LDV datasets alone. In the future, we are aiming to use it in our daily life and collect more training data, maybe other recognition methods can also improve the performance.

Conclusion New & Interesting Idea: Future work: LDV + normal speech combination Well-trained DNN for initialization Regression network to get corresponding pseudo-LDV feature Future work: Practical use in daily life More LDV data Other recognition methods In this talk, We have demonstrated an new and interesting ideal that using auxiliary information derived from an LDV sensor for improving ASR performance. Due to the properties of LDV data which make it immune to acoustic interference, we combine LDV features with normal acoustic speech features to train a DNN acoustic model. Experimental results show significant improvements of recognition accuracy under both clean and noisy conditions. Furthermore, after pre-training the DNN model with pseudo-LDV features combined with acoustic features extracted from a large data set, ASR system achieves much better performance than that trained with smaller LDV datasets alone. In the future, we are aiming to use it in our daily life and collect more training data, maybe other recognition methods can also improve the performance.

Thanks & Questions