1M4 speech recognition University of Sheffield M4 speech recognition Martin Karafiát*, Steve Renals, Vincent Wan
2M4 speech recognition Status at last meeting No working speech recognition system Use HTK for speech recognition –No large vocabulary decoder with HTK Use DUcoder instead of HVite –Provided by TUM Use SRI LM toolkit for language modelling Build initial models using SWITCHBOARD and ICSI meetings data
3M4 speech recognition Software limitations Acoustic modelling –HTK 3.2 No efficient large vocabulary decoder Use only bigram language models HVite, a time synchronous decoder, is slow Decoding –Ducoder Based on HTK version 2 Only capable of word internal context dependent triphone model decoding Trade-off between cross word triphone models and trigram language model appears necessary!
4M4 speech recognition System Architecture Front end n -best lattice generation Best first decoding (Ducoder) Trigram language model (SRILM) Word internal triphone models MLLR adaptation (HTK) Cross word triphone models Recognition output Lattice rescoring Time synchronous decoding (HTK)
5M4 speech recognition System limitations N-best list rescoring not optimal Many more hyper-parameters to tune manually Adaptation must be performed on two sets of acoustic models
6M4 speech recognition Method comparison Ducoder/HTK system compared to a pure HTK system: –Pure HTK: 61.51% wer –Cross word triphones, bigram LM –Decoding time: over a month –Ducoder/HTK: 60.29% wer –Word internal triphones & trigram LM, rescored using cross word triphones –Decoding time: a few days Early comparison result obtained using Switchboard data
7M4 speech recognition Current recognisers SWITCHBOARD recogniser –Acoustic & language models trained on 200 hours of speech ICSI meetings recogniser –Acoustic models trained on 40 hours of speech –Language model is a combination of SWB and ICSI
8M4 speech recognition ResultsResults RecogniserSWITCHBOARDICSI meetings Test on same data without adaptation 55.05% WER52.34% WER Test on same data with speaker adaptation N/A49.27% WER Test on M4 data without adaptation 88.14% WER74.00% WER Test on M4 data with unsupervised adaptation 84.67% WERN/A
9M4 speech recognition Areas for immediate improvement Increase number of lattices in n-best list Word internal triphone models need adaptation Fully supervised speaker dependent adaptation Vocal tract length normalisation
10M4 speech recognition Coming soon Release current models and scripts Merged Switchboard and ICSI acoustic models ASR transcriptions of the M4 meetings