Acoustical and Lexical Based Confidence Measures for a Very Large Vocabulary Telephone Speech Hypothesis-Verification System Javier Macías-Guarasa, Javier Ferreiros, Rubén San-Segundo, Juan M. Montero and José M. Pardo Grupo de Tecnología del Habla Departamento de Ingeniería Electrónica E.T.S.I. Telecomunicación Universidad Politécnica de Madrid
Overview l Introduction l System architecture l Motivation l Databases & Dictionaries l Experimental results l Conclusions and future work
Abstract l In LVSRS: classify utterances as being correctly or incorrectly recognized is of major interest l Preliminary study on: –Word-level confidence estimation –Multiple features: Acoustical and lexical decoders –Neural Network based scheme
Introduction (I) l ASR Systems rank the output hipothesis according to scores l Confidence on proposed decoding is not a direct byproduct of the process l Lot of work in recent years: –Acoustic and linguistic features –Single or multiple set of parameters –Direct estimation, LDA, NNs, etc.
Introduction (and II) l Traditionally –Acoustic features alone show poor results (likelihoods not comparable across utterances) –Literature centered in description of methods to convert the HMM decoded probabilities into useful confidence measures: l likelihoods (normalized versions) l LM probabilities l n-best decoding lists
System Architecture Intermediate Unit Generation Lexical Access Verification Module HypothesisVerification Rough AnalysisDetailed l Hypothesis-verification strategy l We work in the Hypothesis Module
Detailed Architecture Preprocessing & VQ processes Lexical Access Hypothesis Phonetic String Build-Up HMMsVQ booksDurats. Align. costs Phonetic string. List of Candidate Words Speech Dicts Indexes.
Motivation l Studies on variable preselection list length estimation systems –# of words to pass to the verification stage l Direct correlation with confidence estimation: –If proposed list length is small high confidence l Initial application to hypothesis module only
Databases & Dictionaries l Part of the VESTEL database l Training –5820 utterances Speakers. l Testing –2536 utters. (vocabulary dependent) spks –1434 utters. (vocabulary independent) spks l Dictionaries: (VD&I) words
Baseline experiment l Directly using the features (normalized to range 0..1 ) l Baseline features: –Acoustic log-likelihood (and normalized versions) –Lexical access cost for the 1 st candidate –Standard deviation of lexical access costs l Not very good results –Best one with Std Deviation
Baseline distributions Std Deviation LA Acoustic likelihood (normalized)
Baseline distributions
Neural Network estimator l Used successfully in preselection list length estimation l Able to combine parameters w/o effort l 3-layer MLP l Wide range of topology alternatives, coding schemes and features: –Direct parameters –Normalized –Lexical Access costs distribution
NN based experiments l Maximum correct classification rates: 70-75% for the three datasets (reasonable, taking into account the preselection rates achieved: 46.95%, 30.14% and 42.47%) l Best single feature: Standard deviation of the lexical access cost measured over the list of the first 10 candidates (0.1% of the dictionary size) l Final system uses 8 parameters (lexical and acoustical-based)
Final distributions Not using NN Using NN
Final distributions Using NN Not using NN
Additional results l EER: –30% for PERFDV –25% for PEIV1000 and PRNOK5TR –Optimum threshold very close to the scale midpoint l Correct rejection rates for given False rejection:
Conclusions l Introduced word-level confidence estimation system based on NNs and a combination of lexical and acoustical features l NN showed to improve results obtained using the features directly l Best parameter is lexical-based and consistent with acoustical-versions reported in the literature (standard deviation is similar to likelihood ratios and n-best related features)
Future work l Extend the comparison of the NN vs non- NN system to all feature set l Extend the work to the verification module (experiments already carried out shows good results) l Extend the approach to CRS (phrase level confidence)
ROC Curves (NN vs. non NN)
Any questions?