Presentation is loading. Please wait.

Presentation is loading. Please wait.

VESTEL database realistic telephone speech corpus:  PRNOK5TR: 5810 utterances in the training set  PERFDV: 2502 utterances in testing set 1 (vocabulary.

Similar presentations


Presentation on theme: "VESTEL database realistic telephone speech corpus:  PRNOK5TR: 5810 utterances in the training set  PERFDV: 2502 utterances in testing set 1 (vocabulary."— Presentation transcript:

1 VESTEL database realistic telephone speech corpus:  PRNOK5TR: 5810 utterances in the training set  PERFDV: 2502 utterances in testing set 1 (vocabulary dependent)  PEIV1000: 1434 utterances in testing set 2 (vocabulary independent) Vocabulary composed of 10000 words Experimental alternatives:  Different output distribution coding  Different preselection list length estimation methods  Different postprocessing and optimisation methods of this estimation Previous work on this topic EUROSPEECH’99: Flexible Large Vocabulary (up to 10000 words) Speaker independent Isolated word Telephone speech Two stage - bottom up strategy Using neural networks as a novel approach to estimate preselection list length Postprocessing methods of the neural network output for final estimation In this paper: Parameter inventory increased New validation scheme Extensive testing with improved results SUMMARY NNs as a suitable strategy for variable preselection list length estimation Improvements (both in IER and Average effort)  Typical: 7-8% for PERFDV, 18-28% for PRNOK and PEIV1000  Maximum: 10% for PERFDV, 32% for PRNOK and PEIV1000 Future work:  Bigger databases :-)  Further exploiting the parameter inventory  Extensibility to other architectures  What happens if single neuron output is used?  Application to word confidence estimation tasks EXPERIMENTAL SETUPCONCLUSIONS AND FUTURE WORK IMPROVED VARIABLE PRESELECTION LIST LENGTH ESTIMATION USING NNs IN A LARGE VOCABULARY TELEPHONE SPEECH RECOGNITION SYSTEM J. Macías-Guarasa, J. Ferreiros, J. Colás, A. Gallardo-Antolín and J.M. Pardo Grupo de Tecnología del Habla. Universidad Politécnica de Madrid. Spain SYSTEM ARCHITECTURE Preprocessing & VQ processes Lexical Access Hypothesis Generator Phonetic String Build-Up HMMs VQ books Durations Alignment costs Phonetic string Speech Dictionary Indexes Verification Module Detailed Matching Preselection list List Length Estimator List Length Text MOTIVATION Verification module is computationally expensive  Idea: reduce preselection list length  Difficult, specially if low acoustic detail  Estimate a different PLL for every word  Methods: parametric, non-parametric,... To think about:  Computational demands do not depend linearly on PL  Final savings must take into account both modules  Only average estimations are possible List Length Estimator Input parameters listLength(0) listLength(1) listLength(2) listLength(4) listLength(3) listLength(5) 100 300 200 400 500 600 For example (vocabulary of 600 words) Target: 2% inclusion error rate (IER) + 90% pruning SCHMM: 23+2 automatically clustered context independent phoneme-like units Fixed length preselection lists BASELINE SYSTEM NN BASED PLL Traditional MLP with one hidden layer  Final topology: 8 inputs - 5 hidden - 10 outputs  Trained with BP: Enough data is available Input parameters:  Direct parameters  Derived parameters: Direct normalized  Lexical Access Statistical Parameters: Calculated over the lexical access costs distribution  Input coding: maxmin, normalization, w/o clipping, single and multiple neurons, linear/nonlinear mapping, etc. Output coding  Each output  different list length segment  Problem: Inhomogeneous number of activations per output  Solution: Train segment length distribution (Table 1 and Figure 2). POST-PROCESSING OF THE NN ESTIMATION The network output is postprocessed to increase robustness Two alternatives:  The winner output neuron decides (WINNER).  Linear combination of normalised activations (SUMMA): Neuron length (i): Upper limit of neuron i (Table 1) normAct(i): Normalised activation of this neuron Additionally, fixed (-FX) or proportional (-PP) threshold can be added (trained to achieve a certain IER) NN DESIGN PARAMETER SELECTION Selected according to results in discrimination task (1 st position vs. the rest) Best absolute results: multiple input neuron, nonlinear mapping No significative differences with single input neuron 8 final parameters selected MLP:  8 inputs - 5 hidden - 10 outputs  Standard input coding normalization  Nonlinear output coding Additional control parameters:  Segment length assigned to last output neuron (0, 2500, 5000, 10000)  G (during training) B (during test)  Objective inclusion rate in threshold estimation process (98%, 98.5%, 99% and 99.5%)  T FINAL NN BASED SYSTEM How to compare the systems?  NN system gets a single point in the (averageEffort x inclusionRate) space  Fixed length system generates a full inclusion rate graphic Alternatives  Use the single point  If improvement in both axis: OK  If not: ???  Sensitivity? Spuriousness? Extend the analysis:  Use the estimated thresholds to build an artificial inclusion rate histogram, around the area of interest (96.5% - 99%)  Compare each point in this range with fixed list length inclusion rate curve  Combine comparisons in both axes:  Inclusion rate improvement to get the same average effort  Average effort reduction to get the same inclusion rate EVALUATION STRATEGY BEST EXPERIMENT WINNER METHOD:  Lack of precision in discrimination SUMMA METHOD:  Good results!  SUMMA plus FX improve fixed length system in almost all cases  High values for last neuron length are needed (>5000)  Relative improvements (for 10 best experiments selected looking at training set results): RESULTS (I) ICSLP’2000 Beijing (China) Quantitative improvements:  Typical: 7-8% for PERFDV, 18-28% for PRNOK and PEIV1000  Maximum: 10% for PERFDV, 32% for PRNOK and PEIV1000 Statistical confidence:  We do not have enough data to absolutely prove our results are statistically relevant  All bands overlap:  We have a small database  Our inclusion rates are very high But: We prove improvements in a wide range of values RESULTS (II)


Download ppt "VESTEL database realistic telephone speech corpus:  PRNOK5TR: 5810 utterances in the training set  PERFDV: 2502 utterances in testing set 1 (vocabulary."

Similar presentations


Ads by Google