Acoustical and Lexical Based Confidence Measures for a Very Large Vocabulary Telephone Speech Hypothesis-Verification System Javier Macías-Guarasa, Javier.

Slides:



Advertisements
Similar presentations
Author :Panikos Heracleous, Tohru Shimizu AN EFFICIENT KEYWORD SPOTTING TECHNIQUE USING A COMPLEMENTARY LANGUAGE FOR FILLER MODELS TRAINING Reporter :
Advertisements

Advances in WP2 Torino Meeting – 9-10 March
Pitch Prediction From MFCC Vectors for Speech Reconstruction Xu shao and Ben Milner School of Computing Sciences, University of East Anglia, UK Presented.
Confidence Measures for Speech Recognition Reza Sadraei.
AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.
ETRW Modelling Pronunciation variation for ASR ESCA Tutorial & Research Workshop Modelling pronunciation variation for ASR INTRODUCING MULTIPLE PRONUNCIATIONS.
Advances in WP2 Nancy Meeting – 6-7 July
Hypothesis testing Week 10 Lecture 2.
Detection of Recognition Errors and Out of the Spelling Dictionary Names in a Spelled Name Recognizer for Spanish R. San-Segundo, J. Macías-Guarasa, J.
EMOTIONS NATURE EVALUATION BASED ON SEGMENTAL INFORMATION BASED ON PROSODIC INFORMATION AUTOMATIC CLASSIFICATION EXPERIMENTS RESYNTHESIS VOICE PERCEPTUAL.
INCORPORATING MULTIPLE-HMM ACOUSTIC MODELING IN A MODULAR LARGE VOCABULARY SPEECH RECOGNITION SYSTEM IN TELEPHONE ENVIRONMENT A. Gallardo-Antolín, J. Ferreiros,
Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
CONTROLLING A HIFI WITH A CONTINUOUS SPEECH UNDERSTANDING SYSTEM ICSLP’ 98 CONTROLLING A HIFI WITH A CONTINUOUS SPEECH UNDERSTANDING SYSTEM J. Ferreiros,
Statistics for Linguistics Students Michaelmas 2004 Week 3 Bettina Braun
VESTEL database realistic telephone speech corpus:  PRNOK5TR: 5810 utterances in the training set  PERFDV: 2502 utterances in testing set 1 (vocabulary.
VARIABLE PRESELECTION LIST LENGTH ESTIMATION USING NEURAL NETWORKS IN A TELEPHONE SPEECH HYPOTHESIS-VERIFICATION SYSTEM J. Macías-Guarasa, J. Ferreiros,
EE225D Final Project Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye EE225D Final Project.
SNR-Dependent Mixture of PLDA for Noise Robust Speaker Verification
Why is ASR Hard? Natural speech is continuous
Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.
Testing Hypotheses.
Improving Utterance Verification Using a Smoothed Na ï ve Bayes Model Reporter : CHEN, TZAN HWEI Author :Alberto Sanchis, Alfons Juan and Enrique Vidal.
7-Speech Recognition Speech Recognition Concepts
Education Research 250:205 Writing Chapter 3. Objectives Subjects Instrumentation Procedures Experimental Design Statistical Analysis  Displaying data.
1 Phoneme and Sub-phoneme T- Normalization for Text-Dependent Speaker Recognition Doroteo T. Toledano 1, Cristina Esteve-Elizalde 1, Joaquin Gonzalez-Rodriguez.
REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.
Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,
Features-based Object Recognition P. Moreels, P. Perona California Institute of Technology.
ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent Recognition of foreign names spoken by native speakers Frederik Stouten & Jean-Pierre Martens Ghent University.
Designing multiple biometric systems: Measure of ensemble effectiveness Allen Tang NTUIM.
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
VQ speaker verification with sentence codebook Filipe Moreira, Carlos Espain CEFAT / DEEC / FEUP / Universidade do Porto.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-
A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.
A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.
Dynamic Tuning Of Language Model Score In Speech Recognition Using A Confidence Measure Sherif Abdou, Michael Scordilis Department of Electrical and Computer.
NTU & MSRA Ming-Feng Tsai
Integrating Multiple Knowledge Sources For Improved Speech Understanding Sherif Abdou, Michael Scordilis Department of Electrical and Computer Engineering,
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Statistical Significance Hypothesis Testing.
St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.
Confidence Measures As a Search Guide In Speech Recognition Sherif Abdou, Michael Scordilis Department of Electrical and Computer Engineering, University.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
S1S1 S2S2 S3S3 8 October 2002 DARTS ATraNoS Automatic Transcription and Normalisation of Speech Jacques Duchateau, Patrick Wambacq, Johan Depoortere,
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
Automatic Classification of Audio Data by Carlos H. L. Costa, Jaime D. Valle, Ro L. Koerich IEEE International Conference on Systems, Man, and Cybernetics.
LECTURE 33: STATISTICAL SIGNIFICANCE AND CONFIDENCE (CONT.)
Hierarchical Multi-Stream Posterior Based Speech Recognition System
Conditional Random Fields for ASR
RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION
On the Integration of Speech Recognition into Personal Networks
An Improved Neural Network Algorithm for Classifying the Transmission Line Faults Slavko Vasilic Dr Mladen Kezunovic Texas A&M University.
Research on the Modeling of Chinese Continuous Speech Recognition
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Speaker Identification:
Domingo Mery Department of Computer Science
Shengcong Chen, Changxing Ding, Minfeng Liu 2018
Presentation transcript:

Acoustical and Lexical Based Confidence Measures for a Very Large Vocabulary Telephone Speech Hypothesis-Verification System Javier Macías-Guarasa, Javier Ferreiros, Rubén San-Segundo, Juan M. Montero and José M. Pardo Grupo de Tecnología del Habla Departamento de Ingeniería Electrónica E.T.S.I. Telecomunicación Universidad Politécnica de Madrid

Overview l Introduction l System architecture l Motivation l Databases & Dictionaries l Experimental results l Conclusions and future work

Abstract l In LVSRS: classify utterances as being correctly or incorrectly recognized is of major interest l Preliminary study on: –Word-level confidence estimation –Multiple features: Acoustical and lexical decoders –Neural Network based scheme

Introduction (I) l ASR Systems rank the output hipothesis according to scores l Confidence on proposed decoding is not a direct byproduct of the process l Lot of work in recent years: –Acoustic and linguistic features –Single or multiple set of parameters –Direct estimation, LDA, NNs, etc.

Introduction (and II) l Traditionally –Acoustic features alone show poor results (likelihoods not comparable across utterances) –Literature centered in description of methods to convert the HMM decoded probabilities into useful confidence measures: l likelihoods (normalized versions) l LM probabilities l n-best decoding lists

System Architecture Intermediate Unit Generation Lexical Access Verification Module HypothesisVerification Rough AnalysisDetailed l Hypothesis-verification strategy l We work in the Hypothesis Module

Detailed Architecture Preprocessing & VQ processes Lexical Access Hypothesis Phonetic String Build-Up HMMsVQ booksDurats. Align. costs Phonetic string. List of Candidate Words Speech Dicts Indexes.

Motivation l Studies on variable preselection list length estimation systems –# of words to pass to the verification stage l Direct correlation with confidence estimation: –If proposed list length is small  high confidence l Initial application to hypothesis module only

Databases & Dictionaries l Part of the VESTEL database l Training –5820 utterances Speakers. l Testing –2536 utters. (vocabulary dependent) spks –1434 utters. (vocabulary independent) spks l Dictionaries: (VD&I) words

Baseline experiment l Directly using the features (normalized to range 0..1 ) l Baseline features: –Acoustic log-likelihood (and normalized versions) –Lexical access cost for the 1 st candidate –Standard deviation of lexical access costs l Not very good results –Best one with Std Deviation

Baseline distributions  Std Deviation LA Acoustic likelihood (normalized) 

Baseline distributions

Neural Network estimator l Used successfully in preselection list length estimation l Able to combine parameters w/o effort l 3-layer MLP l Wide range of topology alternatives, coding schemes and features: –Direct parameters –Normalized –Lexical Access costs distribution

NN based experiments l Maximum correct classification rates: 70-75% for the three datasets (reasonable, taking into account the preselection rates achieved: 46.95%, 30.14% and 42.47%) l Best single feature: Standard deviation of the lexical access cost measured over the list of the first 10 candidates (0.1% of the dictionary size) l Final system uses 8 parameters (lexical and acoustical-based)

Final distributions  Not using NN Using NN 

Final distributions  Using NN Not using NN 

Additional results l EER: –30% for PERFDV –25% for PEIV1000 and PRNOK5TR –Optimum threshold very close to the scale midpoint l Correct rejection rates for given False rejection:

Conclusions l Introduced word-level confidence estimation system based on NNs and a combination of lexical and acoustical features l NN showed to improve results obtained using the features directly l Best parameter is lexical-based and consistent with acoustical-versions reported in the literature (standard deviation is similar to likelihood ratios and n-best related features)

Future work l Extend the comparison of the NN vs non- NN system to all feature set l Extend the work to the verification module (experiments already carried out shows good results) l Extend the approach to CRS (phrase level confidence)

ROC Curves (NN vs. non NN)

Any questions?