1 Phoneme and Sub-phoneme T- Normalization for Text-Dependent Speaker Recognition Doroteo T. Toledano 1, Cristina Esteve-Elizalde 1, Joaquin Gonzalez-Rodriguez.

Slides:

Advertisements

Similar presentations

Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.

Advertisements

Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Sensor-Based Abnormal Human-Activity Detection Authors: Jie Yin, Qiang Yang, and Jeffrey Junfeng Pan Presenter: Raghu Rangan.

Acoustic Vector Re-sampling for GMMSVM-Based Speaker Verification

Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.

Detection of Recognition Errors and Out of the Spelling Dictionary Names in a Spelled Name Recognizer for Spanish R. San-Segundo, J. Macías-Guarasa, J.

Speech Recognition Training Continuous Density HMMs Lecture Based on:

INCORPORATING MULTIPLE-HMM ACOUSTIC MODELING IN A MODULAR LARGE VOCABULARY SPEECH RECOGNITION SYSTEM IN TELEPHONE ENVIRONMENT A. Gallardo-Antolín, J. Ferreiros,

Acoustical and Lexical Based Confidence Measures for a Very Large Vocabulary Telephone Speech Hypothesis-Verification System Javier Macías-Guarasa, Javier.

Speaker Adaptation for Vowel Classification

Feature vs. Model Based Vocal Tract Length Normalization for a Speech Recognition-based Interactive Toy Jacky CHAU Department of Computer Science and Engineering.

8/12/2003 Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye International Computer Science Institute.

VESTEL database realistic telephone speech corpus:  PRNOK5TR: 5810 utterances in the training set  PERFDV: 2502 utterances in testing set 1 (vocabulary.

9/20/2004Speech Group Lunch Talk Speaker ID Smorgasbord or How I spent My Summer at ICSI Kofi A. Boakye International Computer Science Institute.

EE225D Final Project Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye EE225D Final Project.

1 USING CLASS WEIGHTING IN INTER-CLASS MLLR Sam-Joo Doh and Richard M. Stern Department of Electrical and Computer Engineering and School of Computer Science.

Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Training HMMs Version 4: February 2005.

Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos

Acoustic and Linguistic Characterization of Spontaneous Speech Masanobu Nakamura, Koji Iwano, and Sadaoki Furui Department of Computer Science Tokyo Institute.

Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.

Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.

Lightly Supervised and Unsupervised Acoustic Model Training Lori Lamel, Jean-Luc Gauvain and Gilles Adda Spoken Language Processing Group, LIMSI, France.

1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.

A Talking Elevator, WS2006 UdS, Speaker Recognition 1.

Cepstral Vector Normalization based On Stereo Data for Robust Speech Recognition Presenter: Shih-Hsiang Lin Luis Buera, Eduardo Lleida, Antonio Miguel,

Utterance Verification for Spontaneous Mandarin Speech Keyword Spotting Liu Xin, BinXi Wang Presenter: Kai-Wun Shih No.306, P.O. Box 1001,ZhengZhou,450002,

VBS Documentation and Implementation The full standard initiative is located at Quick description Standard manual.

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.

Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.

Speaker independent Digit Recognition System Suma Swamy Research Scholar Anna University, Chennai 10/22/2015 9:10 PM 1.

DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.

Data Sampling & Progressive Training T. Shinozaki & M. Ostendorf University of Washington In collaboration with L. Atlas.

Dijana Petrovska-Delacrétaz 1 Asmaa el Hannani 1 Gérard Chollet 2 1: DIVA Group, University of Fribourg 2: GET-ENST, CNRS-LTCI,

Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.

A methodology for the creation of a forensic speaker recognition database to handle mismatched conditions Anil Alexander and Andrzej Drygajlo Swiss Federal.

Boosting Training Scheme for Acoustic Modeling Rong Zhang and Alexander I. Rudnicky Language Technologies Institute, School of Computer Science Carnegie.

Speaker Authentication Qi Li and Biing-Hwang Juang, Pattern Recognition in Speech and Language Processing, Chap 7 Reporter : Chang Chih Hao.

A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA.

I-SMOOTH FOR IMPROVED MINIMUM CLASSIFICATION ERROR TRAINING Haozheng Li, Cosmin Munteanu Pei-ning Chen Department of Computer Science & Information Engineering.

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

Performance Comparison of Speaker and Emotion Recognition

MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.

A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-

Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluating Unsupervised Language Model Adaption Methods for Speaking Assessment ShaSha.

A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.

Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.

Phonetic features in ASR Kurzvortrag Institut für Kommunikationsforschung und Phonetik Bonn 17. Juni 1999 Jacques Koreman Institute of Phonetics University.

Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.

Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.

On the relevance of facial expressions for biometric recognition Marcos Faundez-Zanuy, Joan Fabregas Escola Universitària Politècnica de Mataró (Barcelona.

By: Nicole Cappella. Why I chose Speech Recognition  Always interested me  Dr. Phil Show Manti Teo Girlfriend Hoax  Three separate voice analysts proved.

Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.

Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.

Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.

A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.

Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI

Inter Class MLLR for Speaker Adaptation

LTI Student Research Symposium 2004 Antoine Raux

Statistical Models for Automatic Speech Recognition

Automatic Speech Recognition: Conditional Random Fields for ASR

Decision Making Based on Cohort Scores for

A maximum likelihood estimation and training on the fly approach

Cheng-Kuan Wei1 , Cheng-Tao Chung1 , Hung-Yi Lee2 and Lin-Shan Lee2

Network Training for Continuous Speech Recognition

Presentation transcript:

1 Phoneme and Sub-phoneme T- Normalization for Text-Dependent Speaker Recognition Doroteo T. Toledano 1, Cristina Esteve-Elizalde 1, Joaquin Gonzalez-Rodriguez 1, Ruben Fernandez-Pozo 2 and Luis Hernandez Gomez 2 1 ATVS, Universidad Autonoma de Madrid, Spain 2 GAPS-SSR, Universidad Politécnica de Madrid, Spain IEEE Odyssey 2008, Cape Town, South Africa, Jan 08

2 Outline 1. Introduction 2. Text-dependent SR Based on Phonetic HMMs  2.1. Enrollment and Verification Phases  2.2. Experimental Framework (YOHO)  2.3. Results with raw scores 3. T-Norm in Text-Dependent SR  3.1. Plain (Utterance-level) T-Norm  3.2. Phoneme-level T-Norm  3.3. Subphoneme-level T-Norm 4. Results summary 5. Discussion 6. Conclusions

3 1. Introduction Text-Independent Speaker Recognition  Unknown linguistic content  Research driven by yearly NIST SRE evals Text-Dependent Speaker Recognition  Linguistic content of test utterance known by system Password set by the user  Security based on password + speaker recognition Text prompted by the system  Security based on speaker recognition only  No competitive evaluations by NIST  YOHO is one of the most extended databases for experimentation This work is on text prompted systems with YOHO as test database

Text-dependent SR based on phonetic HMMs: Enrollment Phase Speech parameterization (common to enrollment and test)  25 ms Hamming windows with 10 ms window shift  13 MFCCs + Deltas + Double Deltas  39 coeffs Spk-indep, context-indep phonetic HMMs used as base models  39 phones trained on TIMIT, 3 states left-to-right, 1-80 Gauss/state Spk-dep phonetic HMMs from transcribed enrollment audio Enrollment Parameterized Utterances Phonetic Transcriptions (with optional Sil) Spk-Indep Phonetic HMMs Spk-Indep models of the utterances, λ I Speaker Dependent Phonetic HMMs (speaker model) Baum-Welch Retraining Or MLLR Adaptation

Text-dependent SR based on phonetic HMMs: Verification Phase Computation of acoustic scores for spk-dep and spk-indep models Acoustic scores  Verification score ( removing silences) Parameterized Audio to Verify Phonetic Transcription (with optional Sil) Spk-Indep Phonetic HMMs Spk-Indep model of the utterance, λ I Spk-Dep model of the utterance, λ D Spk-Dep Phonetic HMMs Viterbi Spk-Indep Acoustic Scores Spk-Dep Acoustic Scores Viterbi

Experimental Framework (YOHO) YOHO database  138 speakers (106 male, 32 female) Enrollment data: 4 sessions x 24 utterances = 96 utterances Test data: 10 sessions x 4 utterances = 40 utterances  Utterance = 3 digit pairs (i.e. “twelve thirty four fifty six”) Usage of YOHO in this work  Enrollment: 3 different conditions 6 utterances from the 1st enrollment session 24 utterances from the 1st enrollment session 96 utterances from the 4 enrollment sessions  Test: always with a single utterance Target trials: 40 test utterances for each speaker (138 x 40 = 5,520) Non-tgt trials: 137 test utterances for each speaker (138 x 137 = 18,906)  One random utterance from the test data of each of the other users

Results with raw scores DET curves and %EERs with raw scores comparing  Baum-Welch Re-estimation vs. MLLR Adaptation For optimum configuration of tuning parameters in each case (Gauss/state, regression classes, re-estimation passes)  Different amounts of enrollment material 6, 24 or 96 utterances MLLR Adaptation provides better performance for all conditions Our baseline for this work is the curve for MLLR adaptation with 6 utterances

8 3. T-Norm in Text-Dependent SR T-Norm in Text-Independent SR  Regularly applied with excellent results  Normalize each score w.r.t. distribution of non-target scores for The same test segment A cohort of impostor speaker models T-Norm in Text-Dependent SR  Rarely applied with only modest improvement  A few notable exceptions are [M. Hébert and D. Boies, ICASSP’05], where T-Norm is the main focus and [R.D. Zylca et al., Odyssey’04], where T-Norm is applied but is not the main focus

Plain (Utterance-level) T-Norm: Procedure Procedure in text-dependent SR is identical to T-Norm in text- independent SR  We call this Plain T-Norm or Utterance-level T-Norm to distinguish it from the other methods we propose 1. Compute verification scores for the same test utterance and a cohort of impostor speaker models:  Reserve a cohort of impostor speakers {1, …, M}  Obtain MLLR speaker-adapted phonetic HMMs for those speakers  Compute verification scores for the same test utterance and those speaker models 2. Normalize the verification score using the mean and standard deviation of the impostor scores obtained

Plain (Utterance-level) T-Norm: Results (i) Plain (Utterance-level) T-Norm vs. No T-Norm on YOHO  Enrollment with only 6 utterances from 1 session and test with 1 utterance  10 male and 10 female speakers reserved as cohort and not included in results  Cohort = 20 speaker models  MLLR adaptation Utterance-level T-Norm (Plain T- Norm) produces slightly worse results than doing nothing Perhaps due to very small cohort?

Plain (Utterance-level) T-Norm: Results (ii) Perhaps due to very small cohort? New experiment using a bigger cohort of models  But not speakers due to very limited amount of speakers in YOHO (32 f)  4 speaker models by speaker in the cohort  Trained with the first 6 utterances in each session Slightly better results, but still the improvement achieved by T-Norm is very small Probably not only due to the small cohort

Plain (Utterance-level) T-Norm: Results (iii) Other causes for limited performance of T-Norm?  M. Hébert and D. Boies, (ICASSP’05) analyzed the effect of lexical mismatch, and proposed it as a cause for the poor performance Smoothing mechanism that weighted the effect of T-Norm according to the goodness of the cohort to model the utterance to verify Could we reduce the effect of the lexical mismatch in other ways?  Reducing the lexical content of the test speech used to produce a speaker verification score to a single phoneme or sub-phoneme  And then T-Normalizing these scores and combining them Basic idea of Phoneme and Sub-phoneme-level T-Norm

Phoneme-level T-Norm: Procedure Compute phoneme-based verification scores for the same test utterance, the speaker model and a cohort of impostor models  Compute a verification score for each non-silence phoneme i, Considering only acoustic scores associated to phoneme i in the utterance  Reserve a cohort of impostor speakers {1, …, M}  Obtain MLLR speaker-adapted phonetic HMMs for those speakers  For each non-silence phoneme, i, compute verification scores for the same test utterance and those speaker models Normalize each phoneme-based verification score using the mean and standard deviation of the corresponding impostor scores obtained Combine normalized phoneme-based verification scores to form utterance verification score (taking into account phoneme lengths)

Phoneme-level T-Norm: Results Phoneme-level T-Norm vs. No T- Norm on YOHO  Enrolment with only 6 utterances from 1 session and test with 1 utterance  10 male and 10 female speakers reserved as cohort and not included in results  Cohort = 20 speaker models  MLLR adaptation Phoneme-Level T-Norm is clearly better than No T-Norm Also clearly better than Utterance- Level T-Norm Can we do it better by using even smaller units?

Subphoneme-level T-Norm: Procedure & Results Using exactly the same idea of phoneme-level T-Norm  But using HMM states instead of phonemes State-level T-Norm vs. No T-Norm on YOHO  Enrolment with only 6 utterances from 1 session and test with 1 utterance  10 male and 10 female speakers reserved as cohort and not included in results  Cohort = 20 speaker models  MLLR adaptation Results are even better than with Phoneme-level T-Norm

16 4. Summary of Results Utterance-level T-Norm performs worse than doing nothing But the newly proposed Phoneme-level and State-level T-Norm provide relative improvements in EER close to 20% and over 25% in

17 5. Discussion (i) Phoneme and State-level T-Norm work clearly better than Utterance-level T-Norm in text-dependent SR  Utterance-level (or Plain) T-Norm suffers from lexical mismatch But this mismatch is not totally avoided by Phoneme or State-level T-Norm  It is still possible to have substantial differences in lexical content  However, now each phoneme/sub-phoneme in the test utterance produces an independent speaker verification score For which the mismatch is limited to the mismatch in a single phoneme/sub-phoneme in the training material  This may reduce the influence of the lexical mismatch on the phoneme/sub-phoneme verification scores  Making T-Norm less sensitive to this problem

18 5. Discussion (ii) Other possible reason for the good performance of phoneme and state-level T-Norm  Based on ideas from a recent paper [Subramanya et al., ICASSP’07] Subramanya computes speaker verification scores for each phoneme And considers those scores as produced by independent weak speaker recognizers That are combined using boosting to yield improved performance  This is (conceptually) similar to our approach We combine phoneme or sub-phoneme verification scores Weighting them according to their means and variances on a cohort Different phonemes/sub-phonemes  different discriminating powers  T-Norm at the phoneme or sub-phoneme levels could be able to weight them appropriately

19 6. Conclusions Applying T-Norm in text-dep SR the way we do in text- indep SR does not work well  This is Plain or Utterance-level T-Norm Newly proposed T-Norm schemes working at sub- utterance levels work much better  Phoneme-level T-Norm  Subphoneme-level T-Norm Possible reasons  Reduction of the effect of lexical mismatch  Better weighting/fusion of the information provided by the different phonemes or subphonemes

20 Thanks! IEEE Odyssey 2008, Cape Town, South Africa, Jan 08

21 Additional Slides IEEE Odyssey 2008, Cape Town, South Africa, Jan 08

22 Baum-Welch Reestimation (YOHO)  Phonetic HMMs from 1 to 5 Gaussians/State  Baum-Welch Reestimation  1 or 4 iterations  6 enrollment uterances (1 session)

23 MLLR Adaptation Results (YOHO)  Phonetic HMMs with 5,10, 20, 40 y 80 Gauss/state  MLLR Adaptation  1, 2, 4, 8, 16, 32 regression classes  6 enrollment utterances (1 session)