9/20/2004Speech Group Lunch Talk Speaker ID Smorgasbord or How I spent My Summer at ICSI Kofi A. Boakye International Computer Science Institute.

Slides:



Advertisements
Similar presentations
Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.
Advertisements

Speech Recognition with Hidden Markov Models Winter 2011
Acoustic Vector Re-sampling for GMMSVM-Based Speaker Verification
Using Motherese in Speech Recognition EE516 final project Steven Schimmel March 13, 2003.
Language modeling for speaker recognition Dan Gillick January 20, 2004.
Speaker Adaptation for Vowel Classification
Speaker Detection Without Models Dan Gillick July 27, 2004.
8/12/2003 Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye International Computer Science Institute.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
Modeling of Mel Frequency Features for Non Stationary Noise I.AndrianakisP.R.White Signal Processing and Control Group Institute of Sound and Vibration.
EE225D Final Project Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye EE225D Final Project.
HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University.
Visual Speech Recognition Using Hidden Markov Models Kofi A. Boakye CS280 Course Project.
Why is ASR Hard? Natural speech is continuous
Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Acoustic and Linguistic Characterization of Spontaneous Speech Masanobu Nakamura, Koji Iwano, and Sadaoki Furui Department of Computer Science Tokyo Institute.
Introduction to Automatic Speech Recognition
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.
Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011 Kei Hashimoto, Shinji Takaki, Keiichiro Oura, and Keiichi Tokuda Nagoya.
A Talking Elevator, WS2006 UdS, Speaker Recognition 1.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
1 Phoneme and Sub-phoneme T- Normalization for Text-Dependent Speaker Recognition Doroteo T. Toledano 1, Cristina Esteve-Elizalde 1, Joaquin Gonzalez-Rodriguez.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Research & development Component Score Weighting for GMM based Text-Independent Speaker Verification Liang Lu SNLP Unit, France Telecom R&D Beijing
On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4.
Csc Lecture 7 Recognizing speech. Geoffrey Hinton.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
Signature with Text-Dependent and Text-Independent Speech for Robust Identity Verification B. Ly-Van*, R. Blouet**, S. Renouard** S. Garcia-Salicetti*,
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
An Intro to Speaker Recognition
A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
I-SMOOTH FOR IMPROVED MINIMUM CLASSIFICATION ERROR TRAINING Haozheng Li, Cosmin Munteanu Pei-ning Chen Department of Computer Science & Information Engineering.
1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Problems of Modeling Phone Deletion in Conversational Speech for Speech Recognition Brian Mak and Tom Ko Hong Kong University of Science and Technology.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.
A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
HMM-Based Speech Synthesis Erica Cooper CS4706 Spring 2011.
0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.
Speaker Verification Using Adapted GMM Presented by CWJ 2000/8/16.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.
Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan
APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,
Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.
1 Experiments with Detector- based Conditional Random Fields in Phonetic Recogntion Jeremy Morris 06/01/2007.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Chinese Named Entity Recognition using Lexicalized HMMs.
Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.
Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.
Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.
Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI
Statistical Models for Automatic Speech Recognition
Statistical Models for Automatic Speech Recognition
Jeremy Morris & Eric Fosler-Lussier 04/19/2007
Decision Making Based on Cohort Scores for
Presentation transcript:

9/20/2004Speech Group Lunch Talk Speaker ID Smorgasbord or How I spent My Summer at ICSI Kofi A. Boakye International Computer Science Institute

9/20/2004Speech Group Lunch Talk Outline Keyword System Enhancements Monophone System Hybrid HMM/SVM Score Combinations Possible Directions

9/20/2004Speech Group Lunch Talk Keyword System: A Review Motivation I.Text-dependent systems have high performance, but limited flexibility when compared to text-independent systems Capitalize on advantages of text-dependent systems in this text- independent domain by limiting words of interest to a select group: Backchannels (yeah, uhhuh), filled pauses (um, uh), discourse markers (like, well, now…) => high frequency and high speaker-characteristic quality II. GMMs assume frames are independent and fail to take advantage of sequential information => Use HMMs instead to model the evolution of speech in time

9/20/2004Speech Group Lunch Talk Keyword System: A Review Approach Model each speaker using a collection of keyword HMMs Speaker models generated via adaptation of background models trained from a development data set Use standard likelihood ratio approach: Compute log likelihood ratio scores using accumulated log probabilities from keyword HMMs Use a speech recognizer to: 1) Locate words in the speech stream 2) Align speech frames to the HMM 3) Generate acoustic likelihood scores Word Extractor HMM-UBM N HMM-UBM 2 HMM-UBM 1 signal Combination

9/20/2004Speech Group Lunch Talk Keyword System: A Review Keywords Discourse markers: {actually, anyway, like, see, well, now, you_know, you_see, i_think, i_mean} Filled pauses: {um, uh} Backchannels: {yeah, yep, okay, uhhuh, right, i_see, i_know } Keyword Models Simple left-to-right (whole word) HMMs with self-loops and no skips 4 Gaussian components per state Number of states related to number of phones and median number of frames for word HMMs trained and scored using HTK Acoustic features: 19 mel-cepstra, zeroth cepstrum, and their first differences

9/20/2004Speech Group Lunch Talk System Performance Switchboard 1 Dev Set Data partitioned into 6 splits Tests use jack-knifing procedure: Test on splits using background model trained on splits 4 – 6 (and vice versa) For development, tested primarily on split 1 with 8-side training Result: EER = 0.83%

9/20/2004Speech Group Lunch Talk System Performance Observations: Well-performing bigrams have comparable EERs Poorly-performing bigrams suffer from a paucity of data Suggests possibility of frequency threshold for performance Single word ‘yeah’ yields EER of 4.62%

9/20/2004Speech Group Lunch Talk Enhancements: Words Examine the performance of other words Sturim et al. propose word sets for text-constrained GMM system 1)Full set: 50 words that occur in > 70% of conversation sides

9/20/2004Speech Group Lunch Talk Enhancements: Words Examine the performance of other words Sturim et al. propose word sets for text-constrained GMM system 1)Full set: 50 words that occur in > 70% of conversation sides { and, I, that, yeah, you, just like, uh, to, think, the, have, so, know, in, but, they, really, it, well, is, not, because, my, that’s, on, its, about, do, for, was, don’t, one, get, all, with, oh, a, we, be, there, of, this, I’m, what, out, or, if, are, at }

9/20/2004Speech Group Lunch Talk Enhancements: Words Examine the performance of other words Sturim et al. propose word sets for text-constrained GMM system 1)Full set: 50 words that occur in > 70% of conversation sides 2)Min set: 11 words that yield the lowest word- specific EERs

9/20/2004Speech Group Lunch Talk Enhancements: Words Examine the performance of other words Sturim et al. propose word sets for text-constrained GMM system 1)Full set: 50 words that occur in > 70% of conversation sides 2)Min set: 11 words that yield the lowest word- specific EERs {and, I, that, yeah, you, just, like, uh, to, think, the}

9/20/2004Speech Group Lunch Talk Enhancements: Words Performance Full set: EER = 1.16% My set Full set = { yeah, like, uh, well, I, think, you }

9/20/2004Speech Group Lunch Talk Enhancements: Words Observations: Some poorly performing words occur quite frequently Such words may simply not be highly discriminative in nature Single word ‘and’ yields EER of 2.48% !!

9/20/2004Speech Group Lunch Talk Enhancements: Words Performance Min set: EER = 0.99% My set Min set = {yeah, like, uh, I, you, think}

9/20/2004Speech Group Lunch Talk Enhancements: Words Observations: Except for ‘and’, min set words have comparable performance Most can fall into one of the three categories of filled pause, discourse marker, or backchannel, either in isolation or conjunction

9/20/2004Speech Group Lunch Talk Enhancements: HNorm Target model scores have different distributions for utterances based on handset type Perform mean and variance normalization of scores based on estimated impostor score distribution For split 1, use impostor utterances from splits 2 and 3 75 females 86 males tgt1 tgt2 elec carb LR scoresHNorm Scores

9/20/2004Speech Group Lunch Talk Enhancements: HNorm Performance EER = 1.65% Performance worsened! Possible issue in HNorm implementation?

9/20/2004Speech Group Lunch Talk Enhancements: HNorm Examine effect of HNorm on particular speaker scores Speakers of interest: Those generating the most errors 3 Speakers each generating 4 errors

9/20/2004Speech Group Lunch Talk Enhancements: HNorm

9/20/2004Speech Group Lunch Talk Enhancements: HNorm

9/20/2004Speech Group Lunch Talk Enhancements: HNorm

9/20/2004Speech Group Lunch Talk Enhancements: HNorm Conclusion: HNorm works…but doesn’t One possibility: Look at computed devs… Distributions are widening in some cases

9/20/2004Speech Group Lunch Talk Enhancements: Deltas Problem: System performance differs significantly by gender Hypothesis: Higher deltas for females may be noisier Solution: Use longer window for delta computation to smooth

9/20/2004Speech Group Lunch Talk Enhancements: Deltas Extended window size from 2->3 Result: EER = 0.83% Performance nearly indistinguishable

9/20/2004Speech Group Lunch Talk Enhancements: Deltas Extended window size from 2->3 Result: Male and female disparity remains

9/20/2004Speech Group Lunch Talk Enhancements: Deltas Extended window size from 3->5 Result: EER = 1.32% Performance worsens!

9/20/2004Speech Group Lunch Talk Enhancements: Deltas Extended window size from 3->5 Result: Male female disparity widens Further investigation necessary

9/20/2004Speech Group Lunch Talk Monophone System Motivation Keyword system, with its use of HMMs, appears to have good performance However, we are only using a small amount (~10%) of the total data available =>Get full coverage by using phone HMMs rather than word HMMs System represents a trade-off between token coverage and “sharpness” of modeling

9/20/2004Speech Group Lunch Talk Monophone System Implementation System implemented similarly to keyword system, with phones replacing words Background models differ in that: 1)All models have 3 states, with 128 Gaussians per state 2)Models trained by successive splitting and Baum-Welch re- estimation, starting with a single Gaussian

9/20/2004Speech Group Lunch Talk Monophone System Performance EER = 1.16% Similar performance to keyword system Uses a lot more data!

9/20/2004Speech Group Lunch Talk Hybrid HMM/SVM System Motivation SVMs have been shown to yield good performance in speaker recognition systems Features used: Frames Phone and word n-gram counts/frequencies Phone lattices

9/20/2004Speech Group Lunch Talk Hybrid HMM/SVM System Motivation Keyword system looks at “distance” between target and background models as measured by log-probabilities Look at distance between models more explicitly => Use model parameters as features

9/20/2004Speech Group Lunch Talk Hybrid HMM/SVM System Approach Use concatenated mixture means as features for SVM Positive examples obtained by adapting background HMM to each of 8 training conversations Negative examples obtained by adapting background HMM to each conversation in the background set Keyword-level SVM outputs combined to give final score -Presently simple linear combination with equal weighting is used (though clearly suboptimal)

9/20/2004Speech Group Lunch Talk Hybrid HMM/SVM System Performance EER = 1.82% Promising first start

9/20/2004Speech Group Lunch Talk Score Combination We have three independent systems, so let’s see how they combine… Perform post facto (read: cheating) linear combination Each best combination yields same EER =>Possibly approaching EER limit for data set

9/20/2004Speech Group Lunch Talk Possible Directions Develop on SWB2 Create word “master list” for keyword system TNorm Modify features to address gender-specific performance disparity Score combination for hybrid system Modified hybrid system Tuning Plowing

9/20/2004Speech Group Lunch Talk Fin