9/20/2004Speech Group Lunch Talk Speaker ID Smorgasbord or How I spent My Summer at ICSI Kofi A. Boakye International Computer Science Institute.

9/20/2004Speech Group Lunch Talk Speaker ID Smorgasbord or How I spent My Summer at ICSI Kofi A. Boakye International Computer Science Institute

9/20/2004Speech Group Lunch Talk Outline Keyword System Enhancements Monophone System Hybrid HMM/SVM Score Combinations Possible Directions

9/20/2004Speech Group Lunch Talk Keyword System: A Review Motivation I.Text-dependent systems have high performance, but limited flexibility when compared to text-independent systems Capitalize on advantages of text-dependent systems in this text- independent domain by limiting words of interest to a select group: Backchannels (yeah, uhhuh), filled pauses (um, uh), discourse markers (like, well, now…) => high frequency and high speaker-characteristic quality II. GMMs assume frames are independent and fail to take advantage of sequential information => Use HMMs instead to model the evolution of speech in time

9/20/2004Speech Group Lunch Talk Keyword System: A Review Approach Model each speaker using a collection of keyword HMMs Speaker models generated via adaptation of background models trained from a development data set Use standard likelihood ratio approach: Compute log likelihood ratio scores using accumulated log probabilities from keyword HMMs Use a speech recognizer to: 1) Locate words in the speech stream 2) Align speech frames to the HMM 3) Generate acoustic likelihood scores Word Extractor HMM-UBM N HMM-UBM 2 HMM-UBM 1 signal Combination

9/20/2004Speech Group Lunch Talk Keyword System: A Review Keywords Discourse markers: {actually, anyway, like, see, well, now, you_know, you_see, i_think, i_mean} Filled pauses: {um, uh} Backchannels: {yeah, yep, okay, uhhuh, right, i_see, i_know } Keyword Models Simple left-to-right (whole word) HMMs with self-loops and no skips 4 Gaussian components per state Number of states related to number of phones and median number of frames for word HMMs trained and scored using HTK Acoustic features: 19 mel-cepstra, zeroth cepstrum, and their first differences

9/20/2004Speech Group Lunch Talk System Performance Switchboard 1 Dev Set Data partitioned into 6 splits Tests use jack-knifing procedure: Test on splits 1 - 3 using background model trained on splits 4 – 6 (and vice versa) For development, tested primarily on split 1 with 8-side training Result: EER = 0.83%

9/20/2004Speech Group Lunch Talk System Performance Observations: Well-performing bigrams have comparable EERs Poorly-performing bigrams suffer from a paucity of data Suggests possibility of frequency threshold for performance Single word ‘yeah’ yields EER of 4.62%

9/20/2004Speech Group Lunch Talk Enhancements: Words Examine the performance of other words Sturim et al. propose word sets for text-constrained GMM system 1)Full set: 50 words that occur in > 70% of conversation sides

9/20/2004Speech Group Lunch Talk Enhancements: Words Examine the performance of other words Sturim et al. propose word sets for text-constrained GMM system 1)Full set: 50 words that occur in > 70% of conversation sides { and, I, that, yeah, you, just like, uh, to, think, the, have, so, know, in, but, they, really, it, well, is, not, because, my, that’s, on, its, about, do, for, was, don’t, one, get, all, with, oh, a, we, be, there, of, this, I’m, what, out, or, if, are, at }

9/20/2004Speech Group Lunch Talk Enhancements: Words Examine the performance of other words Sturim et al. propose word sets for text-constrained GMM system 1)Full set: 50 words that occur in > 70% of conversation sides 2)Min set: 11 words that yield the lowest word- specific EERs

9/20/2004Speech Group Lunch Talk Enhancements: Words Examine the performance of other words Sturim et al. propose word sets for text-constrained GMM system 1)Full set: 50 words that occur in > 70% of conversation sides 2)Min set: 11 words that yield the lowest word- specific EERs {and, I, that, yeah, you, just, like, uh, to, think, the}

9/20/2004Speech Group Lunch Talk Enhancements: Words Performance Full set: EER = 1.16% My set Full set = { yeah, like, uh, well, I, think, you }

9/20/2004Speech Group Lunch Talk Enhancements: Words Observations: Some poorly performing words occur quite frequently Such words may simply not be highly discriminative in nature Single word ‘and’ yields EER of 2.48% !!

9/20/2004Speech Group Lunch Talk Enhancements: Words Performance Min set: EER = 0.99% My set Min set = {yeah, like, uh, I, you, think}

9/20/2004Speech Group Lunch Talk Enhancements: Words Observations: Except for ‘and’, min set words have comparable performance Most can fall into one of the three categories of filled pause, discourse marker, or backchannel, either in isolation or conjunction

9/20/2004Speech Group Lunch Talk Enhancements: HNorm Target model scores have different distributions for utterances based on handset type Perform mean and variance normalization of scores based on estimated impostor score distribution For split 1, use impostor utterances from splits 2 and 3 75 females 86 males tgt1 tgt2 elec carb LR scoresHNorm Scores

9/20/2004Speech Group Lunch Talk Enhancements: HNorm Performance EER = 1.65% Performance worsened! Possible issue in HNorm implementation?

9/20/2004Speech Group Lunch Talk Enhancements: HNorm Examine effect of HNorm on particular speaker scores Speakers of interest: Those generating the most errors 3 Speakers each generating 4 errors

9/20/2004Speech Group Lunch Talk Enhancements: HNorm

9/20/2004Speech Group Lunch Talk Enhancements: HNorm Conclusion: HNorm works…but doesn’t One possibility: Look at computed devs… Distributions are widening in some cases

9/20/2004Speech Group Lunch Talk Enhancements: Deltas Problem: System performance differs significantly by gender Hypothesis: Higher deltas for females may be noisier Solution: Use longer window for delta computation to smooth

9/20/2004Speech Group Lunch Talk Enhancements: Deltas Extended window size from 2->3 Result: EER = 0.83% Performance nearly indistinguishable

9/20/2004Speech Group Lunch Talk Enhancements: Deltas Extended window size from 2->3 Result: Male and female disparity remains

9/20/2004Speech Group Lunch Talk Enhancements: Deltas Extended window size from 3->5 Result: EER = 1.32% Performance worsens!

9/20/2004Speech Group Lunch Talk Enhancements: Deltas Extended window size from 3->5 Result: Male female disparity widens Further investigation necessary

9/20/2004Speech Group Lunch Talk Monophone System Motivation Keyword system, with its use of HMMs, appears to have good performance However, we are only using a small amount (~10%) of the total data available =>Get full coverage by using phone HMMs rather than word HMMs System represents a trade-off between token coverage and “sharpness” of modeling

9/20/2004Speech Group Lunch Talk Monophone System Implementation System implemented similarly to keyword system, with phones replacing words Background models differ in that: 1)All models have 3 states, with 128 Gaussians per state 2)Models trained by successive splitting and Baum-Welch re- estimation, starting with a single Gaussian

9/20/2004Speech Group Lunch Talk Monophone System Performance EER = 1.16% Similar performance to keyword system Uses a lot more data!

9/20/2004Speech Group Lunch Talk Hybrid HMM/SVM System Motivation SVMs have been shown to yield good performance in speaker recognition systems Features used: Frames Phone and word n-gram counts/frequencies Phone lattices

9/20/2004Speech Group Lunch Talk Hybrid HMM/SVM System Motivation Keyword system looks at “distance” between target and background models as measured by log-probabilities Look at distance between models more explicitly => Use model parameters as features

9/20/2004Speech Group Lunch Talk Hybrid HMM/SVM System Approach Use concatenated mixture means as features for SVM Positive examples obtained by adapting background HMM to each of 8 training conversations Negative examples obtained by adapting background HMM to each conversation in the background set Keyword-level SVM outputs combined to give final score -Presently simple linear combination with equal weighting is used (though clearly suboptimal)

9/20/2004Speech Group Lunch Talk Hybrid HMM/SVM System Performance EER = 1.82% Promising first start

9/20/2004Speech Group Lunch Talk Score Combination We have three independent systems, so let’s see how they combine… Perform post facto (read: cheating) linear combination Each best combination yields same EER =>Possibly approaching EER limit for data set

9/20/2004Speech Group Lunch Talk Possible Directions Develop on SWB2 Create word “master list” for keyword system TNorm Modify features to address gender-specific performance disparity Score combination for hybrid system Modified hybrid system Tuning Plowing

9/20/2004Speech Group Lunch Talk Fin

9/20/2004Speech Group Lunch Talk Speaker ID Smorgasbord or How I spent My Summer at ICSI Kofi A. Boakye International Computer Science Institute.

Similar presentations

Presentation on theme: "9/20/2004Speech Group Lunch Talk Speaker ID Smorgasbord or How I spent My Summer at ICSI Kofi A. Boakye International Computer Science Institute."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

9/20/2004Speech Group Lunch Talk Speaker ID Smorgasbord or How I spent My Summer at ICSI Kofi A. Boakye International Computer Science Institute.

Similar presentations

Presentation on theme: "9/20/2004Speech Group Lunch Talk Speaker ID Smorgasbord or How I spent My Summer at ICSI Kofi A. Boakye International Computer Science Institute."— Presentation transcript:

Similar presentations

About project

Feedback