1 Bioinformatic Voice Applications: Speaker Recognition and Verification Andrew Rosenberg Biometric Seminar Day August 23, 2010
2 Outline Biometrics and Voice What can the Voice tell us about a Speaker Representing Speech Modeling Speakers Gaussian Mixture Model Universal Background Model
3 Biometrics and Voice Applications of Voice Biometrics Speaker Verification Are you who you say you are? Speaker Recognition Who are you? Diagnoses of Medical Pathologies and other Speaker States The voice can tell us other things about a speaker
4 Advantages of Voice Biometrics Minimally Intrusive Cheap Mechanisms to Collect Speech Data Established, low-risk, legal eavesdropping scenarios
5 Biometrics and Voice How does speech carry biometric information? How is speech produced? Articulators Vocal Tract First Language and Regional Influences Speech Pathologies Individual Differences
6 Production of Speech
7 Its ten below outside From the Queens University Speech Production and Perception Laboratory
8 Production of Speech Why did Ken set the soggy net on top of his deck? From the Queens University Speech Production and Perception Laboratory
9 Influences of Native Tongue Negative Language Transfer When speaking in a non-native tongue, speakers will use some characteristics from their native tongue. Very common in pronunciation /r/ vs. /l/ in Japanese and Chinese Cognates and false-cognates “elektrisch” = electricity “embarasada” ≠ embarassed Limited evidence of language transfer regarding grammar and word choice.
10 Assessment and Monitoring of medical problems How well is a patient coping with cancer treatment? Zellerman (2002) Is a patient clinically depressed? Alpert (2001) Moore (2003) Mundt (2007) Diagnosis of Schizophrenia through word choice Elvelag (2007 & 2009) Autism Spectrum Disorders demonstrated through lexical effects and “flat” prosody Rapin & Dunn (2003) Mesibov (1992) Le Normand (2008) Van Santen (2009
11 Automatic Detection of Pathological Speech Apraxia Green (2004) Shriberg (2004) Spasmodic Dysphonia & Muscular Tension Dysphonia Schlotthauer (2006) Stuttering Howell (1997) Czyzewski (2003) Parkinson’s Little (2008) Hammen (1989) Dyslexia Schulte-Köme (1999)
12 Speaker Verification Are you who you say you are? Security Applications Banking Restricted Facility Entry Forensics Compare stored speech against test speech Statistical modeling
13 Text Dependent vs. Text Independent Text Dependent Everyone says the same short phrase Text Independent Speakers say whatever they want. Typically no impact of the words that are said Text Dependent approaches have higher performance Text Independent approaches are more widely applicable
14 Speaker Verification Schematic Pipeline Training Testing Speech Parameterization speech data known speaker identity speech data claimed speaker identity Score Normalization Statistical Modeling Speech Parameterization Statistical Models speaker model speaker model background model Accept / Reject
15 Representation of Speech Mel-Frequency Cepstral Coefficients Typically taken every 10ms Often 20 coefficients Also include ∆ and ∆∆ in the feature vector, for a vector of 60 elements windowingFFTFilter Bank Cepstral Transform (DCT)
16 Statistical Modeling How does statistical modeling work? Learn a function that produces a probability. (Training) These functions are commonly represented in a parametric form. Learn the parameters.
17 Gaussian Model Gaussian Model or Normal Distribution Common and Easy to Work With Has 2 parameters: mean, variance (or standard deviation)
18 Gaussian Models in Higher Dimensions Normal Distributions in higher dimensions require slightly more complicated math, but operate identically Two parameters: A mean vector with d elements, a d-by-d covariance matrix.
19 Training a Gaussian Model The Gaussian Model that best fits a set of data has the traditional mean and standard deviation values. Can be proven with calculus, but we’re not going to today.
20 Gaussian Mixture Model But a lot of data is not actually normally distributed. A Mixture of Gaussian Models (GMM) allows us to add contributions from a number of Gaussians to best fit the data.
21 Modeling with a Gaussian Mixture Model Fitting a GMM to data. There isn’t a closed form to find the best parameterization of a GMM. Expectation-Maximization Powerful iterative optimization approach. Can be slow Can fall into local optima Algorithm: Initialize Assign points to mixtures Estimate mixture parameters Repeat until convergence
22 Speaker Verification Schematic Pipeline Training Testing Speech Parameterization speech data known speaker identity speech data claimed speaker identity Score Normalization Statistical Modeling Speech Parameterization Statistical Models speaker model speaker model background model Accept / Reject
23 Score normalization What does a score of.0005 mean? At what score should a system accept a users claim that they are who they say they are? We want to compare the likelihood that a speaker is who they say they are to the likelihood that they are another speaker. Universal Background Model
24 Speaker Verification with UBM score normalization For each speaker we have a GMM representing their voice. Additionally, we have one UBM-GMM that represents “speech” generally.
25 Speaker Verification Schematic Pipeline Training Testing Speech Parameterization speech data known speaker identity speech data claimed speaker identity Score Normalization Statistical Modeling Speech Parameterization Statistical Models speaker model speaker model background model Accept / Reject
26 Speaker Recognition Given speech from an unknown speaker can you tell me who it is? Requires some known material from the person in question. Now no longer a binary (True vs. False) question. Now a 1-of-N problem.
27 Verification vs. Recognition with GMMs
28 Speaker Recognition Overview Training Testing Speech Parameterization speech data known speaker identity speech data claimed speaker identity Score Normalization Statistical Modeling Speech Parameterization Statistical Models speaker model speaker models background model Speaker Prediction
29 State-of-the-art Speaker Verification What we have works fine. There has been a significant improvement to the state-of-the-art. Rather than model a speaker directly... model how the speaker differs from the average speaker (UBM). How can we do this? Move the UBM to best fit the new speaker.
30 Maximum A Posteriori Adaptation Update the UBM model parameters to best fit the new speaker data.
31 Maximum A Posteriori Adaptation Update the UBM model parameters to best fit the new speaker data.
32 Maximum A Posteriori Adaptation Store the transformation (or new value) of each parameter. Construct a new feature vector. Classifier using SVM (or another classifier) “Supervectors” Feature vectors of model parameters rather than speech features. MAP Classifier (SVM) supervectors speech representation (MFCC) UBM
33 UBM-MAP Overview Training Testing Speech Parameterization speech data known speaker identities speech data SVM Testing UBM-MAP Speech Parameterization UBM-MAP training supervectors testing supervectors Speaker Prediction SVM Training
34 Limitations of Current Speaker Verification and Adaptation Require Training material from the target. Can be slow to train. Best performance with Text-Dependent approaches
35 Summary of Voice Biometrics Speech carries speaker specific information Physiology Native Language Interference Personality Speaker State Idiosyncracies Speech is an attractive Biometric option. Inexpensive Technology requirements Minimally intrusuve Low-risk surveillance GMM modeling is a powerful way to statistically model a speaker’s voice for recognition and verification. >85-95% classification accuracy
36 Questions? Feel free to