MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 1 © 2001 D.A.Reynolds and L.P.Heck Speaker Verification: From Research to Reality.

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 1 © 2001 D.A.Reynolds and L.P.Heck Speaker Verification: From Research to Reality Douglas A. Reynolds, PhD Senior Member of Technical Staff M.I.T. Lincoln Laboratory Larry P. Heck, PhD Speaker Verification R&D Nuance Communications This work was sponsored by the Department of Defense under Air Force contract F19628-00-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the authors and are not necessarily endorsed by the Department of Defense. ICASSP Tutorial Salt Lake City, UT 7 May 2001

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 2 © 2001 D.A.Reynolds and L.P.Heck Speaker Verification: From Research to Reality This material may not be reproduced in whole or part without written permission from the authors ICASSP Tutorial Salt Lake City, UT 7 May 2001

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 3 © 2001 D.A.Reynolds and L.P.Heck Tutorial Outline Part I : Background and Theory –Overview of area –Terminology –Theory and structure of verification systems –Channel compensation and adaptation Part II : Evaluation and Performance –Evaluation tools and metrics –Evaluation design –Publicly available corpora –Performance survey Part III : Applications and Deployments –Brief overview of commercial speaker verification systems –Design requirements for commercial verifiers –Steps to deployment –Examples of deployments

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 4 © 2001 D.A.Reynolds and L.P.Heck Goals of Tutorial Understand major concepts behind modern speaker verification systems Identify the key elements in evaluating performance of a speaker verification system Define the main issues and tasks in deploying a speaker verification system

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 5 © 2001 D.A.Reynolds and L.P.Heck Part I : Background and Theory Outline Overview of area –Applications –Terminology General Theory –Features for speaker recognition –Speaker models –Verification decision Channel compensation Adaptation Combination of speech and speaker recognizers

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 6 © 2001 D.A.Reynolds and L.P.Heck Extracting Information from Speech Speech Recognition Language Recognition Speaker Recognition Words Language Name Speaker Name “How are you?” English James Wilson Speech Signal Goal: Automatically extract information transmitted in speech signal

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 7 © 2001 D.A.Reynolds and L.P.Heck Evolution of Speaker Recognition This tutorial will focus on techniques and performance of state-of-the art systems 1970 Dynamic Time-Warping Vector Quantization 1980 Hidden Markov Models Gaussian Mixture Models 1990 2001 Template matching 1960 Large databases, realistic, unconstrained speech Small databases, clean, controlled speech Aural and spectrogram matching 1930- Commercial application of speaker recognition technology

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 8 © 2001 D.A.Reynolds and L.P.Heck Speaker Recognition Applications Access Control Physical facilities Computer networks and websites Access Control Physical facilities Computer networks and websites Transaction Authentication Telephone banking Remote credit card purchases Transaction Authentication Telephone banking Remote credit card purchases Speech Data Management Voice mail browsing Speech skimming Speech Data Management Voice mail browsing Speech skimming Personalization Intelligent answering machine Voice-web / device customization Personalization Intelligent answering machine Voice-web / device customization Law Enforcement Forensics Home parole Law Enforcement Forensics Home parole

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 9 © 2001 D.A.Reynolds and L.P.Heck Terminology The general area of speaker recognition can be divided into two fundamental tasks Verification IdentificationIdentification Speaker recognition Any work on speaker recognition should identify which task is being addressed

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 10 © 2001 D.A.Reynolds and L.P.Heck Terminology Identification Determines whom is talking from set of known voices No identity claim from user (one to many mapping) Often assumed that unknown voice must come from set of known speakers - referred to as closed-set identification ? ? ? ? Whose voice is this?

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 11 © 2001 D.A.Reynolds and L.P.Heck Terminology Verification/Authentication/Detection Determine whether person is who they claim to be User makes identity claim (one to one mapping) Unknown voice could come from large set of unknown speakers - referred to as open-set verification Adding “none-of-the-above” option to closed-set identification gives open-set identification ? Is this Bob’s voice?

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 12 © 2001 D.A.Reynolds and L.P.Heck Terminology Segmentation and Clustering Determine when speaker change has occurred in speech signal (segmentation) Group together speech segments from same speaker (clustering) Prior speaker information may or may not be available Speaker B Speaker A Which segments are from the same speaker? Where are speaker changes?

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 13 © 2001 D.A.Reynolds and L.P.Heck Terminology Speech Modalities Text-dependent recognition –Recognition system knows text spoken by person –Examples: fixed phrase, prompted phrase –Used for applications with strong control over user input –Knowledge of spoken text can improve system performance Application dictates different speech modalities: Text-independent recognition –Recognition system does not know text spoken by person –Examples: User selected phrase, conversational speech –Used for applications with less control over user input –More flexible system but also more difficult problem –Speech recognition can provide knowledge of spoken text

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 14 © 2001 D.A.Reynolds and L.P.Heck Terminology Voice Biometric Strongest security Speaker verification is often referred to as a voice biometric Biometric: a human generated signal or attribute for authenticating a person’s identity Voice is a popular biometric: –natural signal to produce –does not require a specialized input device –ubiquitous: telephones and microphone equipped PC Voice biometric can be combined with other forms of security –Something you have - e.g., badge –Something you know - e.g., password –Something you are - e.g., voice HaveKnow Are

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 16 © 2001 D.A.Reynolds and L.P.Heck General Theory Phases of Speaker Verification System Two distinct phases to any speaker verification system Feature extraction Model training Enrollment speech for each speaker Bob Sally Model (voiceprint) for each speaker Sally Bob Enrollment Phase Model training Accepted! Feature extraction Verification decision Verification decision Claimed identity: Sally Verification Phase Verification decision Verification decision

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 17 © 2001 D.A.Reynolds and L.P.Heck General Theory Features for Speaker Recognition Humans use several levels of perceptual cues for speaker recognition High-level cues (learned traits) Low-level cues (physical traits) Easy to automatically extract Difficult to automatically extract Hierarchy of Perceptual Cues There are no exclusive speaker identity cues Low-level acoustic cues most applicable for automatic systems

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 18 © 2001 D.A.Reynolds and L.P.Heck General Theory Features for Speaker Recognition Desirable attributes of features for an automatic system (Wolf ‘72) Occur naturally and frequently in speech Easily measurable Not change over time or be affected by speaker’s health Not be affected by reasonable background noise nor depend on specific transmission characteristics Not be subject to mimicry Occur naturally and frequently in speech Easily measurable Not change over time or be affected by speaker’s health Not be affected by reasonable background noise nor depend on specific transmission characteristics Not be subject to mimicry Practical Robust Secure No feature has all these attributes Features derived from spectrum of speech have proven to be the most effective in automatic systems

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 19 © 2001 D.A.Reynolds and L.P.Heck General Theory Speech Production Speech production model: source-filter interaction –Anatomical structure (vocal tract/glottis) conveyed in speech spectrum Vocal tractGlottal pulses Time (sec) Speech signal Time (sec)

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 20 © 2001 D.A.Reynolds and L.P.Heck General Theory Features for Speaker Recognition Different speakers will have different spectra for similar sounds Cross Section of Vocal Tract /AE/ Cross Section of 70 60 50 40 30 20 10 0 Male Speaker Female Speaker Male Speaker Female Speaker 14 16 18 12 10 8 6 4 2 0 0200040006000 0 0 0 1 2 3 4 1 2 3 4 5 200040006000 /I/ Vocal Tract Frequency (Hz) Magnitude (dB) Frequency (Hz) Magnitude (dB) Differences are in location and magnitude of peaks in spectrum –Peaks are known as formants and represent resonances of vocal cavity The spectrum captures the format location and, to some extent, pitch without explicit formant or pitch tracking

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 21 © 2001 D.A.Reynolds and L.P.Heck General Theory Features for Speaker Recognition Speech is a continuous evolution of the vocal tract –Need to extract time series of spectra –Use a sliding window - 20 ms window, 10 ms shift... Fourier Transform Magnitude Produces time-frequency evolution of the spectrum Frequency (Hz) Time (sec)

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 22 © 2001 D.A.Reynolds and L.P.Heck General Theory Features for Speaker Recognition The number of discrete Fourier transform samples representing the spectrum is reduced by averaging frequency bins together –Typically done by a simulated filterbank A perceptually based filterbank is used such as a Mel or Bark scale filterbank –Linearly spaced filters at low frequencies –Logarithmically spaced filters at high frequencies Frequency Magnitude

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 23 © 2001 D.A.Reynolds and L.P.Heck General Theory Features for Speaker Recognition Primary feature used in speaker recognition systems are cepstral feature vectors Log() function turns linear convolutional effects into additive biases –Easy to remove using blind-deconvolution techniques Cosine transform helps decorrelate elements in feature vector –Less burden on model and empirically better performance... Fourier Transform Magnitude Log() Cosine transform 3.4 3.6 2. 1 0.0 -0.9 0.3.1 3.4 3.6 2. 1 0.0 -0.9 0.3.1 3.4 3.6 2. 1 0.0 -0.9 0.3.1 One feature vector every 10 ms

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 24 © 2001 D.A.Reynolds and L.P.Heck General Theory Features for Speaker Recognition Fourier Transform Fourier Transform Magnitude Log() Cosine transform Cosine transform

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 25 © 2001 D.A.Reynolds and L.P.Heck General Theory Features for Speaker Recognition Additional processing steps for speaker recognition features To help capture some temporal information about the spectra, delta cepstra are often computed and appended to the cepstra feature vector –1 st order linear fit used over a 5 frame (50 ms) span For telephone speech processing, only voice pass-band frequency region is used –Use only output of filters in range 300-3300 Hz 300 3300

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 26 © 2001 D.A.Reynolds and L.P.Heck General Theory Features for Speaker Recognition To help remove channel convolutional effects, cepstral mean subtraction (CMS) or RASTA filtering is applied to the cepstral vectors Some speaker information is lost, but generally CMS is highly beneficial to performance RASTA filtering is like a time-varying version of CMS (Hermansky, 92) h(t) FT() |.| Log() Cos Trans()

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 27 © 2001 D.A.Reynolds and L.P.Heck General Theory Phases of Speaker Verification System Two distinct phases to any speaker verification system Feature extraction Model training Enrollment speech for each speaker Bob Sally Model (voiceprint) for each speaker Sally Bob Enrollment Phase Accepted! Feature extraction Verification decision Verification decision Claimed identity: Sally Verification Phase

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 28 © 2001 D.A.Reynolds and L.P.Heck General Theory Speaker Models Speaker models are used to represent the speaker- specific information conveyed in the feature vectors Desirable attributes of a speaker model –Theoretical underpinning –Generalizable to new data –Parsimonious representation (size and computation) Modern speaker verification systems employ some form of Hidden Markov Models (HMM) –Statistical model for speech sound representation –Solid theoretical basis –Existing parameter estimation techniques

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 29 © 2001 D.A.Reynolds and L.P.Heck 3.4 3.6 2.1 0.0 -0.9 0.3.1 General Theory Speaker Models Treat speaker as a hidden random source generating observed feature vectors –Source has “states” corresponding to different speech sounds Speaker (source) Hidden speech state 3.4 3.6 2.1 0.0 -0.9 0.3.1 3.4 3.6 2.1 0.0 -0.9 0.3.1 … Observed feature vectors

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 30 © 2001 D.A.Reynolds and L.P.Heck General Theory Speaker Models Feature vectors generated from each state follow a Gaussian mixture distribution Transition probability Feature distribution for state i Transition between states based on modality of speech –Text-dependent case will have ordered states –Text-independent case will allow all transitions Model parameters –Transition probabilities –State mixture parameters Parameters are estimated from training speech using Expectation Maximization (EM) algorithm

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 31 © 2001 D.A.Reynolds and L.P.Heck General Theory Speaker Models HMMs encode the temporal evolution of the features (spectrum) HMMs represent underlying statistical variations in the speech state (e.g., phoneme) and temporal changes of speech between the states. This provides a statistical model of how a speaker produces sounds Designer needs to set –Topology (# states and allowed transitions) –Number of mixtures

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 32 © 2001 D.A.Reynolds and L.P.Heck General Theory Speaker Models Form of HMM depends on the application “Open sesame” Fixed Phrase Word/phrase models /t/ /e//e/ /n/ Prompted phrases/passwords Phoneme models General speech Text-independent single state HMM (GMM)

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 33 © 2001 D.A.Reynolds and L.P.Heck General Theory Speaker Models The dominant model factor in speaker recognition performance is the number of mixtures used (Matsui and Furui, ICASSP92) Selection of mixture order is dependent on a number of factors –Topology of HMM –Amount of training data –Desired model size No good theoretical technique to pick mixtures order –Usually set empirically Parameter tying techniques can help increase the effective number of Gaussians with limited total parameter increase

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 34 © 2001 D.A.Reynolds and L.P.Heck General Theory Speaker Models The likelihood of a HMM given a sequence of feature vectors is computed as Full likelihood score Viterbi (best-path) score time states x(1)x(3)x(2)x(4)

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 35 © 2001 D.A.Reynolds and L.P.Heck General Theory Phases of Speaker Verification System Two distinct phases to any speaker verification system Feature extraction Model training Enrollment speech for each speaker Bob Sally Model (voiceprint) for each speaker Sally Bob Enrollment Phase Accepted! Feature extraction Verification decision Verification decision Claimed identity: Sally Verification Phase Verification decision Verification decision

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 36 © 2001 D.A.Reynolds and L.P.Heck General Theory Verification Decision The verification task is fundamentally a two-class hypothesis test –H0: the speech S is from an impostor –H1: the speech S is from the claimed speaker This is known as the likelihood ratio test We select the most likely hypothesis (Bayes test for minimum error)

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 37 © 2001 D.A.Reynolds and L.P.Heck General Theory Verification Decision Usually the log-likelihood ratio is used Front-end processing Front-end processing Speaker model Impostor model Impostor model   - + The H1 likelihood is computed using the claimed speaker model Requires an alternative or impostor model for H0 likelihood

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 38 © 2001 D.A.Reynolds and L.P.Heck General Theory Background Model There are two main approaches for creating an alternative model for the likelihood ratio test Speaker model Speaker model  Bkg 1 model Bkg 1 model Bkg 2 model Bkg 2 model Bkg 3 model Bkg 3 model Cohorts/Likelihood Sets/Background Sets (Higgins, DSPJ91) –Use a collection of other speaker models –The likelihood of the alternative is some function, such as average, of the individual impostor model likelihoods Speaker model Speaker model  Universal model Universal model General/World/Universal Background Model (Carey, ICASSP91) –Use a single speaker-independent model –Trained on speech from a large number of speakers to represent general speech patterns

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 39 © 2001 D.A.Reynolds and L.P.Heck General Theory Background Model The background model is crucial to good performance –Acts as a normalization to help minimize non-speaker related variability in decision score Just using speaker model’s likelihood does not perform well –Too unstable for setting decision thresholds –Influenced by too many non-speaker dependent factors The background model should be trained using speech representing the expected impostor speech –Same type speech as speaker enrollment (modality, language, channel) –Representation of impostor genders and microphone types to be encountered

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 40 © 2001 D.A.Reynolds and L.P.Heck General Theory Background Model Selected highlights of research on background models Near/Far cohort selection (Reynolds, SpeechComm95) –Select cohort speakers to cover the speaker space around speaker model Phonetic based cohort selection (Rosenberg, ICASSP96) –Select speech and speakers to match the same speech modality as used for speaker enrollment Microphone dependent background models (Heck, ICASSP97) –Train background model using speech from same type microphone as used for speaker enrollment Adapting speaker model from background model (Reynolds, Eurospeech97, DSPJ00) –Use Maximum A Posteriori (MAP) estimation to derive speaker model from a background model

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 41 © 2001 D.A.Reynolds and L.P.Heck ACCEPT General Theory Components of Speaker Verification System Feature extraction Speaker Model Bob’s model “My Name is Bob” ACCEPT Bob Impostor Model Identity Claim Decision REJECT  Input Speech Impostor model(s)

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 43 © 2001 D.A.Reynolds and L.P.Heck Channel Compensation Variability refers to changes in channel effects between enrollment and successive verification attempts Channel effects encompasses several factors –The microphones Carbon-button, electret, hands-free, etc –The acoustic environment Office, car, airport, etc. –The transmission channel Landline, cellular, VoIP, etc. Anything which affects the spectrum can cause problems –Speaker and channel effects are bound together in spectrum and hence features used in speaker verifiers Unlike speech recognition, speaker verifiers can not “average” out these effects using large amounts of speech –Limited enrollment speech The largest challenge to practical use of speaker verification systems is channel variability

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 48 © 2001 D.A.Reynolds and L.P.Heck Channel Compensation Three areas where compensation has been applied Feature-based approaches CMS and RASTA Nonlinear mappings Model-based approaches Handset-dependent background models Synthetic Model Synthesis (SMS) Score-based approaches Hnorm, Tnorm Error Rates Factor of 20 worse Factor of 2.5 worse Using compensation techniques has driven down error rates in NIST evaluations

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 49 © 2001 D.A.Reynolds and L.P.Heck Channel Compensation Feature-based Approaches CMS and RASTA only address linear channel effects on features Several approaches have looked at non-linear effects Non-linear mapping Non-linear mapping (Quatieri, TrSAP 2000) Use Volterra series to map speech between different types of handsets Discriminative feature design Discriminative feature design (Heck, SpeechCom 2000) Use neural-net to find features to discriminate speakers not channels Linear filter Linear filter Non-Linear filter Non-Linear filter Linear filter Linear filter carbon button speech carbon button speech electretspeechelectretspeech Linear filter Linear filter Non-Linear filter Non-Linear filter Linear filter Linear filter Linear filter Linear filter Non-Linear filter Non-Linear filter Linear filter Linear filter Feature analysis ANN Transform Speaker Recognition system Discriminative training Input features Output features

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 50 © 2001 D.A.Reynolds and L.P.Heck Channel Compensation Model-based Approaches It is generally difficult to get enrollment speech from all microphone types to be used The SMS approach addresses this by synthetically generating speaker models as if they came from different microphones (Teunen, ICSLP 2000) –A mapping of model parameters between different microphone types is applied cellularcarbon button electret synthesis

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 51 © 2001 D.A.Reynolds and L.P.Heck Channel Compensation Score-based Approaches Speaker model LR scores have different biases and scales for utterances from different handset types Hnorm attempts to remove these bias and scale differences from the LR scores ( Reynolds, NIST eval96 ) elec carb spk1 LR scores –Estimate mean and standard-deviation of impostor, same-sex utterances from different microphone-types elec carb spk2 hnorm scores –During verification normalize LR score based on microphone label of utterance

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 52 © 2001 D.A.Reynolds and L.P.Heck Channel Compensation Score-based Approaches Tnorm/HTnorm - Estimates bias and scale parameters for score normalization using “cohort” set of speaker models ( Auckenthaler, DSP Journal 2000 ) –Test time score normalization –Normalizes target score relative to a non-target model ensemble –Similar to standard cohort normalization except for standard deviation scaling Speaker model Cohort model Tnorm score Used cohorts of same gender and channel as speaker Can be used in conjunction with Hnorm

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 54 © 2001 D.A.Reynolds and L.P.Heck Adaptation Model adaptation is important for maintaining performance in speaker verification systems –Limited enrollment speech –Speaker and speech environment change over time (Furui) Most useful approach is unsupervised adaptation –Use verifier decision to select data to update speaker model –Adjust model parameters to better match new data (MAP adaptation) Front-end processing Front-end processing Speaker model Impostor model  Decision Accept

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 55 © 2001 D.A.Reynolds and L.P.Heck Adaptation Adaptation parameter can be set in several ways –As a fixed value : Continuous adaptation –As a function of likelihood score : Adjust adaptation based on certainty of decision –As a function of verification sessions : Adapt aggressively early and taper off later Experiments have shown that adapting with N utterances produces performance comparable to having extra N utterances during initial training Potential problems with adaptation –Impostor contamination –Novel channels may be rejected and so never learned Adaptation session EER Largest gain at start

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 57 © 2001 D.A.Reynolds and L.P.Heck Combination of Speech and Speaker Recognizers There are four basic ways speech recognition is used with speaker verifiers –For front-end speech segmentation –For prompted text verification –For knowledge verification –To extract idiolectal information

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 58 © 2001 D.A.Reynolds and L.P.Heck Speech and Speaker Recognizers Front-end Segmentation Speech recognizer used to segment speech for training and verification Depending on task, different linguist units are recognized –Words, phones, broad phonetic classes The recognized phrase could also be providing claimed identity to verifier –E.g., Account number Speech recognizer Speaker verifier /a/ /b/ /c/ … Impostor HMMs Speaker HMMs 

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 59 © 2001 D.A.Reynolds and L.P.Heck Speech and Speaker Recognizers Prompted Text Verification Prompted text systems used to help thwart play-back attack Need to verify voice and that prompted text was said Possible to have integrated speaker and text verification using speaker-dependent phrase decoding Speech recognizer /a/ /b/ /c/ … Speaker verifier Speaker HMMs  Impostor HMMs Prompted phrase (e.g., 82-32-71) Text score Speaker score Combine

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 60 © 2001 D.A.Reynolds and L.P.Heck Speech and Speaker Recognizers Knowledge Verification Compare response to personal question to known answer –E.g., “What is your date of birth?” Can be used for initial enrollment speech collection –Use KV for first three accesses while collecting speech Can also be used as fall-back verification in case speaker verifier is unsure after some number of attempts Also known as Verbal Information Verification (Q. Li, ICSLP98) Speech recognizer Answer KV score Personal Info DB Question Response Speaker verifier Unsure Decision

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 61 © 2001 D.A.Reynolds and L.P.Heck Speech and Speaker Recognizers Idiolectal Information Extraction Recent work by Doddington has found significant speaker information using ngrams of recognized words (Eurospeech 2001 and NIST website) During training, create counts of ngrams from speaker trainng data and from a collection of background speakers During verification compute LR score between speaker and background ngram models Good example of using higher-levels of speaker information Speech recognizer Uh-I 0.022 Uh-yeah 0.001 Un-well 0.025 Bigram (n=2) Speaker ngrams Background ngrams Uh-I 0.001 Uh-yeah 0.049 Uh-well 0.071 Bigram (n=2) Uh I think yeah … LR Computation 

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 63 © 2001 D.A.Reynolds and L.P.Heck Part II : Evaluation and Performance Outline Evaluation metrics Evaluation design Publicly available corpora Performance survey

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 64 © 2001 D.A.Reynolds and L.P.Heck Evaluation metrics In speaker verification, there are two types of errors that can occur False reject: incorrectly reject a speaker Also known as a miss or a Type-I error False accept: incorrectly accept an impostor Also known as a Type-II error The performance of a verification system is a measure of the trade-off between these two errors –The tradeoff is usually controlled by adjustment of the decision threshold In an evaluation, N true true trials (speech from claimed speaker) and N false false trials (speech from an impostor) are conducted and the probability of false reject and false accept are estimated for different thresholds

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 65 © 2001 D.A.Reynolds and L.P.Heck Evaluation metrics Evaluation errors are estimates of true errors using a finite number of trials L J

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 66 © 2001 D.A.Reynolds and L.P.Heck Evaluation metrics ROC and DET Curves Plot of Pr(miss) vs. Pr(fa) shows system performance DET plots Pr(miss) and Pr(fa) on normal deviate scale Receiver Operator Characteristic (ROC) Decreasing threshold Better performance Detection Error Tradeoff (DET) PROBABILITY OF FALSE ACCEPT (in %) PROBABILITY OF FALSE REJECT (in %)

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 67 © 2001 D.A.Reynolds and L.P.Heck Evaluation metrics DET Curve PROBABILITY OF FALSE ACCEPT (in %) PROBABILITY OF FALSE REJECT (in %) Equal Error Rate (EER) = 1 % Wire Transfer: False acceptance is very costly Users may tolerate rejections for security Toll Fraud: False rejections alienate customers Any fraud rejection is beneficial Equal Error Rate (EER) is often quoted as a summary performance measure High Convenience High Security Balance Application operating point depends on relative costs of the two errors

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 68 © 2001 D.A.Reynolds and L.P.Heck Evaluation metrics Decision Cost Function In addition to EER, a decision cost function (DCF) is also used to measure performance C(miss) = cost of a miss Pr(spkr) = prior probability of true speaker attempt Pr(imp) = 1-Pr(spkr) = prior probability of impostor attempt C(fa) = cost of a false alarm For application specific costs and priors, compare systems based on minimum value of DCF

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 69 © 2001 D.A.Reynolds and L.P.Heck Evaluation metrics Thresholds Deployed verification system must make decisions – that is set and use a priori thresholds –DET curves and EER are independent of setting thresholds The DCF can be used as an objective target for setting and measuring goodness of a priori thresholds –Set threshold during development to minimize DCF –Measure how close to minimum DCF the threshold achieves in evaluation For measuring system performance, speaker-independent thresholds should be used –Pr(miss) is computed by pooling all true trial scores from all speakers in evaluation –Pr(fa) is computed by pooling all false trial scores from all impostors in evaluation Using speaker-dependent threshold DETs produces very optimistic performance which can not be achieved in practice

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 71 © 2001 D.A.Reynolds and L.P.Heck Evaluation Design Data Selection Factors Performance numbers are only meaningful when evaluation conditions are known Speech quality –Channel and microphone characteristics –Ambient noise level and type –Variability between enrollment and verification speech Speech modality –Fixed/prompted/user-selected phrases –Free text Speech duration –Duration and number of sessions of enrollment and verification speech Speaker population –Size and composition –Experience The evaluation data and design should match the target application domain of interest

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 72 © 2001 D.A.Reynolds and L.P.Heck Evaluation Design Sizing of Evaluation For performance goals of Pr(miss)=1% and Pr(fa)=0.1% this implies –3,000 true trials  0.7% < Pr(miss) < 1.3% with 90% confidence –30,000 impostor trials  0.07% < Pr(fa) < 0.13% with 90% confidence Independence of trials is still an open issue To be 90 percent confident that the true error rate is within +/- 30% of the observed error rate, there must be at least 30 errors The overarching concern is to design an evaluation which produces statistically significant results –Number and composition of speakers –Number of true and false trials For the number of trials, we can use the “rule of 30” based on binomial distribution and independence assumption (Doddington)

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 73 © 2001 D.A.Reynolds and L.P.Heck Evaluation Design Trials True trials are the limiting factor in evaluation design False trials are easily generated by scoring all speaker models against all utterances –May not be possible for speaker specific fixed phrases Speaker 1Speaker 2Speaker 3 Utterance 1True trialFalse trial Utterance 2False trialTrue trialFalse trial Utterance 3False trial True trial Speaker models Test utts Important that each trial only uses utterance and model under test –Otherwise system is using “known” impostors (closed set) Can design trials to examine performance on sub- conditions –E.g., train on electret and test on carbon-button

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 75 © 2001 D.A.Reynolds and L.P.Heck Publicly Available Corpora Data Providers Linguistic Data Consortium http://www.ldc.upenn.edu/ Linguistic Data Consortium http://www.ldc.upenn.edu/ European Language Resources Association http://www.icp.inpg.fr/ELRA/home.html European Language Resources Association http://www.icp.inpg.fr/ELRA/home.html

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 76 © 2001 D.A.Reynolds and L.P.Heck Publicly Available Corpora Partial Listing TIMIT, et. al (LDC) - Not particularly good for evaluations SIVA (ELRA) – Italian telephone prompted speech PolyVar (ELRA) – French telephone prompted and spontaneous speech POLYCOST (ELRA) – European languages prompted and spontaneous speech KING (LDC) – Dual wideband and telephone monologs YOHO (LDC) – Office environment combination lock phrases Switchboard I-II & NIST Eval Subsets (LDC) – Telephone conversational speech Tactical Speaker Identification, TSID (LDC) – Military radio communications Speaker Recognition Corpus (OGI) – Long term telephone prompted and spontaneous speech Summary of corpora characteristics can be found at http://www.apl.jhu.edu/Classes/Notes/Campbell/SpkrRec/ Summary of corpora characteristics can be found at http://www.apl.jhu.edu/Classes/Notes/Campbell/SpkrRec/

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 78 © 2001 D.A.Reynolds and L.P.Heck Performance Survey Range of Performance Probability of False Accept (in %) Probability of False Reject (in %) Increasing constraints Text-dependent (Combinations) Clean Data Single microphone Large amount of train/test speech Text-dependent (Combinations) Clean Data Single microphone Large amount of train/test speech 0.1% Text-dependent (Digit strings) Telephone Data Multiple microphones Small amount of training data Text-dependent (Digit strings) Telephone Data Multiple microphones Small amount of training data 1% Text-independent (Conversational) Telephone Data Multiple microphones Moderate amount of training data Text-independent (Conversational) Telephone Data Multiple microphones Moderate amount of training data 10% Text-independent (Read sentences) Military radio Data Multiple radios & microphones Moderate amount of training data Text-independent (Read sentences) Military radio Data Multiple radios & microphones Moderate amount of training data 25%

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 79 © 2001 D.A.Reynolds and L.P.Heck Performance Survey NIST Speaker Recognition Evaluations Annual NIST evaluations of speaker verification technology (since 1995) Aim: Provide a common paradigm for comparing technologies Focus: Conversational telephone speech (text-independent) Evaluation Coordinator Linguistic Data Consortium Data Provider Technology Developers Comparison of technologies on common task Evaluate Improve http://www.nist.gov/speech/tests/spk/index.htm

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 80 © 2001 D.A.Reynolds and L.P.Heck Performance Survey NIST Speaker Recognition Evaluation 2000 DET curves for 10 US and European sites –Variable duration test segments (average 30 sec) –Two minutes of training speech per speaker –1003 speakers (546 female, 457 male) –6096 true trials, 66520 false trials Equal error rates range between 8% and 19% Dominant approach is adapted Gaussian Mixture Model based system (single state HMM)

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 81 © 2001 D.A.Reynolds and L.P.Heck Performance Survey Effect of Training and Testing Duration Results from 1998 NIST evaluation Increasing training dataIncreasing testing data

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 82 © 2001 D.A.Reynolds and L.P.Heck Performance Survey Effect of Microphone Mismatch In the NIST evaluation, performance was measured when speakers used the same and different telephone handset microphone types (carbon-button vs electret) With microphone mismatch, equal error rate increases by over a factor of 2 Using different handset types Using same handset types 2.5 X

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 83 © 2001 D.A.Reynolds and L.P.Heck Performance Survey Effect of Speech Coding Recognition from reconstructed speech Error rate increases as bit rate decreases –GSM speech performs as well as uncoded speech Coder Rates: T1 - 64.0 kb/s GSM - 12.2 kb/s G.729 - 8.0 kb/s G.723 - 5.3 kb/s MELP - 2.4 kb/s Recognition from speech coder parameters Negligible increase in EER with increased computational efficiency

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 84 © 2001 D.A.Reynolds and L.P.Heck Performance Survey Human vs. Machine Motivation for comparing human to machine –Evaluating speech coders and potential forensic applications Schmidt-Nielsen and Crystal used NIST evaluation (DSP Journal, January 2000) –Same amount of training data –Matched Handset-type tests –Mismatched Handset-type tests –Used 3-sec conversational utterances from telephone speech Humans have more robustness to channel variabilities –Use different levels of information Humans 44% better Humans 15% worse Error Rates Computer Human

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 85 © 2001 D.A.Reynolds and L.P.Heck Performance Survey Human Forensic Performance In 1986, the Federal Bureau of Investigation published a survey of two thousand voice identification comparisons made by FBI examiners –Forensic comparisons completed over a period of fifteen years, under actual law enforcement conditions –The examiners had a minimum of two years experience, and had completed over 100 actual cases –The examiners used both aural and spectrographic methods –http://www.owlinvestigations.com/forensic_articles/aural_spe cetrographic/fulltext.html#researchhttp://www.owlinvestigations.com/forensic_articles/aural_sp No decision65.2% (1304) Non-match18.8% (378)FR = 0.53% (2) Match15.9% (318)FA = 0.31% (1) SuspectRecorded threat From “Spectrographic voice identification: A forensic survey,” J. Acoust. Soc. Am, 79(6) June 1986, Bruce E. Koenig

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 86 © 2001 D.A.Reynolds and L.P.Heck Performance Survey Comparison to Other Biometrics Raw accuracy is generally not a good way to compare different biometric techniques –The application will dictate other important factors –See “Fundamentals of Biometric Technology” at http://www.engr.sjsu.edu/biometrics/publications_tech.html for good discussion and comparison of biometrics http://www.engr.sjsu.edu/biometrics/publications_tech.html Characteristic Finger- prints Hand Geometry RetinaIrisFaceSignatureVoice Ease of useHigh LowMedium High Error incidence Dryness, dirt, age Hand injury, age Glasses Poor lighting Lighting, hair, glasses, age Changing signatures Microphones, channels, noise, colds AccuracyHigh Very high High User acceptance Medium High Long-term stability HighMediumHigh Medium From “A Practical Guide to Biometric Security Technology,” IEEE Computer Society, IT Pro - Security, Jan-Feb 2001, Simon Liu and Mark Silverman

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 87 © 2001 D.A.Reynolds and L.P.Heck Performance Survey Comparison to Other Biometrics From CESG Biometric Test Programme Report (http://www.cesg.gov.uk/biometrics/)http://www.cesg.gov.uk/biometrics/

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 89 © 2001 D.A.Reynolds and L.P.Heck Part III : Applications and Deployments Outline Brief overview of commercial speaker verification systems Design requirements for commercial verification systems –General considerations –Dialog design Steps to deploying speaker verification systems –Initial data collection –Tuning –Limited Deployment and Final Rollout Examples of real deployments

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 91 © 2001 D.A.Reynolds and L.P.Heck Commercial Speaker Verification 1985 Access Control TI Corporate Facility (Texas Instruments) 1980 Large-Scale Deployments (1M+) Small Scale Deployments (100s) 1990 Law Enforcement Home Incarceration (ITT Industries) Telecom Sprint’s Voice FONCARD (Texas Instruments) 1995 2001 Law Enforcement Prison Call Monitoring (T-Netix) Commerce Home Shopping Network (Nuance) Financial Charles Schwab (Nuance) Telecom Swisscom (Nuance) Access Control Mac OS9 (Apple)

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 94 © 2001 D.A.Reynolds and L.P.Heck Applications and Deployments Design Requirements in Commercial Verifiers Fast –Example: 50 simultaneous verifications on single PIII 500MHz processor Accurate – < 0.1% FAR @ < 5% FRR with ~1-5% reprompt rate Robust (channel/noise variability) Compact Storage of Speaker Models –< 100KB/model with support for 1M+ users on standard DBs (e.g., Oracle) Scalable (1 Million+ users with standard DBs) Easy to deploy International language/region support Variety of operating modes: –Text-independent, text-prompted, text-dependent Fully Integrated with state-of-the-art speech recognizer Biggest challenge  Robustness to channel variability Requirements:

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 95 © 2001 D.A.Reynolds and L.P.Heck Applications and Deployments Design Requirements: Online Adaptation Online Unsupervised Adaptation Adaptation is one of the most powerful technologies to address robustness & ease of deployment Additional requirements: –Minimizes cross channel corruption Adapting on cellular improves performance on office phone –Minimizes cross-channel effects w/ no growth in storage Saves new information from addition channels in 1 channel –Minimizes model corruption from impostor attack SMS with online adaptation (Heck, ICSLP 2000): –Addresses above requirements –5222 speakers, 8 calls @ 12.5% impostor attack rate  61% reduction in EER (unsupervised)

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 96 © 2001 D.A.Reynolds and L.P.Heck Scalable, Per-Utterance Fault Tolerant & Cost Efficient ServerServer IVR System/ Functions Clients Resource Manager Server Speech-Enabled Application IP Network Speaker Recognition Server SpeechRecognitionServer Applications and Deployments Scalability of Commercial Verifiers

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 97 © 2001 D.A.Reynolds and L.P.Heck Applications and Deployments Dialog Design: General Principles Dialog should be designed to be secure and convenient –Security often compromised by users if dialog not convenient Example: 4-digit PIN Security = 1 out of 10,000 false accepts? NO! Users compromise security of PINs to make them easier to remember (writing down in wallet, on-line, etc.) Dialog should be maximally constrained but flexible –More constraints  better accuracy for fixed length training –Example: balance between constraints on acoustic space while maintaining flexibility  digit sequences Dialog Design Goal Constrained but flexible dialog to maximize security while maintaining convenience

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 98 © 2001 D.A.Reynolds and L.P.Heck Applications and Deployments Dialog Design: Rules of Thumb Enrollment: –must be secure (e.g., rely on knowledge) –should be completed in single session Identity claim should be: –unique (but perhaps not unique for multi-user accounts) –easy to recognize over large populations –useful for simultaneous verification Verification utterances should be: –easy to remember YES: SSN, DOB, home telephone number NO: PIN, password –easy to recognize (both recognizer and verifier) –perceived as short, but contain lots of speech Names: “Smith S M I T H” Digits: “3 5 6 7, 3 5 6 7” –known only by user –widely accepted by user population (e.g., not too private) –difficult to record/synthesize

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 99 © 2001 D.A.Reynolds and L.P.Heck Simultaneous Identity Claim and Verification: Buffer identity claim utterance Recognize identity claim and retrieve corresponding model “Re-process” data by verifying same utt. against model Applications and Deployments Dialog Design: Simultaneous ID Claim/Verification Start Buffering Data My name is John Doe Start Verification (“john_doe.model”) Stop Verification & Stop Buffering Data My name is John Doe

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 100 © 2001 D.A.Reynolds and L.P.Heck Initial positive score threshold Initial negative score threshold FRR 1 FRR 2 FAR 1 FAR 2 score threshold Applications and Deployments Dialog Design: Confidence-based Reprompting Confidence-based reprompting: minimize average length of authentication process Improve effective FAR/FRR by reprompting when unsure “reprompt rate (RPR)” controlled by two new thresholds RPR = Pr(spkr) (FRR1 – FRR2) + Pr(imp) (FAR2 – FAR1) 1st Utterance: FAR = 0.07% FRR = 0.2% RPR = 5.7%

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 101 © 2001 D.A.Reynolds and L.P.Heck Please enter your account number “5551234” Say your date of birth “October 13, 1964” You’re accepted by the system Applications and Deployments Dialog Design: Knowledge VerificationKnowledgeVerificationKnowledgeVerification Knowledge Base CombineCombine Accept Reject Recognize “What you know” Speaker Verification Voice Prints Recognize “Who you are”

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 102 © 2001 D.A.Reynolds and L.P.Heck Applications and Deployments Dialog Design: Knowledge Verification Parallel (“or” of decisions) Weighted Scores FAR = FAR(sv) * FAR(kv) FRR = FRR(kv) + (1-FRR(kv)) * FRR(sv) Speaker Verification Knowledge Verification Methods to Combine Knowledge: Sequential (“and” of decisions) Knowledge Verification Speaker Verification FAR = FAR(sv) + FAR(kv) FRR = FRR(sv) * FRR(kv) Knowledge Verification Speaker Verification Combine Scores

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 103 © 2001 D.A.Reynolds and L.P.Heck Applications and Deployments Dialog Design: Knowledge VerificationKnowledgeVerificationKnowledgeVerification Knowledge Base CombineCombine Accept Reject Recognize “What you know” Speaker Verification Voice Prints Recognize “Who you are” Example: Sequential Combination of Decisions Easy to implement, focuses on improving overall security Example: Sequential Combination of Decisions Easy to implement, focuses on improving overall security FAR = FAR(sv) * FAR(kv) 0.01% = 0.1% * 10% FRR = FRR(kv) + (1-FRR(kv)) * FRR(sv) 1.1% = 0.1% + (1 - 0.1%) * 1%

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 104 © 2001 D.A.Reynolds and L.P.Heck Applications and Deployments Dialog Design: Security Against Recordings Prompted Text Verification: Prompt user to repeat random phrase –Example: “Please say 82-32-71, 82-32-71” –Serves as “liveness” test Requires modification of enrollment dialog –(typically) longer enrollment to adequately cover acoustics Speech recognizer /a/ /b/ /c/ … Speaker verifier Speaker HMMs  Impostor HMMs Prompted phrase (e.g., 82-32-71) Text score Speaker score Combine

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 106 © 2001 D.A.Reynolds and L.P.Heck Rollout Tune Limited Deployment Tune Initial Data Collection Applications and Deployments Deployment Steps

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 107 © 2001 D.A.Reynolds and L.P.Heck “Probing/sampling” approach? – Employ persons to call app. under supervision – Not widely used (too difficult to collect enough data) Assessment from actual in-field data? – Much easier to get volumes of data and more realistic – Impostor trials: common enrollment utterance for impostor trials – True Speaker Trials: Sort scores. Manually transcribe poor-scoring utts. Need ~50 callers/gender Need to observe 30 errors of each type/condition (“rule of 30”) Each speaker enroll/verifies several times (across multiple channels) Applications and Deployments Deployment Steps: Initial Data Collection How do you collect the data? Rollout Tune Tune Initial Data Collection Limited Deployment

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 108 © 2001 D.A.Reynolds and L.P.Heck Rollout Tune Tune Initial Data Collection Limited Deployment Operating point (threshold) – Setting operating point a Priori is very difficult! – Speaker-independent and/or speaker-dependent thresholds? – Picking correct operating point is key to a successful deployment! Dialog Design –Customer feedback and/or usage patterns can be used to simplify dialog design (e.g., removing confirmation steps, reducing reprompt rate) Impostor Models (Acoustic) –Training with real application data results in more competitive impostor models with better representation of linguistics & noise & channels. What components can be tuned? Applications and Deployments Deployment: Tuning

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 109 © 2001 D.A.Reynolds and L.P.Heck Begin with limited set of actual users –Representative of entire caller population –Representative sampling of (telephone) network –Representative of noise and channel mismatch conditions After rollout, track the following statistics: –Successful enrollment sessions (# of speaker models) –Successful verification sessions –In-grammar/Out-of-grammar analysis (recognition) –Verification rejects (correct & false) for each speaker –Duration of sessions Limited Deployment Initial Data Collection Rollout Tune Tune Applications and Deployments Deployment Steps: Limited Deployment/Rollout What steps are there to deployment?

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 111 © 2001 D.A.Reynolds and L.P.Heck Applications and Deployments First High-Volume DeploymentApplication Speaker verification and identification based on home phone numberSpeaker verification and identification based on home phone number Provides secure access to customer record & credit card informationProvides secure access to customer record & credit card information Implementation Nuance Verifier TMNuance Verifier TM Edify telephony platformEdify telephony platform Deployed July 1999Deployed July 1999 Benefits SecuritySecurity PersonalizationPersonalization Size & Volume 600k customers enrolled currently @20K calls/day600k customers enrolled currently @20K calls/day Full deployment: 5 million customers @170K calls/dayFull deployment: 5 million customers @170K calls/day

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 112 © 2001 D.A.Reynolds and L.P.Heck Successful Enrollment Successful Authentication Applications and Deployments First High-Volume Deployment

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 113 © 2001 D.A.Reynolds and L.P.Heck Toll fraud prevention Telephone credit card purchases Telephone brokerage (e.g., stock trading) Applications and Deployments Transaction Authentication

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 114 © 2001 D.A.Reynolds and L.P.Heck Charles Schwab “Service Broker” “No PIN to remember, no PIN to forget” –Built on Nuance Verifier  pilot (2000): 10,000 users (SF bay area, NY)  deployment: ~3 million users –National rollout beginning Q3, 2001 Account Number Random phrase (4-digits) Confident? No 2 nd Utt? No Yes Make Decision Yes PIN? Applications and Deployments 1st Large-Scale High Security Deployment

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 115 © 2001 D.A.Reynolds and L.P.Heck Applications and Deployments Law Enforcement Monitoring –Remote time and attendance logging –Home parole verification –Prison telephone usage ITT: SpeakerKey - telephone based home incarceration service Deployed at 10 sites in Wisconsin, Ohio, Pennsylvania, Georgia, and California More than 12,000 home incarceration sessions in June, 1995 0.66% false acceptance 4.3% false rejection T-Netix: Contain - validating the identity and location of a parolee PIN-LOCK - validating the identity of an inmate prior to allowing an outbound prison call Deployed in Arizona,Colorado and Maryland 10K inmates using PIN-LOCK Roughly 25,000 - 30,000 verifications performed daily

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 116 © 2001 D.A.Reynolds and L.P.Heck Applications and Deployments Demonstrations Nuance 1-888-NUANCE-8 –http://www.nuance.com/demos/demo-shoppingnetwork.htmlhttp://www.nuance.com/demos/demo-shoppingnetwork.html T-Netix 1-800-443-2748 –http://www.t-netix.com/SpeakEZ/SpeakEZDemo.htmlhttp://www.t-netix.com/SpeakEZ/SpeakEZDemo.html ITT –http://www.buytel.com/WebKey/index.asphttp://www.buytel.com/WebKey/index.asp Voice Security –http://www.Voice-Security.com/KeyPad.htmlhttp://www.Voice-Security.com/KeyPad.html

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 117 © 2001 D.A.Reynolds and L.P.Heck Speaker Verification: From Research to Reality Recap Part I : Background and Theory –Major concepts behind theory and operation of modern speaker verification systems Part II : Evaluation and Performance –Key elements in evaluating performance of a speaker verification system Part III : Applications and Deployments –Main issues and tasks in deploying a speaker verification system

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 118 © 2001 D.A.Reynolds and L.P.Heck Conclusions Speaker recognition is one of the few recognition areas where machines can outperform humans Speaker recognition technology is a viable technique currently available for applications Speaker recognition can be augmented with other authentication techniques to increase security

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 119 © 2001 D.A.Reynolds and L.P.Heck Speaker recognition technology will become an integral part of speech interfaces Research will focus on using speaker recognition for more unconstrained, uncontrolled situations Future Directions –Audio search and retrieval –Increasing robustness to channel variability –Incorporating higher-levels of knowledge into decisions –Personalization of services and devices –Unobtrusive protection of transactions and information

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 120 © 2001 D.A.Reynolds and L.P.Heck To Probe Further General Resources Conferences and Workshops: 1)2001: A Speaker Odyssey - The Speaker Recognition Workshop, Crete, Greece, 2001 http://www.odyssey.westhost.com/ http://www.odyssey.westhost.com/ 2)Reconnaissance du Locuteur et ses Applications Commerciales et Criminalistiques (RLA2C), Avignon, France, 1998 [proceedings in English] 3)ESCA Workshop on Automatic Speaker Recognition, Identification, and Verification, Martigny, Switzerland 1994 http://www.isca- speech.org/workshops.htmlhttp://www.isca- speech.org/workshops.html 4)Audio and Visual Based Person Authentication (AVBPA) 1997, 1999, 2001 http://www.hh.se/avbpa/ http://www.hh.se/avbpa/ 5)International Conference on Acoustics Speech and Signal Processing (ICASSP), annual [sessions on speaker recognition] http://www.icassp2001.org/ http://www.icassp2001.org/ 6)European Conference on Speech Communication and Technology (Eurospeech), biennial [sessions on speaker recognition] http://eurospeech2001.org/ http://eurospeech2001.org/ 7)International Conference on Spoken Language Processing (ICSLP), biennial [sessions on speaker recognition] http://www.icslp2000.org/http://www.icslp2000.org/ Journals: 1)IEEE Transactions on Speech and Audio Processing http://www.ieee.org/organizations/society/sp/tsa. html http://www.ieee.org/organizations/society/sp/tsa. html 2)Speech Communication http://www.elsevier.nl/locate/specom http://www.elsevier.nl/locate/specom 3)Computer, Speech & Language http://www.academicpress.com/www/journal/0/ @/0/la.htm http://www.academicpress.com/www/journal/0/ @/0/la.htm Web : 1)The Linguistic Data Consortium, http://www.ldc.upenn.edu/ http://www.ldc.upenn.edu/ 2)European Language Resources Association http://www.icp.inpg.fr/ELRA/home.html 3)NIST Speaker Recognition Benchmarks http://www.nist.gov/speech/tests/spk/index.htm 4)Joe Campbell’s Site for Speaker Recognition Speech Corpora http://www.apl.jhu.edu/Classes/Notes/Campbell/ SpkrRec/ http://www.apl.jhu.edu/Classes/Notes/Campbell/ SpkrRec/ 5)The Biometric Consortium http://www.biometrics.org/ http://www.biometrics.org/ 6)Comp.Speech FAQ on speaker recognition http://www.speech.cs.cmu.edu/comp.speech/Se ction6/Q6.6.html http://www.speech.cs.cmu.edu/comp.speech/Se ction6/Q6.6.html 7)Search for “speaker verification” in Goggle search engine http://www.google.com/http://www.google.com/

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 121 © 2001 D.A.Reynolds and L.P.Heck To Probe Further Selected References Tutorials: 1)B. Atal, `Àutomatic recognition of speakers from their voices,'' Proceedings of the IEEE, vol. 64, pp.~460--475, April 1976. 2)A. Rosenberg, `Àutomatic speaker verification: a review,'' Proceedings of the IEEE, vol. 64, pp. 475- -487, April 1976. 3)G. Doddington, ``Speaker recognition---identifying people by their voices,'‘ Proceedings of the IEEE}, vol. 73, pp. 1651--1664, November 1985. 4)D. O'Shaughnessy, ``Speaker recognition,'' {IEEE ASSP Magazine, vol. 3, pp. 4--17, October 1986. 5)J. Naik, ``Speaker verification: a tutorial,'' IEEE Communications Magazine, vol. 28, pp. 42--48, January 1990. 6)H. Gish and M. Schmidt, ``Text-independent speaker identification,'' IEEE Signal Processing Magazine, vol. 11, pp. 18--32, October 1994. 7)S. Furui, `Àn overview of speaker recognition technology,'' in Automatic Speech and Speaker Recognition (C.-H. Lee, F.K. Soong, ed.), pp.~31-- 56, Kluwer Academic, 1996. 8)J. P. Campbell, ``Speaker recognition: A tutorial,'' Proceedings of the IEEE, vol. 85, pp. 1437--1462, September 1997. 9)Special Issue on Speaker Recognition, Digital Signal Processing, vol. 10, January 2000. http://www.idealibrary.com/links/toc/dspr/10/1/0 http://www.idealibrary.com/links/toc/dspr/10/1/0 Technology: 1)M. Carey, E. Parris, and J. Bridle, `À speaker verification system using alphanets,'' in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pp. 397--400, May 1991. 2)G. Doddington, ``Speaker recognition based on idiolectal differences between speakers,'' in Proceedings of the European Conference on Speech Communication and Technology, 2001. 3)C. Fredouille, J. Mariethoz, C. Jaboulet, J. Hennebert, J.-F. Bonastre, C. Mokbel, and F. Bimbot, ``Behavious of a Bayesian adaptation method for incremental enrollment in speaker verification,'' in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 2000. 4)L. P. Heck and M. Weintraub, ``Handset- dependent background models for robust text- independent speaker recognition,'' in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pp. 1071--1073, April 1997. 10)G. Doddington, M. Przybocki, A. Martin, and D. A. Reynolds, ``The NIST speaker recognition evaluation - overview, methodology, systems, results, perspective,'‘ Speech Communication}, vol. 31, pp. 225-254,March 2000.

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 122 © 2001 D.A.Reynolds and L.P.Heck To Probe Further Selected References 5)L. Heck and N. Mirghafori, ``On-line unsupervised adaptation for speaker verification,'' in Proceedings of the International Conference on Spoken Language Processing, 2000. 6)L. Heck, Y. Konig, M.K. Sonmez, and M. Weintraub, ``Robustness to Telephone Handset Distortion in Speaker Recognition by Discriminative Feature Design,'' Speech Communication, Vol. 31, 2000, pp. 181-192. 7)H. Hermansky, N. Morgan, A. Bayya, and P. Kohn, ``RASTA-PLP speech analysis technique,'' in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pp. I.121--I.124, March 1992. 8)A. Higgins, L. Bahler, and J. Porter, ``Speaker verification using randomized phrase prompting,'' Digital Signal Processing, vol. 1, pp. 89--106,1991. 9)B. Koenig, ``Spectrographic voice identification: A forensic survey,'‘ Journal of the Acoustical Society of America, vol. 79, pp. 2088--2090, June 1986. 10)Q. Li and B.-H. Juang, ``Speaker verification using verbal information verification for automatic enrollment,'' in Proceedings of the International Conference on Spoken Language Processing, 1998. 11)A. Martin, G. Doddington, T. Kamm, M. Ordowski, and M. Przybocki, ``The DET curve in assessment of detection task performance,'' in Proceedings of the European Conference on Speech Communication and Technology, pp. 1895--1898, 1997. 12)T. Matsui and S. Furui, ``Comparison of text- independent speaker recognition methods using VQ-distortion and discrete/continuous HMMs,'' in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pp. II--157--II--164, March 1992. 13)M. Newman, L. Gillick, Y. Ito, D. McAllaster, and B. Peskin, ``Speaker verification through large vocabulary continuous speech recognition,'' in Proceedings of the International Conference on Spoken Language Processing, pp. 2419--2422, 1996. 14)T.F. Quatieri, D.A. Reynolds and G.C. O'Leary, “Estimation of Handset Nonlinearity with Application to Speaker Recognition,” IEEE Transactions on Speech and Audio Processing, August 2000 15)D. A. Reynolds, ``Speaker identification and verification using Gaussian mixture speaker models,'' Speech Communication, vol. 17, pp. 91--108, August 1995.

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 123 © 2001 D.A.Reynolds and L.P.Heck To Probe Further Selected References 16)D. A. Reynolds, ``HTIMIT and LLHDB: Speech corpora for the study of handset transducer effects,'' in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pp. 1535--1538, April 1997. 17)D. A. Reynolds, ``Comparison of background normalization methods for text-independent speaker verification,'' in Proceedings of the European Conference on Speech Communication and Technology, pp. 963--967, September 1997. 18)D.Reynolds, M.Zissman, T.Quateri, G.O'Leary, and B.Carlson, ``The effects of telephone transmission degradations on speaker recognition performance,'‘ in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pp.329--332, May 1995 19)D.A. Reynolds and B.A. Carlson, ``Text- dependent speaker verification using decoupled and integrated speaker and speech recognizers,'' in Proceedings of the European Conference on Speech Communication and Technology, pp.647--650, September 1995. 20)D.A. Reynolds, ``The effects of handset variability on speaker recognition performance: Experiments on the Switchboard corpus,'' in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing pp.113--116, May 1996. 21)A.E. Rosenberg, C.-H. Lee, ``Connected word talker verification using whole word hidden markov models,'' in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pp. 381--384, 1991. 22)A. E. Rosenberg and S. Parthasarathy, ``Speaker background models for connected digit password speaker verification,'' in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pp. 81--84, May 1996. 23)F. Soong and A. Rosenberg, ``On the use of instantaneous and transitional spectral information in speaker recognition,'' in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pp. 877--880, 1986. 24)R. Teunen, B. Shahshahani, and L. Heck, ``A model-based transformational approach to robust speaker recognition,'' in Proceedings of the International Conference on Spoken Language Processing, 2000.

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 1 © 2001 D.A.Reynolds and L.P.Heck Speaker Verification: From Research to Reality.

Similar presentations

Presentation on theme: "MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 1 © 2001 D.A.Reynolds and L.P.Heck Speaker Verification: From Research to Reality."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 1 © 2001 D.A.Reynolds and L.P.Heck Speaker Verification: From Research to Reality.

Similar presentations

Presentation on theme: "MIT Lincoln Laboratory Nuance Communications ICASSP01 Tutorial 5/7/01 1 © 2001 D.A.Reynolds and L.P.Heck Speaker Verification: From Research to Reality."— Presentation transcript:

Similar presentations

About project

Feedback