Download presentation
1
An Intro to Speaker Recognition
Nikki Mirghafori Acknowledgment: some slides borrowed from the Heck & Reynolds tutorial, and A. Stolcke.
2
Today’s class Interactive Measures of success for today:
You talk at least as much as I do You learn and remember the basics You feel you can do this stuff We all have fun with the material!
3
A 10-minute “Project Design”
You are experts with different backgrounds. Your previous startup companies were wildly successful. A large VC firm in the valley wants to fund YOUR next creation, as long as the project is in speaker recognition. The VC funding is yours, if you come up with some kind of a coherent plan/list of issues: What is your proposed application? What will be the sources of error and variability, i.e., technology challenges? What types of features will you use? What sorts of statistical modeling tools/techniques? What will be your data needs? Any other issues you can think of along your path?
4
Extracting Information from Speech
What’s noise? what’s signal? Orthogonal in many ways Use many of the same models and tools Goal: Automatically extract information transmitted in speech signal Speech Recognition Language Speaker Words Language Name Speaker Name “How are you?” English James Wilson Speech Signal Speaker conveys several levels of information. In addition to the message being spoken (words) there is also information about the language and the speaker. The aim in all automatic speech processing techniques is to extract these levels of information for further processing (database query, info for synthesis, etc.) Although shown as independent extraction paths, knowledge gained from the different levels of information can be used together for different application goals.
5
Speaker Recognition Applications
Access control Physical facilities Data and data networks Transaction authentication Telephone credit card purchases Bank wire transfers Fraud detection Monitoring Remote time and attendance logging Home parole verification Information retrieval Customer information for call centers Audio indexing (speech skimming device) Personalization Forensics Voice sample matching
6
Tasks Identification vs. verification
Closed set vs. open set identification Also, segmentation, clustering, tracking...
7
Speaker Model Database
Identification Test Speech Speaker Model Database Whose voice is it? Closed-set Speaker Identification
8
Speaker Model Database
Identification Test Speech Speaker Model Database Whose voice is it? None of the above Open-set Speaker Identification
9
Verification/Authentication/Detection
Test Speech Speaker Model Database “It’s me!” Does the voice match? Yes/No Verification requires claimant ID
10
Speech Modalities Text-dependent recognition
Recognition system knows text spoken by person Examples: fixed phrase, prompted phrase Used for applications with strong control over user input Knowledge of spoken text can improve system performance Text-independent recognition Recognition system does not know text spoken by person Examples: User selected phrase, conversational speech Used for applications with less control over user input More flexible system but also more difficult problem Speech recognition can provide knowledge of spoken text Text-Constrained recognition. Exercise for the reader. Definition of speech modalities of input speech.
11
Text-constrained Recognition
Basic idea: build speaker models for words rich in speaker information Example: “What time did you say? um... okay, I_think that’s a good plan.” Text-dependent strategy in a text- independent context by acoustics (in title) we mean low-level acoustics, i.e., cepstral feats.
12
Voice as a biometric Biometric: a human generated signal or attribute for authenticating a person’s identity Voice is a popular biometric: natural signal to produce does not require a specialized input device ubiquitous: telephones and microphone equipped PC Voice biometric with other forms of security Strongest security Definition of voice as a biometric. Potential lead in to slide describing the integration of voice and knowledge verification. Something you have - e.g., badge Are Something you know - e.g., password Know Have Something you are - e.g., voice
13
How to build a system? Feature choices:
low level (MFCC, PLP, LPC, F0, ...) and high level (words, phones, prosody, ...) Types of models: HMM, GMM, Support Vector Machines (SVM), DTW, Nearest Neighbor, Neural Nets Making decisions: Log Likelihood Thresholds, threshold setting for desired operating point Other issues: normalization (znorm, tnorm), optimal data selection to match expected conditions, channel variability, noise, etc.
14
Verification Performance
There are many factors to consider in design of an evaluation of a speaker verification system Speech quality Channel and microphone characteristics Noise level and type Variability between enrollment and verification speech Speech modality Fixed/prompted/user-selected phrases Free text Speech duration Duration and number of sessions of enrollment and verification speech Speaker population Size and composition Describe some of the dimensions of core speaker verification technology evaluation. Application specific evaluation, however, will depend on other practical considerations of throughput, ease of enrollment, etc. Most importantly: The evaluation data and design should match the target application domain of interest
15
Verification Performance
Text-independent (Read sentences) Military radio Data Multiple radios & microphones Moderate amount of training data Increasing constraints Text-independent (Conversational) Telephone Data Multiple microphones Moderate amount of training data Probability of False Reject (in %) Text-dependent (Combinations) Clean Data Single microphone Large amount of train/test speech Summary of performance variability of core verification system with respect to constraints (vocabulary and environment). This leads into application performance where various constraints and auxiliary knowledge are used to produce performance required. Text-dependent (Digit strings) Telephone Data Multiple microphones Small amount of training data Probability of False Accept (in %)
16
Verification Performance
Example Performance Curve Wire Transfer: False acceptance is very costly Users may tolerate rejections for security Application operating point depends on relative costs of the two error types High Security PROBABILITY OF FALSE REJECT (in %) Equal Error Rate (EER) = 1 % Example of DET (ROC) curve and where different applications may operate. Balance Customization: False rejections alienate customers Any customization is beneficial High Convenience PROBABILITY OF FALSE ACCEPT (in %)
17
Human vs. Machine Motivation for comparing human to machine
Humans 44% better Motivation for comparing human to machine Evaluating speech coders and potential forensic applications Schmidt-Nielsen and Crystal used NIST evaluation (DSP Journal, January 2000) Same amount of training data Matched Handset-type tests Mismatched Handset-type tests Used 3-sec conversational utterances from telephone speech Humans 15% worse Error Rates
18
Features Desirable attributes of features for an automatic system (Wolf ‘72) Occur naturally and frequently in speech Easily measurable Not change over time or be affected by speaker’s health Not be affected by reasonable background noise nor depend on specific transmission characteristics Not be subject to mimicry Practical Robust Secure Desirable attributes for automatic features. No feature has all these attributes
19
Training & Test Phases Enrollment Phase Feature Extraction Model
Model for each speaker Training speech for each speaker Feature Extraction Verification Decision Recognition Phase (e.g. Verification) ? “It’s me!” Accepted Rejected
20
Statistic computed on test utterance S as likelihood ratio:
Decision making Verification decision approaches have roots in signal detection theory 2-class Hypothesis test: H0: the speaker is an impostor H1: the speaker is indeed the claimed speaker. Statistic computed on test utterance S as likelihood ratio: Likelihood S came from speaker model Likelihood S did not come from speaker model L = log L < q reject Feature extraction Speaker Model Impostor Model Decision S + - > q accept Describe basic verification decision - likelihood ratio
21
Decision making Identification: pick model (of N) with best score
Verification: usual approach is via likelihood ratio tests, hypothesis testing, i.e.: By Bayes: P(target|x)/P(nontarget|x) = P(x|target)P(target)/P(x|nontarget)P(nontarget) accept if > threshold, reject otherwise Can’t sum over all non-target talkers -- world for SV! Use “cohorts” (collection of impostors) Train “universal”/”world”/”background” model (speaker independent, it’s trained on many speakers)
22
Spectral Based Approach
Traditional speaker recognition systems use Cepstral feaures Gaussian Mixture Models (GMMs) Speaker Model Background Model log likelihood ratio Feature Extraction Fourier Transform Magnitude Sliding window Cosine Log Adapt sliding window: 20 ms., overlapping every 10 ms Fourier transform: produces time-frequency evolution of the spectrum the number of discrete fourier transform samples by averaging frequency bins together mel or bark scale filters: low frequncy, linearly spread. high freq, logarithmic. log turns linear convolutional effects into additive biases cosine decorrelates elements in feature vector D.A. Reynolds, T.F. Quatieri, R.B. Dunn. “Speaker Verification using Adapted Gaussian Mixture Models,” Digital Signal Processing, 10(1--3), January/April/July 2000
23
Features: Levels of Information
Semantic Dialogic Idiolectal Phonetic Prosodic Spectral Low-level cues (physical characteristics) High-level cues (learned behaviors) Hierarchy of Perceptual Cues semantics, idiolects, pronunciations, idiosyncrasies socio-economic status, education, place of birth prosody, rhythm, speed intonation, volume modulation personality type, parental influence acoustic aspects of speech, nasal, deep, breathy, rough anatomical structure of vocal apparatus
24
Low level features Speech production model: source-filter interaction
Anatomical structure (vocal tract/glottis) conveyed in speech spectrum Glottal pulses Vocal tract Speech signal Describe link of speech signal to speaker specific information (anatomical structure)
25
Word N-gram Features … I_shall 0.002 I_think 0.025 I_would 0.012
Idea (Doddington 2001): Word usage can be idiosyncratic to a speaker Model speakers by relative frequencies of word N-grams Reflects vocabulary AND grammar Cf. similar approaches for authorship and plagiarism detection on text documents. First (unpublished) use in speaker recognition: Heck et al. (1998) Implementation: Get 1-best word recognition output Extract N-gram frequencies Model likelihood ratio OR Model frequency vectors by SVM I_shall 0.002 I_think 0.025 I_would 0.012 …
26
Open-loop phone recognition
Phone N-gram features Model the pattern of phone usage or “short term pronunciation” for a speaker Open-loop phone recognition Support Vector Machine (SVM) [ ] [ ] [ ] score
27
MLLR transform vectors as features
Speaker-dependent Speaker-independent Phone class B Phone class A Speaker-independent Speaker-dependent MLLR Transforms = Features
28
Models SVMs HMMs: text dep (could use whole word/phone model)
prompted (phone models) text ind’t (use LVCSR) -- or GMMs! templates DTW (if text-dependent system) nearest neighbor: frame level, training data as “model”, non-parametric neural nets: train explicitly discriminating models SVMs
29
Speaker Models -- HMM h-a-d
Speaker models (voiceprints) represent voice biometric in compact and generalizable form Modern speaker verification systems use Hidden Markov Models (HMMs) HMMs are statistical models of how a speaker produces sounds HMMs represent underlying statistical variations in the speech state (e.g., phoneme) and temporal changes of speech between the states. Fast training algorithms (EM) exist for HMMs with guaranteed convergence properties. h-a-d Describe basic model used to capture the speaker characteristics - HMM
30
Speaker Models – HMM/GMM
Form of HMM depends on the application “Open sesame” Fixed Phrase Word/phrase models /s/ /i/ /x/ Prompted phrases/passwords Phoneme models Describe the basic forms of how HMMs are used for different speech modalities General speech Text-independent single state HMM
31
Word N-gram Modeling: Likelihood Ratios
Model N-gram token log likelihood ratio Numerator: speaker language model estimated from enrollment data Denominator: background language model estimated from large speaker population Normalize by token count Choose all reasonably frequent bigrams or trigrams, or a weighted combination of both
32
Speaker Recognition with SVMs
Each speech sample (training or test) generates a point in a derived feature space The SVM is trained to separate the target sample from the impostor (= UBM) samples Scores are computed as the Euclidean distance from the decision hyperplane to the test sample point SVMs training is biased against misclassifying positive examples (typically very few, often just 1) Background sample Target sample Test sample
33
Feature Transforms for SVMs
SVMs have been a boon for higher-level (as well as cepstral speaker recognition) research – they allow great flexibility in the choice of features However, we need a “sequence kernel” Dominant approach: transform variable-length feature stream into fixed, finite-dimensional feature space Then use linear kernel All the action is in the feature transform!
34
Combination of Systems
Systems work best in combination, especially ones using “higher level” features Need to estimate optimal combination weight. E.g., use neural network Combination weights trained on a held-out development dataset GMM MMLR WordHMM PhoneNgram Neural Network Combiner
35
Variability: The Achilles Heel...
Variability (extrinsic & intrinsic) in the spectrum can cause error Data of focus has mainly been extrinsic “Channel” mismatch: Microphone carbon-button, hands-free,.. Acoustic environment Office, car, airport, ... Transmission channel Landline, cellular, VoIP, ... Three compensation approaches: Feature-based Model-based Score-based Compensation techniques help reduce error.
36
NIST Speaker Verification Evaluations
Annual NIST evaluations of speaker verification technology (since 1996) Aim: Provide a common paradigm for comparing technologies Focus: Conversational telephone speech (text-independent) Linguistic Data Consortium Data Provider Evaluation Coordinator Comparison of technologies on common task Evaluate Technology Developers Describe NIST text-independent speaker recognition evaluations.This is an example of the core research evaluation. Improve
37
The NIST Evaluation Task
Conversational telephone speech, interview Landline, cellular, hands-free, multiple-mics in room 5 min of conversations between two speakers Various conditions, e.g., Training: 8, 1, or other number of conversation sides Test: 1 conversation side, 30 secs, etc. Evaluation: Equal Error Rate (EER) Decision Cost Function (DCF) = (10, 1, 0.01)
38
The End What’s one interesting you learned today you may share with a friend over dinner conversation?
39
Backup slides
40
Word Conditional Models -- example
Boakye et al. (2004) 19 words and bi-grams Discourse markers: {actually, anyway, like, see, well, now, you_know, you_see, i_think, i_mean} Filled pauses: {um, uh} Backchannels: {yeah, yep, okay, uhhuh, right, i_see, i_know } Trained whole-word HMMs, instead of GMMs, to model evolution of speech in time Combines well with low- level (i.e., cepstral GMM) system, especially with more training data add paper ref
41
Phone N-Grams -- example
Idea (Hatch et al., ‘05): model the pattern of phone usage or “short term pronunciation” for a speaker Use open-loop phone recognition to obtain phone hypotheses Create models of relative frequencies of phone n- grams of the speaker vs. “others” Use SVM for modeling Combines well, esp. with increased data Works across languages add paper ref. add
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.