Speaker Authentication Qi Li and Biing-Hwang Juang, Pattern Recognition in Speech and Language Processing, Chap 7 Reporter : Chang Chih Hao.

Slides:



Advertisements
Similar presentations
Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.
Advertisements

Voiceprint System Development Design, implement, test unique voiceprint biometric system Research Day Presentation, May 3 rd 2013 Rahul Raj (Team Lead),
Frederico Rodrigues and Isabel Trancoso INESC/IST, 2000 Robust Recognition of Digits and Natural Numbers.
15.0 Utterance Verification and Keyword/Key Phrase Spotting References: 1. “Speech Recognition and Utterance Verification Based on a Generalized Confidence.
Chapter 9 Creating and Maintaining Database Presented by Zhiming Liu Instructor: Dr. Bebis.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Confidence Measures for Speech Recognition Reza Sadraei.
Speech Recognition Training Continuous Density HMMs Lecture Based on:
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Speaker Adaptation in Sphinx 3.x and CALO David Huggins-Daines
Lecture #1COMP 527 Pattern Recognition1 Pattern Recognition Why? To provide machines with perception & cognition capabilities so that they could interact.
Comparison and Combination of Ear and Face Images in Appearance-Based Biometrics IEEE Trans on PAMI, VOL. 25, NO.9, 2003 Kyong Chang, Kevin W. Bowyer,
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Training HMMs Version 4: February 2005.
A PRESENTATION BY SHAMALEE DESHPANDE
Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.
Isolated-Word Speech Recognition Using Hidden Markov Models
Speaker Recognition By Afshan Hina.
Principles of Pattern Recognition
June 28th, 2004 BioSecure, SecurePhone 1 Automatic Speaker Verification : Technologies, Evaluations and Possible Future Gérard CHOLLET CNRS-LTCI, GET-ENST.
Douglas A. Reynolds, PhD Senior Member of Technical Staff
A Talking Elevator, WS2006 UdS, Speaker Recognition 1.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Utterance Verification for Spontaneous Mandarin Speech Keyword Spotting Liu Xin, BinXi Wang Presenter: Kai-Wun Shih No.306, P.O. Box 1001,ZhengZhou,450002,
STA Statistical Inference
Chapter 14 Speaker Recognition 14.1 Introduction to speaker recognition 14.2 The basic problems for speaker recognition 14.3 Approaches and systems 14.4.
1 Phoneme and Sub-phoneme T- Normalization for Text-Dependent Speaker Recognition Doroteo T. Toledano 1, Cristina Esteve-Elizalde 1, Joaquin Gonzalez-Rodriguez.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Minimum Mean Squared Error Time Series Classification Using an Echo State Network Prediction Model Mark Skowronski and John Harris Computational Neuro-Engineering.
Csc Lecture 7 Recognizing speech. Geoffrey Hinton.
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
CS Statistical Machine learning Lecture 24
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Performance Comparison of Speaker and Emotion Recognition
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.
Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Statistical Significance Hypothesis Testing.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Speaker Verification System Middle Term Presentation Performed by: Barak Benita & Daniel Adler Instructor: Erez Sabag.
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
Other Models for Time Series. The Hidden Markov Model (HMM)
By: Nicole Cappella. Why I chose Speech Recognition  Always interested me  Dr. Phil Show Manti Teo Girlfriend Hoax  Three separate voice analysts proved.
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
BIOMETRICS VOICE RECOGNITION. Meaning Bios : LifeMetron : Measure Bios : LifeMetron : Measure Biometrics are used to identify the input sample when compared.
LECTURE 33: STATISTICAL SIGNIFICANCE AND CONFIDENCE (CONT.)
ARTIFICIAL NEURAL NETWORKS
Statistical Models for Automatic Speech Recognition
RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION
Statistical Models for Automatic Speech Recognition
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
CONTEXT DEPENDENT CLASSIFICATION
A maximum likelihood estimation and training on the fly approach
Speaker Identification:
Presentation transcript:

Speaker Authentication Qi Li and Biing-Hwang Juang, Pattern Recognition in Speech and Language Processing, Chap 7 Reporter : Chang Chih Hao

2 Outline Introduction Pattern Recognition Technique Speaker Verification System Verbal Information Verification Combined with SV and VIV

3 Introduction To ensure the security of and proper access to private information, important transactions, and the computer and communication networks, passwords or personal identification numbers (PIN) have been used extensively in our daily life. To further enhance the level of security as well as convenience, biometric feature have also been considered. Among all biometric features, a person’s voice is the most convenient one for personal identification purposes because it is easy to produce, capture, or transmit over the ubiquitous telephone network. (supported without requiring special devices)

4 Introduction Speaker recognition –Speaker verification Verify an unknown speaker whether s/he is the person as claimed Ex. A yes-no hypothesis testing problem –Speaker identification Process of associating an unknown speaker with a member in a pre- registered Ex. A multiple-choice classification problem

5 Introduction The are two approaches to speaker authentication : –The speaker verification (SV) verify a speaker’s identity based on his/her voice characteristics. –The verbal information verification (VIV) verifies a speaker’s identity through verification of the content of his/her utterance.

6 Introduction Speaker Verification Two session (direct method) –Enrollment session The User’s identity, together with a pass-phrase, is assigned to the speaker. Train a speaker-dependent (SD) model that registers the speaker’s speech characteristics. –Test session Claims his/her identity. The system then prompts the speaker to say the pass-phrase. The pass-phrase utterance is compared against the stored SD model. Obviously, successful verification of a speaker relies upon a correct recognition of the speech input.

7 Introduction Speaker Verification

8 Enrollment is an inconvenience to the user as well as the system developer who often has to supervise and ensure the quality of the collected data. The quality of the collected training data has a critical effect on the performance of an SV system –Speaker may make a mistake –Acoustic mismatch between the training and testing environments

9 Introduction Verbal Information Verification The VIV method is the process of verifying spoken utterances against the information stored in given personal data profile. A VIV system may use a dialogue procedure to verify a user by asking questions. (indirect method) The difference between SV and VIV –Model SD model & acoustic-phonetic models –Pre-data voice data & personal data profile –Reject an imposter Pre-trained SD model & user’s responsibility

10 Introduction Verbal Information Verification

11 Pattern Recognition Bayesian Decision Theory M-class recognition problem –Given an observation o and a set of classes designated as {C 1, C 2,..C M } –Asked to make a decision, to classify o into class C i, denote this as an action α i

12 Pattern Recognition Bayesian Decision Theory The zero-one loss function describing the loss incurred for taking action α i when the true class is C j The expected loss associated with taking action α i

13 Pattern Recognition Bayesian Decision Theory Minimum-error-rate classification –To Minimize the loss, we take action α i that maximizes the posterior probability A sequence of observations –Assume the observations are independent and identically distributed (i.i.d.)

14 Pattern Recognition Stochastic Models for Stationary Process Gaussian mixture model (GMM) –Characterize a speech probability density functions (pdfs)

15 Pattern Recognition Stochastic Models for Stationary Process The GMM parameters can be destimated iteratively using the Baum-Welch or EM algorithm

16 Pattern Recognition Stochastic Models for Stationary Process One application of the above model is context-independent speaker identification, where assume that each speaker’s speech characteristics only acoustically and is represented by one class. When a spoken utterance is long enough, it is reasonable to assume that the acoustic characteristic is independent of its content.

17 Pattern Recognition Stochastic Models for Non-Stationary Process Hidden Markov Model (HMM) –Applied to characterize both the temporal structure and the corresponding statistical variations along the parameter trajectory of an utterance. –N-state, left-to-right model –Within each state, a GMM is used to characterize the observed speech feature vector as a multivariate distribution –Three parameters A : state transition probabilities B : observation densities π: initial state probabilities

18 Pattern Recognition Stochastic Models for Non-Stationary Process

19 Speaker Verification System

20 Speaker Verification System

21 Speaker Verification System Test session –After a speaker claims his/her identity, the system expects the user to speak the same phrase as in the enrollment session. –The voice waveform is converted to the feature representation. –The forced alignment block A sequence of speaker-independent phoneme models is constructed. The model sequence is then used to segment and align the feature vector sequence through use of the Viterbi algorithm. –The cepstral mean subtraction block Silence frames are removed Mean vector is computed based on the remaining speech frames

22 Speaker Verification System Fixed-phrase system –User-selected phrase is easy to remember –Has a better performance than a text-prompted system Model –SD left-to-right HMM –Whole-word or whole phrase model Feature extraction –Sampled at 8 kHz –Frame 30 ms, overlapping 10 ms –Pre-emphasized, Hamming window –10-th order LPC –Converted to cepstral coefficients, delta cepstral coefficients (24d)

23 Speaker Verification System Test session –Target score computation –Background score computation –Likelihood-ratio test

24 Speaker Verification System Experimentation Data base –Train 100 speakers : 51male and 49 female. Average utterance length of 2 seconds. Five utterances from each speaker recorded in one enrollment session. –Testing 50 utterances recorded from a true speaker in different sessions 200 utterances recorded from 51 or 49 impostors of the same gender in different sessions

25 Speaker Verification System Experimentation Data base –SD Model left-to-right HMMs The number of states depends on the total number of phonemes in the phrases. There are 4 GMM associated with each state. –SI Model 43 HMMs, corresponding to 43 phonemes 3 state per model, 32 GMM per state The common variance to all GMM –Adaptation The second, fourth, sixth, and eighth test utterances from the true speaker, which were recorded at different times, are used to update the means and mixture weights of the SD HMM for verifying successive test utterances.

26 Speaker Verification System Experimentation In general, the longer the pass-phrase, the higher the accuracy. The actual system performance would be better when users choose their own and most likely different pass-phrase.

27 Verbal Information Verification –Automatic speech recognition (ASR) The spoken input is transcribed into a sequence of words. The transcribed words are then compared to the information pre-stored in the claimed speaker’s personal profile. –Utterance verification (UV) The spoken input is verified against an expected sequence of words or subwords, which is taken from a personal data profile of the claimed individusl.

28 Verbal Information Verification Utterance verification (Single Question) –Keyword spotting and non-keyword rejection Three key modules –Utterance segmentation by forced decoding –Subword testing –Utterance level confidence measure

29 Verbal Information Verification

30 Verbal Information Verification Utterance Segmentation –Each piece of the key information is represented by a sequence of words, S, which in turn is equivalently characterized by a concatenation of a sequence of phonemes or subwords, and N is the total number of subwords in the key word sequence.

31 Verbal Information Verification Subword Hypothesis Testing –H 0 means that the observed speech O n consists of the actual sound of subword S n –H 1 is the alternative hypothesis –Target model : is trained using the data of subword S n –Anti-HMMs : is trained using the data of a set of subwords

32 Verbal Information Verification Confidence Measure Calculation –Make a decision at both the subword and the utterance level. At the subword level, a likelihood-ratio test can be conducted to reach a decision to accept or reject each subword. At the utterance level, a simple utterance score can be computed to represent the percentage of acceptable subwords. –Normalized confidence measure

33 Verbal Information Verification Sequential Utterance Verification –Definition 1 : False rejection error on J utterances is the error when the system rejects a correct response in any one of J hypothesis subtests. –Definition 2 : False acceptance error on J utterances is the error when the system accepts an incorrect set of responses after all of J hypothesis subtests. –Definition 3 : Equal-error rate on J utterances is the rate at which the false rejection error rate and the false acceptance error rate on J utterances are equal.

34 Verbal Information Verification Example : –A bank operator usually asks two kinds of personal questions when verifying a customer. When automatic VIV is applied to the procedure, the average individual error rates on these two subtests are ε r (1)=0.1%, ε a (1)=5%; and ε r (2)=0.2%, ε a (2)=6%, respectively. Then, a sequential test are E r (2)=0.3% and E a (2)=0.3%.

35 Verbal Information Verification VIV Experimentation –Data base 26% of the speakers have birth year in the 1950s and 24% are in the 1960s. In city and state names, 39% are “New Jersey”, and 5% of the speakers used exactly the same answer. 38% of the telephone numbers start from “ ”, which means that at least 60% of the digits in their answer for the telephone number are identical. The same speaker is used as an impostor when the utterances are verified against other speakers’ profiles. Thus, for each true speaker, we have three utterances from the speaker and 99*3 utterances from other 99 speakers as impostors.

36 Verbal Information Verification VIV Experimentation –Feature 12 LPC cepstral coefficients + 12 delta + 12 delta-delta (39) –Model The target phone models : 1117 right context-dependent HMMs Anti-models : 41 context-independent anti-phone HMMs Three sequential subtests (J=3) –“In which year were you born?” –“In which city and state did you grow up?” –“May I have your telephone number, please?”

37 Verbal Information Verification

38 Speaker Authentication by Combining SV and VIV Cause –SV : users often make mistakes during enrollment. –VIV : no speaker-specific voice characteristics are used in the verification process. Procedure –The uttered pass-phrase must pass VIV tests; otherwise, the user is prompted to repeat. –Verified utterances of the pass-phrase are then saved, and used to train a SD model for SV. –The authentication system can then be switched from VIV to SV.

39 Speaker Authentication by Combining SV and VIV

40 Speaker Authentication by Combining SV and VIV Experimentation –Train Data 100 speaker, 51 male and 49 female. The fixed phrase, common to all speaker, is “I pledge allegiance to the flag” (2 sec) Five utterances of the pass-phrase recorded from five separate VIV sessions were used to train the SD HMM. (different environments and channels) –Test Data 40 utterances recorded from a true speaker in different sessions 192 utterances recorded from 50 impostors of the same gender in different sessions. For model adaptation, the second, fourth, sixth, and eighth test utterances from the tested true speaker were used to update the associated HMMs for verifying subsequent test utterances incrementally.

41 Speaker Authentication by Combining SV and VIV

42 Speaker Authentication by Combining SV and VIV Advantage –The system is convenient to users. –The acoustic mismatch problem is to a certain degree mitigated –The quality of the training data are ensured. –Better authentication performances.

43 Summary The theoretical foundation –Bayesian decision –Hypothesis testing Speaker Verification –Verifying speakers by their voice characteristics –Fixed-phrase has good performance. (easy to remember and convenient to use) Verbal Information Verification –Verify a speaker by the verbal content. –Have very good accuracy by applying a sequential verification technique. –It is the users’ responsibility to protect their personal information from impostors.

44 Summary SV + VIV –User convenience Without going through a formal enrollment session and waiting for model training. –System performance Collects verified training data Different channels and environments. The acoustic mismatch problem is mitigated. A good speaker authentication system for real applications could come from a proper integration of speaker verification, verbal information verification, speech recognition, and text-to-speech systems.

45 Thanks