8/12/2003 Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye International Computer Science Institute.

Slides:



Advertisements
Similar presentations
Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.
Advertisements

Speech Recognition with Hidden Markov Models Winter 2011
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
Frederico Rodrigues and Isabel Trancoso INESC/IST, 2000 Robust Recognition of Digits and Natural Numbers.
15.0 Utterance Verification and Keyword/Key Phrase Spotting References: 1. “Speech Recognition and Utterance Verification Based on a Generalized Confidence.
A Text-Independent Speaker Recognition System
Robust Voice Activity Detection for Interview Speech in NIST Speaker Recognition Evaluation Man-Wai MAK and Hon-Bill YU The Hong Kong Polytechnic University.
Speech Recognition. What makes speech recognition hard?
Speaker Detection Without Models Dan Gillick July 27, 2004.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
9/20/2004Speech Group Lunch Talk Speaker ID Smorgasbord or How I spent My Summer at ICSI Kofi A. Boakye International Computer Science Institute.
Modeling of Mel Frequency Features for Non Stationary Noise I.AndrianakisP.R.White Signal Processing and Control Group Institute of Sound and Vibration.
June 14th, 2005Speech Group Lunch Talk Kofi A. Boakye International Computer Science Institute Mixed Signals: Speech Activity Detection and Crosstalk in.
EE225D Final Project Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye EE225D Final Project.
May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International.
Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Acoustic and Linguistic Characterization of Spontaneous Speech Masanobu Nakamura, Koji Iwano, and Sadaoki Furui Department of Computer Science Tokyo Institute.
Introduction to Automatic Speech Recognition
1 Robust HMM classification schemes for speaker recognition using integral decode Marie Roch Florida International University.
A Talking Elevator, WS2006 UdS, Speaker Recognition 1.
Utterance Verification for Spontaneous Mandarin Speech Keyword Spotting Liu Xin, BinXi Wang Presenter: Kai-Wun Shih No.306, P.O. Box 1001,ZhengZhou,450002,
Diamantino Caseiro and Isabel Trancoso INESC/IST, 2000 Large Vocabulary Recognition Applied to Directory Assistance Services.
Segmental Hidden Markov Models with Random Effects for Waveform Modeling Author: Seyoung Kim & Padhraic Smyth Presentor: Lu Ren.
Chapter 14 Speaker Recognition 14.1 Introduction to speaker recognition 14.2 The basic problems for speaker recognition 14.3 Approaches and systems 14.4.
1 Phoneme and Sub-phoneme T- Normalization for Text-Dependent Speaker Recognition Doroteo T. Toledano 1, Cristina Esteve-Elizalde 1, Joaquin Gonzalez-Rodriguez.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Research & development Component Score Weighting for GMM based Text-Independent Speaker Verification Liang Lu SNLP Unit, France Telecom R&D Beijing
LOG-ENERGY DYNAMIC RANGE NORMALIZATON FOR ROBUST SPEECH RECOGNITION Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Quebec,
Signature with Text-Dependent and Text-Independent Speech for Robust Identity Verification B. Ly-Van*, R. Blouet**, S. Renouard** S. Garcia-Salicetti*,
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,
An Intro to Speaker Recognition
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
Automatic Cue-Based Dialogue Act Tagging Discourse & Dialogue CMSC November 3, 2006.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
National Taiwan University, Taiwan
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga 1,2 Sara Cadeias 1 Carla Lopes 1,2 Fernando Perdigão 1,2 1 Instituto.
Performance Comparison of Speaker and Emotion Recognition
A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.
1/17/20161 Emotion in Meetings: Business and Personal Julia Hirschberg CS 4995/6998.
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Speaker Verification Using Adapted GMM Presented by CWJ 2000/8/16.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
1 Experiments with Detector- based Conditional Random Fields in Phonetic Recogntion Jeremy Morris 06/01/2007.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.
A Tutorial on Speaker Verification First A. Author, Second B. Author, and Third C. Author.
Voice Activity Detection Based on Sequential Gaussian Mixture Model Zhan Shen, Jianguo Wei, Wenhuan Lu, Jianwu Dang Tianjin Key Laboratory of Cognitive.
Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI
Conditional Random Fields for ASR
Statistical Models for Automatic Speech Recognition
RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION
Statistical Models for Automatic Speech Recognition
Jeremy Morris & Eric Fosler-Lussier 04/19/2007
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Speaker Identification:
Presenter: Shih-Hsiang(士翔)
Combination of Feature and Channel Compensation (1/2)
Presentation transcript:

8/12/2003 Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye International Computer Science Institute

8/12/2003 Outline Introduction Design and System Description Initial Results System Enhancements More words Higher order cepstra Cepstral Mean Subtraction Conclusions Future Work

8/12/2003 Introduction Speaker Recognition Problem: Determine if spoken segment is putative target Also referred to as Speaker Verification/Authentication

8/12/2003 Introduction claimed identity: Sally Method of Solution Requires Two Phases: Similar to speech recognition, though “noise” (inter-speaker variability) is now signal. Training Phase Testing Phase

8/12/2003 Introduction Also like speech recognition, different domains exist Two major divisions: 1)Text-dependent/Text-constrained Highly constrained text spoken by person Examples: fixed phrase, prompted phrase 2)Text-independent Unconstrained text spoken by person Example: conversational speech

8/12/2003 Introduction Text-dependent systems can have high performance because of input constraints More acoustic variation arises from speaker distinction (vs. phones) Text-independent systems have greater flexibility

8/12/2003 Introduction Question: Is it possible to capitalize on advantages of text dependent systems in text-independent domains? Answer: Yes!

8/12/2003 Introduction Idea: Limit words of interest to a select group -Words should have high frequency in domain -Words should have high speaker-discriminative quality What kind of words match these criteria for conversational speech ? 1) Discourse markers (like, well, now…) 2) Filled pauses (um, uh) 3) Backchannels (yeah, right, uhhuh, …) These words are fairly spontaneous and represent an “involuntary speaking style” (Heck, WS2002)

8/12/2003 Likelihood Ratio Detector: Λ = p(X|S) /p(X|UBM) Task is a detection problem, so use likelihood ratio detector -In implementation, log-likelihood is used Design Feature Extraction Background Model Speaker Model / Λ > Θ Accept < Θ Reject signal adapt

8/12/2003 Design State-of-the Art Speaker Recognition Systems use Gaussian Mixture Models Speaker’s acoustic space is represented by many- component mixture of Gaussians speaker 2 speaker 1

8/12/2003 Design Speaker models are obtained via adaptation of a Universal Background Model (UBM) Probabilistically align target training data into UBM mixture states Update mixture weights, means and variances based on the number of occurrences in mixtures Gives very good performance, but… Target training data

8/12/2003 Design Concern: GMMs utilize a “bag-of-frames” approach Frames assumed to be independent Sequential information is not really utilized Alternative: Use HMMs Do likelihood test on output from recognizer, which is an accumulated log-probability score Text-independent system has been analyzed (Weber et al. from Dragon Systems) Let’s try a text-dependent one!

8/12/2003 System Word-level HMM-UBM detectors Word Extractor HMM-UBM N HMM-UBM 2 HMM-UBM 1 Topology: Left-to-right HMM with self-loops and no skips 4 Gaussian components per state Number of states related to number of phones and median number of frames for word Combination Λ signal

8/12/2003 System HMMs implemented using HMM toolkit (HTK) -Used for speech recognition Input features were 12 mel-cepstra, first differences, and zeroth order cepstrum (energy parameter) Adaptation: Means were adapted using Maximum A Posteriori adaptation In cases of no adaptation data, UBM was used -LLR score cancels

8/12/2003 Word Selection 13 Words: Discourse markers: {actually, anyway, like, see, well, now} Filled pauses: {um, uh} Backchannels: {yeah, yep, okay, uhhuh, right } Words account for approx: 8% of total tokens

8/12/2003 Recognition Task NIST Extended Data Evaluation: Training for 1,2,4,8, and 16 complete conversation sides and testing on one side (side duration ~2.5 mins) Uses Switchboard I corpus -Conversational telephone speech Cross-validation method where data is partitioned Test on one partition; use others for background models and normalization For project, used splits 4-6 for background and 1 for testing with 8-conversation training

8/12/2003 Scoring LLR(X) = log(p(X|S)) – log(p(X|UBM)) Target score: output of adapted HMM scoring forced alignment recognition of word from true transcripts (aligned via SRI recognizer) UBM score: output of non-adapted HMM scoring same forced alignment Frame normalization: Word normalization: Average of word-level frame normalizations N-best normalization: Frame normalization on n best matching (i.e. high log-prob) words

8/12/2003 Observations: 1) Frame norm result = word norm result 2) EER of n-best decreases with increasing n -Suggests benefit from an increase in data Initial Results

8/12/2003 Comparable results: Sturim et al. text-dependent GMM Yielded EER of 1.3% -Larger word pool (50 words) -Channel normalization Initial Results

8/12/2003 Observations: EERs for most lie in a small range around 7% -Suggests that words, as a group, share some qualities -last two may differ greatly partly because of data scarcity Best word (“yeah”) yielded EER of 4.63% compared with 2.87% for all words Initial Results

8/12/2003 System Enhancements

8/12/2003 System Enhancements: New Words Some discourse markers and backchannels are bigrams 6 Additional Words Bigrams: Discourse markers:{you_know, you_see, i_think, i_mean} Backchannels:{i_see, i_know} Total coverage of ~10% with these additional words

8/12/2003 System Enhancements: New Words Results EER reduced from 2.87% to 2.53% Significant reduction, especially given the size of coverage increase

8/12/2003 System Enhancements: New Words Results Observations: Well-performing bigrams have comparable EERs Poorly-performing bigrams suffer from a paucity of data Suggests possibility of frequency threshold for performance

8/12/2003 System Enhancements: More Cepstra Idea: Higher order cepstra may posses more variability that can be used for speaker discrimination Input features modified to 19 mel-cepstra from 12

8/12/2003 System Enhancements: More Cepstra Results EER Reduced from 2.87% to 1.88%

8/12/2003 System Enhancements: CMS Idea: Channel response may introduce undesirable variability (e.g., the same speaker on different handsets), so try and remove it Common approach is to perform Cepstral Mean Subtraction (CMS) Convolutional effects in the time domain become additive effects in the log power domain: X( ,t) = S( ,t)C( ,t) log|X( ,t)| 2 = log|S( ,t)| 2 + log|C( ,t)| 2

8/12/2003 System Enhancements: CMS Results EER reduced from 2.87% to 1.35% Poor performance in low false alarm region possibly due to small number of data points also may have removed ‘good’ channel info

8/12/2003 System Enhancements: Combined System Results “grab bag” system yields EER of 1.01% Suffers from same problem of poor performance for low false alarms

8/12/2003 Conclusions Well performing text-dependent speaker recognition in an unconstrained speech domain is very feasible Benefit of sequential information appears to have been established Benefits of higher order cepstra and CMS for input features have been demonstrated

8/12/2003 Future Work -Analyze performance with ASR output -Closer analysis of word frequency to performance -More words! -Normalizations (Hnorm, Tnorm) -Examine influence of word context (e.g., “well” as discourse marker and as adverb)

8/12/2003 Fin