Download presentation
Presentation is loading. Please wait.
1
EE225D Final Project Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye EE225D Final Project
2
Introduction Speaker Recognition Problem: Determine if spoken segment is putative target Method of Solution Requires Two Parts: I.Training II.Testing Similar to speech recognition though noise (inter-speaker variability) is now signal.
3
EE225D Final Project Introduction Also like speech recognition, different domains exist Two major divisions: 1)Text-dependent/Text-constrained 2)Text-independent Text-dependent systems can have high performance because of input constraints More of the acoustic variation is speaker-distinctive
4
EE225D Final Project Introduction Question: Is it possible to capitalize on advantages of text- dependent systems in text-independent domains? Answer: Yes!
5
EE225D Final Project Introduction Idea: Limit words of interest to a select group -Words should have high frequency in domain -Words should have high speaker-discriminative quality What kind of words match these criteria for conversational speech ? 1) Discourse markers (um, uh, like, …) 2) Backchannels (yeah, right, uhhuh, …) These words are fairly spontaneous and represent an “involuntary speaking style” (Heck, WS2002)
6
Likelihood Ratio Detector: Λ = p(X|S) /p(X|UBM) Task is a detection problem, so use likelihood ratio detector -In implementation, log-likelihood is used EE225D Final Project Design Feature Extraction Background Model Speaker Model / Λ > Θ Accept < Θ Reject signal
7
EE225D Final Project Design State-of-the Art Systems use Gaussian Mixture Models Speaker’s acoustic space is represented by many- component mixture of Gaussians Gives very good performance, but… speaker 3 speaker 2 speaker 1
8
EE225D Final Project Design Concern: GMMs utilize a “bag-of-frames” approach Frames assumed to be independent Sequential information is no really utilized Alternative: Use HMMs Do likelihood test on output from recognizer, which is an accumulated log-probability score Text-independent system has been analyzed (Weber et. al from Dragon Systems) Let’s try a text-dependent one!
9
EE225D Final Project System Word-level HMM-UBM detectors Word Extractor HMM-UBM N HMM-UBM 2 HMM-UBM 1 Topology: Left-right HMM with self-loops and no skips 4 components per state Number of states related to number of phones and median number of frames for word Combination Λ signal
10
EE225D Final Project System HMMs implemented using HMM toolkit (HTK) -Used for speech recognition Input features were 12 cepstra, first differences, and zeroth order cepstrum (energy parameter) Adaptation: Means were adapted using Maximum A Posteriori adaptation In cases of no adaptation data, UBM was used -LLR score cancels
11
EE225D Final Project Word Selection 13 Words: Discourse markers: {actually, anyway, like, see, well, now, um, uh} Backchannels: {yeah, yep, okay, uhhuh, right }
12
EE225D Final Project Recognition Task NIST Extended Data Evaluation: Training for 1,2,4,8, and 16 complete conversation sides and testing on one side (side duration ~2.5 mins) Uses Switchboard I corpus -Conversational telephone speech Cross-validation method where data is partitioned Test on one partition; use others for background models and normalization For project, used splits 4-6 for background and 1 for testing with 8-conversation training
13
EE225D Final Project Scoring Target score: output of adapted HMM to forced alignment recognition of word from true transcripts and SRI recognizer UBM score: output of non-adapted HMM to same forced alignment Frame normalization: Word normalization: Average of word-level frame normalizations N-best normalization: Frame normalization on n best matching (i.e. high log-prob) words
14
EE225D Final Project Results Observations: 1)Frame norm = word norm 2)EER of n-best decreases with increasing n -Suggests benefit from an increase in data
15
EE225D Final Project Results Comparable results: Sturim et al. text-dependent GMM Yielded EER of 1.3% -Larger word pool -Channel normalization
16
EE225D Final Project Results Observations: 1)EERs for most lie in a small range of 7% -Indicates words, as a group, share some qualities -last two may differ greatly partly because of data scarcity 2)Best word (“yeah”) yielded EER of 4.63% compared with 2.87% for all words
17
EE225D Final Project Conclusions Well performing text-dependent speaker recognition in an unconstrained speech domain is feasible Benefit of sequential information applied in this fashion is unclear -Can compete with GMM, but can it be superior?
18
EE225D Final Project Future Work -Channel Normalization -Examine influence of word context (e.g., “well” as discourse marker and as adverb) -Revise word list
19
EE225D Final Project Acknowledgements -Barbara Peskin -Chuck Wooters -Yang Liu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.