Presentation for EEL6586 Automatic Speech Processing Speaker Verification Seth McNeill 18 April 2003 Presentation for EEL6586 Automatic Speech Processing Hello, I am Seth McNeill and my project is Speaker Verification
Outline of things to come 1) Who am I? 2) Verification vs. ID 3) Features 4) Modeling 5) My Project Here is an outline of the things to come in this presentation. First, Who am I? Second, Speaker Verification vs. Speaker ID Third, What features to use Fourth, how do I model the data? Lastly, my project 2/12
Who am I? First Year Graduate Student First Year at UF From SE Washington State 3/12
Note three things: No trees, hills – this hill is taller top-to-bottom than the tallest point in FL, and lastly, no signs of rain – it is dry
Speaker Verification vs Speaker ID Verification – Are you an imposter? ID – Which of N speakers are you? Speaker verification asks the question, are you who you say you are? Speaker ID asks, who of all the people I know are you? 4/12
Features Mel-Cepstrum, D Cepstrum, Cepstral Mean Subtraction Glottal Flow Derivative Liljencrants-Fant (LF) model Sub-Cepstrum 96% 95% 74% Note that the first feature set used in speaker verification is the same as speech recognition. Actually, speaker ID is very similar to speaker dependent speech recognition Excitation features can be used for speaker ID The glottal flow derivative has a 95% accuracy The Liljencrants-Fant model has 74% accuracy One feature extraction method mentioned in the book that I hadn’t seen before was the sub-cepstrum It is the time domain method of getting mel-cepstrum features. You convolve the time domain impulse response of each of the mel filters with your signal to get the coefficients 5/12
Gaussian Mixture Model (GMM) Loses Temporal Data Person vs. “Background” Person vs. Threshold The gaussian mixture model is used for speaker verification. This looses the temporal data, but that makes sense because what you want is to see what parts of the feature space each speaker occupies There are two methods of testing the model You can the test data to the person and a “background” model the background model is made from lots of people who are not the person you are testing against this requires lots of data. The other method is just to use a threshold. If the likelihood that the test data came from the model is greater than a threshold, the person is not an imposter. 6/12
GMM (continued) Here is a video to show what training the GMM is like using an expectation maximization algorithm. 7/12
My Project C++ Implementation Energy Based Endpoint Detection Mel-Cepstrum, D Cepstrum Coefficients Single Window with Nearest Neighbor Muliple Window with GMM I did a C++ implementation because I wanted my project to work on any Windows computer, not have to buy Matlab. My project uses energy based feature extraction. As we know, this is not the best, but is quick and easy. I have chosen to use the Mel-Cepstrum and delta cepstrum coefficients Depending on time I will either use a single window (remember temporal data doesn’t matter) and nearest neighbor Or I will use multiple windows and GMM. GMM takes more data to train, so using multiple windows helps. 8/12
Current Progress Data Capture from Sound Card Endpoint Detection 9/12 Currently I have the data capture from the sound card working And endpoint detection running. 9/12
Demo 10/12
Future Progress Feature Extraction Modeling Motion Detection Text-to-Speech Visual Verification Future things to do are: Feature extraction, I think I finally found a good way to do that. Modeling – either nearest neighbor or GMM. I think I have software which will help make it easier 11/12
Speaker Verification Questions or Comments? 12/12 Any Questions or Comments? 12/12
Another GMM Video