Presentation is loading. Please wait.

Presentation is loading. Please wait.

Speaker Detection Without Models Dan Gillick July 27, 2004.

Similar presentations


Presentation on theme: "Speaker Detection Without Models Dan Gillick July 27, 2004."— Presentation transcript:

1 Speaker Detection Without Models Dan Gillick July 27, 2004

2 Dan Gillick (2)July 27, 2004Speaker Detection Without Models Motivation Want to develop a speaker ID algorithm that: captures sequential information takes advantage of extended data combines well with existing baseline systems

3 Dan Gillick (3)July 27, 2004Speaker Detection Without Models The Algorithm Rather than build models (GMM, HMM, etc.) to describe the information in the training data, we directly compare test data frames to training data frames. We compare sequences of frames because we believe there is information in sequences that systems like the GMM do not capture. The comparisons are guided by token-level alignments extracted from a speech recognizer.

4 Dan Gillick (4)July 27, 2004Speaker Detection Without Models Front-End Using 40 MFCC features per 10ms frame –19 Cepstrals and Energy (C 0 ) –Their deltas

5 Dan Gillick (5)July 27, 2004Speaker Detection Without Models The Algorithm: Overview Cut the test and target data into tokens –use word or phone-level time-alignments from the SRI recognizer –note that these alignments have lots of errors (both word errors and alignment errors)

6 Dan Gillick (6)July 27, 2004Speaker Detection Without Models The Algorithm: Overview Compare test and target data 1.Take the first test token 2.Find every instance of this token in the target data 3.Measure the distance between the test token and each target instance 4.Move on to the next test token

7 Dan Gillick (7)July 27, 2004Speaker Detection Without Models The Algorithm Test dataTraining data

8 Dan Gillick (8)July 27, 2004Speaker Detection Without Models The Algorithm Test dataTraining data “take the first test token”: grab the sequence of frames corresponding to this token according to the recognizer output Hello

9 Dan Gillick (9)July 27, 2004Speaker Detection Without Models The Algorithm Test dataTraining data “Find every instance of this token in the target data” Hello Hello (1) Hello (2) Hello (3)

10 Dan Gillick (10)July 27, 2004Speaker Detection Without Models The Algorithm Test dataTraining data “Measure the distance between the test token and each target instance”: distance = sum of the (Euclidian) distances between frames of the test and target instances Hello Hello (1) Hello (2) Hello (3) Euclidian distance function Distance = 25

11 Dan Gillick (11)July 27, 2004Speaker Detection Without Models The Algorithm Test dataTraining data “Measure the distance between the test token and each target instance”: distance = sum of the (Euclidian) distances between frames of the test and target instances Hello Hello (1) Hello (2) Hello (3) Euclidian distance function Distance = 40 Distance = 25 Distance = 40

12 Dan Gillick (12)July 27, 2004Speaker Detection Without Models The Algorithm Test dataTraining data “Measure the distance between the test token and each target instance”: distance = sum of the (Euclidian) distances between frames of the test and target instances Hello Hello (1) Hello (2) Hello (3) Euclidian distance function Distance = 18 Distance = 25 Distance = 40 Distance = 18

13 Dan Gillick (13)July 27, 2004Speaker Detection Without Models The Algorithm: Distance Function But these instances have different lengths. How do we line up the frames? Here are some possibilities: 1. Line up the first frames and cut off the longer at the shorter 2. Use a sliding window approach: slide the shorter through the longer, taking the best (smallest) total distance. 3. Use dynamic time warping (DTW) Hello (test) Hello (3) Euclidian distance function Distance = 18

14 Dan Gillick (14)July 27, 2004Speaker Detection Without Models The Algorithm: Take the 1-Best Test dataTraining data Now what do we do with these scores? There are a number of options, but we only keep the 1-best score. One motivation for this decision is that we are mainly interested in positive information. Hello Hello (1) Hello (2) Hello (3) Distance = 25 Distance = 40 Distance = 18 Token Score = 18

15 Dan Gillick (15)July 27, 2004Speaker Detection Without Models The Algorithm: Scoring Test dataTraining data So we accumulate scores for each token. What do we do with these? Some options: 1. Average them, normalizing either by the number of tokens or by the total number of frames (Basic score) 2. Focus on some subset of the scores a. Positive evidence (Hit score): ∑ [ (#frames) / (k^score) ] b. Negative evidence: ∑ [ (#frames*target count) / (k^(M-score)) ] HelloToken Score = 18 myToken Score = 16.5 nameToken Score = 21 Etc…

16 Dan Gillick (16)July 27, 2004Speaker Detection Without Models Normalization Most systems use a UBM (universal background model) to center the test pieces –Since this system has no model, we create a background by lumping together speech from a number of different held-out speakers and running the algorithm with this group as training data ZNorm to center the “models” –Find the mean score for each “model” or training set by running a number of held-out imposters against each one.

17 Dan Gillick (17)July 27, 2004Speaker Detection Without Models Results Results reported on split 1 (of 6) of Switchboard I (1624 test vs. target scores)

18 Dan Gillick (18)July 27, 2004Speaker Detection Without Models Results TOKENSTYLEBKGZNORM BSCR EER HS EER COMB EER COMB DCF word unigramssw14none6.824.83 For reference: GMM performance on the same data set: 0.67% EER; 0.0491 DCF Style: sw = sliding window; Bkg: # of speakers in the bkg set; Znorm: # of speakers in the znorm set

19 Dan Gillick (19)July 27, 2004Speaker Detection Without Models Results TOKENSTYLEBKGZNORM BSCR EER HS EER COMB EER COMB DCF word unigrams sw dtw 14 none 6.82 4.16 4.83 3.16 For reference: GMM performance on the same data set: 0.67% EER; 0.0491 DCF Style: sw = sliding window; Bkg: # of speakers in the bkg set; Znorm: # of speakers in the znorm set

20 Dan Gillick (20)July 27, 2004Speaker Detection Without Models Results TOKENSTYLEBKGZNORM BSCR EER HS EER COMB EER COMB DCF word unigrams sw dtw 14 none 16 6.82 4.16 2.66 4.83 3.16 2.162.000.0416 For reference: GMM performance on the same data set: 0.67% EER; 0.0491 DCF Style: sw = sliding window; Bkg: # of speakers in the bkg set; Znorm: # of speakers in the znorm set

21 Dan Gillick (21)July 27, 2004Speaker Detection Without Models Results TOKENSTYLEBKGZNORM BSCR EER HS EER COMB EER COMB DCF word unigrams sw dtw 14 none 16 6.82 4.16 2.66 4.83 3.16 2.162.000.0416 word bigrams sw dtw 14 16 5.80 2.83 3.68 2.161.830.0447 phone unigramsdtw14162.642.481.980.0560 phone bigramsdtw14161.83 1.330.0333 phone trigramsdtw14161.65 1.160.0345 For reference: GMM performance on the same data set: 0.67% EER; 0.0491 DCF Style: sw = sliding window; Bkg: # of speakers in the bkg set; Znorm: # of speakers in the znorm set

22 Dan Gillick (22)July 27, 2004Speaker Detection Without Models Results How do positive and negative evidence compare? Word-bigrams + bkg (positive evidence)3.16% EER Word-bigrams + bkg (negative evidence) 26.5% EER

23 Dan Gillick (23)July 27, 2004Speaker Detection Without Models Results How is the system effected by errorful recognizer transcripts? Word bigrams + bkg + znorm (recognized transcripts)1.83% EER Word bigrams + bkg + znorm (true transcripts)1.16% EER

24 Dan Gillick (24)July 27, 2004Speaker Detection Without Models Results How does the system combine with the GMM? This experiment was done on the first half (splits 1,2,3) of Switchboard I EERDCF SRI GMM system0.970.04806 Best phone-bigram system1.460.06110 GMM + phone-bigrams0.490.02040

25 Dan Gillick (25)July 27, 2004Speaker Detection Without Models Future Stuff Try larger background population, larger znorm set Try other, non-Euclidian distance functions Change the front-end features (Feature mapping) Run the system on Switchboard II; 2004 eval. data Dynamic token selection –While the system works well already, perhaps its real strength is one which has not been exploited. Since there are no models, we might dynamically select the longest available frame sequences in the test and target data for scoring.

26 Dan Gillick (26)July 27, 2004Speaker Detection Without Models Thanks Steve (wrote all the DTW code, versions 1 through 5…) Barry (tried to make my slides fancy) Barbara Everyone else in the Speaker ID group


Download ppt "Speaker Detection Without Models Dan Gillick July 27, 2004."

Similar presentations


Ads by Google