Recent work on Language Identification

1 Recent work on Language Identification
Pietro Laface POLITECNICO di TORINO Brno Pietro LAFACE

2 Team POLITECNICO di TORINO LOQUENDO Pietro Laface Professor
Fabio Castaldo Post-doc Sandro Cumani PhD student Ivano Dalmasso Thesis Student LOQUENDO Claudio Vair Senior Researcher Daniele Colibro Researcher Emanuele Dalmasso Post-doc

3 Outline Acoustic models Phonetic models LRE09
Fast discriminative training of GMMs Language factors Phonetic models 1-best tokenizers lattice tokenizers LRE09 Incremental acquisition of segments for the development sets This is the outline of my talk. I will illustrate the 3 main contributions of (1.30) our work on the acoustic models of a LID system. First: To reduce inter-speaker variability within the same language we have shown in last ICASPP that significant performance improvement in LID can be obtained performing speaker compensation in feature space using the Generalized Linear Discriminant Sequence Kernels approach. Here we use speaker compensation with Gaussian Mixture Models. Second: Since GMMs in combination with linear Support Vector Machine classifier s have been shown to give excellent classification accuracy in speaker recognition, we apply this approach for LID, and we compare its performance with the standard GMM based techniques. Third: In the GMM-SVM framework, a GMM is trained for each training or test utterance. Since it is difficult to accurately train a model with short utterances, in these conditions the standard GMMs perform better than the GMM-SVM models. To overcome this limitation, we present an extremely fast GMM discriminative training procedure that exploits the information given by the separation hyperplanes estimated by an SVM classifier.

4 Our technology progress
Inter-speaker compensation in feature space GLDS / SVM models (ICASSP 2007) - GMMs SVM using GMM super‑vectors (GMM-SVM) Introduced by MIT-LL for speaker recognition Fast discriminative training of GMMs Alternative to MMIE Exploiting the GMM-SVM separation hyperplanes MIT discriminative GMMs Language factors This is the outline of my talk. I will illustrate the 3 main contributions of (1.30) our work on the acoustic models of a LID system. First: To reduce inter-speaker variability within the same language we have shown in last ICASPP that significant performance improvement in LID can be obtained performing speaker compensation in feature space using the Generalized Linear Discriminant Sequence Kernels approach. Here we use speaker compensation with Gaussian Mixture Models. Second: Since GMMs in combination with linear Support Vector Machine classifier s have been shown to give excellent classification accuracy in speaker recognition, we apply this approach for LID, and we compare its performance with the standard GMM based techniques. Third: In the GMM-SVM framework, a GMM is trained for each training or test utterance. Since it is difficult to accurately train a model with short utterances, in these conditions the standard GMMs perform better than the GMM-SVM models. To overcome this limitation, we present an extremely fast GMM discriminative training procedure that exploits the information given by the separation hyperplanes estimated by an SVM classifier.

5 Acoustic Language Identification
Task similar to text-independent Speaker Recognition Gaussian Mixture Models (GMM) - MAP adapted from an Universal Background Model (UBM) UBM Language GMM For LID we use the same core technology that we use lo Speaker recognition:(40) Gaussian Mixture Models used in combination with Maximum A Posteriori adaptation MAP adaptation is not necessary in language recognition because every language GMM can be robustly trained by Maximum Likelihood estimation. However, we perform MAP estimation from a UBM also in LID for 3 main reasons MAP

6 GMM super‑vectors Appending the mean value of all Gaussians in a single stream we get a super-vector We use GMM super-vectors Without normalization Inter-speaker/channel variation compensation With Kullback‑Leibler normalization Training GMM-SVM models Training Discriminative GMMs In all our approaches we make use of supervectors (30)

7 Using an UBM in LID The frame based inter-speaker variation compensation approach estimates the inter- speaker compensation factors using the UBM In the GMM-SVM approach all language GMMs share the same weights and variances of the UBM The UBM is used for fast selection of Gaussians Our frame based inter-speaker variation compensation approach computes its speaker factors using the UBM. (40) Language models deriving from a common UBM are required by our GMM-SVM approach. A side benefit of this choice is that it allows fast selection of the Gaussians both in training and in testing. Thus, larger models can be trained discriminatively.

8 Speaker/channel compensation in feature space
U is a low rank matrix (estimated offline) projecting the speaker/channel factors subspace in the supervector domain. x(i) is a low dimensional vector, estimated using the UBM, holding the speaker/channel factors for the current utterance i. is the occupation probability of the m-th Gaussian This slide summarizes our feature domain compensation technique. Each frame is compensated according to this formula where

9 Estimating the U matrix
Estimating the U matrix with a large set of differences between models generated using different utterances of the same speaker we compensate the distortions due to the inter-session variability  Speaker recognition Estimating the U matrix with a large set of differences between models generated using different speaker utterances of the same language we compensate the distortions due to inter- speaker/channel variability within the same language  Language recognition

10 GMM-SVM A GMM model is trained for each utterance, both in train and in test Each GMM is represented by a normalized GMM super‑vector The normalization is necessary to define a meaningful comparison between GMM supervectors I summarize now the GMM-SVM approach, focusing on the main topics that are of interest for understanding our discriminative training approach. (1.10) The normalization is necessary to define a comparison between GMM supervectors in a suitable Euclidean space

11 Kullback‑Leibler divergence
Two GMMs (i and j) can be compared using an approximation of the Kullback‑Leibler divergence The GMM comparison can be performed using (10) The interesting property of this measure is that …

12 Kullback‑Leibler normalization
normalizing each supervector component according to The normalized UBM supervector defines the origin of a new space The KL divergence becomes an Euclidean distance The SVM language models are created using a linear kernel in this KL space Since a translation does not alter the relative distance, (50) among the new means, the UBM mean can be dropped, and the supervector normalization term can be simply reduced to a scaling factor

13 GMM-SVM models perform very well with rather long test utterances
GMM-SVM weakness GMM-SVM models perform very well with rather long test utterances It is difficult to estimate a robust GMM with a short test utterance Exploit the discriminative information given by the GMM-SVM for fast estimation of discriminative GMMs But not so well for short TEST utterances, (1.20) because we cannot estimate a robust GMM with a short test utterance Thus, to improve the performance for short sentences we have to rely on discriminative GMMs. Here comes the idea of avoiding the time expensive MMIE training Long sentences are available for training.

14 SVM discriminative directions
w: normal vector to the class‑separation hyperplane In particular we exploit the discriminative directions given by the vectors w estimated by the SVM in the KL space, w1, w2, w3 in the figure

15 GMM discriminative training
Feature Space Language GMM Utterance GMM The red circles shown on the left side of Figure represent a two-dimensional projection of a set of utterance supervectors of language k mapped to the KL space. The green circles correspond to the utterance supervectors of one of the competitor languages, and the black circle is the UBM. (2) The Figure shows, on its right side, a two-dimensional acoustic feature space The black ellipses represent two Gaussians of the UBM. The red and the green ellipses represent the corresponding Gaussians of two languages, the red ones referring to language k. Wk1 and wk2 , in the acoustic feature space, are the rescaled components of supervector wk for the two Gaussians of the language k GMM shown in the figure. The figure suggests that the Gaussians of a language k are moved away from the corresponding Gaussians of the other languages along different directions, and with different shift size. These directions are the ones that optimize the discrimination of that language in the KL space, i.e. the directions that maximize the distance of the GMM of language k from its competitor GMMs. KL Space Shift each Gaussian of a language model along its discriminative direction, given by the vector normal to the class‑separation hyperplane in the KL space

16 Rules for selection of αk
A discriminative GMM moves away from its original - MAP adapted - model, which best matches the training (and test) data. A large value of αk (shift size)  more discriminative model, but worse likelihood than less discriminative models Use a development set for estimating α The first rule derives from this observation (1) This is also true for the MMIE framework, which optimizes a discrimination function, not the likelihood of the model.

17 Experiments with 2048 GMMs Pooled EER(%) of Discriminative 2048 GMMs, and GMM-SVM on the NIST LRE tasks. In parentheses, the average of the EERs of each language. Year Models Discriminative GMMs GMM-SVM 3s 10s 30s 1996 (13.71) 3.62 (4.92) 1.01 (1.37) 2003 (14.40) 5.50 (6.02) 1.42 (1.64) 2005 (17.85) 9.73 (11.07 ) 4.67 (5.81 ) To enable the comparison with previous reported results, this Table shows the results in terms of pooled EER scores, and in parentheses the average of the EERs of each language for experiments with a gender independent 2048 Gaussian GMM-SVM on the 30s tests, and with gender dependent Discriminative GMMs on the shorter duration tests, achieve performance comparable to the ones reported by Brno University, which are among the best one presented so far for this database for an acoustic only system. 256-MMI (Brno University – 2006 IEEE Odyssey ) 2005 17.1 8.6 4.6

18 Pushed GMMs (MIT-LL) Another method, based on the use of the support vectors identified by the SVM, has been proposed in [8] for creating discriminative models. Since a support vector corresponds to a GMM supervector, the location of the positive boundaries of the SVM can be modeled by a weighted combination of the support vectors associated to positive Lagrange multipliers

19 Language Factors Eigenvoice modeling, , and the use of speaker factors as input features to SVMs, has recently been demonstrated to give good results for speaker recognition compared to the standard GMM-SVM approach (Dehak et al. ICASSP 2009). Analogy Estimate an eigen-language space, and use the language factors as input features to SVM classifiers (Castaldo et al. submitted to Interspeech ).

20 Language Factors: advantages
Language factors are low-dimension vectors Training and evaluating SVMs with different kernels is easy and fast: it requires the dot product of normalized language factors Using a very large number of training examples is feasible Small models give good performance

21 Toward an eigen-language space
After compensation of the nuisances of a GMM adapted from the UBM using a single utterance, residual information about the channel and the speaker remains. However, most of the undesired variation is removed as demonstrated by the improvements obtained using this technique

22 Speaker compensated eigenvoices
First approach Estimating the principal directions of the GMM supervectors of all the training segments before inter- speaker nuisance compensation would produce a set of language independent, “universal” eigenvoices. After nuisance removal, however, the speaker contribution to the principal components is reduced to the benefit of language discrimination.

23 Eigen-language space Second approach
Computing the differences between the GMM supervectors obtained from utterances of a polyglot speaker would compensate the speaker characteristics and would enhance the acoustic components of a language with respect to the others. We do not have labeled databases including polyglot speakers compute and collect the difference between GMM supervectors produced by utterances of speakers of two different languages irrespective of the speaker identity, already compensated in the feature domain we could in principle compute and collect the difference between GMM supervectors produced by utterances of speakers of two different languages irrespective of the speakers’ identity, which should have been already compensated in the feature domain

24 Eigen-language space The number of these differences would grow with the square of utterances of the training set. Perform Principal Component Analysis on the set of the differences between the set of the supervectors of a language and the average supervector of every other language.

25 The same used for LRE07 evaluation
Training corpora The same used for LRE07 evaluation All data of the 12 languages in the Callfriend corpus Half of the NIST LRE07 development corpus Half of the OSHU corpus provided by NIST for LRE05 The Russian through switched telephone network Automatic segmentation

26 Eigenvalues of two language subspaces
whereas the remaining eigenvalues decrease slowly indicating that the corresponding directions still contribute to language discrimination, even if they probably still account for residual channel and speaker characteristics The language subspace has higher eigenvalues, and both curves show a sharp decrease for their first 13 eigenvalues, corresponding to the main language discrimination directions.

27 Language factor’s minDCF is always better and more stable
LRE07 30s closed set test 10% worse a 100 rispetto a 600 Language factor’s minDCF is always better and more stable

28 Pushed GMMs (MIT-LL)

29 Pushed eigen-language GMMs
The same approach to obtain discriminative GMMs from the language factors

30 Min DCFs and (%EER) Models 30s 10s 3s GMM-SVM (KL kernel) 0.029 (3.43)
0.085 (9.12) 0.201 (21.3) GMM-SVM (Identity kernel) 0.031 (3.72) 0.087 (9.51) 0.200 (21.0) LF-SVM (KL kernel) 0.026 (3.13) 0.083 (9.02) 0.186 (20.4) LF-SVM (Identity kernel) (3.11) (9.13) 0.187 Discriminative GMMs 0.021 (2.56) 0.069 (7.49) 0.174 (18.45) LF-Discriminative GMMs (KL kernel) 0.025 (2.97) 0.084 (9.04) (19.9) LF-Discriminative GMMs (Identity kernel) (3.05) (9.05) (20.0) The results of our reference system were obtained using the KL kernel in the GMM-SVM approach, and are shown in the first row on Table 1 The first set of experiments were done to evaluate the performance of the identity kernel SVM classifiers as a function of the modeling eigen-language matrices,

31 Loquendo-Polito LRE09 System
Model Training Acoustic features SVM-GMMs Pushed GMMs MMIE GMMs Phonetic transcriber N-gram counts TFLLR SMV

32 Phonetic models ASR Recognizer
Output layer states for the language dependent phonetic units Stationary units Diphone units ASR Recognizer phone-loop grammar with diphone transition constraints

33 Phone transcribers ASR Recognizer 12 phone transcribers for
phone-loop grammar with diphone transition constraints 12 phone transcribers for French, German, Greek, Italian, Polish, Portuguese, Russian, Spanish, Swedish, Turkish, UK and US English. The statistics of the n-gram phone occurrences collected from the best decoded string of each conversation segment

34 Phone transcribers ANN models 10 phone transcribers for
Same phone-loop grammar - different engine 10 phone transcribers for Catalan, French, German, Greek, Italian, Polish, Portuguese, Russian, Spanish, Swedish, Turkish, UK and US English. The statistics of the n-gram phone occurrences collected from the expected counts from a lattice of each conversation segment

35 Multigrams Two different TFLLR kernels trigrams pruned multigrams
multigrams can provide useful information about the language by capturing “word parts” within the string sequences

36 Total number of n-grams for 12 language transcribers
Pruned Multigrams For each phonetic transcriber, we discard all the n-grams appearing in the training set less than 0.05% of the average occurrence of the unigrams Total number of n-grams for 12 language transcribers N-gram 1 2 3 4 5 6 Pruned 461 11477 120200 114396 10738 443

37 Scoring The total number of models that we use for scoring an unknown segment is 34: 11 channel dependent models (11 x 2) 12 single channel models (2 telephone and 10 broadcast models only). 23 x 2 for MMIE GMMs (channel independent but M/F)

38 Calibration and fusion
Multi-class FoCal Pushed GMMs MMIE GMMs 1-best 3-grams SVMs 1-best n-grams SVMs Lattice n-grams SVMs 34 46 23 G. back-end LLR max lre_detection max of the channel dependent scores

39 Language pair recognition
For the language-pair evaluation only the back-ends have been re-trained, keeping unchanged the models of all the sub-systems.

40 Telephone development corpora
CALLFRIEND - Conversations split into slices of 150s NIST 2003 and NIST 2005 LRE07 development corpus Cantonese and Portuguese data in the 22 Language OGI corpus RuSTeN -The Russian through Switched Telephone Network corpus

41 “Broadcast” development corpora
Incrementally created to include as far as possible the variability within a language due to channel, gender and speaker differences The development data, further split in training, calibration and test subsets, should cover the mentioned variability

42 Problems with LRE09 dev data
Often same speaker segments Scarcity of segments for some languages after filtering same speaker segments Genders are not balanced Excluding “French”, the segments of all languages are either telephone or broadcast. No audited data available for Hindi, Russian, Spanish and Urdu on VOA3, only automatic segmentation was provided No segmentation was provided in the first release of the development data for Cantonese, Korean, Mandarin, and Vietnamese For these 8 missing languages only the language hypotheses provided by BUT were available for VOA2 data.

43 Additional “audited” data
For the 8 languages lacking broadcast data, segments have been generated accessing the VOA site looking for the original MP3 files Goal collect ~300 broadcast segments per language, processed to detect narrowband fragments The candidates were checked to eliminate segments including music, bad channel distortions, and fragments of other languages

44 Development data for bootstrap models
Telephone and audited/checked broadcast data Training (50 %) Development (25 %) Test (25 %) The segments were distributed to these sets so that same speaker segments were included in the same set. A set of acoustic (pushed GMMs) bootstrap models has been trained The telephone and the audited/checked broadcast data, were evenly split into a train and a development set. The development set has been further divided in two parts, one devoted to estimate the back-end parameters and the other for testing

45 Additional not-audited data from VOA3
Preliminary tests with the bootstrap models indicate the need of additional data Selected from VOA3 to include new speakers in the train, calibration and test sets assuming that the file label correctly identify the corresponding language

46 Speaker selection Performed by means of a speaker recognizer
We process the audited segments before the others A new speaker model is added to the current set of speaker models whenever the best recognition score obtained by a segment is less than a threshold

47 Additional not-audited data from VOA2
Enriching the training set Language recognition has been performed using a system combining the acoustic bootstrap models and a phonetic system A segment has been selected only if the 1-best language hypothesis of our system had associated a score greater than a given (rather high) threshold matched the 1-best hypothesis provided by the BUT system

48 Total number of segments for this evaluation
Set Corpora voa3_ A voa2_ A ftp_C voa3_ S voa2_ S ftp_S Train 529 116 316 1955 590 66 Extended train 114 22 65 2483 574 151 Developmen t 396 85 329 1866 449 45 Suffix: A audited C checked S automatic segmentation ftp:

49 Hausa- Decision Cost Function

50 Hindi- Decision Cost Function

51 Results on the development set
Test on Systems Pushed GMMs MMIE 3- grams Multi- grams Lattice Fusio n Broadcast & telephone 1.48 1.70 1.09 1.12 1.06 0.86 Broadcast subset 1.54 1.69 1.24 1.26 1.14 0.91 Telephone 2.00 2.51 1.45 1.49 1.42 1.21 Average minDCFx100 on 30s test segments

52 Korean - score cumulative distribution
b-b t-t t-b b-t

