Advances in WP2 Torino Meeting – 9-10 March
2 Activities on WP2 since last meeting Study of innovative NN adaptation methods –Models: Linear Hidden Networks Test on project adaptation corpora: –WSJ0 Adaptation component –WSJ1 Spoke-3 component –Hiwire Non-Native Corpus
3 Speech Databases for Speaker Adaptation WSJ0: (standard ARPA, 1993, LDC, 1000$) –Large vocabulary (5K words) continuous speech database –Test Set: 8 speakers, ~40 utterances, read speech, bigram LM –Adaptation set: the same 8 speakers, 40 utterances each WSJ1: (1994,LDC, 1500$) –Similar to WSJ0, same vocabulary and LM –SPOKE-3: standard case study of adaptation to non-native speakers –10 speakers, 40 adaptation utterances, 40 test utterances Hiwire Non-Native Speaker database: –Collected within the project; –80 speakers, each reads 100 utterances
4 LIN Adaptation for HMM/NN LIN means “linear input network” LIN in a classical technique for speaker and channel adaptation in HMM/NN [Neto 1996]; The LIN is placed before an MLP already trained in a speaker independent way (SI-MLP) The input space is rotated by a linear transform, to make the target conditions nearer to the training conditions The linear transform is implemented with a linear neural network inserted between the input layer and the 1 st hidden layer
5 LIN Adaptation Output layer …. Input layer 1 st hidden layer 2 nd hidden layer Emission Probabilities Acoustic phonetic Units Speech Signal parameters …. Speaker Independent MLP SI-MLP LIN
6 LIN Training The global SI-MLP+LIN system is trained with vocal material from the target speaker; The LIN is initialized with an identity matrix; LIN weights are trained with error back- propagation through the global net; The original NN weights are kept frozen
7 LHN Adaptation LHN means “linear hidden network” The activations of the last hidden layer are linearly transformed to improve acoustic matching of the adaptation material The activation values of a hidden layer represent an internal structure of the input pattern in a space more suitable for classification and adaptation The linear transform is implemented with a linear neural network layer inserted between the last hidden layer and the output layer
8 LHN Adaptation Output layer …. Input layer 1 st hidden layer 2 nd hidden layer Emission Probabilities Acoustic phonetic Units Speech Signal parameters …. Speaker Independent MLP SI-MLP LHN
9 LHN Training The global SI-MLP+LHN system is trained with vocal material from the target speaker; The LHN is initialized with an identity matrix; LHN weights are trained with error back- propagation through the last layer of weights; The original NN weights are kept frozen
10 Paper at Icassp-2006 ADAPTATION OF HYBRID ANN/HMM MODELS USING LINEAR HIDDEN TRANSFORMATIONS AND CONSERVATIVE TRAINING Roberto Gemello, Franco Mana, Stefano Scanzio, Pietro Laface and Renato De Mori
11 WSJ0 LIN-LHN Adaptation Train: standard WSJ0 SI-84 train set, 16 kHz SI Test : 8 speakers and ~40 sentences for each speaker Vocabulary: 5K words, with a standard bigram LM Adaptation : the same 8 speakers of SI test, with 40 adaptation sentences for each of them Adaptation Model Spk: WV1_440 Spk: WV1_441 Spk: WV1_442 Spk: WV1_443 Spk: WV1_444 Spk: WV1_445 Spk: WV1_446 Spk: WV1_447 Average (E.R.) Baseline LIN Adapt (10.5%) LHN Adapt (20.0%)
12 WSJ1 – SPOKE-3 LIN-LHN Adaptation Spoke-3 is the standard WSJ1 case study to evaluate adaptation to non-native speakers There are 10 non-native speakers (40 adaptation sentences and ~40 test sentences) Train: standard WSJ0 SI-84 train set, 16 kHz Vocabulary is 5K words, with standard bigram LM Adaptation Model 4N04N14N34N44N54N84N94NA4NB4NCAverage (E.R.) Baseline LIN Adapt (14.2%) LHN Adapt (43.5%) THE FEMALE PRODUCES A LITTER OF TWO TO FOUR YOUNG IN NOVEMBER AND DECEMBER
13 Comments on WSJ0 – WSJ1 Results LIN does work for speaker adaptation: E.R. 10.5% on WSJ0 and 14.2% on WSJ1 However, with LIN in some cases performances does not improve or decrease LHN is a more powerful method: E.R. 20.0% on WSJ0 and 43.5% on WSJ1 with LHN performances always increase
14 Hiwire Non-Native Corpus (1) The database consists of English sentences uttered by non-native speakers. These speakers are from French, Italian, Greek and Spanish origins (plus an additional set of extra-European speakers). The uttered sentences belong to a command language used by aircraft pilots. The vocabulary contains 134 words. Each speaker has pronounced 1 list of 100 sentences.
15 Hiwire Non-Native Corpus (2) Corpus composition: French speakers: 31 Italian speakers:20 Greek speakers:20 Spanish speakers:10 World speakers: 10
16 Experimental conditions Starting models: -standard Loquendo ASR EN-US -Telephone models (8 kHz) -Training set: LDC Macrophone Adaptation: first 50 utterances of each speaker Test:last 50 utterances of each speaker LM: Hiwire grammar (134 words voc.) Signal proc.: down-sampling to 8 kHz
17 Results on Hiwire corpus Recognition model: ANN/HMM Adaptation Model: LIN - LHN Nationality# of speakers Default models Adapted LINAdapted LHN WAER %WAER % French Italian Greek Spanish World Total
18 Discussion The adaptation of Acoustic Models gives a good contribution also in the case of non-native speakers State-of-art LIN is a feasible and practical way to adapt hybrid NN-HMM models LHN (transformation of hidden layers activations) is a new NN adaptation method introduced in the project LHN outperforms LIN
19 Experiments for year 2 Speaker Adaptation tests on project test sets: –WSJ0 –WSJ1 spoke-3 –Hiwire non-native Tests with different techniques: –LIN –New NN adaptation methods
20 Workplan Selection of suitable benchmark databases (m6) Baseline set-up for the selected databases (m8) LIN adaptation method implemented and experimented on the benchmarks (m12) Experimental results on Hiwire database with LIN (m18) Innovative NN adaptation methods and algorithms for acoustic modeling and experimental results (m21) Further advances on new adaptation methods (m24) Unsupervised Adaptation: algorithms and experimentation (m33)