HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE NEURAL NETWORKS RESEARCH CENTRE The development of the HTK Broadcast News transcription system: An overview Paper by P. C. Woodland. Appeared in Speech Communication 37 (2002), T Audio Mining, 17 October 2002.
HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 October 2002 Mathias Creutz 2 Motivation Transcription of broadcast radio and television news is challenging: different speech styles –read, spontaneous, conversational speech native and non-native speakers high- and low-bandwidth channels... with or without background music or other background noise Solving these problems is of great utility in more general tasks.
HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 October 2002 Mathias Creutz 3 Procedure Based on the HTK large vocabulary speech recognition system – by Woodland, Leggetter, Odell, Valtchev, Young, Gales, Pye, Cambridge University, Entropic Ltd., 1994 – Developed and evaluated in the NIST/DARPA Broadcast News & TREC SDR evaluations –1996 (DARPA BN) –1997 (DARPA BN) –1998 (DARPA BN) –1998, 10 x real time (TREC 7 spoken docum. retrieval) –1999, 10 x real time (TREC 8 spoken docum. retrieval)
HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 October 2002 Mathias Creutz 4 Standard HTK LVSR system (1) Acoustic feature extraction Initially designed for clean speech tasks Standard Mel-frequency cepstral coefficients (MFCC) –39 dimensional feature vector Cepstral mean normalization on an utterance-by-utterance basis Prounciation dictionary Based on the LIMSI 1993 WSJ pronunciation dictionary 46 phones Vocabulary of 65k words
HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 October 2002 Mathias Creutz 5 Standard HTK LVSR system (2) Acoustic modelling Hidden Markov models (HMMs) –States are implemented as Gaussian mixture models. Embedded Baum-Welch re-estimation Forced Viterbi alignment chooses between pronunciation variants in the dictionary, e.g., ”the” = /ðeh/ or /ði/. Transcription 1.Monophones, e.g., ”you speak” = sil j u sp s p i k sil 2.Triphones, sil j+u j-u+s sp u-s+p s-p+i p-i+k i-k sil 3.Quinphones, sil j+u+s j-u+s+p sp j-u-s+p+i u-s-p+i+k s-p-i+k p-i-k sil
HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 October 2002 Mathias Creutz 6 Standard HTK LVSR system (3) Language modelling (LM) N-gram models using Katz-backoff Class-based models based on automatically derived word classes Dynamic language models based on a cache model Decoding Time-synchronous decoders Single pass or generation or rescoring of word lattices Early stages with triphone models and bigram or trigram LMs Later stages with adapted quinphones and more advanced LMs
HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 October 2002 Mathias Creutz 7 Standard HTK LVSR system (4) Adaptation (new speaker or acoustic environment) MLLR (Maximum Likelihood Linear Regression) –adjust Gaussian means (and optionally variances) in HMMs in order to increase likelihood of adaptation data –m adapted = A · m original + b –use a single, global transform for all Gaussians –or separate transforms for different clusters of Gaussians –can be used in combination with multi-pass decoding: 1.decode 2.adapt 3.decode/rescore lattice with adapted models 4.adapt again, etc.
HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 October 2002 Mathias Creutz 8 F-conditions Data supplied by the Linguistic Data Consortium (LDC) Part of the training data and all test data were hand labelled according to the ”focus” or F-conditions: F0Baseline broadcast speech (clean, planned) F1Spontaneous broadcast speech (clean) F2Low-fidelity speech (mainly narrowband) F3Speech in the presence of background music F4Speech under degraged acoustical conditions F5Non-native speakers (clean, planned) FXAll other speech (e.g., spontaneous non-native)
HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 October 2002 Mathias Creutz 9 Broadcast news data Training data35 hours+37 hours+71 hours Development test set 2 shows + 4 shows Test set2 hours (4 shows) 3 hours (9 shows) 3 hours Text for LM (?) million words + transcr. of acoustic data million words Green background = hand labelled according to F-conditions Underlined text = pre-partioned at the speaker turn
HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 October 2002 Mathias Creutz 10 Initial experiments using 1995 BN data Speech Recognition: Existing HTK SR system –Wall Street Journal (WSJ), triphone models, trigram LM Data: Broadcast news data from the radio show ”Marketplace”, marked according to –presense/absense of background music –full/reduced bandwidth Goals: –Compare standard MFCC (Mel-Frequency Cepstral Coeffs) to MF-PLP (Mel-Frequency Perceptual Linear Prediction) –Try out unsupervised test-data MLLR adaptation Results –12% word error rate reduction with MF-PLP –further 26% using two-iteration MLLR adaptation Good. Let’s use these techniques!
HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 October 2002 Mathias Creutz 11 The 1996 BN evaluation (1) Acoustic environment adaptation (training data) Adapt WSJ triphone and quinphone models to each of the focus conditions using mean and variance MLLR data- type specific model sets. Automatically classify F2 (low-fidelity speech) as narrowband or wideband adapt separate sets. A couple of tricks for F5 (non-native) and FX (other speech), due to small amounts of data. (Mainly) Speaker adaptation (test data) Unknown speaker identities. Cluster (bottom-up) similar segments until sufficient data is available in each group robust unsupervised adaptation. Each segment is represented by its mean and variance.
HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 October 2002 Mathias Creutz 12 The 1996 BN evaluation (2) Language modelling –Static: Bigram, trigram, 4-gram word-based LMs (Katz backoff) –Dynamic: Unigram and bigram cache model Woodland et al Based on last recognition output Operates on a per-show, per-focus-condition basis Includes future and previous words + other word forms with the same stem Excludes common words Interpolated with the static 4-gram language model
HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 October 2002 Mathias Creutz 13 The 1996 BN evaluation (3) Multi-pass decoding: PassMLLRHMMsLM%WER%Imprv. P1-triph.trigram33.40 P21 transf.triph.trigram P31 transf.triph.bigram P3 lat.rs.1 transf.triph.fourgram P4 lat.rs.1 transf.quinph.fourgram P5 lat.rs.2 transf.quinph.fourgram P6 lat.rs.4 transf.quinph.fourgram cache4 transf.quinph.+cache
HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 October 2002 Mathias Creutz 14 Towards the 1997 BN evaluation (1) Information about data segmentation and type is not supplied. Goal: Compare performance of condition-dependent and condition-independent models Test data: 1996 development test set Experiment with different acoustic models: 1.Adapt WSJ models to each F-condition (cond.-dep.) 2.Train models on 1996 BN training data (cond.-indep.) 3.Train models on 1997 BN training data (cond.-indep.) Results: –Condition-independent models slightly better than adapted condition-dependent models! (WER: 32.0%, 31.7%, 29.6%)
HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 October 2002 Mathias Creutz 15 Towards the 1997 BN evaluation (2) Gender effect? –2/3 male, 1/3 female speakers in BN data –1/2 male, 1/2 female speakers in WSJ models Use gender-dependent models: –gender of speakers in data is known assume that perfect gender determination is possible Results (1997 BN data): –Gender-indep: All: 29.6%, Male: 28.8%, Female: 31.1% –Gender-dep. All: 28.1%, Male: 27.8%, Female: 28.8%
HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 October 2002 Mathias Creutz BN: Automatic segm. & clustering Goal: Convert the audio stream into clusters of reasonably sized homogeneous speech segments each cluster shares a set of MLLR transforms. The audio stream is first classified into 3 broad categories: –wideband speech, narrowband speech, music ( reject) Use a gender-dependent recognizer to locate silence portions and gender change points. Cluster segments separately for each gender and bandwidth combination for use in MLLR adaptation. Result: Only 0.1% absolute higher WER than manual segments.
HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 October 2002 Mathias Creutz 17 The 1997 BN evaluation Language modelling –Bigram, trigram, 4-gram word-based LMs (Katz backoff) –Category language model Kneser & Ney ’93; Martin et al., ’95; Niesler et al., ’ automatically generated word classes based on word bigram statistics in the training set Trigram model –Interpolation of word 4-gram and class trigram models weights: 0.7 and 0.3 Hypothesis combination –Different types of errors Combine triphone and quinphone results. –Use confidence scores and dynamic programming-based string alignment.
HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 October 2002 Mathias Creutz 18 The 1997 BN evaluation (2) PassMLLRHMMsLM%WER%Imprv. P1-gi triph.trigram21.40 P21 transf.gd triph.bigram P2 lat.rs.1 transf.gd triph.trigram P2 lat.rs.1 transf.gd triph.fourgram P2 lat.rs.1 transf.gd triph.inp.w4c P3 lat.rs.1 transf.gd quin.inp.w4c P4 lat.rs.2 transf.gd quin.inp.w4c P5 lat.rs.4 transf.gd quin.inp.w4c cache4 transf.gd quin.+cache ROVER1 tr./4 tr.gd tri/qu.inp.w4c conf. combine
HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 October 2002 Mathias Creutz 19 The 1998 BN evaluation (1) Vocal tract length normalization (VTLN) –Max. likelihood selection of best warp factor (parabolic search) –0.4% lower absolute WER (MLLR-adapted quinphones) Language modelling –Interpolate 3 separate word-based LMs (BN, newswire, acoustic data) instead of pooling them. –0.5% lower absolute WER (adapted quinphones) Full variance MLLR transforms –0.2% lower absolute WER Speaker-adaptive training –further 0.1% lower absolute WER (in combination with full variance transforms)
HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 October 2002 Mathias Creutz 20 The 1998 BN evaluation (2) PassMLLRHMMsLM%WER%Imprv. P1-gi -V trip.trigram19.90 P2-gd +V trip.trigram P31 tr. –FVgd +V trip.bigram P3 lat.rs.1 tr. –FVgd +V trip.inp.w4c P4 lat.rs.1 tr. –FVgd +V qui.inp.w4c P4 lat.rs.1 tr. +FVgd +V qui.inp.w4c P6 lat.rs.4 tr. +FVgd +V qui.inp.w4c ROVER1-F/4+Fgd +V tr/q.inp.w4c conf. combine
HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 October 2002 Mathias Creutz TREC 7 evaluation Constraint: Must operate in max. 10 x real time 1999 TREC 8: Same architecture, larger vocab. PassMLLRHMMsLM% 1997% 1998 P1-gi -V trip.trigram P1 lat. 1 -gi -V trip.fourgram P21 tran.gd -V trip.trigram P2 lat. 1 1 tran.gd -V trip.inp.w4c Full time systems
HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 October 2002 Mathias Creutz 22 Discussion and conclusion The HTK system had either the lowest overall error rate in every evaluation or a value not significantly different from the lowest. HTK was always the best for F0 speech (clean, planned). In worse conditions, the applied adaptation methods were shown to significantly reduce the error. Still a long way to go(?): Word error rates for bulk transcriptions of BN data remains at about 20% for the best systems.... with very high WER for some audio conditions. What about other languages than English?
HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 October 2002 Mathias Creutz 23 Project work Planned project: Literature study on language models used in audio mining (broadcast news quality speech) –How do they work? –What is their contribution to the overall error reduction?
HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 October 2002 Mathias Creutz 24 Home assignment Briefly comment the following claims in the light of Woodland's paper. (Simply answering true or false is not enough.) 1.The better overall performance of the 1997 system compared to the 1996 system was mainly due to the doubling of the amount of training data. 2.Triphone HMMs cannot be estimated unless there is a huge amount of training data available. 3.Gender-dependent acoustic models are to be preferred over gender-independent models. 4.Quinphone HMMs are not created through two-model re- estimation. 5.MLLR (Maximum Likelihood Linear Regression) is an adaptation method that is sensitive to transcription errors.