HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE NEURAL NETWORKS RESEARCH CENTRE The development of the HTK Broadcast News.

Slides:



Advertisements
Similar presentations
Building an ASR using HTK CS4706
Advertisements

Speech Recognition with Hidden Markov Models Winter 2011
1990s DARPA Programmes WSJ and BN Dapo Durosinmi-Etti Bo Xu Xiaoxiao Zheng.
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
Frederico Rodrigues and Isabel Trancoso INESC/IST, 2000 Robust Recognition of Digits and Natural Numbers.
SRI 2001 SPINE Evaluation System Venkata Ramana Rao Gadde Andreas Stolcke Dimitra Vergyri Jing Zheng Kemal Sonmez Anand Venkataraman.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Adaptation Resources: RS: Unsupervised vs. Supervised RS: Unsupervised.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Speaker Adaptation in Sphinx 3.x and CALO David Huggins-Daines
Investigation on Mandarin Broadcast News Speech Recognition Mei-Yuh Hwang, Xin Lei, Wen Wang*, Takahiro Shinozaki University of Washington, *SRI 9/19/2006,
EE225D Final Project Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye EE225D Final Project.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
1 USING CLASS WEIGHTING IN INTER-CLASS MLLR Sam-Joo Doh and Richard M. Stern Department of Electrical and Computer Engineering and School of Computer Science.
Improved Tone Modeling for Mandarin Broadcast News Speech Recognition Xin Lei 1, Manhung Siu 2, Mei-Yuh Hwang 1, Mari Ostendorf 1, Tan Lee 3 1 SSLI Lab,
2001/03/29Chin-Kai Wu, CS, NTHU1 Speech and Language Technologies for Audio Indexing and Retrieval JOHN MAKHOUL, FELLOW, IEEE, FRANCIS KUBALA, TIMOTHY.
9.0 Speaker Variabilities: Adaption and Recognition References: of Huang 2. “ Maximum A Posteriori Estimation for Multivariate Gaussian Mixture.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.
Introduction to Automatic Speech Recognition
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
Lightly Supervised and Unsupervised Acoustic Model Training Lori Lamel, Jean-Luc Gauvain and Gilles Adda Spoken Language Processing Group, LIMSI, France.
Adaptation Techniques in Automatic Speech Recognition Tor André Myrvoll Telektronikk 99(2), Issue on Spoken Language Technology in Telecommunications,
1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.
1M4 speech recognition University of Sheffield M4 speech recognition Martin Karafiát*, Steve Renals, Vincent Wan.
Isolated-Word Speech Recognition Using Hidden Markov Models
Arthur Kunkle ECE 5525 Fall Introduction and Motivation  A Large Vocabulary Speech Recognition (LVSR) system is a system that is able to convert.
Midterm Review Spoken Language Processing Prof. Andrew Rosenberg.
Speech and Language Processing
1 Phoneme and Sub-phoneme T- Normalization for Text-Dependent Speaker Recognition Doroteo T. Toledano 1, Cristina Esteve-Elizalde 1, Joaquin Gonzalez-Rodriguez.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Minimum Mean Squared Error Time Series Classification Using an Echo State Network Prediction Model Mark Skowronski and John Harris Computational Neuro-Engineering.
1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.
Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.
IRCS/CCN Summer Workshop June 2003 Speech Recognition.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,
The Use of Context in Large Vocabulary Speech Recognition Julian James Odell March 1995 Dissertation submitted to the University of Cambridge for the degree.
1 Boostrapping language models for dialogue systems Karl Weilhammer, Matthew N Stuttle, Steve Young Presenter: Hsuan-Sheng Chiu.
ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent Recognition of foreign names spoken by native speakers Frederik Stouten & Jean-Pierre Martens Ghent University.
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
Recurrent neural network based language model Tom´aˇs Mikolov, Martin Karafia´t, Luka´sˇ Burget, Jan “Honza” Cˇernocky, Sanjeev Khudanpur INTERSPEECH 2010.
July Age and Gender Recognition from Speech Patterns Based on Supervised Non-Negative Matrix Factorization Mohamad Hasan Bahari Hugo Van hamme.
ISL Meeting Recognition Hagen Soltau, Hua Yu, Florian Metze, Christian Fügen, Yue Pan, Sze-Chen Jou Interactive Systems Laboratories.
National Taiwan University, Taiwan
1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.
PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Microsoft The State of the Art in ASR (and Beyond?) Cambridge University Engineering Department Speech Vision and Robotics Group Steve Young Microsoft.
Performance Comparison of Speaker and Emotion Recognition
MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.
A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-
Natural Language Processing Statistical Inference: n-grams
Statistical Models for Automatic Speech Recognition Lukáš Burget.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.
S1S1 S2S2 S3S3 8 October 2002 DARTS ATraNoS Automatic Transcription and Normalisation of Speech Jacques Duchateau, Patrick Wambacq, Johan Depoortere,
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Dec. 4-5, 2003EARS STT Workshop1 Broadcast News Training Experiments Anand Venkataraman, Dimitra Vergyri, Wen Wang, Ramana Rao Gadde, Martin Graciarena,
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者:郝柏翰 2013/05/23.
Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
H ADVANCES IN MANDARIN BROADCAST SPEECH RECOGNITION Overview Goal Build a highly accurate Mandarin speech recognizer for broadcast news (BN) and broadcast.
Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky
Emre Yılmaz, Henk van den Heuvel and David A. van Leeuwen
Presentation transcript:

HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE NEURAL NETWORKS RESEARCH CENTRE The development of the HTK Broadcast News transcription system: An overview Paper by P. C. Woodland. Appeared in Speech Communication 37 (2002), T Audio Mining, 17 October 2002.

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 October 2002 Mathias Creutz 2 Motivation Transcription of broadcast radio and television news is challenging: different speech styles –read, spontaneous, conversational speech native and non-native speakers high- and low-bandwidth channels... with or without background music or other background noise  Solving these problems is of great utility in more general tasks.

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 October 2002 Mathias Creutz 3 Procedure Based on the HTK large vocabulary speech recognition system – by Woodland, Leggetter, Odell, Valtchev, Young, Gales, Pye, Cambridge University, Entropic Ltd., 1994 – Developed and evaluated in the NIST/DARPA Broadcast News & TREC SDR evaluations –1996 (DARPA BN) –1997 (DARPA BN) –1998 (DARPA BN) –1998, 10 x real time (TREC 7 spoken docum. retrieval) –1999, 10 x real time (TREC 8 spoken docum. retrieval)

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 October 2002 Mathias Creutz 4 Standard HTK LVSR system (1) Acoustic feature extraction Initially designed for clean speech tasks Standard Mel-frequency cepstral coefficients (MFCC) –39 dimensional feature vector Cepstral mean normalization on an utterance-by-utterance basis Prounciation dictionary Based on the LIMSI 1993 WSJ pronunciation dictionary 46 phones Vocabulary of 65k words

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 October 2002 Mathias Creutz 5 Standard HTK LVSR system (2) Acoustic modelling Hidden Markov models (HMMs) –States are implemented as Gaussian mixture models. Embedded Baum-Welch re-estimation Forced Viterbi alignment chooses between pronunciation variants in the dictionary, e.g., ”the” = /ðeh/ or /ði/. Transcription 1.Monophones, e.g., ”you speak” = sil j u sp s p i k sil 2.Triphones, sil j+u j-u+s sp u-s+p s-p+i p-i+k i-k sil 3.Quinphones, sil j+u+s j-u+s+p sp j-u-s+p+i u-s-p+i+k s-p-i+k p-i-k sil

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 October 2002 Mathias Creutz 6 Standard HTK LVSR system (3) Language modelling (LM) N-gram models using Katz-backoff Class-based models based on automatically derived word classes Dynamic language models based on a cache model Decoding Time-synchronous decoders Single pass or generation or rescoring of word lattices Early stages with triphone models and bigram or trigram LMs Later stages with adapted quinphones and more advanced LMs

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 October 2002 Mathias Creutz 7 Standard HTK LVSR system (4) Adaptation (new speaker or acoustic environment) MLLR (Maximum Likelihood Linear Regression) –adjust Gaussian means (and optionally variances) in HMMs in order to increase likelihood of adaptation data –m adapted = A · m original + b –use a single, global transform for all Gaussians –or separate transforms for different clusters of Gaussians –can be used in combination with multi-pass decoding: 1.decode 2.adapt 3.decode/rescore lattice with adapted models 4.adapt again, etc.

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 October 2002 Mathias Creutz 8 F-conditions Data supplied by the Linguistic Data Consortium (LDC) Part of the training data and all test data were hand labelled according to the ”focus” or F-conditions: F0Baseline broadcast speech (clean, planned) F1Spontaneous broadcast speech (clean) F2Low-fidelity speech (mainly narrowband) F3Speech in the presence of background music F4Speech under degraged acoustical conditions F5Non-native speakers (clean, planned) FXAll other speech (e.g., spontaneous non-native)

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 October 2002 Mathias Creutz 9 Broadcast news data Training data35 hours+37 hours+71 hours Development test set 2 shows + 4 shows Test set2 hours (4 shows) 3 hours (9 shows) 3 hours Text for LM (?) million words + transcr. of acoustic data million words Green background = hand labelled according to F-conditions Underlined text = pre-partioned at the speaker turn

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 October 2002 Mathias Creutz 10 Initial experiments using 1995 BN data Speech Recognition: Existing HTK SR system –Wall Street Journal (WSJ), triphone models, trigram LM Data: Broadcast news data from the radio show ”Marketplace”, marked according to –presense/absense of background music –full/reduced bandwidth Goals: –Compare standard MFCC (Mel-Frequency Cepstral Coeffs) to MF-PLP (Mel-Frequency Perceptual Linear Prediction) –Try out unsupervised test-data MLLR adaptation Results –12% word error rate reduction with MF-PLP –further 26% using two-iteration MLLR adaptation Good. Let’s use these techniques!

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 October 2002 Mathias Creutz 11 The 1996 BN evaluation (1) Acoustic environment adaptation (training data) Adapt WSJ triphone and quinphone models to each of the focus conditions using mean and variance MLLR  data- type specific model sets. Automatically classify F2 (low-fidelity speech) as narrowband or wideband  adapt separate sets. A couple of tricks for F5 (non-native) and FX (other speech), due to small amounts of data. (Mainly) Speaker adaptation (test data) Unknown speaker identities. Cluster (bottom-up) similar segments until sufficient data is available in each group  robust unsupervised adaptation. Each segment is represented by its mean and variance.

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 October 2002 Mathias Creutz 12 The 1996 BN evaluation (2) Language modelling –Static: Bigram, trigram, 4-gram word-based LMs (Katz backoff) –Dynamic: Unigram and bigram cache model Woodland et al Based on last recognition output Operates on a per-show, per-focus-condition basis Includes future and previous words + other word forms with the same stem Excludes common words Interpolated with the static 4-gram language model

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 October 2002 Mathias Creutz 13 The 1996 BN evaluation (3) Multi-pass decoding: PassMLLRHMMsLM%WER%Imprv. P1-triph.trigram33.40 P21 transf.triph.trigram P31 transf.triph.bigram P3 lat.rs.1 transf.triph.fourgram P4 lat.rs.1 transf.quinph.fourgram P5 lat.rs.2 transf.quinph.fourgram P6 lat.rs.4 transf.quinph.fourgram cache4 transf.quinph.+cache

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 October 2002 Mathias Creutz 14 Towards the 1997 BN evaluation (1) Information about data segmentation and type is not supplied. Goal: Compare performance of condition-dependent and condition-independent models Test data: 1996 development test set Experiment with different acoustic models: 1.Adapt WSJ models to each F-condition (cond.-dep.) 2.Train models on 1996 BN training data (cond.-indep.) 3.Train models on 1997 BN training data (cond.-indep.) Results: –Condition-independent models slightly better than adapted condition-dependent models! (WER: 32.0%, 31.7%, 29.6%)

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 October 2002 Mathias Creutz 15 Towards the 1997 BN evaluation (2) Gender effect? –2/3 male, 1/3 female speakers in BN data –1/2 male, 1/2 female speakers in WSJ models Use gender-dependent models: –gender of speakers in data is known  assume that perfect gender determination is possible Results (1997 BN data): –Gender-indep: All: 29.6%, Male: 28.8%, Female: 31.1% –Gender-dep. All: 28.1%, Male: 27.8%, Female: 28.8%

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 October 2002 Mathias Creutz BN: Automatic segm. & clustering Goal: Convert the audio stream into clusters of reasonably sized homogeneous speech segments  each cluster shares a set of MLLR transforms. The audio stream is first classified into 3 broad categories: –wideband speech, narrowband speech, music (  reject) Use a gender-dependent recognizer to locate silence portions and gender change points. Cluster segments separately for each gender and bandwidth combination for use in MLLR adaptation. Result: Only 0.1% absolute higher WER than manual segments.

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 October 2002 Mathias Creutz 17 The 1997 BN evaluation Language modelling –Bigram, trigram, 4-gram word-based LMs (Katz backoff) –Category language model Kneser & Ney ’93; Martin et al., ’95; Niesler et al., ’ automatically generated word classes based on word bigram statistics in the training set Trigram model –Interpolation of word 4-gram and class trigram models weights: 0.7 and 0.3 Hypothesis combination –Different types of errors  Combine triphone and quinphone results. –Use confidence scores and dynamic programming-based string alignment.

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 October 2002 Mathias Creutz 18 The 1997 BN evaluation (2) PassMLLRHMMsLM%WER%Imprv. P1-gi triph.trigram21.40 P21 transf.gd triph.bigram P2 lat.rs.1 transf.gd triph.trigram P2 lat.rs.1 transf.gd triph.fourgram P2 lat.rs.1 transf.gd triph.inp.w4c P3 lat.rs.1 transf.gd quin.inp.w4c P4 lat.rs.2 transf.gd quin.inp.w4c P5 lat.rs.4 transf.gd quin.inp.w4c cache4 transf.gd quin.+cache ROVER1 tr./4 tr.gd tri/qu.inp.w4c conf. combine

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 October 2002 Mathias Creutz 19 The 1998 BN evaluation (1) Vocal tract length normalization (VTLN) –Max. likelihood selection of best warp factor (parabolic search) –0.4% lower absolute WER (MLLR-adapted quinphones) Language modelling –Interpolate 3 separate word-based LMs (BN, newswire, acoustic data) instead of pooling them. –0.5% lower absolute WER (adapted quinphones) Full variance MLLR transforms –0.2% lower absolute WER Speaker-adaptive training –further 0.1% lower absolute WER (in combination with full variance transforms)

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 October 2002 Mathias Creutz 20 The 1998 BN evaluation (2) PassMLLRHMMsLM%WER%Imprv. P1-gi -V trip.trigram19.90 P2-gd +V trip.trigram P31 tr. –FVgd +V trip.bigram P3 lat.rs.1 tr. –FVgd +V trip.inp.w4c P4 lat.rs.1 tr. –FVgd +V qui.inp.w4c P4 lat.rs.1 tr. +FVgd +V qui.inp.w4c P6 lat.rs.4 tr. +FVgd +V qui.inp.w4c ROVER1-F/4+Fgd +V tr/q.inp.w4c conf. combine

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 October 2002 Mathias Creutz TREC 7 evaluation Constraint: Must operate in max. 10 x real time 1999 TREC 8: Same architecture, larger vocab. PassMLLRHMMsLM% 1997% 1998 P1-gi -V trip.trigram P1 lat.  1 -gi -V trip.fourgram P21 tran.gd -V trip.trigram P2 lat.  1 1 tran.gd -V trip.inp.w4c Full time systems

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 October 2002 Mathias Creutz 22 Discussion and conclusion The HTK system had either the lowest overall error rate in every evaluation or a value not significantly different from the lowest. HTK was always the best for F0 speech (clean, planned). In worse conditions, the applied adaptation methods were shown to significantly reduce the error. Still a long way to go(?): Word error rates for bulk transcriptions of BN data remains at about 20% for the best systems.... with very high WER for some audio conditions. What about other languages than English?

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 October 2002 Mathias Creutz 23 Project work Planned project: Literature study on language models used in audio mining (broadcast news quality speech) –How do they work? –What is their contribution to the overall error reduction?

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 October 2002 Mathias Creutz 24 Home assignment Briefly comment the following claims in the light of Woodland's paper. (Simply answering true or false is not enough.) 1.The better overall performance of the 1997 system compared to the 1996 system was mainly due to the doubling of the amount of training data. 2.Triphone HMMs cannot be estimated unless there is a huge amount of training data available. 3.Gender-dependent acoustic models are to be preferred over gender-independent models. 4.Quinphone HMMs are not created through two-model re- estimation. 5.MLLR (Maximum Likelihood Linear Regression) is an adaptation method that is sensitive to transcription errors.