The Development of the AMI System for the Transcription of Speech in Meetings Thomas Hain, Lukas Burget, John Dines, Iain McCowan, Giulia Garau, Martin.

Slides:



Advertisements
Similar presentations
PHONE MODELING AND COMBINING DISCRIMINATIVE TRAINING FOR MANDARIN-ENGLISH BILINGUAL SPEECH RECOGNITION Yanmin Qian, Jia Liu ICASSP2010 Pei-Ning Chen CSIE.
Advertisements

Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.
Building an ASR using HTK CS4706
Speech Recognition with Hidden Markov Models Winter 2011
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
SRI 2001 SPINE Evaluation System Venkata Ramana Rao Gadde Andreas Stolcke Dimitra Vergyri Jing Zheng Kemal Sonmez Anand Venkataraman.
Feature Selection, Acoustic Modeling and Adaptation SDSG REVIEW of recent WORK Technical University of Crete Speech Processing and Dialog Systems Group.
Feature vs. Model Based Vocal Tract Length Normalization for a Speech Recognition-based Interactive Toy Jacky CHAU Department of Computer Science and Engineering.
Investigation on Mandarin Broadcast News Speech Recognition Mei-Yuh Hwang, Xin Lei, Wen Wang*, Takahiro Shinozaki University of Washington, *SRI 9/19/2006,
Incorporating Tone-related MLP Posteriors in the Feature Representation for Mandarin ASR Overview Motivation Tone has a crucial role in Mandarin speech.
VARIABLE PRESELECTION LIST LENGTH ESTIMATION USING NEURAL NETWORKS IN A TELEPHONE SPEECH HYPOTHESIS-VERIFICATION SYSTEM J. Macías-Guarasa, J. Ferreiros,
June 14th, 2005Speech Group Lunch Talk Kofi A. Boakye International Computer Science Institute Mixed Signals: Speech Activity Detection and Crosstalk in.
Advances in WP1 and WP2 Paris Meeting – 11 febr
HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University.
May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International.
Why is ASR Hard? Natural speech is continuous
Introduction to Automatic Speech Recognition
Lightly Supervised and Unsupervised Acoustic Model Training Lori Lamel, Jean-Luc Gauvain and Gilles Adda Spoken Language Processing Group, LIMSI, France.
1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.
1M4 speech recognition University of Sheffield M4 speech recognition Martin Karafiát*, Steve Renals, Vincent Wan.
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
Macquarie RT05s Speaker Diarisation System Steve Cassidy Centre for Language Technology Macquarie University Sydney.
Arthur Kunkle ECE 5525 Fall Introduction and Motivation  A Large Vocabulary Speech Recognition (LVSR) system is a system that is able to convert.
7-Speech Recognition Speech Recognition Concepts
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
1M4 speech recognition University of Sheffield M4 speech recognition Vincent Wan, Martin Karafiát.
Speaker Diarisation and Large Vocabulary Recognition at CSTR: The AMI/AMIDA System Fergus McInnes 7 December 2011 History – AMI, AMIDA and recent developments.
1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.
Speech recognition and the EM algorithm
Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
1 Boostrapping language models for dialogue systems Karl Weilhammer, Matthew N Stuttle, Steve Young Presenter: Hsuan-Sheng Chiu.
1 Update on WordWave Fisher Transcription Owen Kimball, Chia-lin Kao, Jeff Ma, Rukmini Iyer, Rich Schwartz, John Makhoul.
Recurrent neural network based language model Tom´aˇs Mikolov, Martin Karafia´t, Luka´sˇ Burget, Jan “Honza” Cˇernocky, Sanjeev Khudanpur INTERSPEECH 2010.
Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),
ISL Meeting Recognition Hagen Soltau, Hua Yu, Florian Metze, Christian Fügen, Yue Pan, Sze-Chen Jou Interactive Systems Laboratories.
1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.
PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Microsoft The State of the Art in ASR (and Beyond?) Cambridge University Engineering Department Speech Vision and Robotics Group Steve Young Microsoft.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
Performance Comparison of Speaker and Emotion Recognition
Arlindo Veiga Dirce Celorico Jorge Proença Sara Candeias Fernando Perdigão Prosodic and Phonetic Features for Speaking Styles Classification and Detection.
St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.
S1S1 S2S2 S3S3 8 October 2002 DARTS ATraNoS Automatic Transcription and Normalisation of Speech Jacques Duchateau, Patrick Wambacq, Johan Depoortere,
1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.
H ADVANCES IN MANDARIN BROADCAST SPEECH RECOGNITION Overview Goal Build a highly accurate Mandarin speech recognizer for broadcast news (BN) and broadcast.
Automatic Speech Recognition
Speech Enhancement Summer 2009
Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI
Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky
Hierarchical Multi-Stream Posterior Based Speech Recognition System
Digital Communications Chapter 13. Source Coding
Juicer: A weighted finite-state transducer speech decoder
Statistical Models for Automatic Speech Recognition
RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION
Speech Processing Speech Recognition
Statistical Models for Automatic Speech Recognition
8-Speech Recognition Speech Recognition Concepts
Sphinx Recognizer Progress Q2 2004
Research on the Modeling of Chinese Continuous Speech Recognition
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Learning Long-Term Temporal Features
Presenter: Shih-Hsiang(士翔)
Combination of Feature and Channel Compensation (1/2)
Emre Yılmaz, Henk van den Heuvel and David A. van Leeuwen
Presentation transcript:

The Development of the AMI System for the Transcription of Speech in Meetings Thomas Hain, Lukas Burget, John Dines, Iain McCowan, Giulia Garau, Martin Karafiat, Mike Lincoln, Darren Moore, Vincent Wan, Roeland Ordelman, Steve Renals July 12, 2005 MLMI Edinburgh

Outline Multi-site development Development strategy Resources Modelling System integration Results Conclusions MLMI – The AMI ASR Meeting System Thomas Hain / July 12, 2005

AMI ASR around the globe MLMI – The AMI ASR Meeting System Thomas Hain / July 12, 2005

Multi-site development Large vocabulary ASR is complex and requires considerable resources Split development effort across multiple sites DICT LM CORE ADAPT AUDIO-PREPROC Central storage and compute resources Communication: frequent telephone conferences internet chat, “working phone calls” (VoIP), multiple workshops, WIKI MLMI – The AMI ASR Meeting System Thomas Hain / July 12, 2005

Development paradigm Resource building Resource driven Dictionary LM Acoustic data Resource driven Bootstrap from conversational telephone speech (CTS) Generic technology selection Pick generic techniques with maximum gain VTLN, HLDA, MPE, CN Task specific components Front-ends Language models MLMI – The AMI ASR Meeting System Thomas Hain / July 12, 2005

Resources Meeting resources “sparse” Language model data Dictionary Corpora: ICSI, ISL, NIST (LDC,VT) The AMI corpus (initial parts) 100 hours of meeting data Language model data Broadcast News (220MW) Web-data - CTS/AMI/Meeting (600MW) Meetings (ICSI/ISL/NIST/AMI) CTS (Swbd/Fisher) … Dictionary Edinburgh UNISYN MLMI – The AMI ASR Meeting System Thomas Hain / July 12, 2005

Dictionary Baseline dictionary based upon UNISYN (Fitt, 2000) with 114,876 words Semi-automatic generation of pronunciations Part-word pronunciations initially automatically guessed from the existing pronunciations Automatic CART based letter-to-sound conversion trained from UNISYN Hand correction/checking of all automatic hypotheses Words were all converted to British spellings An additional 11,595 words were added using a combination of automatic and manual generation: Pronunciation probabilities (estimated from alignment of the training data) MLMI – The AMI ASR Meeting System Thomas Hain / July 12, 2005

Vocabulary ICSI NIST ISL AMI 0.01 0.47 0.58 0.57 0.43 0.09 0.59 0.66 Source ICSI NIST ISL AMI 0.01 0.47 0.58 0.57 0.43 0.09 0.59 0.66 0.41 0.37 0.03 0.53 0.30 ALL 0.16 0.42 0.55 Test data Out of Vocabulary rates (OOV) with padding to 50k words from general Broadcast News data. No need for specific vocabulary ! MLMI – The AMI ASR Meeting System Thomas Hain / July 12, 2005

Language modelling ICSI NIST ISL AMI ALL 68.2 74.6 73.8 77.1 68.0 Source   ICSI NIST ISL AMI ALL 68.2 74.6 73.8 77.1 68.0 105.9 100.9 102.0 106.0 101.3 104.7 99.5 98.5 106.4 102.9 115.6 114.3 114.4 88.9 94.1 107.5 105.7 90.6 92.7 Test data Interpolated trigram language models on meeting data optimised for each domain (on independent dev data) Perplexity results Meeting resource specific outperform general models Translates into 0.5% abs Word Error Rate improvement MLMI – The AMI ASR Meeting System Thomas Hain / July 12, 2005

Acoustic modelling Standard HMM based framework Decision tree state clustered triphones Hidden Markov model toolkit (HTK) Maximum likelihood training Approx. 70k Gaussian/Model set MAP Adaptation from CTS models Bandwidth problem: CTS is narrowband data (4kHz), meetings are recorded at 8kHz bandwith Developed MLLR/MAP Front-end feature transform SHLDA = Smoothed Heteroscedastic Linear Discriminant Analysis Typically 1.5% WER improvement MLMI – The AMI ASR Meeting System Thomas Hain / July 12, 2005

Speaker/channel adaptation CMN/CVN (channel) Vocal tract length normalisation (VTLN) maximum likelihood training & test Typically 3-4% WER gain MLLR Mean and variance Transforms for speech and silence Typically 1-2% improvement Histograms Warp factors female/male MLMI – The AMI ASR Meeting System Thomas Hain / July 12, 2005

Front-ends Meeting recordings with a variety of source types Microphone locations Close talking: head-mounted/lapel Distant: “arbitrary location”, various array configuration Requires: speech activity detection, speaker “grouping”, speaker and location tracking. Objective: Achieve “close-talking” performance with distant microphones Enhancement type approach for simplicity MLMI – The AMI ASR Meeting System Thomas Hain / July 12, 2005

IHM front-end processing Signal enhancement (cross-talk suppression) LMS echo cancellation Speech activity detection (SAD) Using Multi-Layer Perceptron (MLP) Cross-talk suppression Feature extraction x (IHM channel) x’ (enhanced signal) Smoothing parameters (insertion penalty, minimum duration) x’ (36 dim feature vector) Yk (remaining IHM channels) MLP classification Viterbi decoder MLMI – The AMI ASR Meeting System Thomas Hain / July 12, 2005

IHM cross-talk suppression Multiple-reference LMS adaptive filtering with 256 tap FIR filter Adaptation is frozen during period of speech activity Automatic correction for channel timing misalignment MLMI – The AMI ASR Meeting System Thomas Hain / July 12, 2005

Multiple distant microphones Gain Calibration Simple gain calibration is performed in which the maximum amplitude of each audio channel is normalised. Gain Calibration Noise Removal Noise spectrum of each input channel is estimated A Wiener filter is applied to each channel to remove stationary noise. Noise removal Delay Estimation Computed per frame Scale factors: ratio of energy Delay: peak finding in cross-correlation Delay Estimation Beamformer Beamformer filters using superdirective technique using a noise correlation matrix estimated above Beamformer MLMI – The AMI ASR Meeting System Thomas Hain / July 12, 2005

Towards a model set Model initialisation (WER on ICSI only) Training data bandwidth adapt WER CTS NB - 33.3 ICSI 27.1 WB 25.3 MAP 25.8 MLLR + MAP 24.6 More training data Training data TOT ISL ICSI LDC NIST AMI-TOT UEDIN IDIAP ICS,NIST 50.4 56.2 24.1 61.1 36.9 59.1 60.2 58.4 ICSI,NIST,ISL 50.6 22.9 61.8 37.2 58.6 60.0 57.6 ICSI,NIST,ISL,AMI 50.3 54.5 27.4 61.3 36.2 57.3 59.0 MLMI – The AMI ASR Meeting System Thomas Hain / July 12, 2005

First pass recognition System architecture Front-end (IHM/MDM) Modified Audio, Segments, Speaker Info First pass recognition First recognition result Adaptation Lattice generation Word lattices LM Rescoring Final word level result MLMI – The AMI ASR Meeting System Thomas Hain / July 12, 2005

Results on rt05seval CTS-adapted ML models, unadapted, trigram LM (first pass)   TOT Sub Del Ins AMI ISL ICSI NIST VT IHM 41.1 21.1 14.7 5.3 42.3 36.3 37.1 49.1 IHMREF 34.9 23.0 7.1 4.8 34.5 34.0 26.6 42.2 37.9 MDM 53.6 32.1 17.3 4.1 46.5 50.2 48.2 63.0 MDMREF 50.6  34.1 11.8 4.6 43.0 49.4 46.4 49.9 59.5 MDM segmentation provided by ICSI/SRI REF denotes reference segmentation and speaker labels The performance of the full system on the above AMI subset is 30.9% for IHM and 35.1% for MDM. BUT: difference on REF remains MLMI – The AMI ASR Meeting System Thomas Hain / July 12, 2005

Conclusions Multi-site development! Competitive ASR system in 10-11 months Meeting domains inhomogeneous ? Good improvements with VTLN/SHLDA/MLLR Pre-processing needs to be sorted ! Reasonable performance on Seminar data MLMI – The AMI ASR Meeting System Thomas Hain / July 12, 2005