Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Development of the AMI System for the Transcription of Speech in Meetings Thomas Hain, Lukas Burget, John Dines, Iain McCowan, Giulia Garau, Martin.

Similar presentations


Presentation on theme: "The Development of the AMI System for the Transcription of Speech in Meetings Thomas Hain, Lukas Burget, John Dines, Iain McCowan, Giulia Garau, Martin."— Presentation transcript:

1 The Development of the AMI System for the Transcription of Speech in Meetings
Thomas Hain, Lukas Burget, John Dines, Iain McCowan, Giulia Garau, Martin Karafiat, Mike Lincoln, Darren Moore, Vincent Wan, Roeland Ordelman, Steve Renals July 12, 2005 MLMI Edinburgh

2 Outline Multi-site development Development strategy Resources
Modelling System integration Results Conclusions MLMI – The AMI ASR Meeting System Thomas Hain / July 12, 2005

3 AMI ASR around the globe
MLMI – The AMI ASR Meeting System Thomas Hain / July 12, 2005

4 Multi-site development
Large vocabulary ASR is complex and requires considerable resources Split development effort across multiple sites DICT LM CORE ADAPT AUDIO-PREPROC Central storage and compute resources Communication: frequent telephone conferences internet chat, “working phone calls” (VoIP), multiple workshops, WIKI MLMI – The AMI ASR Meeting System Thomas Hain / July 12, 2005

5 Development paradigm Resource building Resource driven
Dictionary LM Acoustic data Resource driven Bootstrap from conversational telephone speech (CTS) Generic technology selection Pick generic techniques with maximum gain VTLN, HLDA, MPE, CN Task specific components Front-ends Language models MLMI – The AMI ASR Meeting System Thomas Hain / July 12, 2005

6 Resources Meeting resources “sparse” Language model data Dictionary
Corpora: ICSI, ISL, NIST (LDC,VT) The AMI corpus (initial parts) 100 hours of meeting data Language model data Broadcast News (220MW) Web-data - CTS/AMI/Meeting (600MW) Meetings (ICSI/ISL/NIST/AMI) CTS (Swbd/Fisher) Dictionary Edinburgh UNISYN MLMI – The AMI ASR Meeting System Thomas Hain / July 12, 2005

7 Dictionary Baseline dictionary based upon UNISYN (Fitt, 2000) with 114,876 words Semi-automatic generation of pronunciations Part-word pronunciations initially automatically guessed from the existing pronunciations Automatic CART based letter-to-sound conversion trained from UNISYN Hand correction/checking of all automatic hypotheses Words were all converted to British spellings An additional 11,595 words were added using a combination of automatic and manual generation: Pronunciation probabilities (estimated from alignment of the training data) MLMI – The AMI ASR Meeting System Thomas Hain / July 12, 2005

8 Vocabulary ICSI NIST ISL AMI 0.01 0.47 0.58 0.57 0.43 0.09 0.59 0.66
Source ICSI NIST ISL AMI 0.01 0.47 0.58 0.57 0.43 0.09 0.59 0.66 0.41 0.37 0.03 0.53 0.30 ALL 0.16 0.42 0.55 Test data Out of Vocabulary rates (OOV) with padding to 50k words from general Broadcast News data. No need for specific vocabulary ! MLMI – The AMI ASR Meeting System Thomas Hain / July 12, 2005

9 Language modelling ICSI NIST ISL AMI ALL 68.2 74.6 73.8 77.1 68.0
Source ICSI NIST ISL AMI ALL 68.2 74.6 73.8 77.1 68.0 105.9 100.9 102.0 106.0 101.3 104.7 99.5 98.5 106.4 102.9 115.6 114.3 114.4 88.9 94.1 107.5 105.7 90.6 92.7 Test data Interpolated trigram language models on meeting data optimised for each domain (on independent dev data) Perplexity results Meeting resource specific outperform general models Translates into 0.5% abs Word Error Rate improvement MLMI – The AMI ASR Meeting System Thomas Hain / July 12, 2005

10 Acoustic modelling Standard HMM based framework
Decision tree state clustered triphones Hidden Markov model toolkit (HTK) Maximum likelihood training Approx. 70k Gaussian/Model set MAP Adaptation from CTS models Bandwidth problem: CTS is narrowband data (4kHz), meetings are recorded at 8kHz bandwith Developed MLLR/MAP Front-end feature transform SHLDA = Smoothed Heteroscedastic Linear Discriminant Analysis Typically 1.5% WER improvement MLMI – The AMI ASR Meeting System Thomas Hain / July 12, 2005

11 Speaker/channel adaptation
CMN/CVN (channel) Vocal tract length normalisation (VTLN) maximum likelihood training & test Typically 3-4% WER gain MLLR Mean and variance Transforms for speech and silence Typically 1-2% improvement Histograms Warp factors female/male MLMI – The AMI ASR Meeting System Thomas Hain / July 12, 2005

12 Front-ends Meeting recordings with a variety of source types
Microphone locations Close talking: head-mounted/lapel Distant: “arbitrary location”, various array configuration Requires: speech activity detection, speaker “grouping”, speaker and location tracking. Objective: Achieve “close-talking” performance with distant microphones Enhancement type approach for simplicity MLMI – The AMI ASR Meeting System Thomas Hain / July 12, 2005

13 IHM front-end processing
Signal enhancement (cross-talk suppression) LMS echo cancellation Speech activity detection (SAD) Using Multi-Layer Perceptron (MLP) Cross-talk suppression Feature extraction x (IHM channel) x’ (enhanced signal) Smoothing parameters (insertion penalty, minimum duration) x’ (36 dim feature vector) Yk (remaining IHM channels) MLP classification Viterbi decoder MLMI – The AMI ASR Meeting System Thomas Hain / July 12, 2005

14 IHM cross-talk suppression
Multiple-reference LMS adaptive filtering with 256 tap FIR filter Adaptation is frozen during period of speech activity Automatic correction for channel timing misalignment MLMI – The AMI ASR Meeting System Thomas Hain / July 12, 2005

15 Multiple distant microphones
Gain Calibration Simple gain calibration is performed in which the maximum amplitude of each audio channel is normalised. Gain Calibration Noise Removal Noise spectrum of each input channel is estimated A Wiener filter is applied to each channel to remove stationary noise. Noise removal Delay Estimation Computed per frame Scale factors: ratio of energy Delay: peak finding in cross-correlation Delay Estimation Beamformer Beamformer filters using superdirective technique using a noise correlation matrix estimated above Beamformer MLMI – The AMI ASR Meeting System Thomas Hain / July 12, 2005

16 Towards a model set Model initialisation (WER on ICSI only)
Training data bandwidth adapt WER CTS NB - 33.3 ICSI 27.1 WB 25.3 MAP 25.8 MLLR + MAP 24.6 More training data Training data TOT ISL ICSI LDC NIST AMI-TOT UEDIN IDIAP ICS,NIST 50.4 56.2 24.1 61.1 36.9 59.1 60.2 58.4 ICSI,NIST,ISL 50.6 22.9 61.8 37.2 58.6 60.0 57.6 ICSI,NIST,ISL,AMI 50.3 54.5 27.4 61.3 36.2 57.3 59.0 MLMI – The AMI ASR Meeting System Thomas Hain / July 12, 2005

17 First pass recognition
System architecture Front-end (IHM/MDM) Modified Audio, Segments, Speaker Info First pass recognition First recognition result Adaptation Lattice generation Word lattices LM Rescoring Final word level result MLMI – The AMI ASR Meeting System Thomas Hain / July 12, 2005

18 Results on rt05seval CTS-adapted ML models, unadapted, trigram LM (first pass) TOT Sub Del Ins AMI ISL ICSI NIST VT IHM 41.1 21.1 14.7 5.3 42.3 36.3 37.1 49.1 IHMREF 34.9 23.0 7.1 4.8 34.5 34.0 26.6 42.2 37.9 MDM 53.6 32.1 17.3 4.1 46.5 50.2 48.2 63.0 MDMREF 50.6  34.1 11.8 4.6 43.0 49.4 46.4 49.9 59.5 MDM segmentation provided by ICSI/SRI REF denotes reference segmentation and speaker labels The performance of the full system on the above AMI subset is 30.9% for IHM and 35.1% for MDM. BUT: difference on REF remains MLMI – The AMI ASR Meeting System Thomas Hain / July 12, 2005

19 Conclusions Multi-site development!
Competitive ASR system in months Meeting domains inhomogeneous ? Good improvements with VTLN/SHLDA/MLLR Pre-processing needs to be sorted ! Reasonable performance on Seminar data MLMI – The AMI ASR Meeting System Thomas Hain / July 12, 2005


Download ppt "The Development of the AMI System for the Transcription of Speech in Meetings Thomas Hain, Lukas Burget, John Dines, Iain McCowan, Giulia Garau, Martin."

Similar presentations


Ads by Google