Microsoft The State of the Art in ASR (and Beyond?) Cambridge University Engineering Department Speech Vision and Robotics Group Steve Young Microsoft.

Slides:

Advertisements

Similar presentations

Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.

Advertisements

Building an ASR using HTK CS4706

Speech Recognition with Hidden Markov Models Winter 2011

Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.

Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.

Frederico Rodrigues and Isabel Trancoso INESC/IST, 2000 Robust Recognition of Digits and Natural Numbers.

SRI 2001 SPINE Evaluation System Venkata Ramana Rao Gadde Andreas Stolcke Dimitra Vergyri Jing Zheng Kemal Sonmez Anand Venkataraman.

December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.

Sequential Modeling with the Hidden Markov Model Lecture 9 Spoken Language Processing Prof. Andrew Rosenberg.

Acoustic / Lexical Model Derk Geene. Speech recognition  P(words|signal)= P(signal|words) P(words) / P(signal)  P(signal|words): Acoustic model  P(words):

Speech Recognition Training Continuous Density HMMs Lecture Based on:

Speaker Adaptation for Vowel Classification

Part 6 HMM in Practice CSE717, SPRING 2008 CUBS, Univ at Buffalo.

HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University.

Why is ASR Hard? Natural speech is continuous

Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.

May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.

Introduction to Automatic Speech Recognition

Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires too much data and is computationally complex. Solution: Create.

1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.

1M4 speech recognition University of Sheffield M4 speech recognition Martin Karafiát*, Steve Renals, Vincent Wan.

Isolated-Word Speech Recognition Using Hidden Markov Models

1 Robust HMM classification schemes for speaker recognition using integral decode Marie Roch Florida International University.

Arthur Kunkle ECE 5525 Fall Introduction and Motivation  A Large Vocabulary Speech Recognition (LVSR) system is a system that is able to convert.

Midterm Review Spoken Language Processing Prof. Andrew Rosenberg.

7-Speech Recognition Speech Recognition Concepts

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.

Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.

Speech recognition and the EM algorithm

IRCS/CCN Summer Workshop June 2003 Speech Recognition.

Modeling Speech using POMDPs In this work we apply a new model, POMPD, in place of the traditional HMM to acoustically model the speech signal. We use.

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,

Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.

The Use of Context in Large Vocabulary Speech Recognition Julian James Odell March 1995 Dissertation submitted to the University of Cambridge for the degree.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.

CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:

1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.

Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.

Performance Comparison of Speaker and Emotion Recognition

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.

RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.

Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan

A Hybrid Model of HMM and RBFN Model of Speech Recognition 길이만, 김수연, 김성호, 원윤정, 윤아림 한국과학기술원 응용수학전공.

Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.

Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.

Automatic Speech Recognition

Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI

Juicer: A weighted finite-state transducer speech decoder

Statistical Models for Automatic Speech Recognition

Speech Processing Speech Recognition

CRANDEM: Conditional Random Fields for ASR

Statistical Models for Automatic Speech Recognition

8-Speech Recognition Speech Recognition Concepts

Automatic Speech Recognition: Conditional Random Fields for ASR

Presentation transcript:

Microsoft The State of the Art in ASR (and Beyond?) Cambridge University Engineering Department Speech Vision and Robotics Group Steve Young Microsoft Cambridge and

Microsoft Standard Model for Continuous Speech Heboughtit Acoustics A Features Y States S Phones Q Words W hiybaotiht 10msecs

Microsoft Or if you prefer ….. Words Phones States Language Model Word Model (Dictionary) Phone-level Acoustic Model “Beads on a string model”

Microsoft Structure of a Simple Decoder Dictionary.... THE th ax THIS th ih s..... Phone Models... th ih s Decoder This is... Grammar N-gram or FSN Feature Extractor

Microsoft A few problems …. Speech is not stationary over 10 msec intervals Every speaker is different Microphone, channel and background noise vary Segments are highly context dependent Pronunciations vary with speaker, style, context, etc Speech is not a sequence of segments, articulators move asynchronously Prosody makes a difference English language is not Markov So considerable ingenuity is needed to make the “beads on a string” model work

Microsoft Building a State of the Art System (1) (a) Front-End Processing Mel-spaced filter bank + cepstral transform (MFCC or PLP) If telephone, restrict bandwidth to Hz Cepstral mean and variance normalisation First and second differences appended to static coefficients Vocal tract length normalisation (ML warping of frequency axis) Result is 39-D feature vector every 10msecs. Zero mean and unit variance, with frequency axis warped to match speaker.

Microsoft Building a State of the Art System (2) (b) Acoustic Models 50 to 250 hours training data Observation probabilities are Gaussian mixture components Phone models context dependent (CD) - either triphone or quinphone CD models clustered according to phonetic context using decision trees State-based clustering with soft-tying between nearby states Gender dependent models Parameter estimation using MLE and MMIE Result is a set of decision trees and one or more sets of context- dependent HMMs. Typically distinct states, mixture components per state ie ~20,000,000 parameters per set.

Microsoft h iy sehd R stop? yesno "iy" needed Hesaid…. L liquid? yesno R fric? yesno L “h”? yesno R “t”? yesno Tree for basic phone "iy" Virtually all of the phonetic/phonological variation is handled by these CD models. Typically a few hundred per phone Questions selected to maximise likelihood training data

Microsoft Building a State of the Art System (3) (c) Word Models Typically one or two pronunciations per word Pronunciation dictionaries have hand-crafted core Bulk of pronunciations generated by rule Probabilities estimated from phone-alignment of acoustic training set Type of word boundary juncture is significant Result is a dictionary with multiple pronunciations per word, each pronunciation has a "probability". (We would like to do more here, but increasing the number of pronunciations also increases the confusability of words and typically degrades performance)

Microsoft Building a State of the Art System (4) (d) Language Model 100M to 1G word training data Back-off word LM interpolated with class-based LM Typically ~500 classes derived from bigram statistics Vocabulary selected to minimise OOV rate Vocabulary size typically 64k Result is word based LM with typically 20M grams interpolated with a class-based LM with typically 1M grams

Microsoft Building a State of the Art System (5) (e) Decoding Multipass sentence-level MAP decoding strategy using lattices Initial pass typically simple triphones + bigram LM Subsequent passes refine acoustic models (eg GD quinphones) Speaker adaptation of Gaussian means and variances using MLLR Global full variance transform using "semi-tied" covariances Word posteriors and confidence scores computed from lattice Final output determined using word-level MAP Final output selected from multiple model sets Result is best hypothesis selected for minimum WER, with confidence scores

Microsoft Performance Each refinement typically gives a few percent relative improvement but If done carefully, improvements are additive PassModelsCtxtVTLNMLLRFVMAPSwbdIICHE 1 GI-MLETriSent MMIETriYesSent MMIETriYes Sent aMMIETriYes Word bGD-MLETriYes Word aMMIEQuinYes Word bGD-MLEQuinYes Word Final Combine 4a+4b+5a+5b Example - The CUED HTK 2000 Switchboard System (courtesy Hain, Woodland, Evermann, Povey)

Microsoft Some Limitations of Current Systems Spectral sampling with uniform time-frequency resolution loses precision Useful information is discarded in front-end eg pitch Minimal use of prosody and segment duration Single stream "string of beads" model deals inefficiently with many types of phonological effect Moving to a parallel stream architecture provides more flexibility to investigate these issues Multistream HMMs: Bourlard, Dupont & Ris, 1996 Factorial HMMs: Ghahramani and Jordan, 1997 Buried HMMs: Bilmes, 1998 Mixed Memory HMMs: Saul & Jordan, 1999

Microsoft Examples of pronunciation variability Feature spreading in coalescence: eg c ae n t -> c ae t where ae is nasalised Assimilation causing changes in place of articulation: eg n -> m before labial stop as in input, can be, grampa Asynchronous articulation errors causing stop insertions: eg warm[p]th, ten[t]th, on[t]ce, leng[k]th r-insertion in vowel-vowel transitions: eg stir [r]up, director [r]of context dependent deletion: eg nex[t] week ~

Microsoft Parallel Stream Processing Models …. Single stream Parallel Stream K observation streams J Markov chains

Microsoft Architectural Issues Number and nature of input data streams Single sample-rate or multi-rate Number of distributed states Synchronisation points (phone, syllable, word, utterance?) State-transition coupling Observation coupling Stream weighting

Microsoft General “Single Rate” Model Observations: where States: Assume conditional independence of observation and state components:

Microsoft Example Architectures Mixed Memory Approximation Standard HMMs Independent Streams Multiband HMMs Factorial HMMs

Microsoft Source Feature Extraction Single-stream (as in standard systems) Multiple Subbands (typically 2 to 4 streams) Discriminative Features (typically 10 to 15 streams) Hybrid (eg Mel-filter, auditory-filter, prosody,..) Multi-resolution (eg wavelets)

Microsoft Training/Decoding Issues Meta state space is huge, hence exact computation of the posterior probabilities is intractable. Some options: Gibbs sampling Variational methods Chain Viterbi

Microsoft Some Simple Experiments Isolet “alphabet” database Exact EM training (Courtesy of Harriet Nock) Topology3 – States6 - States3 - States6 - States Indep Streams Obs Coupled Fully Coupled Multiband SubBands 3 SubBands Similar results obtained with Chain Viterbi training/decoding

Microsoft Conclusions State of the Art “Conventional HMM” systems continue to improve Many stages of complex processing, each giving small improvement Pronunciation modeling in conventional systems is unsatisfactory And still no effective use of prosody Parallel stream processing models warrant continued study